Spatstat
Spatstat
Spatstat
3) Spatial Statistics
Centrographic Statistics (O&U Ch. 4 p. 77-81)
– single, summary measures of a spatial distribution
Point Pattern Analysis (O&U Ch 4 p. 81-114)
-- pattern analysis; points have no magnitude (“no variable”)
Quadrat Analysis
Nearest Neighbor Analysis
= i =1
– Standard deviation N N
(square root of variance)
- -
deviations of X and Y, and X and Y
n
( xi X )( yi Y ) are the means.
r= i =1
n SxSy
n
n
i-
(Y Y ) 2
SX= i=1 (Xi - X)2
Sy= i=1
N N
5
Correlation Coefficient
example using
“calculation formulae”
As we explore spatial
statistics, we will see
many analogies to the
mean, the variance,
and the correlation
coefficient, and their
There is an example of calculation later in
various formulae this presentation. 6
• In actuality: an outcome to be
expected from a random
process: two ways to sit
opposite, but four ways to sit
catty/corner
i=1 wixi
n
n
X = Y= i =1
wiyi
i=1 wi
n
n
i =1
wi
16
2 4 7
3 7 7
4 7 3 n n
2,3
7,3 5 6 2
Xi Y i
sum 26 22
X= i =1
,Y = i =1
6,2 n n
Centroid/MC 5.2 4.4
0
0 5 10
10
7,3
2
3
4
7
7
7
500
400
2,000
2,800
3,500
2,800
wX i wY i i i
X= i =1
,Y = i =1
w w
2,3 4 7 3 100 700 300
i i
5 6 2 300 1,800 600
6,2
sum 26 22 4,300 13,300 16,200
0
w MC 3.09 3.77
0 5 10
17
Median Center:
Intersection of a north/south and an
east/west line drawn so half of
population lives above and half
below the e/w line, and half lives to
the left and half to the right of the n/s
line
Mean Center:
Balancing point of a weightless map,
if equal weights placed on it at the
residence of every person on census
day.
i =1 i=1
n n
( Xi - Xc ) 2
(Yi - Yc ) 2
i=1 wi( Xi - Xc)2 i=1 wi(Yi - Yc)2
n n
i=1 wi
n
N
which by Pythagoras
i =1
n 2
d iC
reduces to:
N
---essentially the average distance of points from the center
Provides a single unit measure of the spread or dispersion of a
distribution.
We can also calculate a weighted standard distance analogous to the 20
weighted mean center. Briggs UT-Dallas GISC 6382 Spring 2007
Standard Distance Deviation Example
10
Circle with radii=SDD=2.9
4,7
7,7
5
7,3
2,3
i X Y (X - Xc)2 (Y - Yc)2
6,2
1 2 3 10.2 2.0
0
2 4 7 1.4 6.8
3 7 7 3.2 6.8
0 5 10
4 7 3 3.2 2.0
i X Y (X - Xc)2 (Y - Yc)2
5 6 2 0.6 5.8
1 2 3 10.2 2.0
sum 26 22 18.8 23.2 2 4 7 1.4 6.8
Centroid 5.2 4.4 3 7 7 3.2 6.8
sum 42.00 4 7 3 3.2 2.0
divide N 8.40 5 6 2 0.6 5.8
sq rt 2.90
sum 26 22 18.8 23.2
Centroid 5.2 4.4
sum of sums 42
divide N 8.4
sq rt 2.90
( Xi - Xc ) 2 i =1 (Yi - Yc ) 2
n n
sdd = i =1
N 21
There appears to be no
major difference
between the location of
the software and the
telecommunications
industry in North
Texas.
24
Q = # of quadarts
P = # of points = 15
=
• The test will ascertain if a pattern is significantly more clustered than would be
expected by chance (but does not test for a uniformity)
• The values of the test statistics in our cases would be:
random uniform clustered
60-(202)/10 = 10 40-(202)/10 = 0 200-(202)/10 = 80
2 2 2
• For degrees of freedom: N - 1 = 10 - 1 = 9, the value of chi-square at the 1% level is
21.666.
• Thus, there is only a 1% chance of obtaining a value of 21.666 or greater if the points
had been allocated randomly. Since our test statistic for the clustered pattern is 80, we
conclude that there is (considerably) less than a 1% chance that the clustered pattern
could have resulted from a random process
29
(See O&U p 98-100)
Briggs UT-Dallas GISC 6382 Spring 2007
Quadrat Analysis: Frequency Distribution Comparison
30
• Expected frequencies for a random spatial distribution are derived from the
Poisson frequency distribution and can be calculated with:
λ
p(0) = e- = 1 / (2.71828P/Q) and p(x) = p(x - 1) * λ /x
Where x = number of points in a quadrat and p(x) = the probability of x points
P = total number of points Q = number of quadrats
λ = P/Q (the average number of points per quadrat)
See next slide for worked example for cluster case
Calculation of Poisson Frequencies for Kolmogorov-Smirnov test
CLUSTERED pattern as used in lecture
A B C D E F G H
33
Where:
Significance test
(Standard error)
0.26136
=
n2 / A
35
Meanrdistance Mean
r distance 0.1 Mean
r distance 2.2
1.09
Area of Area of Area of
Region 50 Region 50 Region 50
Density 0.2 Density 0.2 Density 0.2
Expected Expected Expected
Mean 1.118034 Mean 1.118034 Mean 1.118034
R 0.974926 RNNI 0.089443 RNNI 1.96774
NNI
Z = -0.1515 Z = 5.508 Z = 5.855
Source: Lembro
Evaluating the Nearest Neighbor Index
• Advantages
– NNI takes into account distance
– No quadrat size problem to be concerned with
• However, NNI not as good as might appear
– Index highly dependent on the boundary for the area
• its size and its shape (perimeter)
– Fundamentally based on only the mean distance
– Doesn’t incorporate local variations (could have clustering locally in some areas,
but not overall)
– Based on point location only and doesn’t incorporate magnitude of phenomena at
that point
• An “adjustment for edge effects” available but does not solve all the problems
• Some alternatives to the NNI are the G and F functions, based on the entire
frequency distribution of nearest neighbor distances, and the K function based
on all interpoint distances.
– See O and U pp. 89-95 for more detail.
– Note: the G Function and the General/Local G statistic (to be discussed later) are
related but not identical to each other
37
43
40
40
31
29
49 56
Sparse Contiguity
Connecticut
Delaware
District of Columbia
9
10
11
3
3
2
44
24
51
36
42
24
25
34 Matrix for US
Florida
Georgia
12
13
2
5
13
12
1
45 37 1 47 States
Idaho 16 6 32 41 56 49 30 53
Illinois
Indiana
17
18
5
4
29
26
21
21
18
17
55
39
19
•Ncount is the
Iowa 19 6 29 31 17 55 27 46
Kansas
Kentucky
20
21
4
7
40
47
29
29
31
18
8
39 54 51 17
number of
Louisiana 22 3 28 48 5
Maine
Maryland
23
24
1
5
33
51 10 54 42 11
neighbors for each
Massachusetts 25 5 44 9 36 50 33
Michigan
Minnesota
26
27
3
4
18
19
39
55
55
46 38
state
Mississippi
Missouri
Montana
28
29
30
4
8
4
22
5
16
5
40
56
1
17
38
47
21
46
47 20 19 31 •Max is 8 (Missouri
Nebraska
Nevada
New Hampshire
31
32
33
6
5
3
29
6
25
20
4
23
8
49
50
19
16
56
41
46
and Tennessee)
New Jersey
New Mexico
New York
34
35
36
3
5
5
10
48
34
36
40
9
42
8
42
4
50
49
25
•Sum of Ncount is
North Carolina
North Dakota
Ohio
37
38
39
4
3
5
45
46
26
13
27
21
47
30
54
51
42 18
218
Oklahoma
Oregon
40
41
6
4
5
6
35
32
48
16
29
53
20 8
•Number of
Pennsylvania 42 6 24 54 10 39 36 34
Rhode Island
South Carolina
44
45
2
2
25
13
9
37 common borders
South Dakota 46 6 56 27 19 31 38 30
Tennessee
Texas
47
48
8
4
5
22
28
5
1
35
37
40
13 51 21 29
(joins)
ncount / 2 = 109
Utah 49 6 4 8 35 56 32 16
Vermont 50 3 36 25 33
Virginia 51 6 47 37 24 54 11 21
Washington
West Virginia
53
54
2
5
41
51
16
21 24 39 42 •N1, N2… FIPS codes
Wisconsin 55 4 26 17 19 27
Wyoming 56 6 49 16 31 8 46 30 for neighbors
Weights Based on Distance (see O&U p 202)
• Most common choice is the inverse (reciprocal) of the distance
between locations i and j (wij = 1/dij)
– Linear distance?
– Distance through a network?
• Other functional forms may be equally valid, such as inverse of
squared distance (wij =1/dij2), or negative exponential (e-d or e-d2)
• Can use length of shared boundary: wij= length (ij)/length(i)
• Inclusion of distance to all points may make it impossible to solve
necessary equations, or may not make theoretical sense (effects
may only be ‘local’)
– Include distance to only the “nth” nearest neighbors
– Include distances to locations only within a buffer distance
• For polygons, distances usually measured centroid to centroid, but
– could be measured from perimeter of one to centroid of other
– For irregular polygons, could be measured between the two closest boundary
points (an adjustment is then necessary for contiguous polygons since
46
distance for these would be zero)
Briggs UT-Dallas GISC 6382 Spring 2007
A Note on Sampling Assumptions
• Another factor which influences results from these tests is the
assumption made regarding the type of sampling involved:
– Free (or normality) sampling assumes that the probability of a polygon
having a particular value is not affected by the number or arrangement of
the polygons
• Analogous to sampling with replacement
– Non-free (or randomization) sampling assumes that the probability of a
polygon having a particular value is affected by the number or arrangement
of the polygons (or points), usually because there is only a fixed number of
polygons (e.g. if n = 20, once I have sampling 19, the 20th is determined)
• Analogous to sampling without replacement
• The formulae used to calculate the various statistics (particularly
the standard deviation/standard error) differ depending on which
assumption is made
– Generally, the formulae are substantially more complex for randomization
sampling—unfortunately, it is also the more common situation!
– Usually, assuming normality sampling requires knowledge about larger
trends from outside the region or access to additional information within
the region in order to estimate parameters.
Joins (or joint or join) Count Statistic
• For binary (1,0) data only (or
ratio data converted to binary)
– Shown here as B/W
(black/white)
Small proportion (or count)
of BW joins • Requires a contiguity matrix for
Large proportion of BB and polygons
WW joins • Based upon the proportion of
“joins” between categories e.g.
– Total of 60 for Rook Case
– Total of 110 for Queen Case
Dissimilar proportions (or • The “no correlation” case is
counts) of BW, BB and simply generated by tossing a
WW joins coin for each cell
• See O&U pp. 186-192
Lee and Wong pp. 147-156
Large proportion (or count)
of BW joins
48
Small proportion of BB and
WW joins Briggs UT-Dallas GISC 6382 Spring 2007
Join Count Statistic Formulae for Calculation
• Test Statistic given by: Z= Observed - Expected
SD of Expected
Expected given by: Standard Deviation of Expected given by:
50
• There are far more Bush/Bush joins (actual = 60) than would be expected (27)
– Since test score (3.79) is greater than the critical value (2.54 at 1%) result is
statistically significant at the 99% confidence level (p <= 0.01)
– Strong evidence of spatial autocorrelation—clustering
• There are far fewer Bush/Gore joins (actual = 28) than would be expected (54)
– Since test score (-5.07) is greater than the critical value (2.54 at 1%) result is
statistically significant at 99% confidence level (p <= 0.01)
– Again, strong evidence of spatial autocorrelation—clustering
51
(y
i =1
i - y) 2
(x
i =1
i - x) 2
n n
n n n n
n n
N w ij (x i - x)(x j - x)
i =1 j=1
w
i =1 j=1
ij (x i - x)(x j - x)/ w ij
i =1 j=1
n n n
( w ij ) (x i - x) 2
= n n
i - i -
i =1 j=1 i =1
2 2
(x x ) (x x)
Spatial i =1 i =1
auto-correlation n n 53
E(I) = -1/(n-1)
• However, there are two different formulations for the
standard error calculation
– The randomization or nonfree sampling method
– The normality or free sampling method
The actual formulae for calculation are in Lee and Wong p. 82 and 160-1
• Consequently, two slightly different values for Z are
obtained. In either case, based on the normal frequency
distribution, a value ‘beyond’ +/- 1.96 indicates a statistically
significant result at the 95% confidence level (p <= 0.05)
55
Low/Low High/Low
positive SA negative SA 56
however, E(C) = 1
• Again, there are two different formulations for the standard
error calculation
– The randomization or nonfree sampling method
– The normality or free sampling method
The actual formulae for calculation are in Lee and Wong p. 81 and p. 162
• Consequently, two slightly different values for Z are obtained.
In either case, based on the normal frequency distribution, a
value ‘beyond’ +/- 1.96 indicates a statistically significant
result at the 95% confidence level (p <= 0.05) 59
W
E (G ) = where
n(n - 1)
• For the General G, the terms in the numerator (top) are calculated “within a
distance bound (d),” and are then expressed relative to totals for the entire
region under study.
– As with all of these measures, if adjacent x terms are both large with the
same sign (indicating positive spatial association), the numerator (top) will
be large
– If they are both large with different signs (indicating negative spatial
association), the numerator (top) will again be large, but negative
61
G - E (G) W
Z= with E (G ) =
Serror(G ) n(n - 1)
However, the calculation of the
standard error is complex. See Lee and
Wong pp 164-167 for formulae.
• As an example: Lee and Wong find the following values:
G(d) = 0.5557 E(G) = .5238.
Since G(d) is greater than E(G) this indicates potential “hot spots” (clusters of
high values)
However, the test statistic Z= 0.3463
Since this does not lie “beyond +/-1.96, our standard marker for a 0.05 significance
level, we conclude that the difference between G(d) and E(G) could have occurred by
chance.” There is no compelling evidence for a hot spot.
62
Lake Ashtabula
Geauga
Cuyahoga
Trumbull
Summit Portage
(p< 0.05)
64
66
Scatter Diagram
68
Source: Lee and Wong
Briggs UT-Dallas GISC 6382 Spring 2007
Ordinary Least Squares (OLS) Simple Linear Regression
• conceptually different but mathematically similar to correlation
• Concerned with “predicting” one variable (Y - the dependent
variable) from another variable (X - the independent variable)
a is the “intercept term”—the value of Y when X =0
Y = a +bY
b is the regression coefficient or slope of the line—the
change in Y for a unit change in x
• The coefficient of determination (r2) measures the proportion of
the variance in Y which can be predicted (“explained by”) X.
– It equals the correlation coefficient (r) squared.
The regression line
Yi minimizes the sum of the
Y squared deviations
Ŷi between actual Yi and
b predicted Ŷi
1 Min (Yi-Ŷi)2
a 69
0 X X
Briggs UT-Dallas GISC 6382 Spring 2007
OLS and Spatial Autocorrelation:
Don’t forget why spatial autocorrelation matters!
• We said earlier:
In ordinary least squares regression (OLS), for example, the correlation
coefficients will be biased and their precision exaggerated
– Bias implies correlation coefficients may be higher than they really are
• They are biased because the areas with higher concentrations of events will have a
greater impact on the model estimate
– Exaggerated precision (lower standard error) implies they are more likely to be
found “statistically significant”
• they will overestimate precision because, since events tend to be concentrated, there
are actually a fewer number of independent observations than is being assumed.
• In other words, ordinary regression and correlation are
potentially deceiving in the presence of spatial autocorrelation
• We need to first adjust the data to remove the effects of spatial
autocorrelation, then run the regressions again
– But that’s for another course!
70
Unique:
Low value/ Classic suburb:
Low crime high value/
low crime 71
75