MET604 Majid Week-02 Lecture-01
MET604 Majid Week-02 Lecture-01
MET604 Majid Week-02 Lecture-01
Geo-Statistical Analysis
Lecture 1
2
Evaluation
Quizzes/Assignments/Projects/Presentations: 25 Marks
-----------------------------------------------------------------------------------------------
TOTAL 100 Marks
-----------------------------------------------------------------------------------------------
3
Lectures Schedule (subjects to change)
4
Recommended Books/Literature
5
Software
▪ Minitab
Minitab is a statistics package developed at the Pennsylvania
State University in 1972.
▪ ArcGIS
▪ QGIS
▪ GeoDa
Developed by The centre for spatial data science, The
university of Chicago, GeoDa is a free and open source
software that serves as an introduction to spatial data analysis.
▪ R
R is a free software environment for statistical
computing and graphics.
6
What is statistics?
7
What is spatial data?
8
What is geo-spatial data?
9
Spatial Data Analysis (SDA)
SDA is a set of methods and tools for exploratory
spatial analysis
It is an extension of ordinary data analysis which is
often used in marketing science, management science,
etc.
SDA treats a huge amount of information
SDA emphasizes computational approach
10
Primary Tasks
11
Spatial Data Analysis Results (Examples)
The description of the general weather patterns in a set
of geographic regions is a spatial characteristic rule.
The comparison of two weather patterns in two
geographic regions is a spatial discriminant rule.
A rule like “most cities in Canada are close to the
Canada-US border” is a spatial association rule
Others: spatial clusters,…
12
Exploratory Data Analysis (EDA)
13
Representation of Data
Graphical methods (Scatter plots and histograms),
useful for visual analysis
Quantitative methods (Summary statistics).
14
Statistical Data Distributions
Histograms, skewness, quantiles
Population statistics
Mean, variance, standard deviation
Continuous distributions (Normal, Student, Chi-square,
F-distributions)
15
Scatter Plot (Index Plot)
INDEX A GRAPH
4
1
A
-1
-2
-3
0 20 40 60 80 100
Index A
16
Histograms
Histograms, the term was first used by Pearson, 1895,
present a graphical representation of the frequency
distribution of the selected variable (s) in which the columns
are drawn over the class intervals and the heights of the
columns are proportional to the class frequencies.
17
Histogram
INDEX A
20
15
Frequency
10
0
-3 -2 -1 0 1 2 3
A
18
Frequency and Class Intervals
Guidelines for forming the class intervals:
19
Interpretation of Histograms
Skewness
i1 i
n
( x x ) 3
Skewness
(n 1) s 3
The parameters are: standard deviation, and N is the number
of data points.
The skewness for a normal distribution is zero, and any
symmetric data should have a skewness near zero.
Negative values for the skewness indicate data that are
skewed left and positive values for the skewness
indicate data that are skewed right.
21
Kurtosis
It is a measure of the size of the tails and the steepness
of the peak
Gives an indication of measurement outliers
Positive kurtosis indicates a "peaked" distribution and
negative kurtosis indicates a "flat" distribution.
Kurtosis of a normal distribution is 3
22
Excess Kurtosis
The kurtosis for a standard normal distribution is three. For
this reason, excess kurtosis is defined as
i1 i
n
( x x ) 4
Excess kurtosis 3
(n 1) s 4
23
Collection of Data: Sample vs. Population
Population - consists of all possible measurements that
can be made on a particular item or procedure. Often a
population has an infinite number of data elements
24
Summary Statistics
Summary statistics is used to describe the some simple
characteristic of the data. Any set of measurements has two
important characteristics:
25
The Center Value
(Central Tendency)
Mean
Median
Mode
26
Mean
The sample mean is defined as
n
x1 x2 xn x
x i 1 i
n n
27
Example
Observation 1 2 3 4 5
Data value 5 7 3 38 7
5 7 3 38 7 60
x 12
5 5
28
Median
Another center value to describe the overall data.
29
Example
Observation 1 2 3 4 5 (Odd)
Data value 3 5 7 38 7
Ordered 3 5 7 7 38
The median
Observation 1 2 3 4 (Even)
Ordered 3 5 7 7
The median = (5 + 7) /2 = 6
30
Mode
A measure of central tendency, the mode. The term first
used by Pearson, 1895) of a sample is the value which occurs
most frequently in the sample.
Example:
Data value: 3, 3, 4, 2, 5, 5, 4, 6, 4, 4, 6, 7, 1, 4, 6, 9,1, 0, 3, 4
Mode = 4 (6 occurrences)
31
Quantiles
The quantile (Kendall, 1940) of a distribution of values is a
number xp such that a proportion p of the population values
are less than or equal to xp.
32
25 and 75 Percentiles
The 0.25 quantile (also referred to as the 25th
percentile or lower quartile) of a variable is a value xp
such that 25% of the values of the variable fall below
that value xp.
The 0.75 quantile (also referred to as the 75th
percentile or upper quartile) is a value xp such that
75% of the values of the variable fall below that value
xp and is calculated accordingly.
Observe that the median is the 50th percentile.
33
Measures of Spread
Measures of spread are to provide information on how far
from the center the data tend to range.
Variance
Standard deviation
Root Mean Square, RMS
Interquartile Range, IQR
34
Sample Variance
The variance measures the spread of the data from the mean
(compare it to IQR). It is average squared distance from the
mean
Residual
n n
i
squared
( x x i ) 2
v 2
s2 i 1
i 1
n 1 f Degrees of
freedom
Variance
35
Example
Using data set {3, 5, 7, 7, 38}, we get the mean equal to 12, f
= 4 and the variance:
(12 3 ) 2
(12 5 ) 2
(12 7 ) 2
(12 7 ) 2
(12 38 ) 2
s2
4
s 2 214
36
Standard Deviation
Note that the variance results are in unit squared. To get the
results in the units of the data for the average measure we
take the square root of the variance. Hence
Mean
v: Residual
n
( x x i ) 2
s i 1
n 1
37
Root Mean Square (RMS)
n
xi2
i 1
RMS
n
38
Range, R, and Interquartile Range (IQR)
IQR contains half the data values in the data set and is
indicative of the range of values. It is less affected by
the extreme values than range
39
Box and Whiskers Plot
40
Covariance
Given two groups of observations, x and y with n
observations in each group, the covariance is defined as
n
( x x )( y y )
i i
s xy i 1
n
Where, x , y are the mean of the observations in each
group.
41
Correlation
Correlation Coefficient is a standardized form of covariance
xy
xy 1 xy 1
yx
s xy
rxy 1 rxy 1
sx s y
43
Correlation (Graphical) Negative
correlation
Positive
correlation
Weak
No correlation
correlation
INDEX A GRAPH
4
1
A
-1
-2
-3
0 20 40 60 80 100
Index A
45
How to Detect Outliers?
Any value that fulfills the following condition is an
outlier
x Q75 1.5 IQR
x Q25 1.5 IQR
46
Descriptive and Inferential Statistics
Descriptive statistics
• Concerned with obtaining summary measures to
describe a set of data (we discussed some of them
earlier, mean, variance, etc…)
47
Questions?
48