MET604 Majid Week-02 Lecture-01

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

MET604

Geo-Statistical Analysis

Dr. Majid Nazeer

Semesters 1 & 2 (Spring 2018)

Lecture 1

Department of Meteorology, COMSATS Institute of


Information Technology, Islamabad Pakistan.
14 February, 2018
Lecture Details
 Every Wednesday 14:30 – 17:30 hours, Room # 209 (AB-II)

 Instructor: Dr. Majid Nazeer ([email protected])

 Office hours: Monday 14:30 – 15:30, or by appointment

2
Evaluation
 Quizzes/Assignments/Projects/Presentations: 25 Marks

 Mid Term Exam: 25 Marks

 Final Term Exam: 50 Marks

-----------------------------------------------------------------------------------------------
TOTAL 100 Marks
-----------------------------------------------------------------------------------------------

3
Lectures Schedule (subjects to change)

4
Recommended Books/Literature

5
Software
▪ Minitab
 Minitab is a statistics package developed at the Pennsylvania
State University in 1972.

▪ ArcGIS

▪ QGIS

▪ GeoDa
 Developed by The centre for spatial data science, The
university of Chicago, GeoDa is a free and open source
software that serves as an introduction to spatial data analysis.
▪ R
 R is a free software environment for statistical
computing and graphics.
6
What is statistics?

7
What is spatial data?

8
What is geo-spatial data?

9
Spatial Data Analysis (SDA)
 SDA is a set of methods and tools for exploratory
spatial analysis
 It is an extension of ordinary data analysis which is
often used in marketing science, management science,
etc.
 SDA treats a huge amount of information
 SDA emphasizes computational approach

10
Primary Tasks

 Analyzing Spatial Association Rules


 Spatial Classification and Prediction
 Spatial Data Clustering Analysis
 Spatial Outlier Analysis
 Exploratory Data Analysis (Statistical), EDA
 Exploratory Spatial Data Analysis, ESDA

11
Spatial Data Analysis Results (Examples)
 The description of the general weather patterns in a set
of geographic regions is a spatial characteristic rule.
 The comparison of two weather patterns in two
geographic regions is a spatial discriminant rule.
 A rule like “most cities in Canada are close to the
Canada-US border” is a spatial association rule
 Others: spatial clusters,…

12
Exploratory Data Analysis (EDA)

13
Representation of Data
 Graphical methods (Scatter plots and histograms),
useful for visual analysis
 Quantitative methods (Summary statistics).

14
Statistical Data Distributions
 Histograms, skewness, quantiles
 Population statistics
 Mean, variance, standard deviation
 Continuous distributions (Normal, Student, Chi-square,
F-distributions)

15
Scatter Plot (Index Plot)

INDEX A GRAPH
4

1
A

-1

-2

-3
0 20 40 60 80 100
Index A

16
Histograms
Histograms, the term was first used by Pearson, 1895,
present a graphical representation of the frequency
distribution of the selected variable (s) in which the columns
are drawn over the class intervals and the heights of the
columns are proportional to the class frequencies.

17
Histogram
INDEX A

20

15
Frequency

10

0
-3 -2 -1 0 1 2 3
A

18
Frequency and Class Intervals
Guidelines for forming the class intervals:

 Use intervals of with midpoints at convenient round


numbers
 For small data, use small number of intervals
 For large data use more intervals

19
Interpretation of Histograms

Less precise, precise, bimodal, left, and right skewed. Left


and right are defined by the location of the tail. Or by the
location of the mean with respect to the median, i.e.

Right skewed: mean > median


Left skewed: mean < median
20

n
( x  x )3
Skewness  i 1 i
( n  1) s 3

Skewness

i1 i
n
( x  x ) 3

Skewness 
(n  1) s 3
The parameters are: standard deviation, and N is the number
of data points.
 The skewness for a normal distribution is zero, and any
symmetric data should have a skewness near zero.
 Negative values for the skewness indicate data that are
skewed left and positive values for the skewness
indicate data that are skewed right.

21
Kurtosis
 It is a measure of the size of the tails and the steepness
of the peak
 Gives an indication of measurement outliers
 Positive kurtosis indicates a "peaked" distribution and
negative kurtosis indicates a "flat" distribution.
 Kurtosis of a normal distribution is 3

22
Excess Kurtosis
The kurtosis for a standard normal distribution is three. For
this reason, excess kurtosis is defined as

i1 i
n
( x  x ) 4

Excess kurtosis  3
(n  1) s 4

so that the standard normal distribution has an excess


kurtosis of zero.

23
Collection of Data: Sample vs. Population
 Population - consists of all possible measurements that
can be made on a particular item or procedure. Often a
population has an infinite number of data elements

 Sample - a subset of data selected from the population.


Can be gathered in an economical fashion. May or may
not be representative of population

24
Summary Statistics
Summary statistics is used to describe the some simple
characteristic of the data. Any set of measurements has two
important characteristics:

 The center value (central tendency),


 The spread about that value (histogram)

25
The Center Value
(Central Tendency)

 Mean
 Median
 Mode

26
Mean
The sample mean is defined as


n
x1  x2    xn x
x  i 1 i
n n

Where x is the sample mean, xi are the observations and n is


the number of observations.

27
Example

Observation 1 2 3 4 5
Data value 5 7 3 38 7

5  7  3  38  7 60
x   12
5 5

28
Median
Another center value to describe the overall data.

 To find the median value of the data set we arrange the


data in order from smallest to the largest (or visa
versa). The median is the value in the middle if the
number of data points is odd. If the number of data
points is even, the median is the average of two data
points nearest the middle.

29
Example
Observation 1 2 3 4 5 (Odd)
Data value 3 5 7 38 7
Ordered 3 5 7 7 38
The median

Observation 1 2 3 4 (Even)
Ordered 3 5 7 7

The median = (5 + 7) /2 = 6

30
Mode
A measure of central tendency, the mode. The term first
used by Pearson, 1895) of a sample is the value which occurs
most frequently in the sample.

Example:
Data value: 3, 3, 4, 2, 5, 5, 4, 6, 4, 4, 6, 7, 1, 4, 6, 9,1, 0, 3, 4
Mode = 4 (6 occurrences)

What happens if the data has fractions?

31
Quantiles
The quantile (Kendall, 1940) of a distribution of values is a
number xp such that a proportion p of the population values
are less than or equal to xp.

32
25 and 75 Percentiles
 The 0.25 quantile (also referred to as the 25th
percentile or lower quartile) of a variable is a value xp
such that 25% of the values of the variable fall below
that value xp.
 The 0.75 quantile (also referred to as the 75th
percentile or upper quartile) is a value xp such that
75% of the values of the variable fall below that value
xp and is calculated accordingly.
 Observe that the median is the 50th percentile.

33
Measures of Spread
Measures of spread are to provide information on how far
from the center the data tend to range.

 Variance
 Standard deviation
 Root Mean Square, RMS
 Interquartile Range, IQR

34
Sample Variance
The variance measures the spread of the data from the mean
(compare it to IQR). It is average squared distance from the
mean
Residual
n n

 i
squared
( x  x i ) 2
v 2

s2  i 1
 i 1
n 1 f Degrees of
freedom
Variance

35
Example
Using data set {3, 5, 7, 7, 38}, we get the mean equal to 12, f
= 4 and the variance:

(12  3 ) 2
 (12  5 ) 2
 (12  7 ) 2
 (12  7 ) 2
 (12  38 ) 2
s2 
4
s 2  214

36
Standard Deviation
Note that the variance results are in unit squared. To get the
results in the units of the data for the average measure we
take the square root of the variance. Hence

Mean
v: Residual
n

 ( x  x i ) 2

s i 1
n 1

37
Root Mean Square (RMS)

Root Mean Square, RMS, value of n


observations x i is defined as

n
 xi2
i 1
RMS 
n

38
Range, R, and Interquartile Range (IQR)

 R = Maximum value - Minimum value

 IQR contains half the data values in the data set and is
indicative of the range of values. It is less affected by
the extreme values than range

IQR  Q75  Q25

39
Box and Whiskers Plot

40
Covariance
Given two groups of observations, x and y with n
observations in each group, the covariance is defined as
n

 ( x  x )( y  y )
i i
s xy  i 1
n
Where, x , y are the mean of the observations in each
group.

41
Correlation
Correlation Coefficient is a standardized form of covariance

 xy
 xy   1   xy  1
 yx

or in terms of sample statistics

s xy
rxy   1  rxy  1
sx s y

This latter expression is also known as Bivariate Pearson


Product Moment Correlation Coefficient (r)
42
Bivariate Pearson Product Moment
Correlation Coefficient ( r )
Measures the degree of association or strength of the
relationship between two continuous variables. It varies on a
scale from –1 thru 0 to +1
 -1 implies perfect negative association
• E.g.: As values on one variable rise, those on the
other fall (price and quantity purchased)
 0 implies no association
 +1 implies perfect positive association
• E.g.: As values rise on one they also rise on the
other (house price and income of occupants).

43
Correlation (Graphical) Negative
correlation

Positive
correlation

Weak
No correlation
correlation

STATISTICA Electronic Manual 44


Outlier
Outliers are atypical (by definition), infrequent observations;
data points which do not appear to follow the characteristic
distribution of the rest of the data.

INDEX A GRAPH
4

1
A

-1

-2

-3
0 20 40 60 80 100
Index A

45
How to Detect Outliers?
 Any value that fulfills the following condition is an
outlier
x  Q75  1.5 IQR
x  Q25  1.5 IQR

46
Descriptive and Inferential Statistics
 Descriptive statistics
• Concerned with obtaining summary measures to
describe a set of data (we discussed some of them
earlier, mean, variance, etc…)

 Inference and inferential statistics


• Concerned with making inferences from samples
about populations
• Concerned with making legitimate inferences about
underlying processes from observed patterns

47
Questions?

48

You might also like