Objectives of STAT5002
Objectives of STAT5002
Objectives of STAT5002
Stat 5002
An Introduction to Statistics with Applications
in Computing
Lecture 1
To introduce students to
1
9/03/2015
? ?
Specify characteristics that identify the members of the
We use
? population. Who/What? Where? When?
information from Population Example: Characteristics such as age, income,
a education, gender and marital status are typically
SAMPLE to used in studies concerning people.
answer questions
or
discover features A sampling frame is a List or Rule Defining the
about a target Population. This is usually unachievable, and we often
POPULATION need to restrict our studies to the population to which
we can gain access.
10
2
9/03/2015
©Sydney University 14
3
9/03/2015
Experimental Studies
An experimental study is one in which the
investigator has some control over the determinant.
Data Mining
from Databases
4
9/03/2015
Experimental
Obtaining Data Studies
Population
Sampling
Experimental
Group Control Group
Experimental Studies Comparison
Compare!
Stanford prison experiment
First Data First Data
http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Collection
Collection
http://www.med.uottawa.ca/sim/data/Study_Designs_e.htm (Before)
(Before)
No
Treatment
Data Mining Treatment
Data Mining
5
9/03/2015
Variables
Measurements taken on subjects in a study vary amongst
subjects.
These measurements (data) are usually organised in a
spreadsheet consisting of rows and columns.
The rows contain information about individual subjects or records.
The columns contain the values of the measurements that vary
the variables.
A Spreadsheet
BOM
station Max_ Min_ Max_ Min_ >34_ >34_ <9_ <9_ Diff Diff
number Month Day 1913 1913 2013 2013 1913 2013 1913 2013 Max Min
66062 1 1 25 19.1 26.2 20.2 0 0 0 0 7.1 1.1
66062 1 2 27.1 17.1 22.9 20.3 0 0 0 0 5.8 3.2
66062 1 3 32.6 20.7 24.8 18.4 0 0 0 0 4.1 ‐2.3
66062 1 4 21.9 17.5 26.6 18.3 0 0 0 0 9.1 0.8
66062 1 5 23.1 15.8 28.3 20.9 0 0 0 0 12.5 5.1
66062 1 6 24.6 15.4 28 21.6 0 0 0 0 12.6 6.2
66062 1 7 23.9 18.9 27.5 21.4 0 0 0 0 8.6 2.5
66062 1 8 23.8 18.6 42.3 20.9 0 1 0 0 23.7 2.3
66062 1 9 23.9 16.8 25 21.1 0 0 0 0 8.2 4.3
66062 1 10 25.2 16 25.4 20.2 0 0 0 0 9.4 4.2
66062 1 11 26.3 19.8 29.6 21.2 0 0 0 0 9.8 1.4
66062 1 12 26.9 20.1 31.2 23.5 0 0 0 0 11.1 3.4
66062 1 13 31.3 19.8 23.8 20.7 0 0 0 0 4 0.9
66062
66062
66062
1
1
1
14
15
16
25.2
25.9
27.1
20
20.1
20.9
23.7
24.9
27.2
17.1
16.8
19.1
0
0
0
0
0
0
0
0
0
0
0
0
3.7
4.8
6.3
‐2.9
‐3.3
‐1.8
Types of Data
66062 1 17 27.8 20.4 29 21.4 0 0 0 0 8.6 1
66062 1 18 30.6 19.4 45.8 21.7 0 1 0 0 26.4 2.3
66062 1 19 27.7 20.2 24.8 21.5 0 0 0 0 4.6 1.3
66062 1 20 21.7 19.6 24.3 20.2 0 0 0 0 4.7 0.6
66062 1 21 22.2 16.9 26.6 20.7 0 0 0 0 9.7 3.8
66062 1 22 24.6 15 29.6 20.9 0 0 0 0 14.6 5.9
http://www.bom.gov.au/climate/data/
35
6
9/03/2015
Types of
Variable Data
Types Categorical variables
Categorical variables are variables where each
nominal observation falls into one of a finite number of groups.
Categorical/ Nominal variables: named variables with no implicit order.
Examples: Type of cancer, Personality type
Group
Ordinal variables: grouped variables with implicit order.
ordinal Examples: Level of education, grade
7
9/03/2015
A title
Clearly labelled axes
Appropriate comments
to have clarity
to be aesthetically satisfying
(c)Sydney University 51
8
9/03/2015
52 53
350
Numbers of Very Hot, and Very Cold Days in
1913 and 2013
For categorical data we simply tabulate the counts
300
and/or proportions of data (denoted p in a sample,
A Clustered Bar Chart or in a population) in the categories of interest.
Number of Days
250
1913
is a visual display 200 2013
showing associations Counts of Days in each Year
150
between two Year < 9C Not so extreme > 34C
100
categorical variables.
50 1913 65 295 5
0 2013 33 326 6
< 9C Not so extreme > 34C
Temperatures
Percentages of Days in each Year
It appears that
the daily temperatures were not so extreme in both 1913 and 2013 Year < 9C Not so extreme > 34C
there was a larger proportion of extremely cold days in 1913 than in 1913 17.81% 80.82% 1.37%
2013 2013 9.04% 89.32% 1.64%
the proportion of very hot days was low in both years
54 55
9
9/03/2015
Histogram
A histogram is a simple and
effective display, useful for
displaying the distribution of
numerical data.
10
gaps between bins, unless a 0
bin is empty. 7 24
56 57
lower upper
quartile
quartile A boxplot also identifies any unusually large or small
values in a dataset, called outliers.
58 59
10
9/03/2015
1913_Min centres
2013_Min spreads and
1913_Max mention unusual observations
2013_Max
0 10 20 30 40 50
Temperature oC
It appears that both minimum and maximum daily
temperatures in 2013 were slightly higher than those in
1913.
See: http://freedom.indiemaps.com/
60 61
X Y
predictor response X
If X increases and Y increases then a If X increases and Y decreases then a
determinant outcome POSITIVE relation exists. NEGATIVE relation exists.
independent dependent
X
Y
Y
X
62 63
11
9/03/2015
Minimum Temperatures: 1913 and 2013 Maximum Temperatures: 1913 and 2013
44
Data Type Categorical Numerical
44
36 36
Comparative
12 12 Numerical Scatter plots
Box plots
4 4
4 12 20 28 36 44 4 12 20 28 36 44
Minimum Temperatures 1913
One Variable
Maximum Temperatures 1913
http://www.gapminder.org/videos/the-joy-of-stats/
Displaying Data
Categorical
20
Numerical 12
4
4 12 20
One Variable
Only
http://www.gapminder.org/world
6
66 7
12
9/03/2015
Wordle
6
(c)Sydney University, 2014 http://www.oceancalendars.com.au 8
Measures of Centre
Mode: The most frequently occurring value in the dataset.
The data may be nominal, ordinal or numeric.
Median: The middle value when all the data are placed in order.
The data must be ordinal or numerical.
For an even number of values the median is the
average of the two middle values.
http://www.youtube.com/watch?v=oNdVynH6hcY 71
13
9/03/2015
Medians and Means
The mean is affected by outliers, the
http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html mean
72 median is not. 74
? ?
Mean = (1 + 3 + 6 + 10)/4
?
= 5
We use Population
Sample Statistics
to estimate
Population Parameters
0 1 2 3 4 5 6 7 8 9 10
75
14
9/03/2015
Sample Population
estimate Measures of spread
Statistics Parameters
Numeric data is often described, or summarised, using
two statistics
Mean x a measure of centrality, or location, and
Median ~
x ~
a measure of spread, or dispersion.
Minimum
Maximum
0 5 10 15 20 25 30 35 40 45 50
Temperature oC
77 78
40
35
A measure of The inter-quartile range
30 variability
25 is important
20 The inter-quartile range (IQR) is the difference between the
15
10
upper and lower quartiles in an ordered set of numerical data.
5
IQR = UQ - LQ
0
-5
-10
The IQR gives the range of the middle 50% of a set of data, so is
40 sometimes called the midspread.
35
30
The inter-quartile range is rarely influenced
25 by outliers in the data.
20
15
10
Daily Minimum and Maximum
5 Temperatures, 2013
0
-5 For the minimum temperatures in 2013: Minimum
-10 IQR ≈ 18-11 =7
Maximum
For the maximum temperatures in 2013:
0 10 20 30 40 50
IQR ≈ 21-26.5 = 5.5 Temperature oC
80
15
9/03/2015
Range = max - min The larger the standard deviation the the greater the spread.
It is defined in terms of the deviations of the data from the mean
(called residuals).
The sample standard deviation, s, is the square root of the average
The range will be influenced by outliers in the data. (sort of) squared residual.
( x1 x )2 ( x2 x ) 2 . . . ( xn x )2
s
Daily Min and MaxTemps, 2013 n 1
For the minimum temperatures in 2013: n
(x x )
Minimum 2
Range ≈ 24 - 7 = 17 i
Maximum i 1
Mean sd
2.5
5 5 5 5 5 5 5 5 0
1.5
1 3 5 7 9 5 3.16
-0.5
0 5 15 34 86 28 34.94
‐3.5
16
9/03/2015
Mean sd
5 5 5 5 5 5 5 5 0
Mean x
Median ~
x ~
1 3 5 7 9 5 3.16
0 5 15 34 86 28 34.94 Std.dev s
Variance s2 2
A measure of how much the data are spread The variance, 2, is the square of the standard deviation
around the mean
85
and is estimated by s2.
86
17
9/03/2015
R command Outcome
plot() 2-D scatterplot
92
18
9/03/2015
350
oma/index.htm?utm_source=twitterfeed&utm_medium=twitter barplot(counts, main="Number of Very
Hot Days in 2013",
250
names.arg=c("35C or more","Less than
Counts
35C"),
150
xlab="Maximum Temperature",
Example: ylab="Counts",
0 50
par(mfrow=c(1,1), mar=c(3.0,3.0,3.0,3.0), mgp=c(1.1,0.1,0), col="darkred")
oma=c(0,2,1.4,0), las=1, tcl=0.2, cex=0.8) 35C
NotVeryHot
or more Less
VeryHot
than 35C
Maximum Temperature
93 94
Easy to read!!!
97
19
9/03/2015
Tables
98
References
Introductory Statistics Lecture Notes, Macquarie University
Susan Imberman: notes on Data Mining vs. Statistics
Wasserman: Chapter 1
R
http://www.statmethods.net/
http://www.statmethods.net/graphs/
http://addictedtor.free.fr/graphiques/
http://www.rseek.org
http://www.cookbook-r.com/Graphs/Shapes_and_line_types/
http://rprogramming.net/
http://it-ebooks.info/book/537/
http://www.ats.ucla.edu/stat/r/
20