Objectives of STAT5002

9/03/2015
Stat 5002
An Introduction to Statistics with Applications
in Computing
Lecture 1
Introduction to Statistical Thinking Objectives of STAT5002

Samples Populations;
Sample Statistics  Population Parameters;
Graphical summaries of Data;
https://elearning.sydney.edu.au/webapps/
Numerical summaries of Data.
To introduce students to
 basic statistical concepts and methods for

further studies.
 methodologies related to statistical data
analysis and Data Mining.
 a number of useful statistical models
 computer oriented estimation procedures
 smoothing and nonparametric concepts Objective of Statistics
 analysis of large data sets.
 the R computing language for all computational
aspects in the course
©Sydney University 3
1
9/03/2015
Samples  Populations Populations (ALL)

Define the target population -- the population to which
we want to generalize our findings.
? ?
 Specify characteristics that identify the members of the
We use
? population. Who/What? Where? When?
information from Population Example: Characteristics such as age, income,
a education, gender and marital status are typically
SAMPLE to used in studies concerning people.
answer questions
or
discover features  A sampling frame is a List or Rule Defining the
about a target Population. This is usually unachievable, and we often
POPULATION need to restrict our studies to the population to which
we can gain access.
10
Samples  Some of All Representative Sample

It is often difficult, or even impossible, to obtain a
Individual observations should be selected independently! random sample.
Samples need be representative of the population (not
biased)
Sample size needs to be large enough! Population
A random sample is one where

each member of the population
has the same chance of being
selected.
Independent observations
∴ random sample:
Representative of population
11
2
9/03/2015
Samples need to be Bias

Samples need to be Bias may be defined as any systematic error (ie. not
representative of the target population occurring randomly) which results in incorrect
conclusions about the target population.
Observations within samples must be Some types of bias include

independent of each other
 selection bias
Samples must not be b i a s e d !  measurement bias

 response bias
 confounding
©Sydney University 14
Types of Bias Two schools of Thought

 Selection Bias Frequentist Bayesian
Selection bias refers to any systematic differences occurring in the
way that subjects are selected for a study. Population is fixed Population varies
 Measurement Bias Samples vary (somewhat) Sample is fixed
Measurement bias refers to systematic differences in the
measurement of variables.
 Response bias
Response bias can occur when the response rate to a survey is too
low.
 Confounding
A confounder is a variable that distorts (increases or decreases)
the apparent effect of one variable (determinant) on another
variable (outcome).
©Sydney University 16 22
3
9/03/2015
Scope of Statistics/Data Mining

Understand Problem!!
Study Data Mining
Design Study
Collect Sample Obtain Data
Organise Data Organise Data
Data Analysis Exploratory Data Analysis
Interpretation of Results Interpretation of Results

Scope of Statistics
Report Results Report Results?
Where do data come from? Types of Statistical Studies

Statistical Studies An observational study is one in which there is no
 Observational Studies intervention by the investigator nor is there any
treatment imposed.
 Experimental Studies
An experimental study is one in which the
investigator has some control over the determinant.
Data Mining
 from Databases
(c)Sydney University (c)Sydney University

1.22
27
4
9/03/2015
Experimental
Obtaining Data Studies
Population
Sampling
Statistical Studies Sample

 Observational Studies Randomisation
Experimental
Group Control Group
 Experimental Studies Comparison
Compare!
Stanford prison experiment
First Data First Data
http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Collection
Collection
http://www.med.uottawa.ca/sim/data/Study_Designs_e.htm (Before)
(Before)
No
Treatment
Data Mining Treatment
 from Databases, Comparison

Compare!
http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets04 Second Data Second Data
05.html Collection Collection
(c)Sydney University 29 (c)Sydney University (After) (After)
CRISP Data Mining

Cross Industry Standard Process for Data Mining
Aim: To develop an industry tool and application neutral process
for conducting Knowledge Discovery (KD).
Data Mining
31 (c)Sydney University, 2014 32
5
9/03/2015
Variables
Measurements taken on subjects in a study vary amongst
subjects.
These measurements (data) are usually organised in a
spreadsheet consisting of rows and columns.
 The rows contain information about individual subjects or records.
 The columns contain the values of the measurements that vary 
the variables.
Data  Evidence from Samples Variables usually take on specific roles

determinants influence outcomes
Predictors Outcomes
Explanatory variable/s Response variable/s
Input Output
independent variable/s dependent variable/s
34
A Spreadsheet
BOM
station Max_ Min_ Max_ Min_ >34_ >34_ <9_ <9_ Diff Diff
number Month Day 1913 1913 2013 2013 1913 2013 1913 2013 Max Min
66062 1 1 25 19.1 26.2 20.2 0 0 0 0 7.1 1.1
66062 1 2 27.1 17.1 22.9 20.3 0 0 0 0 5.8 3.2
66062 1 3 32.6 20.7 24.8 18.4 0 0 0 0 4.1 ‐2.3
66062 1 4 21.9 17.5 26.6 18.3 0 0 0 0 9.1 0.8
66062 1 5 23.1 15.8 28.3 20.9 0 0 0 0 12.5 5.1
66062 1 6 24.6 15.4 28 21.6 0 0 0 0 12.6 6.2
66062 1 7 23.9 18.9 27.5 21.4 0 0 0 0 8.6 2.5
66062 1 8 23.8 18.6 42.3 20.9 0 1 0 0 23.7 2.3
66062 1 9 23.9 16.8 25 21.1 0 0 0 0 8.2 4.3
66062 1 10 25.2 16 25.4 20.2 0 0 0 0 9.4 4.2
66062 1 11 26.3 19.8 29.6 21.2 0 0 0 0 9.8 1.4
66062 1 12 26.9 20.1 31.2 23.5 0 0 0 0 11.1 3.4
66062 1 13 31.3 19.8 23.8 20.7 0 0 0 0 4 0.9
66062
66062
66062
1
1
1
14
15
16
25.2
25.9
27.1
20
20.1
20.9
23.7
24.9
27.2
17.1
16.8
19.1
0
0
0
0
0
0
0
0
0
0
0
0
3.7
4.8
6.3
‐2.9
‐3.3
‐1.8
Types of Data
66062 1 17 27.8 20.4 29 21.4 0 0 0 0 8.6 1
66062 1 18 30.6 19.4 45.8 21.7 0 1 0 0 26.4 2.3
66062 1 19 27.7 20.2 24.8 21.5 0 0 0 0 4.6 1.3
66062 1 20 21.7 19.6 24.3 20.2 0 0 0 0 4.7 0.6
66062 1 21 22.2 16.9 26.6 20.7 0 0 0 0 9.7 3.8
66062 1 22 24.6 15 29.6 20.9 0 0 0 0 14.6 5.9
http://www.bom.gov.au/climate/data/
35
6
9/03/2015
Types of
Variable Data
Types Categorical variables
Categorical variables are variables where each
nominal observation falls into one of a finite number of groups.
Categorical/ Nominal variables: named variables with no implicit order.
Examples: Type of cancer, Personality type
Group
Ordinal variables: grouped variables with implicit order.
ordinal Examples: Level of education, grade
If there are two groups the variable is often referred to

discrete as being binary or dichotomous (having two possible values).
Numerical/
Binary variables can be either
Quantitative
nominal, such as sex, or
continuous
ordinal such as age group, eg < 20 years, ≥ 20 years.
37 39
Size (ordinal) Numerical / Quantitative Variables

Colour
(nominal Numerical variables are measured variables and can
Small Medium Large be either discrete or continuous.
Discrete variables are variables that take discrete

White values: eg. Number of children, number of people in a
store.
Continuous variables are those that can assume many

values within a certain range or interval:
Green eg. height, weight, pulse rate.
Numerical variables are also referred to interval or

Purple scale variables
44
(c)Sydney University
7
9/03/2015
'Numerical' data The Variables in the Spreadsheet

Variable Description Data Type
cover a range of values
Bureau of Meteorology station
BOM station number Categorical, Nominal
usually measured with an instrument or along some number
Month Month of Year (1:12) Categorical, Ordinal
scale or counted (but a large number).
Day Day of Month (1:31) Continuous, Discreet
Discrete can only take some values Max_1913 1913 Daily max temp (Co) Numeric, continuous
For example: Min_1913 1913 Daily min temp (Co) Numeric, continuous
 Marks in a test ( max half mark accuracy only) Max_2013 2013 Daily max temp (Co) Numeric, continuous
 number of steps walked in a day (whole Min_2013 2013 Daily min temp (Co) Numeric, continuous
numbers) 1: 1913 max temp > 34
Very Hot?_1913 Categorical, Ordinal
0: otherwise
1: 2013 max temp > 34
Very Hot?_2013 Categorical, Ordinal
0: otherwise
Continuous - can take any values 1: 2013 min temp < 9
Example: Very Cold?_1913 Categorical, Ordinal
0: otherwise
 Distance walked in a day (2.6km, 2.67km,
1: 1913 min temp < 9
2.675km, etc Very Cold?_2013
0: otherwise
Categorical, Ordinal
 University entrance scores Diff Max Max_2013 - Max_2013 Numeric, continuous
Diff Min Max_2013 - Max_2014 Numeric, continuous
47
All graphs need
A title
Clearly labelled axes
Appropriate comments
 to have clarity
 to be aesthetically satisfying
Summarising Data: Graphical Methods
(c)Sydney University 51
8
9/03/2015
Displaying Categorical Data: Contingency Table: Showing counts for Two

One Variable  Bar Chart Categorical Variables
Number of Very Hot Days in 1913
400
350 Temperature
300
250
200
Year < 9C Not so extreme > 34C Total

150
100
50
0
Not so Hot >34 C
1913 65 295 5 365

2013 33 326 6 365
Number of Very Cold Days 2013
350
300
Total 98 621 11 730
250
200
150
100
50
0
Not so Cold <9C
52 53
Clustered bar chart Numerical Summary: Categorical Data
350
Numbers of Very Hot, and Very Cold Days in
1913 and 2013
For categorical data we simply tabulate the counts
300
and/or proportions of data (denoted p in a sample,
A Clustered Bar Chart or  in a population) in the categories of interest.
Number of Days
250
1913
is a visual display 200 2013
showing associations Counts of Days in each Year
150
between two Year < 9C Not so extreme > 34C
100
categorical variables.
50 1913 65 295 5
0 2013 33 326 6
< 9C Not so extreme > 34C
Temperatures
Percentages of Days in each Year
It appears that
 the daily temperatures were not so extreme in both 1913 and 2013 Year < 9C Not so extreme > 34C
 there was a larger proportion of extremely cold days in 1913 than in 1913 17.81% 80.82% 1.37%
2013 2013 9.04% 89.32% 1.64%
 the proportion of very hot days was low in both years
54 55
9
9/03/2015
Histogram
A histogram is a simple and
effective display, useful for
displaying the distribution of
numerical data.
A histogram shows the number

of observations that fall into
each of several non-
overlapping groups or bins. Daily Minimum Temperatures in 2013
40
The bins of a histogram adjoin 30
Displaying Numerical Data each other so there are no

20
10
gaps between bins, unless a 0
bin is empty. 7 24
56 57
Structure of a Box Plot Median and quartiles
A boxplot displays a five-number summary of a

whiskers numerical set of data. These numbers are
Minimum the smallest value

separates the lower 25% of values from
Lower Quartile
the rest
Median: the half-way point of the data
separates the upper 25% of values from
Upper Quartile:
the rest
median Maximum: the largest value
outliers
lower upper
quartile
quartile A boxplot also identifies any unusually large or small
values in a dataset, called outliers.
58 59
10
9/03/2015
Comparative Box Plots Comparing box plots

Box plots enable the comparison of several samples of data
simultaneously. Daily Minimum and Maximum Temperatures,
When making comparisons using box plots compare
1913 and 2013
1913_Min  centres
2013_Min  spreads and
1913_Max  mention unusual observations
2013_Max
0 10 20 30 40 50
Temperature oC
It appears that both minimum and maximum daily
temperatures in 2013 were slightly higher than those in
1913.
See: http://freedom.indiemaps.com/
60 61
Scatter Plot Construction of scatter plot

A scatter plot shows the relation between two numerical Draw X and Y axes to cover the range of the two variables.
variables. Label the axes and mark the scale
Y
The two variables, X and Y, are referred to as the Plot one point for each observation ie. (x, y)
predictor and response variable respectively, although Comment on the plot.
they do have other names.
X Y
predictor response X
If X increases and Y increases then a If X increases and Y decreases then a
determinant outcome POSITIVE relation exists. NEGATIVE relation exists.
independent dependent
X
Y
Y
X
62 63
11
9/03/2015
Scatterplots of Temperatures Displaying Data
Minimum Temperatures: 1913 and 2013 Maximum Temperatures: 1913 and 2013
44
Data Type Categorical Numerical
44
Maximum Temperatures 2013

Minimum Temperatures 2013
36 36
Categorical Clustered bar Comparative

28 28
chart Box plots
20 20
Comparative
12 12 Numerical Scatter plots
Box plots
4 4
4 12 20 28 36 44 4 12 20 28 36 44
Minimum Temperatures 1913
One Variable
Maximum Temperatures 1913
Bar Chart Histograms

Only
The points on the diagonal lines represent days where the minimum
(or maximum) temperature in 1913 was the same as in 2013.
Is there a sensible message here??
64 65
http://www.gapminder.org/videos/the-joy-of-stats/
Displaying Data
Data Type Categorical Numerical
Categorical
20
Numerical 12
4
4 12 20
One Variable
Only
http://www.gapminder.org/world
6
66 7
12
9/03/2015
Wordle
6
(c)Sydney University, 2014 http://www.oceancalendars.com.au 8
Measures of Centre
Mode: The most frequently occurring value in the dataset.
The data may be nominal, ordinal or numeric.
Median: The middle value when all the data are placed in order.
The data must be ordinal or numerical.
For an even number of values the median is the
average of the two middle values.
Mean: The Arithmetic Average.
Data summaries: The data must be either discreet or continuous.
Numerical Data The mean is calculated by dividing the 'sum of the

values' by the 'number of the values'.
http://www.youtube.com/watch?v=oNdVynH6hcY 71
13
9/03/2015
The Mean Mean versus median

The median cuts the data into two
The mean is calculated by dividing the 'sum of the values' by the
sections with the same number of
'number of the values'. n observations in each
 xi
Symmetric Data
x  i 1
n
50% 50%
xithe i values of the data
̅ the average or the 'mean‘ of the x values

The mean is the centre of gravity
(point of balance) of the data. Mean
(sigma)  'the sum of'. =
Median
Medians and Means
The mean is affected by outliers, the
http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html mean
72 median is not. 74
Mean: Centre of balance Samples  Populations

Data: 1 3 6 10
? ?
Mean = (1 + 3 + 6 + 10)/4
?
= 5
We use Population
Sample Statistics
to estimate
Population Parameters
0 1 2 3 4 5 6 7 8 9 10
75
14
9/03/2015
Sample Population
estimate Measures of spread
Statistics Parameters
Numeric data is often described, or summarised, using
two statistics
Mean x   a measure of centrality, or location, and
Median ~
x ~
  a measure of spread, or dispersion.
Daily Minimum and Maximum Temperatures, 2013
Minimum
Maximum
0 5 10 15 20 25 30 35 40 45 50
Temperature oC
77 78
40
35
A measure of The inter-quartile range
30 variability
25 is important
20 The inter-quartile range (IQR) is the difference between the
15
10
upper and lower quartiles in an ordered set of numerical data.
5
IQR = UQ - LQ
0
-5
-10
The IQR gives the range of the middle 50% of a set of data, so is
40 sometimes called the midspread.
35
30
The inter-quartile range is rarely influenced
25 by outliers in the data.
20
15
10
Daily Minimum and Maximum
5 Temperatures, 2013
0
-5 For the minimum temperatures in 2013: Minimum
-10 IQR ≈ 18-11 =7
Maximum
For the maximum temperatures in 2013:
0 10 20 30 40 50
IQR ≈ 21-26.5 = 5.5 Temperature oC
80
15
9/03/2015
The range The Standard Deviation

The range is the difference between the maximum value and The standard deviation is a measure of how closely the data are
the minimum value in an ordered set of numerical data. grouped about the mean.
Range = max - min  The larger the standard deviation the the greater the spread.
It is defined in terms of the deviations of the data from the mean
(called residuals).
The sample standard deviation, s, is the square root of the average
The range will be influenced by outliers in the data. (sort of) squared residual.
( x1  x )2  ( x2  x ) 2  . . .  ( xn  x )2
s 
Daily Min and MaxTemps, 2013 n 1
For the minimum temperatures in 2013: n
(x  x )
Minimum 2
Range ≈ 24 - 7 = 17 i
Maximum  i 1
For the maximum temperatures in 2013:

n 1
0 10 20 30 40 50
Temperature oC
Range ≈ 46 - 13 = 33 Residual = xi – x, ie. observed value – sample mean.
81 82
Deviations of points from the mean Standard deviation (s)
Mean sd
2.5
5 5 5 5 5 5 5 5 0
1.5
1 3 5 7 9 5 3.16
-0.5
0 5 15 34 86 28 34.94
‐3.5
‐1 1 3 5 7 9 A measure of how much the data are spread

around the mean
83 84
16
9/03/2015
Sample estimate Population

Standard deviation (s) Statistics Parameters
Mean sd
5 5 5 5 5 5 5 5 0
Mean x 
Median ~
x ~

1 3 5 7 9 5 3.16
0 5 15 34 86 28 34.94 Std.dev s 
Variance s2 2
A measure of how much the data are spread The variance, 2, is the square of the standard deviation
around the mean
85
and is estimated by s2.
86
The data in Excel

The data we have been using this week is stored in an
Excel workbook named Daily MaxMin Temp_1859-
2013.xlsx.
The data we will be using are stored in the spreadsheet

called 1913; 2013.
Use File…Save as … and save the data in Text (Tab-delimited)

(*txt) format, named Daily MaxMin Temp_1859-2013.txt
Doing it with R!
87 (c)Sydney University, 2014 88
17
9/03/2015
Reading in the Data into R Renaming variables

You can rename variables programmatically or interactively.
# rename interactively
From the File Drop Down menu in R select fix(mydata) # results are saved on close
Change dir…
and change the working directory in R to the # rename programmatically
directory and folder where your data are stored in #Recoding a continuous variable into categorical variable
Excel. #Mark those whose control measurement is >34 as "VeryHot", and those
First row of the with <=34 as "NotVeryHot":
dataset contains tempdat$VHot2013[tempdat$Max_2013 > 34] <- "VeryHot"
names of each tempdat$VHot2013[tempdat$Max_2013 <=34] <- "NotVeryHot"
variable
 Read in the data, type
temp.dat = read.table("Daily MaxMin Temp_1859-2013.txt”, header=T)
# Convert the column to a factor!!!
tempdat$VHot2013 <- factor(tempdat$VHot2013)
 To look at the first 10 rows of data, type
temp.dat[1:10, ] # you can re-enter all the variable names in order
 To edit the data, type # changing the ones you need to change.
fix(temp.dat) (Make changes directly on the spreadsheet) # the limitation is that you need to enter all of them!
names(mydata) <- c("x1","age","y", "ses")
(c)Sydney University 89 90
Some Graphics commands
R command Outcome
plot() 2-D scatterplot
barplot() Bar graph

hist() Histogram
lines() Line graph
points() Adds points to a plot

Graphing data in R
legend() Adds a legend to the plot
axis() Adds an axis to the plot
92
18
9/03/2015
Setting the Graphing Parameters Bar Charts in R

The par() function defines the settings for subsequent  To construct a bar chart of the categorical variable VHot2013, type
commands. counts<-table(tempdat$VHot2013)
barplot(temp.dat$VHot2013)
Arguments within other graphics functions can also be used.
http://www.statmethods.net/advgraphs/parameters.html Number of Very Hot Days in 2013
http://research.stowers-institute.org/efg/R/Graphics/Basics/mar- ##Detail:
350
oma/index.htm?utm_source=twitterfeed&utm_medium=twitter barplot(counts, main="Number of Very
Hot Days in 2013",
250
names.arg=c("35C or more","Less than
Counts
35C"),
150
xlab="Maximum Temperature",
Example: ylab="Counts",
0 50
par(mfrow=c(1,1), mar=c(3.0,3.0,3.0,3.0), mgp=c(1.1,0.1,0), col="darkred")
oma=c(0,2,1.4,0), las=1, tcl=0.2, cex=0.8) 35C
NotVeryHot
or more Less
VeryHot
than 35C
Maximum Temperature
93 94
Presentation of Numerical data

Present numerical summaries of data in neatly organised
tables, with column and row headings
 Easy to read!!!
n median mean std.dev

Min_1913 365 13.9 13.73 4.35
Numerical summaries in R Max_1913 365 21.3 21.52 5.13
Min_2013 365 14.9 15.03 4.2
Max_2013 365 23.6 23.71 4.36
97
19
9/03/2015
Tables
# 2-Way Frequency Table

attach(mydata)
mytable <- table(A,B) # A will be rows, B will be columns
mytable # print table
margin.table(mytable, 1) # A frequencies (summed over B)

margin.table(mytable, 2) # B frequencies (summed over A)
prop.table(mytable) # cell percentages

prop.table(mytable, 1) # row percentages More examples in the Tutorial!
prop.table(mytable, 2) # column percentages
98
References
Introductory Statistics Lecture Notes, Macquarie University
Susan Imberman: notes on Data Mining vs. Statistics
Wasserman: Chapter 1
R
http://www.statmethods.net/
http://www.statmethods.net/graphs/
http://addictedtor.free.fr/graphiques/
http://www.rseek.org
http://www.cookbook-r.com/Graphs/Shapes_and_line_types/
http://rprogramming.net/
http://it-ebooks.info/book/537/
http://www.ats.ucla.edu/stat/r/
20

Objectives of STAT5002

Uploaded by

Copyright:

Available Formats

Objectives of STAT5002

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Objectives of STAT5002

Uploaded by

Copyright:

Available Formats

9/03/2015

Introduction to Statistical Thinking Objectives of STAT5002

 basic statistical concepts and methods for

Samples  Populations Populations (ALL)

Samples  Some of All Representative Sample

A random sample is one where

Samples need to be Bias

Observations within samples must be Some types of bias include

Samples must not be b i a s e d !  measurement bias

Types of Bias Two schools of Thought

Scope of Statistics/Data Mining

Collect Sample Obtain Data

Organise Data Organise Data

Data Analysis Exploratory Data Analysis

Interpretation of Results Interpretation of Results

Where do data come from? Types of Statistical Studies

(c)Sydney University (c)Sydney University

Statistical Studies Sample

 from Databases, Comparison

CRISP Data Mining

31 (c)Sydney University, 2014 32

Data  Evidence from Samples Variables usually take on specific roles

If there are two groups the variable is often referred to

Size (ordinal) Numerical / Quantitative Variables

Discrete variables are variables that take discrete

Continuous variables are those that can assume many

Numerical variables are also referred to interval or

'Numerical' data The Variables in the Spreadsheet

All graphs need

Summarising Data: Graphical Methods

Displaying Categorical Data: Contingency Table: Showing counts for Two

Year < 9C Not so extreme > 34C Total

1913 65 295 5 365

Clustered bar chart Numerical Summary: Categorical Data

A histogram shows the number

The bins of a histogram adjoin 30

Displaying Numerical Data each other so there are no

Structure of a Box Plot Median and quartiles

A boxplot displays a five-number summary of a

Minimum the smallest value

Comparative Box Plots Comparing box plots

Scatter Plot Construction of scatter plot

Scatterplots of Temperatures Displaying Data

Maximum Temperatures 2013

Categorical Clustered bar Comparative

Bar Chart Histograms

Data Type Categorical Numerical

Mean: The Arithmetic Average.

Data summaries: The data must be either discreet or continuous.

Numerical Data The mean is calculated by dividing the 'sum of the

The Mean Mean versus median

xithe i values of the data

̅ the average or the 'mean‘ of the x values

Mean: Centre of balance Samples  Populations

Daily Minimum and Maximum Temperatures, 2013

The range The Standard Deviation

For the maximum temperatures in 2013:

Deviations of points from the mean Standard deviation (s)

‐1 1 3 5 7 9 A measure of how much the data are spread