Objectives of STAT5002

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

9/03/2015

Stat 5002
An Introduction to Statistics with Applications
in Computing

Lecture 1

Introduction to Statistical Thinking Objectives of STAT5002


Samples Populations;
Sample Statistics  Population Parameters;
Graphical summaries of Data;
https://elearning.sydney.edu.au/webapps/
Numerical summaries of Data.

To introduce students to

 basic statistical concepts and methods for


further studies.
 methodologies related to statistical data
analysis and Data Mining.
 a number of useful statistical models
 computer oriented estimation procedures
 smoothing and nonparametric concepts Objective of Statistics
 analysis of large data sets.
 the R computing language for all computational
aspects in the course
©Sydney University 3

1
9/03/2015

Samples  Populations Populations (ALL)


Define the target population -- the population to which
we want to generalize our findings.

? ?
 Specify characteristics that identify the members of the
We use
? population. Who/What? Where? When?
information from Population Example: Characteristics such as age, income,
a education, gender and marital status are typically
SAMPLE to used in studies concerning people.
answer questions
or
discover features  A sampling frame is a List or Rule Defining the
about a target Population. This is usually unachievable, and we often
POPULATION need to restrict our studies to the population to which
we can gain access.
10

Samples  Some of All Representative Sample


It is often difficult, or even impossible, to obtain a
Individual observations should be selected independently! random sample.
Samples need be representative of the population (not
biased)
Sample size needs to be large enough! Population

A random sample is one where


each member of the population
has the same chance of being
selected.
Independent observations
∴ random sample:
Representative of population
11

2
9/03/2015

Samples need to be Bias


Samples need to be Bias may be defined as any systematic error (ie. not 
representative of the target population occurring randomly) which results in incorrect 
conclusions about the target population.  

Observations within samples must be Some types of bias include


independent of each other
 selection bias

Samples must not be b i a s e d !  measurement bias


 response bias
 confounding

©Sydney University 14

Types of Bias Two schools of Thought


 Selection Bias Frequentist Bayesian
Selection bias refers to any systematic differences occurring in the
way that subjects are selected for a study. Population is fixed Population varies
 Measurement Bias Samples vary (somewhat) Sample is fixed
Measurement bias refers to systematic differences in the
measurement of variables.
 Response bias
Response bias can occur when the response rate to a survey is too
low.
 Confounding
A confounder is a variable that distorts (increases or decreases)
the apparent effect of one variable (determinant) on another
variable (outcome).
©Sydney University 16 22

3
9/03/2015

Scope of Statistics/Data Mining


Understand Problem!!
Study Data Mining
Design Study

Collect Sample Obtain Data

Organise Data Organise Data

Data Analysis Exploratory Data Analysis

Interpretation of Results Interpretation of Results


Scope of Statistics
Report Results Report Results?

Where do data come from? Types of Statistical Studies


Statistical Studies An observational study is one in which there is no
 Observational Studies intervention by the investigator nor is there any
treatment imposed.

 Experimental Studies
An experimental study is one in which the
investigator has some control over the determinant.
Data Mining
 from Databases

(c)Sydney University (c)Sydney University


1.22
27

4
9/03/2015

Experimental
Obtaining Data Studies
Population
Sampling

Statistical Studies Sample


 Observational Studies Randomisation

Experimental 
Group Control Group
 Experimental Studies Comparison
Compare!
Stanford prison experiment
First Data  First Data 
http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Collection 
Collection 
http://www.med.uottawa.ca/sim/data/Study_Designs_e.htm (Before)
(Before)

No
Treatment
Data Mining Treatment

 from Databases, Comparison


Compare!
http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets04 Second Data  Second Data 
05.html Collection  Collection 
(c)Sydney University 29 (c)Sydney University (After) (After)

CRISP Data Mining


Cross Industry Standard Process for Data Mining
Aim: To develop an industry tool and application neutral process
for conducting Knowledge Discovery (KD).

Data Mining

31 (c)Sydney University, 2014 32

5
9/03/2015

Variables
Measurements taken on subjects in a study vary amongst
subjects.
These measurements (data) are usually organised in a
spreadsheet consisting of rows and columns.
 The rows contain information about individual subjects or records.
 The columns contain the values of the measurements that vary 
the variables.

Data  Evidence from Samples Variables usually take on specific roles


determinants influence outcomes
Predictors Outcomes
Explanatory variable/s Response variable/s
Input Output
independent variable/s dependent variable/s
34

A Spreadsheet
BOM 
station  Max_      Min_       Max_    Min_     >34_   >34_   <9_  <9_     Diff       Diff     
number Month Day 1913 1913 2013 2013 1913 2013 1913 2013 Max Min
66062 1 1 25 19.1 26.2 20.2 0 0 0 0 7.1 1.1
66062 1 2 27.1 17.1 22.9 20.3 0 0 0 0 5.8 3.2
66062 1 3 32.6 20.7 24.8 18.4 0 0 0 0 4.1 ‐2.3
66062 1 4 21.9 17.5 26.6 18.3 0 0 0 0 9.1 0.8
66062 1 5 23.1 15.8 28.3 20.9 0 0 0 0 12.5 5.1
66062 1 6 24.6 15.4 28 21.6 0 0 0 0 12.6 6.2
66062 1 7 23.9 18.9 27.5 21.4 0 0 0 0 8.6 2.5
66062 1 8 23.8 18.6 42.3 20.9 0 1 0 0 23.7 2.3
66062 1 9 23.9 16.8 25 21.1 0 0 0 0 8.2 4.3
66062 1 10 25.2 16 25.4 20.2 0 0 0 0 9.4 4.2
66062 1 11 26.3 19.8 29.6 21.2 0 0 0 0 9.8 1.4
66062 1 12 26.9 20.1 31.2 23.5 0 0 0 0 11.1 3.4
66062 1 13 31.3 19.8 23.8 20.7 0 0 0 0 4 0.9
66062
66062
66062
1
1
1
14
15
16
25.2
25.9
27.1
20
20.1
20.9
23.7
24.9
27.2
17.1
16.8
19.1
0
0
0
0
0
0
0
0
0
0
0
0
3.7
4.8
6.3
‐2.9
‐3.3
‐1.8
Types of Data
66062 1 17 27.8 20.4 29 21.4 0 0 0 0 8.6 1
66062 1 18 30.6 19.4 45.8 21.7 0 1 0 0 26.4 2.3
66062 1 19 27.7 20.2 24.8 21.5 0 0 0 0 4.6 1.3
66062 1 20 21.7 19.6 24.3 20.2 0 0 0 0 4.7 0.6
66062 1 21 22.2 16.9 26.6 20.7 0 0 0 0 9.7 3.8
66062 1 22 24.6 15 29.6 20.9 0 0 0 0 14.6 5.9

http://www.bom.gov.au/climate/data/
35

6
9/03/2015

Types of
Variable Data
Types Categorical variables
Categorical variables are variables where each
nominal observation falls into one of a finite number of groups.
Categorical/ Nominal variables: named variables with no implicit order.
Examples: Type of cancer, Personality type
Group
Ordinal variables: grouped variables with implicit order.
ordinal Examples: Level of education, grade

If there are two groups the variable is often referred to


discrete as being binary or dichotomous (having two possible values).
Numerical/
Binary variables can be either
Quantitative
nominal, such as sex, or
continuous
ordinal such as age group, eg < 20 years, ≥ 20 years.
37 39

Size (ordinal) Numerical / Quantitative Variables


Colour
(nominal Numerical variables are measured variables and can
Small Medium Large be either discrete or continuous.

Discrete variables are variables that take discrete


White values: eg. Number of children, number of people in a
store.

Continuous variables are those that can assume many


values within a certain range or interval:
Green eg. height, weight, pulse rate.

Numerical variables are also referred to interval or


Purple scale variables
44
(c)Sydney University

7
9/03/2015

'Numerical' data The Variables in the Spreadsheet


Variable Description Data Type
cover a range of values
Bureau of Meteorology station
BOM station number Categorical, Nominal
usually measured with an instrument or along some number
Month Month of Year (1:12) Categorical, Ordinal
scale or counted (but a large number).
Day Day of Month (1:31) Continuous, Discreet
Discrete can only take some values Max_1913 1913 Daily max temp (Co) Numeric, continuous
For example: Min_1913 1913 Daily min temp (Co) Numeric, continuous
 Marks in a test ( max half mark accuracy only) Max_2013 2013 Daily max temp (Co) Numeric, continuous
 number of steps walked in a day (whole Min_2013 2013 Daily min temp (Co) Numeric, continuous
numbers) 1: 1913 max temp > 34
Very Hot?_1913 Categorical, Ordinal
0: otherwise
1: 2013 max temp > 34
Very Hot?_2013 Categorical, Ordinal
0: otherwise
Continuous - can take any values 1: 2013 min temp < 9
Example: Very Cold?_1913 Categorical, Ordinal
0: otherwise
 Distance walked in a day (2.6km, 2.67km,
1: 1913 min temp < 9
2.675km, etc Very Cold?_2013
0: otherwise
Categorical, Ordinal
 University entrance scores Diff Max Max_2013 - Max_2013 Numeric, continuous
Diff Min Max_2013 - Max_2014 Numeric, continuous
47

All graphs need

A title
Clearly labelled axes
Appropriate comments

 to have clarity
 to be aesthetically satisfying

Summarising Data: Graphical Methods

(c)Sydney University 51

8
9/03/2015

Displaying Categorical Data: Contingency Table: Showing counts for Two


One Variable  Bar Chart Categorical Variables
Number of Very Hot Days in 1913
400
350 Temperature
300
250
200

Year < 9C Not so extreme > 34C Total


150
100
50
0
Not so Hot >34 C

1913 65 295 5 365


2013 33 326 6 365
Number of Very Cold Days 2013
350
300
Total 98 621 11 730
250
200
150
100
50
0
Not so Cold <9C

52 53

Clustered bar chart Numerical Summary: Categorical Data

350
Numbers of Very Hot, and Very Cold Days in
1913 and 2013
For categorical data we simply tabulate the counts
300
and/or proportions of data (denoted p in a sample,
A Clustered Bar Chart or  in a population) in the categories of interest.
Number of Days

250
1913
is a visual display 200 2013
showing associations Counts of Days in each Year
150
between two Year < 9C Not so extreme > 34C
100
categorical variables.
50 1913 65 295 5
0 2013 33 326 6
< 9C Not so extreme > 34C
Temperatures
Percentages of Days in each Year
It appears that
 the daily temperatures were not so extreme in both 1913 and 2013 Year < 9C Not so extreme > 34C

 there was a larger proportion of extremely cold days in 1913 than in 1913 17.81% 80.82% 1.37%
2013 2013 9.04% 89.32% 1.64%
 the proportion of very hot days was low in both years
54 55

9
9/03/2015

Histogram
A histogram is a simple and
effective display, useful for
displaying the distribution of
numerical data.

A histogram shows the number


of observations that fall into
each of several non-
overlapping groups or bins. Daily Minimum Temperatures in 2013
40

The bins of a histogram adjoin 30

Displaying Numerical Data each other so there are no


20

10
gaps between bins, unless a 0
bin is empty. 7 24

56 57

Structure of a Box Plot Median and quartiles

A boxplot displays a five-number summary of a


whiskers numerical set of data. These numbers are

Minimum the smallest value


separates the lower 25% of values from
Lower Quartile
the rest
Median: the half-way point of the data
separates the upper 25% of values from
Upper Quartile:
the rest
median Maximum: the largest value
outliers

lower upper
quartile
quartile A boxplot also identifies any unusually large or small
values in a dataset, called outliers.
58 59

10
9/03/2015

Comparative Box Plots Comparing box plots


Box plots enable the comparison of several samples of data
simultaneously. Daily Minimum and Maximum Temperatures,
When making comparisons using box plots compare
1913 and 2013

1913_Min  centres
2013_Min  spreads and
1913_Max  mention unusual observations
2013_Max

0 10 20 30 40 50

Temperature oC
It appears that both minimum and maximum daily
temperatures in 2013 were slightly higher than those in
1913.
See: http://freedom.indiemaps.com/
60 61

Scatter Plot Construction of scatter plot


A scatter plot shows the relation between two numerical Draw X and Y axes to cover the range of the two variables.
variables. Label the axes and mark the scale
Y
The two variables, X and Y, are referred to as the Plot one point for each observation ie. (x, y)
predictor and response variable respectively, although Comment on the plot.
they do have other names.

X Y
predictor response X
If X increases and Y increases then a If X increases and Y decreases then a
determinant outcome POSITIVE relation exists. NEGATIVE relation exists.
independent dependent
X
Y
Y
X
62 63

11
9/03/2015

Scatterplots of Temperatures Displaying Data

Minimum Temperatures: 1913 and 2013 Maximum Temperatures: 1913 and 2013

44
Data Type Categorical Numerical
44

Maximum Temperatures 2013


Minimum Temperatures 2013

36 36

Categorical Clustered bar Comparative


28 28
chart Box plots
20 20

Comparative
12 12 Numerical Scatter plots
Box plots
4 4
4 12 20 28 36 44 4 12 20 28 36 44
Minimum Temperatures 1913
One Variable
Maximum Temperatures 1913

Bar Chart Histograms


Only
The points on the diagonal lines represent days where the minimum
(or maximum) temperature in 1913 was the same as in 2013.
Is there a sensible message here??
64 65

http://www.gapminder.org/videos/the-joy-of-stats/
Displaying Data

Data Type Categorical Numerical

Categorical

20

Numerical 12

4
4 12 20

One Variable
Only

http://www.gapminder.org/world
6
66 7

12
9/03/2015

Wordle

6
(c)Sydney University, 2014 http://www.oceancalendars.com.au 8

Measures of Centre
Mode: The most frequently occurring value in the dataset.
The data may be nominal, ordinal or numeric.

Median: The middle value when all the data are placed in order.
The data must be ordinal or numerical.
For an even number of values the median is the
average of the two middle values.

Mean: The Arithmetic Average.

Data summaries: The data must be either discreet or continuous.

Numerical Data The mean is calculated by dividing the 'sum of the


values' by the 'number of the values'.

http://www.youtube.com/watch?v=oNdVynH6hcY 71

13
9/03/2015

The Mean Mean versus median


The median cuts the data into two
The mean is calculated by dividing the 'sum of the values' by the
sections with the same number of
'number of the values'. n observations in each
 xi
Symmetric Data
x  i 1
n
50% 50%

xithe i values of the data

̅ the average or the 'mean‘ of the x values


The mean is the centre of gravity
(point of balance) of the data. Mean
(sigma)  'the sum of'. =
Median

Medians and Means
The mean is affected by outliers, the
http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html mean
72 median is not. 74

Mean: Centre of balance Samples  Populations


Data: 1 3 6 10

? ?
Mean = (1 + 3 + 6 + 10)/4
?
= 5
We use Population
Sample Statistics
to estimate
Population Parameters

0 1 2 3 4 5 6 7 8 9 10

75

14
9/03/2015

Sample Population
estimate Measures of spread
Statistics Parameters
Numeric data is often described, or summarised, using
two statistics
Mean x   a measure of centrality, or location, and

Median ~
x ~
  a measure of spread, or dispersion.

Daily Minimum and Maximum Temperatures, 2013

Minimum

Maximum

0 5 10 15 20 25 30 35 40 45 50

Temperature oC
77 78

40
35
A measure of The inter-quartile range
30 variability
25 is important
20 The inter-quartile range (IQR) is the difference between the
15
10
upper and lower quartiles in an ordered set of numerical data.
5
IQR = UQ - LQ
0
-5
-10
The IQR gives the range of the middle 50% of a set of data, so is
40 sometimes called the midspread.
35
30
The inter-quartile range is rarely influenced
25 by outliers in the data.
20
15
10
Daily Minimum and Maximum
5 Temperatures, 2013
0
-5 For the minimum temperatures in 2013: Minimum
-10 IQR ≈ 18-11 =7
Maximum
For the maximum temperatures in 2013:
0 10 20 30 40 50
IQR ≈ 21-26.5 = 5.5 Temperature oC

80

15
9/03/2015

The range The Standard Deviation


The range is the difference between the maximum value and The standard deviation is a measure of how closely the data are
the minimum value in an ordered set of numerical data. grouped about the mean.

Range = max - min  The larger the standard deviation the the greater the spread.
It is defined in terms of the deviations of the data from the mean
(called residuals).
The sample standard deviation, s, is the square root of the average
The range will be influenced by outliers in the data. (sort of) squared residual.

( x1  x )2  ( x2  x ) 2  . . .  ( xn  x )2
s 
Daily Min and MaxTemps, 2013 n 1
For the minimum temperatures in 2013: n

(x  x )
Minimum 2
Range ≈ 24 - 7 = 17 i

Maximum  i 1

For the maximum temperatures in 2013:


n 1
0 10 20 30 40 50
Temperature oC
Range ≈ 46 - 13 = 33 Residual = xi – x, ie. observed value – sample mean.
81 82

Deviations of points from the mean Standard deviation (s)

Mean sd
2.5
5 5 5 5 5 5 5 5 0

1.5
1 3 5 7 9 5 3.16
-0.5
0 5 15 34 86 28 34.94
‐3.5

‐1 1 3 5 7 9 A measure of how much the data are spread


around the mean
83 84

16
9/03/2015

Sample estimate Population


Standard deviation (s) Statistics Parameters

Mean sd
5 5 5 5 5 5 5 5 0
Mean x 
Median ~
x ~

1 3 5 7 9 5 3.16

0 5 15 34 86 28 34.94 Std.dev s 

Variance s2 2

A measure of how much the data are spread The variance, 2, is the square of the standard deviation
around the mean
85
and is estimated by s2.
86

The data in Excel


The data we have been using this week is stored in an
Excel workbook named Daily MaxMin Temp_1859-
2013.xlsx.

The data we will be using are stored in the spreadsheet


called 1913; 2013.

Use File…Save as … and save the data in Text (Tab-delimited)


(*txt) format, named Daily MaxMin Temp_1859-2013.txt
Doing it with R!

87 (c)Sydney University, 2014 88

17
9/03/2015

Reading in the Data into R Renaming variables


You can rename variables programmatically or interactively.
# rename interactively
From the File Drop Down menu in R select  fix(mydata) # results are saved on close
Change dir…
and change the working directory in R to the  # rename programmatically
directory and folder where your data are stored in  #Recoding a continuous variable into categorical variable
Excel. #Mark those whose control measurement is >34 as "VeryHot", and those
First row of the  with <=34 as "NotVeryHot":
dataset contains  tempdat$VHot2013[tempdat$Max_2013 > 34] <- "VeryHot"
names of each  tempdat$VHot2013[tempdat$Max_2013 <=34] <- "NotVeryHot"
variable
 Read in the data, type
temp.dat = read.table("Daily MaxMin Temp_1859-2013.txt”, header=T)
# Convert the column to a factor!!!
tempdat$VHot2013 <- factor(tempdat$VHot2013)
 To look at the first 10 rows of data, type
temp.dat[1:10, ] # you can re-enter all the variable names in order
 To edit the data, type # changing the ones you need to change.
fix(temp.dat) (Make changes directly on the spreadsheet) # the limitation is that you need to enter all of them!
names(mydata) <- c("x1","age","y", "ses")
(c)Sydney University 89 90

Some Graphics commands

R command Outcome
plot() 2-D scatterplot

barplot() Bar graph


hist() Histogram

lines() Line graph

points() Adds points to a plot


Graphing data in R
legend() Adds a legend to the plot

axis() Adds an axis to the plot

92

18
9/03/2015

Setting the Graphing Parameters Bar Charts in R


The par() function defines the settings for subsequent  To construct a bar chart of the categorical variable VHot2013, type
commands. counts<-table(tempdat$VHot2013)
barplot(temp.dat$VHot2013)
Arguments within other graphics functions can also be used.
http://www.statmethods.net/advgraphs/parameters.html Number of Very Hot Days in 2013
http://research.stowers-institute.org/efg/R/Graphics/Basics/mar- ##Detail:

350
oma/index.htm?utm_source=twitterfeed&utm_medium=twitter barplot(counts, main="Number of Very
Hot Days in 2013",

250
names.arg=c("35C or more","Less than

Counts
35C"),

150
xlab="Maximum Temperature",
Example: ylab="Counts",

0 50
par(mfrow=c(1,1), mar=c(3.0,3.0,3.0,3.0), mgp=c(1.1,0.1,0), col="darkred")
oma=c(0,2,1.4,0), las=1, tcl=0.2, cex=0.8) 35C
NotVeryHot
or more Less
VeryHot
than 35C

Maximum Temperature

93 94

Presentation of Numerical data


Present numerical summaries of data in neatly organised
tables, with column and row headings

 Easy to read!!!

n median mean std.dev


Min_1913 365 13.9 13.73 4.35
Numerical summaries in R Max_1913 365 21.3 21.52 5.13
Min_2013 365 14.9 15.03 4.2
Max_2013 365 23.6 23.71 4.36

97

19
9/03/2015

Tables

# 2-Way Frequency Table


attach(mydata)
mytable <- table(A,B) # A will be rows, B will be columns
mytable # print table

margin.table(mytable, 1) # A frequencies (summed over B)


margin.table(mytable, 2) # B frequencies (summed over A)

prop.table(mytable) # cell percentages


prop.table(mytable, 1) # row percentages More examples in the Tutorial!
prop.table(mytable, 2) # column percentages

98

References
Introductory Statistics Lecture Notes, Macquarie University
Susan Imberman: notes on Data Mining vs. Statistics
Wasserman: Chapter 1

R
http://www.statmethods.net/
http://www.statmethods.net/graphs/
http://addictedtor.free.fr/graphiques/
http://www.rseek.org
http://www.cookbook-r.com/Graphs/Shapes_and_line_types/
http://rprogramming.net/
http://it-ebooks.info/book/537/
http://www.ats.ucla.edu/stat/r/

20

You might also like