Exploratory Data Analysis

Exploratory data analysis
Objective
• Get the quick idea about data
– visualization is the easiest way
– check descriptive statistics
• Data cleaning process to reduce the number of
data problems in the future
– handle missing data, outliers or typo etc.
– need to be careful!
• Explore your data to determine whether the model
assumptions are met etc.
– E.g., check normality of data
Visualization and descriptive statistics
• Visualization
– Histogram
– Boxplot
– Scatter plot (to find the relationship btw 2 variables)
• Descriptive statistics
– Mean, median, variance, skewness, kurtosis etc.
– Correlation (for 2 variables)
• Get rough idea about the distribution of data

• Check outliers or missingness
• In general, to check normality of data
Visualization and descriptive statistics
• Skewed right: mean > median
• Skewed left: mean < median
o Robustness of median.
o Able to guess its skewness based on mean an median values
• https://demonstrations.wolfram.com/ExploringSkewnessIn
BoxPlots/ (boxplot and skewness)
• Able to check its normality (informally) based on visual and
descriptive statistics
Example : airquality
• Daily air quality measurements in New York,
May to September 1973. (R built-in data)
• 154 observations on 6 variables – Ozone,
Solar R, Wind, …
hist(airquality$Ozone,main="Ozone",xlab="Ozone")
Provides the distribution of

the data. This can also be
used to assess potential
outlier concerns.
boxplot(airquality$Ozone,ylab="Ozone")
points(mean(airquality$Ozone, na.rm=TRUE), col="red")
Example of descriptive statistics
summary(airquality$Ozone, na.rm=TRUE)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 18.00 31.50 42.13 63.25 168.00 37
mean(airquality$Ozone, na.rm=TRUE) -> mean
## [1] 42.12931
var(airquality$Ozone, na.rm=TRUE) -> variance
## [1] 1088.201
skewness(airquality$Ozone, na.rm=TRUE) -> skewness
## [1] 1.209866
range(airquality$Ozone,na.rm=TRUE) -> range [min,max]
## [1] 1 168
Missing data handling
• Can be a separate semester-long course
• Missing mechanisms:
– Missing Completely at Random (MCAR)
• Missing occurs by random
– Missing at Random (MAR)
– Missing Not at Random (MNAR)
Important statistical assumptions
• Normality
o Why normality check is important?
1) When conducting a t-test or ANOVA, normality assumption is
required
2) When using correlation and regression techniques, lack of normality
and outliers impact your conclusions
• Normal distribution is symmetric, bell-shaped

o Inverse is NOT true e.g., Cauchy distribution, t-distribution
o There are a lot of tests one can use to check for

normality and outliers in the data.
Inference based on Normality
• Under normality assumption, we can perform
following tests.
✓One-sample t-test
(e.g., test if iphone battery life span > 2 years)
✓Two-sample t-test
(e.g., test if iphone and galaxy have the same life span)
✓ANOVA test (simply speaking, comparing group means
among more than two groups)
(e.g., test among iphone, galaxy and Android phone)
Detection of Normality
• How to check Normality?
✓ Qualitatively check by looking at:
: histogram, boxplot, quantile-quantile plot (QQ plot) etc..
✓ Quantitative check by formal test
: Sharpiro-Wilk test …
• For a comparison among groups (e.g., t-test, ANOVA),

normality check should be conducted by groups
• If at least one group does not follow normality, t-test or
ANOVA conclusions may NOT be valid.
qqnorm(airquality$Ozone); qqline(airquality$Ozone, col = 2)
Quantile-Quantile Plots
(a.k.a., Q-Q plots): A useful
diagnostics of how well a
specified theoretical
distribution fits your data. If
the quantiles of the
theoretical and data
distributions agree, the
plotted points fall on or near
the line.
Shapiro-Wilk Normality test
shapiro.test(airquality$Ozone)
##
## Shapiro-Wilk normality test
##
## data: airquality$Ozone
## W = 0.87867, p-value = 2.79e-08
H0: Data follows normal distribution

H1: Data does not follow normal distribution
• If p-value is larger than significance level (in general α=0.05), we
do not enough evidence to reject the null hypothesis, thus our
conclusion is - data follows normal distribution
• If p-value is smaller than significance level, we have enough
evidence to reject the null hypothesis, thus our conclusion is –
data does not follow Normal distribution
14

Exploratory Data Analysis

Uploaded by

Copyright:

Available Formats

Exploratory Data Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exploratory Data Analysis

Uploaded by

Copyright:

Available Formats

Exploratory data analysis

• Get rough idea about the distribution of data

Provides the distribution of

• Normal distribution is symmetric, bell-shaped

o There are a lot of tests one can use to check for

• For a comparison among groups (e.g., t-test, ANOVA),

H0: Data follows normal distribution

You might also like