Exploratory Data Analysis
Exploratory Data Analysis
Exploratory Data Analysis
Objective
• Get the quick idea about data
– visualization is the easiest way
– check descriptive statistics
• Data cleaning process to reduce the number of
data problems in the future
– handle missing data, outliers or typo etc.
– need to be careful!
• Explore your data to determine whether the model
assumptions are met etc.
– E.g., check normality of data
Visualization and descriptive statistics
• Visualization
– Histogram
– Boxplot
– Scatter plot (to find the relationship btw 2 variables)
• Descriptive statistics
– Mean, median, variance, skewness, kurtosis etc.
– Correlation (for 2 variables)
• https://demonstrations.wolfram.com/ExploringSkewnessIn
BoxPlots/ (boxplot and skewness)
• Able to check its normality (informally) based on visual and
descriptive statistics
Example : airquality
• Daily air quality measurements in New York,
May to September 1973. (R built-in data)
• 154 observations on 6 variables – Ozone,
Solar R, Wind, …
hist(airquality$Ozone,main="Ozone",xlab="Ozone")
Quantile-Quantile Plots
(a.k.a., Q-Q plots): A useful
diagnostics of how well a
specified theoretical
distribution fits your data. If
the quantiles of the
theoretical and data
distributions agree, the
plotted points fall on or near
the line.
Shapiro-Wilk Normality test
shapiro.test(airquality$Ozone)
##
## Shapiro-Wilk normality test
##
## data: airquality$Ozone
## W = 0.87867, p-value = 2.79e-08