Course Title: Data Pre-Processing and Visualization

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 11

Course Title : Data pre-processing and

Visualization
Ram Mohan Dhara|
IMTG/ PGDM/ Term – VI / 2017-2019
Session 3 : EDA (Exploratory Data Analysis)
After completing this session, you will be able to –
Session • Carryout a comprehensive exploration of a
objectives dataset in R
Exploratory Data Analysis
• The following is a list of the EDA functions included in the dlookr package-
• describe() - provides descriptive statistics for all variables
• normality() and plot_normality() - perform normalization and visualization of normality
• correlate() and plot_correlate() - calculate the correlation coefficient between two
numerical variables and plots correlation
• target_by() - defines the target variable
• relate() - describes the relationship with the variables of interest corresponding to the
target variable.
• plot.relate() - visualizes the relationship to the variable of interest corresponding to the
target variable.
• summary()- gives a detailed summary of analysis
• eda_report() - performs an exploratory data analysis and reports the results.
Calculating descriptive statistics using describe() in R
• n : number of observations excluding • skewness : skewness
missing values
• kurtosis : kurtosis
• na : number of missing values
• p25 : Q1. 25% percentile
• mean : arithmetic average
• p50 : Q2. median. 50% percentile
• sd : standard deviation
• p75 : Q3. 75% percentile
• se_mean : standard error mean.
sd/sqrt(n)
• IQR : interquartile range (Q3-Q1)
Test of normality of numeric Normalization visualization of
variables using normality() numerical variables using plot_
normality()

• statistic : Statistics of the Shapiro-Wilk • Histogram of original data


test
• Q-Q plot of original data
• p_value : p-value of the Shapiro-Wilk
• histogram of log transformed data
test
• Histogram of square root transformed
• sample : Number of sample
data
observations performed Shapiro-Wilk
test
Calculation of correlation Visualization of the correlation
coefficient using correlate() matrix using plot_correlate()

• r : Pearson's correlation • Visualizes co-relation matrix


EDA on target variable using target_by(),
relate(), plot.relate(), summary()
• target_by() – creates a target object (variable)
• relate() – establishes a relationship between target and predictors
• plot.relate() – plots the relationship between target and predictors
• summary() – gives the summary of analysis carried out at the background
EDA on target variable
Target Predictor target_by() - relate() plot.relate() summary()
Variable Variable nomenclature
Categorical Continuous tar_cat_pred_cont Description of Density plot of Summary of target
predictor at predictor at different
different levels levels of target
of target
Categorical Categorical tar_cat_pred_cat It creates a A mosaic plot between Chi-sq statistic, tests
contingency target and predictor independence between
table target and predictor
Continuous Continuous tar_cont_pred_cont It runs a simple Scatter plot with a Gives a linear regression
linear regression trend line between output between target and
target and predictor predictor
Continuous Categorical tar_cont_pred_cat It creates an It creates box plots of F –statistic.
ANOVA table target at different
levels of predictor
A comprehensive report of EDA using eda_report()

• Introduction • Relationship Between Variables


• Information of Dataset • Correlation Coefficient
• Information of Variables • Correlation Coefficient by Variable
Combination
• Numerical Variables
• Correlation Plot of Numerical Variables
• Univariate Analysis
• Descriptive Statistics
• Target based Analysis
• Numerical Variables and Categorical
• Normality Test of Numerical Variables Variables
• Statistics and Visualization of (Sample) • Correlation and Correlation Plots
Data
Summary : what we have learnt
• Essential steps in exploration of data using R

• describe()
• normality() and plot_normality()
• correlate() and plot_correlate()
• target_by()
• relate() and plot.relate()
• summary()
• eda_report()
This concludes the session :
EDA (Exploratory Data Analysis)

Next session :
Introduction to Visual Analytics and Tableau

You might also like