Numerical Data Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

Exploratory Data Analysis

NUMERICAL DESCRIPTIVE MEASURES


Case 1: Advertising and Sales

Overview Challenge Objective


FOOD4U is a major food and The company invests a lot for To figure out the effectiveness
beverage company. advertising across different of advertising for the product
It sells a number of different media. across several markets.
products across different However, the company is not
markets. sure of the utility of advertising.
It uses “advertising” heavily to
promote the products
Advertising Data Set
The Advertising data set consists of the sales (in
thousands of units) of a particular product in 200
different markets.

It also contains the advertising budgets (in thousands


of dollars) for the product in each of the markets for
three different media: TV, Radio, and Newspaper
Advertising Data Set
Market ID TV Radio Newspaper Sales
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
6 8.7 48.9 75 7.2
7 57.5 32.8 23.5 11.8
8 120.2 19.6 11.6 13.2
9 8.6 2.1 1 4.8
Important Questions for an Effective Market
Plan

1
Is there any relationship between advertising budget and sales?

2
How strong is the relationship between advertising budget and sales?

3
Which media contribute to sales?

4
How accurately can we predict the future sales?
Graphical Methods for Quantitative
Variables

Univariate Analysis

Bivariate Analysis
Graphical Methods for Quantitative
Variables

Univariate Analysis

Bivariate Analysis
Univariate Analysis
NUMERICAL MEASURES FOR SALES
Univariate Analysis

Measures of Central Tendency

Measures of Dispersion

Measures of Relative Dispersion

Source: https://www.studyblue.com/notes/note/n/types-of-data-variance-and-
central-tendency/deck/6635625
Measures of Central
Tendency
• Mean
• Median
• Mode
Measures of Central
Tendency For Sales
Measure Value 90
Histogram

80

Mean 14.02 70

60

Frequency
Median 12.9 50

40

30

Mode 9.7 20

10

0
5 10 15 20 25 30
Sales
Measures of
Dispersion
• Range
•Variance
• Standard Deviation
Range
• Range is defined as

𝑅𝑎𝑛𝑔𝑒 = 𝑀𝑎𝑥𝑖𝑚𝑢𝑚 − 𝑀𝑖𝑛𝑖𝑚𝑢𝑚.


• It is possibly the simplest measure.
• It focusses on extreme values.
• For the Advertising data set, the range of Sales
is 25.4 thousand units.
Variance and
Standard
Deviation
Sales

0
5
10
15
20
25
30
1
5
9
13
17
21
25
29
33
37

Sales
41

Average
45
49
53
57
61
65
69
73
77
81
85
89
93
97
101
105
Market

109
113
117
121
125
129
133
137
141
Variance and Standard Deviation

145
149
153
157
161
165
169
173
177
181
185
189
193
197
Variance and Standard Deviation
• For a given sample, the sample variance is given by
𝑛
2
1
𝑠 = ෍ 𝑥𝑖 − 𝑥ҧ 2 .
𝑛−1
𝑖=1
• The sample standard deviation is the square root of the sample variance and is
denoted by 𝑠.
• A low standard deviation suggests that the values tend to be close to the mean,
while a high standard deviation indicates that the values are spread out over a
wider range.
Summation Notation
• Note that
3

෍ 𝑥𝑖 = 𝑥1 + 𝑥2 + 𝑥3 .
𝑖=1
• Similarly,
3

෍ 𝑥𝑖 − 𝑥ҧ 2 = 𝑥1 − 𝑥ҧ 2 + 𝑥2 − 𝑥ҧ 2 + 𝑥3 − 𝑥ҧ 2 .
𝑖=1
• Note that
3

෍ 𝑥𝑖 − 𝑥ҧ = 0.
𝑖=1
Summary Measure Value

Advertising Sample Variance 27.22


Data: Sales Standard Deviation 5.21
Chebyshev’s
Theorem
• For any data set, the proportion of
observations that lie within 𝑘 standard
deviations from the mean is at least
1
1− 2 ,
𝑘
where 𝑘 is any number greater than 1.

Application of
Chebyshev’s Theorem
• Consider a large lecture class with 280
students. The mean score on an exam is
74 with a standard deviation of 8. At
least how many students scored within
58 and 90?
Application of
Chebyshev’s Theorem
• Consider a large lecture class with 280
students. The mean score on an exam is
74 with a standard deviation of 8. At
least how many students scored within
58 and 90?
• With k = 2, we have 1 − 1Τ22 = 0.75.
• At least 75% of 280 or 210 students
scored within 58 and 90.
Application of Chebyshev’s
Theorem to Advertising
Data set
Measure Value
Mean 14.023
Standard Deviation 5.217
Mean − 2 × Standard 3.588
Deviation
Mean + 2 × Standard 24.457
Deviation
Empirical Rule
• Approximately 68% of all
observations fall in the interval
𝑥ҧ ± 𝑠.
• Approximately 95% of all
observations fall in the interval
𝑥ҧ ± 2𝑠. .
• Almost all observations fall in
the interval 𝑥ҧ ± 3𝑠. .
Empirical Rule
•Reconsider the example of the lecture class
with 280 students with a mean score of 74 and a
standard deviation of 8. Assume that the
distribution is symmetric and bell-shaped.
Approximately how many students scored within
58 and 90?
✓The score 58 is two standard deviations below
the mean while the score 90 is two standard
deviations above the mean.
✓Therefore approximately 95% of 280 students,
or 0.95(280) = 266 students, scored within 58
and 90.
Chebyshev’s Theorem vs. the Empirical Rule
• The main advantage of Chebyshev’s Theorem is that it applies to all data
sets, regardless of their distributions. It defines a lower bound on the
percentages of observations lying in a given interval, however the actual
percentages can be much greater.
• If we know that the data is drawn from a symmetric, bell-shaped
distribution, it is better to use the Empirical Rule because it is more precise.
Coefficient of Variation
Case 2: Investment Decision
RETURNS FROM GROWTH AND VALUE FUNDS
Case 2: Investment Decision
• As an investment counselor at a large bank, Jacqueline Brennan was
asked by an inexperienced investor to explain the differences between
two top-performing mutual funds:
✓Vanguard’s Growth Index (growth)
✓Vanguard’s Value Index (value)
• The investor has collected sample returns for these two funds for years
2007 through 2016. These data are presented in the next slide.
Introductory Case: Investment Decision
Case 2: Investment Decision
50

40

30

20

10

Growth
0
1 2 3 4 5 6 7 8 9 10 Value

-10

-20

-30

-40

-50
Growth Fund Value Value Fund Value
Mean 10.09 Mean 7.56
Standard Error 6.47 Standard Error 5.84
Median 13.02 Median 13.67
Mode #N/A Mode #N/A

Standard Deviation 20.45 Standard Deviation 18.46


Sample Variance 418.10 Sample Variance 340.74
Kurtosis 3.42 Kurtosis 3.26
Skewness -1.38 Skewness -1.42
Range 74.61 Range 68.82 Descriptive Statistics
Minimum -38.32 Minimum -35.97
Maximum 36.29 Maximum 32.85
Sum 100.88 Sum 75.60
Count 10.00 Count 10.00
Growth Fund Value Value Fund Value
Mean 10.09 Mean 7.56
Standard Error 6.47 Standard Error 5.84
Median 13.02 Median 13.67
Mode #N/A Mode #N/A
Standard Deviation 20.45 Standard Deviation 18.46
Sample Variance 418.10 Sample Variance 340.74
Kurtosis 3.42 Kurtosis 3.26
Skewness -1.38 Skewness -1.42
Range
Minimum
74.61 Range
-38.32 Minimum
68.82
-35.97
Descriptive Statistics
Maximum 36.29 Maximum 32.85
Sum 100.88 Sum 75.60
Count 10.00 Count 10.00
Growth Fund Value Value Fund Value
Mean 10.09 Mean 7.56
Standard Error 6.47 Standard Error 5.84
Median 13.02 Median 13.67
Mode #N/A Mode #N/A
Standard Deviation 20.45 Standard Deviation 18.46
Sample Variance 418.10 Sample Variance 340.74
Kurtosis 3.42 Kurtosis 3.26
Skewness -1.38 Skewness -1.42
Range
Minimum
74.61 Range
-38.32 Minimum
68.82
-35.97
Descriptive Statistics
Maximum 36.29 Maximum 32.85
Sum 100.88 Sum 75.60
Count 10.00 Count 10.00
Coefficient of Variation (CV)
• CV is a useful statistic for comparing the degree of variation from one data
series to another, even if the means are drastically different from one another.
• In this sense, CV serves as a relative measure of dispersion.
• CV adjusts for differences in the magnitudes of the means.
• CV is unit-less, allowing easy comparisons of mean-adjusted dispersion across
different data sets.
• CV is given by
SD 𝑠
CV = =
Mean 𝑥ҧ
CV
Fund CV
Growth 2.02
Value 2.44

• The coefficient of variation indicates that returns for the Value mutual fund
have more relative dispersion.
Percentiles and
Box Plots

Source: https://www.mana.md/understanding-percentiles/
Percentiles
• In general, the 𝑝-th percentile divides a data set into two parts:
✓ Approximately 𝑝 percent of the observations have values less
than the p-th percentile;
✓ Approximately (100 − 𝑝 ) percent of the observations have

values greater than the 𝑝-th percentile.


Percentiles
• Calculating the 𝑝-th percentile:
✓ First arrange the data in ascending order.

✓ Locate the position, 𝐿𝑝 , of the 𝑝-th percentile by using the

formula:
𝑝
𝐿𝑝 = 𝑛 + 1
100
✓ We use this position to find the percentile as shown next.
Percentiles
• Consider the following data:

Position 1 2 3 4 5 6 7 8 9 10
Value −56.02 −7.34 8.09 18.33 33.35 34.30 36.13 43.79 59.45 76.46
Percentiles
• For the 25th percentile, we locate the position:
𝑝 25
𝐿25 = 𝑛+1 = 10 + 1 = 2.75.
100 100
• Similarly, for the 75th percentile, we first find:
𝑝 75
𝐿75 = 𝑛+1 = 10 + 1 = 8.25.
100 100
Calculation of the 𝑝-th percentile
• Once you find 𝐿𝑝 , observe whether it is an integer.
✓ If 𝐿𝑝 is an integer, then the 𝐿𝑝 -th observation in the sorted data
set is the 𝑝-th percentile.
✓ If 𝐿𝑝 is not an integer, then interpolate between two
corresponding observations to approximate the 𝑝 -th
percentile.
Percentiles
• Both L25 = 2.75 and L75 = 8.25 are non-integers, thus
✓ The 25-th percentile is located 75% of the distance between the second
and third observations, and it is
−7.34 + 0.75 × 8.09 − −7.34 = 4.23
✓ The 75-th percentile is located 25% of the distance between the eighth and
ninth observations, and it is
43.79 + 0.25 × 59.45 − 43.79 = 47.71
Box Plot
• A box plot allows you to
✓ graphically display the distribution of a data set.
✓ compare two or more distributions.
✓ identify outliers in a data set.

Outliers Largest
Whiskers Obs.

Box

**
Third Smallest
Observation
Measure Value

First Quartile (𝑄1 ) 10.325

Second Quartile (𝑄2 ) 12.900

Third Quartile (𝑄3 ) 17.400 Calculation


IQR (= 𝑄3 − 𝑄1 ) 7.075

Q1 − 1.5 × IQR -0.288

Q3 + 1.5 × IQR 28.013


Box Plot of
Sales
Modified Advertising Data Set
Market ID TV Radio Newspaper Sales
1 230.1 37.8 69.2 38.0
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
6 8.7 48.9 75 7.2
7 57.5 32.8 23.5 11.8
8 120.2 19.6 11.6 13.2
9 8.6 2.1 1 4.8
Box Plot for
Modified Data
Outliers can be
influential…
Case 3: IPL Dataset
Box Plot for IPL
Player Salary
Data
Name Salary
V Kohli $2,656,250
MS Dhoni $2,343,750
RG Sharma $2,343,750
RR Pant $2,343,750
Ben Stokes $1,953,130
DA Warner $1,953,130
Outliers SP Narine $1,953,130
SPD Smith $1,953,130
AB de Villiers $1,718,750
CH Morris $1,718,750
HH Pandya $1,718,750
KL Rahul $1,718,750
MK Pandey $1,718,750
SK Raina $1,718,750
Association between Two Variables

Source: https://allabouthealthychoices.wordpress.com/2016/03/17/associationcorrelation-vs-causation/
Interesting Example (1)
Insight Organization Suggested Explanation
Higher crime, more Uber rides. In Uber “We hypothesized that crime should
San Francisco, the areas with the be a proxy for non-residential
most prostitution, alcohol, theft, population. . . Uber riders are not
and burglary are most positively causing more crime. Right, guys?”
correlated with Uber trips.

Source: https://blogs.scientificamerican.com/guest-blog/9-bizarre-and-
surprising-insights-from-data-science/?redirect=1
Interesting Example (2)
Insight Organization Suggested Explanation
Typing with proper capitalization A financial services startup Adherence to grammatical rules
indicates creditworthiness. Online company reflects a general propensity to
loan applicants who complete the correctly comply.
application form with the correct
case are more dependable debtors.
Those who complete the form with
all lower-case letters are slightly
less reliable payers; all capitals
reveals even less reliability.

Source: https://blogs.scientificamerican.com/guest-blog/9-bizarre-and-
surprising-insights-from-data-science/?redirect=1
Interesting Example (3)
Insight Organization Suggested Explanation
Female-named hurricanes are University researchers This may result from “a hazardous
more deadly. Based on a study of form of implicit sexism.”
the most damaging hurricanes in Psychological experiments in a
the United States during six recent related study “suggested that this
decades, the ones with “relatively is because feminine- versus
feminine” names killed an average masculine-named hurricanes are
of 42 people, almost three times perceived as less risky and thus
the 15 killed by hurricanes with motivate less preparedness. . . .
“relatively male” names. Individuals systematically
underestimate their vulnerability
to hurricanes with more feminine
names.”
Source: https://blogs.scientificamerican.com/guest-blog/9-bizarre-and-
surprising-insights-from-data-science/?redirect=1
Relationship between Two
Quantitative Variables
Plot of
Advertising
Data Set
Covariance and Correlation
• The covariance (𝑠𝑥𝑦 ) describes the direction of the linear relationship between
two variables, x and y.
• The correlation coefficient (𝑟𝑥𝑦 ) describes both the direction and strength of
the relationship between x and y.
Covariance and Correlation
• The sample covariance 𝑠𝑥𝑦 is computed as
𝑛
1
𝑠𝑥𝑦 = ෍ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത .
𝑛−1
𝑖=1
Possible
Explanation

Mean of Mean of TV
Sales Ad Budget
II I

Possible
III IV Explanation

Mean of Mean of TV
Sales Ad Budget
Covariance and Correlation
• The sample correlation coefficient 𝑟𝑥𝑦 is computed as

𝑠𝑥𝑦 Covarince (𝑋, 𝑌)


𝑟𝑥𝑦 = = .
𝑠𝑥 𝑠𝑦 SD 𝑋 SD(𝑌)
• Note that
−1 ≤ 𝑟𝑥𝑦 ≤ 1.
𝑟 = 0.78 𝑟 = 0.58 𝑟 = 0.23

Correlation
TV Radio Newspaper Sales
Correlation TV 1.00
Matrix
Radio 0.05 1.00

Newspaper 0.06 0.35 1.00


Sales 0.78 0.58 0.23 1.00
TV Radio Newspaper Sales
Correlation TV 1.00
Matrix
Radio 0.05 1.00

Newspaper 0.06 0.35 1.00


Sales 0.78 0.58 0.23 1.00
Correlation and Causation
Correlation
does not
imply
causation…

Source: https://www.reddit.com/r/shittyaskscience/comments/19gv2e/why_exactly_does_increased_sales_in_ice_cream/
Correlation and Causation
1. Two apparently unrelated variables can be correlated by just sheer coincidence.

Fig 1. Correlation between Money spent on pets in


the US and number of lawyers in California
• Correlation of the two variables is
150000 70
close to perfect = 0.998
142500 60

Billions of Dollars
• Evidently there is no connection
Lawyers

135000 50
between these two variables
127500 40 though.

120000 30
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Money spent on pets in US No. of lawyers in California

Source: Adapted from Tyler Vigen, “Spurious Correlations,”


http://www.tylervigen.com/view_correlation?id=2956, accessed March 2015.
Correlation and Causation
2. Causation is an asymmetric relation while Correlation is a symmetric relation

➢ X causes Y doesn’t mean Y can cause X.


➢ But X is correlated with Y means Y is correlated with X
➢ For Example: Weather forecast is correlated with actual weather, it doesn’t
mean that weather forecast causes the actual weather.
Correlation and Causation
3. Correlation can fail to predict
causation when a third factor ( 𝒁 )
influences both the correlated
X Y
variables (𝑿 and 𝒀).
➢ The ice cream sales (X) and the
rate of drowning deaths (Y) are
Z
found to be correlated. – Doesn’t
imply X causes Y or Y causes X
Reading Materials
• Sections 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.8.

• Exercise: 12, 15, 22, 23, 36, 37, 45, 46, 53, 68, 70, 83, 87.
• Case Study 3.4

You might also like