Numerical Data Analysis
Numerical Data Analysis
Numerical Data Analysis
1
Is there any relationship between advertising budget and sales?
2
How strong is the relationship between advertising budget and sales?
3
Which media contribute to sales?
4
How accurately can we predict the future sales?
Graphical Methods for Quantitative
Variables
Univariate Analysis
Bivariate Analysis
Graphical Methods for Quantitative
Variables
Univariate Analysis
Bivariate Analysis
Univariate Analysis
NUMERICAL MEASURES FOR SALES
Univariate Analysis
Measures of Dispersion
Source: https://www.studyblue.com/notes/note/n/types-of-data-variance-and-
central-tendency/deck/6635625
Measures of Central
Tendency
• Mean
• Median
• Mode
Measures of Central
Tendency For Sales
Measure Value 90
Histogram
80
Mean 14.02 70
60
Frequency
Median 12.9 50
40
30
Mode 9.7 20
10
0
5 10 15 20 25 30
Sales
Measures of
Dispersion
• Range
•Variance
• Standard Deviation
Range
• Range is defined as
0
5
10
15
20
25
30
1
5
9
13
17
21
25
29
33
37
Sales
41
Average
45
49
53
57
61
65
69
73
77
81
85
89
93
97
101
105
Market
109
113
117
121
125
129
133
137
141
Variance and Standard Deviation
145
149
153
157
161
165
169
173
177
181
185
189
193
197
Variance and Standard Deviation
• For a given sample, the sample variance is given by
𝑛
2
1
𝑠 = 𝑥𝑖 − 𝑥ҧ 2 .
𝑛−1
𝑖=1
• The sample standard deviation is the square root of the sample variance and is
denoted by 𝑠.
• A low standard deviation suggests that the values tend to be close to the mean,
while a high standard deviation indicates that the values are spread out over a
wider range.
Summation Notation
• Note that
3
𝑥𝑖 = 𝑥1 + 𝑥2 + 𝑥3 .
𝑖=1
• Similarly,
3
𝑥𝑖 − 𝑥ҧ 2 = 𝑥1 − 𝑥ҧ 2 + 𝑥2 − 𝑥ҧ 2 + 𝑥3 − 𝑥ҧ 2 .
𝑖=1
• Note that
3
𝑥𝑖 − 𝑥ҧ = 0.
𝑖=1
Summary Measure Value
40
30
20
10
Growth
0
1 2 3 4 5 6 7 8 9 10 Value
-10
-20
-30
-40
-50
Growth Fund Value Value Fund Value
Mean 10.09 Mean 7.56
Standard Error 6.47 Standard Error 5.84
Median 13.02 Median 13.67
Mode #N/A Mode #N/A
• The coefficient of variation indicates that returns for the Value mutual fund
have more relative dispersion.
Percentiles and
Box Plots
Source: https://www.mana.md/understanding-percentiles/
Percentiles
• In general, the 𝑝-th percentile divides a data set into two parts:
✓ Approximately 𝑝 percent of the observations have values less
than the p-th percentile;
✓ Approximately (100 − 𝑝 ) percent of the observations have
formula:
𝑝
𝐿𝑝 = 𝑛 + 1
100
✓ We use this position to find the percentile as shown next.
Percentiles
• Consider the following data:
Position 1 2 3 4 5 6 7 8 9 10
Value −56.02 −7.34 8.09 18.33 33.35 34.30 36.13 43.79 59.45 76.46
Percentiles
• For the 25th percentile, we locate the position:
𝑝 25
𝐿25 = 𝑛+1 = 10 + 1 = 2.75.
100 100
• Similarly, for the 75th percentile, we first find:
𝑝 75
𝐿75 = 𝑛+1 = 10 + 1 = 8.25.
100 100
Calculation of the 𝑝-th percentile
• Once you find 𝐿𝑝 , observe whether it is an integer.
✓ If 𝐿𝑝 is an integer, then the 𝐿𝑝 -th observation in the sorted data
set is the 𝑝-th percentile.
✓ If 𝐿𝑝 is not an integer, then interpolate between two
corresponding observations to approximate the 𝑝 -th
percentile.
Percentiles
• Both L25 = 2.75 and L75 = 8.25 are non-integers, thus
✓ The 25-th percentile is located 75% of the distance between the second
and third observations, and it is
−7.34 + 0.75 × 8.09 − −7.34 = 4.23
✓ The 75-th percentile is located 25% of the distance between the eighth and
ninth observations, and it is
43.79 + 0.25 × 59.45 − 43.79 = 47.71
Box Plot
• A box plot allows you to
✓ graphically display the distribution of a data set.
✓ compare two or more distributions.
✓ identify outliers in a data set.
Outliers Largest
Whiskers Obs.
Box
**
Third Smallest
Observation
Measure Value
Source: https://allabouthealthychoices.wordpress.com/2016/03/17/associationcorrelation-vs-causation/
Interesting Example (1)
Insight Organization Suggested Explanation
Higher crime, more Uber rides. In Uber “We hypothesized that crime should
San Francisco, the areas with the be a proxy for non-residential
most prostitution, alcohol, theft, population. . . Uber riders are not
and burglary are most positively causing more crime. Right, guys?”
correlated with Uber trips.
Source: https://blogs.scientificamerican.com/guest-blog/9-bizarre-and-
surprising-insights-from-data-science/?redirect=1
Interesting Example (2)
Insight Organization Suggested Explanation
Typing with proper capitalization A financial services startup Adherence to grammatical rules
indicates creditworthiness. Online company reflects a general propensity to
loan applicants who complete the correctly comply.
application form with the correct
case are more dependable debtors.
Those who complete the form with
all lower-case letters are slightly
less reliable payers; all capitals
reveals even less reliability.
Source: https://blogs.scientificamerican.com/guest-blog/9-bizarre-and-
surprising-insights-from-data-science/?redirect=1
Interesting Example (3)
Insight Organization Suggested Explanation
Female-named hurricanes are University researchers This may result from “a hazardous
more deadly. Based on a study of form of implicit sexism.”
the most damaging hurricanes in Psychological experiments in a
the United States during six recent related study “suggested that this
decades, the ones with “relatively is because feminine- versus
feminine” names killed an average masculine-named hurricanes are
of 42 people, almost three times perceived as less risky and thus
the 15 killed by hurricanes with motivate less preparedness. . . .
“relatively male” names. Individuals systematically
underestimate their vulnerability
to hurricanes with more feminine
names.”
Source: https://blogs.scientificamerican.com/guest-blog/9-bizarre-and-
surprising-insights-from-data-science/?redirect=1
Relationship between Two
Quantitative Variables
Plot of
Advertising
Data Set
Covariance and Correlation
• The covariance (𝑠𝑥𝑦 ) describes the direction of the linear relationship between
two variables, x and y.
• The correlation coefficient (𝑟𝑥𝑦 ) describes both the direction and strength of
the relationship between x and y.
Covariance and Correlation
• The sample covariance 𝑠𝑥𝑦 is computed as
𝑛
1
𝑠𝑥𝑦 = 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത .
𝑛−1
𝑖=1
Possible
Explanation
Mean of Mean of TV
Sales Ad Budget
II I
Possible
III IV Explanation
Mean of Mean of TV
Sales Ad Budget
Covariance and Correlation
• The sample correlation coefficient 𝑟𝑥𝑦 is computed as
Correlation
TV Radio Newspaper Sales
Correlation TV 1.00
Matrix
Radio 0.05 1.00
Source: https://www.reddit.com/r/shittyaskscience/comments/19gv2e/why_exactly_does_increased_sales_in_ice_cream/
Correlation and Causation
1. Two apparently unrelated variables can be correlated by just sheer coincidence.
Billions of Dollars
• Evidently there is no connection
Lawyers
135000 50
between these two variables
127500 40 though.
120000 30
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
• Exercise: 12, 15, 22, 23, 36, 37, 45, 46, 53, 68, 70, 83, 87.
• Case Study 3.4