Business Analyst Slide Deck
Business Analyst Slide Deck
Business Analyst Slide Deck
course
Introduction to the course
Description
Business
1 This video is dedicated to bureaucracies Analyst
Econometrics and
Causal Inference
Segmentation
2 The course has statistics as the base
Predictive
Analytics
3 A business analyst needs to know 3 Analytics types
Description
Description
Game Plan
2 Master the principles to make it easier in the future
Description Visualization
Methodological Representation
σ 𝑥𝑖
𝑥ҧ =
𝑛
Description
Case Study We have a dataset with baseball teams’ data from 1962 to
1
Briefing –
2012
Baseball
3 We will practice statistical concepts on the dataset
https://www.baseball-reference.com/
Mode and Median
Mode Median
The most frequent number in a set The central number of an ordered set
X 2 3 3 5 X 2 3 5
Mode is 3 Median is 3
X 2 3 5 10
Median is 4
(Pearson) Correlation
Description Visualization
0 indicates no relationship
σ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑟=
Correlation does not imply causation σ 𝑥𝑖 − 𝑥ҧ 2 𝛴 𝑦𝑖 − 𝑦ത 2
Standard Deviation
X # occurences X
Description
3 We have 2 datasets
Normal Distribution aka Gaussian Distribution
Description Visualization
Description Visualization
Key Idea
A normal distributions is a pattern, and
patterns enables us to categorize data
with more confidence
1SD 2SD 3SD
Case Study – Wine Quality
Description
You are a wanna be Wine Statistics Connoisseur
Challenge1
1 Normal Distribution
2 Standard Errors
3 Confidense Intervals
Paulo Cortez,
University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez
A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal
@2009
P-value is all about likelihood
Description Examples
The probability of obtaining results at H0: The average salary of business analysts is €60k
least as extreme as the observed
results of a statistical hypothesis test, H1: business analysts’ average salary is not €60k
assuming the null hypothesis is correct.
P-value = 0.2 -> We fail to reject the null hypothesis
It helps us understand what is the
likelihood of “accepting” aka “fail to
reject” the hypothesis
H0: Blueberries prevent cancer
A small p-value (small probability) would
mean we favor the alternate hypothesis H1: Blueberries do not prevent cancer
P-value threshold usually used: 0.05 P-value = 0.01 -> We reject the null hypothesis
Shapiro-Wilk test
Description Interpretation
Quantifies how likely it is that the data H0: The distribution is gaussian
was drawn from a Gaussian distribution
If p-value > 0.05
Created in 1965 and is one of many
normality tests The distribution appears to have a normal
distribution
In practical terms
Helps us understand whether one group is
(statistically) different than from the other
How do we know?
Hand usage while If p-value less than 0.05, then the groups are
talking statistically different
Challenge – Understanding Remote Work predictions
1 T-tests
2 Chi-square tests
(Person) Chi-square test
Visualization Chi-square
Chi-square test
test
Determine whether there is a statistically
Wears black
significant difference between the expected
Yes No frequencies and the observed frequencies
Lives in Yes
Berlin Difference from t-test
No
A t-test tests a null hypothesis about two means;
Description
Package: Ecdat
(Linear) Regression crash course
Visualization Definition
Study of a relationship between a dependent
variable and at least one independent variable
Price
Intuition perspective
𝑌 =𝑎+𝑏∗𝑋+𝑒
Interpretation
If I increase X by 1, Y increases by b
If X happens, Y increases by b
How to read a Regression result
Coefficients
One carat increases the price by 11.6k
Statistical Significance
If P>|t| is less than 0.05, we have statistical
significance
Dummy variable trap
Multicollinearity
Multicollinearity
White Correlation between Coca-Cola and Pepsi is -1
Observation Coca cola Pepsi
Label
a 1 0 0 Solution: remove one dummy variable
b 1 0 0
c 1 0 0 Removing does not mean information is lost
d 0 0 1
A zero also represents information
e 0 0 1
f 0 1 0 The removed dummy variable becomes part of
g 0 1 0 the intercept.
h 0 0 1
j 0 1 0 You can see it as being your baseline.
MULTILINEAR
REGRESSION
Description
Beware of multicollinearity or
overfitting
Shape
Case Study Description
Visualization Interpretation
Outliers can damage your analysis
Salaries
Professional experience
Modelling is finding the balance between under and
overfitting
Description
Splitting between training and test enables an unbiased
model assessment
Model Assessment
Mean Absolute Error (MAE) vs Root Squared Mean Error
(RSME)
Visualization Key ideas
σ 𝑦ො − 𝑦 2
σ 𝑦 − 𝑦ො
𝑀𝐴𝐸 = 𝑅𝑆𝑀𝐸 =
𝑛 𝑛
Challenge
2 Analyze the data i.e. summary statistics
3 Correlation Matrix
6 Assess Accuracy
Dataset: Ecdat package
LOGISTIC
REGRESSION
Description
spam?
3 Can we predict which emails are spam?
DAAG ´package
Logistic Regression crash course
𝑌 =𝑎+𝑏∗𝑋+𝑒
B = 0.5: For each X unit increase, Y increases by
0.5
Logistic Regression
Accuracy
Confusion Matrix
Accuracy = (True positive + True negative ) / All
Predicted
False True
Balanced dataset
False True Negative False Positive Sensitivity, Recall or True Positive Rate
True Positive / (True Positive + False Negative)
True False Negative True Positive
Focus is on True values
Description
Use Logistic Regression to predict the sex of penguins
Challenge
2 Plot Histograms of the character variables
Econometrics 1 Decision-Making
Inference
According to BNP Paribas, Sustainability-focused
companies perform better
Description
Visualization
Context Visualization
20000
This graph shows the sales in the market. The
event started where the red line is 18000
16000
Comparing before and after would subject you to
14000
ommitted bias.
12000
10000
9/1/2020 10/1/2020 11/1/2020
Brodersen, Kay H.; Gallusser, Fabian; Koehler, Jim; Remy, Nicolas; Scott, Steven L. Inferring causal impact using Bayesian structural time-series models. Ann. Appl. Stat. 9
(2015), no. 1, 247--274. doi:10.1214/14-AOAS788. https://projecteuclid.org/euclid.aoas/1430226092
Difference-in-differences framework
Price
Bitcoin
The delta between what Bitcoin, if same
actually happened and evolution as Google
the what-if scenario is
the treatment impact
Google
Google + Post
October 20th
Time
Causal Impact Step by Step
The Treatment and Control Groups are There must be only one policy or initiative that
assumed to have the same evolution for the KPI differentiates the treatment from control groups.
Time
Correlation in Time Series
Description Visualization
Y Y Statistical test:
Dickey-Fuller test. If p-
value is less than 0.05,
time series is
considered stationary
t t
Making Data Stationary
5 NA
9 4
1 -8
7 6
3 -4
7 4
4 -3
Impact evolution
Context Visualization
• Brodersen, Kay H.; Gallusser, Fabian; Koehler, Jim; Remy, Nicolas; Scott, Steven L. Inferring causal impact using Bayesian structural time-series models. Ann. Appl. Stat.
9 (2015), no. 1, 247--274. doi:10.1214/14-AOAS788. https://projecteuclid.org/euclid.aoas/1430226092
Description
Use Causal Impact to measure the impact of the CO2
scandal in Volkswagen stock Price
Context
Amazon Prime is a loyalty program that provides free shipping, discounts and other services
If you were to asked to provide the impact of Amazon Prime on its financials, how would you do it?
You cannot just simply compare the average Prime and
non-prime subscriber
Context
Both groups may be inerently different from the start. Hence, they are not comparable.
In a nutshell, you create a counterfactual group with similar characteristics to your treatment group
Treatment
Control
You cannot just simply compare the average Prime and
non-prime subscriber
Context
Both groups may be inerently different from the start. Hence, they are not comparable.
In a nutshell, you create a counterfactual group with similar characteristics to your treatment group
Unmatched
Treatment
Control
Case Study Description
Use Matching to understand whether catholic schoolsare
Briefing – Are better than others (from a standardized test score view)
Context Visualization
Visualization Context
Imagine you have a variable with 3 options
Finally, a third
Key Idea
Make sure when you create a model as simple as possible
How to determine the Common Support Region
Density
is part of the treatment group
List of confounders There will be people with high
Via Logistic Probability of likelihood of participating.
Regression being treated Probability
You are not likely to find a
Treatment control group for them.
Common Support The greater the overlap, the
Region higher the matching quality
Dataset Variables
Sample Remove 1
Matching Matching
Challenge
2 T-test Loop
4 Perform Matching
My
2 What is the incrementality?
experience
3 Difference audiences means non-comparability
with
4 Tiny treatment group = good common support region
Game Plan
3 Challenge: purchasing behavior
Visualization Description
20%
What is RFM?
Recency Days gone Days gone How long since they last purchased
Frequency Frequency (Q) Frequency (Q) How often have they purchased
What else?
Value
3 Create a RFM model with 3 levels
segmentation
4 Define 3 segments
Game Plan
3 Challenge: Credit card users
X1
Gaussian Mixture Model
Case Study 1
Determine the optimal number of segments for the
dataset
Gaussian
1 Prepare data set
Mixture
2 Determine optimal number of clusters
My
2 Their status-quo was a value-based segmentation
experience
3 They wanted a behavioral segmentation
with
4 The first difficulty was how massive the data was
5 We tried to be hypothesis-driven
Segmentation
6 We had 7 interpretable segments in the end
Game Plan
3 Practice case study: Credit card applicants
Description
X2 < 50
x1
No Yes
X1 > 40 X1 > 70
70 No Yes No Yes
Blue Blue
40
Key Ideas:
• A split or leaf is done taken a maximum entropy logic
- Where would it yield more information
• The prediction would be done based on the relative frequency
50 x2
Description
A Dataset with credit card applicants
Case Study 1
The key metric of success is whether someone was
accepted or not
Briefing
2 We want to predict the acceptance
Description
1 Tendency to overfit
What is it?
2 Good with multicollinearity
4 Robust to Outliers
Parameter Tuning
Context
Advanced models have parameters to tune to optimize accuracy
Description
Parameter Run Model Measure error Save error
n_estimators: 50 5000
Challenge –
1 Prepare data set
Random
2 Create Random Forest Regressor model
3 Measure accuracy
Forest
4 Tune the model
5 Generate insights
FACEBOOK
PROPHET
Description
Game Plan
3 Practice case study: Udemy Wikipedia page visits
Visualization Description
Data Seasonality
Structural Time Series is the
decomposition of the data in at least:
Trend
Seasonality
Exogenous impacts
Trend Exogenous impacts Error Term
Methodological framework
𝑦(𝑡) = 𝑐 𝑡 + 𝑠 𝑡 + 𝑥 𝑡 + 𝜖
Facebook Prophet quick facts
Description
1 Built by facebook
Which? Stan background - probabilistic programming
2
language for statistical inference
3 Dynamic Holidays
Where:
c(t) Trend +
Prophet s(t) Seasonality +
Visualization
Facebook Prophet
Chocolate
demand You state Valentine‘s as a key
event and specify how many
days before/after
Other models:
You must create dummy
variables for each day, if you
believe they have different
impacts
11 12 13 14 15
February
Training and Test Set in Time Series
Dataset Time
Key Ideas
Forecasting Models are usually split into a pre and post period from a time perspective
The Test Set should be of the size of a real-world forecast
Facebook Prophet Model
Component Description
Holiday_prior_scale Smaller values allow the model to fit larger seasonal fluctuations
If it is in adding absolute
values, then it is additive.
t
t
Cross Validation
Key Idea
Repeating the assessment of our model reinforces its evaluation
Parameters to tune
Component Description
Holiday_prior_scale Smaller values allow the model to fit larger seasonal fluctuations
Forecasting
3 But they also need it to measure investments
at Uber
4 They need to forecast at scale