Business Analyst Slide Deck

Business Analyst
course
Introduction to the course
Description
Business
1 This video is dedicated to bureaucracies Analyst
Econometrics and
Causal Inference
Segmentation
2 The course has statistics as the base
Predictive
Analytics
3 A business analyst needs to know 3 Analytics types
4 The course is practice-focused

Statistics and Descriptive Analytics
5 The course materials are in the next lecture
Description
The Modern- 1 Being only good with numbers is a thing of past
day Business 2 Proficient with Statistics & Analytics methodologies
Analyst 3 A bridge between technical and non-technical people

The impact of weather on sales
Description
1 Weather influences seasonal industries
2 External factors are uncontrollable by nature
3 How to prove weather influences sales?
If weather influences, then sales move when weather

4
changes and are constant, all else being equal
5 The Technique I used was Google Causal Impact

Predicting the future
Description
1 Commercial teams are belief-driven in nature
2 Advanced Analytics give numbers to beliefs
3 But making the change is difficult
4 Simple and interpretable usually has high errors
5 One of the techniques used was Facebook Prophet

BASIC
STATISTICS
Description
1 Backbone for the full course
Game Plan
2 Master the principles to make it easier in the future
3 You will do an exercise for each statistic learned
4 Moneyball case study at the end

(Arithmetic) Mean
Description Visualization
Same thing as average
When we say mean, we refer to the

arithmetic mean
Represents the expected value
Methodological Representation
σ 𝑥𝑖
𝑥ҧ =
𝑛
Description
Case Study We have a dataset with baseball teams’ data from 1962 to
1
Briefing –
2012
2 There are 12 KPIs for each Team, League and Year
Baseball
3 We will practice statistical concepts on the dataset
https://www.baseball-reference.com/
Mode and Median
Mode Median
The most frequent number in a set The central number of an ordered set
Fashion is a statistical term ☺ If even numbers, then you average both

middle points
Used with skewed dataset

Visualization Visualization
X 2 3 3 5 X 2 3 5
Mode is 3 Median is 3
X 2 3 5 10
Median is 4
(Pearson) Correlation
Measures the relationship strength

between 2 variables
Varies between -1 and 1
1 means strong positive relationship
-1 means strong negative relationship Methodological Representation
0 indicates no relationship
σ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑟=
Correlation does not imply causation σ 𝑥𝑖 − 𝑥ҧ 2 𝛴 𝑦𝑖 − 𝑦ത 2
Standard Deviation
Description Methodological Representation
Measures the variation or dispersion of

a set of values
σ 𝑥𝑖 − 𝑥ҧ 2
High values mean higher variability 𝜎=
𝑛
High variability Low variability

# occurences
X # occurences X
Description
1 Money ball is set on the world of baseball
2 The A’s had success despite financial struggles
Moneyball 3 The team looked for undervalued players
Case Study 4 Other teams did not look at statistics
5 The General Manager looked at specific statistics
6 With the right system, you can beat anyone

INTERMEDIARY
STATISTICS
Description
1 Level up our statistics game
Game Plan 2 Please use the Q&A
3 We have 2 datasets
Normal Distribution aka Gaussian Distribution
Symmetric distribution with the mean in

the middle
Data occuring near the mean is more
frequent
Graph is similar to bell shaped curve
Statistical methods, i.e. regression

assumes normalization of errors
In real-life, there will be some degree of
similarity in most problems
68-95-99 Rule in Normal Distributions
Within +- 1 standard distributions, you 99.7%

can find 68% of observations 95%
Within +-2 SD, you should encounter 68%
95% of deviations
Within +-3 SD, it is 99.7%
Key Idea
A normal distributions is a pattern, and
patterns enables us to categorize data
with more confidence
1SD 2SD 3SD
Case Study – Wine Quality
Description
You are a wanna be Wine Statistics Connoisseur
Challenge1
1 Normal Distribution
2 Standard Errors
3 Confidense Intervals
Paulo Cortez,
University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez
A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal
@2009
P-value is all about likelihood
Description Examples
The probability of obtaining results at H0: The average salary of business analysts is €60k
least as extreme as the observed
results of a statistical hypothesis test, H1: business analysts’ average salary is not €60k
assuming the null hypothesis is correct.
P-value = 0.2 -> We fail to reject the null hypothesis
It helps us understand what is the
likelihood of “accepting” aka “fail to
reject” the hypothesis
H0: Blueberries prevent cancer
A small p-value (small probability) would
mean we favor the alternate hypothesis H1: Blueberries do not prevent cancer
P-value threshold usually used: 0.05 P-value = 0.01 -> We reject the null hypothesis
Shapiro-Wilk test
Description Interpretation
Quantifies how likely it is that the data H0: The distribution is gaussian
was drawn from a Gaussian distribution
If p-value > 0.05
Created in 1965 and is one of many
normality tests The distribution appears to have a normal
distribution
If p-value < 0.05
The distribution does not look Gaussian -> reject

the null hypothesis
Standard Error (of the sample mean)
The standard error of the sample mean

is an estimate of how far the sample
mean is likely to be from the population 𝜎
mean. 𝑆𝐸 =
𝑛
Standard deviation is the degree to
which individuals within the sample
differ from the sample mean.
Z-Score
Gives you an idea of how far from the

mean is a data point. 𝑥−𝜇
𝑧=
Z-scores are a way to compare results 𝜎
to a “normal” population
Example – Diogo College grades
It is a way to standardize values
Uni: Z = (16 – 13) / (2) = 1.5
GMAT: Z = (680 – 560) / (120) = 1

Confidence Interval (when n > 30)
Description Z-values table Visualization – 95% CI
A range that gives a sense of how

Confidence Z-Value
precisely a statistic estimates a
Interval
parameter.
80% 1.28
The associated confidence level gives the 85% 1.44
probability with which an estimated interval
will contain the true value of the parameter 90% 1.65 2.5% 97.5%
95% 1.96 Lower Upper
Methodological Representation bound bound
99% 2.58
99.9% 3.29
𝜎
𝐶𝐼 = 𝑥ҧ ± 𝑧 ∗
𝑛
T-Tests
Visualization T Test formally

Test any statistical hypothesis in which the test
# of people statistic follows a Student's t-distribution under
the null hypothesis.
Non-Italians Italians
In practical terms
Helps us understand whether one group is
(statistically) different than from the other
How do we know?
Hand usage while If p-value less than 0.05, then the groups are
talking statistically different
Challenge – Understanding Remote Work predictions
Stack Overflow dataset

Challenge1 Worker‘s characteristics, and job related queries
1 T-tests
2 Chi-square tests
(Person) Chi-square test
Visualization Chi-square
Chi-square test
test
Determine whether there is a statistically
Wears black
significant difference between the expected
Yes No frequencies and the observed frequencies
Lives in Yes
Berlin Difference from t-test
No
A t-test tests a null hypothesis about two means;
Null hypothesis A chi-square test requires categorical variables,

each having any number of levels.
There is no relationship between variables
Powerposing and p-hacking
Description
1 You put your body in a powerpose
2 You would perform better in high-pressure moments
3 Powerposing is not backed up by science
4 Powerposing results were not replicated by others
P-hacking is the removal of some individuals to

5 achieve statistical significance
LINEAR
REGRESSION
Description
1 Building block in our learning capacity
Game Plan 2 I learned how to do it by hand. Yeah, really!
3 We will have a practice-focused approach

Case Study Description
Briefing – 1 We have a dataset of roughly 300 diamonds
Pricing 2 We have the price, carats and other KPIs
Diamonds 3 We want to understand how carats influence

Diamond Prices
Package: Ecdat
(Linear) Regression crash course
Visualization Definition
Study of a relationship between a dependent
variable and at least one independent variable
Price
Intuition perspective
β Method for “What is the impact of X on Y?
How is it different from a correlation?

• Correlation studies the direction
Carat • Regression studies the impact
Linear Regression
Visualization Methodological View
𝑌 =𝑎+𝑏∗𝑋+𝑒
Interpretation
If I increase X by 1, Y increases by b
If X happens, Y increases by b
How to read a Regression result
Coefficients
One carat increases the price by 11.6k
R-Squared and adj R-squared

We can explain 89.3% of the variance
Confidence interval (95%)

The Carat coefficient is between 11.1k-12.1k
Statistical Significance
If P>|t| is less than 0.05, we have statistical
significance
Dummy variable trap
Multicollinearity
Observation Coca cola Pepsi Correlation between Coca-Cola and Pepsi is -1
a 1 0 Solution: remove one dummy variable

b 1 0
c 1 0 Removing does not mean information is lost
d 1 0
A zero also represents information
e 0 1
f 0 1 The removed dummy variable becomes part of
g 0 1 the intercept.
h 0 0
j 0 1 You can see it as being your baseline.
Dummy variable trap
Multicollinearity
White Correlation between Coca-Cola and Pepsi is -1
Observation Coca cola Pepsi
Label
a 1 0 0 Solution: remove one dummy variable
b 1 0 0
c 1 0 0 Removing does not mean information is lost
d 0 0 1
A zero also represents information
e 0 0 1
f 0 1 0 The removed dummy variable becomes part of
g 0 1 0 the intercept.
h 0 0 1
j 0 1 0 You can see it as being your baseline.
MULTILINEAR
REGRESSION
Description
1 Topics: outliers, assessment ad overfitting
Game Plan 2 Practice tutorial: Teacher’s salaries
3 Challenge: Retail Store drivers
4 Regression value adding is interpretability

Multilinear Regression
Linear Regression Multilinear Regression Description
It is very rare that one input

Carat Diamond Carat explains the output
We often need more predictors

X Y Color Diamond to improve the models
Beware of multicollinearity or
overfitting
Shape
The 2008-09 academic salary for Professors in a

Briefing – 1
college in the U.S.
Professors‘ 2 The data was collected to monitor salary differences

between male and female faculty members.
salaries 3 Use Multilinear Regression to study

Outliers
Visualization Interpretation
Outliers can damage your analysis
Salaries
You must distinguish between noise and

valuable information
Consider using models good with outliers or

non-linearity (e.g., Random Forest)
Professional experience
Modelling is finding the balance between under and
overfitting
Underfitting Overfitting Insights
Having a too simple model

will get you nowhere
Too complex will not yield

results in other testing
scenarios
You should iterate based

on results
Let‘s imagine this is our full data set
Description
Splitting between training and test enables an unbiased
model assessment
Training Set Test Set
Model Assessment
Mean Absolute Error (MAE) vs Root Squared Mean Error
(RSME)
Visualization Key ideas
MAE and RSME are performance indicators for

Y Regression models with continuous outputs
Model
σ 𝑦ො − 𝑦 2
σ 𝑦 − 𝑦ො
𝑀𝐴𝐸 = 𝑅𝑆𝑀𝐸 =
𝑛 𝑛
RSME is useful for models with extremes / outliers
MAE is more interpretable.

X
Description
Use Multilinear Regression to study a Store sales‘ drivers
1 Pick variables for your model
Challenge
2 Analyze the data i.e. summary statistics
3 Correlation Matrix
4 Create a Training a Test Set
5 Use Multilinear Regression
6 Assess Accuracy
Dataset: Ecdat package
LOGISTIC
REGRESSION
Description
1 We now face a classification problem
2 The question influences the analytical technique
Game Plan 3 How do we measure accuracy?
4 Case study: Which emails are spam?
5 Challenge: the sex of penguins

Description
Case Study
1 Dataset with ~5k emails
Briefing – Is it What makes an email spammy?

2
spam?
3 Can we predict which emails are spam?
DAAG ´package
Logistic Regression crash course
Visualization What is a Logistic Regression?

Relationship study between a discrete dependent
Spam
variable and at least one independent variable
Y (1)
From an Intuition Perspective?
What is the impact of X on Y happening?
How is it different from a Linear Regression

Linear is for continuous, logistic is discrete
N (0)
# times the word Linear we fit a straight line, logistic a curve
money occurs
How to read a Logistic Regression coefficients1
Linear Regression Interpretation

For each X unit increase, Y increases by b
𝑌 =𝑎+𝑏∗𝑋+𝑒
B = 0.5: For each X unit increase, Y increases by
0.5
Logistic Regression
For each X unit increase, the probability of Y

happening increases by exp(b) – 1 * 100 %
ⅇ𝑎+𝑏𝑋
𝑌=
1 + ⅇ𝑎+𝑏𝑥 B = 0.5: For each X unit increase, the probability
of Y happening increases by 64%
The Confusion Matrix allows to access the results of a classifier
Accuracy
Confusion Matrix
Accuracy = (True positive + True negative ) / All
Predicted
False True
Balanced dataset
False True Negative False Positive F1-Score

F1-score = 2 * TP / (2 * TP + FP + FN)
True False Negative True Positive
Unbalanced dataset
The Confusion Matrix allows to access the results of a classifier
Specificity or True Negative Rate

Confusion Matrix
True Negative / (True Negative + False Positive)
Predicted
False True
When we focus in False values accuracy
False True Negative False Positive Sensitivity, Recall or True Positive Rate
True Positive / (True Positive + False Negative)
True False Negative True Positive
Focus is on True values
Description
Use Logistic Regression to predict the sex of penguins
Challenge
2 Plot Histograms of the character variables
3 Transform the character variables into binary
4 Create a Training a Test Set
5 Use Logistic Regression
6 Assess Accuracy through the classification report

GOOGLE CAUSAL
IMPACT
Why
Description
Econometrics 1 Decision-Making
and Causal 2 Understand and tacking biases
Inference
According to BNP Paribas, Sustainability-focused
companies perform better
Description
1 Are there other differences between sustaibanility-

focused companies and others?
BNP Paribas 2 People define politics and decision
Not including all factors is falling into the omitted

3
variable bias
Does
Smoking
prevent
Parkinson’s?
Description
The incidence of Parkinson’s in people between 55

1
and 75 is twice as significant in non-smokers
Is there a causal relationship between smoking and

2
Parkinson’s?
Smoking and
People are more likely to get Parkinson’s the older
3
Parkinson‘s
they get
4 Smokers’ life expectancy is lower than non-smokers’
Non-smokers are more likely to have

5 Parkinson’s because they live longer, not
because they don’t smoke
Description
1 Causal Impact was developed by Google
2 Practice Case Study: Paypal and Bitcoin
Game Plan 3 Challenge: Volkswagen CO2 scandal
4 Causal Impact is my most-used technique

Description
Use Google Causal Impact to estimate the impact of
Case study: Paypal allowing crypto payments on Bitcoin price
In October 21st, 2020, Paypal announced entered

Paypal x
1
the Crypto industry.
Bitcoin 2 Given the bull market and other volatilities, we

cannot compare the price before and after
3 We need to find comparable control groups

What is Time Series Data?
Visualization
Bitcoin Price Key ideas

22000
Sequence of data points in time
20000 order (oldest to newest)
18000
Most commonly, it is data
16000 recorded in equally distanced
time periods
14000
Type of Panel Data
12000
(multidimensional dataset)
10000
9/1/2020 10/1/2020 11/1/2020
Comparing before and after impact leads to omitted
variable bias
Context Visualization
How to measure the impact of Paypal on Bitcoin Bitcoin Price

price? 22000
20000
This graph shows the sales in the market. The
event started where the red line is 18000
16000
Comparing before and after would subject you to
14000
ommitted bias.
12000
10000
9/1/2020 10/1/2020 11/1/2020
Brodersen, Kay H.; Gallusser, Fabian; Koehler, Jim; Remy, Nicolas; Scott, Steven L. Inferring causal impact using Bayesian structural time-series models. Ann. Appl. Stat. 9
(2015), no. 1, 247--274. doi:10.1214/14-AOAS788. https://projecteuclid.org/euclid.aoas/1430226092
Difference-in-differences framework
Key ideas Visualization

What actually happened
October 20th
We use Google to with Bitcoin
create an artificial Paypal
control group impact
Price
Bitcoin
The delta between what Bitcoin, if same
actually happened and evolution as Google
the what-if scenario is
the treatment impact
Google
Google + Post
October 20th
Time
Causal Impact Step by Step
Define pre and post period
Retrieve the data we need
Check whether the variables are correlated in the pre period
Remove non-correlated data
Use Causal Impact

Assumptions
Parallel Trends Assumption Confounding Policy Change
The Treatment and Control Groups are There must be only one policy or initiative that
assumed to have the same evolution for the KPI differentiates the treatment from control groups.
Visualization You can only measure the impact of one

treatment.
Bitcoin October
20th How to Strengthen the assumptions
Price
Amazon Use More control groups
Google Use a longer training period

Keep post-period to the bare minimum
Time
Correlation in Time Series
Measures the relationship strength

between 2 variables
If the Time-Series grows over time, then

the correlation mightvbe random
The data must be stationary

Stationarity
Stationary Time Series Time dependent mean Key idea

Mean, variance and
covariance are not time
dependent
Stationary Time Series

have a defined pattern
Time dependent variance Time dependent covariance
Y Y Statistical test:
Dickey-Fuller test. If p-
value is less than 0.05,
time series is
considered stationary
t t
Making Data Stationary
Time Series Differencing
5 NA
9 4
1 -8
7 6
3 -4
7 4
4 -3
Impact evolution
Let‘s discuss what should be the impact of

Paypal adopting Bitcoin:
Greater in the beggining
Impact gradually increases
You can also point out that the impact should

continue days after the announcement
Causal Impact allows the impact variations

over time
• Brodersen, Kay H.; Gallusser, Fabian; Koehler, Jim; Remy, Nicolas; Scott, Steven L. Inferring causal impact using Bayesian structural time-series models. Ann. Appl. Stat.
9 (2015), no. 1, 247--274. doi:10.1214/14-AOAS788. https://projecteuclid.org/euclid.aoas/1430226092
Description
Use Causal Impact to measure the impact of the CO2
scandal in Volkswagen stock Price
Challenge 1 Pick Stocks for the control groups
2 Perform a correlation matrix
3 Measure the impact

MATCHING
Description
1 There is no comparable control group
2 Helps us with (self)-selection bias
3 How to measure referral programs?
Game Plan 4 What is the incremental value of Mobile Shopping?
5 Practice case study: Catholic Schools and scores
6 Challenge: Remote work and career satisfaction

How do you figure out the value of Amazon Prime?
Context
Amazon Prime is a loyalty program that provides free shipping, discounts and other services
The goal of program is fourfold:

• Increase customer loyalty
• Increase revenue per customer
• Decrease marketing spendings in customer re-activation
• Decrease paid advertising in conversion
The subscription lasts 1 year
If you were to asked to provide the impact of Amazon Prime on its financials, how would you do it?
You cannot just simply compare the average Prime and
non-prime subscriber
Context
Both groups may be inerently different from the start. Hence, they are not comparable.
Beware of (self-)selection bias
A possible solution is Matching.
In a nutshell, you create a counterfactual group with similar characteristics to your treatment group
Treatment
Control
You cannot just simply compare the average Prime and
non-prime subscriber
Context
Both groups may be inerently different from the start. Hence, they are not comparable.
Beware of (self-)selection bias
A possible solution is Matching.
In a nutshell, you create a counterfactual group with similar characteristics to your treatment group
Unmatched
Treatment
Control
Use Matching to understand whether catholic schoolsare
Briefing – Are better than others (from a standardized test score view)
catholic 1 We have a dataset with kids‘ background, their

parents upbringing among others
schools 2 The key metric of success is the standardized test

scores
better? 3 We need to re-create a comparable control group

Unconfoundedness
The variables (confonders) used are enough

to describe the people or entities (W)
W
The characteristics affect the likelihood of
v
someone being part of the treatment (X)
The combination of the confounders and the

treatment leads to the outcome (Y) X Y
Meeting the Uncondoundedness assumption
v
is a tall order
Curse of Dimensionality
Visualization Context
Imagine you have a variable with 3 options
Then you had a second with 3 more
Finally, a third
The observations needed to fill each bucket

grows exponentially
The Matching outcome can be spurious, when
few elements belong to a “dimension”
Key Idea
Make sure when you create a model as simple as possible
How to determine the Common Support Region
Visualization Examples Key ideas
Unconfoundness We preditct whether someone
Density
is part of the treatment group
List of confounders There will be people with high
Via Logistic Probability of likelihood of participating.
Regression being treated Probability
You are not likely to find a
Treatment control group for them.
Common Support The greater the overlap, the
Region higher the matching quality
Probability of the treated Probability of the non-

group being treated treated group bring treated
Robustness checks
Repeated experiment Removing 1 confounders
Dataset Variables
Sample Remove 1
Subsample Repeat Confounders Repeat
Matching Matching
Store Results Store Results

Description
Use Matching to figure out whether Remote workers
have higher Career Satisfaction
Challenge
2 T-test Loop
3 Transform the character variables into binary
4 Perform Matching
5 Perform a robustness check

Description
1 Introducing English in the Zalando.de website
My
2 What is the incrementality?
experience
3 Difference audiences means non-comparability
with
4 Tiny treatment group = good common support region
5 Practice case study: Catholic Schools and scores

Matching
6 I used the repeated experiments for robustness
RECENCY
FREQUENCY
MONETARY
Description
1 Introducing value-based segmentation
2 Case study: online shoppers segmentation
Game Plan
3 Challenge: purchasing behavior
4 Simple yet powerful concepts in this section

Value-based segmentation
Visualization Description
Customers Profit Companies rank customers

Pareto rule Top 20%
Who to prioritize
20% of the causes Understand where to focus

80%
result in 80% of the
consequences Bottom 80% Who is more loyal
20%
What is RFM?
Description Typical RFM Our Model Meaning
Recency Days gone Days gone How long since they last purchased
Frequency Frequency (Q) Frequency (Q) How often have they purchased
Moneratary Sales (P*Q) Basket (P) Average Purchase Value
What else?
Include Churn or Customer Retention

Include Time Horizon
Change Average Purchase by Average Profit
How does it work?
Frequency Recency Monetary Final Values
Max 4 Max 1 Max 4 11-12 Superstar

Future
3 2 3 8-10
Champion
Median Median Median
2 3 2 6-7 High Potential
Low
1 4 1 3-5
Min Min Min Relevance
Description
A Dataset with Online Shoppers data
Case Study We have a dataset with purchases of customers,

1 detailed by items
Briefing Create a customer dataset with the Recency,
2
Frequency and Monetary variables
3 Create an RFM model and apply to the dataset

Description
A dataset with customer data
Challenge 1 Prepare basket variable
Customer 2 Rename variables
Value
3 Create a RFM model with 3 levels
segmentation
4 Define 3 segments
5 Prepare final table overview

GAUSSIAN
MIXTURE MODEL
Description
1 Clustering is a lazy person’s favorite
2 Case study: Credit Card applicants
Game Plan
3 Challenge: Credit card users
4 New concepts: AIC and BIC

What are clustering techniques?
Groups observations in terms of their

X2
characteristics
Main task of exploratory data mining
Clustering is an art rather than

Science
X1
Gaussian Mixture Model
Gaussian Mixture Model is a probabilistic

method for clustering
Better to use than traditional clustering

algorithms, like Kmeans
The probabilities allow to better evaluate

edge cases
Description
A Dataset with Credit Card applicants
Case Study 1
Determine the optimal number of segments for the
dataset
Briefing 2 Use Gaussian Mixture Model
Interpret the segments and name them according to

3
their characteristics
Akaike’s Information Criterion (AIC) and Bayesian
Information Criterion (BIC)
Key Ideas Pseudo-visualization
• AIC and BIC helps us determining the optimal Goodness

number of clusters of fit
• AIC and BIC provide a means to select a model
• Trade-off between simplicity and goodness of fit
• Deal with overfitting
• BIC penalizes overfitting more than the AIC Simplicity

Description
Challenge – A Dataset with customer data
Gaussian
1 Prepare data set
Mixture
2 Determine optimal number of clusters
3 Create GMM model

Model
4 Interpret segments
Description
1 A closed contest for a big conglomerate
My
2 Their status-quo was a value-based segmentation
experience
3 They wanted a behavioral segmentation
with
4 The first difficulty was how massive the data was
5 We tried to be hypothesis-driven
Segmentation
6 We had 7 interpretable segments in the end
7 We were complex and did not consider scalability

RANDOM FOREST
Description
1 Random Forest is an advanced analytics technique
2 Learn about Decion Trees and Ensemble Learning
Game Plan
3 Practice case study: Credit card applicants
4 Challenge: Customer‘s income prediction

Random Forest is an Ensemble Learning Algorithm
Description
Ensemble Learning is when you have a plurality of

1
models predicting your output
What is it?
2 Ensemble is an average of Models
3 A Random Forest is a combination of decision trees
4 Can be used for Regression and Classification problems

How do Decision trees work?
Visualization Decision tree
X2 < 50
x1
No Yes
X1 > 40 X1 > 70
70 No Yes No Yes
Blue Blue
40
Key Ideas:
• A split or leaf is done taken a maximum entropy logic
- Where would it yield more information
• The prediction would be done based on the relative frequency
50 x2
Description
A Dataset with credit card applicants
Case Study 1
The key metric of success is whether someone was
accepted or not
Briefing
2 We want to predict the acceptance
3 We also want to generate insights

Random Forest quirks
Description
1 Tendency to overfit
What is it?
2 Good with multicollinearity
3 Works well with non-linearity
4 Robust to Outliers
Parameter Tuning
Context
Advanced models have parameters to tune to optimize accuracy
Description
Parameter Run Model Measure error Save error
n_estimators: 50 5000

Description
What is the income of your customers?
Challenge –
1 Prepare data set
Random
2 Create Random Forest Regressor model
3 Measure accuracy
Forest
4 Tune the model
5 Generate insights
FACEBOOK
PROPHET
Description
1 Technique to predict the future
2 Forecasting is a common task for Business analysts
Game Plan
3 Practice case study: Udemy Wikipedia page visits
4 Challenge: Shelter Demand in New York City

Structural Time Series
Visualization Description
Data Seasonality
Structural Time Series is the
decomposition of the data in at least:
Trend
Seasonality
Exogenous impacts
Trend Exogenous impacts Error Term
Methodological framework
𝑦(𝑡) = 𝑐 𝑡 + 𝑠 𝑡 + 𝑥 𝑡 + 𝜖
Facebook Prophet quick facts
Description
1 Built by facebook
Which? Stan background - probabilistic programming
2
language for statistical inference
3 Dynamic Holidays
4 Prophet is customizable in ways that are intuitive to

non-experts
5 Built-in Cross Validation

Methodological framework
𝑦(𝑡) = 𝑐 𝑡 + 𝑠 𝑡 + ℎ 𝑡 + 𝑥 𝑡 + 𝜖
Where:
c(t) Trend +
Prophet s(t) Seasonality +
Mechanics h(t) Holiday effects +

x(t) External regressors +
e error
Description
A Dataset with Daily Udemy Wikipedia Visits
Case Study Predict the number of visits to the Wikipedia page of

1
Udemy
Briefing
2 Learn cross-validation
3 Combine with Parameter Tuning

Dynamic Holidays – Valentine‘s example
Visualization
Facebook Prophet
Chocolate
demand You state Valentine‘s as a key
event and specify how many
days before/after
Other models:
You must create dummy
variables for each day, if you
believe they have different
impacts
11 12 13 14 15
February
Training and Test Set in Time Series
Training set Test set
Dataset Time
Key Ideas
Forecasting Models are usually split into a pre and post period from a time perspective
The Test Set should be of the size of a real-world forecast
Facebook Prophet Model
Component Description
Growth Linear or Logistic
Holidays Dataframe that we prepared
Seasonality Yearly, weekly or daily. True or False
Seasonality_mode Multiplicative or additive
Seasonality_prior_scale Strength of the seasonality
Holiday_prior_scale Smaller values allow the model to fit larger seasonal fluctuations
Changepoint_prior_scale Does the Trend change easily?

Additive vs. Multiplicative
Additive Multiplicative Key ideas

Y 𝑦 𝑡 = 𝑇𝑡 𝑡 + 𝑆 𝑡 + 𝑒[𝑡] Y 𝑦 𝑡 = 𝑇𝑡 𝑡 ∗ 𝑆 𝑡 ∗ 𝑒[𝑡]
If we talk about seasonality
in terms of percentage, the
seasonality is multiplicative
If it is in adding absolute
values, then it is additive.
t
t
Cross Validation
Training set Test set
Key Idea
Repeating the assessment of our model reinforces its evaluation
Parameters to tune
Component Description
Seasonality_prior_scale Strength of the seasonality
Holiday_prior_scale Smaller values allow the model to fit larger seasonal fluctuations
Changepoint_prior_scale flexibility of the automatic changepoint selection

Shelter Demand
How many people need a shelter?
Challenge - 1 Prepare dataframe
Demand 2 Training and test set
Forecasting 3 Create model and assess accuracy
4 Visualize the output
5 Perform parameter tuning

Description
1 They need strategic and tactical forecasts
2 They need marketplace forecasts to allocate cars
Forecasting
3 But they also need it to measure investments
at Uber
4 They need to forecast at scale
5 They use simple statistical models
Machine Learning when exogenous regressors are

6
available
7 They try multiple approaches to find the best result

Business Analyst Slide Deck

Uploaded by

Copyright:

Available Formats

Business Analyst Slide Deck

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Business Analyst Slide Deck

Uploaded by

Copyright:

Available Formats

Business Analyst

4 The course is practice-focused

The Modern- 1 Being only good with numbers is a thing of past

day Business 2 Proficient with Statistics & Analytics methodologies

Analyst 3 A bridge between technical and non-technical people

1 Weather influences seasonal industries

2 External factors are uncontrollable by nature

3 How to prove weather influences sales?

If weather influences, then sales move when weather

5 The Technique I used was Google Causal Impact

1 Commercial teams are belief-driven in nature

2 Advanced Analytics give numbers to beliefs

3 But making the change is difficult

4 Simple and interpretable usually has high errors

5 One of the techniques used was Facebook Prophet

1 Backbone for the full course

3 You will do an exercise for each statistic learned

4 Moneyball case study at the end

Same thing as average

When we say mean, we refer to the

Represents the expected value

2 There are 12 KPIs for each Team, League and Year

Fashion is a statistical term ☺ If even numbers, then you average both

Used with skewed dataset

Measures the relationship strength

Varies between -1 and 1

1 means strong positive relationship

-1 means strong negative relationship Methodological Representation

Description Methodological Representation

Measures the variation or dispersion of

High variability Low variability

1 Money ball is set on the world of baseball

2 The A’s had success despite financial struggles

Moneyball 3 The team looked for undervalued players

Case Study 4 Other teams did not look at statistics

5 The General Manager looked at specific statistics

6 With the right system, you can beat anyone

1 Level up our statistics game

Game Plan 2 Please use the Q&A

Symmetric distribution with the mean in

Statistical methods, i.e. regression

Within +- 1 standard distributions, you 99.7%

Within +-3 SD, it is 99.7%

If p-value < 0.05

The distribution does not look Gaussian -> reject

Description Methodological Representation

The standard error of the sample mean

Description Methodological Representation

Gives you an idea of how far from the

GMAT: Z = (680 – 560) / (120) = 1

Description Z-values table Visualization – 95% CI

A range that gives a sense of how

Visualization T Test formally

Stack Overflow dataset

Null hypothesis A chi-square test requires categorical variables,

1 You put your body in a powerpose

2 You would perform better in high-pressure moments

3 Powerposing is not backed up by science

4 Powerposing results were not replicated by others