Business Analyst Slide Deck

Download as pdf or txt
Download as pdf or txt
You are on page 1of 120

Business Analyst

course
Introduction to the course

Description
Business
1 This video is dedicated to bureaucracies Analyst

Econometrics and
Causal Inference

Segmentation
2 The course has statistics as the base

Predictive
Analytics
3 A business analyst needs to know 3 Analytics types

4 The course is practice-focused


Statistics and Descriptive Analytics
5 The course materials are in the next lecture
Description

The Modern- 1 Being only good with numbers is a thing of past

day Business 2 Proficient with Statistics & Analytics methodologies

Analyst 3 A bridge between technical and non-technical people


The impact of weather on sales

Description

1 Weather influences seasonal industries

2 External factors are uncontrollable by nature

3 How to prove weather influences sales?

If weather influences, then sales move when weather


4
changes and are constant, all else being equal

5 The Technique I used was Google Causal Impact


Predicting the future

Description

1 Commercial teams are belief-driven in nature

2 Advanced Analytics give numbers to beliefs

3 But making the change is difficult

4 Simple and interpretable usually has high errors

5 One of the techniques used was Facebook Prophet


BASIC
STATISTICS
Description

1 Backbone for the full course

Game Plan
2 Master the principles to make it easier in the future

3 You will do an exercise for each statistic learned

4 Moneyball case study at the end


(Arithmetic) Mean

Description Visualization

Same thing as average

When we say mean, we refer to the


arithmetic mean

Represents the expected value

Methodological Representation

σ 𝑥𝑖
𝑥ҧ =
𝑛
Description
Case Study We have a dataset with baseball teams’ data from 1962 to
1

Briefing –
2012

2 There are 12 KPIs for each Team, League and Year

Baseball
3 We will practice statistical concepts on the dataset

https://www.baseball-reference.com/
Mode and Median

Mode Median
The most frequent number in a set The central number of an ordered set

Fashion is a statistical term ☺ If even numbers, then you average both


middle points

Used with skewed dataset


Visualization Visualization

X 2 3 3 5 X 2 3 5

Mode is 3 Median is 3
X 2 3 5 10

Median is 4
(Pearson) Correlation

Description Visualization

Measures the relationship strength


between 2 variables

Varies between -1 and 1

1 means strong positive relationship

-1 means strong negative relationship Methodological Representation

0 indicates no relationship
σ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑟=
Correlation does not imply causation σ 𝑥𝑖 − 𝑥ҧ 2 𝛴 𝑦𝑖 − 𝑦ത 2
Standard Deviation

Description Methodological Representation

Measures the variation or dispersion of


a set of values
σ 𝑥𝑖 − 𝑥ҧ 2
High values mean higher variability 𝜎=
𝑛

High variability Low variability


# occurences

X # occurences X
Description

1 Money ball is set on the world of baseball

2 The A’s had success despite financial struggles

Moneyball 3 The team looked for undervalued players

Case Study 4 Other teams did not look at statistics

5 The General Manager looked at specific statistics

6 With the right system, you can beat anyone


INTERMEDIARY
STATISTICS
Description

1 Level up our statistics game

Game Plan 2 Please use the Q&A

3 We have 2 datasets
Normal Distribution aka Gaussian Distribution

Description Visualization

Symmetric distribution with the mean in


the middle
Data occuring near the mean is more
frequent
Graph is similar to bell shaped curve

Statistical methods, i.e. regression


assumes normalization of errors
In real-life, there will be some degree of
similarity in most problems
68-95-99 Rule in Normal Distributions

Description Visualization

Within +- 1 standard distributions, you 99.7%


can find 68% of observations 95%
Within +-2 SD, you should encounter 68%
95% of deviations

Within +-3 SD, it is 99.7%

Key Idea
A normal distributions is a pattern, and
patterns enables us to categorize data
with more confidence
1SD 2SD 3SD
Case Study – Wine Quality

Description
You are a wanna be Wine Statistics Connoisseur

Challenge1
1 Normal Distribution

2 Standard Errors

3 Confidense Intervals

Paulo Cortez,
University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez
A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal
@2009
P-value is all about likelihood

Description Examples
The probability of obtaining results at H0: The average salary of business analysts is €60k
least as extreme as the observed
results of a statistical hypothesis test, H1: business analysts’ average salary is not €60k
assuming the null hypothesis is correct.
P-value = 0.2 -> We fail to reject the null hypothesis
It helps us understand what is the
likelihood of “accepting” aka “fail to
reject” the hypothesis
H0: Blueberries prevent cancer
A small p-value (small probability) would
mean we favor the alternate hypothesis H1: Blueberries do not prevent cancer

P-value threshold usually used: 0.05 P-value = 0.01 -> We reject the null hypothesis
Shapiro-Wilk test

Description Interpretation

Quantifies how likely it is that the data H0: The distribution is gaussian
was drawn from a Gaussian distribution
If p-value > 0.05
Created in 1965 and is one of many
normality tests The distribution appears to have a normal
distribution

If p-value < 0.05

The distribution does not look Gaussian -> reject


the null hypothesis
Standard Error (of the sample mean)

Description Methodological Representation

The standard error of the sample mean


is an estimate of how far the sample
mean is likely to be from the population 𝜎
mean. 𝑆𝐸 =
𝑛
Standard deviation is the degree to
which individuals within the sample
differ from the sample mean.
Z-Score

Description Methodological Representation

Gives you an idea of how far from the


mean is a data point. 𝑥−𝜇
𝑧=
Z-scores are a way to compare results 𝜎
to a “normal” population
Example – Diogo College grades
It is a way to standardize values
Uni: Z = (16 – 13) / (2) = 1.5

GMAT: Z = (680 – 560) / (120) = 1


Confidence Interval (when n > 30)

Description Z-values table Visualization – 95% CI

A range that gives a sense of how


Confidence Z-Value
precisely a statistic estimates a
Interval
parameter.
80% 1.28
The associated confidence level gives the 85% 1.44
probability with which an estimated interval
will contain the true value of the parameter 90% 1.65 2.5% 97.5%
95% 1.96 Lower Upper
Methodological Representation bound bound
99% 2.58
99.9% 3.29
𝜎
𝐶𝐼 = 𝑥ҧ ± 𝑧 ∗
𝑛
T-Tests

Visualization T Test formally


Test any statistical hypothesis in which the test
# of people statistic follows a Student's t-distribution under
the null hypothesis.
Non-Italians Italians

In practical terms
Helps us understand whether one group is
(statistically) different than from the other

How do we know?
Hand usage while If p-value less than 0.05, then the groups are
talking statistically different
Challenge – Understanding Remote Work predictions

Stack Overflow dataset


Challenge1 Worker‘s characteristics, and job related queries

1 T-tests

2 Chi-square tests
(Person) Chi-square test

Visualization Chi-square
Chi-square test
test
Determine whether there is a statistically
Wears black
significant difference between the expected
Yes No frequencies and the observed frequencies
Lives in Yes
Berlin Difference from t-test
No
A t-test tests a null hypothesis about two means;

Null hypothesis A chi-square test requires categorical variables,


each having any number of levels.
There is no relationship between variables
Powerposing and p-hacking

Description

1 You put your body in a powerpose

2 You would perform better in high-pressure moments

3 Powerposing is not backed up by science

4 Powerposing results were not replicated by others

P-hacking is the removal of some individuals to


5 achieve statistical significance
LINEAR
REGRESSION
Description

1 Building block in our learning capacity

Game Plan 2 I learned how to do it by hand. Yeah, really!

3 We will have a practice-focused approach


Case Study Description

Briefing – 1 We have a dataset of roughly 300 diamonds

Pricing 2 We have the price, carats and other KPIs

Diamonds 3 We want to understand how carats influence


Diamond Prices

Package: Ecdat
(Linear) Regression crash course

Visualization Definition
Study of a relationship between a dependent
variable and at least one independent variable
Price

Intuition perspective

β Method for “What is the impact of X on Y?

How is it different from a correlation?


• Correlation studies the direction
Carat • Regression studies the impact
Linear Regression

Visualization Methodological View

𝑌 =𝑎+𝑏∗𝑋+𝑒

Interpretation
If I increase X by 1, Y increases by b
If X happens, Y increases by b
How to read a Regression result

Coefficients
One carat increases the price by 11.6k

R-Squared and adj R-squared


We can explain 89.3% of the variance

Confidence interval (95%)


The Carat coefficient is between 11.1k-12.1k

Statistical Significance
If P>|t| is less than 0.05, we have statistical
significance
Dummy variable trap

Multicollinearity

Observation Coca cola Pepsi Correlation between Coca-Cola and Pepsi is -1

a 1 0 Solution: remove one dummy variable


b 1 0
c 1 0 Removing does not mean information is lost
d 1 0
A zero also represents information
e 0 1
f 0 1 The removed dummy variable becomes part of
g 0 1 the intercept.
h 0 0
j 0 1 You can see it as being your baseline.
Dummy variable trap

Multicollinearity
White Correlation between Coca-Cola and Pepsi is -1
Observation Coca cola Pepsi
Label
a 1 0 0 Solution: remove one dummy variable
b 1 0 0
c 1 0 0 Removing does not mean information is lost
d 0 0 1
A zero also represents information
e 0 0 1
f 0 1 0 The removed dummy variable becomes part of
g 0 1 0 the intercept.
h 0 0 1
j 0 1 0 You can see it as being your baseline.
MULTILINEAR
REGRESSION
Description

1 Topics: outliers, assessment ad overfitting

Game Plan 2 Practice tutorial: Teacher’s salaries

3 Challenge: Retail Store drivers

4 Regression value adding is interpretability


Multilinear Regression

Linear Regression Multilinear Regression Description

It is very rare that one input


Carat Diamond Carat explains the output

We often need more predictors


X Y Color Diamond to improve the models

Beware of multicollinearity or
overfitting
Shape
Case Study Description

The 2008-09 academic salary for Professors in a


Briefing – 1
college in the U.S.

Professors‘ 2 The data was collected to monitor salary differences


between male and female faculty members.

salaries 3 Use Multilinear Regression to study


Outliers

Visualization Interpretation
Outliers can damage your analysis
Salaries

You must distinguish between noise and


valuable information

Consider using models good with outliers or


non-linearity (e.g., Random Forest)

Professional experience
Modelling is finding the balance between under and
overfitting

Underfitting Overfitting Insights

Having a too simple model


will get you nowhere

Too complex will not yield


results in other testing
scenarios

You should iterate based


on results
Let‘s imagine this is our full data set

Description
Splitting between training and test enables an unbiased
model assessment

Training Set Test Set

Model Assessment
Mean Absolute Error (MAE) vs Root Squared Mean Error
(RSME)
Visualization Key ideas

MAE and RSME are performance indicators for


Y Regression models with continuous outputs
Model

σ 𝑦ො − 𝑦 2
σ 𝑦 − 𝑦ො
𝑀𝐴𝐸 = 𝑅𝑆𝑀𝐸 =
𝑛 𝑛

RSME is useful for models with extremes / outliers

MAE is more interpretable.


X
Description
Use Multilinear Regression to study a Store sales‘ drivers

1 Pick variables for your model

Challenge
2 Analyze the data i.e. summary statistics

3 Correlation Matrix

4 Create a Training a Test Set

5 Use Multilinear Regression

6 Assess Accuracy
Dataset: Ecdat package
LOGISTIC
REGRESSION
Description

1 We now face a classification problem

2 The question influences the analytical technique

Game Plan 3 How do we measure accuracy?

4 Case study: Which emails are spam?

5 Challenge: the sex of penguins


Description
Case Study
1 Dataset with ~5k emails

Briefing – Is it What makes an email spammy?


2

spam?
3 Can we predict which emails are spam?

DAAG ´package
Logistic Regression crash course

Visualization What is a Logistic Regression?


Relationship study between a discrete dependent
Spam
variable and at least one independent variable
Y (1)
From an Intuition Perspective?
What is the impact of X on Y happening?

How is it different from a Linear Regression


Linear is for continuous, logistic is discrete
N (0)
# times the word Linear we fit a straight line, logistic a curve
money occurs
How to read a Logistic Regression coefficients1

Linear Regression Interpretation


For each X unit increase, Y increases by b

𝑌 =𝑎+𝑏∗𝑋+𝑒
B = 0.5: For each X unit increase, Y increases by
0.5

Logistic Regression

For each X unit increase, the probability of Y


happening increases by exp(b) – 1 * 100 %
ⅇ𝑎+𝑏𝑋
𝑌=
1 + ⅇ𝑎+𝑏𝑥 B = 0.5: For each X unit increase, the probability
of Y happening increases by 64%
The Confusion Matrix allows to access the results of a classifier

Accuracy
Confusion Matrix
Accuracy = (True positive + True negative ) / All
Predicted
False True
Balanced dataset

False True Negative False Positive F1-Score


F1-score = 2 * TP / (2 * TP + FP + FN)
True False Negative True Positive
Unbalanced dataset
The Confusion Matrix allows to access the results of a classifier

Specificity or True Negative Rate


Confusion Matrix
True Negative / (True Negative + False Positive)
Predicted
False True
When we focus in False values accuracy

False True Negative False Positive Sensitivity, Recall or True Positive Rate
True Positive / (True Positive + False Negative)
True False Negative True Positive
Focus is on True values
Description
Use Logistic Regression to predict the sex of penguins

1 Pick variables for your model

Challenge
2 Plot Histograms of the character variables

3 Transform the character variables into binary

4 Create a Training a Test Set

5 Use Logistic Regression

6 Assess Accuracy through the classification report


GOOGLE CAUSAL
IMPACT
Why
Description

Econometrics 1 Decision-Making

and Causal 2 Understand and tacking biases

Inference
According to BNP Paribas, Sustainability-focused
companies perform better
Description

1 Are there other differences between sustaibanility-


focused companies and others?

BNP Paribas 2 People define politics and decision

Not including all factors is falling into the omitted


3
variable bias
Does
Smoking
prevent
Parkinson’s?
Description

The incidence of Parkinson’s in people between 55


1
and 75 is twice as significant in non-smokers

Is there a causal relationship between smoking and


2
Parkinson’s?
Smoking and
People are more likely to get Parkinson’s the older
3
Parkinson‘s
they get

4 Smokers’ life expectancy is lower than non-smokers’

Non-smokers are more likely to have


5 Parkinson’s because they live longer, not
because they don’t smoke
Description

1 Causal Impact was developed by Google

2 Practice Case Study: Paypal and Bitcoin

Game Plan 3 Challenge: Volkswagen CO2 scandal

4 Causal Impact is my most-used technique


Description
Use Google Causal Impact to estimate the impact of
Case study: Paypal allowing crypto payments on Bitcoin price

In October 21st, 2020, Paypal announced entered


Paypal x
1
the Crypto industry.

Bitcoin 2 Given the bull market and other volatilities, we


cannot compare the price before and after

3 We need to find comparable control groups


What is Time Series Data?

Visualization

Bitcoin Price Key ideas


22000
Sequence of data points in time
20000 order (oldest to newest)
18000
Most commonly, it is data
16000 recorded in equally distanced
time periods
14000
Type of Panel Data
12000
(multidimensional dataset)
10000
9/1/2020 10/1/2020 11/1/2020
Comparing before and after impact leads to omitted
variable bias

Context Visualization

How to measure the impact of Paypal on Bitcoin Bitcoin Price


price? 22000

20000
This graph shows the sales in the market. The
event started where the red line is 18000

16000
Comparing before and after would subject you to
14000
ommitted bias.
12000

10000
9/1/2020 10/1/2020 11/1/2020

Brodersen, Kay H.; Gallusser, Fabian; Koehler, Jim; Remy, Nicolas; Scott, Steven L. Inferring causal impact using Bayesian structural time-series models. Ann. Appl. Stat. 9
(2015), no. 1, 247--274. doi:10.1214/14-AOAS788. https://projecteuclid.org/euclid.aoas/1430226092
Difference-in-differences framework

Key ideas Visualization


What actually happened
October 20th
We use Google to with Bitcoin
create an artificial Paypal
control group impact

Price
Bitcoin
The delta between what Bitcoin, if same
actually happened and evolution as Google
the what-if scenario is
the treatment impact
Google
Google + Post
October 20th
Time
Causal Impact Step by Step

Define pre and post period

Retrieve the data we need

Check whether the variables are correlated in the pre period

Remove non-correlated data

Use Causal Impact


Assumptions

Parallel Trends Assumption Confounding Policy Change

The Treatment and Control Groups are There must be only one policy or initiative that
assumed to have the same evolution for the KPI differentiates the treatment from control groups.

Visualization You can only measure the impact of one


treatment.
Bitcoin October
20th How to Strengthen the assumptions
Price

Amazon Use More control groups

Google Use a longer training period


Keep post-period to the bare minimum

Time
Correlation in Time Series

Description Visualization

Measures the relationship strength


between 2 variables

If the Time-Series grows over time, then


the correlation mightvbe random

The data must be stationary


Stationarity

Stationary Time Series Time dependent mean Key idea


Mean, variance and
covariance are not time
dependent

Stationary Time Series


have a defined pattern

Time dependent variance Time dependent covariance

Y Y Statistical test:
Dickey-Fuller test. If p-
value is less than 0.05,
time series is
considered stationary
t t
Making Data Stationary

Time Series Differencing

5 NA
9 4
1 -8
7 6
3 -4
7 4
4 -3
Impact evolution

Context Visualization

Let‘s discuss what should be the impact of


Paypal adopting Bitcoin:

Greater in the beggining

Impact gradually increases

You can also point out that the impact should


continue days after the announcement

Causal Impact allows the impact variations


over time

• Brodersen, Kay H.; Gallusser, Fabian; Koehler, Jim; Remy, Nicolas; Scott, Steven L. Inferring causal impact using Bayesian structural time-series models. Ann. Appl. Stat.
9 (2015), no. 1, 247--274. doi:10.1214/14-AOAS788. https://projecteuclid.org/euclid.aoas/1430226092
Description
Use Causal Impact to measure the impact of the CO2
scandal in Volkswagen stock Price

Challenge 1 Pick Stocks for the control groups

2 Perform a correlation matrix

3 Measure the impact


MATCHING
Description

1 There is no comparable control group

2 Helps us with (self)-selection bias

3 How to measure referral programs?

Game Plan 4 What is the incremental value of Mobile Shopping?

5 Practice case study: Catholic Schools and scores

6 Challenge: Remote work and career satisfaction


How do you figure out the value of Amazon Prime?

Context

Amazon Prime is a loyalty program that provides free shipping, discounts and other services

The goal of program is fourfold:


• Increase customer loyalty
• Increase revenue per customer
• Decrease marketing spendings in customer re-activation
• Decrease paid advertising in conversion

The subscription lasts 1 year

If you were to asked to provide the impact of Amazon Prime on its financials, how would you do it?
You cannot just simply compare the average Prime and
non-prime subscriber
Context
Both groups may be inerently different from the start. Hence, they are not comparable.

Beware of (self-)selection bias

A possible solution is Matching.

In a nutshell, you create a counterfactual group with similar characteristics to your treatment group

Treatment
Control
You cannot just simply compare the average Prime and
non-prime subscriber
Context
Both groups may be inerently different from the start. Hence, they are not comparable.

Beware of (self-)selection bias

A possible solution is Matching.

In a nutshell, you create a counterfactual group with similar characteristics to your treatment group
Unmatched

Treatment
Control
Case Study Description
Use Matching to understand whether catholic schoolsare

Briefing – Are better than others (from a standardized test score view)

catholic 1 We have a dataset with kids‘ background, their


parents upbringing among others

schools 2 The key metric of success is the standardized test


scores

better? 3 We need to re-create a comparable control group


Unconfoundedness

Context Visualization

The variables (confonders) used are enough


to describe the people or entities (W)
W
The characteristics affect the likelihood of
v
someone being part of the treatment (X)

The combination of the confounders and the


treatment leads to the outcome (Y) X Y
Meeting the Uncondoundedness assumption
v
is a tall order
Curse of Dimensionality

Visualization Context
Imagine you have a variable with 3 options

Then you had a second with 3 more

Finally, a third

The observations needed to fill each bucket


grows exponentially
The Matching outcome can be spurious, when
few elements belong to a “dimension”

Key Idea
Make sure when you create a model as simple as possible
How to determine the Common Support Region

Visualization Examples Key ideas

Unconfoundness We preditct whether someone

Density
is part of the treatment group
List of confounders There will be people with high
Via Logistic Probability of likelihood of participating.
Regression being treated Probability
You are not likely to find a
Treatment control group for them.
Common Support The greater the overlap, the
Region higher the matching quality

Probability of the treated Probability of the non-


group being treated treated group bring treated
Robustness checks

Repeated experiment Removing 1 confounders

Dataset Variables

Sample Remove 1

Subsample Repeat Confounders Repeat

Matching Matching

Store Results Store Results


Description
Use Matching to figure out whether Remote workers
have higher Career Satisfaction

1 Pick variables for your model

Challenge
2 T-test Loop

3 Transform the character variables into binary

4 Perform Matching

5 Perform a robustness check


Description

1 Introducing English in the Zalando.de website

My
2 What is the incrementality?

experience
3 Difference audiences means non-comparability

with
4 Tiny treatment group = good common support region

5 Practice case study: Catholic Schools and scores


Matching
6 I used the repeated experiments for robustness
RECENCY
FREQUENCY
MONETARY
Description

1 Introducing value-based segmentation

2 Case study: online shoppers segmentation

Game Plan
3 Challenge: purchasing behavior

4 Simple yet powerful concepts in this section


Value-based segmentation

Visualization Description

Customers Profit Companies rank customers


Pareto rule Top 20%
Who to prioritize

20% of the causes Understand where to focus


80%
result in 80% of the
consequences Bottom 80% Who is more loyal

20%
What is RFM?

Description Typical RFM Our Model Meaning

Recency Days gone Days gone How long since they last purchased
Frequency Frequency (Q) Frequency (Q) How often have they purchased

Moneratary Sales (P*Q) Basket (P) Average Purchase Value

What else?

Include Churn or Customer Retention


Include Time Horizon
Change Average Purchase by Average Profit
How does it work?

Frequency Recency Monetary Final Values

Max 4 Max 1 Max 4 11-12 Superstar


Future
3 2 3 8-10
Champion
Median Median Median
2 3 2 6-7 High Potential
Low
1 4 1 3-5
Min Min Min Relevance
Description
A Dataset with Online Shoppers data

Case Study We have a dataset with purchases of customers,


1 detailed by items
Briefing Create a customer dataset with the Recency,
2
Frequency and Monetary variables

3 Create an RFM model and apply to the dataset


Description
A dataset with customer data

Challenge 1 Prepare basket variable

Customer 2 Rename variables

Value
3 Create a RFM model with 3 levels

segmentation
4 Define 3 segments

5 Prepare final table overview


GAUSSIAN
MIXTURE MODEL
Description

1 Clustering is a lazy person’s favorite

2 Case study: Credit Card applicants

Game Plan
3 Challenge: Credit card users

4 New concepts: AIC and BIC


What are clustering techniques?

Visualization Key ideas

Groups observations in terms of their


X2
characteristics

Main task of exploratory data mining

Clustering is an art rather than


Science

X1
Gaussian Mixture Model

Visualization Key ideas

Gaussian Mixture Model is a probabilistic


method for clustering

Better to use than traditional clustering


algorithms, like Kmeans

The probabilities allow to better evaluate


edge cases
Description
A Dataset with Credit Card applicants

Case Study 1
Determine the optimal number of segments for the
dataset

Briefing 2 Use Gaussian Mixture Model

Interpret the segments and name them according to


3
their characteristics
Akaike’s Information Criterion (AIC) and Bayesian
Information Criterion (BIC)
Key Ideas Pseudo-visualization

• AIC and BIC helps us determining the optimal Goodness


number of clusters of fit

• AIC and BIC provide a means to select a model

• Trade-off between simplicity and goodness of fit

• Deal with overfitting

• BIC penalizes overfitting more than the AIC Simplicity


Description

Challenge – A Dataset with customer data

Gaussian
1 Prepare data set

Mixture
2 Determine optimal number of clusters

3 Create GMM model


Model
4 Interpret segments
Description

1 A closed contest for a big conglomerate

My
2 Their status-quo was a value-based segmentation

experience
3 They wanted a behavioral segmentation

with
4 The first difficulty was how massive the data was

5 We tried to be hypothesis-driven
Segmentation
6 We had 7 interpretable segments in the end

7 We were complex and did not consider scalability


RANDOM FOREST
Description

1 Random Forest is an advanced analytics technique

2 Learn about Decion Trees and Ensemble Learning

Game Plan
3 Practice case study: Credit card applicants

4 Challenge: Customer‘s income prediction


Random Forest is an Ensemble Learning Algorithm

Description

Ensemble Learning is when you have a plurality of


1
models predicting your output
What is it?

2 Ensemble is an average of Models

3 A Random Forest is a combination of decision trees

4 Can be used for Regression and Classification problems


How do Decision trees work?

Visualization Decision tree

X2 < 50
x1
No Yes

X1 > 40 X1 > 70
70 No Yes No Yes

Blue Blue
40
Key Ideas:
• A split or leaf is done taken a maximum entropy logic
- Where would it yield more information
• The prediction would be done based on the relative frequency
50 x2
Description
A Dataset with credit card applicants

Case Study 1
The key metric of success is whether someone was
accepted or not
Briefing
2 We want to predict the acceptance

3 We also want to generate insights


Random Forest quirks

Description

1 Tendency to overfit
What is it?
2 Good with multicollinearity

3 Works well with non-linearity

4 Robust to Outliers
Parameter Tuning

Context
Advanced models have parameters to tune to optimize accuracy

Description
Parameter Run Model Measure error Save error

n_estimators: 50 5000

n_estimators: 100 6000

n_estimators: 150 6100


Description
What is the income of your customers?

Challenge –
1 Prepare data set

Random
2 Create Random Forest Regressor model

3 Measure accuracy
Forest
4 Tune the model

5 Generate insights
FACEBOOK
PROPHET
Description

1 Technique to predict the future

2 Forecasting is a common task for Business analysts

Game Plan
3 Practice case study: Udemy Wikipedia page visits

4 Challenge: Shelter Demand in New York City


Structural Time Series

Visualization Description
Data Seasonality
Structural Time Series is the
decomposition of the data in at least:
Trend
Seasonality

Exogenous impacts
Trend Exogenous impacts Error Term

Methodological framework
𝑦(𝑡) = 𝑐 𝑡 + 𝑠 𝑡 + 𝑥 𝑡 + 𝜖
Facebook Prophet quick facts

Description

1 Built by facebook
Which? Stan background - probabilistic programming
2
language for statistical inference

3 Dynamic Holidays

4 Prophet is customizable in ways that are intuitive to


non-experts

5 Built-in Cross Validation


Methodological framework
𝑦(𝑡) = 𝑐 𝑡 + 𝑠 𝑡 + ℎ 𝑡 + 𝑥 𝑡 + 𝜖

Where:

c(t) Trend +
Prophet s(t) Seasonality +

Mechanics h(t) Holiday effects +


x(t) External regressors +
e error
Description
A Dataset with Daily Udemy Wikipedia Visits

Case Study Predict the number of visits to the Wikipedia page of


1
Udemy
Briefing
2 Learn cross-validation

3 Combine with Parameter Tuning


Dynamic Holidays – Valentine‘s example

Visualization
Facebook Prophet
Chocolate
demand You state Valentine‘s as a key
event and specify how many
days before/after

Other models:
You must create dummy
variables for each day, if you
believe they have different
impacts

11 12 13 14 15
February
Training and Test Set in Time Series

Training set Test set

Dataset Time

Key Ideas
Forecasting Models are usually split into a pre and post period from a time perspective
The Test Set should be of the size of a real-world forecast
Facebook Prophet Model

Component Description

Growth Linear or Logistic

Holidays Dataframe that we prepared

Seasonality Yearly, weekly or daily. True or False

Seasonality_mode Multiplicative or additive

Seasonality_prior_scale Strength of the seasonality

Holiday_prior_scale Smaller values allow the model to fit larger seasonal fluctuations

Changepoint_prior_scale Does the Trend change easily?


Additive vs. Multiplicative

Additive Multiplicative Key ideas


Y 𝑦 𝑡 = 𝑇𝑡 𝑡 + 𝑆 𝑡 + 𝑒[𝑡] Y 𝑦 𝑡 = 𝑇𝑡 𝑡 ∗ 𝑆 𝑡 ∗ 𝑒[𝑡]
If we talk about seasonality
in terms of percentage, the
seasonality is multiplicative

If it is in adding absolute
values, then it is additive.

t
t
Cross Validation

Training set Test set

Key Idea
Repeating the assessment of our model reinforces its evaluation
Parameters to tune

Component Description

Seasonality_prior_scale Strength of the seasonality

Holiday_prior_scale Smaller values allow the model to fit larger seasonal fluctuations

Changepoint_prior_scale flexibility of the automatic changepoint selection


Shelter Demand
How many people need a shelter?

Challenge - 1 Prepare dataframe

Demand 2 Training and test set

Forecasting 3 Create model and assess accuracy

4 Visualize the output

5 Perform parameter tuning


Description

1 They need strategic and tactical forecasts

2 They need marketplace forecasts to allocate cars

Forecasting
3 But they also need it to measure investments

at Uber
4 They need to forecast at scale

5 They use simple statistical models

Machine Learning when exogenous regressors are


6
available

7 They try multiple approaches to find the best result

You might also like