Scalable-ML-3 4 1
Scalable-ML-3 4 1
Scalable-ML-3 4 1
*Optional
Survey
Programming
Apache Spark Machine Learning Language
LET’S GET STARTED
Apache Spark™ Overview
Apache Spark Background
▪ Founded as a research project at UC
Berkeley in 2009
▪ Open-source unified data analytics
engine for big data
▪ Built-in APIs in SQL, Python, Scala, R,
and Java
Have you ever counted the
number of M&Ms in a jar?
Spark Cluster
Driver One Driver
Logical Plan
Catalyst Optimizer
Physical Execution
Under the Catalyst Optimizer’s Hood
Cost Model
Unresolved Optimized Physical Selected
Logical
Logical
Logical
Physical
Plan
Physical
Plans Physical RDDs
Plan Plan Plans
Plans Plan
DataFrame
When to Use Spark
Machine
Features Output
Learning
Types of Machine Learning
Supervised Learning Unsupervised Learning
▪ Labeled data (known function output) ▪ Unlabeled data (no known function output)
▪ Regression (a continuous/ordinal-discrete output) ▪ Clustering (categorize records based on features)
▪ Classification (a categorical output) ▪ Dimensionality reduction (reduce feature space)
Types of Machine Learning
Semi-supervised Learning Reinforcement Learning
▪ Labeled and unlabeled data, mostly unlabeled ▪ States, actions, and rewards
▪ Combines supervised learning and unsupervised ▪ Useful for exploring spaces and exploiting
learning information to maximize expected cumulative
▪ Commonly trying to label the unlabeled data to be rewards
used in another round of training ▪ Frequently utilizes neural networks and deep
learning
Machine Learning Workflow
Define
Define Success,
Feature
Business Use Constraints Data Collation Modeling Deployment
Engineering
Case and
Infrastructure
Business Use Cases
▪ Data distribution
▪ Feature interactions
▪ Missing values
▪ Target variable type
▪ Deployment considerations
▪ Speed of training
▪ Need for accuracy
▪ Need for interpretability
y≈ŷ+ϵ
where...
x: feature
y: label
w0: y-intercept
w1: slope of the line of best fit X
Minimizing the Residuals
Y
Evaluation Metrics
▪ Loss: (y - ŷ)
▪ Absolute loss: |y - ŷ|
▪ Squared loss: (y - ŷ)2
X
Evaluation Metric: Root mean-squared-error
(RMSE)
Linear Regression Assumptions
Y
▪ Linear relationship between each
feature and Y
▪ Observations are independent from
one another
▪ Features are independent from one
another
▪ The value of residuals is not
dependent on the feature values
X
Linear Regression Assumptions
So, which datasets are suited for linear regression?
Train vs. Test RMSE
Test
Evaluation Metric: R2
Dog OHE 1 0 0
Cat 0 1 0
Fish 0 0 1
▪ But what if we have an entire zoo of animals? That would result in really wide
data!
Runs Experiments
Commute? Bonus?
$50,000
Salary
Lines vs. Boundaries
Linear Regression Decision Trees
▪ Lines through data ▪ Boundaries instead of lines
▪ Assumed linear relationship ▪ Learn complex relationships
Commute
1 hour
X $50,000 Salary
Linear Regression or Decision Tree?
Tree Depth: the length of the Salary > $50,000 Root Node 0
longest path from a root note to a
leaf node Yes No
Note: shallow trees tend to underfit, and deep trees tend to overfit
Underfitting vs. Overfitting
Underfitting Just Right Overfitting
Additional Resource
Model Complexity
https://www.explainxkcd.com/wiki/index.php/2021:_Software_Development
Building Five Hundred Decision Trees
▪ Using more data reduces variance for one model
▪ Averaging more predictions reduces prediction variance
▪ But that would require more decision trees
▪ And we only have one training set … or do we?
Bootstrap Sampling
A method for simulating N new datasets:
Final Prediction
Random Forest Algorithm
Full Training Data
...
...
Aggregation
Final Prediction
5 2 5 2
8 4 5 4
8 2
8 4
Question: With 3-fold cross validation, how many models will this build?
HYPERPARAMETER TUNING
DEMO
HYPERPARAMETER TUNING
LAB
Hyperparameter Tuning with Hyperopt
Problems with Grid Search
▪ Exhaustive enumeration is expensive
▪ Manually determined search space
▪ Past information on good hyperparameters isn’t used
▪ So what do you do if…
▪ You have a training budget
▪ You have a non-parametric search space
▪ You want to pick your hyperparameters based on past results
Hyperopt
▪ Open-source Python library
▪ Optimization over awkward search spaces
▪ Serial
▪ Parallel
▪ Spark integration
▪ Three core algorithms for optimization:
▪ Random Search
▪ Tree of Parzen Estimators (TPE)
▪ Adaptive TPE
Paper
Optimizing Hyperparameter Values
Random Search
40 35 5 5 3 2 40 38
60 67 -7 -7 -4 -3 60 63
30 28 2 2 3 -1 30 31
33 32 1 1 0 1 33 32
Boosting vs. Bagging
GBDT RF
▪ Starts with high bias, low variance ▪ Starts with high variance, low bias
▪ Works right ▪ Works left
Optimum Model
Complexity
Variance
Bias2
Model Complexity
Gradient Boosted Decision Trees Implementations
▪ Spark ML
▪ Built into Spark
▪ Utilizes Spark’s existing decision tree implementation
▪ XGBoost
▪ Designed and built specifically for gradient boosted trees
▪ Regularized to prevent overfitting
▪ Highly parallel
▪ Works nicely with Spark in Scala
XGBOOST DEMO
Appendix
Electives
The following electives are also available:
▪ Logistic Function:
▪ Large positive inputs → 1
▪ Large negative inputs → 0
Converting Probabilities to Classes
▪ In binary classification, the class probabilities are directly complementary
▪ So let’s set our Red class equal to 1, and our Blue class equal to 0
▪ The model output is 𝐏[y = 1 | x] where x represents the features
TP + TN TP
TP + FP + TN + FN TP + FP
Recall F1
TP 2 x Precision x Recall
TP + FN Precision + Recall
K-Means
Clustering
▪ Unsupervised learning
▪ Unlabeled data (no known function output)
▪ categorize records based on features
K-Means Clustering
Global minimum
Local minimum
Other Clustering Techniques
Collaborative Filtering
Recommendation Systems
Naive Approaches to Recommendation
▪ Hand-curated lists
▪ Aggregates