Scalable-ML-3 4 1

Scalable Machine Learning
with Apache Spark™

Introductions
▪ Instructor Introduction
▪ Student Introductions
▪ Name
▪ Professional Responsibilities
▪ Fun Personal Interest/Fact
▪ Expectations for the Course
Course Objectives
1 Create data processing pipelines with Spark
2 Build and tune machine learning models with Spark ML
3 Track, version, and deploy machine learning models with MLflow
4 Perform distributed hyperparameter tuning with Hyperopt
5 Scale the inference of single-node models with Spark

Agenda
Day 1 Day 2 Day 3 Day 4
1. Spark Review* 1. Linear Regression, pt. 1 1. Decision Trees 1. Hyperopt Lab
2. Delta Lake Review* Lab 2. Break 2. MLlib Deployment
3. ML Overview* 2. Linear Regression, pt. 2 3. Random Forest and Options*
4. Break 3. Break Hyperparameter Tuning 3. XGBoost*
5. Data Cleansing 4. Linear Regression, pt. 2 4. Break 4. Break
6. Data Exploration Lab Lab 5. Hyperparameter Tuning 5. Inference with Pandas
7. Break 5. MLflow Tracking Lab UDFs
8. Linear Regression, pt. 1 6. Break 6. Hyperopt 6. Training with Pandas
7. MLflow Model Registry UDFs
8. MLflow Lab 7. Pandas UDFs Lab
8. Koalas
9. Break
10. Capstone Project*
*Optional
Survey
Programming
Apache Spark Machine Learning Language
LET’S GET STARTED
Apache Spark™ Overview
Apache Spark Background
▪ Founded as a research project at UC
Berkeley in 2009
▪ Open-source unified data analytics
engine for big data
▪ Built-in APIs in SQL, Python, Scala, R,
and Java
Have you ever counted the
number of M&Ms in a jar?
Spark Cluster
Driver One Driver
Worker Worker Worker Worker
Executor Executor Executor Executor
JVM JVM JVM JVM
Many Executor JVMs

Spark’s Structured Data APIs
RDD DataFrame Dataset

(2011) (2013) (2015)
Distributed collection of JVM Distributed collection of row Internally rows, externally

objects objects JVM objects
Functional operators (map, Expression-based Almost the “best of both

filter, etc.) operations and UDFs worlds”: type safe + fast
Logical plans and optimizer But still slower than

DataFrames
Fast/efficient internal
representations Not as good for interactive
analysis, especially with
Python
Spark DataFrame Execution
PySpark DataFrame Java/Scala DataFrame SparkR DataFrame
Logical Plan
Catalyst Optimizer
Physical Execution
Under the Catalyst Optimizer’s Hood
Logical Physical Code

Analysis
Optimization Planning Generation
SQL Query
Cost Model
Unresolved Optimized Physical Selected
Logical
Logical
Logical
Physical
Plan
Physical
Plans Physical RDDs
Plan Plan Plans
Plans Plan
DataFrame
When to Use Spark
Data or model is too large to process

Scaling Out on a single machine, commonly
resulting in out-of-memory errors
Data or model is processing slowly

Speeding Up and could benefit from shorter
processing times and faster results
Delta Lake Overview
Open-source Storage Layer
Delta Lake’s Key Features
▪ ACID transactions
▪ Time travel (data versioning)
▪ Schema enforcement and evolution
▪ Audit history
▪ Parquet format
▪ Compatible with Apache Spark API
Machine Learning Overview
What is Machine Learning
▪ Learn patterns and relationships in your data without explicitly programming
them
▪ Derive an approximation function to map features to an output or relate them
to each other
Machine
Features Output
Learning
Types of Machine Learning
Supervised Learning Unsupervised Learning
▪ Labeled data (known function output) ▪ Unlabeled data (no known function output)
▪ Regression (a continuous/ordinal-discrete output) ▪ Clustering (categorize records based on features)
▪ Classification (a categorical output) ▪ Dimensionality reduction (reduce feature space)
Types of Machine Learning
Semi-supervised Learning Reinforcement Learning
▪ Labeled and unlabeled data, mostly unlabeled ▪ States, actions, and rewards
▪ Combines supervised learning and unsupervised ▪ Useful for exploring spaces and exploiting
learning information to maximize expected cumulative
▪ Commonly trying to label the unlabeled data to be rewards
used in another round of training ▪ Frequently utilizes neural networks and deep
learning
Machine Learning Workflow
Define
Define Success,
Feature
Business Use Constraints Data Collation Modeling Deployment
Engineering
Case and
Infrastructure
Business Use Cases
What business use cases does you have?

Defining and Measuring Success
Baseline Models
▪ Simple, dummy model

50% Heads
▪ Examples include:
Baseline ▪ Most common case (not hot dog)
Coin Flip
Model
▪ Target variable mean
▪ Point-of-reference
50% Tails
Algorithm Selection
How do we decide which machine learning algorithms to use?
▪ Data distribution
▪ Feature interactions
▪ Missing values
▪ Target variable type
▪ Deployment considerations
▪ Speed of training
▪ Need for accuracy
▪ Need for interpretability
Note: Be aware of any interpretability requirements due to data

regulations like the General Data Protection Regulation.
How do we get this information?
▪ Exploratory data analysis
▪ Data visualization
▪ Data cleaning
▪ Data summaries
▪ Data relationships
DATA CLEANSING DEMO
Importance of Data Visualization
Importance of Data Visualization
How do we build and evaluate models?
DATA EXPLORATION LAB
Linear Regression
Linear Regression
Goal: Find the line of best fit. Y
ŷ = w0+w1x
y≈ŷ+ϵ
where...
x: feature
y: label
w0: y-intercept
w1: slope of the line of best fit X
Minimizing the Residuals
Y
▪ Blue point: True value

▪ Green-dotted line: Positive residual
▪ Orange-dotted line: Negative
residual
▪ Red line: Line of best fit
The goal is to draw a line that

minimizes the sum of the
X squared residuals.
Regression Evaluators
Y
Measure the “closeness”
between the actual value and
the predicted value.
Evaluation Metrics
▪ Loss: (y - ŷ)
▪ Absolute loss: |y - ŷ|
▪ Squared loss: (y - ŷ)2
X
Evaluation Metric: Root mean-squared-error
(RMSE)
Linear Regression Assumptions
Y
▪ Linear relationship between each
feature and Y
▪ Observations are independent from
one another
▪ Features are independent from one
another
▪ The value of residuals is not
dependent on the feature values
X
Linear Regression Assumptions
So, which datasets are suited for linear regression?
Train vs. Test RMSE
Which is more important? Why?

Train
Test
Evaluation Metric: R2
What is the range of R2?
Do we want it to be higher or lower?

Machine Learning Libraries
Scikit-learn is a popular single-node machine learning library.
But what if our data or model get too big?

Machine Learning in Spark
Machine learning in Spark allows us to work
Scale Out and Speed Up with bigger data and train models faster by
distributing the data and computations
across multiple workers.
Spark Machine Learning MLlib Spark ML

Libraries
Original ML API Newer ML API for
for Spark Spark
Based on RDDs Based on

DataFrames
Maintenance
Mode Supported API
LINEAR REGRESSION
DEMO I
LINEAR REGRESSION
LAB I
Non-numeric Features
Two primary types of non-numeric features
Categorical Features Ordinal Features
A series of categories of a single A series of categories of a single

feature feature
No intrinsic ordering Relative ordering, but not

necessarily consistent spacing
e.g. Dog, Cat, Fish
e.g. Infant, Toddler, Adolescent,
Teen, Young Adult, etc.
Non-numeric Features in Linear Regression
How do we handle non-numeric Life
features for linear regression? Expectancy
▪ X-axis is numeric, so features need

to be numeric
▪ Convert our non-numeric features
to numeric features?
Could we assign numeric values

to each of the categories
▪ “Dog” = 1, “Cat” = 2, “Fish” = 3, etc.

▪ Does this make sense? Dog Cat Fish Animal
This implies 1 Cat is equal to 2 Dogs!

What about with ordinal Height
variables?
▪ Since ordinal variables have an

order just like numbers, could this
work?
▪ “Infant” = 1, “Toddler” = 2, “Child” = 3,
etc.
▪ Does this make sense?
Infant Toddler Child Life

Stage
Remember that the ordinal categories aren’t necessarily evenly
spaced, so it’s still not perfect and not particularly scalable.
Instead, we commonly use a practice known as one-hot encoding (OHE).
▪ Creates a binary “dummy” feature for each category
Animal Dog Cat Fish
Dog OHE 1 0 0
Cat 0 1 0
Fish 0 0 1
▪ Doesn’t force a uniformly-spaced, ordered numeric representation

One-hot Encoding at Scale
You might be thinking...
▪ Okay, I see what’s happening here … this works for a handful of animals.
▪ But what if we have an entire zoo of animals? That would result in really wide
data!
Spark uses sparse vectors for this…

DenseVector(0, 0, 0, 7, 0, 2, 0, 0, 0, 0)
SparseVector(10, [3, 5], [7, 2])
▪ Sparse vectors take the form:
(Number of elements, [index of non-zero element, value of non-zero element], ...)

LINEAR REGRESSION
DEMO II
LINEAR REGRESSION
LAB II
MLflow Tracking
MLflow
▪ Open-source platform for machine learning lifecycle

▪ Operationalizing machine learning
▪ Developed by Databricks
▪ Pre-installed on the Databricks Runtime for ML
Core Machine Learning Issues
▪ Keeping track of experiments or model development
▪ Reproducing code
▪ Comparing models
▪ Standardization of packaging and deploying models
MLflow addresses these issues.

MLflow Components
▪ MLflow Tracking
▪ MLflow Projects
▪ MLflow Models
▪ MLflow Plugins
▪ APIs: CLI, Python, R, Java, REST
MLflow Tracking
▪ Logging API
▪ Specific to machine learning
▪ Library and environment agnostic
Runs Experiments
Executions of data science code Aggregations of runs
E.g. a model build, an optimization Typically correspond to a data science

run project
What Gets Tracked
▪ Parameters
▪ Key-value pairs of parameters (e.g. hyperparameters)
▪ Metrics
▪ Evaluation metrics (e.g. RMSE)
▪ Artifacts
▪ Arbitrary output files (e.g. images, pickled models, data files)
▪ Source
▪ The source code from the run
Examining Past Runs
▪ Querying Past Runs via the API
▪ MLflowClient Object
▪ List experiments
▪ Search runs
▪ Return run metrics
▪ MLflow UI
▪ Built in to Databricks platform
MLFLOW TRACKING
DEMO
MLflow Model Registry
MLflow Model Registry
▪ Collaborative, centralized model hub
▪ Facilitate experimentation, testing, and production
▪ Integrate with approval and governance workflows
▪ Monitor ML deployments and their performance
Databricks MLflow Blog Post

MLFLOW MODEL REGISTRY
DEMO
MLFLOW
LAB
Decision Trees
Decision Making
Decision Node Salary > $50,000 Root Node

Yes No
Decision Node Commute > 1 hr Decline Offer Leaf Node

Yes No
Leaf Node Decline Offer Offers Free Coffee Decision Node

Yes No
Leaf Node Accept Offer Decline Offer Leaf Node

Determining Splits
Commute? Bonus?
< 30 min 30 min - 1 hr > 1 hr Yes No
Commute is a better choice because it provides information

about the classification.
Creating Decision Boundaries
Commute
Salary > $50,000 Decline Offer

Yes No
1 hour
Commute > 1 hr Decline Offer

Decline Offer
Yes No
Accept Offer
Decline Offer Accept Offer
$50,000
Salary
Lines vs. Boundaries
Linear Regression Decision Trees
▪ Lines through data ▪ Boundaries instead of lines
▪ Assumed linear relationship ▪ Learn complex relationships
Commute
1 hour
X $50,000 Salary
Linear Regression or Decision Tree?
It depends on the data...

Tree Depth
Tree Depth: the length of the Salary > $50,000 Root Node 0
longest path from a root note to a
leaf node Yes No
Commute > 1 hr Decline Offer 1

Yes No
3
Decline Offer
Offers Free
Coffee
2
Yes No
Leaf Node Accept Offer Decline Offer Leaf Node 3
Note: shallow trees tend to underfit, and deep trees tend to overfit
Underfitting vs. Overfitting
Underfitting Just Right Overfitting
Additional Resource
R2D3 has an excellent visualization of how

decision trees work.
DECISION TREE DEMO
Random Forests
Decision Trees
Pros Cons
▪ Interpretable ▪ Poor accuracy
▪ Simple ▪ High variance
▪ Classification
▪ Regression
▪ Nonlinear relationships
Bias vs. Variance
Bias-Variance Tradeoff
Error = Variance + Bias2 + noise
Error Total Error ▪ Reduce Bias

Optimum Model ▪ Build more complex
Complexity
Variance models
▪ Reduce Variance
▪ Use a lot of data
▪ Build simple models
▪ What about the noise?
Bias2
Model Complexity
https://www.explainxkcd.com/wiki/index.php/2021:_Software_Development
Building Five Hundred Decision Trees
▪ Using more data reduces variance for one model
▪ Averaging more predictions reduces prediction variance
▪ But that would require more decision trees
▪ And we only have one training set … or do we?
Bootstrap Sampling
A method for simulating N new datasets:
1. Take sample with replacement from original training set

2. Repeat N times
Bootstrap Visualization
Bootstrap 1 (N = 100) Bootstrap 2 (N = 100)
Training Set (N = 100)
Bootstrap 3 (N = 100) Bootstrap 4 (N = 100)
Why are some points in the bootstrapped

samples not selected?
Training Set Coverage
Assume we are bootstrapping N draws from a training set with N
observations ...
▪ Probability of an element getting picked in each draw:
▪ Probability of an element not getting picked in each draw:
▪ Probability of an element not getting drawn in the entire sample:
As N → ∞, the probability for each element of not

getting picked in a sample approaches 0.368.
Bootstrap Aggregating
▪ Train a tree on each of sample, and average the predictions
▪ This is bootstrap aggregating, commonly referred to as bagging
Bootstrap 1 Bootstrap 2 Bootstrap 3 Bootstrap 4
Decision Tree 1 Decision Tree 2 Decision Tree 3 Decision Tree 4
Final Prediction
Random Forest Algorithm
Full Training Data
Bootstrap 1 Bootstrap 2 Bootstrap K
...
At each split, a subset of features is considered to

ensure each tree is different.
Random Forest Aggregation
Scoring Record
...
Aggregation
Final Prediction
▪ Majority-voting for classification

▪ Mean for regression
RANDOM FOREST DEMO
Hyperparameter Tuning
What is a Hyperparameter?
▪ Examples for Random Forest:
▪ Tree depth
▪ Number of trees
▪ Number of features to consider
A parameter whose value is used to

control the training process.
Selecting Hyperparameter Values
▪ Build a model for each hyperparameter value
▪ Evaluate each model to identify the optimal hyperparameter value
▪ What dataset should we use to train and evaluate?
Training Validation Test
What if there isn’t enough data to split

into three separate sets?
K-Fold Cross Validation
Pass 1: Training Training Validation

Average Validation
Errors to Identify
Pass 2: Training Validation Training Optimal
Hyperparameter
Values
Pass 3: Validation Training Training
Final Pass: Training with Optimal Hyperparameters Test

Optimizing Hyperparameter Values
Grid Search
▪ Train and validate every unique combination of hyperparameters
Tree Depth Number of Trees Tree Depth Number of Trees
5 2 5 2
8 4 5 4
8 2
8 4
Question: With 3-fold cross validation, how many models will this build?
HYPERPARAMETER TUNING
DEMO
HYPERPARAMETER TUNING
LAB
Hyperparameter Tuning with Hyperopt
Problems with Grid Search
▪ Exhaustive enumeration is expensive
▪ Manually determined search space
▪ Past information on good hyperparameters isn’t used
▪ So what do you do if…
▪ You have a training budget
▪ You have a non-parametric search space
▪ You want to pick your hyperparameters based on past results
Hyperopt
▪ Open-source Python library
▪ Optimization over awkward search spaces
▪ Serial
▪ Parallel
▪ Spark integration
▪ Three core algorithms for optimization:
▪ Random Search
▪ Tree of Parzen Estimators (TPE)
▪ Adaptive TPE
Paper
Random Search
▪ Generally outperforms grid search

▪ Can struggle on some datasets (e.g. convex spaces)
Tree of Parzen Estimators
▪ Meta-learner, Bayesian process

▪ Non-parametric densities
▪ Returns candidate hyperparameters based on best expected
improvement
▪ Provide a range and distribution for continuous and discrete
values
▪ Adaptive TPE better tunes the search space
▪ Freezes hyperparameters
▪ Tunes number of random trials before TPE
HYPEROPT
DEMO
HYPEROPT
LAB
MLlib Deployment Options
Data Science vs. Data Engineering
▪ Data Science != Data Engineering
▪ Data Science
▪ Scientific
▪ Art
▪ Business problems
▪ Model mathematically
▪ Optimize performance
▪ Data Engineering
▪ Reliability
▪ Scalability
▪ Maintainability
▪ SLAs
Model Operations (ModelOps)
▪ DevOps
▪ Software development and IT operations
▪ Manages deployments
▪ CI/CD of features, patches, updates, and rollbacks
▪ Agile vs. waterfall
▪ ModelOps
▪ Data modeling and deployment operations
▪ Java environments
▪ Containers
▪ Model performance monitoring
The Four ML Deployment Options
▪ Batch
▪ 80-90 percent of deployments
▪ Leverages databases and object storage
▪ Fast retrieval of stored predictions
▪ Continuous/Streaming
▪ Moderately fast scoring on new data
▪ Real-time
▪ Usually using REST (Azure ML, SageMaker, containers)
▪ On-device
ML DEPLOYMENT DEMO
Gradient Boosted Decision Trees
Decision Tree Ensembles
▪ Combine many decision trees Full Training Data
▪ Random Forest
▪ Bagging Bootstrap 1 Bootstrap 2 Bootstrap K
▪ Independent trees
▪ Results aggregated to a ...
final prediction
▪ There are other methods of
ensembling decision trees
Boosting
Full Training Data
▪ Sequential (one tree at a time)

▪ Each tree learns from the last
▪ Sequence of trees is the final
model
Gradient Boosted Decision Trees
▪ Common boosted trees algorithm
▪ Fits each tree to the residuals of the previous tree
▪ On the first iteration, residuals are the actual label values
Model 1 Model 2 Final Prediction
Y Prediction Residual Y Prediction Residual Y Prediction
40 35 5 5 3 2 40 38
60 67 -7 -7 -4 -3 60 63
30 28 2 2 3 -1 30 31
33 32 1 1 0 1 33 32
Boosting vs. Bagging
GBDT RF
▪ Starts with high bias, low variance ▪ Starts with high variance, low bias
▪ Works right ▪ Works left
Error Total Error
Optimum Model
Complexity
Variance
Bias2
Model Complexity
Gradient Boosted Decision Trees Implementations
▪ Spark ML
▪ Built into Spark
▪ Utilizes Spark’s existing decision tree implementation
▪ XGBoost
▪ Designed and built specifically for gradient boosted trees
▪ Regularized to prevent overfitting
▪ Highly parallel
▪ Works nicely with Spark in Scala
XGBOOST DEMO
Appendix
Electives
The following electives are also available:
▪ Machine Learning Algorithms and Applications

▪ K-Means
▪ Logistic Regression Lab
▪ Time Series Forecasting
▪ Isolation Forests for Outlier and Fraud Detection
▪ Collaborative Filtering for Recommendation Systems Lab
▪ Tools
▪ Joblib
▪ Other
▪ Databricks Best Practices
Logistic Regression
Types of Supervised Learning
Regression Classification
▪ Predicting a continuous output ▪ Predicting a categorical/discrete output

Types of Classification
Binary Classification Multiclass Classification
Two label classes Three or more label classes
Model output is commonly the probability of a record

belonging to each of the classes.
Binary Classification
Binary Classification
Two label classes ▪ Outputs:
▪ Probability that the record is
Red given a set of features
▪ Probability that the record is
Blue given a set of features
▪ Reminders:
▪ Probabilities are bounded
between 0 and 1
▪ And linear regression returns
any real number
Bounding Binary Classification Probabilities
How can we keep model outputs between 0 and 1?
▪ Logistic Function:
▪ Large positive inputs → 1
▪ Large negative inputs → 0
Converting Probabilities to Classes
▪ In binary classification, the class probabilities are directly complementary
▪ So let’s set our Red class equal to 1, and our Blue class equal to 0
▪ The model output is 𝐏[y = 1 | x] where x represents the features
But we need class predictions, not probability predictions
▪ Set a threshold on the probability predictions

▪ 𝐏[y = 1 | x] < 0.5 → y = 0
▪ 𝐏[y = 1 | x] ≥ 0.5 → y = 1
Evaluating Binary Classification Models
▪ How can the model be wrong?
▪ Type I Error: False Positive
▪ Type II Error: False Negative
▪ Representing these errors with a confusion matrix.
Binary Classification Metrics
Accuracy Precision
TP + TN TP
TP + FP + TN + FN TP + FP
Recall F1
TP 2 x Precision x Recall
TP + FN Precision + Recall
K-Means
Clustering
▪ Unsupervised learning
▪ Unlabeled data (no known function output)
▪ categorize records based on features
K-Means Clustering
▪ Most common clustering algorithm

▪ Number of clusters, K, is manually
chosen
▪ Each cluster has a centroid
▪ Objective of minimizing the total
distance between all of the points and
their assigned centroid
K-Means Algorithm
▪ Step 1: Randomly create centroids for k clusters
▪ Repeat until convergence/stopping criteria:
▪ Step 2: Assign each data point to the cluster with the
closest centroid
▪ Step 3: Move the cluster centroids to the average location
of their assigned data points
Visualizing K-Means
Choosing the Number of Clusters
▪ K is a hyperparameter
▪ Methods of identifying the optimal K
▪ Prior knowledge
▪ Visualizing data
▪ Elbow method for within-cluster distance
Note: Error will always decrease as K increases, unless a penalty is imposed.

Issues with K-Means
Local optima vs. global optima Straight-line distance
Global minimum
Local minimum
Other Clustering Techniques
Collaborative Filtering
Recommendation Systems
Naive Approaches to Recommendation
▪ Hand-curated lists
▪ Aggregates
Question: What are problems with these approaches?

Content-based Recommendation
▪ Idea: Recommend items to a customer that are similar to other items
the customer liked
▪ Creates a profile for each user or product
▪ User: demographic info, ratings, etc.
▪ Item: genre, flavor, brand, actor list, etc.
Content-based Recommendation
▪ Advantages
▪ No need for data from other users
▪ New item recommendations
▪ Disadvantages
▪ Cold-start problem
▪ Determining appropriate feature comparisons
▪ Implicit information
Collaborative Filtering
▪ Idea: Make recommendations for one customer (filtering) by collecting
and analyzing the interests of many users (collaboration)
▪ Advantages over content-based recommendation
▪ Relies only on past user behavior (no profile creation)
▪ Domain independent
▪ Generally more accurate
▪ Disadvantages
▪ Extremely susceptible to cold-start problem (user and item)
Types of Collaborative Filtering
▪ Neighborhood Methods: Compute relationships between items or
users
▪ Computationally expensive
▪ Not empirically as good
▪ Latent Factor Models: Explain the ratings by characterizing items and
users by small number of inferred factors
▪ Matrix factorization
▪ Characterizes both items and users by vectors of factors from
item-rating pattern
▪ Explicit feedback: sparse matrix
▪ Scalable
Latent Factor Approach
Ratings Matrix
Matrix Factorization
Alternating Least Squares
▪ Step 1: Randomly initialize user and movie factors
▪ Step 2: Repeat the following
1. Fix the movie factors, and optimize user factors
2. Fix the user factors, and optimize movie factors
Why not SVD?
▪ The matrix is too sparse
▪ Imputation can be inaccurate
▪ Imputation can be expensive
Distributed ALS Implementation
▪ Naive approach
▪ Broadcast R, U, and V
▪ Problems?
▪ R is large, and it’s duplicating copies for each worker
▪ Better approach
▪ Distribute R and broadcast U and V
▪ Problems?
▪ U and V might be large, too, and we’re still duplicating copies
▪ Best approach
▪ Join ALS
Join ALS
Blocked Join ALS
▪ Spark implements a smarter version of Join ALS
▪ Limits data shuffling
▪ ALS is a distributed model (i.e. stored across executors)

Scalable-ML-3 4 1

Uploaded by

Copyright:

Available Formats

Scalable-ML-3 4 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Scalable-ML-3 4 1

Uploaded by

Copyright:

Available Formats

Scalable Machine Learning

with Apache Spark™

2 Build and tune machine learning models with Spark ML

3 Track, version, and deploy machine learning models with MLﬂow

4 Perform distributed hyperparameter tuning with Hyperopt

5 Scale the inference of single-node models with Spark

Worker Worker Worker Worker

Executor Executor Executor Executor

JVM JVM JVM JVM

Many Executor JVMs

RDD DataFrame Dataset

Distributed collection of JVM Distributed collection of row Internally rows, externally

Functional operators (map, Expression-based Almost the “best of both

Logical plans and optimizer But still slower than

Logical Physical Code

Data or model is too large to process

Data or model is processing slowly

What business use cases does you have?

▪ Simple, dummy model

Note: Be aware of any interpretability requirements due to data

▪ Blue point: True value

The goal is to draw a line that

Which is more important? Why?

What is the range of R2?

Do we want it to be higher or lower?

Scikit-learn is a popular single-node machine learning library.

But what if our data or model get too big?

Spark Machine Learning MLlib Spark ML

Based on RDDs Based on

Categorical Features Ordinal Features

A series of categories of a single A series of categories of a single

No intrinsic ordering Relative ordering, but not

▪ X-axis is numeric, so features need

Could we assign numeric values

▪ “Dog” = 1, “Cat” = 2, “Fish” = 3, etc.

This implies 1 Cat is equal to 2 Dogs!

▪ Since ordinal variables have an

Infant Toddler Child Life

Animal Dog Cat Fish

▪ Doesn’t force a uniformly-spaced, ordered numeric representation

Spark uses sparse vectors for this…

▪ Sparse vectors take the form:

(Number of elements, [index of non-zero element, value of non-zero element], ...)

▪ Open-source platform for machine learning lifecycle

MLﬂow addresses these issues.

Executions of data science code Aggregations of runs

E.g. a model build, an optimization Typically correspond to a data science

Databricks MLﬂow Blog Post

Decision Node Salary > $50,000 Root Node

Decision Node Commute > 1 hr Decline Offer Leaf Node

Leaf Node Decline Offer Offers Free Coffee Decision Node

Leaf Node Accept Offer Decline Offer Leaf Node

< 30 min 30 min - 1 hr > 1 hr Yes No

Commute is a better choice because it provides information

Salary > $50,000 Decline Offer

Commute > 1 hr Decline Offer

It depends on the data...

Commute > 1 hr Decline Offer 1

Leaf Node Accept Offer Decline Offer Leaf Node 3