Deep Learning Models at Scale With Apache Spark: May 31, 2019 Asa Sdss
Deep Learning Models at Scale With Apache Spark: May 31, 2019 Asa Sdss
Deep Learning Models at Scale With Apache Spark: May 31, 2019 Asa Sdss
1
About me
Joseph Bradley
• Software Engineer at Databricks
• Apache Spark committer & PMC member
About Databricks
TEAM
Started Spark project (now Apache Spark) at UC Berkeley in 2009
MISSION
Making Big Data Simple
Try for free today.
PRODUCT databricks.com
Unified Analytics Platform
Apache Spark + AI
Apache Spark:
The First Unified Analytics Engine
Runtime
Delta
Spark Core Engine
5
AI is re-shaping the world
Disruptive innovations are affecting enterprises across the planet
6
Better AI needs more data
7
When AI goes distributed ...
Larger datasets
è More need for distributed training
è More open-source offerings: distributed TensorFlow,
Horovod, distributed MXNet
8
Challenges
Two user stories
As a data scientist,
10
Distributed training
fit on a
load using a model
save data GPU
Spark cluster
cluster
12
Streaming model inference
required:
● save to stream sink
● GPU for fast inference
13
A hybrid Spark and AI cluster?
fit a model
load using a
distributedly model
Spark cluster w/
on the same
GPUs
cluster
load using a
predict w/ GPUs model
Spark cluster w/
as a Spark task
GPUs
14
Unfortunately, it doesn’t work out of the box.
Project Hydrogen
Big data for AI
There are many efforts from the Spark community to
integrate Spark with AI/ML frameworks:
● (Yahoo) CaffeOnSpark, TensorFlowOnSpark
● (Intel) BigDL
● (John Snow Labs) Spark-NLP
● (Databricks) TensorFrames, Deep Learning Pipelines,
spark-sklearn
● … 80+ ML/AI packages on spark-packages.org
Project Hydrogen to fill the major gaps
From Databricks:
● Our use of Project Hydrogen features
● Lessons learned and best practices
Story #1:
Distributed training
load using a fit a model
Spark cluster on the same model
w/ GPUs cluster
Project Hydrogen
Task 1
Distributed training
Complete coordination among tasks
def train_hvd():
hvd.init()
… # train using Horovod
HorovodRunner(np=2).run(train_hvd)
Implementation of HorovodRunner
Task 2
Task 0 GPU:0 GPU:0
Task 3
Task 1 GPU:1 GPU:1
Task 4 ?
Executor 0 Executor 1
Workarounds (a.k.a hacks)
context = TaskContext.get()
assigned_gpu = context.getResources()[“gpu”][0]
with tf.device(assigned_gpu):
# training code ...
Cluster manager support
Standalone YARN
SPARK-27361 SPARK-27361
Kubernetes Mesos
SPARK-27362 SPARK-27363
Future features in discussion
JIRA: SPARK-24579
Pandas User-Defined Function (UDF)
Pandas UDF was introduced
in Spark 2.3
• Pandas for vectorized
computation
• Apache Arrow for data
exchange
Pandas UDF for distributed inference
@pandas_udf(...)
def predict(features):
...
spark.readStream(...) \
.withColumn(‘prediction’, predict(col(‘features’)))
Support for complex return types
@pandas_udf(...)
def predict(features):
# ...
return pd.DataFrame({'labels': labels, 'scores': scores})
t5 fetch batch #3
Project Hydrogen
• Apache Spark JIRA & dev mailing list
• spark.apache.org
47
Acknowledgements