SPARK
SPARK
SPARK
Afef Bahri
Cours: ESC Tunis
1 23/12/2023
Plan
Motivation
From MapReduce to Spark
RDD
Actions execution
Spark Cluster
2 23/12/2023
Brief history of Spark
Timeline
2002: MapReduce @ Google
2004: MapReduce paper
2006: Hadoop @ Yahoo
2008: Hadoop Summit
2010: Spark paper
2014: Apache Spark top-level
MapReduce started a general batch processing paradigm,
but had its limitations:
difficulty programming in MapReduce
batch processing did not fit many use cases
3 23/12/2023
Motivations
Explosion of data: social media such as Twitter feeds,
Facebook posts, SMS
Ex. How can you find out what your customers want and be
able to offer it to them right away?
Response time: batch job may take too much time to complete
4 23/12/2023
MapReduce to Spark
Time: run jobs not acceptable in many situations
5 23/12/2023
Spark
Apache Spark [1] is an in-memory distributed computing
platform designed for large-scale data processing
6 23/12/2023
Spark
Speed
Because of Spark’s In-memory implementation It can be up to 100 times
faster than Hadoop [3]
8 23/12/2023
Spark
Expands on Hadoop's capabilities
9 23/12/2023
Spark
Data scientist
Analyze and model the data to obtain insight using ad-hoc analysis
Transforming the data into a useable format
SQL, statistics, machine learning Python, MatLab or R
Data engineers
Develop a data processing system or application
Inspect and monitor, inspect, and tune their applications
Programming with the Spark's API
Everyone else
Ease of use
Wide variety of functionality
Mature and reliable
10 23/12/2023
Spark use cases
Data Streaming
Machine Learning
Collaborative Filtering
Interactive Analysis
11 23/12/2023
Spark unified stack
The Spark core is at the center of the Spark Unified Stack
12 23/12/2023
Spark Unified Stack
Various add-in components can run on top of the core
13 23/12/2023
SPARK Core
General purpose system: scheduling, distributing, and
monitoring of the applications across a cluster
15 23/12/2023
Spark Unified Stack: High level
On top of RDD and DataFrames, Spark proposes two
higher-level data access models for processing semi-
structured data in general
Spark SQL is designed to work with the Spark via SQL and
HiveQL (a Hive variant of SQL)
17 23/12/2023
Spark Unified Stack: High level
GraphX is a graph processing library with APIs to manipulate
graphs and performing graph-parallel computations
page rank
connected components
shortest paths, and others
18 23/12/2023
Spark VS Hadoop ecosystem
Hadoop Ecosystem Spark
Apache Storm Spark Streaming
Apache Giraph Spark GraphX
Apache Pig, and Apache Sqoop Spark Core and Spark SQL
Spork
19 23/12/2023
RDD
20 23/12/2023
Need for RDD
Data reuse is common in many iterative machine learning
and graph algorithms
21 23/12/2023
Need for RDD
In most frameworks, the only way to reuse data between
computations (e.g., between two MapReduce jobs) is to:
22 23/12/2023
Need for RDD
In distributed computing system data is stored in
intermediate stable distributed store such as HDFS
Computation of job slower since it involves many IO
operations, replications, and serializations in the process
23 23/12/2023
Need for RDD
RDDs solve these problems by enabling fault-tolerant
distributed In-memory computations
24 23/12/2023
SPARK RDD
RDD stands for “Resilient Distributed Dataset”. It is the
fundamental data structure of Apache Spark
Dataset represents records of the data you work with. The user
can load the data set externally which can be either JSON file,
CSV file, text file or database via JDBC with no specific data
structure
25 23/12/2023
SPARK Core: RDD
Spark's primary abstraction: Distributed collection of
elements, parallelized across the cluster
26 23/12/2023
RDD
RDD : Resilient Distributed Dataset
Operations (task, programme c’est un ensemble de taches) :
Transformations
Lazy evaluation : créer un nouveau RDD et ne rien exécuter
Input : RDD
Output : RDD qui pointe vers le RDD précédent
Exemple : map
Actions :
Déclencher l’exécution de toutes les opérations précédentes
(transformations définies dans le DAG) et de l’action : linéage
Input : RDD
Output : ce n’est pas un RDD, a value
Stockage : external storage
Exemple : reduce
27 23/12/2023
DAG
Programme : un ensemble de tâches/d’opérations
(transformations, actions)
Transformations: Map, filtermap, join
actions: reduce, count
L’exécution d’un programme
Spark crée un DAG : un graphe de RDD (nœuds) et les arcs
(transformations)
L’exécution est déclenché par une action
28 23/12/2023
Transformation RDD
Function that produces new RDD from the existing RDDs
Map, filter, …
Thus, the so input RDDs, cannot be changed since RDD are immutable in
nature
29 23/12/2023
Transformation RDD
Exemple
30 23/12/2023
RDD transformations
Transformations do not return a value
Lazy evaluation
32 23/12/2023
Action RDD
Transformations create RDDs from each other
33 23/12/2023
RDD lineage
Applying transformations built an RDD lineage, with the
entire parent RDDs of the final RDD(s)
35 23/12/2023
RDD operations: Transformations
These are some of the transformations available
Transformations are lazy evaluations
Returns a pointer to the transformed RDD
36 23/12/2023
flatMap
The flatMap function is similar to map, but each input
can be mapped to 0 or more output items
38 23/12/2023
RDD operations: Actions
Action returns values
39 23/12/2023
RDD operations: Actions
collect: returns all the elements of the dataset as an array of
the driver program
41 23/12/2023
RDD basic operations
Loading a file
42 23/12/2023
RDD basic operations
Applying transformation
val lineLengths = lines.map(s => s.length)
43 23/12/2023
RDD basic operations
Invoking action
val totalLengths = lineLengths.reduce((a,b) => a + b)
44 23/12/2023
RDD basic operations
View the DAG : display the series of transformation that
lineLengths.toDebugString
Read DAG from the bottom to up
Example: the following DAG starts as a textFile and goes
through a series of transformation such as map and filter,
followed by more map operations
45 23/12/2023
What happens when an action is exectuted ? (1/8)
Example: code to analyze
some log files
The first line you load the log
from the hadoop file system
The next two lines you filter
out the messages within the log
errors
Tell it to cache the filtered
dataset
More filters to get specific
error messages relating to
mysql and php followed by the
count action (number of errors)
46 23/12/2023
RDD log (1)
Appliquer transformations : filter -> RDD messages (2)
48 23/12/2023
What happens when an action is
exectuted ? (3/8)
49 23/12/2023
What happens when an action is exectuted
? (4/8)
The executors read the HDFS blocks to prepare the data
for the operations in parallel
50 23/12/2023
What happens when an action is exectuted
? (5/8)
After a series of transformations, you can cache the results
up until that point into memory
A cache is created
51 23/12/2023
What happens when an action is exectuted
? (6/8)
After the first action completes, the results are sent back to
the driver
52 23/12/2023
What happens when an action is exectuted
? (7/8)
To process the second action, Spark will use the data on
the cache; it does not need to go to the HDFS data again
53 23/12/2023
What happens when an action is exectuted
? (8/8)
Finally the results are sent back to the driver and you have
completed a full cycle
54 23/12/2023
Spark Cluster overview
55 23/12/2023
Spark cluster overview
Components Example of cluster
configuration
Driver
Cluster Manager
Executors
56 23/12/2023
Spark cluster overview
There are three main components of a Spark cluster
The executors are the processes that run computations and store
the data for the application
57 23/12/2023
Spark context
SparkContext is the entry point of Spark functionality.
59 23/12/2023
Driver
Where the SparkContext is located within the main
program
60 23/12/2023
Executor
The executors are the processes that run computations and
store the data for the application
61 23/12/2023
Executor
Each application gets its own executor processes
The executor stays up for the entire duration that the application
is running
The benefit of this is that the applications are isolated from each
other, on both the scheduling side, and running on different
JVMs
You would need to externalize the data if you wish to share data
between different applications, instances of SparkContext
62 23/12/2023
Programming with Spark
63 23/12/2023
Programming with Spark
Spark with interactive shells: (TP3, Shell)
64 23/12/2023
Initializing spark: scala/java/python
Build a SparkConf object that contains information about
your application
Scala
Java
Python
Java
Scal
a
Python
66 23/12/2023
Révision
67 23/12/2023
Data.csv sur le disque
Le fait de charger Data.csv : un RDD correspondant à
Data.csv avec un numéro RDD1
Exécuter le processus : un ensemble de tâches :
map
filter
count
Deux types de tâches : transformation, action
Transformation : transforme un RDD1 en un RDD2 (lazy
evaluation)
map filter
RDD1 RDD2 RDD3
count
Créer un DAG, le DAG est exécuté
résultat
68 23/12/2023
Data.csv sur le disque
Le fait de charger Data.csv : un RDD correspondant à
Data.csv avec un numéro RDD1
Exécuter le processus : un ensemble de tâches :
map
filter
count
Transformation : transforme un RDD1 en un RDD2
map(RDD1) -> RDD2
filter(RDD2) -> RDD3
count(RDD3) -> output : action déclencher l’exécution de
toutes les tâches (map,filter,count)
Lazy evaluation (transformation)
69 23/12/2023
Dataset RDD
map
RDD
External storage
70 23/12/2023
Example: MapReduce Job With Hadoop
MapReduce Job: map, reduce, map, reduce (disk)
map2
Hadoop
Diapo du cours en ligne reduce1
(replicated,
reduce2 partionned)
71 23/12/2023
Example: mapReduce job with Spark
MapReduce Job: map, reduce, map, reduce
Point to parent
RDD New RDD
map
Dataset
reduce
72 23/12/2023
Annexe
73 23/12/2023
Compatible versions of software
Scala:
Spark 1.6.3 uses Scala 2.10
Spark 2.1.1 is built and distributed to work with Scala 2.11 by default
To write applications in Scala, you will need to use a compatible Scala
version (e.g. 2.10.X)
Python:
Spark 1.x works with Python 2.6 or higher (but not yet with Python 3)
Java:
Spark 1.x works with Java 6 and higher - and Java 8 supports lambda
expressions
74 23/12/2023
Linking with Spark
Scala
Java
Python
75 23/12/2023
Initializing spark: Spark Properties
SparkConf: allows you to configure some of the common properties
master URL
application name
76 23/12/2023
Initializing spark: scala/java/python
Build a SparkConf object that contains information about
your application
Scala
Java
Python
Java
Scal
a
Python
78 23/12/2023
Resilient Distributed Datasets (RDDs)
There are two ways to create RDDs:
distData.reduce((a, b) => a + b)
79 23/12/2023
Resilient Distributed Datasets (RDDs)
There are two ways to create RDDs:
80 23/12/2023
RDD operations with scala
Define RDD
At this point Spark breaks the computation into tasks to run on separate machines
Each machine runs both its part of the map and a local reduction, returning only
its answer to the driver program
81 23/12/2023
Run spark application: dependencies
1. Define the dependencies using any system build
mechanism (Ant, SBT, Maven, Gradle)
82 23/12/2023
Run spark application: dependencies
pom.xml
83 23/12/2023
Run spark application: dependencies
Create zip files (example- abc.zip) containing all your
dependencies. While creating the spark context mention the
zip file name as:
84 23/12/2023
Run spark application: dependencies
2. Create a JAR package containing the application's code
and submit to run the program
Scala: sbt
Python: submit-spark
85 23/12/2023
RDD Features
86 23/12/2023
Spark RDD features
In-memory Computation
Spark RDDs have a provision of in-memory computation
Lazy Evaluations
All transformations in Apache Spark are lazy, in that they do
not compute their results right away
88 23/12/2023
Spark RDD features
Immutability
Data is safe to share across processes
It can also be created or retrieved anytime which makes
caching, sharing & replication easy
Thus, it is a way to reach consistency in computations
Partitioning
Partitioning is the fundamental unit of parallelism in Spark
RDD
Each partition is one logical division of data which is mutable
One can create a partition through some transformations on
existing partitions
89 23/12/2023
Spark RDD features
Persistence
Users can state which RDDs they will reuse and choose a
storage strategy for them (e.g., in-memory storage or on Disk)
Coarse-grained Operations
It applies to all elements in datasets through
maps
filter
group by operation
90 23/12/2023
Spark RDD features
Location-Stickiness
RDDs are capable of defining placement preference to compute
partitions
91 23/12/2023
How RDDs are represented
RDDs are made up of 4 parts:
Partitions: Atomic pieces of the dataset. One or many per
compute node
92 23/12/2023
RDD: logical vs physical partition
Every dataset in RDD is logically partitioned across many
servers so that they can be computed on different nodes of
the cluster
93 23/12/2023
RDD persistence
94 23/12/2023
RDD persistence: caching
One of the key capability of Spark is its speed through
persisting or caching
96 23/12/2023
RDD persistence:
Two methods for RDD persistence: persist(), cache()
97 23/12/2023
RDD persistence strategies
98 23/12/2023
Best practices for which storage level to
choose
MEMORY_ONLY : The default storage level is the best
The most CPU-efficient option
Allow operations on the RDDs to run as fast as possible
It is the fastest option to fully take advantage of Spark's design
MEMORY_ONLY_SER:
Use it with a fast serialization library to make objects more
space-efficient
Is reasonably fast to access
99 23/12/2023
Best practices for which storage level to
choose
DISK_ONLY ?
When to use disk-only storage
100 23/12/2023
Best practices for which storage level to
choose
The experimental OFF_HEAP mode has several
advantages:
102 23/12/2023
RDD Dependencies
103 23/12/2023
Dependencies
Transformations can have 2 kinds of dependencies:
No shuffle necessary
104 23/12/2023
Dependencies
Wide dependencies: all the elements that are required to
compute the records in the single partition may live in many
partitions of parent RDD
105 23/12/2023
Narrow dependencies Vs. Wide
dependencies
106 23/12/2023
Narrow dependencies
Transformations with (usually) narrow dependencies:
map
mapValues
flatMap
filter
mapPartitions
mapPartitionsWithIndex
Narrow dependency objects
OneToOneDependency
PruneDependency
RangeDependency
107 23/12/2023
Wide dependencies
Transformations with (usually) Wide dependencies: (might cause a shuffle):
cogroup
groupWith
join
leftOuterJoin
rightOuterJoin
groupByKey
reduceByKey
combineByKey
distinct
intersection
repartition
Coalesce
109 23/12/2023
Spark functions
110 23/12/2023
Spark librariries
Extensions of the core Spark API
Improvements made to the core are passed to these libraries
Little overhead to use with the Spark core
111 23/12/2023
Spark SQL
Allows relational queries expressed in
SQL
HiveQL
Scala
SchemaRDD
Row objects
Schema
Created from:
Existing RDD
Parquet file
JSON dataset
HiveQL against Apache Hive
Supports Scala, Java, R, and Python
112 23/12/2023
Spark Streaming
Scalable, high-throughput, fault-tolerant stream processing of live data
streams
Receives live input data and divides into small batches which are processed
and returned as batches
DStream - sequence of RDD
Currently supports Scala, Java, and Python
113 23/12/2023
Spark streaming internals
The input stream (DStream) goes into Spark Streaming
The data is broken up into batches that are fed into the Spark
engine for processing
The final results are generated as a stream of batches
114 23/12/2023
MLib
MLlib for machine learning library - under active
development
Provides, currently, the following common algorithm and
utilities
Classification
Regression
Clustering
Collaborative filtering
Dimensionality reduction
115 23/12/2023
GraphX
GraphX for graph processing
Graphs and graph parallel computation
Social networks and language modeling
116 23/12/2023
Berkley Data Analytics Stack (BDAS)
117 23/12/2023
Spark Monitoring
118 23/12/2023
Spark monitoring
Three ways to monitor Spark applications
1. Web UI
Port 4040 (lab exercise on port 8088)
Available for the duration of the application
119 23/12/2023
Spark monitoring
2. Metrics
Based on the Coda Hale Metrics Library
Report to a variety of sinks (HTTP, JMX, and CSV)
/conf/metrics.properties
3. External instrumentations
Cluster-wide monitoring tool (Ganglia)
OS profiling tools (dstat, iostat, iotop)
JVM utilities (jstack, jmap, jstat, jconsole)
120 23/12/2023
Spark Monitoring Web UI
121 23/12/2023
Spark Monitoring web UI
122 23/12/2023
Spark Monitoring: Metrics
123 23/12/2023
Spark monitoring: Ganglia
124 23/12/2023
Conclusion
Purpose of Apache Spark in the Hadoop ecosystem