Data Engineering 101 - Spark Concepts
Data Engineering 101 - Spark Concepts
Data Engineering 101 - Spark Concepts
GritSetGrow - GSGLearn.com
Data Engineering
SPARK
ALL CONCEPTS TO GET STARTED
Data Engineering 101 - Spark
Driver Program
The main program that runs the user's
application and creates the SparkContext.
Example
Initializing SparkContext: from pyspark import
SparkContext; sc = SparkContext("local",
"App").
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Executors
Worker nodes that execute tasks and store
data for the application.
Example
Executors process tasks and cache data in
memory.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Cluster Manager
Manages resources and schedules jobs across
the cluster.
Example
Using YARN, Mesos, or Standalone mode for
resource management.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
SparkContext
The entry point to Spark functionality.
Example
sc = SparkContext("local", "App")
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
SparkSession
The entry point for DataFrame and SQL
functionality.
Example
from pyspark.sql import SparkSession;
spark = SparkSession.builder.appName("App")
.getOrCreate()
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Resilient Distributed
Datasets (RDDs)
Immutable distributed collections of objects
that can be processed in parallel.
Example
rdd = sc.parallelize([1, 2, 3, 4, 5]).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Transformations
Operations that create a new RDD from an
existing one.
Example
rdd2 = rdd.map(lambda x: x * 2).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Actions
Operations that trigger computation and
return a result to the driver.
Example
count = rdd.count().
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Lazy Evaluation
Transformations are not executed until an
action is called.
Example
Defining transformations but no computation
until action:
rdd2 = rdd.map(x => x * 2);
rdd2.count().
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Example
Viewing the DAG in Spark UI.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Spark SQL
A module for working with structured data
using SQL.
Example
df = spark.sql("SELECT * FROM table").
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
DataFrame
A distributed collection of data organized into
named columns.
Example
df = spark.read.json("file.json").
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Dataset
A strongly-typed collection of objects.
Example
ds = spark.createDataset([(1, "Alice"), (2, "Bob")]).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Catalyst Optimizer
Spark's query optimizer that generates
efficient execution plans.
Example
Analyzing query plans with df.explain().
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
PySpark
The Python API for Spark, allowing Python
developers to use Spark's features.
Example
from pyspark.sql import SparkSession;
spark =
SparkSession.builder.appName("App").getOrCreate().
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
SQL
Allows execution of SQL queries in PySpark.
Example
spark.sql("SELECT COUNT(*) FROM table").
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
DataFrame
Similar to pandas DataFrame, but distributed
across a cluster.
Example
df = spark.read.csv("file.csv").
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
RDD
Low-level API for distributed data
manipulation.
Example
rdd = sc.textFile("file.txt").
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
UDF
User-defined functions for extending
PySpark's built-in functions.
Example
spark.udf.register("add_one", lambda x: x + 1,
IntegerType()).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Window Functions
Functions that allow calculations across a set
of rows related to the current row.
Example
df.withColumn("rank", row_number()
.over(Window.partitionBy("dept")
.orderBy("salary"))).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Machine Learning
Using MLlib for scalable machine learning in
PySpark.
Example
from pyspark.ml.classification import
LogisticRegression; lr = LogisticRegression().
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Streaming
Processing real-time data streams with Spark
Streaming.
Example
from pyspark.streaming import StreamingContext;
ssc = StreamingContext(sc, 1).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
GraphFrames
A library for graph processing using PySpark.
Example
from graphframes import GraphFrame; g =
GraphFrame(vertices, edges).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Accumulators
Variables that can only be "added" to, used
for aggregating information across executors.
Example
accum = sc.accumulator(0).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Broadcast Variables
Variables that are cached on each machine to
efficiently distribute large values.
Example
broadcastVar = sc.broadcast([1, 2, 3]).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Partitioning
Dividing data into partitions for parallel
processing.
Example
rdd = rdd.repartition(4).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Shuffling
Redistributing data across different nodes,
required by certain transformations like
reduceByKey.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Caching
Storing RDDs in memory for faster access
during iterative algorithms.
Example
rdd.cache().
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Persisting
Storing RDDs with different storage levels
(memory, disk, etc.).
Example
rdd.persist(StorageLevel.MEMORY_AND_DISK).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Checkpointing
Saving RDDs to reliable storage to truncate
lineage graphs and provide fault tolerance.
Example
rdd.checkpoint().
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Fault Tolerance
Recovering lost data using lineage
information.
Example
Handling node failures by recomputing lost
partitions.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Lineage
Tracking the sequence of transformations that
produced an RDD.
Example
Viewing lineage with rdd.toDebugString.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
MapReduce
A programming model for processing large
datasets with a distributed algorithm on a
cluster.
Example
Implementing a simple MapReduce job:
rdd.map(...).reduceByKey(...).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
HDFS
Integration with Hadoop Distributed File
System for data storage.
Example
rdd = sc.textFile("hdfs://path/to/file").
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Data Sources
Reading and writing data from various sources
like HDFS, S3, JDBC, and more.
Example
df = spark.read.jdbc(url, table, properties).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
SQL Functions
Built-in functions for data manipulation and
analysis in Spark SQL.
Example
df.select(concat(col("name"), lit(" "),
col("surname"))).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Column Pruning
An optimization technique that only reads
necessary columns from the data source.
Example
Using column pruning with DataFrame API.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Predicate Pushdown
An optimization technique that pushes filter
conditions to the data source.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Catalyst Optimization
Spark's internal optimization framework for
generating efficient execution plans.
Example
Viewing optimized logical plan with
df.queryExecution.optimizedPlan.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Tungsten
An execution engine in Spark SQL for memory
management and binary processing.
Example
Enabling Tungsten optimization:
spark.sql.tungsten.enabled = true.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Job
A complete computation expressed as a high-
level action like count or saveAsTextFile.
Example
rdd.count().
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Stage
A set of parallel tasks that will be executed as
a unit.
Example
Viewing stages in Spark UI for a job.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Task
The smallest unit of work in Spark, executed
by an executor.
Example
Monitoring task execution in Spark UI.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Cache
Storing RDDs in memory to speed up repeated
access.
Example
rdd.cache().
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Persist
Storing RDDs with different storage levels
(memory, disk, etc.).
Example
rdd.persist(StorageLevel.MEMORY_AND_DISK).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Checkpoint
Saving an RDD to a reliable storage to truncate
the lineage graph and provide fault tolerance.
Example
rdd.checkpoint().
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Fault Tolerance
Spark's ability to recompute lost data using
lineage information.
Example
Handling node failures by recomputing lost
partitions.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Lineage
The logical execution plan that Spark builds to
keep track of the transformations applied to
an RDD.
Example
Viewing lineage with rdd.toDebugString.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
MapReduce
A programming model for processing large
datasets with a parallel, distributed algorithm.
Example
Implementing a simple MapReduce job in
Spark: rdd.map(...).reduceByKey(...).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
HDFS
Integration with Hadoop Distributed File
System for data storage.
Example
Reading data from HDFS:
val rdd = sc.textFile("hdfs://path/to/file").
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Data Sources
Interfaces for reading and writing data from
various sources like HDFS, S3, JDBC, and more.
Example
Reading data from a JDBC source:
val df = spark.read.jdbc(url, table, properties).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
SQL Functions
Built-in functions for data manipulation and
analysis in Spark SQL.
Example
Using concat function:
df.select(concat(col("name"), lit(" "),
col("surname")).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
User-Defined Functions
(UDFs)
Custom functions defined by users to extend
Spark SQL capabilities.
Example
spark.udf.register("add_one", lambda x: x + 1,
IntegerType()).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Window Functions
Functions that perform calculations across a
set of rows related to the current row.
Example
df.withColumn("rank", row_number()
.over(Window.partitionBy("dept")
.orderBy("salary"))).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
GroupBy
Grouping data by one or more columns for
aggregation.
Example
df.groupBy("department").agg(avg("salary")).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Join
Combining rows from two or more
DataFrames based on a related column.
Example
df1.join(df2, df1("id") === df2("id")).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Sorting
Ordering rows in a DataFrame based on
column values.
Example
df.sort("age", ascending=True).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Aggregations
Computing summary statistics of data groups.
Example
df.groupBy("department").agg(sum("salary"))
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Pivot
Transforming unique values from one column
into multiple columns.
Example
df.groupBy("year").pivot("quarter").sum("rev
enue").
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
DataFrame Operations
Common operations on DataFrames, such as
selecting, filtering, and grouping.
Example
df.select("name").filter(df("age") > 21).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Schema
Defining the structure of a DataFrame with
column names and types.
Example
schema = StructType([StructField("name",
StringType(), True), StructField("age",
IntegerType(), True)]).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
DataFrame Creation
Creating DataFrames from various data
sources like CSV, JSON, and RDDs.
Example
df = spark.read.csv("file.csv").
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
DataFrame
Transformation
Modifying DataFrames using operations like
withColumn, filter, groupBy, and agg.
Example
df.withColumn("new_col", df("col1") + df("col2"))
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
DataFrame Actions
Triggering computations and returning results
to the driver.
Example
df.show().
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
SQL Queries
Executing SQL queries on DataFrames.
Example
df.createOrReplaceTempView("table");
spark.sql("SELECT * FROM table").
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Joins
Combining DataFrames based on common
columns.
Example
df1.join(df2, df1("id") === df2("id")).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Aggregations
Aggregating data in DataFrames using
functions like sum, avg, and count.
Example
df.groupBy("department").agg(sum("salary"))
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Filtering
Filtering rows based on column values.
Example
df.filter(df("age") > 21).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Selection
Selecting specific columns from a DataFrame.
Example
df.select("name", "age").
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Renaming
Renaming columns in a DataFrame.
Example
df.withColumnRenamed("old_name",
"new_name")
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Dropping Columns
Removing columns from a DataFrame.
Example
df.drop("column_name").
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Merging
Combining multiple DataFrames into one.
Example
df1.union(df2).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Pivoting
Transforming data to show unique values
from one column as multiple columns.
Example
df.groupBy("year")
.pivot("quarter")
.sum("revenue").
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Unpivoting
Transforming data from columns into rows.
Example
melted_df = df.selectExpr("stack(3, 'Q1', Q1,
'Q2', Q2, 'Q3', Q3) as (quarter, revenue)").
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Writing
Writing DataFrames to external storage
systems like HDFS, S3, or databases.
Example
df.write.csv("output.csv").
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Reading
Reading data from various external storage
systems into DataFrames.
Example
df = spark.read.json("input.json").
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Repartitioning
Changing the number of partitions of a
DataFrame.
Example
df.repartition(10).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Coalescing
Reducing the number of partitions of a
DataFrame.
Example
df.coalesce(1).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Serialization
Converting DataFrame rows into a format that
can be stored and transmitted.
Example
Using PySpark's built-in serializers like pickle
and Kryo.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Deserialization
Converting serialized data back into
DataFrame rows.
Example
Reading serialized data and converting it back
to DataFrame.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Functions
Built-in functions for performing operations
on DataFrames.
Example
from pyspark.sql.functions
import col, lit; df.select(col("name"), lit(1)).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Column Operations
Performing operations on DataFrame
columns.
Example
df.withColumn("new_col", df("col1") + df("col2"))
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Exploding
Transforming a DataFrame with nested
structures into a flat DataFrame.
Example
df.select("name", explode("hobbies"))
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Aggregation Functions
Built-in functions for aggregating data in
DataFrames.
Example
from pyspark.sql.functions import sum, avg;
df.groupBy("dept")
.agg(sum("salary"), avg("salary"))
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Statistics
Computing summary statistics of DataFrames.
Example
df.describe().show().
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Union
Combining rows from two DataFrames into
one.
Example
df1.union(df2).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Intersect
Finding common rows between two
DataFrames.
Example
df1.intersect(df2).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Except
Finding rows in one DataFrame that are not in
another.
Example
df1.except(df2).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Distinct
Removing duplicate rows from a DataFrame.
Example
df.distinct().
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Drop Duplicates
Removing duplicate rows based on specific
columns.
Example
df.dropDuplicates(["col1", "col2"]).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Example
df.fillna(0).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Date Functions
Working with date and time data in
DataFrames.
Example
df.withColumn("year", year("date_col"))
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
String Functions
Performing operations on string columns in
DataFrames.
Example
df.withColumn("uppercase_name",
upper("name"))
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Conversion
Converting between different data types in
DataFrames.
Example
df.withColumn("int_col", col("string_col")
.cast("int")).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Explode
Converting an array column into multiple
rows, one for each element in the array.
Example
df.select("name", explode("hobbies")).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
JSON Functions
Working with JSON data in DataFrames.
Example
df.selectExpr("json_tuple(json_col, 'key1',
'key2')").
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Advanced Analytics
Using advanced analytical functions available
in PySpark.
Example
df.stat.freqItems(["col1", "col2"]).
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Spark
Vectorized UDFs
UDFs that operate on entire columns at once,
improving performance.
Example
from pyspark.sql.functions import pandas_udf;
@pandas_udf("int") def add_one(s: pd.Series) -
> pd.Series: return s + 1.
Shwetank Singh
GritSetGrow - GSGLearn.com
THANK YOU