0% found this document useful (0 votes)
13 views17 pages

BDA1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 17

Components of Apache Spark (EcoSystem)

Introduction

Now since we have some understanding of Spark, let us dive deeper and
understand its components. Apache Spark consists of Spark Core Engine, Spark SQL, Spark
Streaming, MLlib, GraphX and Spark R. You can use Spark Core Engine along with any of
the other five components mentioned above. It is not necessary to use all the Spark
components together. Depending on the use case and application, any one or more of these
can be used along with Spark Core.

Let us look at each of these components in detail.

1. Spark Core
2. Spark SQL
3. Spark Streaming
4. MLlib(Machine learning library)
5. GraphX
6. Spark R

Now since we have some understanding of Spark let us dive deeper into Spark and
understand the components Apache Spark consists of. Apache Spark consists of Spark Core
Engine, Spark SQL, Spark Streaming, MLlib, GraphX, and Spark R. You can use Spark Core
Engine along with any of the other five components mentioned above. It is not necessary to
use all the Spark components together. Depending on the use case and application any one or
more of these can be used along with Spark Core.

Let us look at each of these components in detail.

Spark Core: Spark Core is the heart of the Apache Spark framework. Spark Core provides
the execution engine for the Spark platform which is required and used by other components
which are built on top of Spark Core as per the requirement. Spark Core provides the in-built
memory computing and referencing datasets stored in external storage systems. It is Spark’s
core responsibility to perform all the basic I/O functions, scheduling, monitoring, etc. Also,
fault recovery and effective memory management are Spark Core’s other important functions.

Spark Core uses a very special data structure called the RDD. Data sharing in distributed
processing systems like MapReduce need the data in intermediate steps to be stored and then
retrieved from permanent storage like HDFS or S3 which makes it very slow due to the
serialization and deserialization of I/O steps. RDDs overcome this as these data structures are
in-memory and fault-tolerant and can be shared across different tasks within the same Spark
process. The RDDs can be any immutable and partitioned collections and can contain any
type of objects; Python, Scala, Java or some user-defined class objects. RDDs can be created
either by Transformations of an existing RDD or loading from external sources like HDFS or
HBase etc. We will look into RDD and its transformations in-depth in later sections in the
tutorial.

Spark SQL: Spark SQL is built on top of Shark which was the first interactive SQL on the
Hadoop system. Shark was built on top of Hive codebase and achieved performance
improvement by swapping out the physical execution engine part of the Hive. But due to the
limitations of Hive, Shark was not able to achieve the performance it was supposed to. So the
Shark project was stopped and Spark SQL was built with the knowledge of Shark on top of
Spark Core Engine to leverage the power of Spark. You can read more about Shark in the
following blog by Reynold Xin, one of the Spark SQL code maintainers.

Spark SQL is named like this because it works with the data in a similar fashion to SQL. In
fact it there is a mention that Spark SQL’s aim is to meet SQL 92 standards. But the gist is
that it allows developers to write declarative code letting the engine use as much of the data
and stored structure (RDDs) as it can to optimize the resultant distributed query behind the
scenes. The goal is to allow the user to not have to worry about the distributed nature as much
and focus on the business use case. Users can perform extract, transform and load functions
on data from a variety of sources in different formats like JSON, Parquet or Hive and then
execute ad-hoc queries using Spark SQL.
DataFrame constitutes the main abstraction for Spark SQL. Distributed collection of data
ordered into named columns is known as a DataFrame in Spark. In the earlier versions of
Spark SQL, DataFrames were referred to as SchemaRDDs. DataFrame API in Spark
integrates with the Spark procedural code to render tight integration between procedural and
relational processing. DataFrame API evaluates operations in a lazy manner to provide
support for relational optimizations and optimize the overall data processing workflow. All
relational functionalities in Spark can be encapsulated using the SparkSQL context or
HiveContext.

Spark Streaming: As the name suggests this library is for Streaming data. This is a very
popular Spark library as it takes Spark’s big data processing power and cranks up the speed.
Spark Streaming has the ability to Stream gigabytes per second. This capability of big and
fast data has a lot of potentials. Spark Streaming is used for analyzing a continuous stream of
data. A common example is processing log data from a website or server.

Spark streaming is not really streaming technically. What it really does is it breaks down the
data into individual chunks that it processes together as small RDDs. So it actually does not
process data as bytes at a time as it comes in, but it processes data every second or two
seconds or some fixed interval of time. So strictly speaking Spark streaming is not real-time
but near real-time or micro batching, but it suffices for a vast majority of applications.

Spark streaming can be configured to talk to a variety of data sources. So we can just listen to
a port that has a bunch of data being thrown at it, or we can connect to data sources like
Amazon Kinesis, Kafka, Flume, etc. There are connectors available to connect Spark to these
sources. The good thing about Spark streaming is it is reliable. It has a concept called
“checkpointing” to store state to the disk periodically and depending on what kind of data
sources or receiver we are using, it can pick up data from the point of failure. It is a very
robust mechanism to handle all kinds of failures like disk failure or node failure etc. Spark
Streaming has exactly-once message guarantees and helps recover lost work without having
to write any extra code or adding additional configurations.

Just like how Spark SQL has the concept of Dataframe/Dataset built on top of RDD, Spark
streaming has something called Dstream. This is a collection of RDDs that embodies the
entire stream data. The good thing about Dstream is that we can apply most of the built-in
functions on RDDs also on the DStream like flatMap, map, etc. Also, the Dstream can be
broken into individual RDDs and can be processed one chunk at a time. Spark developers can
reuse the same code for stream and batch processing and can also integrate the streaming data
with historical data.

MLlib: Today many companies focus on building customer-centric data products and
services which need machine learning to build predictive insights, recommendations, and
personalized results. Data scientists can solve these problems using popular languages like
Python and R, but they spend a lot of time in building and supporting infrastructure for these
languages. Spark has built-in support for doing machine learning and data science at a
massive scale using the clusters. It’s called MLLib which stands for Machine Learning
Library.

MLlib is a low-level machine learning library. It can be called from Java, Scala and Python
programming languages. It is simple to use, scalable and can be easily integrated with other
tools and frameworks. MLlib eases the deployment and development of scalable machine
learning pipelines. Machine learning in itself is a subject and it may not be possible to get
into details here. But these are some of the important features and capabilities Spark MLLib
offers:

 Linear regression, logistic regression


 Support Vector Machines
 Naive Bayes classifier
 K-Means clustering
 Decision trees
 Recommendations using Alternating Least Squares
 Basic statistics
 Chi-squared test, Pearsons or Spearman correlation, min, max, mean, variance
 Feature extraction

GraphX: For graphs and graph-parallel processing Apache Spark provides another API
called GraphX. The graph here does not mean charts, lines or bar graphs, but these are graphs
in computer sciences like social networks which consist of vertices where each vertex
consists of an individual user in the social network and there are many users connected to
each other by edges. These edges represent the relationship between the users in the network.

GraphX is useful in giving overall information about the graph network like it can tell how
many triangles appear in the graph and apply the PageRank algorithm to it. It can measure
things like “connectedness”, degree distribution, average path length and other high-level
measures of a graph. It can also join graphs together and transform graphs quickly. It also
supports the Pregel API for traversing a graph. Spark GraphX provides Resilient Distributed
Graph (RDG- an abstraction of Spark RDD’s). RDG’s API is used by data scientists to
perform several graph operations through various computational primitives. Similar to RDDs
basic operations like map, filter, property graphs also consist of basic operators. Those
operators take UDFs (user-defined functions) and produce new graphs. Moreover, these are
produced with transformed properties and structure.

Spark R: R programming language is widely used by Data scientists due to its simplicity and
ability to run complex algorithms. But R suffers from a problem that its data processing
capacity is limited to a single node. This makes R not usable when processing a huge amount
of data. The problem is solved by SparkR which is an R package in Apache Spark. SparkR
provides data frame implementation that supports operations like selection, filtering,
aggregation, etc. on distributed large datasets. SparkR also has support for distributed
machine learning using Spark MLlib.

Spark Architecture and components


What is Spark?

Spark Architecture, an open-source, framework-based component that processes a


large amount of unstructured, semi-structured, and structured data for analytics, is
utilised in Apache Spark. Apart from Hadoop and map-reduce architectures for big
data processing, Apache Spark’s architecture is regarded as an alternative. The RDD
and DAG, Spark’s data storage and processing framework, are utilised to store and
process data, respectively. Spark architecture consists of four components, including
the spark driver, executors, cluster administrators, and worker nodes. It uses the
Dataset and data frames as the fundamental data storage mechanism to optimise the
Spark process and big data computation.

Apache Spark Features

Apache Spark, a popular cluster computing framework, was created in order to


accelerate data processing applications. Spark, which enables applications to run faster
by utilising in-memory cluster computing, is a popular open source framework. A
cluster is a collection of nodes that communicate with each other and share data.
Because of implicit data parallelism and fault tolerance, Spark may be applied to a
wide range of sequential and interactive processing demands.
 Speed: Spark performs up to 100 times faster than MapReduce for processing
large amounts of data. It is also able to divide the data into chunks in a
controlled way.
 Powerful Caching: Powerful caching and disk persistence capabilities are
offered by a simple programming layer.
 Deployment: Mesos, Hadoop via YARN, or Spark’s own cluster manager can
all be used to deploy it.
 Real-Time: Because of its in-memory processing, it offers real-time
computation and low latency.
 Polyglot: In addition to Java, Scala, Python, and R, Spark also supports all four
of these languages. You can write Spark code in any one of these languages.
Spark also provides a command-line interface in Scala and Python.

Two Main Abstractions of Apache Spark

The Apache Spark architecture consists of two main abstraction layers:

Resilient Distributed Datasets (RDD):


It is a key tool for data computation. It enables you to recheck data in the event of a
failure, and it acts as an interface for immutable data. It helps in recomputing data in
case of failures, and it is a data structure. There are two methods for modifying RDDs:
transformations and actions.

Directed Acyclic Graph (DAG):


The driver converts the program into a DAG for each job. The Apache Spark Eco-
system includes various components such as the API core, Spark SQL, Streaming and
real-time processing, MLIB, and Graph X. A sequence of connection between nodes
is referred to as a driver. As a result, you can read volumes of data using the Spark
shell. You can also use the Spark context -cancel, run a job, task (work), and job
(computation) to stop a job.

Architecture:

When the Driver Program in the Apache Spark architecture executes, it calls the real
program of an application and creates a SparkContext. SparkContext contains all of
the basic functions. The Spark Driver includes several other components, including a
DAG Scheduler, Task Scheduler, Backend Scheduler, and Block Manager, all of
which are responsible for translating user-written code into jobs that are actually
executed on the cluster.

The Cluster Manager manages the execution of various jobs in the cluster. Spark
Driver works in conjunction with the Cluster Manager to control the execution of
various other jobs. The cluster Manager does the task of allocating resources for the
job. Once the job has been broken down into smaller jobs, which are then distributed
to worker nodes, SparkDriver will control the execution.
Many worker nodes can be used to process an RDD created in the SparkContext, and
the results can also be cached.

The Spark Context receives task information from the Cluster Manager and enqueues
it on worker nodes.
The executor is in charge of carrying out these duties. The lifespan of executors is the
same as that of the Spark Application. We can increase the number of workers if we
want to improve the performance of the system. In this way, we can divide jobs into
more coherent parts.

Spark Architecture Applications

A high-level view of the architecture of the Apache Spark application is as follows:

The Spark driver

The master node (process) in a driver process coordinates workers and oversees the
tasks. Spark is split into jobs and scheduled to be executed on executors in clusters.
Spark contexts (gateways) are created by the driver to monitor the job working in a
specific cluster and to connect to a Spark cluster. In the diagram, the driver
programmes call the main application and create a spark context (acts as a gateway)
that jointly monitors the job working in the cluster and connects to a Spark cluster.
Everything is executed using the spark context.

Each Spark session has an entry in the Spark context. Spark drivers include more
components to execute jobs in clusters, as well as cluster managers. Context acquires
worker nodes to execute and store data as Spark clusters are connected to different
types of cluster managers. When a process is executed in the cluster, the job is divided
into stages with gain stages into scheduled tasks.

The Spark executors

An executor is responsible for executing a job and storing data in a cache at the outset.
Executors first register with the driver programme at the beginning. These executors
have a number of time slots to run the application concurrently. The executor runs the
task when it has loaded data and they are removed in idle mode. The executor runs in
the Java process when data is loaded and removed during the execution of the tasks.
The executors are allocated dynamically and constantly added and removed during the
execution of the tasks. A driver program monitors the executors during their
performance. Users’ tasks are executed in the Java process.
Cluster Manager

A driver program controls the execution of jobs and stores data in a cache. At the
outset, executors register with the drivers. This executor has a number of time slots to
run the application concurrently. Executors read and write external data in addition to
servicing client requests. A job is executed when the executor has loaded data and
they have been removed in the idle state. The executor is dynamically allocated, and it
is constantly added and deleted depending on the duration of its use. A driver program
monitors executors as they perform users’ tasks. Code is executed in the Java process
when an executor executes a user’s task.

Worker Nodes

The slave nodes function as executors, processing tasks, and returning the results back
to the spark context. The master node issues tasks to the Spark context and the worker
nodes execute them. They make the process simpler by boosting the worker nodes (1
to n) to handle as many jobs as possible in parallel by dividing the job up into sub-jobs
on multiple machines. A Spark worker monitors worker nodes to ensure that the
computation is performed simply. Each worker node handles one Spark task. In Spark,
a partition is a unit of work and is assigned to one executor for each one.

The following points are worth remembering about this design:

1. There are multiple executor processes for each application, which run tasks on
multiple threads over the course of the whole application. This allows
applications to be isolated both on the scheduling side (drivers can schedule
tasks individually) and the executor side (tasks from different apps can run in
different JVMs). Therefore, data must be written to an external storage system
before it can be shared across different Spark applications.
2. Even on a cluster manager that also supports other applications, Spark can be
run if it can acquire executor processes and these communicate with each other.
It’s relatively easy for Spark to operate even on a cluster manager if this can be
done even with other applications (e.g. Mesos/YARN).
3. The driver program must listen for and accept incoming connections from its
executors throughout its lifetime. Workers must be able to connect to the driver
program via the network.
4. The driver is responsible for scheduling tasks on the cluster. It should be run on
the same local network as the worker nodes, preferably on the same machine. If
you want to send requests to the cluster, it’s preferable to open an RPC and
have the driver submit operations from nearby rather than running the driver far
away from the worker nodes.
Modes of Execution

You can choose from three different execution modes: local, shared, and dedicated.
These determine where your app’s resources are physically located when you run your
app. You can decide where to store resources locally, in a shared location, or in a
dedicated location.

1. Cluster mode
2. Client mode
3. Local mode

Cluster mode: Cluster mode is the most frequent way of running Spark Applications.
In cluster mode, a user delivers a pre-compiled JAR, Python script, or R script to a
cluster manager. Once the cluster manager receives the pre-compiled JAR, Python
script, or R script, the driver process is launched on a worker node inside the cluster,
in addition to the executor processes. This means that the cluster manager is in charge
of all Spark application-related processes.

Client mode: In contrast to cluster mode, where the Spark driver remains on the client
machine that submitted the application, the Spark driver is removed in client mode
and is therefore responsible for maintaining the Spark driver process on the client
machine. These machines, usually referred to as gateway machines or edge nodes, are
maintained on the client machine.

Local mode: Local mode runs the entire Spark Application on a single machine, as
opposed to the previous two modes, which parallelized the Spark Application through
threads on that machine. As a result, the local mode uses threads instead of
parallelized threads. This is a common way to experiment with Spark, try out your
applications, or experiment iteratively without having to make any changes on Spark’s
end.
we will create our first Java MapReduce program:

Step 1)
Create a new directory with name MapReduceTutorial as shwon in the below
MapReduce example
sudo mkdir MapReduceTutorial

Give permissions
sudo chmod -R 777 MapReduceTutorial

SalesMapper.java

package SalesCountry;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;

public class SalesMapper extends MapReduceBase implements Mapper


<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);

public void map(LongWritable key, Text value, OutputCollector <Text,


IntWritable> output, Reporter reporter) throws IOException {

String valueString = value.toString();


String[] SingleCountryData = valueString.split(",");
output.collect(new Text(SingleCountryData[7]), one);
}
}

SalesCountryReducer.java

package SalesCountry;

import java.io.IOException;
import java.util.*;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;

public class SalesCountryReducer extends MapReduceBase implements


Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text t_key, Iterator<IntWritable> values,


OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {
Text key = t_key;
int frequencyForCountry = 0;
while (values.hasNext()) {
// replace type of value with the actual type of our value
IntWritable value = (IntWritable) values.next();
frequencyForCountry += value.get();

}
output.collect(key, new IntWritable(frequencyForCountry));
}
}

SalesCountryDriver.java

package SalesCountry;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class SalesCountryDriver {


public static void main(String[] args) {
JobClient my_client = new JobClient();
// Create a configuration object for the job
JobConf job_conf = new JobConf(SalesCountryDriver.class);

// Set a name of the Job


job_conf.setJobName("SalePerCountry");

// Specify data type of output key and value


job_conf.setOutputKeyClass(Text.class);
job_conf.setOutputValueClass(IntWritable.class);

// Specify names of Mapper and Reducer Class


job_conf.setMapperClass(SalesCountry.SalesMapper.class);
job_conf.setReducerClass(SalesCountry.SalesCountryReducer.class);
// Specify formats of the data type of Input and output
job_conf.setInputFormat(TextInputFormat.class);
job_conf.setOutputFormat(TextOutputFormat.class);

// Set input and output directories using command line arguments,


//arg[0] = name of input directory on HDFS, and arg[1] = name of output
directory to be created to store the output file.

FileInputFormat.setInputPaths(job_conf, new Path(args[0]));


FileOutputFormat.setOutputPath(job_conf, new Path(args[1]));

my_client.setConf(job_conf);
try {
// Run the job
JobClient.runJob(job_conf);
} catch (Exception e) {
e.printStackTrace();
}
}
}

Spark Streaming was added to Apache Spark in 2013, an extension of the core Spark
API that provides scalable, high-throughput and fault-tolerant stream processing of
live data streams. Data ingestion can be done from many sources like Kafka, Apache
Flume, Amazon Kinesis or TCP sockets and processing can be done using complex
algorithms that are expressed with high-level functions like map, reduce, join and
window. Finally, processed data can be pushed out to filesystems, databases and live
dashboards.
Its internal working is as follows. Live input data streams is received and divided into
batches by Spark streaming, these batches are then processed by the Spark engine to
generate the final stream of results in batches.

Its key abstraction is Apache Spark Discretized Stream or, in short,


a Spark DStream, which represents a stream of data divided into small batches.
DStreams are built on Spark RDDs, Spark’s core data abstraction. This allows
Streaming in Spark to seamlessly integrate with any other Apache Spark components
like Spark MLlib and Spark SQL.
Need for Streaming in Apache Spark
To process the data, most traditional stream processing systems are designed with a
continuous operator model, which works as follows:

 Streaming data is received from data sources (e.g. live logs, system
telemetry data, IoT device data, etc.) into some data ingestion system like
Apache Kafka, Amazon Kinesis, etc.
 The data is then processed in parallel on a cluster.
 Results are given to downstream systems like HBase, Cassandra, Kafka,
etc.

There is a set of worker nodes, each of which runs one or more continuous operators.
Each continuous operator processes the streaming data one record at a time and
forwards the records to other operators in the pipeline.

Data is received from ingestion systems via Source operators and given as output to
downstream systems via sink operators.

Continuous operators are a simple and natural model. However, this traditional
architecture has also met some challenges with today’s trend towards larger scale and
more complex real-time analytics

a) Fast Failure and Straggler Recovery

In real time, the system must be able to fastly and automatically recover from failures
and stragglers to provide results which is challenging in traditional systems due to the
static allocation of continuous operators to worker nodes.

b) Load Balancing

In a continuous operator system, uneven allocation of the processing load between the
workers can cause bottlenecks. The system needs to be able to dynamically adapt the
resource allocation based on the workload.

c) Unification of Streaming, Batch and Interactive Workloads

In many use cases, it is also attractive to query the streaming data interactively, or to
combine it with static datasets (e.g. pre-computed models). This is hard in continuous
operator systems which does not designed to new operators for ad-hoc queries. This
requires a single engine that can combine batch, streaming and interactive queries.

d) Advanced Analytics with Machine learning and SQL Queries

Complex workloads require continuously learning and updating data models, or even
querying the streaming data with SQL queries. Having a common abstraction across
these analytic tasks makes the developer’s job much easier.
Why Streaming in Spark?
Batch processing systems like Apache Hadoop have high latency that is not suitable
for near real time processing requirements. Processing of a record is guaranteed by
Storm if it hasn’t been processed, but this can lead to inconsistency as repetition of
record processing might be there. The state is lost if a node running Storm goes down.
In most environments, Hadoop is used for batch processing while Storm is used for
stream processing that causes an increase in code size, number of bugs to fix,
development effort, introduces a learning curve, and causes other issues. This creates
the difference between Big data Hadoop and Apache Spark.

Spark Streaming helps in fixing these issues and provides a scalable, efficient,
resilient, and integrated (with batch processing) system. Spark has provided a unified
engine that natively supports both batch and streaming workloads. Spark’s single
execution engine and unified Spark programming model for batch and streaming lead
to some unique benefits over other traditional streaming systems.

Goals of Spark Streaming


This architecture allows Spark Streaming to achieve the following goals:

a) Dynamic load balancing


Dividing the data into small micro-batches allows for fine-grained allocation of
computations to resources. Let us consider a simple workload where partitioning of
input data stream needs to be done by a key and processed. In the traditional record-at-
a-time approach, if one of the partitions is more computationally intensive than others,
the node to which that partition is assigned will become a bottleneck and slow down
the pipeline. The job’s tasks will be naturally load balanced across the workers where
some workers will process a few longer tasks while others will process more of the
shorter tasks in Spark Streaming.

b) Fast failure and straggler recovery

Traditional systems have to restart the failed operator on another node to recompute
the lost information in case of node failure. Only one node is handling the
recomputation due to which the pipeline cannot proceed until the new node has caught
up after the replay. In Spark, the computation discretizes into small tasks that can run
anywhere without affecting correctness. So failed tasks we can distribute evenly on all
the other nodes in the cluster to perform the recomputations and recover from the
failure faster than the traditional approach.

c) Unification of batch, streaming and interactive analytics

A DStream in Spark is just a series of RDDs in Spark that allows batch and streaming
workloads to interoperate seamlessly. Arbitrary Apache Spark functions can be
applied to each batch of streaming data. Since the batches of streaming data are stored
in the Spark’s worker memory, it can be interactively queried on demand.

d) Advanced analytics like machine learning and interactive SQL

Spark interoperability extends to rich libraries like MLlib (machine learning), SQL,
DataFrames, and GraphX. RDDs generated by DStreams can convert to DataFrames
and query with SQL.
Machine learning models generated offline with MLlib can apply to streaming data.

e) Performance
Spark Streaming’s ability to batch data and leverage the Spark engine leads to almost
higher throughput to other streaming systems. Spark Streaming can achieve latencies
as low as a few hundred milliseconds.

You might also like