Bda 5

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Introduction to SPARK

Zikra Shaikh
What is Apache Spark
Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop
MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes
interactive queries and stream processing. The main feature of Spark is its in-memory cluster computing that
increases the processing speed of an application.

Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive
queries and streaming. Apart from supporting all these workload in a respective system, it reduces the management
burden of maintaining separate tools.

Evolution of Apache Spark


Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei Zaharia. It was Open
Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark
has become a top level Apache project from Feb-2014.
Why apache spark
Features of Apache Spark
Apache Spark has following features.

Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times
faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the
intermediate processing data in memory.
Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can
write applications in different languages. Spark comes up with 80 high-level operators for interactive querying.
Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming
data, Machine learning (ML), and Graph algorithms.
Apache Spark Architecture

The Spark follows the master-slave architecture. Its cluster consists of a single master and multiple slaves.

The Spark architecture depends upon two abstractions:

○ Resilient Distributed Dataset (RDD)


○ Directed Acyclic Graph (DAG)

Resilient Distributed Datasets (RDD)


The Resilient Distributed Datasets are the group of data items that can be stored in-memory on worker nodes. Here,

○ Resilient: Restore the data on failure.


○ Distributed: Data is distributed among different nodes.
○ Dataset: Group of data.
Directed Acyclic Graph (DAG)
Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on data. Each node is an RDD
partition, and the edge is a transformation on top of data. Here, the graph refers the navigation whereas directed and acyclic
refers to how it is done.

Let's understand the Spark architecture.


Driver Program
The Driver Program is a process that runs the main() function of the application and creates the SparkContext object. The
purpose of SparkContext is to coordinate the spark applications, running as independent sets of processes on a cluster.

To run on a cluster, the SparkContext connects to a different type of cluster managers and then perform the following tasks:
-

○ It acquires executors on nodes in the cluster.


○ Then, it sends your application code to the executors. Here, the application code can be defined by JAR or Python
files passed to the SparkContext.
○ At last, the SparkContext sends tasks to the executors to run.
Cluster Manager

○ The role of the cluster manager is to allocate resources across applications. The Spark is capable enough of running
on a large number of clusters.
○ It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos and Standalone Scheduler.
○ Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to install Spark on an empty
set of machines.

Worker Node

○ The worker node is a slave node


○ Its role is to run the application code in the cluster.

Executor

○ An executor is a process launched for an application on a worker node.


○ It runs tasks and keeps data in memory or disk storage across them.
○ It read and write data to the external sources.
○ Every application contains its executor.

Task

○ A unit of work that will be sent to one executor.


Spark Components
The Spark project consists of different types of tightly integrated components. At its core, Spark is a computational engine
that can schedule, distribute and monitor multiple applications.

Let's understand each Spark component in detail.


Apache Spark Core:

○ The Spark Core is the heart of Spark and performs the core functionality.
○ It holds the components for task scheduling, fault recovery, interacting with storage systems and memory
management.

Spark SQL
○ The Spark SQL is built on the top of Spark Core. It provides support for structured data.
○ It allows to query the data via SQL (Structured Query Language) as well as the Apache Hive variant of SQL?called the
HQL (Hive Query Language).
○ It supports JDBC and ODBC connections that establish a relation between Java objects and existing databases, data
warehouses and business intelligence tools.
○ It also supports various sources of data like Hive tables, Parquet, and JSON.
Spark Streaming
○ Spark Streaming is a Spark component that supports scalable and fault-tolerant processing of streaming data.
○ It uses Spark Core's fast scheduling capability to perform streaming analytics.
○ It accepts data in mini-batches and performs RDD transformations on that data.
○ Its design ensures that the applications written for streaming data can be reused to analyze batches of historical data with
little modification.
○ The log files generated by web servers can be considered as a real-time example of a data stream.

MLlib
○ The MLlib is a Machine Learning library that contains various machine learning algorithms.
○ These include correlations and hypothesis testing, classification and regression, clustering, and principal component analysis.
○ It is nine times faster than the disk-based implementation used by Apache Mahout.
GraphX
○ The GraphX is a library that is used to manipulate graphs and perform graph-parallel computations.
○ It facilitates to create a directed graph with arbitrary properties attached to each vertex and edge.
○ To manipulate graph, it supports various fundamental operators like subgraph, join Vertices, and aggregate
Messages.
Spark Shell
The Spark Shell is an interactive command-line tool provided by Apache Spark that allows users to interactively explore and manipulate
data using Spark's APIs. It provides an interactive environment for running Spark code snippets, performing data analysis, and
experimenting with Spark features without the need to write and execute full Spark applications.

There are two primary Spark Shells:

Spark Shell (Scala):


● The Scala Spark Shell (spark-shell ) is the default interactive shell for Apache Spark written in Scala.
● It provides a Scala REPL (Read-Eval-Print Loop) environment with Spark pre-configured, allowing users to write and
execute Spark code interactively.
● Users can directly enter Scala code to create Spark RDDs, DataFrames, perform transformations, and run Spark actions.
PySpark Shell (Python):
● PySpark Shell (pyspark ) is the interactive shell for Apache Spark written in Python.
● It provides a Python REPL environment with Spark pre-configured, enabling users to write and execute Spark code using
Python.
● Users can interactively work with Spark RDDs, DataFrames, perform transformations, and execute Spark actions using
Python syntax.
Spark core:RDD
The RDD (Resilient Distributed Dataset) is the Spark's core abstraction. It is a collection of elements, partitioned across the nodes of the cluster so that we can
execute various parallel operations on it.

There are two ways to create RDDs:

○ Parallelizing an existing data in the driver program


○ Referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

Parallelized Collections
To create parallelized collection, call SparkContext's parallelize method on an existing collection in the driver program. Each element of collection is copied to
form a distributed dataset that can be operated on in parallel.

1. val info = Array(1, 2, 3, 4)


2. val distinfo = sc.parallelize(info)

Now, we can operate the distributed dataset (distinfo) parallel such like distinfo.reduce((a, b) => a + b).
External Datasets
In Spark, the distributed datasets can be created from any type of storage sources supported by Hadoop such as HDFS,
Cassandra, HBase and even our local file system. Spark provides the support for text files, SequenceFiles, and other
types of Hadoop InputFormat.

SparkContext's textFile method can be used to create RDD's text file. This method takes a URI for the file (either a local
path on the machine or a hdfs://) and reads the data of the file.

Now, we can operate data on by dataset operations such as we can add up the sizes of all the lines using the map and
reduceoperations as follows: data.map(s => s.length).reduce((a, b) => a + b).
RDD Operations

The RDD provides the two types of operations:

○ Transformation
○ Action

Transformation
In Spark, the role of transformation is to create a new dataset from an existing one. The transformations are considered lazy
as they only computed when an action requires a result to be returned to the driver program.

Let's see some of the frequently used RDD Transformations.


Action
In Spark, the role of action is to return a value to the driver program after running a computation on the dataset.

You might also like