Bda 5
Bda 5
Bda 5
Zikra Shaikh
What is Apache Spark
Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop
MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes
interactive queries and stream processing. The main feature of Spark is its in-memory cluster computing that
increases the processing speed of an application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive
queries and streaming. Apart from supporting all these workload in a respective system, it reduces the management
burden of maintaining separate tools.
Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times
faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the
intermediate processing data in memory.
Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can
write applications in different languages. Spark comes up with 80 high-level operators for interactive querying.
Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming
data, Machine learning (ML), and Graph algorithms.
Apache Spark Architecture
The Spark follows the master-slave architecture. Its cluster consists of a single master and multiple slaves.
To run on a cluster, the SparkContext connects to a different type of cluster managers and then perform the following tasks:
-
○ The role of the cluster manager is to allocate resources across applications. The Spark is capable enough of running
on a large number of clusters.
○ It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos and Standalone Scheduler.
○ Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to install Spark on an empty
set of machines.
Worker Node
Executor
Task
○ The Spark Core is the heart of Spark and performs the core functionality.
○ It holds the components for task scheduling, fault recovery, interacting with storage systems and memory
management.
Spark SQL
○ The Spark SQL is built on the top of Spark Core. It provides support for structured data.
○ It allows to query the data via SQL (Structured Query Language) as well as the Apache Hive variant of SQL?called the
HQL (Hive Query Language).
○ It supports JDBC and ODBC connections that establish a relation between Java objects and existing databases, data
warehouses and business intelligence tools.
○ It also supports various sources of data like Hive tables, Parquet, and JSON.
Spark Streaming
○ Spark Streaming is a Spark component that supports scalable and fault-tolerant processing of streaming data.
○ It uses Spark Core's fast scheduling capability to perform streaming analytics.
○ It accepts data in mini-batches and performs RDD transformations on that data.
○ Its design ensures that the applications written for streaming data can be reused to analyze batches of historical data with
little modification.
○ The log files generated by web servers can be considered as a real-time example of a data stream.
MLlib
○ The MLlib is a Machine Learning library that contains various machine learning algorithms.
○ These include correlations and hypothesis testing, classification and regression, clustering, and principal component analysis.
○ It is nine times faster than the disk-based implementation used by Apache Mahout.
GraphX
○ The GraphX is a library that is used to manipulate graphs and perform graph-parallel computations.
○ It facilitates to create a directed graph with arbitrary properties attached to each vertex and edge.
○ To manipulate graph, it supports various fundamental operators like subgraph, join Vertices, and aggregate
Messages.
Spark Shell
The Spark Shell is an interactive command-line tool provided by Apache Spark that allows users to interactively explore and manipulate
data using Spark's APIs. It provides an interactive environment for running Spark code snippets, performing data analysis, and
experimenting with Spark features without the need to write and execute full Spark applications.
Parallelized Collections
To create parallelized collection, call SparkContext's parallelize method on an existing collection in the driver program. Each element of collection is copied to
form a distributed dataset that can be operated on in parallel.
Now, we can operate the distributed dataset (distinfo) parallel such like distinfo.reduce((a, b) => a + b).
External Datasets
In Spark, the distributed datasets can be created from any type of storage sources supported by Hadoop such as HDFS,
Cassandra, HBase and even our local file system. Spark provides the support for text files, SequenceFiles, and other
types of Hadoop InputFormat.
SparkContext's textFile method can be used to create RDD's text file. This method takes a URI for the file (either a local
path on the machine or a hdfs://) and reads the data of the file.
Now, we can operate data on by dataset operations such as we can add up the sizes of all the lines using the map and
reduceoperations as follows: data.map(s => s.length).reduce((a, b) => a + b).
RDD Operations
○ Transformation
○ Action
Transformation
In Spark, the role of transformation is to create a new dataset from an existing one. The transformations are considered lazy
as they only computed when an action requires a result to be returned to the driver program.