What Is Spark?: History of Apache Spark
What Is Spark?: History of Apache Spark
What Is Spark?: History of Apache Spark
Object 1
Object 2
4
Uses of Spark
•Data integration: The data generated by systems are not consistent
enough to combine for analysis. To fetch consistent data from systems we
can use processes like Extract, transform, and load (ETL). Spark is used to
reduce the cost and time required for this ETL process.
Object 3
How to find Nth Highest Salary in SQL
Cluster Manager
•The role of the cluster manager is to allocate resources across
applications. The Spark is capable enough of running on a large number
of clusters.
Worker Node
•The worker node is a slave node
Executor
•An executor is a process launched for an application on a worker node.
•It runs tasks and keeps data in memory or disk storage across them.
Task
•A unit of work that will be sent to one executor.
Spark Components
The Spark project consists of different types of tightly integrated components.
At its core, Spark is a computational engine that can schedule, distribute and
monitor multiple applications.
Let's understand each Spark component in detail.
Spark Core
•The Spark Core is the heart of Spark and performs the core functionality.
•It holds the components for task scheduling, fault recovery, interacting
with storage systems and memory management.
Spark SQL
•The Spark SQL is built on the top of Spark Core. It provides support for
structured data.
•It allows to query the data via SQL (Structured Query Language) as well
as the Apache Hive variant of SQL?called the HQL (Hive Query
Language).
•It supports JDBC and ODBC connections that establish a relation between
Java objects and existing databases, data warehouses and business
intelligence tools.
•It also supports various sources of data like Hive tables, Parquet, and
JSON.
Spark Streaming
•Spark Streaming is a Spark component that supports scalable and fault-
tolerant processing of streaming data.
MLlib
•The MLlib is a Machine Learning library that contains various machine
learning algorithms.
GraphX
•The GraphX is a library that is used to manipulate graphs and perform
graph-parallel computations.
What is RDD?
The RDD (Resilient Distributed Dataset) is the Spark's core abstraction. It is a
collection of elements, partitioned across the nodes of the cluster so that we
can execute various parallel operations on it.
There are two ways to create RDDs:
Parallelized Collections
To create parallelized collection, call SparkContext's parallelize method on an
existing collection in the driver program. Each element of collection is copied to
form a distributed dataset that can be operated on in parallel.
Object 4
Object 5
HTML Tutorial
External Datasets
In Spark, the distributed datasets can be created from any type of storage
sources supported by Hadoop such as HDFS, Cassandra, HBase and even our
local file system. Spark provides the support for text files, SequenceFiles, and
other types of Hadoop InputFormat.
SparkContext's textFile method can be used to create RDD's text file. This
method takes a URI for the file (either a local path on the machine or a hdfs://)
and reads the data of the file.
Now, we can operate data on by dataset operations such as we can add up the
sizes of all the lines using the map and reduceoperations as follows:
data.map(s => s.length).reduce((a, b) => a + b).
Next Topic RDD Operations
RDD Operations
The RDD provides the two types of operations:
•Transformation
•Action
Transformation
In Spark, the role of transformation is to create a new dataset from an existing
one. The transformations are considered lazy as they only computed when an
action requires a result to be returned to the driver program.
Let's see some of the frequently used RDD Transformations.
Transformation Description
It returns a new distributed dataset
map(func) formed by passing each element of
the source through a function func.
Action
In Spark, the role of action is to return a value to the driver program after
running a computation on the dataset.
Object 6
Prime Ministers of India | List of Prime Minister of India (1947-2020)
Action Description
It aggregate the elements of the dataset using a
function func (which takes two arguments and
reduce(func) returns one). The function should be commutative
and associative so that it can be computed correctly
in parallel.
Object 7
OOPs Concepts in Java
MEMORY_ONLY_2,
It is the same as the levels above, but replicate each
MEMORY_AND_DISK
partition on two cluster nodes.
_2, etc.
Broadcast variable
The broadcast variables support a read-only variable cached on each machine
rather than providing a copy of it with tasks. Spark uses broadcast algorithms
to distribute broadcast variables for reducing communication cost.
The execution of spark actions passes through several stages, separated by
distributed "shuffle" operations. Spark automatically broadcasts the common
data required by tasks within each stage. The data broadcasted this way is
cached in serialized form and deserialized before running each task.
To create a broadcast variable (let say, v), call SparkContext.broadcast(v). Let's
understand with an example.
Object 8
21.8M
507
•To open the spark in Scala mode, follow the below command
1. $ spark-shell
1. scala> data.collect
•Apply the map function and pass the expression required to perform.
•To open the spark in Scala mode, follow the below command.
1. $ spark-shell
•Create an RDD using parallelized collection.
1. scala> data.collect
1. scala> filterfunc.collect
Here, we got the desired output.
1. scala> data.collect
•Apply count() function to count number of elements.
•To open the spark in Scala mode, follow the below command.
1. $ spark-shell
•Create an RDD using parallelized collection.
1. scala> data.collect
1. scala> distinctfunc.collect
Here, we got the desired output.
•To open the spark in Scala mode, follow the below command.
1. $ spark-shell
•Create an RDD using parallelized collection.
1. scala> data1.collect
1. scala> data2.collect
•Apply union() function to return the union of the elements.
1. scala> unionfunc.collect
•To open the Spark in Scala mode, follow the below command.
1. $ spark-shell
1. scala> data1.collect
1. scala> data2.collect
1. scala> intersectfunc.collect
•To open the Spark in Scala mode, follow the below command.
1. $ spark-shell
1. scala> data1.collect
•Create another RDD using the parallelized collection.
1. scala> data2.collect
1. scala> cartesianfunc.collect
Here, we got the desired output.
•To open the Spark in Scala mode, follow the below command.
1. $ spark-shell
•Create an RDD using the parallelized collection.
1. scala> data.collect
For ascending,
Object 9
How to find Nth Highest Salary in SQL
1. scala> sortfunc.collect
1. scala> sortfunc.collect
•To open the Spark in Scala mode, follow the below command.
1. $ spark-shell
1. scala> data.collect
•Apply groupByKey() function to group the values.
1. scala> groupfunc.collect
•To open the Spark in Scala mode, follow the below command.
1. $ spark-shell
1. scala> data.collect
•Apply reduceByKey() function to aggregate the values.
1. scala> reducefunc.collect
Object 10
HTML Tutorial
Spark cogroup Function
In Spark, the cogroup function performs on different datasets, let's say, (K, V)
and (K, W) and returns a dataset of (K, (Iterable groupWith.
•To open the Spark in Scala mode, follow the below command.
1. $ spark-shell
1. scala> data1.collect
•Create another RDD using the parallelized collection.
Object 11
History of Java
1. scala> data2.collect
1. scala> cogroupfunc.collect
Here, we got the desired output.
•To open the Spark in Scala mode, follow the below command.
1. $ spark-shell
•Create an RDD using the parallelized collection.
1. scala> data.collect
•To open the Spark in Scala mode, follow the below command.
1. $ spark-shell
•Create an RDD using the parallelized collection.
1. scala> data.collect
•Create a text file in your local machine and write some text into it.
1. $ nano sparkdata.txt
1. $ cat sparkdata.txt
•Create a directory in HDFS, where to kept text file.
1. $ spark-shell
•Let's create an RDD by using the following command.
•Now, we can read the generated result by using the following command.
1. scala> data.collect;
•Here, we split the existing data in the form of individual words by using
the following command.
1. scala> splitdata.collect;
Object 12
HTML Tutorial
•Now, we can read the generated result by using the following command.
1. scala> mapdata.collect;
•Now, perform the reduce operation
•Now, we can read the generated result by using the following command.
1. scala> reducedata.collect;
•Create a text file in your local machine and write some text into it.
1. $ nano sparkdata.txt
1. $ cat sparkdata.txt
•Create a directory in HDFS, where to kept text file.
1. $ spark-shell
•Let's create an RDD by using the following command.
•Now, we can read the generated result by using the following command.
1. scala> data.collect;
•Here, we split the existing data in the form of individual words by using
the following command.
1. scala> splitdata.collect;
Object 13
Difference between JDK, JRE, and JVM
•Now, we can read the generated result by using the following command.
1. scala> mapdata.collect;
•Now, perform the reduce operation
•Now, we can read the generated result by using the following command.
1. scala> reducedata.collect;