Apache Spark Self Learning 1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

Apache Spark is a lightning-fast cluster computing technology, designed for fast

computation. It is based on Hadoop MapReduce and it extends the MapReduce model to


efficiently use it for more types of computations, which includes interactive queries and
stream processing. The main feature of Spark is its in-memory cluster computing that
increases the processing speed of an application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workload in a
respective system, it reduces the management burden of maintaining separate tools.

Features:::;
 Speed − Spark helps to run an application in Hadoop cluster, up to 100 times
faster in memory, and 10 times faster when running on disk. This is possible
by reducing number of read/write operations to disk. It stores the intermediate
processing data in memory.
 Supports multiple languages − Spark provides built-in APIs in Java, Scala, or
Python. Therefore, you can write applications in different languages. Spark
comes up with 80 high-level operators for interactive querying.
 Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also
supports SQL queries, Streaming data, Machine learning (ML), and Graph
algorithms.

Components::;
 Standalone − Spark Standalone deployment means Spark occupies the place
on top of HDFS(Hadoop Distributed File System) and space is allocated for
HDFS, explicitly. Here, Spark and MapReduce will run side by side to cover
all spark jobs on cluster.
 Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn
without any pre-installation or root access required. It helps to integrate Spark
into Hadoop ecosystem or Hadoop stack. It allows other components to run on
top of stack.
 Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark
job in addition to standalone deployment. With SIMR, user can start Spark and
uses its shell without any administrative access.
Spark Dataframes and Datasets

The Apache Spark Dataset API provides a type-safe, object-oriented programming


interface. DataFrame is an alias for an untyped Dataset [Row].

The Azure Databricks documentation uses the term DataFrame for most technical
references and guide, because this language is inclusive for Python, Scala, and R.
See Notebook example: Scala Dataset aggregator.

The Apache Spark Dataset API provides a type-safe, object-oriented programming


interface. DataFrame is an alias for an untyped Dataset [Row].

The Azure Databricks documentation uses the term DataFrame for most technical
references and guide, because this language is inclusive for Python, Scala, and R.
See Notebook example: Scala Dataset aggregator.

You might also like