Apache Spark Self Learning 1
Apache Spark Self Learning 1
Apache Spark Self Learning 1
Features:::;
Speed − Spark helps to run an application in Hadoop cluster, up to 100 times
faster in memory, and 10 times faster when running on disk. This is possible
by reducing number of read/write operations to disk. It stores the intermediate
processing data in memory.
Supports multiple languages − Spark provides built-in APIs in Java, Scala, or
Python. Therefore, you can write applications in different languages. Spark
comes up with 80 high-level operators for interactive querying.
Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also
supports SQL queries, Streaming data, Machine learning (ML), and Graph
algorithms.
Components::;
Standalone − Spark Standalone deployment means Spark occupies the place
on top of HDFS(Hadoop Distributed File System) and space is allocated for
HDFS, explicitly. Here, Spark and MapReduce will run side by side to cover
all spark jobs on cluster.
Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn
without any pre-installation or root access required. It helps to integrate Spark
into Hadoop ecosystem or Hadoop stack. It allows other components to run on
top of stack.
Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark
job in addition to standalone deployment. With SIMR, user can start Spark and
uses its shell without any administrative access.
Spark Dataframes and Datasets
The Azure Databricks documentation uses the term DataFrame for most technical
references and guide, because this language is inclusive for Python, Scala, and R.
See Notebook example: Scala Dataset aggregator.
The Azure Databricks documentation uses the term DataFrame for most technical
references and guide, because this language is inclusive for Python, Scala, and R.
See Notebook example: Scala Dataset aggregator.