Spark Interview Questions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Apache Spark Interview Questions

Spark has become popular among data scientists and big data enthusiasts. If you are looking for the best
collection of Apache Spark Interview Questions for your data analyst, big data or machine learning job, you
have come to the right place.

In this Spark Tutorial, we shall go through some of the frequently asked Spark Interview Questions.

Entry Level Spark Interview Questions


Medium Level Spark Interview Questions
Advanced Spark Interview Questions

Entry Level Spark Interview Questions

What is Apache Spark?


Apache Spark is an Open Source Project from the Apache Software Foundation. Apache Spark is a data
processing engine and is being used in data processing and data analytics. It has inbuilt libraries for Machine
Learning, Graph Processing, and SQL Querying. Spark is horizontally scalable and is very efficient in terms of
speed when compared to big data giant Hadoop.

When should you choose Apache Spark?


When the application needs to scale. When the application needs both batch and real-time processing of
records. When the application needs to connect to multiple databases like Apache Cassandra, Apache
Mahout, Apache HBase, SQL databases, etc. When the application should be able to query structured datasets
cumulatively present across different database platforms.

Which builtin libraries does Spark have?


Spark has four builtin libraries. They are :

SQL and DataFrames


Spark Streaming
MLlib (Machine Learning Library)
GraphX

How fast is Apache Spark when compared to Hadoop ? Give us an example.


Apache Spark is about 100 times faster than Apache Hadoop. For a typical Logistic Regression problem, if
Hadoop takes 110ms to complete, then Spark would take around 1ms.

Why is Spark faster than Hadoop?


Spark is so fast because it uses a state-of-the-art DAG scheduler, a query optimizer, and a physical execution
engine.
Which programming languages could be used for Spark Application Development?
One can use following programming languages.

Java
Scala
Python
R
SQL

On which platforms can Spark run?


When Spark is run in stand-alone cluster mode, it can run on :

Hadoop YARN
EC2
Mesos
Kubernetes

Which data sources can Spark access?


Spark can access data from hundreds of sources. Some of them are :

HDFC
Apache Cassandra
Apache HBase
Apache Hive

Can structured data be queried in Spark? If so, how?


Yes, structured data can be queried using :

SQL
DataFrame API

What is Spark MLlib?


MLlib is Spark’s scalable Machine Learning inbuilt library. The library contains many machine learning
algorithms and utilities to transform the data and extract useful information or inference from the data.

How is MLlib scalable?


Spark’s DataFrames API realizes scalability.

What kinds of machine learning use cases does MLlib solve?


MLlib contains common learning algorithms that can solve problems like :

Clustering
Classification
Regression
Recommendation
Topic Modelling
Frequent itemsets
Association rules
Sequential pattern mining

What is Spark Context?


SparkContext instance sets up internal services for a Spark Application and also establishes a connection to
a Spark execution environment. Spark Context should be created by Spark Driver Application.

What is an RDD?
RDD, short for Resilient Distributed Datasets, is a collection of elements. RDD is fault tolerant and can be
operated on in parallel. RDD provides the abstraction for distributed computing across nodes in Spark Cluster.

What are RDD transformations?


Transformations are operations which create a new RDD from an existing RDD.

Some of the RDD transformations are :

map
filter
union
intersection

What are RDD actions?


Actions are operations which consume or run operations on an RDD and produce an output value.

Some of the RDD actions are :

reduce
collect
count
countByKey

What do you mean by ‘RDD Transformations are lazy’?


Transformations on an RDD are not actually executed until an action is encountered. Hence, RDD
Transformations are called lazy.

What do you know about RDD Persistence?


Persistence means Caching. When an RDD is persisted, each node in the cluster stores the partitions (of the
RDD) in memory (RAM). When there are multiple transformations or actions on an RDD, persistence helps to
cut down the latency by the time required to load the data from file storage to memory.

What is the difference between cache() and persist() for an RDD?


cache() uses default storage level, i.e., MEMORY_ONLY.

persist() can be provided with any of the possible storage levels.

What do you mean by the default storage level: MEMORY_ONLY?


Default Storage Level – MEMORY_ONLY mean store RDD as deserialized Java objects in the JVM. If the RDD
does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they’re
needed.

What is Sliding Window Operation?

Medium Level Spark Interview Questions

How does Spark Context in Spark Application pick the value for Spark Master?
That can be done in two ways.

1. Create a new SparkConf object and set the master using its setMaster() method. This Spark Configuration object is passed
as an argument while creating the new Spark Context.
SparkConf conf = new SparkConf().setAppName("JavaKMeansExample")
.setMaster("local[2]")
.set("spark.executor.memory","3g")
.set("spark.driver.memory", "3g");

JavaSparkContext jsc = new JavaSparkContext(conf);

2. <apache-installation-directory>/conf/spark-env.sh file, located locally on the machine, contains information regarding


Spark Environment configuration. Spark Master is one the parameters that could be provided in the configuration file.

How do you configure Spark Application?


Spark Application could be configured using properties that could be set directly on a SparkConf object that is
passed during SparkContext initialization.

Following are the properties that could be configured for a Spark Application.

Spark Application Name


Number of Spark Driver Cores
Spark Driver’s Maximum Result Size
Spark Driver’s Memory
Spark Executors’ Memory
Spark Extra Listeners
Spark Local Directory
Log Spark Configuration
Spark Master
Deploy Mode of Spark Driver
Log Application Information
Spark Driver Supervise Action

Reference : Configure Spark Application

What is the use of Spark Environment Parameters? How do you configure those?
Spark Environment Parameters affect the behavior, working and memory usage of nodes in a cluster.

These parameters could be configured using the local config file spark-env.sh located at <apache-installation-
directory>/conf/spark-env.sh.

Reference: Configure Spark Ecosystem

How do you establish a connection from Apache Mesos to Apache Spark?


A connection to Mesos could be established in two ways.

Spark Driver program has to be configured to establish a connection to Mesos. Also, Spark binaries location should be
accessible to the Apache Mesos program.
The other way to connect to Mesos is that installing Spark in the same location as that of Mesos, and configure the property,
spark.mesos.executor.home to point to the location of Mesos installation.

How do you minimize data transfers between nodes in a cluster?


Minimizing shuffles can minimize the data transfers between nodes, and thus making the process fast. There
are some practices that can reduce the usage of shuffling operations. Use broadcast variables to join small and
large RDDs; and accumulators to update the variable’s values.

To run Spark Applications, should we install Spark on all the nodes of a YARN cluster?
Spark programs can be executed on top of YARN. So, there is no need to install Spark on the nodes a YARN
cluster, to run spark applications.

To run Spark Applications, should we install Spark on all the nodes of a Mesos cluster?
Spark programs can be executed on top of Mesos. So, there is no need to install Spark on the nodes of a
Mesos cluster, to run spark applications.

What is the usage of GraphX module in Spark?


GraphX is a graph processing library. It can be used to build and transform interactive graphs. Many
algorithms are available with GraphX library. PageRank is one of them.

How does Spark handle distributed processing?


Spark provides an abstraction to the distributed processing through Spark RDD API. A general user does not
need to worry about how data is processed in a distributed cluster. There are some exceptions though. When
you optimize an application for performance, you should understand about operations and actions which
require the data transfer between nodes.

Advanced Level Spark Interview Questions

Learn Apache Spark

Apache Spark Tutorial

Install Spark on Ubuntu

Install Spark on Mac OS

Scala Spark Shell - Example

Python Spark Shell - PySpark

Setup Java Project with Spark

Spark Scala Application - WordCount Example

Spark Python Application

Spark DAG & Physical Execution Plan

Setup Spark Cluster

Configure Spark Ecosystem

Configure Spark Application

Spark Cluster Managers

Spark RDD

Spark RDD

Spark RDD - Print Contents of RDD

Spark RDD - foreach

Spark RDD - Create RDD

Spark Parallelize

Spark RDD - Read Text File to RDD

Spark RDD - Read Multiple Text Files to Single RDD

Spark RDD - Read JSON File to RDD

Spark RDD - Containing Custom Class Objects

Spark RDD - Map

Spark RDD - FlatMap

Spark RDD - Filter

Spark RDD - Distinct

Spark RDD - Reduce


Spark Dataseet

Spark - Read JSON file to Dataset

Spark - Write Dataset to JSON file

Spark - Add new Column to Dataset

Spark - Concatenate Datasets

Spark MLlib (Machine Learning Library)

Spark MLlib Tutorial

KMeans Clustering & Classification

Decision Tree Classification

Random Forest Classification

Naive Bayes Classification

Logistic Regression Classification

Topic Modelling

Spark SQL

Spark SQL Tutorial

Spark SQL - Load JSON file and execute SQL Query

Spark Others

Spark Interview Questions

You might also like