Spark Interview Questions
Spark Interview Questions
Spark Interview Questions
Spark has become popular among data scientists and big data enthusiasts. If you are looking for the best
collection of Apache Spark Interview Questions for your data analyst, big data or machine learning job, you
have come to the right place.
In this Spark Tutorial, we shall go through some of the frequently asked Spark Interview Questions.
Java
Scala
Python
R
SQL
Hadoop YARN
EC2
Mesos
Kubernetes
HDFC
Apache Cassandra
Apache HBase
Apache Hive
SQL
DataFrame API
Clustering
Classification
Regression
Recommendation
Topic Modelling
Frequent itemsets
Association rules
Sequential pattern mining
What is an RDD?
RDD, short for Resilient Distributed Datasets, is a collection of elements. RDD is fault tolerant and can be
operated on in parallel. RDD provides the abstraction for distributed computing across nodes in Spark Cluster.
map
filter
union
intersection
reduce
collect
count
countByKey
How does Spark Context in Spark Application pick the value for Spark Master?
That can be done in two ways.
1. Create a new SparkConf object and set the master using its setMaster() method. This Spark Configuration object is passed
as an argument while creating the new Spark Context.
SparkConf conf = new SparkConf().setAppName("JavaKMeansExample")
.setMaster("local[2]")
.set("spark.executor.memory","3g")
.set("spark.driver.memory", "3g");
Following are the properties that could be configured for a Spark Application.
What is the use of Spark Environment Parameters? How do you configure those?
Spark Environment Parameters affect the behavior, working and memory usage of nodes in a cluster.
These parameters could be configured using the local config file spark-env.sh located at <apache-installation-
directory>/conf/spark-env.sh.
Spark Driver program has to be configured to establish a connection to Mesos. Also, Spark binaries location should be
accessible to the Apache Mesos program.
The other way to connect to Mesos is that installing Spark in the same location as that of Mesos, and configure the property,
spark.mesos.executor.home to point to the location of Mesos installation.
To run Spark Applications, should we install Spark on all the nodes of a YARN cluster?
Spark programs can be executed on top of YARN. So, there is no need to install Spark on the nodes a YARN
cluster, to run spark applications.
To run Spark Applications, should we install Spark on all the nodes of a Mesos cluster?
Spark programs can be executed on top of Mesos. So, there is no need to install Spark on the nodes of a
Mesos cluster, to run spark applications.
Spark RDD
Spark RDD
Spark Parallelize
Topic Modelling
Spark SQL
Spark Others