Bigdata Notes

Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 26

1.

file sized used in environment


Ans. tera bytes of data

2. what is cluster size, node size.


Ans. 6 node cluster, each node size 2 TB disk space and 16GB RAM.

3. what are the challenges you have faced in your project? explain in detail
Ans. 1. in pyspark converting nested json to dataframe
2. creating dynamic schema for required databaes from unstructued data.

4. where is data pushed to Hadoop or aws?


Ans. After data is preprossesed the generated data is pushed into aws(s3)/hdfs/hive

5. explain about big data real time processing?


Ans. Real time processing is spark streaming, kafka- all together

6. what kind of analyics performed on collectd video streams?


Ans. image classifiction, object identification, camera tampering

7. how you transfer collected video data using kafka topics?


Ans. videos are break into frames and then push into kafka topics.

kafka with python is producer no. of camera and created multiple producer, one of
region one producer create a kafka topic.
and create a frames 30 fps.

develop a code and create produder data injection pipeline , aws only s3.

using consumer in python spark and kafka.

initially storing in hdfs or aws s3.

cluster set up is difficult, creating hive metadata.

pyspark converting nested json to dataframe - we flattened nested json to json and
then created dataframe

query optimization

8. describe about opencv, why only openCV for video streams?


Ans. open-source library for the computer vision, machine learning, and image
processing, now it plays a major role in real-time operation.

10. How long does it take to run your script in production cluster? How did you
optimized the timings. Challenges you have faced.
Ans. 1.each data pipelpile couple of minutes.

11. end to end project description and roles of it and then team size and there
roles.
Ans. 1. team size 10
2. from data injestion till storing data in data lake.
3. from opencv produced video steam is produced to kafka topics and from
kafka topic spark streaming is consumed and we do anylytics bases
on given problem and then it is pushed to data lake(hdfs, hive, s3)

12. Which domain working, service or product?


Ans. Service based

===================================================================================
=================================================(Project)
===================================================================================
===================================================(KAFKA)
KAFKA (https://data-flair.training/blogs/kafka-interview-questions/)

1. what is kafaka and how it works?


Ans. Kafka is distributed publish-subscribe based fault tolerant messaging system.
It is fast, scalable and distributed by design.
Kafka messages are persisted on the disk and replicated within the cluster to
prevent data loss.
Kafka is built on top of the ZooKeeper synchronization service.
Kafka is very fast and guarantees zero downtime and zero data loss.
Kafka is a learder and follower concept.
-producers
-kafka cluster
-broker
-topics - stream of message that belog into particular category
-topic log - kafka stores topic in logs
-ZOOKEEPER
-partitions
-offset -> a sequence id given to a mesages as they arrive in partitions
-consumers

2. how redundancy in kafka is acheived? redendency is handled by setting up key


Ans. through leader and follower
When topics are created it is partitions and replicated, among partitons it
will help in elect leader for other repliation of same partitions and stored
distributed in kafka broker. And leader election is done randomly and zookeeper
will helps it.

3. how data is transmitted in kafka either keys or values?


Ans. https://medium.com/@durgaswaroop/a-practical-introduction-to-kafka-storage-
internals-d5b544f6925f - detail about data storing
http://cloudurable.com/blog/kafka-architecture-topics/index.html#:~:text=Kafka
%20Topics%2C%20Logs%2C%20Partitions&text=Kafka%20stores%20topics%20in
%20logs,category%2C%20stream%20name%20or%20feed.

4. How is partiton is happening in kafka? either key or value?


Ans. https://sookocheff.com/post/kafka/kafka-in-a-nutshell/

5. how many partiotins are to be done in kafka and why?


Ans.

6. How is leader election is done in kafka?


Ans. Whenever a new topic is created, Kafka runs it’s leader election algorithm to
figure out the preferred leader of a partition.
The first replica will be the one that will be elected as a leader from the
list of replicas.
(https://medium.com/@mandeep309/preferred-leader-election-in-kafka-
4ec09682a7c4)

7. what is insync replica(ISR)?


Ans. The ISR is simply all the replicas of a partition that are "in-sync" with the
leader.
The definition of"in-sync" depends on the topic configuration,but by
default,it means that a replica is or has been fully caught up with the leader
in the last 10 seconds.
ISR will consist of the leader replica and any additional follower replica
that is also considered in-sync.Followers replicate data from the leader to
themselves by sending Fetch Requests periodically, by default every 500ms.

8. what is zookeeper and its roles in kafka and anywhere else?


Ans. A critical dependency of Apache Kafka is Apache Zookeeper, which is a
distributed configuration and synchronization service.
Zookeeper serves as the coordination interface between the Kafka brokers,
producers and consumers.
The Kafka servers share information via a Zookeeper cluster.
Kafka stores basic metadata in Zookeeper such as information about topics,
brokers, consumer offsets and so on.
However,we also use Zookeeper to recover from previously committed offset if
any node fail because it work as periodically commit offset.
The leader election between the Kafka broker is also done by using Zookeeper
in the event of leader failure.
Kafka uses Zookeeper to store offsets of messages consumed for a specific
topic and partition by a specific Consumer Group.

9. Explain the role of the offset.


Ans. There is a sequential ID number given to the messages in the partitions what
we call, an offset. So, to identify each message in the partition
uniquely, we use these offsets.

10. What are main APIs of Kafka?


Ans. Apache Kafka has 4 main APIs:
-Producer API
-Consumer API
-Streams API
-Connector API

11. Explain the concept of Leader and Follower.


Ans. In every partition of Kafka, there is one server which acts as the Leader, and
none or more servers plays the role as a Followers.

12. What does ISR stand in Kafka environment?


Ans. ISR refers to In sync replicas. These are generally classified as a set of
message replicas which are synced to be leaders.

13. How do you define a Partitioning Key?


Ans. Within the Producer, the role of a Partitioning Key is to indicate the
destination partition of the message. By default, a hashing-based
Partitioner is used to determine the partition ID given the key.
Alternatively, users can also use customized Partitions.
(https://medium.com/@stephane.maarek/the-kafka-api-battle-producer-vs-consumer-vs-
kafka-connect-vs-kafka-streams-vs-ksql-ef584274c1e)
===================================================================================
===================================================(Kafka)
===================================================================================
===================================================(SPARK)
SPARK

1. what is spark?
Ans. Apache Spark is a cluster computing platform designed to be fast and general
purpose.
It can execute streaming as well as the batch
It also integrates closely with other Big Data tools. In particular, Spark can
run in Hadoop clusters and access any Hadoop data source, including
Cassandra.
spark in in-memory data processing.
Main abstraction of spark are RDD.

2. three cluster types for Spark?


Ans. standalone
Mesos
Hadoop yarn

3. spark components?
Ans. Driver program -> central entry point for spark shell, it runs main function
of application and create sparkcontext,
Driver stores metadata about RDD and their location and
partitions
sparkContext/SparkSession - it is a client of Spark’s execution environment
and it acts as the master of the Spark application.
cluster manager -> responsible for aquring resource on spark cluster and job
allocation
worker nodes -> responsible for execution of task
Executers
Tasks

4. what is spark driver program?


Ans. At a high level, every Spark application consists of a driver program that
launches various parallel operations on a cluster.
Driver program contains your application’ main function and defines
distributed datasets on the cluster,then applies operations to them.
In the preceding examples, the driver program was the Spark shell itself.
Driver programs access Spark through a SparkContext object that is initiated,
It represents a connection to a computing cluster.
In the shell, a SparkContext is automatically created for you as the variable
called sc.
Converting user program into tasks.The Spark driver is responsible for
converting a user program into units of physical execution called tasks.

5. what is sparkContext?
Ans. A SparkContext is a client of Spark’s execution environment and it acts as the
master of the Spark application.
SparkContext is the entry gate of Apache Spark functionality.
SparkContext sets up internal services and establishes a connection to a Spark
execution environment.
The most important step of any Spark driver application is to generate
SparkContext.
You can create RDD,accumulator and broadcast variables,access Spark services
and run jobs(until SparkContext stops)after the creation of SparkContext.
It allows your Spark Application to access Spark Cluster with the help of
Resource Manager (YARN/Mesos).
To create SparkContext, first SparkConf should be made.
The SparkConf has a configuration parameter that our Spark driver application
will pass to SparkContext.

6. what is RDD?
Ans. RDD is Resilient distributed datasets.RDD is an abstract representation of the
data which is divided into the partitions and distributed across the
cluster.
This collection is made up of data partitions which is a small collection of
data stored in RAM or on Disk.
RDD is immutable, lazy evaluted and cacheable.

7. how spark works on yarn? - hadoop cluster is configure and on top of that we
will install spark.
Ans. A Spark application is launched on a set of machines using an external service
called a cluster manager.
Spark is packaged with a built-in cluster manager called the Standalone
cluster manager.
Spark also works with Hadoop YARN and Apache Mesos.
Spark has a spark driver
(https://www.youtube.com/watch?v=bPHQouyUGBk)

8. dataframe VS rdd vs datasets.


Ans. DataFrame -> DataFrame is a distributed collection of data organized into
named columns.
It is conceptually equivalent to a table in a relational database or
a data frame in Python but with richer optimizations.
RDD -> RDD stands for Resilient Distributed Datasets. It is Read-only
partition collection of records.
RDD is the fundamental data structure of Spark. It allows a programmer to
perform in-memory computations on large clusters in a fault-
tolerant manner.
DataSet -> Dataset is a data structure in SparkSQL which is strongly typed and
is a map to a relational schema.
It represents structured queries with encoders.
It is an extension to data frame API.
A Dataset is a strongly typed collection of domain-specific objects
that can be transformed in parallel using functional or
relational operations.
Each Dataset also has an untyped view called a DataFrame, which is a
Dataset of Row.

9. different dataframe operations?


Ans. Different Data frame operatoins are groupby, sortby, select, joins.

10. differnt map and flatMap?


Ans. The map() transformation takes in a function and applies it to each element in
the RDD and the result of the function is a new value of each element in
the resulting RDD.
The flatMap() is used to produce multiple output elements for each input
element.
When using map(), the function we provide to flatMap() is called individually
for each element in our input RDD.
Instead of returning a single element, an iterator with the return values is
returned.

11. what is fold and reduce?


Ans.
12. what is difference dataframe and dataset?
Ans. DATAFRAMES - DataFrame gives a schema view of data basically,it is an
abstraction.In dataframe,view of data is organized as columns with
column name and types info. In addition, we can say data in dataframe is as same as
the table in relational database.
DataFrame is an immutable distributed collection of data.Unlike an
RDD,data is organized into named column, like a table in a relational
database. DatraFrame is Dataset of row.
Dataframe API does not support compile time safety which limits you
from manipulating data when the structure is not known.

DATASETS - In Spark,dataset are an extension of dataframes.Basically,it earn


two different APIs characteristics,such as strongly typed and
untyped.Datasets are by default a collection of strongly typed JVM objects,unlike
dataframes.Moreover,it uses Spark’s Catalyst optimizer.
Datasets in Apache Spark are an extension of DataFrame API which
provides type-safe, object-oriented programming interface.
Dataset takes advantage of Spark’s Catalyst optimizer by exposing
expressions and data fields to a query planner.
Datasets API provides compile time safety which was not available in
Data frames.

13. serialization concept in Spark.


Ans. Serialization is implemented in most distributed applications for performance.
A serialization framework helps you convert objects into a stream of bytes.
Spark serializes objects using Java’ ObjectOutputStream framework,and can work
with any class you create that implements java.io.Serializable.

14. spark RDD architecture


Ans

15. why is RDD immutable?


Ans. – Immutable data is always safe to share across multiple processes as well as
multiple threads.
– Since RDD is immutable we can recreate the RDD any time. (From lineage
graph).
– If the computation is time-consuming, in that we can cache the RDD which
result in performance improvement

16. what is Catalyst Query optimizer?


Ans. At the core of Spark SQL is the Catalyst optimizer, which leverages advanced
programming language features
Catalyst is based on functional programming constructs in Scala and designed
with these key two purposes:
Easily add new optimization techniques and features to Spark SQL
Enable external developers to extend the optimizer (e.g. adding data source
specific rules, support for new data types, etc.)
Catalyst framework is a new optimization framework present in Spark SQL.
It allows Spark to automatically transform SQL queries by adding new
optimizations to build a faster processing system

17. what is Spark Tungsten(Spark Tungsten Execution Engine)?


Ans.

18. what is spark reduce explain it and what is fold and reduce?
Ans.

19. Explanin about DStream?


Ans. spark streaming abstraction is Dstream or discretized stream which represents
a continuous stream of data,DStreams can be created either from input data
streams from sources such as Kafka,or by applying high-level operations on other
DStreams,DStream is represented as a sequence of RDD.(A sequence of RDD
that represent a data).
DStreams can be created from various sources like Apache Kafka, HDFS, and
Apache Flume. DStreams have two operations –
Transformations that produce a new DStream.
Output operations that write data to an external system.

20. what is logical and physical plan?(spark)


Ans. Logical Plan: Let’s say we have a code (DataFrame, DataSet, SQL). Now the
first step will be the generation of the Logical Plan.
Logical Plan is divided into three parts:
-Unresolved Logical Plan OR Parsed Logical Plan
-Resolved Logical Plan OR Analyzed Logical Plan OR Logical Plan
-Optimized Logical Plan
Logical Plan is an abstract of all transformation steps that need to
be performed and it does not refer anything about the Driver
(Master Node) or Executor (Worker Node).The SparkContext is responsible for
generating and storing it.
Basically, Catalyst Optimizer performs logical optimization.
Physical Plan:Now coming Physical Plan,it is an internal optimization for
Spark.Once our Optimized Logical Plan is created then further,
Physical Plan is generated.It simply specifies how our Logical Plan is going to be
executed on the cluster.
It generates different kinds of execution strategies and then keeps
comparing them in the “Cost Model”.
Once the Best Physical Plan is selected,it’ the time to generate
the executable code(DAG of RDDs)for the query that is to be executed
in a cluster in a distributed fashion.This process is called Codegen and that’s the
job of Spark’s Tungsten Execution Engine.
(https://blog.knoldus.com/understanding-sparks-logical-and-physical-plan-in-
laymans-term/)

21. what is window in sql? and different b/w groupby and window
Ans. Window functions operate on a set of rows and return a single value for each
row from the underlying query.
The term window describes the set of rows on which the function operates.
A window function uses values from the rows in a window to calculate the
returned values.

22. How spark know that it is writing data into external location?
Ans.

23. What are Broadcast Variables(or broadcast join)?


Ans. Broadcast variables in Apache Spark is a mechanism for sharing variables
across executors that are meant to be read-only.
Without broadcast variables these variables would be shipped to each executor
for every transformation and action,and this can cause net work overhead.
Broadcast variables are useful when large datasets needs to be cached in
executors.
Since broadcast variable increases the efficiency of joins between small and
large RDDs.the broadcast variable allow keeping a read-only variable cached on
every machine in place of shipping a copy of it with tasks.

24. what are Accumulators


Ans. Accumulators are shared variables that are used for aggregating information
across the executors.
Accumulators are a special kind of variable that we basically use to update
some data points across executors.
The accumulator is the type of Shared variable that is only added through
associative and commutative operations.
Using accumulator we can update the value of the variable in parallel while
executing.

25. What is the difference between Temp View and Global Temp View?
Ans. Temporary view in Spark SQL are tied to the Sparksession that created the
view,and will not be available once the Sparksession is terminated.
Global Temporary views are not tied to a Spark session, but can be shared
across multiple Spark sessions.

26. How are aggregations performed on DataFrames?


Ans. DataFrames has built-in functions that provide common aggregation functions
such as count(), countDistinct(), avg(), max(), min().

27. What Is Lineage Graph?


Ans. The representation of dependencies in between RDD is known as the lineage
graph.Lineage graph information is used to compute each RDD on demand,so
that whenever a part of persistent RDD is lost,the data that is lost can be
recovered using the lineage graph information.
Spark does not support data replication in the memory and thus, if any data is
lost, it is rebuild using RDD lineage. RDD lineage is a process that
reconstructs lost data partitions. The best is that RDD always remembers how to
build from other datasets.

28. What Do You Understand By Pair Rdd?


Ans. Paired RDD is a distributed collection of data with the key-value pair.
It is a subset of Resilient Distributed Dataset(RDD). So it has all the
feature of RDD and some new feature for the key-value pair.
Special operations can be performed on RDDs in Spark using key/value pairs and
such RDDs are referred to as Pair RDDs.
Pair RDDs allow users to access each key in parallel.
These operations on Paired RDD are very useful to solve many use cases that
require sorting, grouping, reducing some value/function.

29. port that spark monitors


Ans. 4040

30. Directory already exists in spark what we will do?


Ans.

31. How much data is transfered in spark streaming?


Ans.

32. What is spark execution process?


Ans. Spark gives us two operations for performing any problem.
When we do a transformation on any RDD,it gives us a new RDD.But it does not
start the execution of those transformations. The execution is performed only
when an action is performed on the new RDD and gives us a final result.
So once you perform any action on an RDD, Spark context gives your program to
the driver.
The driver creates the DAG (directed acyclic graph) or execution plan (job)
for your program.Once the DAG is created, the driver divides this DAG into a
number of stages.These stages are divided into smaller task and all the task are
given to the executor for execution.
(https://dzone.com/articles/how-spark-internally-executes-a-program)

33. what is spark core?


Ans. Spark Core is the fundamental unit of the whole Spark project.It provides all
sort of functionalities like task dispatching, scheduling, and input-output
operations etc.
Spark makes use of Special data structure known as RDD(Resilient Distributed
Dataset).It is the home for API that defines and manipulate the RDDs.
Spark Core is distributed execution engine with all the functionality attached
on its top.
All the basic functionality of Apache Spark Like in-memory computation,fault
tolerance,memory management, monitoring, task scheduling is provided by Spark
Core.

34. How spark is better than Hadoop?


Ans. Apache Spark is lightening fast cluster computing tool.It is up to 100 times
faster than Hadoop MapReduce due to its very fast in-memory data
analytics processing power.
Apache Spark is a general purpose data processing engine and is generally used
on top of HDFS.
Apache Spark is suitable for the variety of data processing requirements
ranging from Batch Processing to Data Streaming.

35. What is DAG?


Ans. the Directed Acyclic Graph(DAG) is a graph with cycles which are not directed.
DAG is a graph which contains set of all the operations that are applied on
RDD. On RDD when any action is called. Spark creates the DAG and submits it to the
DAG scheduler.
Only after the DAG is built, Spark creates the query optimization plan.

36. Different running mode of spark?


Ans. local, standalone, cluster

37. Explain level of paralleism in spark streaming?


Ans. (1) Increase the number of receivers:If there are too many records for single
receiver(single machine) to read in and distribute so that is bottleneck. So we
can increase the no. of receiver depends on scenario.
(2) Re-partition the receive data : If one is not in a position to increase
the no. of receivers in that case redistribute the data by re-
partitioning.
(3) Increase parallelism in aggregation :

38. what happen if there is a latency or late data in spark streaming?


Ans.

39. How hive connects to Spark


Ans. Since Spark SQL connects to Hive metastore using thrift, we need to provide
the thrift server uri while creating the Spark session.

40. Do you need to install Spark on all nodes of YARN cluster?


Ans. No, because Spark runs on top of YARN. Spark runs independently from its
installation. Spark has some options to use YARN when dispatching jobs
to the cluster, rather than its own built-in manager, or Mesos. Further, there are
some configurations to run YARN. They include master, deploy-mode, driver-
memory, executor-memory, executor-cores, and queue.

41. What is Executor Memory in a Spark application?


Ans. Every spark application has same fixed heap size and fixed number of cores for
a spark executor. The heap size is what referred to as the Spark executor
memory which is controlled with the spark.executor.memory property of the –
executor-memory flag.
Every spark application will have one executor on each worker node. The
executor memory is basically a measure on how much memory of the worker node
will the application utilize.

42. Define Partitions in Apache Spark.


Ans. As the name suggests, partition is a smaller and logical division of data
similar to ‘split’ in MapReduce. It is a logical chunk of a large
distributed data set.
Partitioning is the process to derive logical units of data to speed up the
processing process. Spark manages data using partitions that help parallelize
distributed data processing with minimal network traffic for sending data between
executors. By default, Spark tries to read data into an RDD from the nodes
that are close to it. Since Spark usually accesses distributed partitioned data, to
optimize transformation operations it creates partitions to hold the data
chunks. Everything in Spark is a partitioned RDD.

43. Can you use Spark to access and analyze data stored in Cassandra databases?
Ans. Yes, it is possible if you use Spark Cassandra Connector.To connect Spark to a
Cassandra cluster, a Cassandra Connector will need to be added to the Spark
project. In the setup, a Spark executor will talk to a local Cassandra node and
will only query for local data. It makes queries faster by reducing the
usage of the network to send data between Spark executors (to process data) and
Cassandra nodes (where data lives).

44. Explain a scenario where you will be using Spark Streaming.


Ans. When it comes to Spark Streaming, the data is streamed in near real-time onto
our Spark program.
Twitter Sentiment Analysis is a real-life use case of Spark Streaming.
Trending Topics can be used to create campaigns and attract a larger
audience. It helps in crisis management, service adjusting and target marketing.
Sentiment refers to the emotion behind a social media mention online.
Sentiment Analysis is categorizing the tweets related to a particular topic
and performing data mining using Sentiment Automation Analytics Tools.
Spark Streaming can be used to gather live tweets from around the world into
the Spark program. This stream can be filtered using Spark SQL and then we
can filter tweets based on the sentiment. The filtering logic will be implemented
using MLlib where we can learn from the emotions of the public and change
our filtering scale accordingly.

45. Is it possible to run Apache Spark on Apache Mesos?


Ans. Yes, Apache Spark can be run on the hardware clusters managed by Mesos.
In a standalone cluster deployment, the cluster manager in the below diagram
is a Spark master instance. When using Mesos, the Mesos master replaces
the Spark master as the cluster manager.Mesos determines what machines handle what
tasks. Because it takes into account other frameworks when scheduling these
many short-lived tasks, multiple frameworks can coexist on the same cluster without
resorting to a static partitioning of resources.

===================================================================================
===================================================(spark)
===================================================================================
=========================================(HADOOP AND YARN)
HADOOP and YARN and Big data

1. what is bag?
Ans. Pig latin works on relations
A relations is a bag.
A bag is a collection of tuples.
A tuple is an ordered set of fields.
A field is a piece of data.

2. what is YARN?
Ans. YARN -yet another resource negotiator.(global resource manager,can run N
number of distributed application at same time on same cluster)
YARN is hadoop processing layer that contains
- resource manager
- node manager
- containers
- job scheduler
YARN allows multiple data processing engines to run in single hadoop cluster
- batch programs( Spark, Map reduce)
- Advanced analytics( sapark, impala)
- interactive SQL (Impala)
- streaming (spark streaming)
YARN deamons
- resource manager
-runs on master node
-global resource scheduler
-
- node manager
- runs on slave
- communicates with resources manager

3. types of schecduler in yarn?


Ans. FIFO, Capacity and Fair (https://medium.com/@bilalmhassan/schedulers-in-yarn-
concepts-to-configurations-5dd7ced6c214)
FIFO - First in first out, It runs the applications in submission order
by placing them in a queue.
Capacity - maintains a separate queue for small jobs in order to start them
as soon a request initiates.
Fair - When a job starts — if it is the only job running — it gets all
the resources of the cluster.
When the second job starts it gets the resources as soon as some
containers get free.
After the small job finishes, the scheduler assigns resources to
large one.

4. cmd to copy from one node to another?


Ans. hdfs dfs -distcp

5. what is hadoop?
Ans. Hadoop is a framework that allows us to store and process larger datasets in
parallel and distributed type.
Hadoop has
HDFS- used for storage that allows to storage of various format across
cluster.
- distributed file system, scalable and fast access.
- no schema need before dumping
- Horizontal scaling as per requirement(add more data node is Horizontal
scaling,adding more resources(RAM,CPU)is vertical scale)
- name node -> contain meta data of the data that is stored in data node.
-> master deamos that maintain and manages data nodes.
-> two files associated with meta data
- fsimage -> contains complete state of files system since
start of name node.
- edit logs -> all recent modification made to file system
- data node -> stores actual data and also have replication data.
-> send heartbeats to name node(3 sec freq)
-> blockreport to name node
-> salve node, commodity hardware
- secondary name node -> works concurrently with namenode has a helper
deamon to name node
- once data is dumped in to HDFS data blocks are created , 128 MB of
default data block size and stored across data nodes.

Hadoop MapReduce – MapReduce is the computing layer that is responsible for


data processing.
It writes an application to process unstructured and
structured data stored in HDFS.
It is responsible for the parallel processing of high volume
of data by dividing data into independent tasks.
The processing is done in two phases Map and Reduce.
The Map is the first phase of processing that specifies
complex logic code and the Reduce is the second phase of
processing that specifies light-weight operations.
Map - In this phase,the input data is split by map tasks.The
map task run in parallel.These split data is used for
analysis purpose.
Reduce - In this phase, the similar split data is aggregated from
the entire collection and shows the result.

YARN -> perform all the processing activity and scheduling


-> resource manager(standby resource manager) per cluster
-> node manager per data node
-> scheduler will see how far a task in on a node manger
-> when submit a job request goes to resource manager, parts of job
sent to node manager
-> node maanger has to parts after getting a job from resource manager
to do task
- App master - actual processing will happen
- container - contain the resources that a job required to do
(executing environment)
-> container sends the report to scheduler which is part of resource
manger
-> the resouce mange find a node manger and request to launch a
container( App master)
-> resource manager has part App manager
- when App master requests excess resources, it goes to App manager and
those resources will be allocated to App master.
-> Mapreduce - it is processing unit, allows parallel processing of
data that is stored across HDFS cluster.
-> resource manager(job tracker)
-> node manger(task tracker)

6. replication factor cmd?


Ans. hdfs dfs-setrep -w 2 /<path>

7. what is hadoop lags than spark?


Ans Hadoop can't process data in real time where are spark can do near real time
proessing
spark is fast processor compare to hadoop
8. what is hdfs architecture?
And. -name node
-secondary node
-data node

9. high data availability of hadoop or spark or big data


Ans.

12. how do we check hadoop cluster configuration?


Ans.

13. what is fsck?


Ans. fsck stands for File System Check. It is a command used by HDFS.
This command is used to check inconsistencies and if there is any problem in
the file. For example,
if there are any missing blocks for a file, HDFS gets notified through this
command.

14. What are the main differences between NAS (Network-attached storage) and HDFS?
Ans. HDFS runs on a cluster of machines while NAS runs on an individual machine.
Hence, data redundancy is a common issue in HDFS. On the contrary, the
replication protocol is different in case of NAS.
Thus the chances of data redundancy are much less.
Data is stored as data blocks in local drives in case of HDFS. In case of NAS,
it is stored in dedicated hardware.

15. What is the Command to format the NameNode?


Ans. hdfs namenode -format

16. Will you optimize algorithms or code to make them run faster?
Ans. “Yes.” Real world performance matters and it doesn’t depend on the data or
model you are using in your project.

17. How would you transform unstructured data into structured data?
Ans.

18. What happens when two users try to access the same file in the HDFS?
Ans. HDFS NameNode supports exclusive write only.Hence,only the first user will
receive the grant for file access and the second user will be rejected.

19. How to recover a NameNode when it is down?


Ans. 1. Use the FsImage which is file system metadata replica to start a new
NameNode.
2. Configure the DataNodes and also the clients to make them acknowledge the
newly started NameNode.
3. Once the new NameNode completes loading the last checkpoint FsImage which
has received enough block reports from the DataNodes, it will
start to serve the client.
4. In case large Hadoop clusters,the NameNode recovery process consumes
lot of time which turns out to be a more significant challenge in case of
routine maintenance.

20. What do you understand by Rack Awareness in Hadoop?


Ans. It is an algorithm applied to the NameNode to decide how blocks and its
replicas are placed.
Depending on rack definitions network traffic is minimized between DataNodes
within the same rack.
we consider replication factor as 3, two copies will be placed on one rack
whereas the third copy in a separate rack.

21. What is the difference between “HDFS Block” and “Input Split”? And what is
block scanner?
Ans. The HDFS divides the input data physically into blocks for processing which is
known as HDFS Block.
Input Split is a logical division of data by mapper for mapping operation.
Block Scanner -Block Scanner tracks the list of blocks present on a DataNode
and verifie them to find any kind of checksum errors. Block Scanners use a
throttling mechanism to reserve disk bandwidth on the datanode.

22. What are the common input formats in Hadoop?


Ans. Below are the common input formats in Hadoop –
Text Input Format – The default input format defined in Hadoop is the Text
Input Format.
Sequence File Input Format – To read files in a sequence, Sequence File Input
Format is used.
Key Value Input Format – The input format used for plain text files (files
broken into lines) is the Key Value Input Format.

23. Explain some important features of Hadoop.


Ans. Hadoop supports the storage and processing of big data.
It is the best solution for handling big data challenges. Some important
features of Hadoop are –
Open Source:
Distributed Processing– Hadoop supports distributed processing of data i.e.
faster processing.
Fault Tolerance – Hadoop is highly fault-tolerant. It creates three
replicas for each block at different nodes, by default.
Reliability – Hadoop stores data on the cluster in reliable manner
that is independent of machine.So,the data stored in Hadoop
environment is not affected by the failure of the machine.
Scalability – Another important feature of Hadoop is scalability.It
is compatible with the other hardware and we can easily
add the new hardware to the nodes.
High Availability – The data stored in Hadoop is available to access even
after the hardware failure.
In case of hardware failure, the data can be accessed
from another path.

24. Explain the different nodes in which Hadoop run.


Ans. Standalone (Local) Mode –By default,Hadoop run in a local mode i.e.on a non-
distributed,single node.This mode uses the local file system
to perform input and output operation.
Pseudo-Distributed Mode – In the pseudo-distributed mode, Hadoop runs on a
single node just like the Standalone mode. In this mode, each
daemon runs in a separate Java process.
Fully – Distributed Mode

25. What are the Port Numbers for NameNode, Task Tracker, and Job Tracker?
Ans. NameNode – Port 50070
Task Tracker – Port 50060
Job Tracker – Port 50030

26. Explain the process that overwrites the replication factors in HDFS.
Ans. $hadoop fs – setrep –w2 /my/test_file

27.
===================================================================================
=========================================(Hadoop and yarn)
===================================================================================
====================================================(HIVE)
HIVE - IN DETAILS

1. Difference b/w manager table and extrenal table in Hive?


Ans. Managed tables are Hive owned tables where the entire lifecycle of the tables’
data are managed and controlled by Hive.
All the write operations to the Managed tables are performed using Hive SQL
commands. If a Managed table or partition is dropped, the data and metadata
associated with that table or partition are deleted.

External tables are tables where Hive has loose coupling with the data.
The writes on External tables can be performed using Hive SQL commands but
data files can also be accessed and managed by processes outside of Hive.
If an External table or partition is dropped,only the metadata associated with
the table or partition is deleted but the underlying data files stay intact.
Hive supports replication of External tables with data to target cluster and
it retains all the properties of External tables.

2. what is hive?
Ans. Data warehousing package built on top of hadoop and is used for analyzing
structured and semi-structured data.
Used for data analytics
provide tools to enable easy data ETL.
It provides a mechanism to project structure onto the data and perform queries
written in HQL that are similar to SQL statements.
Internally, these queries or HQL gets converted to map reduce jobs by the Hive
compiler.

3. What are the different types of tables available in Hive?


Ans. There are two types. Managed table and external table.
In managed table both the data an schema in under control of hive but in
external table only the schema is under control of Hive.

4. Is Hive suitable to be used for OLTP systems? Why?


Ans. No Hive does not provide insert and update at row level. So it is not suitable
for OLTP system.

5. What is a metastore in Hive?


Ans. It is a relational database storing the metadata of hive tables, partitions,
Hive databases etc.
Metastore in Hive stores the meta data information using RDBM and an open
source ORM (Object Relational Model) layer called Data Nucleus which
converts the object representation into relational schema and vice versa.

6. What are the three different modes in which hive can be run?
Ans. Local mode
Distributed mode
Pseudodistributed mode

7. What are collection data type in Hive?


Ans. Three types Array, map and struct

8. What do you mean by schema on read?


Ans. The schema is validated with the data when reading the data and not enforced
when writing data.
9. Where does the data of a Hive table gets stored?
Ans. By default, the Hive table is stored in an HDFS directory –
/user/hive/warehouse.
One can change it by specifying the desi directory in
hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml.

10. Why Hive does not store metadata information in HDFS?


Ans. The reason for choosing RDBMS is to achieve low latency as HDFS read/write
operations are time consuming processes.

11. What is the difference between local and remote metastore?


Ans. Local Metastore:-> In local metastore configuration, the metastore service
runs in the same JVM in which the Hive service is running and
connects to a database running in a separate JVM, either on the same
machine or on a remote machine.
Remote Metastore:->In the remote metastore configuration, the metastore
service runs on its own separate JVM and not in the Hive service
Other processes communicate with the metastore server using Thrift Network APIs.You
can have one or more metastore servers in this case to
provide more availability.

12. What is the default database provided by Apache Hive for metastore?
Ans. By default, Hive provides an embedded Derby database instance backed by the
local disk for the metastore. This is called the embedded metastore
configuration.

13. What is a Hive variable? What do we use it for?


Ans. Hive variables are basically created in the Hive environment that is
referenced by Hive scripting languages.
They allow to pass some values to a Hive query when the query starts
executing. They use the source command.

14. What are Buckets in Hive?


Ans. Buckets in Hive are used in segregating Hive table data into multiple files or
directories. They are used for efficient querying.
for decomposing table data sets into more manageable parts, it uses Hive
Bucketing concept.

15. How to skip header rows from a table in Hive?


Ans. add TBLPROPERTIES("skip.header.line.count"="2”) while creating table.

16. What is the maximum size of a string data type supported by Hive?
Ans. 2 GB

17. What is the available mechanism for connecting applications when we run Hive
as a server?
Ans. Thrift Client: Using Thrift, we can call Hive commands from various
programming languages, such as C++, PHP, Java, Python, and Ruby.
JDBC Driver: JDBC Driver enables accessing data with JDBC support, by
translating calls from an application into SQL and passing the SQL
queries to the Hive engine.
ODBC Driver:It implements the ODBC API standard for the Hive DBMS,enabling
ODBC-compliant applications to interact seamlessly with Hive.

18. what is SerDe?


Ans. SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for
IO.
Hive uses SerDe (and FileFormat) to read and write table rows.
HDFS files --> InputFileFormat --> <key, value> --> Deserializer --> Row
object
Row object --> Serializer --> <key, value> --> OutputFileFormat --> HDFS files
The interface handles both serialization and deserialization and also
interpreting the results of serialization as individual fields for
processing.
Deserializer:-> The Hive deserializer converts record (string or binary) into
a java(Row) object that Hive can process (modify).
Serializer: -> Now, the Hive serializer will take this Java object, convert it
into suitable format that can be stored into HDFS.
basically a serde is responsible for converting the record bytes into
something that can be used by Hive. Hive comes with several SerDe like
JSon SerDe for JSon files, CSV SerDe for CSV files etc.

19. what is map side join(a.k.a broadcast join) in hive?


Ans. Map-side Joins allows a table to get loaded into memory ensuring a very fast
join operation, performed entirely within a mapper and that too without
having to use both map and reduce phases.
SELECT /*+ MAPJOIN(dataset2) */ dataset1.first_name,dataset1.eid,dataset2.eid
FROM dataset1 JOIN dataset2 ON dataset1.first_name = datase t2.first_name;
hive.auto.convert.join=true setting following property
hive.mapjoin.smalltable.filesize=(default it will be 25MB) small size table
property

20. What is Map Join in Hive?


Ans. Apache Hive Map Join is also known as Auto Map Join, or Map Side Join, or
Broadcast Join. And refer Q19.

21. Different types of partitions?


Ans. Static and dynamic.

===================================================================================
====================================================(Hive)
===================================================================================
=================================================(GENERAL)
BIG DATA CONCEPTS

10. differnt types of files?


Ans. CSV, TSV, XML, txt,
json - java script object notation
Avro -
orc,
parquet - is a columnar format.

11. what is data pipeline - data ingestion pipeline, data extraction pipeline,
data preprocessing pipeline
Ans. data pipeline is connecting two or more operation together.
data ingestion pipeline is pipeling nifi and kafka i.e., connecting together.
data preprocessing pipeline is pipeling hive and spark together

12. What is cross join?


Ans. In SQL,the CROSS JOIN is used to combine each row of the first table with each
row of the second table.It is also known as the Cartesian join since it
returns the Cartesian product of the sets of rows from the joined tables.

===================================================================================
=================================================(general)
===================================================================================
===================================================(SCALA)
SCALA

1. Scala vs Java?
Ans. Scala (https://www.geeksforgeeks.org/scala-vs-java/)
Scala is a mixture of both object oriented and functional programming.
Scala is less readable due to nested code.
The process of compiling source code into byte code is slow.
Scala support operator overloading.

Java
Java is a general purpose object oriented language.
Java is more readable.
The process of compiling source code into byte code is fast.
Java does not support operator overloading.

2. scala decribre and featuers?


Ans.

3. Explain what is Scala?


Ans. Scala is an object functional programming and scripting language for general
software applications designed to express solutions in a concise manner.

4. What is a ‘Scala set’? What are methods through which operation sets are
expressed?
Ans. Scala set is a collection of pairwise element of the same type.Scala set does
not contain any duplicate elements. There are two kinds of sets, mutable
and immutable.

5. What is a ‘Scala map’?


Ans. Scala map is a collection of key or value pair.Based on its key any value can
be retrieved. Values are not unique but keys are unique in the Map.

6. What is the advantage of Scala?


Ans. Less error prone functional style
High maintainability and productivity
High scalability
High testability
Provides features of concurrent programming

7. In what ways Scala is better than other programming language?


Ans. The array uses regular generics, while in other language, generics are bolted
on as an afterthought and are completely separate but have overlapping
behaviours with arrays.
Scala has immutable “val” as a first class language feature.The “val” of scala
is similar to Java final variables.Contents may mutate but to p reference is
immutable.
Scala lets ‘if blocks’, ‘for-yield loops’, and ‘code’ in braces to return a
value. It is more preferable, and eliminates the need for a separate
ternary operator.
Singleton has singleton objects rather than C++/Java/ C# classic static. It is
a cleaner solution
Persistent immutable collections are the default and built into the standard
library.
It has native tuples and a concise code
It has no boiler plate code

8. Mention the difference between an object and a class ?


Ans. A class is a definition for a description. It defines a type in terms of
methods and composition of other types.
A class is a blueprint of the object. While, an object is a singleton, an
instance of a class which is unique.
An anonymous class is created for every object in the code, it inherits from
whatever classes you declared object to implement.
Class combines the data and its methods whereas an Object is one particular
Instance in a class.

9. What is recursion tail in scala?


Ans. ‘Recursion’ is a function that calls itself.A function that calls itself,for
example, a function ‘A’ calls function ‘B’, which calls the function ‘C’.
It is a technique used frequently in functional programming.In order for a
tail recursive,the call back to the function must be the last function to be
performed.

10. What is ‘scala trait’ in scala?


Ans. ‘Traits’ are used to define object types specified by the signature of the
supported methods.
Scala allows to be partially implemented but traits may not have constructor
parameters.A trait consist of method and field definition by mixing them
into classes it can be reused.
A Trait can be defined as a unit which Encapsulates the method and its
variables or fields.

11. What is Case Classes?


Ans. Case classes provides a recursive decomposition mechanism via pattern
matching, it is a regular classes which export their constructor
parameter. The constructor parameters of case classes can be accessed
directly and are treated as public values.
A Case Class is just like a regular class,which has a feature for modeling
unchangeable data.It is also constructive in pattern matching
It has been defined with a modifier case,due to this case keyword,we can get
some benefit to stop oneself from doing a sections of codes that have to be
included in many places with little or no alteration.

12. What is the use of tuples in scala?


Ans. Scala tuple combine a fixed number of items together so that they can be
passed around as whole.A tuple is immutable and can hold object with
different types, unlike an array or list.

13. Why scala prefers immutability?


Ans. Scala prefers immutability in design and in many cases uses it as default.
Immutability can help when dealing with equality issues or concurrent
programs.

14. Explain how Scala is both Functional and Object-oriented Programming Language?
Ans. Scala treats every single value as an Object which even includes Functions.
Hence, Scala is the fusion of both Object-oriented and Functional programming
features.

15. Explain Streams in Scala.


Ans. In simple words,we define Stream as a Lazy list which evaluate the elements
only when it needs to.This sort of lazy computation enhances the Performance of
the program.

16. Mention the Advantages of Scala


Ans. Some of the major Advantages of Scala are as follows:
It is highly Scalable
It is highly Testable
It is highly Maintainable and Productive
It facilitates Concurrent programming
It is both Object-Oriented and Functional
It has no Boilerplate code
Singleton objects are a cleaner solution than Static
Scala Arrays use regular Generics
Scala has Native Tuples and Concise code

17. Why do we need App in Scala?


Ans. App is a helper class that holds the main method and its Members together.The
App trait can be used to quickly turn Objects into Executable programs. We
can have our classes extend App to render the executable code.

18. Mention how Scala is different from Java


Ans. A few scenarios where Scala differs from Java are as follows:
All values are treated as Objects.
Scala supports Closures
Scala Supports Concurrency.
It has Type-Inference.
Scala can support Nested functions.
It has DSL support [Domain Specific Language]
Traits

19. How is the Scala code compiled?


Ans. Code is written in Scala IDE or a Scala REPL, Later, the code is converted
into a Byte code and transferred to the JVM or Java Virtual Machine for
compilation.

20. what is Currying Functions ?


Ans. Currying in Scala is simply a technique or a process of transforming a
function. This function takes multiple arguments into a function that
takes single argument.It is applied widely in multiple functional languages.It is
applied widely in multiple functional languages

21. val lst = List(1,2,3....,100) output should be List((1,2),(2,3),....


(100,101))
Ans.

22. List((1,2),(2,3),.... (100,101)) out put should be List(3,5,..... 201)


Ans. map(r=> r_1+r_2)

23. val x =(1,(2,(3,(4,5)))) select 4 from the list


Ans.

24. what are higher order functions in scala.


Ans. A higher-order function takes other functions as a parameter or returns a
function as a result.

===================================================================================
===================================================(scala)
===================================================================================
==================================================(PYTHON)
PYTHON

1. what is parallel processing in python?


Ans. Parallel processing is a mode of operation where the task is executed
simultaneously in multiple processors in the same computer.
It is meant to reduce the overall processing time.
In python, the multiprocessing module is used to run independent parallel
processes by using subprocesses (instead of threads).
It allows you to leverage multiple processors on a machine (both Windows and
Unix), which means, the processes can be run in completely separate
memory location
(https://www.machinelearningplus.com/python/parallel-processing-python/)

2. what is multi-threading in python?


Ans.

3. how can you change the way two instances of a specific class behave on
comaprison?

===================================================================================
==================================================(python)
===================================================================================
=====================================================(AWS)
AWS

1. Define and explain the three basic types of cloud services and the AWS
products that are built based on them?
Ans. Computing - These include EC2, Elastic Beanstalk, Lambda, Auto-Scaling, and
Lightsat.
Storage - These include S3, Glacier, Elastic Block Storage, Elastic File
System.
Networking - These include VPC, Amazon CloudFront, Route53

2. What is the relation between the Availability Zone and Region?


Ans. regions are separate geographical areas, like the US-West 1 (North California)
and Asia South (Mumbai).
On the other hand, availability zones are the areas that are present inside
the regions.
These are generally isolated zones that can replicate themselves whenever
required.

3. What is auto-scaling?
Ans. Auto-scaling is a function that allows you to provision and launch new
instances whenever there is a demand.
It allows you to automatically increase or decrease resource capacity in
relation to the demand.

4. What is geo-targeting in CloudFront?


Ans. Geo-Targeting is a concept where businesses can show personalized content to
their audience based on their geographic location without changing the
URL.
This helps you create customized content for the audience of a specific
geographical area, keeping their needs in the forefront.

5. What are the steps involved in a CloudFormation Solution?


Ans. 1. Create or use an existing CloudFormation template using JSON or YAML
format.
2. Save the code in an S3 bucket, which serves as a repository for the code.
3. Use AWS CloudFormation to call the bucket and create a stack on your
template.
4. CloudFormation reads the file and understands the services that are called,
their order, the relationship between the services, and provisions
the services one after the other.
6. What services can be used to create a centralized logging solution?
Ans. The essential service that you can use are Amazon CloudWatch Log,store them in
Amazon S3,and then use Amazon Elastic Search to visualize them. You can use
Amazon Kinesis Firehose to move the data from Amazon S3 to Amazon ElasticSearch

7. what is a DDoS attack, and what services can minimize them?


Ans. DDoS is a cyber-attack in which the perpetrator accesse a website and creates
multiple session so that the other legitimate users cannot access the
service. The native tools that can help you deny the DDoS attacks on your AWS
services are:
AWS Shield, AWS WAF, Amazon Route53, Amazon CloudFront, ELB, VPC

8. Different types of s3?


Ans.

9. How to uplade to S3 bucket?


Ans. AWS CLI, AWS SDK, or Amazon S3 REST API

===================================================================================
=====================================================(aws)
===================================================================================
===================================================(HBASE)
HBASE

1. What is Apache HBase?


Ans. It is a column-oriented database which is used to store the sparse data sets.
It is run on the top of Hadoop file distributed system.
Apache HBase is a database that runs on a Hadoop cluster.
Clients can access HBase data through either a native Java API or through a
Thrift or REST gateway,making it accessible by any language. Some of the
key properties of HBase include:
NOSQL -> HBase is not a traditional relational database (RDBMS).HBase
relaxes the ACID
(Atomicity,Consistency,Isolation,Durability) properties of traditional RDBMS
systems in order to achieve much greater scalability.
Wide-Column->
Distributed and Scalable
Consistent ->

2. Give the name of the key components of HBase.


Ans. The key components of HBase are Zookeeper, RegionServer, Region, Catalog
Tables and HBase Master.

3. What is the use of get() method?


Ans. get() method is used to read the data from the table.

4. What is the reason of using HBase?


Ans. HBase is used because it provides random read and write operations and it can
perform a number of operation per second on a large data sets.

5. Define column families?


Ans. It is a collection of columns whereas row is a collection of column families.

6. What is decorating Filters?


Ans. It is useful to modify, or extend, the behavior of a filter to gain additional
control over the returned data.

7. What are the operational commands of HBase?


Ans. Operational commands of HBase are Get, Delete, Put, Increment, and Scan.

8. Which code is used to open the connection in Hbase?


Ans. Configuration myConf = HBaseConfiguration.create();
HTableInterface usersTable = new HTable(myConf, “users”)

9.

===================================================================================
===================================================(hbase)
===================================================================================
===================================================(NOSQL)
NOSQL

1. What are the different types of NoSQL databases? What are NoSQL databases?
What are the different types of NoSQL databases?
Ans. NoSQL database provides a mechanism for storage and retrieval of data that is
modeled in means other than the tabular relations used in relational
databases (like SQL, Oracle, etc.).
Types of NoSQL databases:
Document Oriented
Key Value
Graph
Column Oriented

2. What do you understand by NoSQL databases? Explain. What do you understand by


NoSQL databases? Explain.
Ans. At the present time, the internet is loaded with big data, big users, big
complexity etc. and also becoming more complex day by day.
NoSQL is answer of all these problem;It is not a traditional database
management system,not even a relational database management system NoSQL
stand for “NotOnlySQL”.NoSQL is a type of database that can handle and sort all
type of unstructured,messy and complicated data.
It is just a new way to think about the database.

3. What are the advantages of NoSQL over traditional RDBMS? What are the
advantages of NoSQL over traditional RDBMS?
Ans. NoSQL is better than RDBMS because of the following reasons/properities of
NoSQL:
-It supports semi-structured data and volatile data
-It does not have schema
-Read/Write throughput is very high
-Horizontal scalability can be achieved easily
-Will support Bigdata in volumes of Terra Bytes & Peta Bytes
-Provides good support for Analytic tools on top of Bigdata
-Can be hosted in cheaper hardware machines
-In-memory caching option is available to increase the performance of queries
-Faster development life cycles for developers
Still, RDBMS is better than NoSQL for the following reasons/properties of
RDBMS:

-Transactions with ACID properties - Atomicity, Consistency, Isolation &


Durability
-Adherence to Strong Schema of data being written/read
-Real time query management ( in case of data size < 10 Tera bytes )
-Execution of complex queries involving join & group by clauses

4. Explain difference between scaling horizontally and vertically for databases?


Ans. Horizontal scaling means that you scale by adding more machines into your pool
of resources whereas
Vertical scaling means that you scale by adding more power (CPU, RAM) to an
existing machine.
In a database world horizontal-scaling is often based on the partitioning of
the data i.e. each node contains only part of the data, in vertical-scaling
the data reside on a single node and scaling is done through multi-core
i.e.spreading the load between CPU and RAM resources of that machine.
Good example of horizontal scaling are Cassandra, MongoDB, Google Cloud
Spanner.And a good example of vertical scaling is MySQL - Amazon RDS (The
cloud version of MySQL).

5. What is ACIDin RDBMS?


Ans. ACID stands that any update is:
-Atomic: it either fully completes or it does not
-Consistent: no reader will see a "partially applied" update
-Isolated: no reader will see a "dirty" read
-Durable: (with the appropriate write concern)

6. How does column-oriented NoSQL differ from document-oriented?


Ans. The main difference is that document stores(e.g. MongoDB and CouchDB) allow
arbitrarily complex documents, i.e. subdocuments within
subdocuments,lists with documents, etc. whereas column stores (e.g. Cassandra and
HBase) only allow a fixed format, e.g. strict one -level or two-level
dictionaries.

7. What does Document-oriented vs. Key-Value mean in context of NoSQL?


Ans.

8. When should I use a NoSQL database instead of a relational database?


Ans. Relational databases enforces ACID. So, you will have schema based transaction
oriented data stores. It's proven and suitable for 99% of the real world
applications. You can practically do anything with relational databases.
But,there are limitations on speed and scaling when it comes to massive high
availability data stores.For example,Google and Amazon have terabytes of data
stored in big data centers. Querying and inserting is not performant in these
scenarios because of the blocking/schema/transaction nature
of the RDBMs.
That's the reason they have implemented their own databases (actually, key-
value stores) for massive performance gain and scalability.

9. What is Denormalization?
Ans. It is the process of improving the performance of the database by adding
redundant data.

10. What are the features of NoSQL?


Ans. When compared to relational databases, NoSQL database are more scalable and
provide superior performance, and their data model addresses several
issues that the relational model is not designed to address:
-Large volumes of structured, semi-structured, and unstructured data
-Agile sprints, quick iteration, and frequent code pushes
-Object-oriented programming that is easy to use and flexible
-Efficient, scale-out architecture instead of expensive, monolithic
architecture
11.
===================================================================================
==================================================(no sql)
===================================================================================
=====================================================(END)

LIVE INTERVIEW QUESTIONS

1. what kind of join you use for performance?

2. broadcast join and accumilators?

4. how to spark-submit syntax spark query


Ans. spark-submit --master local --deploy-mode DEPLOY_MODE --executor-cores NUM
--driver-memory 2g --executor-memory 2g --class classpath jarfile

5. higher order functions in scala


Ans. A higher-order function takes other functions as a parameter or returns a
function as a result.

6. can we use collect for large size of data in spark?

7. how to create different dataframes in spark

8. temp table in spark(or scala)

9. how you deploy code to production?


Ans. i wont do

10. what kind of scheduler used?


internal-oozie , yarn scheduler

11. what is the version control using?

12. Query optimization in spark


Ans. https://www.xenonstack.com/blog/apache-spark-optimisation/
(https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-1/)

13. distribution by key and cluster by key in spark?


Ans. https://deepsense.ai/optimize-spark-with-distribute-by-and-cluster-by/

14. what is map side join(a.k.a broadcast join) in hive?


Ans. Map-side Joins allows a table to get loaded into memory ensuring a very fast
join operation, performed entirely within a mapper and that too without
having to use both map and reduce phases.

15. bucket in hive?

16. shuffle partition in spark


Ans. Shuffle partitions are the partitions in spark dataframe,which is created
using a grouped or join operation. Number of partitions in this dataframe is
different than the original dataframe partitions.
17. reduceby key and groupby key
Ans. example, rdd.groupByKey().mapValues(_.sum) will produce the same results as
rdd.reduceByKey(_ + _).However, the former will transfer the entire
dataset across the network, while the latter will compute local sums for each key
in each partition and combine those local sums in to larger sums after
shuffling.

18. brocast varibale

19. Did you use JDBC/ODBC

20. debugging in spark

21. skewness in data in spark


Ans. https://unraveldata.com/common-failures-slowdowns-part-ii/

22. The specific variant of SQL that is used to parse queries can also be selected
using the spark.sql.dialect option. This parameter can be changed using either the
setConf method on a SQLContext or by using a SET key=value command in SQL

23. spark.sql.broadcastTimeout 300 Timeout in seconds for the broadcast wait


time in broadcast joins

24. DatFrame API doesn't have provision for complie time type safety.

SampleColorRdd – all information of colors

i want to select red and /or blue

smapleColorRdd.filter(c=>

list = [1,2,3,4,56,7,89,89]
large = []

for i in range(len(list)-1):
if list[i] > list[1+1]:
large.append(list[i])

set = set(large)
print(set[len(set)])

--LOAD DATA INPATH 'hdfs://localhost:9000/hdfs/employee_id.txt' OVERWRITE INTO


TABLE employee_id;

You might also like