Bigdata Notes
Bigdata Notes
Bigdata Notes
3. what are the challenges you have faced in your project? explain in detail
Ans. 1. in pyspark converting nested json to dataframe
2. creating dynamic schema for required databaes from unstructued data.
kafka with python is producer no. of camera and created multiple producer, one of
region one producer create a kafka topic.
and create a frames 30 fps.
develop a code and create produder data injection pipeline , aws only s3.
pyspark converting nested json to dataframe - we flattened nested json to json and
then created dataframe
query optimization
10. How long does it take to run your script in production cluster? How did you
optimized the timings. Challenges you have faced.
Ans. 1.each data pipelpile couple of minutes.
11. end to end project description and roles of it and then team size and there
roles.
Ans. 1. team size 10
2. from data injestion till storing data in data lake.
3. from opencv produced video steam is produced to kafka topics and from
kafka topic spark streaming is consumed and we do anylytics bases
on given problem and then it is pushed to data lake(hdfs, hive, s3)
===================================================================================
=================================================(Project)
===================================================================================
===================================================(KAFKA)
KAFKA (https://data-flair.training/blogs/kafka-interview-questions/)
1. what is spark?
Ans. Apache Spark is a cluster computing platform designed to be fast and general
purpose.
It can execute streaming as well as the batch
It also integrates closely with other Big Data tools. In particular, Spark can
run in Hadoop clusters and access any Hadoop data source, including
Cassandra.
spark in in-memory data processing.
Main abstraction of spark are RDD.
3. spark components?
Ans. Driver program -> central entry point for spark shell, it runs main function
of application and create sparkcontext,
Driver stores metadata about RDD and their location and
partitions
sparkContext/SparkSession - it is a client of Spark’s execution environment
and it acts as the master of the Spark application.
cluster manager -> responsible for aquring resource on spark cluster and job
allocation
worker nodes -> responsible for execution of task
Executers
Tasks
5. what is sparkContext?
Ans. A SparkContext is a client of Spark’s execution environment and it acts as the
master of the Spark application.
SparkContext is the entry gate of Apache Spark functionality.
SparkContext sets up internal services and establishes a connection to a Spark
execution environment.
The most important step of any Spark driver application is to generate
SparkContext.
You can create RDD,accumulator and broadcast variables,access Spark services
and run jobs(until SparkContext stops)after the creation of SparkContext.
It allows your Spark Application to access Spark Cluster with the help of
Resource Manager (YARN/Mesos).
To create SparkContext, first SparkConf should be made.
The SparkConf has a configuration parameter that our Spark driver application
will pass to SparkContext.
6. what is RDD?
Ans. RDD is Resilient distributed datasets.RDD is an abstract representation of the
data which is divided into the partitions and distributed across the
cluster.
This collection is made up of data partitions which is a small collection of
data stored in RAM or on Disk.
RDD is immutable, lazy evaluted and cacheable.
7. how spark works on yarn? - hadoop cluster is configure and on top of that we
will install spark.
Ans. A Spark application is launched on a set of machines using an external service
called a cluster manager.
Spark is packaged with a built-in cluster manager called the Standalone
cluster manager.
Spark also works with Hadoop YARN and Apache Mesos.
Spark has a spark driver
(https://www.youtube.com/watch?v=bPHQouyUGBk)
18. what is spark reduce explain it and what is fold and reduce?
Ans.
21. what is window in sql? and different b/w groupby and window
Ans. Window functions operate on a set of rows and return a single value for each
row from the underlying query.
The term window describes the set of rows on which the function operates.
A window function uses values from the rows in a window to calculate the
returned values.
22. How spark know that it is writing data into external location?
Ans.
25. What is the difference between Temp View and Global Temp View?
Ans. Temporary view in Spark SQL are tied to the Sparksession that created the
view,and will not be available once the Sparksession is terminated.
Global Temporary views are not tied to a Spark session, but can be shared
across multiple Spark sessions.
43. Can you use Spark to access and analyze data stored in Cassandra databases?
Ans. Yes, it is possible if you use Spark Cassandra Connector.To connect Spark to a
Cassandra cluster, a Cassandra Connector will need to be added to the Spark
project. In the setup, a Spark executor will talk to a local Cassandra node and
will only query for local data. It makes queries faster by reducing the
usage of the network to send data between Spark executors (to process data) and
Cassandra nodes (where data lives).
===================================================================================
===================================================(spark)
===================================================================================
=========================================(HADOOP AND YARN)
HADOOP and YARN and Big data
1. what is bag?
Ans. Pig latin works on relations
A relations is a bag.
A bag is a collection of tuples.
A tuple is an ordered set of fields.
A field is a piece of data.
2. what is YARN?
Ans. YARN -yet another resource negotiator.(global resource manager,can run N
number of distributed application at same time on same cluster)
YARN is hadoop processing layer that contains
- resource manager
- node manager
- containers
- job scheduler
YARN allows multiple data processing engines to run in single hadoop cluster
- batch programs( Spark, Map reduce)
- Advanced analytics( sapark, impala)
- interactive SQL (Impala)
- streaming (spark streaming)
YARN deamons
- resource manager
-runs on master node
-global resource scheduler
-
- node manager
- runs on slave
- communicates with resources manager
5. what is hadoop?
Ans. Hadoop is a framework that allows us to store and process larger datasets in
parallel and distributed type.
Hadoop has
HDFS- used for storage that allows to storage of various format across
cluster.
- distributed file system, scalable and fast access.
- no schema need before dumping
- Horizontal scaling as per requirement(add more data node is Horizontal
scaling,adding more resources(RAM,CPU)is vertical scale)
- name node -> contain meta data of the data that is stored in data node.
-> master deamos that maintain and manages data nodes.
-> two files associated with meta data
- fsimage -> contains complete state of files system since
start of name node.
- edit logs -> all recent modification made to file system
- data node -> stores actual data and also have replication data.
-> send heartbeats to name node(3 sec freq)
-> blockreport to name node
-> salve node, commodity hardware
- secondary name node -> works concurrently with namenode has a helper
deamon to name node
- once data is dumped in to HDFS data blocks are created , 128 MB of
default data block size and stored across data nodes.
14. What are the main differences between NAS (Network-attached storage) and HDFS?
Ans. HDFS runs on a cluster of machines while NAS runs on an individual machine.
Hence, data redundancy is a common issue in HDFS. On the contrary, the
replication protocol is different in case of NAS.
Thus the chances of data redundancy are much less.
Data is stored as data blocks in local drives in case of HDFS. In case of NAS,
it is stored in dedicated hardware.
16. Will you optimize algorithms or code to make them run faster?
Ans. “Yes.” Real world performance matters and it doesn’t depend on the data or
model you are using in your project.
17. How would you transform unstructured data into structured data?
Ans.
18. What happens when two users try to access the same file in the HDFS?
Ans. HDFS NameNode supports exclusive write only.Hence,only the first user will
receive the grant for file access and the second user will be rejected.
21. What is the difference between “HDFS Block” and “Input Split”? And what is
block scanner?
Ans. The HDFS divides the input data physically into blocks for processing which is
known as HDFS Block.
Input Split is a logical division of data by mapper for mapping operation.
Block Scanner -Block Scanner tracks the list of blocks present on a DataNode
and verifie them to find any kind of checksum errors. Block Scanners use a
throttling mechanism to reserve disk bandwidth on the datanode.
25. What are the Port Numbers for NameNode, Task Tracker, and Job Tracker?
Ans. NameNode – Port 50070
Task Tracker – Port 50060
Job Tracker – Port 50030
26. Explain the process that overwrites the replication factors in HDFS.
Ans. $hadoop fs – setrep –w2 /my/test_file
27.
===================================================================================
=========================================(Hadoop and yarn)
===================================================================================
====================================================(HIVE)
HIVE - IN DETAILS
External tables are tables where Hive has loose coupling with the data.
The writes on External tables can be performed using Hive SQL commands but
data files can also be accessed and managed by processes outside of Hive.
If an External table or partition is dropped,only the metadata associated with
the table or partition is deleted but the underlying data files stay intact.
Hive supports replication of External tables with data to target cluster and
it retains all the properties of External tables.
2. what is hive?
Ans. Data warehousing package built on top of hadoop and is used for analyzing
structured and semi-structured data.
Used for data analytics
provide tools to enable easy data ETL.
It provides a mechanism to project structure onto the data and perform queries
written in HQL that are similar to SQL statements.
Internally, these queries or HQL gets converted to map reduce jobs by the Hive
compiler.
6. What are the three different modes in which hive can be run?
Ans. Local mode
Distributed mode
Pseudodistributed mode
12. What is the default database provided by Apache Hive for metastore?
Ans. By default, Hive provides an embedded Derby database instance backed by the
local disk for the metastore. This is called the embedded metastore
configuration.
16. What is the maximum size of a string data type supported by Hive?
Ans. 2 GB
17. What is the available mechanism for connecting applications when we run Hive
as a server?
Ans. Thrift Client: Using Thrift, we can call Hive commands from various
programming languages, such as C++, PHP, Java, Python, and Ruby.
JDBC Driver: JDBC Driver enables accessing data with JDBC support, by
translating calls from an application into SQL and passing the SQL
queries to the Hive engine.
ODBC Driver:It implements the ODBC API standard for the Hive DBMS,enabling
ODBC-compliant applications to interact seamlessly with Hive.
===================================================================================
====================================================(Hive)
===================================================================================
=================================================(GENERAL)
BIG DATA CONCEPTS
11. what is data pipeline - data ingestion pipeline, data extraction pipeline,
data preprocessing pipeline
Ans. data pipeline is connecting two or more operation together.
data ingestion pipeline is pipeling nifi and kafka i.e., connecting together.
data preprocessing pipeline is pipeling hive and spark together
===================================================================================
=================================================(general)
===================================================================================
===================================================(SCALA)
SCALA
1. Scala vs Java?
Ans. Scala (https://www.geeksforgeeks.org/scala-vs-java/)
Scala is a mixture of both object oriented and functional programming.
Scala is less readable due to nested code.
The process of compiling source code into byte code is slow.
Scala support operator overloading.
Java
Java is a general purpose object oriented language.
Java is more readable.
The process of compiling source code into byte code is fast.
Java does not support operator overloading.
4. What is a ‘Scala set’? What are methods through which operation sets are
expressed?
Ans. Scala set is a collection of pairwise element of the same type.Scala set does
not contain any duplicate elements. There are two kinds of sets, mutable
and immutable.
14. Explain how Scala is both Functional and Object-oriented Programming Language?
Ans. Scala treats every single value as an Object which even includes Functions.
Hence, Scala is the fusion of both Object-oriented and Functional programming
features.
===================================================================================
===================================================(scala)
===================================================================================
==================================================(PYTHON)
PYTHON
3. how can you change the way two instances of a specific class behave on
comaprison?
===================================================================================
==================================================(python)
===================================================================================
=====================================================(AWS)
AWS
1. Define and explain the three basic types of cloud services and the AWS
products that are built based on them?
Ans. Computing - These include EC2, Elastic Beanstalk, Lambda, Auto-Scaling, and
Lightsat.
Storage - These include S3, Glacier, Elastic Block Storage, Elastic File
System.
Networking - These include VPC, Amazon CloudFront, Route53
3. What is auto-scaling?
Ans. Auto-scaling is a function that allows you to provision and launch new
instances whenever there is a demand.
It allows you to automatically increase or decrease resource capacity in
relation to the demand.
===================================================================================
=====================================================(aws)
===================================================================================
===================================================(HBASE)
HBASE
9.
===================================================================================
===================================================(hbase)
===================================================================================
===================================================(NOSQL)
NOSQL
1. What are the different types of NoSQL databases? What are NoSQL databases?
What are the different types of NoSQL databases?
Ans. NoSQL database provides a mechanism for storage and retrieval of data that is
modeled in means other than the tabular relations used in relational
databases (like SQL, Oracle, etc.).
Types of NoSQL databases:
Document Oriented
Key Value
Graph
Column Oriented
3. What are the advantages of NoSQL over traditional RDBMS? What are the
advantages of NoSQL over traditional RDBMS?
Ans. NoSQL is better than RDBMS because of the following reasons/properities of
NoSQL:
-It supports semi-structured data and volatile data
-It does not have schema
-Read/Write throughput is very high
-Horizontal scalability can be achieved easily
-Will support Bigdata in volumes of Terra Bytes & Peta Bytes
-Provides good support for Analytic tools on top of Bigdata
-Can be hosted in cheaper hardware machines
-In-memory caching option is available to increase the performance of queries
-Faster development life cycles for developers
Still, RDBMS is better than NoSQL for the following reasons/properties of
RDBMS:
9. What is Denormalization?
Ans. It is the process of improving the performance of the database by adding
redundant data.
22. The specific variant of SQL that is used to parse queries can also be selected
using the spark.sql.dialect option. This parameter can be changed using either the
setConf method on a SQLContext or by using a SET key=value command in SQL
24. DatFrame API doesn't have provision for complie time type safety.
smapleColorRdd.filter(c=>
list = [1,2,3,4,56,7,89,89]
large = []
for i in range(len(list)-1):
if list[i] > list[1+1]:
large.append(list[i])
set = set(large)
print(set[len(set)])