BDA (2019) Two Marks (QB)
BDA (2019) Two Marks (QB)
BDA (2019) Two Marks (QB)
1/CSE
K.S.R. COLLEGE OF ENGINEERING(Autonomous)
4
All the enterprises data is housed in a central server whereas in a big data environment resides
KSRCE/QM/7.5.1/CSE
in a distributed system. The distributed file system scales by scaling in or out horizontally as compared to
typical database sever that scales vertically.
PART-B
1. Define data? Describe the types of Digital Data?
Collection of information-types of digital data-Unstructured data-is not form, Semi-structured data-it
has structure, Structured data-it is organised form.
2. Brief about Big Data?
Big data -characteristics of data-evolutions-challenges with big data-why big data is used.
3. Define Big Data Analaytics? Explain the terminologies used in the Big Data Environment?
Big data analytics-classifications-why used-Data science-Terminologies: In-Memory Analytics-In-
Database Processing-Security-Schema-Continues availability-Consistency.
4. Explain about Typical Hadoop Environment ?
Typical warehouse environment – architecture of Hadoop –HDFS(Hadoop distributed environment
system).
UNIT-2
Part- A
1. What is mean NOSQL?
NoSQL stands for Not Only SQL. These are non-relational, open source, distributed database.
They are hugely popular today owing to their ability to scale horizontally and the adeptness at dealing with a
rich variety of data: structure, semi structure and unstructured data.
2. Types of NOSQL
Key-value or big data hash
Schema less
3. Why NOSQL
It has scale out architecture instead of the monolithic architecture of relational database. It can
house large volumes of structure, semi structure and unstructured data. Dynamic schema, NoSQL
database allows insertion of data without a pre-defined scheme.
4. Advantages of NOSQL
Cheap, easy to implement
Easy to distribute
Can easily scale up and down
Relaxes the data consistency requirement
Doesn’t required a pre-defined schema
Data can be replicated to multiple nodes and can be partitioned
SQL NOSQL
8. Define HADOOP
Hadoop is an open source project of the Apache Foundation. It is a framework written in Java.
Hadoop uses Google’s mapreduce and Google file system technologies as its foundation.
9. Features of HADOOP
It is optimized to handle massive quantities of structure, semi structure and unstructured data.
Hadoop has a shared nothing architecture
It complements On Line Transaction Processing (OLTP) and On Line Analytical Processing
(OLAP).However, it’s not a replacement for a relational database management system.
Hadoop is for high throughput rather than low latency
Hadoop SQL
6
Off-line batch processing On-line transaction processing
KSRCE/QM/7.5.1/CSE
Cost around $10000 to $14000 per Cost around $4000 per terabytes of
terabytes of storage storage
Needs high expensive hardware A nodes required only a processor, a
network card and few hard drives
7
KSRCE/QM/7.5.1/CSE
17. Different between SQL and mapreduce
SQL Mapreduce
Interactive and batch access Batch accesses
Real and write many times Write once, read many times
PART B
HDFS – Hbase – Hive – Pig – Zookeeper – Oozie – Mahout – Chukwa – Sqoop – Ambari
Unit 3
Part A:
1. What is MONGODB?
MongoDb is:
Cross-platform
Open Source
Non-Relational
Distributed
NoSql
Document oriented data store.
2. Why MONGODB?
Few of the major challenges with traditional RDBMS are dealing with large volumes of
data,rich variety of data – particularly unstructured data, and meeting up to the scale needs of
enterprise data. The need is for database that can scale out or scale horizontally to meet the scale
requirements.
3. What are the terms used In MONGODB?
Database
Collection
Document
Fields/Key Value pairs
Index
Embedded Documents
4. What are the data types in MONGODB?
String
Integer
Boolean
Double
Arrays
Null
Date
5. Explain MONGODB query language?
CRUD(Create,Read update and Delete)
Create – Creation of data
Read - Reading of data
Update – Update of data
9
Delete – Deleted using the Remove()method.
KSRCE/QM/7.5.1/CSE
6. Write a query for performing insert operations in MONGODB?
Db.students.insert(
{
RollNo:101,
Age:19,
Contact no: 10124585:
EmailId:[email protected]
}}
Part B
1. Explain in details about MONGODB Query language with example?
Cross-platform
Open Source
Non relational
Distributed
NoSql
Document oriented data store
2. Write details about features of Cassandra?
Peer to Peer Network
Gossip and Failure Detection
Partitioner
Replication Error
Anti-Entropy and Read Repair
Writes in Cassandra
Hinted Handoffs
3. Explain about CRUD operations with example?
Create
Read
Update
Delete
4. Explain in details about collections?
Set
List
Map
5. Explain in details about import and export of Cassandra?
Export Cassandra:
Check the records of the table” e-learninglists” present in the “student” database.
Execute the below command at the cqlsh prompt
Check the existence of the “e-learninglists.csv” file in D:/.
11
Import Cassandra:
KSRCE/QM/7.5.1/CSE
Check for the Table “e-learning list” in the “Students” database.
Check for the context of the “D:/elearinglist.csv”
Execute the command to import data into the table in the database.
Unit – IV
Part - A
1) Define Mapreduce ?
Answer: MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes
a set of data and converts it into another set of data, where individual elements are broken down into
tuples (key/value pairs). Reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the
reduce task is always performed after the map job
7) What is Hive ?
Answer: Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides
on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.This is a brief
tutorial that provides an introduction on how to use Apache Hive HiveQL with Hadoop Distributed
File System.
12
KSRCE/QM/7.5.1/CSE
8) What is the features of hive ?
Answer: * Hive provides data summarization, query, and analysis in much easier manner.
* Hive supports external tables which make it possible to process data without actually storing in
HDFS.
* Apache Hive fits the low-level interface requirement of Hadoop perfectly.
* It also supports partitioning of data at the level of tables to improve performance.
4) List the Hive Datatypes and explain briefly about Hive File Format?
Integral
String
Timestamp
Union types
5) Explain about HQL and also list the DDL, Aggregation, Bucketing?
From clause
As Clause
Select clause
Where Clause
Order by Clause
Update , Delete, DDL, Aggregation, Bucketing
Unit-5
Part-A
1) What is pig?
Apache pig is a platform for data analysis. It is alternative to mapreduce programming.Pig was
developed as a research product at Yahoo.
2) Define anatomy of pig?
Data flow languages
Interactive shell where you can type pig latin statements
Pig interpreter and execution engine
3) Types of pig philosophy?
14
Pig eat anything
KSRCE/QM/7.5.1/CSE
Pig live anywhere
Pig are domestic animal
Pig fly
4) List the pig latin statements?
Pig latin statements is an operator
Pig latin statements are basic construct to process data using pig
Pig latin statements should end with semicolon.
5) What are the modes in running pig?
You can run a pig in two ways
Interactive mode
Batch mode
6) What are the data types in Pig?
Int,long,float,double,chararray,bytearray,datetime,Boolean
7) Define local mode & map reduce mode?
local mode:
To run a pig in local mode,you need to have your files in a local file system.
Syntax: pig-x local filename
Map reduce mode:
To run a pig in mapreduce mode you need to have access to a hadoob cluster to read/write
8) Define relational operator?
Filter
For each
Distinct
Group
Limit
Order by
Join
Union
Split
Sample
Part B
1) Explain brief about pig ETL processing and its latin overview?
Pig latin statement
Pig latin :keyword
Pig latin:identifier
Pig latin :command
15
Pig latin :case sensitivity KSRCE/QM/7.5.1/CSE
Operators in pig latin
2) Define machine learning and its algorithm?
Machine learning definition
Machine learning algorithm
3) Define execution mode of pig and running pig?
Run pig in two ways
Interactive pig
Batch pig
Execute pig in two ways
Local pig
Map reduce mode
16