BDA Answers
BDA Answers
BDA Answers
NoSQL (Not Only SQL) refers to a broad class of database management systems that do not rely on the
traditional relational database model, which uses structured query language (SQL) for data
manipulation. NoSQL databases are designed to handle large volumes of unstructured, semi-structured,
or structured data with high performance and scalability, making them suitable for big data
applications.
Unlike relational databases that use predefined schemas and tables, NoSQL databases provide more
flexibility in data storage and processing. They can scale horizontally by adding more servers to handle
increasing loads. NoSQL databases are particularly useful for distributed, cloud-based applications that
require high availability, fault tolerance, and rapid response times.
There are four primary types of NoSQL databases, each optimized for different use cases:
1. Document-Oriented Databases
Description: These databases store data in documents (usually JSON, BSON, or XML format) rather
than in rows and columns like in relational databases. Each document can have different structures,
making it flexible for handling varying data formats.
Use Cases: Ideal for managing hierarchical data, content management systems, e-commerce
applications, and real-time big data analytics.
Examples:
MongoDB: One of the most popular document-oriented NoSQL databases, widely used for web
applications and big data platforms.
CouchDB: Stores data in JSON format and uses JavaScript for query processing.
2. Key-Value Stores
Description: In key-value databases, data is stored as a collection of key-value pairs, where each key is
unique. These databases are simple, fast, and highly scalable. They are primarily used for storing and
retrieving simple data.
Use Cases: Best suited for caching, session management, real-time recommendation engines, and
storing configuration settings.
Examples:
Redis: An in-memory key-value store known for its high performance, often used for caching and real-
time analytics.
Amazon DynamoDB: A fully managed key-value store provided by AWS, used for large-scale
applications requiring low-latency performance.
3. Column-Family Stores
Description: These databases store data in columns rather than rows. A column family represents a
collection of rows, each with a unique key. This allows efficient storage and retrieval of sparse data and
can handle large datasets spread across distributed systems.
Use Cases: Ideal for data warehousing, distributed data storage, and applications requiring high write
throughput like social networks and recommendation engines.
Examples:
Apache Cassandra: A highly scalable, distributed column-family store used by companies like
Facebook, Instagram, and Netflix.
HBase: Built on top of Hadoop, HBase is designed to handle large amounts of sparse data across
distributed systems.
4. Graph Databases
Description: Graph databases are designed to represent and store data in the form of nodes (entities)
and edges (relationships between entities). These databases excel at handling complex relationships
between data points.
Use Cases: Commonly used for social networks, recommendation engines, fraud detection, and
network analysis where data relationships are crucial.
Examples:
Neo4j: One of the most widely used graph databases, known for its performance in handling complex
queries on large-scale graphs.
Amazon Neptune: A managed graph database service that supports both property graphs and RDF
graph models.
Conclusion
NoSQL databases offer flexibility, scalability, and high performance, making them well-suited for
handling large volumes of data typical in big data environments. Depending on the use case, different
types of NoSQL databases—document-oriented, key-value, column-family, and graph—provide
tailored solutions for managing and querying data in modern applications.
7.b) Hive Architecture and Working
Apache Hive is a data warehouse software built on top of Hadoop that provides an SQL-like interface
to query data stored in Hadoop Distributed File System (HDFS). Hive abstracts the complexity of
Hadoop's MapReduce programming by allowing users to write queries in HiveQL, a query language
similar to SQL. This makes it easier for analysts and developers to work with large datasets in Hadoop.
Hive Architecture
The Hive architecture consists of several key components that work together to process data:
The user interface allows users to interact with Hive by submitting queries, managing tables, and
performing data operations. Hive provides several interfaces such as:
CLI (Command Line Interface): The basic interface for writing HiveQL queries.
HiveQL (Hive Query Language) is the SQL-like language used to query data in Hive. When a user
submits a query, the HiveQL process engine compiles, parses, and converts the query into MapReduce
jobs that can be executed on Hadoop.
3. Metastore
The Metastore is a crucial component in Hive. It stores metadata about the tables, databases, columns,
data types, partitions, and other elements. It helps the Hive engine understand the schema and structure
of the data.
Metastore can be backed by a relational database like MySQL or Derby, which stores the schema and
related metadata.
4. Driver
The Driver acts as the controller that receives the query, initiates the query execution, and interacts with
all components. It manages the lifecycle of the query, including parsing, optimizing, and compiling it
into MapReduce tasks.
Compiler: Converts the logical plan (from the parser) into an execution plan with MapReduce jobs.
Optimizer: Optimizes the query execution by rearranging the jobs for better performance (e.g.,
reducing the number of MapReduce stages).
5. Execution Engine
The Execution Engine translates the execution plan generated by the Driver into MapReduce jobs and
executes them on Hadoop. The results of these jobs are then passed back to the Driver and, eventually,
to the user.
Hive relies on Hadoop’s storage and processing infrastructure. Data is stored in HDFS (Hadoop
Distributed File System), and computations are done through MapReduce jobs.
Hive abstracts the complexity of creating MapReduce programs by translating HiveQL queries into
corresponding MapReduce tasks that run across the Hadoop cluster.
Workflow of Hive
1. Query Submission: The user submits a query using the CLI, web UI, or external tools via
JDBC/ODBC.
2. Query Parsing: The Hive driver parses the query to check for syntax errors and validates the schema
using the Metastore.
3. Logical Plan Creation: The query is converted into a logical plan by the Compiler, which defines the
steps to retrieve the data.
4. Optimization: The logical plan is optimized for better execution by reducing the number of
MapReduce jobs or by combining jobs.
5. Job Execution: The Execution Engine translates the optimized plan into MapReduce jobs, which are
submitted to the Hadoop cluster for processing.
6. Result Retrieval: After the MapReduce jobs are completed, the results are passed back to the driver
and displayed to the user.
Hive Architecture Diagram
+------------------------+
| User Interface |
| (CLI, Web UI, ODBC/JDBC)|
+-----------+------------+
|
|
v
+------------------------+
| HiveQL Process Engine |
+-----------+------------+
|
|
v
+---------------+-------------+
| Driver |
| (Parser, Compiler, Optimizer)|
+---------------+-------------+
|
|
v
+------------------------+
| Execution Engine |
+-----------+------------+
|
v
+------------------------+
| Hadoop Cluster |
| (HDFS & MapReduce) |
+------------------------+
|
|
v
+------------------------+
| Data Storage |
| (HDFS, HBase, etc.) |
+------------------------+
Conclusion
Hive architecture leverages Hadoop’s storage and processing capabilities to handle big data while
providing a SQL-like interface for users. Its components, including the Metastore, Driver, and
Execution Engine, work in tandem to process large-scale queries efficiently by translating them into
MapReduce tasks. This architecture allows Hive to process structured and semi-structured data, making
it a powerful tool in the Hadoop ecosystem for data analytics.
4.a) Hadoop Distributed File System (HDFS) is a core component of the Apache Hadoop ecosystem
and is designed to store vast amounts of data across many machines in a cluster. It provides high-
throughput access to this data and is highly fault-tolerant. Here's a detailed breakdown of HDFS, its
architecture, components, and features:
HDFS is modeled after Google's proprietary Google File System (GFS) and is optimized for handling
large data sets across distributed machines. It is particularly suited for systems that involve processing
and analyzing massive datasets using parallel computation frameworks like MapReduce.
HDFS is designed as a master-slave architecture, which divides tasks between different types of nodes:
*NameNode* and *DataNodes*.
HDFS splits large files into blocks, which are distributed across the cluster of DataNodes.
- *Block Size*: HDFS uses large block sizes (usually 128 MB or 256 MB) to minimize the overhead of
block management and maximize throughput.
- *Block Replication*: To achieve fault tolerance, HDFS replicates each block across multiple
DataNodes. The default replication factor is 3, meaning each block is stored on three different
DataNodes.
- *One replica* on the same DataNode where the block is originally written.
- *One replica* on a different DataNode within the same rack (for rack-aware replication).
- *Another replica* on a DataNode in a different rack for inter-rack fault tolerance.
#### c. *Scalability*
HDFS is designed to scale horizontally. You can add more DataNodes to the system to store more data
and increase processing power.
HDFS is rack-aware, which means it takes into account the physical location (rack) of the DataNodes
when placing replicas. This improves fault tolerance by ensuring that not all replicas of a block are
stored in the same rack, which protects against rack-level failures (e.g., a power failure in the entire
rack).
- *Latency*: HDFS is optimized for high-throughput batch processing and is not ideal for applications
that require low-latency access.
- *Small files*: Storing many small files can be inefficient in HDFS because each file’s metadata is
stored on the NameNode, and too many small files can overwhelm the memory of the NameNode.
- *Single Point of Failure: In older versions of Hadoop, the NameNode was a single point of failure
(SPOF). If the NameNode crashed, the entire HDFS became inaccessible. Modern versions of Hadoop
address this through features like the **High Availability* (HA) setup, where a standby NameNode
takes over in case of failure.
To mitigate the SPOF problem with the NameNode, HDFS introduced High Availability. In HA mode,
there are:
- *Active NameNode*: The primary NameNode, which handles client requests.
- *Standby NameNode*: A backup NameNode that is synchronized with the active NameNode. If the
active NameNode fails, the standby node takes over, minimizing downtime.
The synchronization between the two NameNodes is facilitated by a shared edit log that both nodes can
access, ensuring that no data is lost during the failover process.
HDFS Federation allows scaling of the NameNode horizontally. In traditional HDFS, the entire
filesystem is managed by a single NameNode, which can become a bottleneck as the system grows.
Federation introduces multiple independent NameNodes, each managing a portion of the namespace,
allowing for improved scalability.
### *Conclusion*
HDFS is a distributed file system that is critical for handling massive amounts of data across multiple
nodes in a Hadoop cluster. Its ability to store large files by splitting them into blocks, coupled with
replication, ensures reliability, scalability, and fault tolerance. However, HDFS is best suited for write-
once, read-many workloads and batch processing environments. It remains a foundational technology
in big data ecosystems due to its robustness and efficiency in managing large-scale data storage and
retrieval.
4.b)
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
// Mapper class
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
// Reducer class
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();