BDA Answers

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

7.a) What is NoSQL?

NoSQL (Not Only SQL) refers to a broad class of database management systems that do not rely on the
traditional relational database model, which uses structured query language (SQL) for data
manipulation. NoSQL databases are designed to handle large volumes of unstructured, semi-structured,
or structured data with high performance and scalability, making them suitable for big data
applications.

Unlike relational databases that use predefined schemas and tables, NoSQL databases provide more
flexibility in data storage and processing. They can scale horizontally by adding more servers to handle
increasing loads. NoSQL databases are particularly useful for distributed, cloud-based applications that
require high availability, fault tolerance, and rapid response times.

Types of NoSQL Databases

There are four primary types of NoSQL databases, each optimized for different use cases:

1. Document-Oriented Databases

Description: These databases store data in documents (usually JSON, BSON, or XML format) rather
than in rows and columns like in relational databases. Each document can have different structures,
making it flexible for handling varying data formats.

Use Cases: Ideal for managing hierarchical data, content management systems, e-commerce
applications, and real-time big data analytics.

Examples:

MongoDB: One of the most popular document-oriented NoSQL databases, widely used for web
applications and big data platforms.

CouchDB: Stores data in JSON format and uses JavaScript for query processing.

2. Key-Value Stores

Description: In key-value databases, data is stored as a collection of key-value pairs, where each key is
unique. These databases are simple, fast, and highly scalable. They are primarily used for storing and
retrieving simple data.

Use Cases: Best suited for caching, session management, real-time recommendation engines, and
storing configuration settings.

Examples:

Redis: An in-memory key-value store known for its high performance, often used for caching and real-
time analytics.
Amazon DynamoDB: A fully managed key-value store provided by AWS, used for large-scale
applications requiring low-latency performance.

3. Column-Family Stores

Description: These databases store data in columns rather than rows. A column family represents a
collection of rows, each with a unique key. This allows efficient storage and retrieval of sparse data and
can handle large datasets spread across distributed systems.

Use Cases: Ideal for data warehousing, distributed data storage, and applications requiring high write
throughput like social networks and recommendation engines.

Examples:

Apache Cassandra: A highly scalable, distributed column-family store used by companies like
Facebook, Instagram, and Netflix.

HBase: Built on top of Hadoop, HBase is designed to handle large amounts of sparse data across
distributed systems.

4. Graph Databases

Description: Graph databases are designed to represent and store data in the form of nodes (entities)
and edges (relationships between entities). These databases excel at handling complex relationships
between data points.

Use Cases: Commonly used for social networks, recommendation engines, fraud detection, and
network analysis where data relationships are crucial.

Examples:

Neo4j: One of the most widely used graph databases, known for its performance in handling complex
queries on large-scale graphs.

Amazon Neptune: A managed graph database service that supports both property graphs and RDF
graph models.

Conclusion

NoSQL databases offer flexibility, scalability, and high performance, making them well-suited for
handling large volumes of data typical in big data environments. Depending on the use case, different
types of NoSQL databases—document-oriented, key-value, column-family, and graph—provide
tailored solutions for managing and querying data in modern applications.
7.b) Hive Architecture and Working

Apache Hive is a data warehouse software built on top of Hadoop that provides an SQL-like interface
to query data stored in Hadoop Distributed File System (HDFS). Hive abstracts the complexity of
Hadoop's MapReduce programming by allowing users to write queries in HiveQL, a query language
similar to SQL. This makes it easier for analysts and developers to work with large datasets in Hadoop.

Hive Architecture

The Hive architecture consists of several key components that work together to process data:

1. User Interface (UI)

The user interface allows users to interact with Hive by submitting queries, managing tables, and
performing data operations. Hive provides several interfaces such as:

CLI (Command Line Interface): The basic interface for writing HiveQL queries.

Web Interface: A browser-based UI to interact with Hive.

ODBC/JDBC Drivers: Connects external applications to Hive for querying.

2. HiveQL Process Engine

HiveQL (Hive Query Language) is the SQL-like language used to query data in Hive. When a user
submits a query, the HiveQL process engine compiles, parses, and converts the query into MapReduce
jobs that can be executed on Hadoop.

3. Metastore

The Metastore is a crucial component in Hive. It stores metadata about the tables, databases, columns,
data types, partitions, and other elements. It helps the Hive engine understand the schema and structure
of the data.

Metastore can be backed by a relational database like MySQL or Derby, which stores the schema and
related metadata.

4. Driver

The Driver acts as the controller that receives the query, initiates the query execution, and interacts with
all components. It manages the lifecycle of the query, including parsing, optimizing, and compiling it
into MapReduce tasks.

The driver has the following key components:


Parser: Breaks the HiveQL query into smaller parts and checks for syntax errors.

Compiler: Converts the logical plan (from the parser) into an execution plan with MapReduce jobs.

Optimizer: Optimizes the query execution by rearranging the jobs for better performance (e.g.,
reducing the number of MapReduce stages).

5. Execution Engine

The Execution Engine translates the execution plan generated by the Driver into MapReduce jobs and
executes them on Hadoop. The results of these jobs are then passed back to the Driver and, eventually,
to the user.

6. Hadoop (HDFS & MapReduce)

Hive relies on Hadoop’s storage and processing infrastructure. Data is stored in HDFS (Hadoop
Distributed File System), and computations are done through MapReduce jobs.

Hive abstracts the complexity of creating MapReduce programs by translating HiveQL queries into
corresponding MapReduce tasks that run across the Hadoop cluster.

Workflow of Hive

1. Query Submission: The user submits a query using the CLI, web UI, or external tools via
JDBC/ODBC.

2. Query Parsing: The Hive driver parses the query to check for syntax errors and validates the schema
using the Metastore.

3. Logical Plan Creation: The query is converted into a logical plan by the Compiler, which defines the
steps to retrieve the data.

4. Optimization: The logical plan is optimized for better execution by reducing the number of
MapReduce jobs or by combining jobs.

5. Job Execution: The Execution Engine translates the optimized plan into MapReduce jobs, which are
submitted to the Hadoop cluster for processing.

6. Result Retrieval: After the MapReduce jobs are completed, the results are passed back to the driver
and displayed to the user.
Hive Architecture Diagram

Here’s a basic diagram to illustrate the components and flow:

+------------------------+
| User Interface |
| (CLI, Web UI, ODBC/JDBC)|
+-----------+------------+
|
|
v
+------------------------+
| HiveQL Process Engine |
+-----------+------------+
|
|
v
+---------------+-------------+
| Driver |
| (Parser, Compiler, Optimizer)|
+---------------+-------------+
|
|
v
+------------------------+
| Execution Engine |
+-----------+------------+
|
v
+------------------------+
| Hadoop Cluster |
| (HDFS & MapReduce) |
+------------------------+
|
|
v
+------------------------+
| Data Storage |
| (HDFS, HBase, etc.) |
+------------------------+

Conclusion

Hive architecture leverages Hadoop’s storage and processing capabilities to handle big data while
providing a SQL-like interface for users. Its components, including the Metastore, Driver, and
Execution Engine, work in tandem to process large-scale queries efficiently by translating them into
MapReduce tasks. This architecture allows Hive to process structured and semi-structured data, making
it a powerful tool in the Hadoop ecosystem for data analytics.

4.a) Hadoop Distributed File System (HDFS) is a core component of the Apache Hadoop ecosystem
and is designed to store vast amounts of data across many machines in a cluster. It provides high-
throughput access to this data and is highly fault-tolerant. Here's a detailed breakdown of HDFS, its
architecture, components, and features:

### *1. Overview of HDFS*

HDFS is modeled after Google's proprietary Google File System (GFS) and is optimized for handling
large data sets across distributed machines. It is particularly suited for systems that involve processing
and analyzing massive datasets using parallel computation frameworks like MapReduce.

- *Scalability*: It can store data across hundreds or even thousands of nodes.


- *Fault tolerance*: HDFS ensures data availability even in the event of hardware failures.
- *Data locality*: HDFS tries to move computation to the location of the data, minimizing data transfer
and improving efficiency.

### *2. HDFS Architecture*

HDFS is designed as a master-slave architecture, which divides tasks between different types of nodes:
*NameNode* and *DataNodes*.

#### *a. NameNode (Master Node)*


The NameNode is the centerpiece of HDFS and manages the filesystem metadata, such as:
- The directory structure (file names, permissions, etc.)
- Mapping of files to blocks
- Mapping of blocks to DataNodes

##### Key Responsibilities:


- *File system namespace management*: The NameNode keeps track of the overall structure of the
HDFS (i.e., directories, files, and the hierarchy).
- *Block management*: HDFS splits large files into smaller, fixed-size blocks (typically 128 MB by
default, but configurable). These blocks are stored on different DataNodes, and the NameNode tracks
where each block is stored.
- *Metadata storage*: The actual data of a file is stored on DataNodes, but the NameNode stores
metadata about these files, including where each block is stored and replication factors (how many
copies of a block exist).

#### *b. DataNodes (Slave Nodes)*


The DataNodes are the worker nodes in the HDFS architecture. They store the actual data blocks of the
files.

##### Key Responsibilities:


- *Data storage*: DataNodes store the actual file blocks as per instructions from the NameNode.
- *Heartbeat signals*: DataNodes regularly send heartbeat signals to the NameNode to confirm their
availability. If a DataNode fails to send heartbeats, the NameNode assumes it is down and re-replicates
the data blocks that were stored on that node to ensure redundancy.
- *Block reports*: DataNodes periodically report the list of blocks they store to the NameNode.

### *3. HDFS Blocks*

HDFS splits large files into blocks, which are distributed across the cluster of DataNodes.

- *Block Size*: HDFS uses large block sizes (usually 128 MB or 256 MB) to minimize the overhead of
block management and maximize throughput.
- *Block Replication*: To achieve fault tolerance, HDFS replicates each block across multiple
DataNodes. The default replication factor is 3, meaning each block is stored on three different
DataNodes.
- *One replica* on the same DataNode where the block is originally written.
- *One replica* on a different DataNode within the same rack (for rack-aware replication).
- *Another replica* on a DataNode in a different rack for inter-rack fault tolerance.

### *4. Key Features of HDFS*

#### a. *Fault Tolerance*


Fault tolerance is a key feature of HDFS. When a DataNode fails, HDFS ensures data integrity by:
- *Replication*: Each block is replicated to multiple DataNodes. If one node fails, another replica is
available.
- *Re-replication*: When a node goes down, the NameNode detects the missing replicas and instructs
the system to create new copies from existing replicas on healthy DataNodes.

#### b. *Data Integrity*


HDFS checksums data blocks as they are written, ensuring data integrity by verifying these checksums
during read operations. If a block is corrupted, HDFS retrieves the data from an alternate replica.

#### c. *Scalability*
HDFS is designed to scale horizontally. You can add more DataNodes to the system to store more data
and increase processing power.

#### d. *High Throughput*


HDFS is optimized for high-throughput applications. It is not designed for low-latency or real-time
access but for batch processing systems like MapReduce, where large datasets are processed.

#### e. *Write-Once, Read-Many Model*


HDFS follows a write-once, read-many model. Once data is written to HDFS, it cannot be modified
(though appending data is possible). This makes it suitable for scenarios where large datasets are
created once and read multiple times later for analysis (e.g., logs, sensor data, etc.).

### *5. Data Flow in HDFS*

#### *Write Operation*


1. *Client interaction*: The client communicates with the NameNode to request permission to write a
file.
2. *Block allocation*: The NameNode allocates blocks and selects DataNodes to store the replicas of
each block.
3. *Pipeline setup*: The client writes the data block-by-block to the first DataNode. The DataNodes
form a pipeline where the first DataNode sends a copy of the block to the second, and the second sends
a copy to the third.
4. *Acknowledgment*: Once all replicas are written, the DataNodes acknowledge the client.

#### *Read Operation*


1. *Client interaction*: The client requests file location information from the NameNode.
2. *Block retrieval*: The NameNode returns the addresses of the DataNodes that store the blocks. The
client directly reads the blocks from the DataNodes, usually from the closest replica (to optimize for
data locality).

### *6. Rack Awareness*

HDFS is rack-aware, which means it takes into account the physical location (rack) of the DataNodes
when placing replicas. This improves fault tolerance by ensuring that not all replicas of a block are
stored in the same rack, which protects against rack-level failures (e.g., a power failure in the entire
rack).

### *7. Limitations of HDFS*

- *Latency*: HDFS is optimized for high-throughput batch processing and is not ideal for applications
that require low-latency access.
- *Small files*: Storing many small files can be inefficient in HDFS because each file’s metadata is
stored on the NameNode, and too many small files can overwhelm the memory of the NameNode.
- *Single Point of Failure: In older versions of Hadoop, the NameNode was a single point of failure
(SPOF). If the NameNode crashed, the entire HDFS became inaccessible. Modern versions of Hadoop
address this through features like the **High Availability* (HA) setup, where a standby NameNode
takes over in case of failure.

### *8. High Availability (HA) in HDFS*

To mitigate the SPOF problem with the NameNode, HDFS introduced High Availability. In HA mode,
there are:
- *Active NameNode*: The primary NameNode, which handles client requests.
- *Standby NameNode*: A backup NameNode that is synchronized with the active NameNode. If the
active NameNode fails, the standby node takes over, minimizing downtime.

The synchronization between the two NameNodes is facilitated by a shared edit log that both nodes can
access, ensuring that no data is lost during the failover process.

### *9. HDFS Federation*

HDFS Federation allows scaling of the NameNode horizontally. In traditional HDFS, the entire
filesystem is managed by a single NameNode, which can become a bottleneck as the system grows.
Federation introduces multiple independent NameNodes, each managing a portion of the namespace,
allowing for improved scalability.

### *Conclusion*
HDFS is a distributed file system that is critical for handling massive amounts of data across multiple
nodes in a Hadoop cluster. Its ability to store large files by splitting them into blocks, coupled with
replication, ensures reliability, scalability, and fault tolerance. However, HDFS is best suited for write-
once, read-many workloads and batch processing environments. It remains a foundational technology
in big data ecosystems due to its robustness and efficiency in managing large-scale data storage and
retrieval.

4.b)
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

// Mapper class
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);


private Text word = new Text();

// The map method


public void map(Object key, Text value, Context context) throws IOException,
InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one); // Emit (word, 1)
}
}
}

// Reducer class
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();

// The reduce method


public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get(); // Sum the counts for each word
}
result.set(sum);
context.write(key, result); // Emit (word, total_count)
}
}

// Main method to configure and run the job


public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class); // Optional combiner to optimize
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0])); // Input path
FileOutputFormat.setOutputPath(job, new Path(args[1])); // Output path
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

You might also like