BDA 3rd Unit QB

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

17.

Enlist the key advantages of Hadoop

1. **Scalability**: Hadoop is highly scalable, allowing organizations to easily scale up their


data storage and processing capabilities by simply adding more commodity hardware to
the cluster. This makes it suitable for handling large volumes of data.

2. **Cost-effective**: Hadoop is built to run on commodity hardware, which is much


cheaper compared to traditional proprietary hardware. Additionally, its open-source
nature eliminates licensing costs, making it a cost-effective solution for storing and
processing big data.

3. **Fault tolerance**: Hadoop's distributed storage and processing framework ensures


fault tolerance by replicating data across multiple nodes in a cluster. If any node fails,
data can be easily recovered from other nodes, ensuring high availability and reliability.

4. **Parallel processing**: Hadoop's MapReduce programming model enables parallel


processing of large datasets across a distributed cluster of commodity hardware. This
allows for faster data processing compared to traditional single-node systems.

5. **Flexibility**: Hadoop is designed to handle various types of data, including structured,


semi-structured, and unstructured data. This flexibility makes it suitable for a wide range
of use cases, from traditional data warehousing to advanced analytics and machine
learning.

6. **Ecosystem**: Hadoop has a rich ecosystem of tools and technologies built around it,
such as Apache Hive for data warehousing, Apache Spark for in-memory processing,
and Apache HBase for real-time NoSQL database capabilities. This ecosystem provides
comprehensive solutions for different data processing and analytics needs.

7. **Community support**: Being an open-source project, Hadoop benefits from a large


and active community of developers and users who contribute to its development, share
knowledge, and provide support through forums, mailing lists, and other channels.
18. Difference between RDBMS and Hadoop

20. Describe HDFS


HDFS, or Hadoop Distributed File System, is the primary storage system used by
Hadoop for distributed storage of large volumes of data across a cluster of commodity
hardware. It is designed to provide high throughput and fault tolerance, making it
suitable for storing and processing big data.

1. **Distributed Storage**:
- HDFS divides large files into smaller blocks (typically 128 MB or 256 MB in size) and
distributes these blocks across multiple data nodes in the cluster.
- Each block is replicated across multiple data nodes (usually three replicas by default) to
ensure fault tolerance. If a data node fails, the blocks stored on that node can be retrieved
from replicas stored on other nodes.
2. **Master-Slave Architecture**:
- HDFS follows a master-slave architecture with two main components: NameNode and
DataNode.
- NameNode: Acts as the master node and manages the metadata of the file system,
including the namespace hierarchy, file-to-block mapping, and location of each block's
replicas.
- DataNode: Acts as the slave nodes and stores the actual data blocks. DataNodes
communicate with the NameNode to report the list of blocks they are storing and to
perform block replication or deletion as instructed.

3. **Namespace and Metadata**:


- HDFS maintains a hierarchical namespace similar to a typical file system, with
directories and files organized in a tree-like structure.
- The metadata, including file names, directory structure, and block locations, is stored in
memory by the NameNode and periodically persisted to disk in the form of a persistent
image and edit logs.

4. **Data Replication**:
- HDFS replicates data blocks across multiple DataNode to ensure fault tolerance and high
availability.
- The default replication factor is typically set to three, meaning each data block is
replicated three times across different nodes in the cluster. However, this replication
factor can be configured based on the desired level of fault tolerance and data
redundancy.

5. **Streaming Access**:
- HDFS is optimized for streaming data access rather than random access. It is well-suited
for applications that process large files sequentially, such as batch processing, data
warehousing, and log analysis.
- HDFS achieves high throughput by prioritizing large sequential reads and writes over
low-latency random access operations.
Processing data with Hadoop typically involves using the MapReduce programming model
or other distributed processing frameworks like Apache Spark or Apache Flink. Here's
an overview of how data processing is done with Hadoop:
21. Describe the processing Data with Hadoop

Processing data with Hadoop begins with ingesting data from various sources into the
Hadoop cluster using tools like Apache Flume, Apache Sqoop, or custom scripts. Once
ingested, the data is stored in HDFS, Hadoop's distributed file system, which ensures
fault tolerance and high availability by distributing data across the cluster. Data
processing is then carried out using distributed processing frameworks like MapReduce,
Spark, or Flink, which enable parallel processing of large datasets across the cluster.
After processing, results can be stored back into HDFS or exported to external systems
for further analysis, visualization, or reporting. Throughout the process, monitoring
tools like Apache Ambari or Cloudera Manager are used to ensure the health and
performance of the Hadoop cluster. Dynamic resource management may be employed
to optimize performance and resource utilization during data processing. Overall,
processing data with Hadoop offers a scalable and efficient solution for handling large
volumes of data across distributed environments.

You might also like