BDA 3rd Unit QB
BDA 3rd Unit QB
BDA 3rd Unit QB
6. **Ecosystem**: Hadoop has a rich ecosystem of tools and technologies built around it,
such as Apache Hive for data warehousing, Apache Spark for in-memory processing,
and Apache HBase for real-time NoSQL database capabilities. This ecosystem provides
comprehensive solutions for different data processing and analytics needs.
1. **Distributed Storage**:
- HDFS divides large files into smaller blocks (typically 128 MB or 256 MB in size) and
distributes these blocks across multiple data nodes in the cluster.
- Each block is replicated across multiple data nodes (usually three replicas by default) to
ensure fault tolerance. If a data node fails, the blocks stored on that node can be retrieved
from replicas stored on other nodes.
2. **Master-Slave Architecture**:
- HDFS follows a master-slave architecture with two main components: NameNode and
DataNode.
- NameNode: Acts as the master node and manages the metadata of the file system,
including the namespace hierarchy, file-to-block mapping, and location of each block's
replicas.
- DataNode: Acts as the slave nodes and stores the actual data blocks. DataNodes
communicate with the NameNode to report the list of blocks they are storing and to
perform block replication or deletion as instructed.
4. **Data Replication**:
- HDFS replicates data blocks across multiple DataNode to ensure fault tolerance and high
availability.
- The default replication factor is typically set to three, meaning each data block is
replicated three times across different nodes in the cluster. However, this replication
factor can be configured based on the desired level of fault tolerance and data
redundancy.
5. **Streaming Access**:
- HDFS is optimized for streaming data access rather than random access. It is well-suited
for applications that process large files sequentially, such as batch processing, data
warehousing, and log analysis.
- HDFS achieves high throughput by prioritizing large sequential reads and writes over
low-latency random access operations.
Processing data with Hadoop typically involves using the MapReduce programming model
or other distributed processing frameworks like Apache Spark or Apache Flink. Here's
an overview of how data processing is done with Hadoop:
21. Describe the processing Data with Hadoop
Processing data with Hadoop begins with ingesting data from various sources into the
Hadoop cluster using tools like Apache Flume, Apache Sqoop, or custom scripts. Once
ingested, the data is stored in HDFS, Hadoop's distributed file system, which ensures
fault tolerance and high availability by distributing data across the cluster. Data
processing is then carried out using distributed processing frameworks like MapReduce,
Spark, or Flink, which enable parallel processing of large datasets across the cluster.
After processing, results can be stored back into HDFS or exported to external systems
for further analysis, visualization, or reporting. Throughout the process, monitoring
tools like Apache Ambari or Cloudera Manager are used to ensure the health and
performance of the Hadoop cluster. Dynamic resource management may be employed
to optimize performance and resource utilization during data processing. Overall,
processing data with Hadoop offers a scalable and efficient solution for handling large
volumes of data across distributed environments.