BDA Module-2
BDA Module-2
1. Hadoop Common
a. Purpose: Provides essential libraries and utilities for other modules in Hadoop.
b. Functions:
i. Contains components required for the operation of Hadoop.
ii. Manages general input/output, serialization, and Java RPC (Remote Procedure Call).
iii. Facilitates file-based data structures.
2. Hadoop Distributed File System (HDFS)
a. Purpose: A Java-based distributed file system for storing large datasets.
b. Functions:
i. Stores data across multiple nodes in a cluster.
ii. Handles structured, unstructured, and semi-structured data.
iii. Provides high fault tolerance and scalability by replicating data blocks.
3. MapReduce v1
a. Purpose: A programming model for processing large data sets in parallel.
b. Functions:
i. Divides tasks into "Map" and "Reduce" steps for parallel execution.
ii. Processes data in batch mode, suitable for large-scale data computations.
4. YARN (Yet Another Resource Negotiator)
a. Purpose: Manages resources for distributed computing.
b. Functions:
i. Allocates resources for application tasks or sub-tasks running on Hadoop.
ii. Schedules tasks in parallel and handles resource requests.
iii. Ensures distributed and efficient task execution.
5. MapReduce v2 (Hadoop 2 YARN)
a. Purpose: An updated version of MapReduce for handling larger datasets.
b. Functions:
i. Enables parallel processing of large datasets.
ii. Improves scalability and resource management compared to MapReduce v1.
iii. Optimized for distributed processing of application tasks.
Hadoop Ecosystem Components
Refers to a combination of technologies and tools that work together with Hadoop.
Supports storage, processing, access, analysis, governance, security, and operations for Big Data.
HDFS operates using a Master/Slave architecture consisting of the following key components:
1. NameNode (Master)
• Role:
o Manages the file system namespace and regulates client access to files.
o Stores and manages metadata, such as file structure, file locations, permissions, and mapping of
file blocks to DataNodes.
• Key Responsibilities:
o Handles file system namespace operations: Opening, closing, renaming files, and directories.
o Determines the mapping of blocks to DataNodes.
o Monitors the health of DataNodes using heartbeat signals.
o Handles block creation, deletion, and replication.
o Ensures high availability of data by replicating blocks based on the replication factor.
• Data Flow:
o Does not store actual data but provides clients with metadata for interacting with DataNodes.
o Keeps all metadata in memory for fast access.
2. DataNodes (Slaves)
• Role:
o Store the actual data blocks of files.
o Serve read and write requests directly from/to clients.
• Key Responsibilities:
o Report block information to the NameNode through periodic heartbeats and block reports.
o Replicate data blocks as instructed by the NameNode.
o Handle actual data transfer between the client and storage.
• Data Replication:
o To ensure data reliability, the same block is replicated across multiple DataNodes.
o If multiple racks exist, the NameNode tries to replicate blocks across different racks to enhance
fault tolerance.
3. Client
• Role: Acts as the user-facing component that interacts with HDFS.
• Interaction Process:
o Requests file creation or retrieval from the NameNode.
o Receives metadata (e.g., block locations) from the NameNode.
o Interacts directly with DataNodes for data read/write operations.
4. Secondary NameNode (Checkpoint Node)
• Role:
o Assists the primary NameNode in managing metadata.
o Periodically merges the namespace image (fsimage) and the edit log to create a new namespace
image checkpoint.
• Key Characteristics:
o Not a backup or failover node.
o Does not replace the NameNode in case of failure.
o Improves the availability and manageability of metadata.
Data Flow in HDFS
1. Write Operation:
a. The client requests the NameNode to create a file.
b. The NameNode allocates the required blocks and provides the DataNodes for storage.
c. The client writes data to the designated DataNodes.
d. The NameNode ensures replication of blocks.
2. Read Operation:
a. The client requests a file from the NameNode.
b. The NameNode returns the DataNodes storing the file blocks.
c. The client reads data directly from the DataNodes.
Fault Tolerance in HDFS
• Heartbeat Monitoring:
o The NameNode monitors DataNodes via heartbeats.
o Lack of a heartbeat indicates a DataNode failure.
• Re-Replication: The NameNode re-replicates blocks stored on the failed DataNode to maintain the
replication factor.
• Rack Awareness: Replication is designed to optimize fault tolerance by storing data across different racks.
1. Automatic Parallelization and Distribution: Automates the division of computation tasks across several
processors, enabling efficient parallel processing.
2. Distributed Processing: Processes data stored on clusters of DataNodes distributed across racks.
3. Large-Scale Data Handling: Supports the processing of vast amounts of data simultaneously.
4. Scalability: Offers scalability by enabling the usage of numerous servers in the cluster.
5. Batch-Oriented Programming Model: Provides a batch-oriented model for processing in Hadoop
version 1.
6. Enhanced Processing in Hadoop 2 (YARN-Based): Additional modes of processing in YARN (Yet
Another Resource Negotiator) enable:
o Query processing
o Graph databases
o Real-time analytics
o Streaming data
These features align with the 3Vs of Big Data (Volume, Variety, Velocity).
Framework Architecture
The MapReduce framework handles two primary functions:
1. Job Distribution: Distributes the application tasks (jobs) to nodes within the cluster for parallel execution.
2. Data Aggregation: Collects and organizes intermediate results from nodes into a cohesive response.
Execution Process
1. JobTracker (Master Node)
a. A daemon (background process) responsible for:
i. Estimating resource requirements for tasks.
ii. Analyzing the states of slave nodes.
iii. Assigning map tasks to the appropriate DataNodes.
iv. Monitoring the progress of tasks.
v. Handling failures by rescheduling tasks.
2. TaskTracker (Slave Nodes)
a. Executes the Map and Reduce tasks assigned by the JobTracker.
b. Reports the status of tasks back to the JobTracker.
4. APACHE HIVE
Apache Hive is a data warehouse infrastructure built on top of Hadoop that facilitates data summarization, ad
hoc queries, and analysis of large datasets using a SQL-like language called HiveQL. It provides an interface for
querying and managing data stored in Hadoop's distributed file system (HDFS) or other storage systems like
HBase.
Using Hive
1. Start Hive
a. Enter the Hive command to begin a session and access the hive> prompt.
b. Command: $ hive
2. Create a Table
a. Use the CREATE TABLE command to define a table structure.
Example: CREATE TABLE pokes (foo INT, bar STRING);
b. Command: hive> CREATE TABLE pokes (foo INT, bar STRING);
3. List Tables
a. Use the SHOW TABLES command to display all existing tables.
b. Command: hive> SHOW TABLES;
4. Drop a Table
a. Use the DROP TABLE command to delete a specific table.
Example: DROP TABLE pokes;
b. Command: hive> DROP TABLE pokes;
5. Commands End with Semicolon: All Hive commands must conclude with a semicolon (;).
Applications of Hive
• Interactive SQL queries on large datasets (petabytes).
• Data analysis and summarization.
• Integration with Hadoop for big data solutions.