0% found this document useful (0 votes)
4 views

BDA Module-2

The document provides an overview of Hadoop's core components and ecosystem, detailing its architecture, including HDFS, MapReduce, and YARN, along with their functionalities. It also describes HDFS features, its master/slave architecture, and the MapReduce framework for processing large datasets. Additionally, it introduces Apache Hive, a data warehousing tool that allows SQL-like querying of data stored in Hadoop.

Uploaded by

laxmishetti1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

BDA Module-2

The document provides an overview of Hadoop's core components and ecosystem, detailing its architecture, including HDFS, MapReduce, and YARN, along with their functionalities. It also describes HDFS features, its master/slave architecture, and the MapReduce framework for processing large datasets. Additionally, it introduces Apache Hive, a data warehousing tool that allows SQL-like querying of data stored in Hadoop.

Uploaded by

laxmishetti1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

BDA MODULE 2

1. HADOOP – CORE COMPONENTS, ECO-SYSTEM COMPONENTS.


Hadoop Core Components

1. Hadoop Common
a. Purpose: Provides essential libraries and utilities for other modules in Hadoop.
b. Functions:
i. Contains components required for the operation of Hadoop.
ii. Manages general input/output, serialization, and Java RPC (Remote Procedure Call).
iii. Facilitates file-based data structures.
2. Hadoop Distributed File System (HDFS)
a. Purpose: A Java-based distributed file system for storing large datasets.
b. Functions:
i. Stores data across multiple nodes in a cluster.
ii. Handles structured, unstructured, and semi-structured data.
iii. Provides high fault tolerance and scalability by replicating data blocks.
3. MapReduce v1
a. Purpose: A programming model for processing large data sets in parallel.
b. Functions:
i. Divides tasks into "Map" and "Reduce" steps for parallel execution.
ii. Processes data in batch mode, suitable for large-scale data computations.
4. YARN (Yet Another Resource Negotiator)
a. Purpose: Manages resources for distributed computing.
b. Functions:
i. Allocates resources for application tasks or sub-tasks running on Hadoop.
ii. Schedules tasks in parallel and handles resource requests.
iii. Ensures distributed and efficient task execution.
5. MapReduce v2 (Hadoop 2 YARN)
a. Purpose: An updated version of MapReduce for handling larger datasets.
b. Functions:
i. Enables parallel processing of large datasets.
ii. Improves scalability and resource management compared to MapReduce v1.
iii. Optimized for distributed processing of application tasks.
Hadoop Ecosystem Components
Refers to a combination of technologies and tools that work together with Hadoop.
Supports storage, processing, access, analysis, governance, security, and operations for Big Data.

1. Core Hadoop Components


a. HDFS (Hadoop Distributed File System): For distributed storage of data.
b. MapReduce: Programming model for distributed data processing.
c. YARN (Yet Another Resource Negotiator): Manages resources and schedules tasks for
distributed computing.
2. Data Store System
a. Consists of clusters, racks, DataNodes, and blocks for storing and managing data.
b. Deploys HDFS for efficient distributed file storage.
3. Application Programming Model
a. Includes models like MapReduce and HBase for data processing and storage.
b. HBase uses columnar databases and supports OLAP (Online Analytical Processing) for
analytical workloads.
4. Support Layer Components
a. AVRO: Data serialization system for efficient communication between Hadoop components.
b. ZooKeeper: Coordination service for managing distributed applications.
5. Application Layer Components
a. Pig: A high-level scripting platform for data transformation and analysis.
b. Hive: A data warehousing tool for querying and managing large datasets using SQL-like queries.
c. Sqoop: Facilitates data transfer between Hadoop and relational databases.
d. Ambari: Provides a web-based interface for managing and monitoring Hadoop clusters.
e. Chukwa: For monitoring and analyzing log data in the Hadoop ecosystem.
f. Mahout: A machine learning library for building scalable algorithms.
g. Spark: A fast and general-purpose cluster computing framework for real-time and batch data
processing.
h. Flink: A stream-processing framework for distributed data processing.
i. Flume: Used for collecting, aggregating, and transporting log data.

2. HDFS – FEATURES, CORE COMPONENTS.


Features
1. Distributed Storage: HDFS stores data across multiple nodes in a distributed manner, ensuring high
scalability and fault tolerance.
2. Replication: Data is replicated across multiple nodes (default replication factor is 3) to ensure availability
and reliability even in case of node failures.
3. Fault Tolerance: HDFS can handle hardware failures by replicating data and recovering lost files from
replicas.
4. Scalability: HDFS is designed to handle large-scale data and can scale horizontally by adding more nodes
to the cluster.
5. Write Once, Read Many: Data in HDFS is written once and read multiple times, which simplifies
concurrency control and improves efficiency.
6. High Throughput: HDFS is optimized for large data transfers, ensuring high throughput for data access
and processing.
7. Block-Based Storage: Files are divided into large blocks (default size is 128 MB or 256 MB) and
distributed across the cluster, enabling efficient storage and processing.
8. Support for Large Files: HDFS is designed to handle files in the range of gigabytes to terabytes
efficiently.
9. Compatibility with Heterogeneous Hardware: HDFS works seamlessly with a variety of hardware and
operating systems, making it adaptable to diverse environments.
10. Built-In Redundancy: Data redundancy through replication ensures data integrity and minimizes the risk
of data loss.
Components

HDFS operates using a Master/Slave architecture consisting of the following key components:
1. NameNode (Master)
• Role:
o Manages the file system namespace and regulates client access to files.
o Stores and manages metadata, such as file structure, file locations, permissions, and mapping of
file blocks to DataNodes.
• Key Responsibilities:
o Handles file system namespace operations: Opening, closing, renaming files, and directories.
o Determines the mapping of blocks to DataNodes.
o Monitors the health of DataNodes using heartbeat signals.
o Handles block creation, deletion, and replication.
o Ensures high availability of data by replicating blocks based on the replication factor.
• Data Flow:
o Does not store actual data but provides clients with metadata for interacting with DataNodes.
o Keeps all metadata in memory for fast access.
2. DataNodes (Slaves)
• Role:
o Store the actual data blocks of files.
o Serve read and write requests directly from/to clients.
• Key Responsibilities:
o Report block information to the NameNode through periodic heartbeats and block reports.
o Replicate data blocks as instructed by the NameNode.
o Handle actual data transfer between the client and storage.
• Data Replication:
o To ensure data reliability, the same block is replicated across multiple DataNodes.
o If multiple racks exist, the NameNode tries to replicate blocks across different racks to enhance
fault tolerance.
3. Client
• Role: Acts as the user-facing component that interacts with HDFS.
• Interaction Process:
o Requests file creation or retrieval from the NameNode.
o Receives metadata (e.g., block locations) from the NameNode.
o Interacts directly with DataNodes for data read/write operations.
4. Secondary NameNode (Checkpoint Node)
• Role:
o Assists the primary NameNode in managing metadata.
o Periodically merges the namespace image (fsimage) and the edit log to create a new namespace
image checkpoint.
• Key Characteristics:
o Not a backup or failover node.
o Does not replace the NameNode in case of failure.
o Improves the availability and manageability of metadata.
Data Flow in HDFS
1. Write Operation:
a. The client requests the NameNode to create a file.
b. The NameNode allocates the required blocks and provides the DataNodes for storage.
c. The client writes data to the designated DataNodes.
d. The NameNode ensures replication of blocks.
2. Read Operation:
a. The client requests a file from the NameNode.
b. The NameNode returns the DataNodes storing the file blocks.
c. The client reads data directly from the DataNodes.
Fault Tolerance in HDFS
• Heartbeat Monitoring:
o The NameNode monitors DataNodes via heartbeats.
o Lack of a heartbeat indicates a DataNode failure.
• Re-Replication: The NameNode re-replicates blocks stored on the failed DataNode to maintain the
replication factor.
• Rack Awareness: Replication is designed to optimize fault tolerance by storing data across different racks.

3. HDFS MAPREDUCE FRAMEWORK


MapReduce provides a programming framework for processing and analyzing large datasets in parallel across
distributed clusters of computers. The framework handles job scheduling, execution, and data movement
efficiently.

Key Features of MapReduce Framework

1. Automatic Parallelization and Distribution: Automates the division of computation tasks across several
processors, enabling efficient parallel processing.
2. Distributed Processing: Processes data stored on clusters of DataNodes distributed across racks.
3. Large-Scale Data Handling: Supports the processing of vast amounts of data simultaneously.
4. Scalability: Offers scalability by enabling the usage of numerous servers in the cluster.
5. Batch-Oriented Programming Model: Provides a batch-oriented model for processing in Hadoop
version 1.
6. Enhanced Processing in Hadoop 2 (YARN-Based): Additional modes of processing in YARN (Yet
Another Resource Negotiator) enable:
o Query processing
o Graph databases
o Real-time analytics
o Streaming data

These features align with the 3Vs of Big Data (Volume, Variety, Velocity).

Framework Architecture
The MapReduce framework handles two primary functions:
1. Job Distribution: Distributes the application tasks (jobs) to nodes within the cluster for parallel execution.
2. Data Aggregation: Collects and organizes intermediate results from nodes into a cohesive response.

Execution Process
1. JobTracker (Master Node)
a. A daemon (background process) responsible for:
i. Estimating resource requirements for tasks.
ii. Analyzing the states of slave nodes.
iii. Assigning map tasks to the appropriate DataNodes.
iv. Monitoring the progress of tasks.
v. Handling failures by rescheduling tasks.
2. TaskTracker (Slave Nodes)
a. Executes the Map and Reduce tasks assigned by the JobTracker.
b. Reports the status of tasks back to the JobTracker.

Job Execution Process


1. Mapper Phase
a. Deploys map tasks on DataNodes where the application data is stored.
b. Outputs intermediate results serialized using formats like AVRO.
2. Reducer Phase
a. Receives data from the Mapper phase.
b. Performs computations and consolidates results into the final output.
3. Result Collection: The final output is sent back to the Hadoop server after all tasks are completed.
Types of Processes in MapReduce
• JobTracker
o Manages jobs submitted by clients.
o Assigns tasks to TaskTrackers and monitors progress.
• TaskTracker
o Executes assigned tasks (Map and Reduce phases).
o Periodically reports status back to the JobTracker.
Daemon
• Refers to a background program dedicated to managing system processes in Hadoop.
• Examples:
o JobTracker: Coordinates jobs and resources.
o TaskTracker: Executes tasks on nodes.

4. APACHE HIVE
Apache Hive is a data warehouse infrastructure built on top of Hadoop that facilitates data summarization, ad
hoc queries, and analysis of large datasets using a SQL-like language called HiveQL. It provides an interface for
querying and managing data stored in Hadoop's distributed file system (HDFS) or other storage systems like
HBase.

Key Features of Apache Hive:

1. SQL-Like Interface (HiveQL)


a. Enables users familiar with SQL to work with Hadoop data.
b. Simplifies querying large datasets using declarative language.
2. ETL Support: Offers tools for data extraction, transformation, and loading (ETL).
3. Data Structure and Format Handling: Imposes structure on various data formats to enable querying.
4. HDFS and HBase Integration: Allows access to data stored in HDFS or HBase seamlessly.
5. Query Execution: Queries are executed via MapReduce or Tez (an optimized MapReduce engine).

Using Hive

1. Start Hive
a. Enter the Hive command to begin a session and access the hive> prompt.
b. Command: $ hive
2. Create a Table
a. Use the CREATE TABLE command to define a table structure.
Example: CREATE TABLE pokes (foo INT, bar STRING);
b. Command: hive> CREATE TABLE pokes (foo INT, bar STRING);
3. List Tables
a. Use the SHOW TABLES command to display all existing tables.
b. Command: hive> SHOW TABLES;
4. Drop a Table
a. Use the DROP TABLE command to delete a specific table.
Example: DROP TABLE pokes;
b. Command: hive> DROP TABLE pokes;
5. Commands End with Semicolon: All Hive commands must conclude with a semicolon (;).
Applications of Hive
• Interactive SQL queries on large datasets (petabytes).
• Data analysis and summarization.
• Integration with Hadoop for big data solutions.

You might also like