0% found this document useful (0 votes)

10 views7 pages

BDA Module-2

The document provides an overview of Hadoop's core components and ecosystem, detailing its architecture, including HDFS, MapReduce, and YARN, along with their functionalities. It also describes HDFS features, its master/slave architecture, and the MapReduce framework for processing large datasets. Additionally, it introduces Apache Hive, a data warehousing tool that allows SQL-like querying of data stored in Hadoop.

Uploaded by

laxmishetti1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views7 pages

BDA Module-2

Uploaded by

laxmishetti1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

BDA MODULE 2

1. HADOOP – CORE COMPONENTS, ECO-SYSTEM COMPONENTS.

Hadoop Core Components

1. Hadoop Common
a. Purpose: Provides essential libraries and utilities for other modules in Hadoop.
b. Functions:
i. Contains components required for the operation of Hadoop.
ii. Manages general input/output, serialization, and Java RPC (Remote Procedure Call).
iii. Facilitates file-based data structures.
2. Hadoop Distributed File System (HDFS)
a. Purpose: A Java-based distributed file system for storing large datasets.
b. Functions:
i. Stores data across multiple nodes in a cluster.
ii. Handles structured, unstructured, and semi-structured data.
iii. Provides high fault tolerance and scalability by replicating data blocks.
3. MapReduce v1
a. Purpose: A programming model for processing large data sets in parallel.
b. Functions:
i. Divides tasks into "Map" and "Reduce" steps for parallel execution.
ii. Processes data in batch mode, suitable for large-scale data computations.
4. YARN (Yet Another Resource Negotiator)
a. Purpose: Manages resources for distributed computing.
b. Functions:
i. Allocates resources for application tasks or sub-tasks running on Hadoop.
ii. Schedules tasks in parallel and handles resource requests.
iii. Ensures distributed and efficient task execution.
5. MapReduce v2 (Hadoop 2 YARN)
a. Purpose: An updated version of MapReduce for handling larger datasets.
b. Functions:
i. Enables parallel processing of large datasets.
ii. Improves scalability and resource management compared to MapReduce v1.
iii. Optimized for distributed processing of application tasks.
Hadoop Ecosystem Components
Refers to a combination of technologies and tools that work together with Hadoop.
Supports storage, processing, access, analysis, governance, security, and operations for Big Data.

1. Core Hadoop Components

a. HDFS (Hadoop Distributed File System): For distributed storage of data.
b. MapReduce: Programming model for distributed data processing.
c. YARN (Yet Another Resource Negotiator): Manages resources and schedules tasks for
distributed computing.
2. Data Store System
a. Consists of clusters, racks, DataNodes, and blocks for storing and managing data.
b. Deploys HDFS for efficient distributed file storage.
3. Application Programming Model
a. Includes models like MapReduce and HBase for data processing and storage.
b. HBase uses columnar databases and supports OLAP (Online Analytical Processing) for
analytical workloads.
4. Support Layer Components
a. AVRO: Data serialization system for efficient communication between Hadoop components.
b. ZooKeeper: Coordination service for managing distributed applications.
5. Application Layer Components
a. Pig: A high-level scripting platform for data transformation and analysis.
b. Hive: A data warehousing tool for querying and managing large datasets using SQL-like queries.
c. Sqoop: Facilitates data transfer between Hadoop and relational databases.
d. Ambari: Provides a web-based interface for managing and monitoring Hadoop clusters.
e. Chukwa: For monitoring and analyzing log data in the Hadoop ecosystem.
f. Mahout: A machine learning library for building scalable algorithms.
g. Spark: A fast and general-purpose cluster computing framework for real-time and batch data
processing.
h. Flink: A stream-processing framework for distributed data processing.
i. Flume: Used for collecting, aggregating, and transporting log data.

2. HDFS – FEATURES, CORE COMPONENTS.

Features
1. Distributed Storage: HDFS stores data across multiple nodes in a distributed manner, ensuring high
scalability and fault tolerance.
2. Replication: Data is replicated across multiple nodes (default replication factor is 3) to ensure availability
and reliability even in case of node failures.
3. Fault Tolerance: HDFS can handle hardware failures by replicating data and recovering lost files from
replicas.
4. Scalability: HDFS is designed to handle large-scale data and can scale horizontally by adding more nodes
to the cluster.
5. Write Once, Read Many: Data in HDFS is written once and read multiple times, which simplifies
concurrency control and improves efficiency.
6. High Throughput: HDFS is optimized for large data transfers, ensuring high throughput for data access
and processing.
7. Block-Based Storage: Files are divided into large blocks (default size is 128 MB or 256 MB) and
distributed across the cluster, enabling efficient storage and processing.
8. Support for Large Files: HDFS is designed to handle files in the range of gigabytes to terabytes
efficiently.
9. Compatibility with Heterogeneous Hardware: HDFS works seamlessly with a variety of hardware and
operating systems, making it adaptable to diverse environments.
10. Built-In Redundancy: Data redundancy through replication ensures data integrity and minimizes the risk
of data loss.
Components

HDFS operates using a Master/Slave architecture consisting of the following key components:
1. NameNode (Master)
• Role:
o Manages the file system namespace and regulates client access to files.
o Stores and manages metadata, such as file structure, file locations, permissions, and mapping of
file blocks to DataNodes.
• Key Responsibilities:
o Handles file system namespace operations: Opening, closing, renaming files, and directories.
o Determines the mapping of blocks to DataNodes.
o Monitors the health of DataNodes using heartbeat signals.
o Handles block creation, deletion, and replication.
o Ensures high availability of data by replicating blocks based on the replication factor.
• Data Flow:
o Does not store actual data but provides clients with metadata for interacting with DataNodes.
o Keeps all metadata in memory for fast access.
2. DataNodes (Slaves)
• Role:
o Store the actual data blocks of files.
o Serve read and write requests directly from/to clients.
• Key Responsibilities:
o Report block information to the NameNode through periodic heartbeats and block reports.
o Replicate data blocks as instructed by the NameNode.
o Handle actual data transfer between the client and storage.
• Data Replication:
o To ensure data reliability, the same block is replicated across multiple DataNodes.
o If multiple racks exist, the NameNode tries to replicate blocks across different racks to enhance
fault tolerance.
3. Client
• Role: Acts as the user-facing component that interacts with HDFS.
• Interaction Process:
o Requests file creation or retrieval from the NameNode.
o Receives metadata (e.g., block locations) from the NameNode.
o Interacts directly with DataNodes for data read/write operations.
4. Secondary NameNode (Checkpoint Node)
• Role:
o Assists the primary NameNode in managing metadata.
o Periodically merges the namespace image (fsimage) and the edit log to create a new namespace
image checkpoint.
• Key Characteristics:
o Not a backup or failover node.
o Does not replace the NameNode in case of failure.
o Improves the availability and manageability of metadata.
Data Flow in HDFS
1. Write Operation:
a. The client requests the NameNode to create a file.
b. The NameNode allocates the required blocks and provides the DataNodes for storage.
c. The client writes data to the designated DataNodes.
d. The NameNode ensures replication of blocks.
2. Read Operation:
a. The client requests a file from the NameNode.
b. The NameNode returns the DataNodes storing the file blocks.
c. The client reads data directly from the DataNodes.
Fault Tolerance in HDFS
• Heartbeat Monitoring:
o The NameNode monitors DataNodes via heartbeats.
o Lack of a heartbeat indicates a DataNode failure.
• Re-Replication: The NameNode re-replicates blocks stored on the failed DataNode to maintain the
replication factor.
• Rack Awareness: Replication is designed to optimize fault tolerance by storing data across different racks.

3. HDFS MAPREDUCE FRAMEWORK

MapReduce provides a programming framework for processing and analyzing large datasets in parallel across
distributed clusters of computers. The framework handles job scheduling, execution, and data movement
efficiently.

Key Features of MapReduce Framework

1. Automatic Parallelization and Distribution: Automates the division of computation tasks across several
processors, enabling efficient parallel processing.
2. Distributed Processing: Processes data stored on clusters of DataNodes distributed across racks.
3. Large-Scale Data Handling: Supports the processing of vast amounts of data simultaneously.
4. Scalability: Offers scalability by enabling the usage of numerous servers in the cluster.
5. Batch-Oriented Programming Model: Provides a batch-oriented model for processing in Hadoop
version 1.
6. Enhanced Processing in Hadoop 2 (YARN-Based): Additional modes of processing in YARN (Yet
Another Resource Negotiator) enable:
o Query processing
o Graph databases
o Real-time analytics
o Streaming data

These features align with the 3Vs of Big Data (Volume, Variety, Velocity).

Framework Architecture
The MapReduce framework handles two primary functions:
1. Job Distribution: Distributes the application tasks (jobs) to nodes within the cluster for parallel execution.
2. Data Aggregation: Collects and organizes intermediate results from nodes into a cohesive response.

Execution Process
1. JobTracker (Master Node)
a. A daemon (background process) responsible for:
i. Estimating resource requirements for tasks.
ii. Analyzing the states of slave nodes.
iii. Assigning map tasks to the appropriate DataNodes.
iv. Monitoring the progress of tasks.
v. Handling failures by rescheduling tasks.
2. TaskTracker (Slave Nodes)
a. Executes the Map and Reduce tasks assigned by the JobTracker.
b. Reports the status of tasks back to the JobTracker.

Job Execution Process

1. Mapper Phase
a. Deploys map tasks on DataNodes where the application data is stored.
b. Outputs intermediate results serialized using formats like AVRO.
2. Reducer Phase
a. Receives data from the Mapper phase.
b. Performs computations and consolidates results into the final output.
3. Result Collection: The final output is sent back to the Hadoop server after all tasks are completed.
Types of Processes in MapReduce
• JobTracker
o Manages jobs submitted by clients.
o Assigns tasks to TaskTrackers and monitors progress.
• TaskTracker
o Executes assigned tasks (Map and Reduce phases).
o Periodically reports status back to the JobTracker.
Daemon
• Refers to a background program dedicated to managing system processes in Hadoop.
• Examples:
o JobTracker: Coordinates jobs and resources.
o TaskTracker: Executes tasks on nodes.

4. APACHE HIVE
Apache Hive is a data warehouse infrastructure built on top of Hadoop that facilitates data summarization, ad
hoc queries, and analysis of large datasets using a SQL-like language called HiveQL. It provides an interface for
querying and managing data stored in Hadoop's distributed file system (HDFS) or other storage systems like
HBase.

Key Features of Apache Hive:

1. SQL-Like Interface (HiveQL)

a. Enables users familiar with SQL to work with Hadoop data.
b. Simplifies querying large datasets using declarative language.
2. ETL Support: Offers tools for data extraction, transformation, and loading (ETL).
3. Data Structure and Format Handling: Imposes structure on various data formats to enable querying.
4. HDFS and HBase Integration: Allows access to data stored in HDFS or HBase seamlessly.
5. Query Execution: Queries are executed via MapReduce or Tez (an optimized MapReduce engine).

Using Hive

1. Start Hive
a. Enter the Hive command to begin a session and access the hive> prompt.
b. Command: $ hive
2. Create a Table
a. Use the CREATE TABLE command to define a table structure.
Example: CREATE TABLE pokes (foo INT, bar STRING);
b. Command: hive> CREATE TABLE pokes (foo INT, bar STRING);
3. List Tables
a. Use the SHOW TABLES command to display all existing tables.
b. Command: hive> SHOW TABLES;
4. Drop a Table
a. Use the DROP TABLE command to delete a specific table.
Example: DROP TABLE pokes;
b. Command: hive> DROP TABLE pokes;
5. Commands End with Semicolon: All Hive commands must conclude with a semicolon (;).
Applications of Hive
• Interactive SQL queries on large datasets (petabytes).
• Data analysis and summarization.
• Integration with Hadoop for big data solutions.

Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
6 pages
BDA NOTES
No ratings yet
BDA NOTES
110 pages
Karmasandhan Indian Army Havildar and Naib Subedar Recruitment 2024 - Sportsperson
No ratings yet
Karmasandhan Indian Army Havildar and Naib Subedar Recruitment 2024 - Sportsperson
15 pages
bda final sem 7
No ratings yet
bda final sem 7
120 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Big Data Refers to Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers to Extremely Large and Complex Datasets That 1
421 pages
Module-Powertrain Control C1 3.6L
No ratings yet
Module-Powertrain Control C1 3.6L
4 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
ML Unit 3
No ratings yet
ML Unit 3
2 pages
Hadoop distributed file system ecosystem and four...
No ratings yet
Hadoop distributed file system ecosystem and four...
2 pages
Experiment No:9 Aim: Theory
0% (1)
Experiment No:9 Aim: Theory
4 pages
Cloud Computing Module-4
No ratings yet
Cloud Computing Module-4
4 pages
Cloud Computing Module-3
No ratings yet
Cloud Computing Module-3
5 pages
BDA CW Chapter 2
No ratings yet
BDA CW Chapter 2
6 pages
Mike Okmawati - Case Study Research
No ratings yet
Mike Okmawati - Case Study Research
2 pages
UNIT 5-PLH
No ratings yet
UNIT 5-PLH
34 pages
BG 345
No ratings yet
BG 345
26 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
UNIT-2
No ratings yet
UNIT-2
14 pages
MITSUBOSHI - Timing Belt
67% (3)
MITSUBOSHI - Timing Belt
142 pages
bda_unit34
No ratings yet
bda_unit34
17 pages
Pr1 Hard
No ratings yet
Pr1 Hard
30 pages
BDA Unit 2 Q&A
No ratings yet
BDA Unit 2 Q&A
14 pages
Unit IV Notes
No ratings yet
Unit IV Notes
34 pages
HDFS
No ratings yet
HDFS
1 page
Unit 4 Endsem PYQs
No ratings yet
Unit 4 Endsem PYQs
24 pages
Bda A1
No ratings yet
Bda A1
5 pages
BDAV_QB
No ratings yet
BDAV_QB
88 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
BDA 3rd Unit QB
No ratings yet
BDA 3rd Unit QB
4 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
BDA Module-3
No ratings yet
BDA Module-3
7 pages
BDA_UNIT-IV
No ratings yet
BDA_UNIT-IV
37 pages
Act2 - March7 - 6E - BDA - SEC
No ratings yet
Act2 - March7 - 6E - BDA - SEC
8 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Hadoop
No ratings yet
Hadoop
154 pages
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
No ratings yet
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
10 pages
SADP-5
No ratings yet
SADP-5
7 pages
Cloud Computing Module-1
No ratings yet
Cloud Computing Module-1
5 pages
UNIT -2
No ratings yet
UNIT -2
27 pages
DBMS Lab Manual
No ratings yet
DBMS Lab Manual
37 pages
Bda Lab 1
No ratings yet
Bda Lab 1
9 pages
DS Unit 4.1
No ratings yet
DS Unit 4.1
14 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
SADP-2
No ratings yet
SADP-2
10 pages
DxDiag
No ratings yet
DxDiag
36 pages
UNIT - 2
No ratings yet
UNIT - 2
42 pages
Global Report - What Worries The World June 23-WEB
No ratings yet
Global Report - What Worries The World June 23-WEB
29 pages
unit 5 bda (1)
No ratings yet
unit 5 bda (1)
8 pages
Session3_4-Bigdata Tools and Movie use case
No ratings yet
Session3_4-Bigdata Tools and Movie use case
79 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Cloud Computing Module-2
No ratings yet
Cloud Computing Module-2
6 pages
Credits
No ratings yet
Credits
16 pages
Understanding Hadoop Ecosystem
No ratings yet
Understanding Hadoop Ecosystem
38 pages
Apply For A New Passport at Go - Gov.sg/passport
No ratings yet
Apply For A New Passport at Go - Gov.sg/passport
2 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
Bjaz GC Policy Schedule
No ratings yet
Bjaz GC Policy Schedule
6 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
No ratings yet
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
19 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Bda Imp No Header Footer (1)
No ratings yet
Bda Imp No Header Footer (1)
25 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
HADOOP
No ratings yet
HADOOP
19 pages
2.what TJH2b Analytical Services DOES
No ratings yet
2.what TJH2b Analytical Services DOES
1 page
Chapter 2 - 大数据生态系统
No ratings yet
Chapter 2 - 大数据生态系统
31 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
AAWorkflowWithMES Dec2012
No ratings yet
AAWorkflowWithMES Dec2012
108 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
GGB Aircraft Bearings Bushings Aerospace Industry Brochure
No ratings yet
GGB Aircraft Bearings Bushings Aerospace Industry Brochure
12 pages
Touch Sensitive Alarm
100% (2)
Touch Sensitive Alarm
7 pages
Type of Retailers
No ratings yet
Type of Retailers
31 pages
Cloud Computing Module-5
No ratings yet
Cloud Computing Module-5
5 pages
7 Motion and Time Solution
No ratings yet
7 Motion and Time Solution
2 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
Internal Revenue Service: Washington, DC 20224
No ratings yet
Internal Revenue Service: Washington, DC 20224
3 pages
Module-2 PPT-1
No ratings yet
Module-2 PPT-1
126 pages
The Ultimate DIY BIR Tax Compliance Guide For Freelancers - November 2020 Version - 5 PDF
No ratings yet
The Ultimate DIY BIR Tax Compliance Guide For Freelancers - November 2020 Version - 5 PDF
56 pages
Big Data Analytics – Unit 4
No ratings yet
Big Data Analytics – Unit 4
32 pages
Hadoop & Big Data
No ratings yet
Hadoop & Big Data
36 pages
64 Do-Follow Social Bookmarking Sites, by Caroline Middlebrook.
No ratings yet
64 Do-Follow Social Bookmarking Sites, by Caroline Middlebrook.
2 pages
Week2-1 Numpy
No ratings yet
Week2-1 Numpy
43 pages
Module 1: Overview On Rotor Dynamics History and Recent Trends
No ratings yet
Module 1: Overview On Rotor Dynamics History and Recent Trends
3 pages
HDM Fpga Asic
No ratings yet
HDM Fpga Asic
157 pages
Report Aicte Activity
No ratings yet
Report Aicte Activity
11 pages
Venture Group: The Only New Power Swivel Engineering in 30 Years
No ratings yet
Venture Group: The Only New Power Swivel Engineering in 30 Years
4 pages
HDFS Internals
No ratings yet
HDFS Internals
30 pages
Knoll Steelcase
No ratings yet
Knoll Steelcase
72 pages
c03 Proe wf3
No ratings yet
c03 Proe wf3
34 pages
Physiotherapy Advice After Your Kidney Transplant
No ratings yet
Physiotherapy Advice After Your Kidney Transplant
8 pages
Mini GTL Plants
No ratings yet
Mini GTL Plants
21 pages
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet

BDA Module-2

Uploaded by

BDA Module-2

Uploaded by

BDA MODULE 2

1. HADOOP – CORE COMPONENTS, ECO-SYSTEM COMPONENTS.

1. Core Hadoop Components

2. HDFS – FEATURES, CORE COMPONENTS.

3. HDFS MAPREDUCE FRAMEWORK

Key Features of MapReduce Framework

Job Execution Process

Key Features of Apache Hive:

1. SQL-Like Interface (HiveQL)

You might also like