Introduction To Hadoop

Uploaded by

anytingac1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views52 pages

Introduction To Hadoop

Uploaded by

anytingac1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Introduction to Hadoop

Introduction to Hadoop
Few statistics to get an idea of the amount of data which gets
generated every day, every minute, and every second.
Every day: (a) NYSE (New York Stock Exchange) generates 1.5
billion shares and trade data. (b) Facebook stores 2.7 billion
comments and Likes. (c) Google processes about 24
petabytes of data.
Every minute: (a) Facebook users share nearly 2.5 million
pieces of content. (b) Twitter users tweet nearly 300,000
times. (c) Instagram users post nearly 220,000 new photos.
(d) YouTube users upload 72 hours of new video content.
(e) Apple users download nearly 50,000 apps. (f) Email
users send over 200 million messages. (g) Amazon
generates over $80,000 in online sales. (h) Google receives
over 4 million search queries.
Every second: (a) Banking applications process more than
10,000 credit card transactions
Why Hadoop?
• The key consideration is:- Its capability to
handle massive amounts of data, different
categories of data – fairly quickly.
• Other considerations-
Why Hadoop?
1. Low cost: Hadoop is an open-source framework and
uses commodity hardware (commodity hardware is
relatively inexpensive and easy to obtain hardware)
to store enormous quantities of data.
2. Computing power: Hadoop is based on distributed
computing model which processes very large
volumes of data fairly quickly. The more the number
of computing nodes, the more the processing power
at hand.
3. Scalability: This boils down to simply adding nodes as
the system grows and requires much less
administration.
Why Hadoop?
4. Storage flexibility: Unlike the traditional relational
databases, in Hadoop data need not be pre-processed
before storing it. Hadoop provides the convenience of
storing as much data as one needs and also the added
flexibility of deciding later as to how to use the stored
data. In Hadoop, one can store unstructured data like
images, videos, and free-form text.
5. Inherent data protection: Hadoop protects data and
executing applications against hardware failure. If a node
fails, it automatically redirects the jobs that had been
assigned to this node to the other functional and available
nodes and ensures that distributed computing does not
fail. It goes a step further to store multiple copies
(replicas) of the data on various nodes across the cluster.
Why Hadoop?
Hadoop framework
Why NOT RDBMS?
• RDBMS is not suitable for storing and
processing large files, images, and videos.
RDBMS is not a good choice when it comes to
advanced analytics involving machine
learning.
• Figure describes the RDBMS system with
respect to cost and storage. It calls for huge
investment as the volume of data shows an
upward trend.
Why NOT RDBMS?
RDBMS versus Hadoop
Distributed Computing Challenges
Although there are several challenges with
distributed computing, we will focus on two
major challenges.

Hardware Failure

How to Process This Gigantic Store of Data?

Distributed Computing Challenges
Hardware Failure
• In a distributed system, several servers are networked
together.
• This implies that more often than not, there may be a
possibility of hardware failure. And when such a failure does
happen, how does one retrieve the data that was stored in
the system? Just to explain further – a regular hard disk may
fail once in 3 years. And when you have 1000 such hard disks,
there is a possibility of at least a few being down every day.
• Hadoop has an answer to this problem in Replication Factor
(RF). Replication Factor connotes the number of data copies
of a given data item/data block stored across the network.
Distributed Computing Challenges
Distributed Computing Challenges
How to Process This Gigantic Store of Data?
• In a distributed system, the data is spread
across the network on several machines. A key
challenge here is to integrate the data
available on several machines prior to
processing it.
• Hadoop solves this problem by using
MapReduce Programming. It is a programming
model to process the data
History of Hadoop
• Hadoop was created by Doug Cutting, the
creator of Apache Lucene (a commonly used
text search library). Hadoop is a part of the
Apache Nutch (Yahoo) project (an open-source
web search engine) and also a part of the
Lucene project.
History of Hadoop
History of Hadoop
The Name “Hadoop”
The name Hadoop is not an acronym; it’s a
made-up name.
The project creator, Doug Cutting, explains
how the name came about: “The name my kid
gave a stuffed yellow elephant. Short,
relatively easy to spell and pronounce,
meaningless, and not used elsewhere: those
are my naming criteria.”
Hadoop Overview
• Hadoop accomplishes two tasks:
1. Massive data storage.
2. Faster data processing.
Hadoop Overview
Key Aspects of Hadoop
Hadoop Overview
Hadoop Components
Hadoop Overview
Hadoop Conceptual Layer
Hadoop Overview
Hadoop high-level architecture.
Hadoop Distributors
• The companies shown in Figure 5.12 provide
products that include Apache Hadoop,
commercial support, and/or tools and utilities
related to Hadoop.
HDFS(Hadoop Distributed File System)
1. Storage component of Hadoop.
2. Distributed File System.
3. Modeled after Google File System.
4. Optimized for high throughput (HDFS
leverages large block size and moves
computation where data is stored).
5. You can replicate a file for a configured
number of times, which is tolerant in terms
of both software and hardware.
HDFS(Hadoop Distributed File System)
6. Re-replicates data blocks automatically on
nodes that have failed.
7. You can realize the power of HDFS when you
perform read or write on large files
(gigabytes and larger).
8. Sits on top of native file system such as ext3
and ext4
HDFS(Hadoop Distributed File System)
HDFS(Hadoop Distributed File System)
HDFS(Hadoop Distributed File System)
HDFS Daemons
1. Namenode
HDFS(Hadoop Distributed File System)
• HDFS breaks a large file into smaller pieces called blocks.
• NameNode uses a rack ID to identify DataNodes in the rack.
• A rack is a collection of DataNodes within the cluster.
• NameNode keeps tracks of blocks of a file as it is placed on various
DataNodes. NameNode manages file-related operations such as read,
write, create, and delete. Its main job is managing the File System
Namespace.
• A file system namespace is collection of files in the cluster. NameNode
stores HDFS namespace.
• File system namespace includes mapping of blocks to file, file properties
and is stored in a file called FsImage. NameNode uses an EditLog
(transaction log) to record every transaction that happens to the file
system metadata.
• When NameNode starts up, it reads FsImage and EditLog from disk and
applies all transactions from the EditLog to in-memory representation of
the FsImage. Then it flushes out new version of FsImage on disk and
truncates the old EditLog because the changes are updated in the
FsImage. There is a single NameNode per cluster.
HDFS(Hadoop Distributed File System)
HDFS Daemons
1. Namenode
HDFS(Hadoop Distributed File System)
HDFS Daemons
2. DataNode
• There are multiple DataNodes per cluster.
• During Pipeline read and write DataNodes
communicate with each other.
• A DataNode also continuously sends “heartbeat”
message to NameNode to ensure the connectivity
between the NameNode and DataNode.
• In case there is no heartbeat from a DataNode, the
NameNode replicates that DataNode within the cluster
and keeps on running as if nothing had happened.
HDFS(Hadoop Distributed File System)
HDFS(Hadoop Distributed File System)
HDFS Daemons
3. Secondary NameNode
• The Secondary NameNode takes a snapshot of HDFS
metadata at intervals specified in the Hadoop configuration.
• Since the memory requirements of Secondary NameNode
are the same as NameNode, it is better to run NameNode
and Secondary NameNode on different machines.
• In case of failure of the NameNode, the Secondary
NameNode can be configured manually to bring up the
cluster. However, the Secondary NameNode does not
record any real-time changes that happen to the HDFS
metadata.
Anatomy of File Read
Anatomy of File Read
• The steps involved in the File Read are as follows:
1. The client opens the file that it wishes to read from by calling
open() on the Distributed File System.
2. Distributed File System communicates with the NameNode to get
the location of data blocks. NameNode returns with the
addresses of the DataNodes that the data blocks are stored on.
Subsequent to this, the Distributed File System returns an
FSDataInputStream to client to read from the file.
3. Client then calls read() on the stream DFSInputStream, which has
addresses of the DataNodes for the first few blocks of the file,
connects to the closest DataNode for the first block in the file.
4. Client calls read() repeatedly to stream the data from the
DataNode.
5. When end of the block is reached, DFSInputStream closes the
connection with the DataNode. It repeats the steps to find the
best DataNode for the next block and subsequent blocks.
6. When the client completes the reading of the file, it calls close()
on the FSDataInputStream to close the connection.
Anatomy of File Write
Anatomy of File Write
• The steps involved in anatomy of File Write are as follows:
1. The client calls create() on DistributedFileSystem to create a file.
2. An RPC call to the NameNode happens through the
DistributedFileSystem to create a new file. The NameNode
performs various checks to create a new file (checks whether
such a file exists or not). Initially, the NameNode creates a file
without associating any data blocks to the file. The
DistributedFileSystem returns an FSDataOutputStream to the
client to perform write.
3. As the client writes data, data is split into packets by
DFSOutputStream, which is then written to an internal queue,
called data queue. DataStreamer consumes the data queue. The
DataStreamer requests the NameNode to allocate new blocks by
selecting a list of suitable DataNodes to store replicas. This list of
DataNodes makes a pipeline. Here, we will go with the default
replication factor of three, so there will be three nodes in the
pipeline for the first block.
Anatomy of File Write
4. DataStreamer streams the packets to the first DataNode in
the pipeline. It stores packet and forwards it to the second
DataNode in the pipeline. In the same way, the second
DataNode stores the packet and forwards it to the third
DataNode in the pipeline.
5. In addition to the internal queue, DFSOutputStream also
manages an “Ack queue” of packets that are waiting for
the acknowledgement by DataNodes. A packet is removed
from the “Ack queue” only if it is acknowledged by all the
DataNodes in the pipeline.
6. When the client finishes writing the file, it calls close() on
the stream.
7. This flushes all the remaining packets to the DataNode
pipeline and waits for relevant acknowledgments before
communicating with the NameNode to inform the client
that the creation of the file is complete.
Replica Placement Strategy
• As per the Hadoop Replica Placement Strategy,
first replica is placed on the same node as the
client.
• Then it places second replica on a node that is
present on different rack.
• It places the third replica on the same rack as
second, but on a different node in the rack. Once
replica locations have been set, a pipeline is built.
• This strategy provides good reliability
Replica Placement Strategy
Working with HDFS Commands
• To get the list of directories and files at the
root of HDFS.
hadoop fs –ls /
• To get the list of complete directories and files
of HDFS.
hadoop fs -ls –R /
• To create a directory (say, sample) in HDFS.
hadoop fs -mkdir /sample
Working with HDFS Commands
• To copy a file from local file system to HDFS.
hadoop fs -put /root/sample/test.txt
/sample/test.txt

• To copy a file from HDFS to local file system.

hadoop fs -get /sample/test.txt /root/sample/testsample.txt

• To copy a file from local file system to HDFS via

copyFromLocal command.
hadoop fs -copyFromLocal /root/sample/test.txt /sample/testsample.txt
Working with HDFS Commands
• To copy a file from Hadoop file system to local file
system via copyToLocal command.
hadoop fs -copyToLocal /sample/test.txt /root/sample/testsample1.txt

• To display the contents of an HDFS file on console.

hadoop fs -cat /sample/test.txt
Working with HDFS Commands
• To copy a file from one directory to another on
HDFS.
hadoop fs -cp /sample/test.txt /sample1

• To remove a directory from HDFS.

hadoop fs –rm /sample1
Managing Resources and Applications
with Hadoop YARN
Managing Resources and Applications
with Hadoop YARN

Limitations of Hadoop 1.0 architecture

Managing Resources and Applications
with Hadoop YARN
Interacting with Hadoop Ecosystem

Krunker Script
No ratings yet
Krunker Script
5 pages
4 UNIT-4 Introduction To Hadoop
No ratings yet
4 UNIT-4 Introduction To Hadoop
154 pages
Introduction to Hadoop- chapter-2
No ratings yet
Introduction to Hadoop- chapter-2
59 pages
UNIT - 2
No ratings yet
UNIT - 2
42 pages
BDA_UNIT-IV
No ratings yet
BDA_UNIT-IV
37 pages
Module II
No ratings yet
Module II
46 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
5.Apache Hadoop Updated
No ratings yet
5.Apache Hadoop Updated
57 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
BDA-UNIT-2 - 2023
No ratings yet
BDA-UNIT-2 - 2023
58 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Big data aktu unit 2
No ratings yet
Big data aktu unit 2
127 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
Hadoop PDF
0% (1)
Hadoop PDF
4 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
Hadoop
No ratings yet
Hadoop
7 pages
UNIT 5 Combined
No ratings yet
UNIT 5 Combined
13 pages
bioDiesel_research
No ratings yet
bioDiesel_research
29 pages
Unit 2
No ratings yet
Unit 2
56 pages
Bigdata Module2 7th-Sem 18cs72
No ratings yet
Bigdata Module2 7th-Sem 18cs72
64 pages
Hadoop Class 1 PDF
No ratings yet
Hadoop Class 1 PDF
27 pages
Hadoop
No ratings yet
Hadoop
9 pages
Unit-2 Hadoop HDFS Hadoopecosystem
No ratings yet
Unit-2 Hadoop HDFS Hadoopecosystem
25 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Unit 3 Introduction To Hadoop Syllabus
No ratings yet
Unit 3 Introduction To Hadoop Syllabus
22 pages
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
BDA-3
No ratings yet
BDA-3
70 pages
Big data Unit 4 own
No ratings yet
Big data Unit 4 own
18 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
UNIT -2
No ratings yet
UNIT -2
27 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
BDA UNIT-2dhhhhbv
No ratings yet
BDA UNIT-2dhhhhbv
23 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Compusoft, 2 (11), 370-373 PDF
No ratings yet
Compusoft, 2 (11), 370-373 PDF
4 pages
bd sec b
No ratings yet
bd sec b
19 pages
Unit 5 Print
No ratings yet
Unit 5 Print
32 pages
UNIT-2
No ratings yet
UNIT-2
14 pages
Big Data Unit-2 PPT part1
No ratings yet
Big Data Unit-2 PPT part1
76 pages
Session3_4-Bigdata Tools and Movie use case
No ratings yet
Session3_4-Bigdata Tools and Movie use case
79 pages
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
No ratings yet
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
17 pages
Hadoop
No ratings yet
Hadoop
25 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
10 Dfs
No ratings yet
10 Dfs
5 pages
Module-2
No ratings yet
Module-2
34 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Garmin Wiring Diagrams - G3X - All
No ratings yet
Garmin Wiring Diagrams - G3X - All
15 pages
Auth0 Amd Cs Case Study
No ratings yet
Auth0 Amd Cs Case Study
7 pages
Learning SQL: Chapter 1 - A Little Background
No ratings yet
Learning SQL: Chapter 1 - A Little Background
33 pages
Computer Assignment
No ratings yet
Computer Assignment
3 pages
IT112
No ratings yet
IT112
5 pages
Sample Paper 2
No ratings yet
Sample Paper 2
6 pages
2023 Dsga1007 Lect03 VF
No ratings yet
2023 Dsga1007 Lect03 VF
39 pages
Computational Complexity
No ratings yet
Computational Complexity
5 pages
RG-AP680-AR Wi-Fi 6 Tri-Radio Access Point Datasheet - For Preview - 09271005
No ratings yet
RG-AP680-AR Wi-Fi 6 Tri-Radio Access Point Datasheet - For Preview - 09271005
19 pages
Iot Chapter 1
100% (1)
Iot Chapter 1
92 pages
L7 - CS305 - Algorithm Design and Analysis
No ratings yet
L7 - CS305 - Algorithm Design and Analysis
4 pages
Datasheet 588969 (80-3355) en
No ratings yet
Datasheet 588969 (80-3355) en
11 pages
IN1020 Oblig 1
No ratings yet
IN1020 Oblig 1
2 pages
Unit-3 Scan Line Conversion Algorithm
No ratings yet
Unit-3 Scan Line Conversion Algorithm
25 pages
Memory Built in Self Testing
No ratings yet
Memory Built in Self Testing
30 pages
Building Enterprise IoT Solutions with Eclipse IoT Technologies: An Open Source Approach to Edge Computing 1st Edition Frédéric Desbiens instant download
100% (1)
Building Enterprise IoT Solutions with Eclipse IoT Technologies: An Open Source Approach to Edge Computing 1st Edition Frédéric Desbiens instant download
85 pages
WiFi Panorama Camera
No ratings yet
WiFi Panorama Camera
4 pages
Embedded C and Linux Challenges
No ratings yet
Embedded C and Linux Challenges
6 pages
Chapitre 05
No ratings yet
Chapitre 05
42 pages
Training AWS - Module 5 - Elastic Load Balancing - Auto Scaling Group
No ratings yet
Training AWS - Module 5 - Elastic Load Balancing - Auto Scaling Group
51 pages
Telpo Customer Service System User Guide V1.4
No ratings yet
Telpo Customer Service System User Guide V1.4
12 pages
Mini Project Python
No ratings yet
Mini Project Python
4 pages
PME Uninstall For Win7 v1.0
No ratings yet
PME Uninstall For Win7 v1.0
4 pages
Job Specializations for Computing Professionals
No ratings yet
Job Specializations for Computing Professionals
20 pages
Softice Tutorial From Mexelite Cracking Group
No ratings yet
Softice Tutorial From Mexelite Cracking Group
10 pages
Unit-3
No ratings yet
Unit-3
30 pages
TC2m-CLK: User Manual and Installation
No ratings yet
TC2m-CLK: User Manual and Installation
35 pages
03 Activity 3 - ARG
100% (1)
03 Activity 3 - ARG
4 pages
Booking System
No ratings yet
Booking System
10 pages

Introduction To Hadoop

Uploaded by

Introduction To Hadoop

Uploaded by

Introduction to Hadoop

How to Process This Gigantic Store of Data?

• To copy a file from HDFS to local file system.

• To copy a file from local file system to HDFS via

• To display the contents of an HDFS file on console.

• To remove a directory from HDFS.

Limitations of Hadoop 1.0 architecture

You might also like