0% found this document useful (0 votes)
38 views

Introduction To Hadoop

Uploaded by

anytingac1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Introduction To Hadoop

Uploaded by

anytingac1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Introduction to Hadoop

Introduction to Hadoop
Few statistics to get an idea of the amount of data which gets
generated every day, every minute, and every second.
Every day: (a) NYSE (New York Stock Exchange) generates 1.5
billion shares and trade data. (b) Facebook stores 2.7 billion
comments and Likes. (c) Google processes about 24
petabytes of data.
Every minute: (a) Facebook users share nearly 2.5 million
pieces of content. (b) Twitter users tweet nearly 300,000
times. (c) Instagram users post nearly 220,000 new photos.
(d) YouTube users upload 72 hours of new video content.
(e) Apple users download nearly 50,000 apps. (f) Email
users send over 200 million messages. (g) Amazon
generates over $80,000 in online sales. (h) Google receives
over 4 million search queries.
Every second: (a) Banking applications process more than
10,000 credit card transactions
Why Hadoop?
• The key consideration is:- Its capability to
handle massive amounts of data, different
categories of data – fairly quickly.
• Other considerations-
Why Hadoop?
1. Low cost: Hadoop is an open-source framework and
uses commodity hardware (commodity hardware is
relatively inexpensive and easy to obtain hardware)
to store enormous quantities of data.
2. Computing power: Hadoop is based on distributed
computing model which processes very large
volumes of data fairly quickly. The more the number
of computing nodes, the more the processing power
at hand.
3. Scalability: This boils down to simply adding nodes as
the system grows and requires much less
administration.
Why Hadoop?
4. Storage flexibility: Unlike the traditional relational
databases, in Hadoop data need not be pre-processed
before storing it. Hadoop provides the convenience of
storing as much data as one needs and also the added
flexibility of deciding later as to how to use the stored
data. In Hadoop, one can store unstructured data like
images, videos, and free-form text.
5. Inherent data protection: Hadoop protects data and
executing applications against hardware failure. If a node
fails, it automatically redirects the jobs that had been
assigned to this node to the other functional and available
nodes and ensures that distributed computing does not
fail. It goes a step further to store multiple copies
(replicas) of the data on various nodes across the cluster.
Why Hadoop?
Hadoop framework
Why NOT RDBMS?
• RDBMS is not suitable for storing and
processing large files, images, and videos.
RDBMS is not a good choice when it comes to
advanced analytics involving machine
learning.
• Figure describes the RDBMS system with
respect to cost and storage. It calls for huge
investment as the volume of data shows an
upward trend.
Why NOT RDBMS?
RDBMS versus Hadoop
Distributed Computing Challenges
Although there are several challenges with
distributed computing, we will focus on two
major challenges.

Hardware Failure

How to Process This Gigantic Store of Data?


Distributed Computing Challenges
Hardware Failure
• In a distributed system, several servers are networked
together.
• This implies that more often than not, there may be a
possibility of hardware failure. And when such a failure does
happen, how does one retrieve the data that was stored in
the system? Just to explain further – a regular hard disk may
fail once in 3 years. And when you have 1000 such hard disks,
there is a possibility of at least a few being down every day.
• Hadoop has an answer to this problem in Replication Factor
(RF). Replication Factor connotes the number of data copies
of a given data item/data block stored across the network.
Distributed Computing Challenges
Distributed Computing Challenges
How to Process This Gigantic Store of Data?
• In a distributed system, the data is spread
across the network on several machines. A key
challenge here is to integrate the data
available on several machines prior to
processing it.
• Hadoop solves this problem by using
MapReduce Programming. It is a programming
model to process the data
History of Hadoop
• Hadoop was created by Doug Cutting, the
creator of Apache Lucene (a commonly used
text search library). Hadoop is a part of the
Apache Nutch (Yahoo) project (an open-source
web search engine) and also a part of the
Lucene project.
History of Hadoop
History of Hadoop
The Name “Hadoop”
The name Hadoop is not an acronym; it’s a
made-up name.
The project creator, Doug Cutting, explains
how the name came about: “The name my kid
gave a stuffed yellow elephant. Short,
relatively easy to spell and pronounce,
meaningless, and not used elsewhere: those
are my naming criteria.”
Hadoop Overview
• Hadoop accomplishes two tasks:
1. Massive data storage.
2. Faster data processing.
Hadoop Overview
Key Aspects of Hadoop
Hadoop Overview
Hadoop Components
Hadoop Overview
Hadoop Conceptual Layer
Hadoop Overview
Hadoop high-level architecture.
Hadoop Distributors
• The companies shown in Figure 5.12 provide
products that include Apache Hadoop,
commercial support, and/or tools and utilities
related to Hadoop.
HDFS(Hadoop Distributed File System)
1. Storage component of Hadoop.
2. Distributed File System.
3. Modeled after Google File System.
4. Optimized for high throughput (HDFS
leverages large block size and moves
computation where data is stored).
5. You can replicate a file for a configured
number of times, which is tolerant in terms
of both software and hardware.
HDFS(Hadoop Distributed File System)
6. Re-replicates data blocks automatically on
nodes that have failed.
7. You can realize the power of HDFS when you
perform read or write on large files
(gigabytes and larger).
8. Sits on top of native file system such as ext3
and ext4
HDFS(Hadoop Distributed File System)
HDFS(Hadoop Distributed File System)
HDFS(Hadoop Distributed File System)
HDFS Daemons
1. Namenode
HDFS(Hadoop Distributed File System)
• HDFS breaks a large file into smaller pieces called blocks.
• NameNode uses a rack ID to identify DataNodes in the rack.
• A rack is a collection of DataNodes within the cluster.
• NameNode keeps tracks of blocks of a file as it is placed on various
DataNodes. NameNode manages file-related operations such as read,
write, create, and delete. Its main job is managing the File System
Namespace.
• A file system namespace is collection of files in the cluster. NameNode
stores HDFS namespace.
• File system namespace includes mapping of blocks to file, file properties
and is stored in a file called FsImage. NameNode uses an EditLog
(transaction log) to record every transaction that happens to the file
system metadata.
• When NameNode starts up, it reads FsImage and EditLog from disk and
applies all transactions from the EditLog to in-memory representation of
the FsImage. Then it flushes out new version of FsImage on disk and
truncates the old EditLog because the changes are updated in the
FsImage. There is a single NameNode per cluster.
HDFS(Hadoop Distributed File System)
HDFS Daemons
1. Namenode
HDFS(Hadoop Distributed File System)
HDFS Daemons
2. DataNode
• There are multiple DataNodes per cluster.
• During Pipeline read and write DataNodes
communicate with each other.
• A DataNode also continuously sends “heartbeat”
message to NameNode to ensure the connectivity
between the NameNode and DataNode.
• In case there is no heartbeat from a DataNode, the
NameNode replicates that DataNode within the cluster
and keeps on running as if nothing had happened.
HDFS(Hadoop Distributed File System)
HDFS(Hadoop Distributed File System)
HDFS Daemons
3. Secondary NameNode
• The Secondary NameNode takes a snapshot of HDFS
metadata at intervals specified in the Hadoop configuration.
• Since the memory requirements of Secondary NameNode
are the same as NameNode, it is better to run NameNode
and Secondary NameNode on different machines.
• In case of failure of the NameNode, the Secondary
NameNode can be configured manually to bring up the
cluster. However, the Secondary NameNode does not
record any real-time changes that happen to the HDFS
metadata.
Anatomy of File Read
Anatomy of File Read
• The steps involved in the File Read are as follows:
1. The client opens the file that it wishes to read from by calling
open() on the Distributed File System.
2. Distributed File System communicates with the NameNode to get
the location of data blocks. NameNode returns with the
addresses of the DataNodes that the data blocks are stored on.
Subsequent to this, the Distributed File System returns an
FSDataInputStream to client to read from the file.
3. Client then calls read() on the stream DFSInputStream, which has
addresses of the DataNodes for the first few blocks of the file,
connects to the closest DataNode for the first block in the file.
4. Client calls read() repeatedly to stream the data from the
DataNode.
5. When end of the block is reached, DFSInputStream closes the
connection with the DataNode. It repeats the steps to find the
best DataNode for the next block and subsequent blocks.
6. When the client completes the reading of the file, it calls close()
on the FSDataInputStream to close the connection.
Anatomy of File Write
Anatomy of File Write
• The steps involved in anatomy of File Write are as follows:
1. The client calls create() on DistributedFileSystem to create a file.
2. An RPC call to the NameNode happens through the
DistributedFileSystem to create a new file. The NameNode
performs various checks to create a new file (checks whether
such a file exists or not). Initially, the NameNode creates a file
without associating any data blocks to the file. The
DistributedFileSystem returns an FSDataOutputStream to the
client to perform write.
3. As the client writes data, data is split into packets by
DFSOutputStream, which is then written to an internal queue,
called data queue. DataStreamer consumes the data queue. The
DataStreamer requests the NameNode to allocate new blocks by
selecting a list of suitable DataNodes to store replicas. This list of
DataNodes makes a pipeline. Here, we will go with the default
replication factor of three, so there will be three nodes in the
pipeline for the first block.
Anatomy of File Write
4. DataStreamer streams the packets to the first DataNode in
the pipeline. It stores packet and forwards it to the second
DataNode in the pipeline. In the same way, the second
DataNode stores the packet and forwards it to the third
DataNode in the pipeline.
5. In addition to the internal queue, DFSOutputStream also
manages an “Ack queue” of packets that are waiting for
the acknowledgement by DataNodes. A packet is removed
from the “Ack queue” only if it is acknowledged by all the
DataNodes in the pipeline.
6. When the client finishes writing the file, it calls close() on
the stream.
7. This flushes all the remaining packets to the DataNode
pipeline and waits for relevant acknowledgments before
communicating with the NameNode to inform the client
that the creation of the file is complete.
Replica Placement Strategy
• As per the Hadoop Replica Placement Strategy,
first replica is placed on the same node as the
client.
• Then it places second replica on a node that is
present on different rack.
• It places the third replica on the same rack as
second, but on a different node in the rack. Once
replica locations have been set, a pipeline is built.
• This strategy provides good reliability
Replica Placement Strategy
Working with HDFS Commands
• To get the list of directories and files at the
root of HDFS.
hadoop fs –ls /
• To get the list of complete directories and files
of HDFS.
hadoop fs -ls –R /
• To create a directory (say, sample) in HDFS.
hadoop fs -mkdir /sample
Working with HDFS Commands
• To copy a file from local file system to HDFS.
hadoop fs -put /root/sample/test.txt
/sample/test.txt

• To copy a file from HDFS to local file system.


hadoop fs -get /sample/test.txt /root/sample/testsample.txt

• To copy a file from local file system to HDFS via


copyFromLocal command.
hadoop fs -copyFromLocal /root/sample/test.txt /sample/testsample.txt
Working with HDFS Commands
• To copy a file from Hadoop file system to local file
system via copyToLocal command.
hadoop fs -copyToLocal /sample/test.txt /root/sample/testsample1.txt

• To display the contents of an HDFS file on console.


hadoop fs -cat /sample/test.txt
Working with HDFS Commands
• To copy a file from one directory to another on
HDFS.
hadoop fs -cp /sample/test.txt /sample1

• To remove a directory from HDFS.


hadoop fs –rm /sample1
Managing Resources and Applications
with Hadoop YARN
Managing Resources and Applications
with Hadoop YARN

Limitations of Hadoop 1.0 architecture


Managing Resources and Applications
with Hadoop YARN
Interacting with Hadoop Ecosystem

You might also like