Introduction To Hadoop
Introduction To Hadoop
Introduction to Hadoop
Few statistics to get an idea of the amount of data which gets
generated every day, every minute, and every second.
Every day: (a) NYSE (New York Stock Exchange) generates 1.5
billion shares and trade data. (b) Facebook stores 2.7 billion
comments and Likes. (c) Google processes about 24
petabytes of data.
Every minute: (a) Facebook users share nearly 2.5 million
pieces of content. (b) Twitter users tweet nearly 300,000
times. (c) Instagram users post nearly 220,000 new photos.
(d) YouTube users upload 72 hours of new video content.
(e) Apple users download nearly 50,000 apps. (f) Email
users send over 200 million messages. (g) Amazon
generates over $80,000 in online sales. (h) Google receives
over 4 million search queries.
Every second: (a) Banking applications process more than
10,000 credit card transactions
Why Hadoop?
• The key consideration is:- Its capability to
handle massive amounts of data, different
categories of data – fairly quickly.
• Other considerations-
Why Hadoop?
1. Low cost: Hadoop is an open-source framework and
uses commodity hardware (commodity hardware is
relatively inexpensive and easy to obtain hardware)
to store enormous quantities of data.
2. Computing power: Hadoop is based on distributed
computing model which processes very large
volumes of data fairly quickly. The more the number
of computing nodes, the more the processing power
at hand.
3. Scalability: This boils down to simply adding nodes as
the system grows and requires much less
administration.
Why Hadoop?
4. Storage flexibility: Unlike the traditional relational
databases, in Hadoop data need not be pre-processed
before storing it. Hadoop provides the convenience of
storing as much data as one needs and also the added
flexibility of deciding later as to how to use the stored
data. In Hadoop, one can store unstructured data like
images, videos, and free-form text.
5. Inherent data protection: Hadoop protects data and
executing applications against hardware failure. If a node
fails, it automatically redirects the jobs that had been
assigned to this node to the other functional and available
nodes and ensures that distributed computing does not
fail. It goes a step further to store multiple copies
(replicas) of the data on various nodes across the cluster.
Why Hadoop?
Hadoop framework
Why NOT RDBMS?
• RDBMS is not suitable for storing and
processing large files, images, and videos.
RDBMS is not a good choice when it comes to
advanced analytics involving machine
learning.
• Figure describes the RDBMS system with
respect to cost and storage. It calls for huge
investment as the volume of data shows an
upward trend.
Why NOT RDBMS?
RDBMS versus Hadoop
Distributed Computing Challenges
Although there are several challenges with
distributed computing, we will focus on two
major challenges.
Hardware Failure