INTRODUCTION TO BIG DATA HADOOP

GOAL
To learn about Big data and Hadoop Programming

AGENDA
Big Data
Hadoop Introduction
History
Comparison to Relational Databases
Hadoop Eco-System and Distributions
Resources

BIG DATA
Information Data Corporation (IDC)
estimates data created in 2010 to be
Companies continue to generate large
amounts of data, here are some 2011 stats:
– Facebook ~ 6 billion messages per day
– EBay ~ 2 billion page views a day, ~ 9 Petabytes of
storage
– Satellite Images by Skybox Imaging ~ 1 Terabyte per day

HADOOP
Existing tools were not designed to handle
such large amounts of data
"The Apache™ Hadoop™ project develops
open-source software for reliable, scalable,
distributed computing." -
http://hadoop.apache.org
– Process Big Data on clusters of commodity hardware
– Vibrant open-source community
– Many products and tools reside on top of Hadoop

USERS OF HADOOP
Amazon
eBay
Facebook
Twitter
Linkedin
Wayn
IBM
Yahoo

DATA STORAGE
Storage capacity has grown exponentially
but read speed has not kept up
– 1990:
• Store 1,400 MB
• Transfer speed of 4.5MB/s
• Read the entire drive in ~ 5 minutes– 2010:
• Store 1 TB
• Transfer speed of 100MB/s
• Read the entire drive in ~ 3 hours
• Hadoop - 100 drives working at the same
time can read 1TB of data in 2 minutes

HADOOP CLUSTER
A set of "cheap" commodity hardware
• Networked together
• Resides in the same location
– Set of servers in a set of racks in a data center

USE COMMODITY HARDWARE
“Cheap” Commodity Server Hardware
– No need for super-computers, use commodity unreliable
hardware
– Not desktops

HADOOP SYSTEM PRINCIPLES
Scale-Out rather than Scale-Up
Bring code to data rather than data to code
Deal with failures – they are common
Abstract complexity of distributed and
concurrent applications

SCALE OUT INSTEAD OF SCALE UP
It is harder and more expensive to scale-up
– Add additional resources to an existing node (CPU, RAM)
– Moore’s Law can’t keep up with data growth
– New units must be purchased if required resources can not be
added
– Also known as scale vertically
• Scale-Out
– Add more nodes/machines to an existing distributed
application
– Software Layer is designed for node additions or removal
– Hadoop takes this approach - A set of nodes are bonded
together as a single distributed system
– Very easy to scale down as well

CODE TO DATA
• Traditional data processing architecture
– nodes are broken up into separate processing and storage
nodes connected by high-capacity link
– Many data-intensive applications are not CPU demanding
causing bottlenecks in network
Storage
Node
Processing
Node
Processing
Node
Storage
Node
Load Data
Save Results
Risk of bottleneck

CODE TO DATA
Hadoop co-locates processors and storage
– Code is moved to data (size is tiny, usually in KBs)
– Processors execute code and access underlying local
storage
16
Processor
Storage
Hadoop Node
Processor
Storage
Hadoop Node
Processor
Storage
Hadoop Node
Processor
Storage
Hadoop Node

FAILURES ARE COMMON
Given a large number machines, failures are
common
– Large warehouses may see machine failures weekly or
even daily
• Hadoop is designed to cope with node
failures
– Data is replicated
– Tasks are retried

ABSTRACT COMPLEXITIES
Hadoop abstracts many complexities in
distributed and concurrent applications
– Defines small number of components
– Provides simple and well defined interfaces of interactions
between these components
• Frees developer from worrying about systemlevel
challenges
– race conditions, data starvation
– processing pipelines, data partitioning, code distribution
– etc.
• Allows developers to focus on application
development and business logi

HISTORY OF HADOOP
Started as a sub-project of Apache Nutch
– Nutch’s job is to index the web and expose it for searching
– Open Source alternative to Google
– Started by Doug Cutting
• In 2004 Google publishes Google File System
(GFS) and MapReduce framework papers
• Doug Cutting and Nutch team implemented
Google’s frameworks in Nutch
• In 2006 Yahoo! hires Doug Cutting to work on
Hadoop with a dedicated team
• In 2008 Hadoop became Apache Top Level
Project
– http://hadoop.apache.org

NAMING CONVENTIONS
Doug Cutting drew inspiration from his
family
– Lucene: Doug’s wife’s middle name
– Nutch: A word for "meal" that his son used as a toddler
– Hadoop: Yellow stuffed elephant named by his son

COMPARISON TO RDBMS
Until recently many applications utilized
Relational Database Management Systems
(RDBMS) for batch processing
– Oracle, Sybase, MySQL, Microsoft SQL Server, etc.
– Hadoop doesn’t fully replace relational products; many
architectures would benefit from both Hadoop and a
Relational product(s)
• Scale-Out vs. Scale-Up
– RDBMS products scale up
• Expensive to scale for larger installations
• Hits a ceiling when storage reaches 100s of terabytes
– Hadoop clusters can scale-out to 100s of machines and to
petabytes of storage

COMPARISON TO RDBMS
Structured Relational vs. Semi-Structured
vs. Unstructured
– RDBMS works well for structured data - tables that
conform to a predefined schema
– Hadoop works best on Semi-structured and Unstructured
data
• Semi-structured may have a schema that is loosely
followed
• Unstructured data has no structure whatsoever and is
usually just blocks of text (or for example images)
• At processing time types for key and values are chosen by
the implementer
– Certain types of input data will not easily fit into
Relational Schema such as images, JSON, XML, etc...

INTRODUCTION TO BIG DATA HADOOP

More Related Content

What's hot (18)

Viewers also liked (16)

Similar to INTRODUCTION TO BIG DATA HADOOP (20)

More from Krishna Sujeer (20)

INTRODUCTION TO BIG DATA HADOOP