SlideShare a Scribd company logo
INTRODUCTION TO BIG DATA HADOOP
GOAL
To learn about Big data and Hadoop Programming
AGENDA
Big Data
Hadoop Introduction
History
Comparison to Relational Databases
Hadoop Eco-System and Distributions
Resources
BIG DATA
Information Data Corporation (IDC)
estimates data created in 2010 to be
Companies continue to generate large
amounts of data, here are some 2011 stats:
– Facebook ~ 6 billion messages per day
– EBay ~ 2 billion page views a day, ~ 9 Petabytes of
storage
– Satellite Images by Skybox Imaging ~ 1 Terabyte per day
HADOOP
Existing tools were not designed to handle
such large amounts of data
"The Apache™ Hadoop™ project develops
open-source software for reliable, scalable,
distributed computing." -
http://hadoop.apache.org
– Process Big Data on clusters of commodity hardware
– Vibrant open-source community
– Many products and tools reside on top of Hadoop
HADOOP JOBS
USERS OF HADOOP
Amazon
eBay
Facebook
Twitter
Linkedin
Wayn
IBM
Yahoo
DATA STORAGE
Storage capacity has grown exponentially
but read speed has not kept up
– 1990:
• Store 1,400 MB
• Transfer speed of 4.5MB/s
• Read the entire drive in ~ 5 minutes– 2010:
• Store 1 TB
• Transfer speed of 100MB/s
• Read the entire drive in ~ 3 hours
• Hadoop - 100 drives working at the same
time can read 1TB of data in 2 minutes
HADOOP CLUSTER
A set of "cheap" commodity hardware
• Networked together
• Resides in the same location
– Set of servers in a set of racks in a data center
USE COMMODITY HARDWARE
“Cheap” Commodity Server Hardware
– No need for super-computers, use commodity unreliable
hardware
– Not desktops
HADOOP SYSTEM PRINCIPLES
Scale-Out rather than Scale-Up
Bring code to data rather than data to code
Deal with failures – they are common
Abstract complexity of distributed and
concurrent applications
SCALE OUT INSTEAD OF SCALE UP
It is harder and more expensive to scale-up
– Add additional resources to an existing node (CPU, RAM)
– Moore’s Law can’t keep up with data growth
– New units must be purchased if required resources can not be
added
– Also known as scale vertically
• Scale-Out
– Add more nodes/machines to an existing distributed
application
– Software Layer is designed for node additions or removal
– Hadoop takes this approach - A set of nodes are bonded
together as a single distributed system
– Very easy to scale down as well
CODE TO DATA
• Traditional data processing architecture
– nodes are broken up into separate processing and storage
nodes connected by high-capacity link
– Many data-intensive applications are not CPU demanding
causing bottlenecks in network
Storage
Node
Processing
Node
Processing
Node
Storage
Node
Load Data
Save Results
Risk of bottleneck
CODE TO DATA
Hadoop co-locates processors and storage
– Code is moved to data (size is tiny, usually in KBs)
– Processors execute code and access underlying local
storage
16
Processor
Storage
Hadoop Node
Processor
Storage
Hadoop Node
Processor
Storage
Hadoop Node
Processor
Storage
Hadoop Node
FAILURES ARE COMMON
Given a large number machines, failures are
common
– Large warehouses may see machine failures weekly or
even daily
• Hadoop is designed to cope with node
failures
– Data is replicated
– Tasks are retried
ABSTRACT COMPLEXITIES
Hadoop abstracts many complexities in
distributed and concurrent applications
– Defines small number of components
– Provides simple and well defined interfaces of interactions
between these components
• Frees developer from worrying about systemlevel
challenges
– race conditions, data starvation
– processing pipelines, data partitioning, code distribution
– etc.
• Allows developers to focus on application
development and business logi
HISTORY OF HADOOP
Started as a sub-project of Apache Nutch
– Nutch’s job is to index the web and expose it for searching
– Open Source alternative to Google
– Started by Doug Cutting
• In 2004 Google publishes Google File System
(GFS) and MapReduce framework papers
• Doug Cutting and Nutch team implemented
Google’s frameworks in Nutch
• In 2006 Yahoo! hires Doug Cutting to work on
Hadoop with a dedicated team
• In 2008 Hadoop became Apache Top Level
Project
– http://hadoop.apache.org
NAMING CONVENTIONS
Doug Cutting drew inspiration from his
family
– Lucene: Doug’s wife’s middle name
– Nutch: A word for "meal" that his son used as a toddler
– Hadoop: Yellow stuffed elephant named by his son
COMPARISON TO RDBMS
Until recently many applications utilized
Relational Database Management Systems
(RDBMS) for batch processing
– Oracle, Sybase, MySQL, Microsoft SQL Server, etc.
– Hadoop doesn’t fully replace relational products; many
architectures would benefit from both Hadoop and a
Relational product(s)
• Scale-Out vs. Scale-Up
– RDBMS products scale up
• Expensive to scale for larger installations
• Hits a ceiling when storage reaches 100s of terabytes
– Hadoop clusters can scale-out to 100s of machines and to
petabytes of storage
COMPARISON TO RDBMS
Structured Relational vs. Semi-Structured
vs. Unstructured
– RDBMS works well for structured data - tables that
conform to a predefined schema
– Hadoop works best on Semi-structured and Unstructured
data
• Semi-structured may have a schema that is loosely
followed
• Unstructured data has no structure whatsoever and is
usually just blocks of text (or for example images)
• At processing time types for key and values are chosen by
the implementer
– Certain types of input data will not easily fit into
Relational Schema such as images, JSON, XML, etc...
END
Thank you.

More Related Content

What's hot (18)

PPTX
Big Data Training in Amritsar
E2MATRIX
 
PPTX
Hadoop And Their Ecosystem
sunera pathan
 
PPT
Hadoop
chandinisanz
 
PDF
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Edureka!
 
PPT
Hadoop
Gagan Agrawal
 
PPTX
PPT on Hadoop
Shubham Parmar
 
PDF
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
PPTX
Apache hadoop technology : Beginners
Shweta Patnaik
 
PPTX
Hadoop
reddivarihareesh
 
PPTX
Hadoop Architecture
Ganesh B
 
PPT
Apache HBase
Jéferson Machado
 
PPTX
HADOOP TECHNOLOGY ppt
sravya raju
 
PDF
Facebook Hadoop Data & Applications
dzhou
 
PPTX
Hadoop
Shamama Kamal
 
PPTX
Big data and hadoop anupama
Anupama Prabhudesai
 
PPTX
Big data and tools
Shivam Shukla
 
PDF
Hadoop Administration pdf
Edureka!
 
PPT
Hadoop Technologies
Kannappan Sirchabesan
 
Big Data Training in Amritsar
E2MATRIX
 
Hadoop And Their Ecosystem
sunera pathan
 
Hadoop
chandinisanz
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Edureka!
 
PPT on Hadoop
Shubham Parmar
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
Apache hadoop technology : Beginners
Shweta Patnaik
 
Hadoop Architecture
Ganesh B
 
Apache HBase
Jéferson Machado
 
HADOOP TECHNOLOGY ppt
sravya raju
 
Facebook Hadoop Data & Applications
dzhou
 
Hadoop
Shamama Kamal
 
Big data and hadoop anupama
Anupama Prabhudesai
 
Big data and tools
Shivam Shukla
 
Hadoop Administration pdf
Edureka!
 
Hadoop Technologies
Kannappan Sirchabesan
 

Viewers also liked (16)

DOC
SAP REQRUITMENT NOTES03
Krishna Sujeer
 
PPTX
TIC 8-2
jdanielperez
 
PDF
HOW TO ACE YOUR INTERVIEW
Women of Distinction Magazine
 
DOC
Dr R Kesavan - Resume (2) 1
Dr . R. Kesavan
 
DOCX
Формули скороченого множення
Славка Сочка
 
PPSX
Tba
Blessed Thee
 
PPT
Academy of warm hearts
gim94
 
PPTX
Retail Grocery Energy & Environmental Challenges
Energy Advantage
 
PPTX
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
DataWorks Summit
 
PPTX
Managing your Assets with Big Data Tools
MachinePulse
 
PPTX
Priyank Patel, Teradata, Hadoop & SQL
The Hive
 
PPTX
Which data should you move to Hadoop?
Attunity
 
PDF
Hadoop World 2011: Building Relational Event History Model with Hadoop - Josh...
Cloudera, Inc.
 
PPTX
Splice Machine Overview
Kunal Gupta
 
PPTX
How to Operationalise Real-Time Hadoop in the Cloud
Attunity
 
PPTX
Essential Tools For Your Big Data Arsenal
MongoDB
 
SAP REQRUITMENT NOTES03
Krishna Sujeer
 
TIC 8-2
jdanielperez
 
HOW TO ACE YOUR INTERVIEW
Women of Distinction Magazine
 
Dr R Kesavan - Resume (2) 1
Dr . R. Kesavan
 
Формули скороченого множення
Славка Сочка
 
Academy of warm hearts
gim94
 
Retail Grocery Energy & Environmental Challenges
Energy Advantage
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
DataWorks Summit
 
Managing your Assets with Big Data Tools
MachinePulse
 
Priyank Patel, Teradata, Hadoop & SQL
The Hive
 
Which data should you move to Hadoop?
Attunity
 
Hadoop World 2011: Building Relational Event History Model with Hadoop - Josh...
Cloudera, Inc.
 
Splice Machine Overview
Kunal Gupta
 
How to Operationalise Real-Time Hadoop in the Cloud
Attunity
 
Essential Tools For Your Big Data Arsenal
MongoDB
 
Ad

Similar to INTRODUCTION TO BIG DATA HADOOP (20)

PPTX
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
DOCX
Hadoop Seminar Report
Atul Kushwaha
 
PDF
Big data and hadoop overvew
Kunal Khanna
 
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
PPTX
002 Introduction to hadoop v3
Dendej Sawarnkatat
 
PPTX
Introduction to BIg Data and Hadoop
Amir Shaikh
 
PDF
Hadoop Master Class : A concise overview
Abhishek Roy
 
PDF
Hadoop framework thesis (3)
JonySaini2
 
PPTX
Big Data and Hadoop
Flavio Vit
 
PPT
Hadoop HDFS.ppt
6535ANURAGANURAG
 
PDF
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
PPTX
Not Just Another Overview of Apache Hadoop
Adaryl "Bob" Wakefield, MBA
 
PDF
Hadoop Primer
Steve Staso
 
PPTX
Hadoop and Big Data
Harshdeep Kaur
 
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
ODP
Hadoop seminar
KrishnenduKrishh
 
PPTX
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
DOC
Hadoop
Himanshu Soni
 
PDF
The Hadoop Ecosystem for Developers
Zohar Elkayam
 
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Hadoop Seminar Report
Atul Kushwaha
 
Big data and hadoop overvew
Kunal Khanna
 
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
002 Introduction to hadoop v3
Dendej Sawarnkatat
 
Introduction to BIg Data and Hadoop
Amir Shaikh
 
Hadoop Master Class : A concise overview
Abhishek Roy
 
Hadoop framework thesis (3)
JonySaini2
 
Big Data and Hadoop
Flavio Vit
 
Hadoop HDFS.ppt
6535ANURAGANURAG
 
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
Not Just Another Overview of Apache Hadoop
Adaryl "Bob" Wakefield, MBA
 
Hadoop Primer
Steve Staso
 
Hadoop and Big Data
Harshdeep Kaur
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
Hadoop seminar
KrishnenduKrishh
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
The Hadoop Ecosystem for Developers
Zohar Elkayam
 
Ad

More from Krishna Sujeer (20)

PPT
1-informatica-training
Krishna Sujeer
 
DOC
SAP REQRUITMENT NOTES03
Krishna Sujeer
 
PDF
KRISHNA NAYAKSujeer B.E(IT), MS(QM) , PGDCA,ITIL PMP,
Krishna Sujeer
 
PDF
KRISHNA NAYAKSujeer B.E(IT), MS(QM) , PGDCA,ITIL PMP,
Krishna Sujeer
 
DOC
KEYTAKEAWAYAS Krishna Nayak v2.0 Notes
Krishna Sujeer
 
PDF
Selenium.PDF
Krishna Sujeer
 
PPT
Recruitment_Process[1]
Krishna Sujeer
 
DOC
ETI_Krishna_Nayak_Sujeer
Krishna Sujeer
 
DOC
KRISHNA_NAYAK_Sujeer
Krishna Sujeer
 
PDF
itil2011foundation-allvolumes-signed-131020235516-phpapp01 (1)
Krishna Sujeer
 
PPT
Software testing strategies
Krishna Sujeer
 
PPT
Software Processes
Krishna Sujeer
 
PPT
Software Quality Management
Krishna Sujeer
 
PPT
Software Testing
Krishna Sujeer
 
DOC
Basic adminstration and configuration techniques
Krishna Sujeer
 
PPTX
LSI - PMP - Training material
Krishna Sujeer
 
DOC
SAP HCM CORE MODULES V1.0
Krishna Sujeer
 
PPTX
20410B_01
Krishna Sujeer
 
PDF
MASTER_Trainer Notes_V5.2
Krishna Sujeer
 
PPT
1- java
Krishna Sujeer
 
1-informatica-training
Krishna Sujeer
 
SAP REQRUITMENT NOTES03
Krishna Sujeer
 
KRISHNA NAYAKSujeer B.E(IT), MS(QM) , PGDCA,ITIL PMP,
Krishna Sujeer
 
KRISHNA NAYAKSujeer B.E(IT), MS(QM) , PGDCA,ITIL PMP,
Krishna Sujeer
 
KEYTAKEAWAYAS Krishna Nayak v2.0 Notes
Krishna Sujeer
 
Selenium.PDF
Krishna Sujeer
 
Recruitment_Process[1]
Krishna Sujeer
 
ETI_Krishna_Nayak_Sujeer
Krishna Sujeer
 
KRISHNA_NAYAK_Sujeer
Krishna Sujeer
 
itil2011foundation-allvolumes-signed-131020235516-phpapp01 (1)
Krishna Sujeer
 
Software testing strategies
Krishna Sujeer
 
Software Processes
Krishna Sujeer
 
Software Quality Management
Krishna Sujeer
 
Software Testing
Krishna Sujeer
 
Basic adminstration and configuration techniques
Krishna Sujeer
 
LSI - PMP - Training material
Krishna Sujeer
 
SAP HCM CORE MODULES V1.0
Krishna Sujeer
 
20410B_01
Krishna Sujeer
 
MASTER_Trainer Notes_V5.2
Krishna Sujeer
 

INTRODUCTION TO BIG DATA HADOOP

  • 2. GOAL To learn about Big data and Hadoop Programming
  • 3. AGENDA Big Data Hadoop Introduction History Comparison to Relational Databases Hadoop Eco-System and Distributions Resources
  • 4. BIG DATA Information Data Corporation (IDC) estimates data created in 2010 to be Companies continue to generate large amounts of data, here are some 2011 stats: – Facebook ~ 6 billion messages per day – EBay ~ 2 billion page views a day, ~ 9 Petabytes of storage – Satellite Images by Skybox Imaging ~ 1 Terabyte per day
  • 5. HADOOP Existing tools were not designed to handle such large amounts of data "The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing." - http://hadoop.apache.org – Process Big Data on clusters of commodity hardware – Vibrant open-source community – Many products and tools reside on top of Hadoop
  • 8. DATA STORAGE Storage capacity has grown exponentially but read speed has not kept up – 1990: • Store 1,400 MB • Transfer speed of 4.5MB/s • Read the entire drive in ~ 5 minutes– 2010: • Store 1 TB • Transfer speed of 100MB/s • Read the entire drive in ~ 3 hours • Hadoop - 100 drives working at the same time can read 1TB of data in 2 minutes
  • 9. HADOOP CLUSTER A set of "cheap" commodity hardware • Networked together • Resides in the same location – Set of servers in a set of racks in a data center
  • 10. USE COMMODITY HARDWARE “Cheap” Commodity Server Hardware – No need for super-computers, use commodity unreliable hardware – Not desktops
  • 11. HADOOP SYSTEM PRINCIPLES Scale-Out rather than Scale-Up Bring code to data rather than data to code Deal with failures – they are common Abstract complexity of distributed and concurrent applications
  • 12. SCALE OUT INSTEAD OF SCALE UP It is harder and more expensive to scale-up – Add additional resources to an existing node (CPU, RAM) – Moore’s Law can’t keep up with data growth – New units must be purchased if required resources can not be added – Also known as scale vertically • Scale-Out – Add more nodes/machines to an existing distributed application – Software Layer is designed for node additions or removal – Hadoop takes this approach - A set of nodes are bonded together as a single distributed system – Very easy to scale down as well
  • 13. CODE TO DATA • Traditional data processing architecture – nodes are broken up into separate processing and storage nodes connected by high-capacity link – Many data-intensive applications are not CPU demanding causing bottlenecks in network Storage Node Processing Node Processing Node Storage Node Load Data Save Results Risk of bottleneck
  • 14. CODE TO DATA Hadoop co-locates processors and storage – Code is moved to data (size is tiny, usually in KBs) – Processors execute code and access underlying local storage 16 Processor Storage Hadoop Node Processor Storage Hadoop Node Processor Storage Hadoop Node Processor Storage Hadoop Node
  • 15. FAILURES ARE COMMON Given a large number machines, failures are common – Large warehouses may see machine failures weekly or even daily • Hadoop is designed to cope with node failures – Data is replicated – Tasks are retried
  • 16. ABSTRACT COMPLEXITIES Hadoop abstracts many complexities in distributed and concurrent applications – Defines small number of components – Provides simple and well defined interfaces of interactions between these components • Frees developer from worrying about systemlevel challenges – race conditions, data starvation – processing pipelines, data partitioning, code distribution – etc. • Allows developers to focus on application development and business logi
  • 17. HISTORY OF HADOOP Started as a sub-project of Apache Nutch – Nutch’s job is to index the web and expose it for searching – Open Source alternative to Google – Started by Doug Cutting • In 2004 Google publishes Google File System (GFS) and MapReduce framework papers • Doug Cutting and Nutch team implemented Google’s frameworks in Nutch • In 2006 Yahoo! hires Doug Cutting to work on Hadoop with a dedicated team • In 2008 Hadoop became Apache Top Level Project – http://hadoop.apache.org
  • 18. NAMING CONVENTIONS Doug Cutting drew inspiration from his family – Lucene: Doug’s wife’s middle name – Nutch: A word for "meal" that his son used as a toddler – Hadoop: Yellow stuffed elephant named by his son
  • 19. COMPARISON TO RDBMS Until recently many applications utilized Relational Database Management Systems (RDBMS) for batch processing – Oracle, Sybase, MySQL, Microsoft SQL Server, etc. – Hadoop doesn’t fully replace relational products; many architectures would benefit from both Hadoop and a Relational product(s) • Scale-Out vs. Scale-Up – RDBMS products scale up • Expensive to scale for larger installations • Hits a ceiling when storage reaches 100s of terabytes – Hadoop clusters can scale-out to 100s of machines and to petabytes of storage
  • 20. COMPARISON TO RDBMS Structured Relational vs. Semi-Structured vs. Unstructured – RDBMS works well for structured data - tables that conform to a predefined schema – Hadoop works best on Semi-structured and Unstructured data • Semi-structured may have a schema that is loosely followed • Unstructured data has no structure whatsoever and is usually just blocks of text (or for example images) • At processing time types for key and values are chosen by the implementer – Certain types of input data will not easily fit into Relational Schema such as images, JSON, XML, etc...