SlideShare a Scribd company logo
Modeling with
   Hadoop	

       Vijay K. Narayanan	

Principal Scientist, Yahoo! Labs, Yahoo!	

                     	

                     	

        Milind Bhandarkar	

Chief Architect, Greenplum Labs, EMC2
Session 1: Overview of
       Hadoop	

•  Motivation	

•  Hadoop	

  •  Map-Reduce	

  •  Distributed File System	

  •  Next Generation MapReduce	

•  Q & A	

                    2
Session 2: Modeling
       with Hadoop	

•  Types of learning in MapReduce	

•  Algorithms in MapReduce
   framework	

  •  Data parallel algorithms	

  •  Sequential algorithms	

•  Challenges and Enhancements	

                  3
Session 3: Hands On
        Exercise	

•  Spin-up Single Node Hadoop cluster in a
  Virtual Machine	

•  Write a regression trainer	

•  Train model on a dataset	


                        4
Overview of Apache
     Hadoop
Hadoop At Yahoo!
    (Some Statistics)	

•  40,000 + machines in 20+ clusters	

•  Largest cluster is 4,000 machines	

•  170 Petabytes of storage	

•  1000+ users	

•  1,000,000+ jobs/month	

                       6
BEHIND
EVERY CLICK
Hadoop Overview kdd2011
Who Uses Hadoop ?
Why Hadoop ?	



      10
Big Datasets
(Data-Rich Computing theme proposal. J. Campbell, et al, 2007)
Cost Per Gigabyte                             
(http://www.mkomo.com/cost-per-gigabyte)
Storage Trends
(Graph by Adam Leventhal, ACM Queue, Dec 2009)
Motivating Examples	



          14
Yahoo! Search Assist
Search Assist	

•  Insight: Related concepts appear close
  together in text corpus	

•  Input: Web pages	

  •  1 Billion Pages, 10K bytes each	

  •  10 TB of input data	

•  Output: List(word, List(related words))	

                       16
Search Assist	

// Input: List(URL, Text)	
foreach URL in Input :	
    Words = Tokenize(Text(URL));	
    foreach word in Tokens :	
        Insert (word, Next(word, Tokens)) in Pairs;	
        Insert (word, Previous(word, Tokens)) in Pairs;	
// Result: Pairs = List (word, RelatedWord)	
Group Pairs by word;	
// Result: List (word, List(RelatedWords)	
foreach word in Pairs :	
    Count RelatedWords in GroupedPairs;	
// Result: List (word, List(RelatedWords, count))	
foreach word in CountedPairs :	
    Sort Pairs(word, *) descending by count;	
    choose Top 5 Pairs;	
// Result: List (word, Top5(RelatedWords))	

                           17
People You May Know
People You May Know	

•  Insight: You might also know Joe Smith if a
  lot of folks you know, know Joe Smith	

  •  if you don t know Joe Smith already	

•  Numbers:	

  •  100 MM users	

  •  Average connections per user is 100	

                      19
People You May Know	


// Input: List(UserName, List(Connections))	
	
foreach u in UserList : // 100 MM	
    foreach x in Connections(u) : // 100	
        foreach y in Connections(x) : // 100	
            if (y not in Connections(u)) :	
                Count(u, y)++; // 1 Trillion Iterations	
    Sort (u,y) in descending order of Count(u,y);	
    Choose Top 3 y;	
    Store (u, {y0, y1, y2}) for serving;	
	




                           20
Performance	

•  101 Random accesses for each user	

  •  Assume 1 ms per random access	

  •  100 ms per user	

•  100 MM users	

  •  100 days on a single machine	

                     21
MapReduce Paradigm	



         22
Map  Reduce	


•  Primitives in Lisp ( Other functional
  languages) 1970s	

•  Google Paper 2004	

  •  http://labs.google.com/papers/
    mapreduce.html	



                        23
Map	


Output_List = Map (Input_List)	
	




Square (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) =	
	
(1, 4, 9, 16, 25, 36,49, 64, 81, 100)	
	



                           24
Reduce	


Output_Element = Reduce (Input_List)	
	




Sum (1, 4, 9, 16, 25, 36,49, 64, 81, 100) = 385	
	




                           25
Parallelism	

•  Map is inherently parallel	

  •  Each list element processed
    independently	

•  Reduce is inherently sequential	

  •  Unless processing multiple lists	

•  Grouping to produce multiple lists	

                       26
Search Assist Map	

// Input: http://hadoop.apache.org	
	
Pairs = Tokenize_And_Pair ( Text ( Input ) )	
	




Output = {	
(apache, hadoop) (hadoop, mapreduce) (hadoop, streaming)
(hadoop, pig) (apache, pig) (hadoop, DFS) (streaming,
commandline) (hadoop, java) (DFS, namenode) (datanode,
block) (replication, default)...	
}	
	

                           27
Search Assist Reduce	

// Input: GroupedList (word, GroupedList(words))	
	
CountedPairs = CountOccurrences (word, RelatedWords)	




Output = {	
(hadoop, apache, 7) (hadoop, DFS, 3) (hadoop, streaming,
4) (hadoop, mapreduce, 9) ...	
}	



                           28
Issues with Large Data	


•  Map Parallelism: Chunking input data	

•  Reduce Parallelism: Grouping related data	

•  Dealing with failures  load imbalance	



                      29
Hadoop Overview kdd2011
Apache Hadoop	


•  January 2006: Subproject of Lucene	

•  January 2008: Top-level Apache project	

•  Stable Version: 0.20.203	

•  Latest Version: 0.22 (Coming soon)	


                      31
Apache Hadoop	

•  Reliable, Performant Distributed file system	

•  MapReduce Programming framework	

•  Ecosystem: HBase, Hive, Pig, Howl, Oozie,
  Zookeeper, Chukwa, Mahout, Cascading,
  Scribe, Cassandra, Hypertable, Voldemort,
  Azkaban, Sqoop, Flume, Avro ...	



                      32
Problem: Bandwidth to
        Data	

•  Scan 100TB Datasets on 1000 node cluster	

  •  Remote storage @ 10MB/s = 165 mins	

  •  Local storage @ 50-200MB/s = 33-8 mins	

•  Moving computation is more efficient than
  moving data	

  •  Need visibility into data placement	

                       33
Problem: Scaling
          Reliably	

•  Failure is not an option, it s a rule !	

  •  1000 nodes, MTBF  1 day	

  •  4000 disks, 8000 cores, 25 switches,
     1000 NICs, 2000 DIMMS (16TB RAM)	

•  Need fault tolerant store with reasonable
  availability guarantees	

  •  Handle hardware faults transparently	

                        34
Hadoop Goals	

•  Scalable: Petabytes (10     15   Bytes) of data on
  thousands on nodes	

•  Economical: Commodity components only	

•  Reliable	

  •  Engineering reliability into every
    application is expensive	



                       35
Hadoop MapReduce	



        36
Think MapReduce	


•  Record = (Key, Value)	

•  Key : Comparable, Serializable	

•  Value: Serializable	

•  Input, Map, Shuffle, Reduce, Output	


                      37
Seems Familiar ?	



cat /var/log/auth.log* |  	
grep session opened | cut -d      -f10 | 	
sort | 	
uniq -c  	
~/userlist 	
	




                          38
Map	


•  Input: (Key , Value )	

              1          1

•  Output: List(Key , Value )	

                     2           2

•  Projections, Filtering, Transformation	



                         39
Shuffle	


•  Input: List(Key , Value )	

                   2            2

•  Output	

  •  Sort(Partition(List(Key , List(Value ))))	

                                    2      2

•  Provided by Hadoop	


                        40
Reduce	


•  Input: List(Key , List(Value ))	

                   2                   2

•  Output: List(Key , Value )	

                       3           3

•  Aggregation	



                           41
Hadoop Streaming	

•  Hadoop is written in Java	

  •  Java MapReduce code is native 	

•  What about Non-Java Programmers ?	

  •  Perl, Python, Shell, R	

  •  grep, sed, awk, uniq as Mappers/Reducers	

•  Text Input and Output	

                      42
Hadoop Streaming	

•  Thin Java wrapper for Map  Reduce Tasks	

•  Forks actual Mapper  Reducer	

•  IPC via stdin, stdout, stderr	

•  Key.toString() t Value.toString() n	

•  Slower than Java programs	

  •  Allows for quick prototyping / debugging	

                      43
Hadoop Streaming	


$ bin/hadoop jar hadoop-streaming.jar 	
      -input in-files -output out-dir 	
      -mapper mapper.sh -reducer reducer.sh	
	
# mapper.sh	
	
sed -e 's/ /n/g' | grep .	
	
# reducer.sh	
	
uniq -c | awk '{print $2 t $1}'	
	



                           44
Hadoop Distributed
File System (HDFS)	



          45
HDFS	

•  Data is organized into files and directories	

•  Files are divided into uniform sized blocks
  (default 128MB) and distributed across
  cluster nodes	

•  HDFS exposes block placement so that
  computation can be migrated to data	



                       46
HDFS	

•  Blocks are replicated (default 3) to handle
  hardware failure	

•  Replication for performance and fault
  tolerance (Rack-Aware placement)	

•  HDFS keeps checksums of data for
  corruption detection and recovery	



                        47
HDFS	


•  Master-Worker Architecture	

•  Single NameNode	

•  Many (Thousands) DataNodes	



                    48
HDFS Master
         (NameNode)	

•  Manages filesystem namespace	

•  File metadata (i.e. inode )	

•  Mapping inode to list of blocks + locations	

•  Authorization  Authentication	

•  Checkpoint  journal namespace changes	

                       49
Namenode	

•  Mapping of datanode to list of blocks	

•  Monitor datanode health	

•  Replicate missing blocks	

•  Keeps ALL namespace in memory	

•  60M objects (File/Block) in 16GB	

                       50
Datanodes	

•  Handle block storage on multiple volumes
   block integrity	

•  Clients access the blocks directly from data
  nodes	

•  Periodically send heartbeats and block
  reports to Namenode	

•  Blocks are stored as underlying OS s files	

                         51
HDFS Architecture
Next Generation
  MapReduce	



       53
MapReduce Today	

    (Courtesy: Arun Murthy, Hortonworks)
Why ?	

•  Scalability Limitations today	

  •  Maximum cluster size: 4000 nodes	

  •  Maximum Concurrent tasks: 40,000	

•  Job Tracker SPOF	

•  Fixed map and reduce containers (slots)	

  •  Punishes pleasantly parallel apps	

                      55
Why ? (contd)	

•  MapReduce is not suitable for every
  application	

•  Fine-Grained Iterative applications	

  •  HaLoop: Hadoop in a Loop	

•  Message passing applications	

  •  Graph Processing	

                       56
Requirements	


•  Need scalable cluster resources manager	

•  Separate scheduling from resource
  management	

•  Multi-Lingual Communication Protocols	


                      57
Bottom Line	

•  @techmilind #mrng (MapReduce, Next
  Gen) is in reality, #rmng (Resource
  Manager, Next Gen)	

•  Expect different programming paradigms to
  be implemented	

 •  Including MPI (soon)	

                      58
Architecture	

  (Courtesy: Arun Murthy, Hortonworks)
The New World	

•    Resource Manager	

     •    Allocates resources (containers) to applications	

•    Node Manager	

     •    Manages containers on nodes	

•    Application Master	

     •    Specific to paradigm e.g. MapReduce application
          master, MPI application master etc	



                                 60
Container	


•  In current terminology: A Task Slot	

•  Slice of the node s hardware resources	

•  #of cores, virtual memory, disk size, disk
  and network bandwidth etc	

  •  Currently, only memory usage is sliced	


                      61
Questions ?	



      62
Ad

More Related Content

What's hot (20)

Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
James Serra
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Khalid Salama
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data Architecture
DATAVERSITY
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
James Serra
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
Ishan Bhawantha Hewanayake
 
Data Mesh
Data MeshData Mesh
Data Mesh
Piethein Strengholt
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
Omid Vahdaty
 
Elastic Data Warehousing
Elastic Data WarehousingElastic Data Warehousing
Elastic Data Warehousing
Snowflake Computing
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
Prashant Gupta
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation
 
Thinking Big - Big data: principes et architecture
Thinking Big - Big data: principes et architecture Thinking Big - Big data: principes et architecture
Thinking Big - Big data: principes et architecture
Lilia Sfaxi
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFW
Kent Graziano
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
Adaryl "Bob" Wakefield, MBA
 
Cloud DW technology trends and considerations for enterprises to apply snowflake
Cloud DW technology trends and considerations for enterprises to apply snowflakeCloud DW technology trends and considerations for enterprises to apply snowflake
Cloud DW technology trends and considerations for enterprises to apply snowflake
SANG WON PARK
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
James Serra
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Khalid Salama
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data Architecture
DATAVERSITY
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
James Serra
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
Omid Vahdaty
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation
 
Thinking Big - Big data: principes et architecture
Thinking Big - Big data: principes et architecture Thinking Big - Big data: principes et architecture
Thinking Big - Big data: principes et architecture
Lilia Sfaxi
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFW
Kent Graziano
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
Adaryl "Bob" Wakefield, MBA
 
Cloud DW technology trends and considerations for enterprises to apply snowflake
Cloud DW technology trends and considerations for enterprises to apply snowflakeCloud DW technology trends and considerations for enterprises to apply snowflake
Cloud DW technology trends and considerations for enterprises to apply snowflake
SANG WON PARK
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 

Viewers also liked (16)

Scaling hadoopapplications
Scaling hadoopapplicationsScaling hadoopapplications
Scaling hadoopapplications
Milind Bhandarkar
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
Milind Bhandarkar
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
Milind Bhandarkar
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
Milind Bhandarkar
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
Kevin Weil
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
Milind Bhandarkar
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)
Kevin Weil
 
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Kevin Weil
 
Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010
Kevin Weil
 
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Kevin Weil
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
Milind Bhandarkar
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
Adam Kawa
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Kevin Weil
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
Kevin Weil
 
Hadoop MapReduce joins
Hadoop MapReduce joinsHadoop MapReduce joins
Hadoop MapReduce joins
Shalish VJ
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
Milind Bhandarkar
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
Milind Bhandarkar
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
Milind Bhandarkar
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
Kevin Weil
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
Milind Bhandarkar
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)
Kevin Weil
 
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Kevin Weil
 
Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010
Kevin Weil
 
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Kevin Weil
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
Milind Bhandarkar
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
Adam Kawa
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Kevin Weil
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
Kevin Weil
 
Hadoop MapReduce joins
Hadoop MapReduce joinsHadoop MapReduce joins
Hadoop MapReduce joins
Shalish VJ
 
Ad

Similar to Hadoop Overview kdd2011 (20)

Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
Yahoo Developer Network
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilind
EMC
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
Dhanashri Yadav
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
York University
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
hansen3032
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
Wisely chen
 
What's new in hadoop 3.0
What's new in hadoop 3.0What's new in hadoop 3.0
What's new in hadoop 3.0
Heiko Loewe
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
Donald Miner
 
Hadoop
HadoopHadoop
Hadoop
Anil Reddy
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
ThoughtWorks
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
James Chen
 
Hadoop-part1 in cloud computing subject.pptx
Hadoop-part1 in cloud computing subject.pptxHadoop-part1 in cloud computing subject.pptx
Hadoop-part1 in cloud computing subject.pptx
JyotiLohar6
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
MaruthiPrasad96
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and Hive
Sharjeel Imtiaz
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
Jazan University
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
cdmaxime
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilind
EMC
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
Dhanashri Yadav
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
hansen3032
 
What's new in hadoop 3.0
What's new in hadoop 3.0What's new in hadoop 3.0
What's new in hadoop 3.0
Heiko Loewe
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
Donald Miner
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
ThoughtWorks
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
James Chen
 
Hadoop-part1 in cloud computing subject.pptx
Hadoop-part1 in cloud computing subject.pptxHadoop-part1 in cloud computing subject.pptx
Hadoop-part1 in cloud computing subject.pptx
JyotiLohar6
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
MaruthiPrasad96
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and Hive
Sharjeel Imtiaz
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
cdmaxime
 
Ad

Recently uploaded (20)

Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 

Hadoop Overview kdd2011

  • 1. Modeling with Hadoop Vijay K. Narayanan Principal Scientist, Yahoo! Labs, Yahoo! Milind Bhandarkar Chief Architect, Greenplum Labs, EMC2
  • 2. Session 1: Overview of Hadoop •  Motivation •  Hadoop •  Map-Reduce •  Distributed File System •  Next Generation MapReduce •  Q & A 2
  • 3. Session 2: Modeling with Hadoop •  Types of learning in MapReduce •  Algorithms in MapReduce framework •  Data parallel algorithms •  Sequential algorithms •  Challenges and Enhancements 3
  • 4. Session 3: Hands On Exercise •  Spin-up Single Node Hadoop cluster in a Virtual Machine •  Write a regression trainer •  Train model on a dataset 4
  • 6. Hadoop At Yahoo! (Some Statistics) •  40,000 + machines in 20+ clusters •  Largest cluster is 4,000 machines •  170 Petabytes of storage •  1000+ users •  1,000,000+ jobs/month 6
  • 11. Big Datasets (Data-Rich Computing theme proposal. J. Campbell, et al, 2007)
  • 12. Cost Per Gigabyte (http://www.mkomo.com/cost-per-gigabyte)
  • 13. Storage Trends (Graph by Adam Leventhal, ACM Queue, Dec 2009)
  • 16. Search Assist •  Insight: Related concepts appear close together in text corpus •  Input: Web pages •  1 Billion Pages, 10K bytes each •  10 TB of input data •  Output: List(word, List(related words)) 16
  • 17. Search Assist // Input: List(URL, Text) foreach URL in Input : Words = Tokenize(Text(URL)); foreach word in Tokens : Insert (word, Next(word, Tokens)) in Pairs; Insert (word, Previous(word, Tokens)) in Pairs; // Result: Pairs = List (word, RelatedWord) Group Pairs by word; // Result: List (word, List(RelatedWords) foreach word in Pairs : Count RelatedWords in GroupedPairs; // Result: List (word, List(RelatedWords, count)) foreach word in CountedPairs : Sort Pairs(word, *) descending by count; choose Top 5 Pairs; // Result: List (word, Top5(RelatedWords)) 17
  • 19. People You May Know •  Insight: You might also know Joe Smith if a lot of folks you know, know Joe Smith •  if you don t know Joe Smith already •  Numbers: •  100 MM users •  Average connections per user is 100 19
  • 20. People You May Know // Input: List(UserName, List(Connections)) foreach u in UserList : // 100 MM foreach x in Connections(u) : // 100 foreach y in Connections(x) : // 100 if (y not in Connections(u)) : Count(u, y)++; // 1 Trillion Iterations Sort (u,y) in descending order of Count(u,y); Choose Top 3 y; Store (u, {y0, y1, y2}) for serving; 20
  • 21. Performance •  101 Random accesses for each user •  Assume 1 ms per random access •  100 ms per user •  100 MM users •  100 days on a single machine 21
  • 23. Map Reduce •  Primitives in Lisp ( Other functional languages) 1970s •  Google Paper 2004 •  http://labs.google.com/papers/ mapreduce.html 23
  • 24. Map Output_List = Map (Input_List) Square (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) = (1, 4, 9, 16, 25, 36,49, 64, 81, 100) 24
  • 25. Reduce Output_Element = Reduce (Input_List) Sum (1, 4, 9, 16, 25, 36,49, 64, 81, 100) = 385 25
  • 26. Parallelism •  Map is inherently parallel •  Each list element processed independently •  Reduce is inherently sequential •  Unless processing multiple lists •  Grouping to produce multiple lists 26
  • 27. Search Assist Map // Input: http://hadoop.apache.org Pairs = Tokenize_And_Pair ( Text ( Input ) ) Output = { (apache, hadoop) (hadoop, mapreduce) (hadoop, streaming) (hadoop, pig) (apache, pig) (hadoop, DFS) (streaming, commandline) (hadoop, java) (DFS, namenode) (datanode, block) (replication, default)... } 27
  • 28. Search Assist Reduce // Input: GroupedList (word, GroupedList(words)) CountedPairs = CountOccurrences (word, RelatedWords) Output = { (hadoop, apache, 7) (hadoop, DFS, 3) (hadoop, streaming, 4) (hadoop, mapreduce, 9) ... } 28
  • 29. Issues with Large Data •  Map Parallelism: Chunking input data •  Reduce Parallelism: Grouping related data •  Dealing with failures load imbalance 29
  • 31. Apache Hadoop •  January 2006: Subproject of Lucene •  January 2008: Top-level Apache project •  Stable Version: 0.20.203 •  Latest Version: 0.22 (Coming soon) 31
  • 32. Apache Hadoop •  Reliable, Performant Distributed file system •  MapReduce Programming framework •  Ecosystem: HBase, Hive, Pig, Howl, Oozie, Zookeeper, Chukwa, Mahout, Cascading, Scribe, Cassandra, Hypertable, Voldemort, Azkaban, Sqoop, Flume, Avro ... 32
  • 33. Problem: Bandwidth to Data •  Scan 100TB Datasets on 1000 node cluster •  Remote storage @ 10MB/s = 165 mins •  Local storage @ 50-200MB/s = 33-8 mins •  Moving computation is more efficient than moving data •  Need visibility into data placement 33
  • 34. Problem: Scaling Reliably •  Failure is not an option, it s a rule ! •  1000 nodes, MTBF 1 day •  4000 disks, 8000 cores, 25 switches, 1000 NICs, 2000 DIMMS (16TB RAM) •  Need fault tolerant store with reasonable availability guarantees •  Handle hardware faults transparently 34
  • 35. Hadoop Goals •  Scalable: Petabytes (10 15 Bytes) of data on thousands on nodes •  Economical: Commodity components only •  Reliable •  Engineering reliability into every application is expensive 35
  • 37. Think MapReduce •  Record = (Key, Value) •  Key : Comparable, Serializable •  Value: Serializable •  Input, Map, Shuffle, Reduce, Output 37
  • 38. Seems Familiar ? cat /var/log/auth.log* | grep session opened | cut -d -f10 | sort | uniq -c ~/userlist 38
  • 39. Map •  Input: (Key , Value ) 1 1 •  Output: List(Key , Value ) 2 2 •  Projections, Filtering, Transformation 39
  • 40. Shuffle •  Input: List(Key , Value ) 2 2 •  Output •  Sort(Partition(List(Key , List(Value )))) 2 2 •  Provided by Hadoop 40
  • 41. Reduce •  Input: List(Key , List(Value )) 2 2 •  Output: List(Key , Value ) 3 3 •  Aggregation 41
  • 42. Hadoop Streaming •  Hadoop is written in Java •  Java MapReduce code is native •  What about Non-Java Programmers ? •  Perl, Python, Shell, R •  grep, sed, awk, uniq as Mappers/Reducers •  Text Input and Output 42
  • 43. Hadoop Streaming •  Thin Java wrapper for Map Reduce Tasks •  Forks actual Mapper Reducer •  IPC via stdin, stdout, stderr •  Key.toString() t Value.toString() n •  Slower than Java programs •  Allows for quick prototyping / debugging 43
  • 44. Hadoop Streaming $ bin/hadoop jar hadoop-streaming.jar -input in-files -output out-dir -mapper mapper.sh -reducer reducer.sh # mapper.sh sed -e 's/ /n/g' | grep . # reducer.sh uniq -c | awk '{print $2 t $1}' 44
  • 46. HDFS •  Data is organized into files and directories •  Files are divided into uniform sized blocks (default 128MB) and distributed across cluster nodes •  HDFS exposes block placement so that computation can be migrated to data 46
  • 47. HDFS •  Blocks are replicated (default 3) to handle hardware failure •  Replication for performance and fault tolerance (Rack-Aware placement) •  HDFS keeps checksums of data for corruption detection and recovery 47
  • 48. HDFS •  Master-Worker Architecture •  Single NameNode •  Many (Thousands) DataNodes 48
  • 49. HDFS Master (NameNode) •  Manages filesystem namespace •  File metadata (i.e. inode ) •  Mapping inode to list of blocks + locations •  Authorization Authentication •  Checkpoint journal namespace changes 49
  • 50. Namenode •  Mapping of datanode to list of blocks •  Monitor datanode health •  Replicate missing blocks •  Keeps ALL namespace in memory •  60M objects (File/Block) in 16GB 50
  • 51. Datanodes •  Handle block storage on multiple volumes block integrity •  Clients access the blocks directly from data nodes •  Periodically send heartbeats and block reports to Namenode •  Blocks are stored as underlying OS s files 51
  • 53. Next Generation MapReduce 53
  • 54. MapReduce Today (Courtesy: Arun Murthy, Hortonworks)
  • 55. Why ? •  Scalability Limitations today •  Maximum cluster size: 4000 nodes •  Maximum Concurrent tasks: 40,000 •  Job Tracker SPOF •  Fixed map and reduce containers (slots) •  Punishes pleasantly parallel apps 55
  • 56. Why ? (contd) •  MapReduce is not suitable for every application •  Fine-Grained Iterative applications •  HaLoop: Hadoop in a Loop •  Message passing applications •  Graph Processing 56
  • 57. Requirements •  Need scalable cluster resources manager •  Separate scheduling from resource management •  Multi-Lingual Communication Protocols 57
  • 58. Bottom Line •  @techmilind #mrng (MapReduce, Next Gen) is in reality, #rmng (Resource Manager, Next Gen) •  Expect different programming paradigms to be implemented •  Including MPI (soon) 58
  • 59. Architecture (Courtesy: Arun Murthy, Hortonworks)
  • 60. The New World •  Resource Manager •  Allocates resources (containers) to applications •  Node Manager •  Manages containers on nodes •  Application Master •  Specific to paradigm e.g. MapReduce application master, MPI application master etc 60
  • 61. Container •  In current terminology: A Task Slot •  Slice of the node s hardware resources •  #of cores, virtual memory, disk size, disk and network bandwidth etc •  Currently, only memory usage is sliced 61