SlideShare a Scribd company logo
Benchmarking
Steve Loughran
Julio Guijarro




© 2009 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice
Benchmarks
Some Problems

•  Estimating Hadoop performance of hardware
•  Estimating Hadoop performance of a cluster
•  Designing Hadoop-ready servers
•  Designing Hadoop-ready clusters
•  Optimising the network for Hadoop
•  Optimising Hadoop/HDFS for specific
   applications
Benchmarking
Recent customer request




     "They want data for
        Hadoop Sort
        for 100GB."
Terasort: what else?

•    PageRank: CPU intensive, small (static) input
     dataset
•  Something that stresses RAM and CPU
•  Something that seeks in the files?
Test Datasets

•    Wikipedia: 5-10 TB of XML data with changes;
     user relationships have to be inferred
•  SpamAssassin: 70+ GB of SPAM
•  Physics? Something Small?
Network Measurement



 What to add to Hadoop/Avro/Thrift to
 monitor network traffic -and relate to
            specific jobs?
Predicting performance



  Can an MR job on small datasets predict
    performance on full size datasets?

   What extra instrumentation can help?
Hardware Q


  What should a Hadoop-ready server
              look like?

         What about a rack?


             Or a container?

More Related Content

PPT
HDP-1 introduction for HUG France
Steve Loughran
 
PPTX
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
DataWorks Summit/Hadoop Summit
 
PPTX
Empower Data-Driven Organizations
DataWorks Summit/Hadoop Summit
 
PPTX
Hive at Yahoo: Letters from the trenches
DataWorks Summit
 
PDF
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
DataWorks Summit
 
PDF
The Heterogeneous Data lake
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Mithun Radhakrishnan
 
PDF
Apache Eagle - Monitor Hadoop in Real Time
DataWorks Summit/Hadoop Summit
 
HDP-1 introduction for HUG France
Steve Loughran
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
DataWorks Summit/Hadoop Summit
 
Empower Data-Driven Organizations
DataWorks Summit/Hadoop Summit
 
Hive at Yahoo: Letters from the trenches
DataWorks Summit
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
DataWorks Summit
 
The Heterogeneous Data lake
DataWorks Summit/Hadoop Summit
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Mithun Radhakrishnan
 
Apache Eagle - Monitor Hadoop in Real Time
DataWorks Summit/Hadoop Summit
 

What's hot (20)

PDF
Integration of HIve and HBase
Hortonworks
 
PPTX
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Sudhir Mallem
 
PDF
May 2013 HUG: HCatalog/Hive Data Out
Yahoo Developer Network
 
PPTX
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
DataWorks Summit
 
PPTX
February 2014 HUG : Tez Details and Insides
Yahoo Developer Network
 
PPTX
Hadoop from Hive with Stinger to Tez
Jan Pieter Posthuma
 
PPTX
Hadoop Platform at Yahoo
DataWorks Summit/Hadoop Summit
 
PPTX
To The Cloud and Back: A Look At Hybrid Analytics
DataWorks Summit/Hadoop Summit
 
PPTX
February 2014 HUG : Hive On Tez
Yahoo Developer Network
 
PPTX
The Future of Hadoop Security
DataWorks Summit
 
PDF
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
PPTX
Big Data Performance and Capacity Management
rightsize
 
PPTX
Hadoop And Their Ecosystem
sunera pathan
 
PPTX
Stinger Initiative - Deep Dive
Hortonworks
 
PPTX
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
 
PPTX
Hive+Tez: A performance deep dive
t3rmin4t0r
 
PPTX
A New "Sparkitecture" for modernizing your data warehouse
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop in the Cloud – The What, Why and How from the Experts
DataWorks Summit/Hadoop Summit
 
PDF
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014
Hortonworks
 
Integration of HIve and HBase
Hortonworks
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Sudhir Mallem
 
May 2013 HUG: HCatalog/Hive Data Out
Yahoo Developer Network
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
DataWorks Summit
 
February 2014 HUG : Tez Details and Insides
Yahoo Developer Network
 
Hadoop from Hive with Stinger to Tez
Jan Pieter Posthuma
 
Hadoop Platform at Yahoo
DataWorks Summit/Hadoop Summit
 
To The Cloud and Back: A Look At Hybrid Analytics
DataWorks Summit/Hadoop Summit
 
February 2014 HUG : Hive On Tez
Yahoo Developer Network
 
The Future of Hadoop Security
DataWorks Summit
 
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
Big Data Performance and Capacity Management
rightsize
 
Hadoop And Their Ecosystem
sunera pathan
 
Stinger Initiative - Deep Dive
Hortonworks
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
DataWorks Summit/Hadoop Summit
 
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
 
Hive+Tez: A performance deep dive
t3rmin4t0r
 
A New "Sparkitecture" for modernizing your data warehouse
DataWorks Summit/Hadoop Summit
 
Hadoop in the Cloud – The What, Why and How from the Experts
DataWorks Summit/Hadoop Summit
 
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014
Hortonworks
 
Ad

Viewers also liked (19)

PPTX
HA Hadoop -ApacheCon talk
Steve Loughran
 
PDF
Hadoop & Hep
Steve Loughran
 
PPT
Deploying On EC2
Steve Loughran
 
PPT
Beyond Unit Testing
Steve Loughran
 
PPTX
Help! My Hadoop doesn't work!
Steve Loughran
 
PPT
When Web Services Go Bad
Steve Loughran
 
PPT
Testing
Steve Loughran
 
PPTX
Hadoop: today and tomorrow
Steve Loughran
 
PPT
The Wondrous Curse of Interoperability
Steve Loughran
 
PPT
My other computer is a datacentre - 2012 edition
Steve Loughran
 
PDF
Hadoop Futures
Steve Loughran
 
PPTX
New Roles In The Cloud
Steve Loughran
 
PPTX
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Steve Loughran
 
ODP
Farming hadoop in_the_cloud
Steve Loughran
 
PPT
Application Architecture For The Cloud
Steve Loughran
 
PPTX
Spark Summit East 2017: Apache spark and object stores
Steve Loughran
 
PPTX
Apache Spark and Object Stores
Steve Loughran
 
PPTX
Household INFOSEC in a Post-Sony Era
Steve Loughran
 
PPTX
Hadoop gets Groovy
Steve Loughran
 
HA Hadoop -ApacheCon talk
Steve Loughran
 
Hadoop & Hep
Steve Loughran
 
Deploying On EC2
Steve Loughran
 
Beyond Unit Testing
Steve Loughran
 
Help! My Hadoop doesn't work!
Steve Loughran
 
When Web Services Go Bad
Steve Loughran
 
Hadoop: today and tomorrow
Steve Loughran
 
The Wondrous Curse of Interoperability
Steve Loughran
 
My other computer is a datacentre - 2012 edition
Steve Loughran
 
Hadoop Futures
Steve Loughran
 
New Roles In The Cloud
Steve Loughran
 
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Steve Loughran
 
Farming hadoop in_the_cloud
Steve Loughran
 
Application Architecture For The Cloud
Steve Loughran
 
Spark Summit East 2017: Apache spark and object stores
Steve Loughran
 
Apache Spark and Object Stores
Steve Loughran
 
Household INFOSEC in a Post-Sony Era
Steve Loughran
 
Hadoop gets Groovy
Steve Loughran
 
Ad

Similar to Benchmarking (20)

PDF
Hp Converged Systems and Hortonworks - Webinar Slides
Hortonworks
 
PPTX
Performance evaluation of cloud-based log file analysis with Apache Hadoop an...
Kishor Datta Gupta
 
PDF
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
PDF
Benchmarking Hadoop and Big Data
Nicolas Poggi
 
PDF
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Ceph Community
 
PDF
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
Big Data Montreal
 
PPTX
Introduction to Apache Hadoop
Christopher Pezza
 
PPTX
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
 
PPTX
Carpe Datum: Building Big Data Analytical Applications with HP Haven
DataWorks Summit
 
PDF
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seeling Cheung
 
PPTX
A modern, flexible approach to Hadoop implementation incorporating innovation...
DataWorks Summit
 
PPTX
eBay Experimentation Platform on Hadoop
Tony Ng
 
PPTX
Experimentation Platform on Hadoop
DataWorks Summit
 
PPTX
4. hadoop גיא לבנברג
Taldor Group
 
PDF
PostgreSQL as a Big Data Platform
Chris Travers
 
PDF
Hadoop and SQL: Delivery Analytics Across the Organization
Seeling Cheung
 
PDF
Presto@Uber
Zhenxiao Luo
 
PPTX
Hadoop is not an Island in the Enterprise
DataWorks Summit
 
PPTX
Empower Data-Driven Organizations with HPE and Hadoop
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop and the Data Warehouse: Point/Counter Point
Inside Analysis
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hortonworks
 
Performance evaluation of cloud-based log file analysis with Apache Hadoop an...
Kishor Datta Gupta
 
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
Benchmarking Hadoop and Big Data
Nicolas Poggi
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Ceph Community
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
Big Data Montreal
 
Introduction to Apache Hadoop
Christopher Pezza
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
 
Carpe Datum: Building Big Data Analytical Applications with HP Haven
DataWorks Summit
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seeling Cheung
 
A modern, flexible approach to Hadoop implementation incorporating innovation...
DataWorks Summit
 
eBay Experimentation Platform on Hadoop
Tony Ng
 
Experimentation Platform on Hadoop
DataWorks Summit
 
4. hadoop גיא לבנברג
Taldor Group
 
PostgreSQL as a Big Data Platform
Chris Travers
 
Hadoop and SQL: Delivery Analytics Across the Organization
Seeling Cheung
 
Presto@Uber
Zhenxiao Luo
 
Hadoop is not an Island in the Enterprise
DataWorks Summit
 
Empower Data-Driven Organizations with HPE and Hadoop
DataWorks Summit/Hadoop Summit
 
Hadoop and the Data Warehouse: Point/Counter Point
Inside Analysis
 

More from Steve Loughran (20)

PPTX
Hadoop Vectored IO
Steve Loughran
 
PPTX
The age of rename() is over
Steve Loughran
 
PPTX
What does Rename Do: (detailed version)
Steve Loughran
 
PPTX
Put is the new rename: San Jose Summit Edition
Steve Loughran
 
PPTX
@Dissidentbot: dissent will be automated!
Steve Loughran
 
PPTX
PUT is the new rename()
Steve Loughran
 
PPT
Extreme Programming Deployed
Steve Loughran
 
PPT
Testing
Steve Loughran
 
PPTX
I hate mocking
Steve Loughran
 
PPTX
What does rename() do?
Steve Loughran
 
PPTX
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Steve Loughran
 
PPTX
Apache Spark and Object Stores —for London Spark User Group
Steve Loughran
 
PPTX
Hadoop, Hive, Spark and Object Stores
Steve Loughran
 
PPTX
Hadoop and Kerberos: the Madness Beyond the Gate
Steve Loughran
 
PPTX
Slider: Applications on YARN
Steve Loughran
 
PPTX
YARN Services
Steve Loughran
 
PPTX
Datacentre stack
Steve Loughran
 
PPTX
Overview of slider project
Steve Loughran
 
ODP
2014 01-02-patching-workflow
Steve Loughran
 
PPTX
2013 11-19-hoya-status
Steve Loughran
 
Hadoop Vectored IO
Steve Loughran
 
The age of rename() is over
Steve Loughran
 
What does Rename Do: (detailed version)
Steve Loughran
 
Put is the new rename: San Jose Summit Edition
Steve Loughran
 
@Dissidentbot: dissent will be automated!
Steve Loughran
 
PUT is the new rename()
Steve Loughran
 
Extreme Programming Deployed
Steve Loughran
 
I hate mocking
Steve Loughran
 
What does rename() do?
Steve Loughran
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Steve Loughran
 
Apache Spark and Object Stores —for London Spark User Group
Steve Loughran
 
Hadoop, Hive, Spark and Object Stores
Steve Loughran
 
Hadoop and Kerberos: the Madness Beyond the Gate
Steve Loughran
 
Slider: Applications on YARN
Steve Loughran
 
YARN Services
Steve Loughran
 
Datacentre stack
Steve Loughran
 
Overview of slider project
Steve Loughran
 
2014 01-02-patching-workflow
Steve Loughran
 
2013 11-19-hoya-status
Steve Loughran
 

Recently uploaded (20)

PDF
bain-temasek-sea-green-economy-2022-report-investing-behind-the-new-realities...
YudiSaputra43
 
PDF
Tariff Surcharge and Price Increase Decision
Joshua Gao
 
PPTX
Memorandum and articles of association explained.pptx
Keerthana Chinnathambi
 
PPTX
Appreciations - July 25.pptxffsdjjjjjjjjjjjj
anushavnayak
 
PPTX
PUBLIC RELATIONS N6 slides (4).pptx poin
chernae08
 
PPTX
Decoding BPMN: A Clear Guide to Business Process Modeling
RUPAL AGARWAL
 
PPTX
Business Plan Presentation: Vision, Strategy, Services, Growth Goals & Future...
neelsoni2108
 
PDF
What are the steps to buy GitHub accounts safely?
d14405913
 
PDF
A Complete Guide to Data Migration Services for Modern Businesses
Aurnex
 
PDF
GenAI for Risk Management: Refresher for the Boards and Executives
Alexei Sidorenko, CRMP
 
PDF
Danielle Oliveira New Jersey - A Seasoned Lieutenant
Danielle Oliveira New Jersey
 
DOCX
India's Emerging Global Leadership in Sustainable Energy Production The Rise ...
Insolation Energy
 
PDF
High Capacity Core IC Pneumatic Spec-Sheet
Forklift Trucks in Minnesota
 
PPTX
Pakistan’s Leading Manpower Export Agencies for Qatar
Glassrooms Dubai
 
PPTX
Financial Management for business management .pptx
Hasibullah Ahmadi
 
PDF
MDR Services – 24x7 Managed Detection and Response
CyberNX Technologies Private Limited
 
PPTX
Presentation - Business Intelligence Solutions 007.pptx
FBSPL
 
PPTX
Chapter 3 Distributive Negotiation: Claiming Value
badranomar1990
 
PDF
William Trowell - A Construction Project Manager
William Trowell
 
PDF
NewBase 29 July 2025 Energy News issue - 1807 by Khaled Al Awadi_compressed.pdf
Khaled Al Awadi
 
bain-temasek-sea-green-economy-2022-report-investing-behind-the-new-realities...
YudiSaputra43
 
Tariff Surcharge and Price Increase Decision
Joshua Gao
 
Memorandum and articles of association explained.pptx
Keerthana Chinnathambi
 
Appreciations - July 25.pptxffsdjjjjjjjjjjjj
anushavnayak
 
PUBLIC RELATIONS N6 slides (4).pptx poin
chernae08
 
Decoding BPMN: A Clear Guide to Business Process Modeling
RUPAL AGARWAL
 
Business Plan Presentation: Vision, Strategy, Services, Growth Goals & Future...
neelsoni2108
 
What are the steps to buy GitHub accounts safely?
d14405913
 
A Complete Guide to Data Migration Services for Modern Businesses
Aurnex
 
GenAI for Risk Management: Refresher for the Boards and Executives
Alexei Sidorenko, CRMP
 
Danielle Oliveira New Jersey - A Seasoned Lieutenant
Danielle Oliveira New Jersey
 
India's Emerging Global Leadership in Sustainable Energy Production The Rise ...
Insolation Energy
 
High Capacity Core IC Pneumatic Spec-Sheet
Forklift Trucks in Minnesota
 
Pakistan’s Leading Manpower Export Agencies for Qatar
Glassrooms Dubai
 
Financial Management for business management .pptx
Hasibullah Ahmadi
 
MDR Services – 24x7 Managed Detection and Response
CyberNX Technologies Private Limited
 
Presentation - Business Intelligence Solutions 007.pptx
FBSPL
 
Chapter 3 Distributive Negotiation: Claiming Value
badranomar1990
 
William Trowell - A Construction Project Manager
William Trowell
 
NewBase 29 July 2025 Energy News issue - 1807 by Khaled Al Awadi_compressed.pdf
Khaled Al Awadi
 

Benchmarking

  • 1. Benchmarking Steve Loughran Julio Guijarro © 2009 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • 3. Some Problems •  Estimating Hadoop performance of hardware •  Estimating Hadoop performance of a cluster •  Designing Hadoop-ready servers •  Designing Hadoop-ready clusters •  Optimising the network for Hadoop •  Optimising Hadoop/HDFS for specific applications
  • 5. Recent customer request "They want data for Hadoop Sort for 100GB."
  • 6. Terasort: what else? •  PageRank: CPU intensive, small (static) input dataset •  Something that stresses RAM and CPU •  Something that seeks in the files?
  • 7. Test Datasets •  Wikipedia: 5-10 TB of XML data with changes; user relationships have to be inferred •  SpamAssassin: 70+ GB of SPAM •  Physics? Something Small?
  • 8. Network Measurement What to add to Hadoop/Avro/Thrift to monitor network traffic -and relate to specific jobs?
  • 9. Predicting performance Can an MR job on small datasets predict performance on full size datasets? What extra instrumentation can help?
  • 10. Hardware Q What should a Hadoop-ready server look like? What about a rack? Or a container?