SlideShare a Scribd company logo
APOLLO GROUP




Hadoop Operations: Starting Out Small
So Your Cluster Isn't Yahoo-sized (yet)
Michael Arnold
Principal Systems Engineer
14 June 2012
Agenda

  Who
  What (Definitions)
  Decisions for Now
  Decisions for Later
  Lessons Learned




APOLLO GROUP             © 2012 Apollo Group        2
APOLLO GROUP




  Who




APOLLO GROUP Apollo Group
          © 2012            3
Who is Apollo?

        Apollo Group is a leading provider of higher
          education programs for working adults.




APOLLO GROUP              © 2012 Apollo Group                4
Who is Michael Arnold?

  Systems Administrator
  Automation geek
  13 years in IT
  I deal with:
      –Server hardware specification/configuration
      –Server firmware
      –Server operating system
      –Hadoop application health
      –Monitoring all the above


APOLLO GROUP              © 2012 Apollo Group        5
APOLLO GROUP




  What
  Definitions




APOLLO GROUP Apollo Group
          © 2012            6
Definitions

  Q: What is a tiny/small/medium/large cluster?
  A:
      –Tiny:          1-9
      –Small:         10-99
      –Medium:        100-999
      –Large:         1000+
      –Yahoo-sized:   4000




APOLLO GROUP              © 2012 Apollo Group             7
Definitions

  Q: What is a “headnode”?
  A: A server that runs one or more of the following
   Hadoop processes:
      –NameNode
      –JobTracker
      –Secondary NameNode
      –ZooKeeper
      –HBase Master




APOLLO GROUP            © 2012 Apollo Group             8
APOLLO GROUP




  What decisions should you
  make now and which can
  you postpone for later?
  Decisions for Now



APOLLO GROUP Apollo Group
          © 2012              9
Which Hadoop distribution?

  Amazon
  Apache
  Cloudera
  Greenplum
  Hortonworks
  IBM
  MapR
  Platform Computing



APOLLO GROUP            © 2012 Apollo Group   10
Should you virtualize?

  Can be OK for small clusters BUT
      –virtualization adds overhead
      –can cause performance degradation
      –cannot take advantage of Hadoop rack locality
  Virtualization can be good for:
      –functional testing of M/R job or workflow changes
      –evaluation of Hadoop upgrades




APOLLO GROUP              © 2012 Apollo Group              11
What sort of hardware should you be
                                      considering?

  Inexpensive
  Not “enterprisey” hardware
     –No RAID*
     –No Redundant power*
  Low power consumption
  No optical drives
     –get systems that can boot off the network



                                              * except in headnodes

APOLLO GROUP            © 2012 Apollo Group                       12
Plan for capacity expansion

  Start at the bottom and
   work your way up
  Leave room in your
   cabinets for more
   machines




APOLLO GROUP            © 2012 Apollo Group    13
Plan for capacity expansion (cont.)

  Deploy your initial
   cluster in two cabinets
     –One headnode, one
      switch, and several
      (five) datanodes per
      cabinet




APOLLO GROUP            © 2012 Apollo Group      14
Plan for capacity expansion (cont.)

  Install a second cluster
   in the empty space in
   the upper half of the
   cabinet




APOLLO GROUP             © 2012 Apollo Group     15
APOLLO GROUP




  What decisions should you
  make now and which can
  you postpone for later?
  Decisions for Later



APOLLO GROUP Apollo Group
          © 2012              16
What size cluster?

  Depends upon your:
  Budget
  Data size
  Workload characteristics
  SLA




APOLLO GROUP           © 2012 Apollo Group                    17
What size cluster? (cont.)

  Are your MapReduce jobs:
  compute-intensive?
  reading lots of data?

  http://www.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/




APOLLO GROUP                   © 2012 Apollo Group                      18
Should you implement rack awareness?


        If more than one switch in the cluster:

                           YES




APOLLO GROUP            © 2012 Apollo Group       19
Should you use automation?

       If not in the beginning, then as soon as
                        possible.

  Boot disks will fail.
  Automated OS and application installs:
      –Save time
      –Reduce errors
          •Cobbler/Spacewalk/Foreman/xCat/etc
          •Puppet/Chef/Cfengine/shell scripts/etc

APOLLO GROUP              © 2012 Apollo Group       20
APOLLO GROUP




  Lessons Learned




APOLLO GROUP Apollo Group
          © 2012            21
Keep It Simple

            Don't add redundancy and features
         (server/network) that will make things more
                 complicated and expensive.

               Hadoop has built-in redundancies.

                     Don't overlook them.




APOLLO GROUP                © 2012 Apollo Group                22
Automate the Hardware

  Twelve hours of manual work in the datacenter is
   not fun.
  Make sure all server firmware is configured
   identically.
      –HP SmartStart Scripting Toolkit
      –Dell OpenManage Deployment Toolkit
      –IBM ServerGuide Scripting Toolkit




APOLLO GROUP            © 2012 Apollo Group           23
Rolling upgrades are possible

               (Just not of the Hadoop software.)

   Datanodes can be decommissioned, patched, and
       added back into the cluster without service
                      downtime.




APOLLO GROUP                © 2012 Apollo Group      24
The smallest thing can have a big impact on the
                                             cluster


  Bad NIC/switchport can cause cluster slowness.

  Slow disks can cause intermittent job slowdowns.




APOLLO GROUP           © 2012 Apollo Group            25
HDFS blocks are weird

  On ext3/ext4:
      –Small blocks are not padded to the HDFS block-
       size, but rather the actual size of the data.
      –Each HDFS block is actually two files on the
       datanode's filesystem:
          •The actual data and
          •A metadata/checksum file

 # ls -l blk_1058778885645824207*
 -rw-r--r-- 1 hdfs hdfs 35094 May 14 01:26 blk_1058778885645824207
 -rw-r--r-- 1 hdfs hdfs   283 May 14 01:26 blk_1058778885645824207_19155994.meta



APOLLO GROUP                        © 2012 Apollo Group                        26
Do not prematurely optimize

  Be careful tuning your datanode filesystems.
      • mkfs -t ext4 -T largefile4 ... (probably bad)
      • mkfs -t ext4 -i 131072 -m 0 ... (better)

 /etc/mke2fs.conf
 [fs_types]
  hadoop = {
         features = has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink,
  extra_isize
         inode_ratio = 131072
         blocksize = -1
         reserved_ratio = 0
         default_mntopts = acl,user_xattr
  }

APOLLO GROUP                       © 2012 Apollo Group                          27
Use DNS-friendly names for services

       hdfs://hdfs.delta.hadoop.apollogrp.edu:8020/
         mapred.delta.hadoop.apollogrp.edu:8021
      http://oozie.delta.hadoop.apollogrp.edu:11000/
      hiveserver.delta.hadoop.apollogrp.edu:10000



   Yes, the names are long, but I bet you can figure out how to
                    connect to Bravo Cluster.




APOLLO GROUP                © 2012 Apollo Group                   29
Use a parallel, remote execution tool

  pdsh/Cluster SSH/mussh/etc

                 SSH in a for loop is so 2010

  FUNC/MCollective




APOLLO GROUP               © 2012 Apollo Group     30
Make your log directories as large as you can.

  20-100GB /var/log
      –Implement log purging cronjobs or your log
       directories will fill up.


  Beware: M/R jobs can fill up /tmp as well.




APOLLO GROUP              © 2012 Apollo Group        31
Insist on IPMI 2.0 for out of band management of
                                     server hardware.

  Serial Over LAN is awesome when booting a
   system.
  Standardized hardware/temperature monitoring.
  Simple remote power control.




APOLLO GROUP            © 2012 Apollo Group         33
Spanning-tree is the devil

  Enable portfast on your server switch ports or the
   BMCs may never get a DHCP lease.




APOLLO GROUP            © 2012 Apollo Group            34
Apollo has re-built it's cluster four times.

               You may end up doing so as well.




APOLLO GROUP               © 2012 Apollo Group     35
Apollo Timeline

  First build
  Cloudera Professional Services helped install CDH
  Four nodes
  Manually build OS via USB CDROM.
  CDH2




APOLLO GROUP           © 2012 Apollo Group                 36
Apollo Timeline

  Second build
  Cobbler
  All software deployment is via kickstart. Very little
   is in puppet. Config files are deployed via wget.
  CDH2




APOLLO GROUP              © 2012 Apollo Group                 37
Apollo Timeline

  Third build
  OS filesystem partitioning needed to change.
  Most software deployment still via kickstart.
  CDH3b2




APOLLO GROUP            © 2012 Apollo Group                 38
Apollo Timeline

  Fourth build
  HDFS filesystem inodes needed to be increased.
  Full puppet automation.
  Added redundant/hotswap enterprise hardware for
   headnodes.
  CDH3u1




APOLLO GROUP          © 2012 Apollo Group                 39
Cluster failures at Apollo

  Hardware
      –disk failures (40+)
      –disk cabling (6)
      –RAM (2)
      –switch port (1)
  Software
      –Cluster
          •NFS (NN -> 2NN metadata)
      –Job
          •TT java heap
          •Running out of /tmp or /var/log/hadoop
          •Running out of HDFS space

APOLLO GROUP                  © 2012 Apollo Group        40
Know your workload

  You can spend all the time in the world trying to get
   the best CPU/RAM/HDD/switch/cabinet
   configuration, but you are running on pure luck
   until you understand your cluster's workload.




APOLLO GROUP             © 2012 Apollo Group               41
APOLLO GROUP




  Questions?




APOLLO GROUP Apollo Group
          © 2012            42
Ad

More Related Content

What's hot (12)

Managing Oracle Solaris Systems with Puppet
Managing Oracle Solaris Systems with PuppetManaging Oracle Solaris Systems with Puppet
Managing Oracle Solaris Systems with Puppet
glynnfoster
 
KNOX-HTTPFS-ONEFS-WP
KNOX-HTTPFS-ONEFS-WPKNOX-HTTPFS-ONEFS-WP
KNOX-HTTPFS-ONEFS-WP
Boni Bruno
 
Red Hat for IBM System z IBM Enterprise2014 Las Vegas
Red Hat for IBM System z IBM Enterprise2014 Las Vegas Red Hat for IBM System z IBM Enterprise2014 Las Vegas
Red Hat for IBM System z IBM Enterprise2014 Las Vegas
Filipe Miranda
 
CERN Agile Infrastructure, Road to Production
CERN Agile Infrastructure, Road to ProductionCERN Agile Infrastructure, Road to Production
CERN Agile Infrastructure, Road to Production
Steve Traylen
 
Oracle Solaris 11.1 New Features
Oracle Solaris 11.1 New FeaturesOracle Solaris 11.1 New Features
Oracle Solaris 11.1 New Features
Orgad Kimchi
 
Osol Netadmin Solaris Administrator
Osol Netadmin Solaris AdministratorOsol Netadmin Solaris Administrator
Osol Netadmin Solaris Administrator
Opeyemi Olakitan
 
Terrraform meet Oracle Cloud: Platform Provisioning Automation
Terrraform meet Oracle Cloud: Platform Provisioning AutomationTerrraform meet Oracle Cloud: Platform Provisioning Automation
Terrraform meet Oracle Cloud: Platform Provisioning Automation
Simon Haslam
 
Hp cmu – easy to use cluster management utility @ hpcday 2012 kiev
Hp cmu – easy to use cluster management utility @ hpcday 2012 kievHp cmu – easy to use cluster management utility @ hpcday 2012 kiev
Hp cmu – easy to use cluster management utility @ hpcday 2012 kiev
Volodymyr Saviak
 
Oow Ppt 1
Oow Ppt 1Oow Ppt 1
Oow Ppt 1
Fran Navarro
 
Better Practices when Using Terraform to Manage Oracle Cloud Infrastructure
Better Practices when Using Terraform to Manage Oracle Cloud InfrastructureBetter Practices when Using Terraform to Manage Oracle Cloud Infrastructure
Better Practices when Using Terraform to Manage Oracle Cloud Infrastructure
Simon Haslam
 
IBM Edge2015 Las Vegas
IBM Edge2015 Las VegasIBM Edge2015 Las Vegas
IBM Edge2015 Las Vegas
Filipe Miranda
 
Linux Containers and Docker SHARE.ORG Seattle 2015
Linux Containers and Docker SHARE.ORG Seattle 2015Linux Containers and Docker SHARE.ORG Seattle 2015
Linux Containers and Docker SHARE.ORG Seattle 2015
Filipe Miranda
 
Managing Oracle Solaris Systems with Puppet
Managing Oracle Solaris Systems with PuppetManaging Oracle Solaris Systems with Puppet
Managing Oracle Solaris Systems with Puppet
glynnfoster
 
KNOX-HTTPFS-ONEFS-WP
KNOX-HTTPFS-ONEFS-WPKNOX-HTTPFS-ONEFS-WP
KNOX-HTTPFS-ONEFS-WP
Boni Bruno
 
Red Hat for IBM System z IBM Enterprise2014 Las Vegas
Red Hat for IBM System z IBM Enterprise2014 Las Vegas Red Hat for IBM System z IBM Enterprise2014 Las Vegas
Red Hat for IBM System z IBM Enterprise2014 Las Vegas
Filipe Miranda
 
CERN Agile Infrastructure, Road to Production
CERN Agile Infrastructure, Road to ProductionCERN Agile Infrastructure, Road to Production
CERN Agile Infrastructure, Road to Production
Steve Traylen
 
Oracle Solaris 11.1 New Features
Oracle Solaris 11.1 New FeaturesOracle Solaris 11.1 New Features
Oracle Solaris 11.1 New Features
Orgad Kimchi
 
Osol Netadmin Solaris Administrator
Osol Netadmin Solaris AdministratorOsol Netadmin Solaris Administrator
Osol Netadmin Solaris Administrator
Opeyemi Olakitan
 
Terrraform meet Oracle Cloud: Platform Provisioning Automation
Terrraform meet Oracle Cloud: Platform Provisioning AutomationTerrraform meet Oracle Cloud: Platform Provisioning Automation
Terrraform meet Oracle Cloud: Platform Provisioning Automation
Simon Haslam
 
Hp cmu – easy to use cluster management utility @ hpcday 2012 kiev
Hp cmu – easy to use cluster management utility @ hpcday 2012 kievHp cmu – easy to use cluster management utility @ hpcday 2012 kiev
Hp cmu – easy to use cluster management utility @ hpcday 2012 kiev
Volodymyr Saviak
 
Better Practices when Using Terraform to Manage Oracle Cloud Infrastructure
Better Practices when Using Terraform to Manage Oracle Cloud InfrastructureBetter Practices when Using Terraform to Manage Oracle Cloud Infrastructure
Better Practices when Using Terraform to Manage Oracle Cloud Infrastructure
Simon Haslam
 
IBM Edge2015 Las Vegas
IBM Edge2015 Las VegasIBM Edge2015 Las Vegas
IBM Edge2015 Las Vegas
Filipe Miranda
 
Linux Containers and Docker SHARE.ORG Seattle 2015
Linux Containers and Docker SHARE.ORG Seattle 2015Linux Containers and Docker SHARE.ORG Seattle 2015
Linux Containers and Docker SHARE.ORG Seattle 2015
Filipe Miranda
 

Viewers also liked (20)

ESG: NetApp Open Solution for Hadoop
ESG: NetApp Open Solution for HadoopESG: NetApp Open Solution for Hadoop
ESG: NetApp Open Solution for Hadoop
NetApp
 
Webinar - Managing Files with Puppet
Webinar - Managing Files with PuppetWebinar - Managing Files with Puppet
Webinar - Managing Files with Puppet
OlinData
 
Managing Files via Puppet: Let Me Count The Ways
Managing Files via Puppet: Let Me Count The WaysManaging Files via Puppet: Let Me Count The Ways
Managing Files via Puppet: Let Me Count The Ways
Michael Arnold
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
Milind Bhandarkar
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
DataWorks Summit
 
Implementing Hadoop on a single cluster
Implementing Hadoop on a single clusterImplementing Hadoop on a single cluster
Implementing Hadoop on a single cluster
Salil Navgire
 
Data-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
Data-Ed Webinar: A Framework for Implementing NoSQL, HadoopData-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
Data-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
DATAVERSITY
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
Milind Bhandarkar
 
Monitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, NagiosMonitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, Nagios
Pradeep Kumar
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
Praveen Kumar Donta
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Hortonworks
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
Humza Naseer
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Cloudera, Inc.
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data Governance
Hortonworks
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
Caserta
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
OReillyStrata
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
ESG: NetApp Open Solution for Hadoop
ESG: NetApp Open Solution for HadoopESG: NetApp Open Solution for Hadoop
ESG: NetApp Open Solution for Hadoop
NetApp
 
Webinar - Managing Files with Puppet
Webinar - Managing Files with PuppetWebinar - Managing Files with Puppet
Webinar - Managing Files with Puppet
OlinData
 
Managing Files via Puppet: Let Me Count The Ways
Managing Files via Puppet: Let Me Count The WaysManaging Files via Puppet: Let Me Count The Ways
Managing Files via Puppet: Let Me Count The Ways
Michael Arnold
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
DataWorks Summit
 
Implementing Hadoop on a single cluster
Implementing Hadoop on a single clusterImplementing Hadoop on a single cluster
Implementing Hadoop on a single cluster
Salil Navgire
 
Data-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
Data-Ed Webinar: A Framework for Implementing NoSQL, HadoopData-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
Data-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
DATAVERSITY
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
Milind Bhandarkar
 
Monitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, NagiosMonitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, Nagios
Pradeep Kumar
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
Praveen Kumar Donta
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Hortonworks
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
Humza Naseer
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Cloudera, Inc.
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data Governance
Hortonworks
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
Caserta
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
OReillyStrata
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
Ad

Similar to Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet) (20)

Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
DataWorks Summit/Hadoop Summit
 
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics WorkbenchPivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
EMC
 
Nagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPi
Nagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPiNagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPi
Nagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPi
Nagios
 
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Markus Michalewicz
 
EMC config Hadoop
EMC config HadoopEMC config Hadoop
EMC config Hadoop
solarisyougood
 
Building Your Own Drupal Distribution
Building Your Own Drupal DistributionBuilding Your Own Drupal Distribution
Building Your Own Drupal Distribution
Aniket Maithani
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
Owen O'Malley
 
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
Speeding up I/O for Machine Learning  ft Apple Case Study using TensorFlow, N...Speeding up I/O for Machine Learning  ft Apple Case Study using TensorFlow, N...
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
Alluxio, Inc.
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
Nicola Ferraro
 
Platform Engineering for the Modern Oracle World
Platform Engineering for the Modern Oracle WorldPlatform Engineering for the Modern Oracle World
Platform Engineering for the Modern Oracle World
Simon Haslam
 
GlassFish in Production Environments
GlassFish in Production EnvironmentsGlassFish in Production Environments
GlassFish in Production Environments
Bruno Borges
 
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?  Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
EMC
 
What's new in MySQL 5.6
What's new in MySQL 5.6What's new in MySQL 5.6
What's new in MySQL 5.6
Shlomi Noach
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA conf
Sujee Maniyam
 
Harnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop SeriesHarnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop Series
Cloudera, Inc.
 
Enterprise software needs a PaaS
Enterprise software needs a PaaSEnterprise software needs a PaaS
Enterprise software needs a PaaS
hmalphettes
 
Intalio create and cloudfoudry - short
Intalio create and cloudfoudry - shortIntalio create and cloudfoudry - short
Intalio create and cloudfoudry - short
hmalphettes
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
EMC
 
Docker 101 - An introduction to docker
Docker 101 - An introduction to dockerDocker 101 - An introduction to docker
Docker 101 - An introduction to docker
Richard Banks
 
Oracle RAC 12c Collaborate Best Practices - IOUG 2014 version
Oracle RAC 12c Collaborate Best Practices - IOUG 2014 versionOracle RAC 12c Collaborate Best Practices - IOUG 2014 version
Oracle RAC 12c Collaborate Best Practices - IOUG 2014 version
Markus Michalewicz
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
DataWorks Summit/Hadoop Summit
 
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics WorkbenchPivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
EMC
 
Nagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPi
Nagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPiNagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPi
Nagios Conference 2012 - Dave Williams - Embedding Nagios using RaspberyPi
Nagios
 
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Markus Michalewicz
 
Building Your Own Drupal Distribution
Building Your Own Drupal DistributionBuilding Your Own Drupal Distribution
Building Your Own Drupal Distribution
Aniket Maithani
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
Owen O'Malley
 
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
Speeding up I/O for Machine Learning  ft Apple Case Study using TensorFlow, N...Speeding up I/O for Machine Learning  ft Apple Case Study using TensorFlow, N...
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
Alluxio, Inc.
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
Nicola Ferraro
 
Platform Engineering for the Modern Oracle World
Platform Engineering for the Modern Oracle WorldPlatform Engineering for the Modern Oracle World
Platform Engineering for the Modern Oracle World
Simon Haslam
 
GlassFish in Production Environments
GlassFish in Production EnvironmentsGlassFish in Production Environments
GlassFish in Production Environments
Bruno Borges
 
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?  Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
EMC
 
What's new in MySQL 5.6
What's new in MySQL 5.6What's new in MySQL 5.6
What's new in MySQL 5.6
Shlomi Noach
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA conf
Sujee Maniyam
 
Harnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop SeriesHarnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop Series
Cloudera, Inc.
 
Enterprise software needs a PaaS
Enterprise software needs a PaaSEnterprise software needs a PaaS
Enterprise software needs a PaaS
hmalphettes
 
Intalio create and cloudfoudry - short
Intalio create and cloudfoudry - shortIntalio create and cloudfoudry - short
Intalio create and cloudfoudry - short
hmalphettes
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
EMC
 
Docker 101 - An introduction to docker
Docker 101 - An introduction to dockerDocker 101 - An introduction to docker
Docker 101 - An introduction to docker
Richard Banks
 
Oracle RAC 12c Collaborate Best Practices - IOUG 2014 version
Oracle RAC 12c Collaborate Best Practices - IOUG 2014 versionOracle RAC 12c Collaborate Best Practices - IOUG 2014 version
Oracle RAC 12c Collaborate Best Practices - IOUG 2014 version
Markus Michalewicz
 
Ad

Recently uploaded (20)

UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 

Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)

  • 1. APOLLO GROUP Hadoop Operations: Starting Out Small So Your Cluster Isn't Yahoo-sized (yet) Michael Arnold Principal Systems Engineer 14 June 2012
  • 2. Agenda Who What (Definitions) Decisions for Now Decisions for Later Lessons Learned APOLLO GROUP © 2012 Apollo Group 2
  • 3. APOLLO GROUP Who APOLLO GROUP Apollo Group © 2012 3
  • 4. Who is Apollo? Apollo Group is a leading provider of higher education programs for working adults. APOLLO GROUP © 2012 Apollo Group 4
  • 5. Who is Michael Arnold? Systems Administrator Automation geek 13 years in IT I deal with: –Server hardware specification/configuration –Server firmware –Server operating system –Hadoop application health –Monitoring all the above APOLLO GROUP © 2012 Apollo Group 5
  • 6. APOLLO GROUP What Definitions APOLLO GROUP Apollo Group © 2012 6
  • 7. Definitions Q: What is a tiny/small/medium/large cluster? A: –Tiny: 1-9 –Small: 10-99 –Medium: 100-999 –Large: 1000+ –Yahoo-sized: 4000 APOLLO GROUP © 2012 Apollo Group 7
  • 8. Definitions Q: What is a “headnode”? A: A server that runs one or more of the following Hadoop processes: –NameNode –JobTracker –Secondary NameNode –ZooKeeper –HBase Master APOLLO GROUP © 2012 Apollo Group 8
  • 9. APOLLO GROUP What decisions should you make now and which can you postpone for later? Decisions for Now APOLLO GROUP Apollo Group © 2012 9
  • 10. Which Hadoop distribution? Amazon Apache Cloudera Greenplum Hortonworks IBM MapR Platform Computing APOLLO GROUP © 2012 Apollo Group 10
  • 11. Should you virtualize? Can be OK for small clusters BUT –virtualization adds overhead –can cause performance degradation –cannot take advantage of Hadoop rack locality Virtualization can be good for: –functional testing of M/R job or workflow changes –evaluation of Hadoop upgrades APOLLO GROUP © 2012 Apollo Group 11
  • 12. What sort of hardware should you be considering? Inexpensive Not “enterprisey” hardware –No RAID* –No Redundant power* Low power consumption No optical drives –get systems that can boot off the network * except in headnodes APOLLO GROUP © 2012 Apollo Group 12
  • 13. Plan for capacity expansion Start at the bottom and work your way up Leave room in your cabinets for more machines APOLLO GROUP © 2012 Apollo Group 13
  • 14. Plan for capacity expansion (cont.) Deploy your initial cluster in two cabinets –One headnode, one switch, and several (five) datanodes per cabinet APOLLO GROUP © 2012 Apollo Group 14
  • 15. Plan for capacity expansion (cont.) Install a second cluster in the empty space in the upper half of the cabinet APOLLO GROUP © 2012 Apollo Group 15
  • 16. APOLLO GROUP What decisions should you make now and which can you postpone for later? Decisions for Later APOLLO GROUP Apollo Group © 2012 16
  • 17. What size cluster? Depends upon your: Budget Data size Workload characteristics SLA APOLLO GROUP © 2012 Apollo Group 17
  • 18. What size cluster? (cont.) Are your MapReduce jobs: compute-intensive? reading lots of data? http://www.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/ APOLLO GROUP © 2012 Apollo Group 18
  • 19. Should you implement rack awareness? If more than one switch in the cluster: YES APOLLO GROUP © 2012 Apollo Group 19
  • 20. Should you use automation? If not in the beginning, then as soon as possible. Boot disks will fail. Automated OS and application installs: –Save time –Reduce errors •Cobbler/Spacewalk/Foreman/xCat/etc •Puppet/Chef/Cfengine/shell scripts/etc APOLLO GROUP © 2012 Apollo Group 20
  • 21. APOLLO GROUP Lessons Learned APOLLO GROUP Apollo Group © 2012 21
  • 22. Keep It Simple Don't add redundancy and features (server/network) that will make things more complicated and expensive. Hadoop has built-in redundancies. Don't overlook them. APOLLO GROUP © 2012 Apollo Group 22
  • 23. Automate the Hardware Twelve hours of manual work in the datacenter is not fun. Make sure all server firmware is configured identically. –HP SmartStart Scripting Toolkit –Dell OpenManage Deployment Toolkit –IBM ServerGuide Scripting Toolkit APOLLO GROUP © 2012 Apollo Group 23
  • 24. Rolling upgrades are possible (Just not of the Hadoop software.) Datanodes can be decommissioned, patched, and added back into the cluster without service downtime. APOLLO GROUP © 2012 Apollo Group 24
  • 25. The smallest thing can have a big impact on the cluster Bad NIC/switchport can cause cluster slowness. Slow disks can cause intermittent job slowdowns. APOLLO GROUP © 2012 Apollo Group 25
  • 26. HDFS blocks are weird On ext3/ext4: –Small blocks are not padded to the HDFS block- size, but rather the actual size of the data. –Each HDFS block is actually two files on the datanode's filesystem: •The actual data and •A metadata/checksum file # ls -l blk_1058778885645824207* -rw-r--r-- 1 hdfs hdfs 35094 May 14 01:26 blk_1058778885645824207 -rw-r--r-- 1 hdfs hdfs 283 May 14 01:26 blk_1058778885645824207_19155994.meta APOLLO GROUP © 2012 Apollo Group 26
  • 27. Do not prematurely optimize Be careful tuning your datanode filesystems. • mkfs -t ext4 -T largefile4 ... (probably bad) • mkfs -t ext4 -i 131072 -m 0 ... (better) /etc/mke2fs.conf [fs_types] hadoop = { features = has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink, extra_isize inode_ratio = 131072 blocksize = -1 reserved_ratio = 0 default_mntopts = acl,user_xattr } APOLLO GROUP © 2012 Apollo Group 27
  • 28. Use DNS-friendly names for services hdfs://hdfs.delta.hadoop.apollogrp.edu:8020/ mapred.delta.hadoop.apollogrp.edu:8021 http://oozie.delta.hadoop.apollogrp.edu:11000/ hiveserver.delta.hadoop.apollogrp.edu:10000 Yes, the names are long, but I bet you can figure out how to connect to Bravo Cluster. APOLLO GROUP © 2012 Apollo Group 29
  • 29. Use a parallel, remote execution tool pdsh/Cluster SSH/mussh/etc SSH in a for loop is so 2010 FUNC/MCollective APOLLO GROUP © 2012 Apollo Group 30
  • 30. Make your log directories as large as you can. 20-100GB /var/log –Implement log purging cronjobs or your log directories will fill up. Beware: M/R jobs can fill up /tmp as well. APOLLO GROUP © 2012 Apollo Group 31
  • 31. Insist on IPMI 2.0 for out of band management of server hardware. Serial Over LAN is awesome when booting a system. Standardized hardware/temperature monitoring. Simple remote power control. APOLLO GROUP © 2012 Apollo Group 33
  • 32. Spanning-tree is the devil Enable portfast on your server switch ports or the BMCs may never get a DHCP lease. APOLLO GROUP © 2012 Apollo Group 34
  • 33. Apollo has re-built it's cluster four times. You may end up doing so as well. APOLLO GROUP © 2012 Apollo Group 35
  • 34. Apollo Timeline First build Cloudera Professional Services helped install CDH Four nodes Manually build OS via USB CDROM. CDH2 APOLLO GROUP © 2012 Apollo Group 36
  • 35. Apollo Timeline Second build Cobbler All software deployment is via kickstart. Very little is in puppet. Config files are deployed via wget. CDH2 APOLLO GROUP © 2012 Apollo Group 37
  • 36. Apollo Timeline Third build OS filesystem partitioning needed to change. Most software deployment still via kickstart. CDH3b2 APOLLO GROUP © 2012 Apollo Group 38
  • 37. Apollo Timeline Fourth build HDFS filesystem inodes needed to be increased. Full puppet automation. Added redundant/hotswap enterprise hardware for headnodes. CDH3u1 APOLLO GROUP © 2012 Apollo Group 39
  • 38. Cluster failures at Apollo Hardware –disk failures (40+) –disk cabling (6) –RAM (2) –switch port (1) Software –Cluster •NFS (NN -> 2NN metadata) –Job •TT java heap •Running out of /tmp or /var/log/hadoop •Running out of HDFS space APOLLO GROUP © 2012 Apollo Group 40
  • 39. Know your workload You can spend all the time in the world trying to get the best CPU/RAM/HDD/switch/cabinet configuration, but you are running on pure luck until you understand your cluster's workload. APOLLO GROUP © 2012 Apollo Group 41
  • 40. APOLLO GROUP Questions? APOLLO GROUP Apollo Group © 2012 42