SlideShare a Scribd company logo
Hadoop and Hive Development at
   Facebook

Dhruba Borthakur Zheng Shao
{dhruba, zshao}@facebook.com
Presented at Hadoop World, New York
October 2, 2009
Hadoop @ Facebook
Who generates this data?

  Lots of data is generated on Facebook
    –  300+ million active users
    –  30 million users update their statuses at least once each
       day
    –  More than 1 billion photos uploaded each month
    –  More than 10 million videos uploaded each month
    –  More than 1 billion pieces of content (web links, news
       stories, blog posts, notes, photos, etc.) shared each
       week
Data Usage

  Statistics per day:
    –  4 TB of compressed new data added per day
    –  135TB of compressed data scanned per day
    –  7500+ Hive jobs on production cluster per day
    –  80K compute hours per day
  Barrier to entry is significantly reduced:
    –  New engineers go though a Hive training session
    –  ~200 people/month run jobs on Hadoop/Hive
    –  Analysts (non-engineers) use Hadoop through Hive
Where is this data stored?

  Hadoop/Hive Warehouse
  –  4800 cores, 5.5 PetaBytes
  –  12 TB per node
  –  Two level network topology
       1 Gbit/sec from node to rack switch
       4 Gbit/sec to top level rack switch
Data Flow into Hadoop Cloud
                                                     Network

                                                     Storage

                                                     and

                                                     Servers

    Web
Servers

                        Scribe
MidTier





     Oracle
RAC
   Hadoop
Hive
Warehouse
   MySQL

Hadoop Scribe: Avoid Costly Filers
                                                          Scribe
Writers


                                                                      RealBme

                                                                      Hadoop

                                                                      Cluster

    Web
Servers
               Scribe
MidTier





     Oracle
RAC
         Hadoop
Hive
Warehouse
                    MySQL

    http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html
HDFS Raid

  Start the same: triplicate
   every data block
                                                 A               B                C
  Background encoding
   –  Combine third replica of                   A               B                C
      blocks from a single file to
      create parity block                        A               B                C
   –  Remove third replica
   –  Apache JIRA HDFS-503                                     A+B+C
  DiskReduce from CMU
   –  Garth Gibson research                     A file with three blocks A, B and C

   http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html
Archival: Move old data to cheap storage

Hadoop
Warehouse


                                                 NFS




               Hadoop
Archive
Node
                     Cheap
NAS




                                 Hadoop
Archival
Cluster



                    hEp://issues.apache.org/jira/browse/HDFS‐220

 Hive
Query

Dynamic-size MapReduce Clusters

  Why multiple compute clouds in Facebook?
    –  Users unaware of resources needed by job
    –  Absence of flexible Job Isolation techniques
    –  Provide adequate SLAs for jobs
  Dynamically move nodes between clusters
    –  Based on load and configured policies
    –  Apache Jira MAPREDUCE-1044
Resource Aware Scheduling (Fair Share
Scheduler)
  We use the Hadoop Fair Share Scheduler
    –  Scheduler unaware of memory needed by job
  Memory and CPU aware scheduling
    –  RealTime gathering of CPU and memory usage
    –  Scheduler analyzes memory consumption in realtime
    –  Scheduler fair-shares memory usage among jobs
    –  Slot-less scheduling of tasks (in future)
    –  Apache Jira MAPREDUCE-961
Hive – Data Warehouse

  Efficient SQL to Map-Reduce Compiler

  Mar 2008: Started at Facebook
  May 2009: Release 0.3.0 available
  Now: Preparing for release 0.4.0


  Countable for 95%+ of Hadoop jobs @ Facebook
  Used by ~200 engineers and business analysts at Facebook
   every month
Hive Architecture

  Web UI + Hive CLI + JDBC/                Map Reduce       HDFS
            ODBC                       User-defined
     Browse, Query, DDL              Map-reduce Scripts

  MetaStore               Hive QL              UDF/UDAF
                                                 substr
                     Parser                       sum
  Thrift API                                    average
                    Planner    Execution
                                                          FileFormats
                                                 SerDe
                   Optimizer                                TextFile
                                                  CSV     SequenceFile
                                                 Thrift      RCFile
                                                 Regex
Hive DDL

  DDL
   –  Complex columns
   –  Partitions
   –  Buckets
  Example
   –  CREATE TABLE sales (
        id INT,
        items ARRAY<STRUCT<id:INT, name:STRING>>,
        extra MAP<STRING, STRING>
      ) PARTITIONED BY (ds STRING)
      CLUSTERED BY (id) INTO 32 BUCKETS;
Hive Query Language

  SQL
   –    Where
   –    Group By
   –    Equi-Join
   –    Sub query in from clause
  Example
   –  SELECT r.*, s.*
      FROM r JOIN (
        SELECT key, count(1) as count
        FROM s
        GROUP BY key) s
      ON r.key = s.key
      WHERE s.count > 100;
Group By

  4 different plans based on:
  –  Does data have skew?
  –  partial aggregation
  Map-side hash aggregation
  –  In-memory hash table in mapper to do partial
     aggregations
  2-map-reduce aggregation
  –  For distinct queries with skew and large cardinality
Join

  Normal map-reduce Join
  –  Mapper sends all rows with the same key to a single
     reducer
  –  Reducer does the join
  Map-side Join
  –  Mapper loads the whole small table and a portion of big
     table
  –  Mapper does the join
  –  Much faster than map-reduce join
Sampling

  Efficient sampling
   –  Table can be bucketed
   –  Each bucket is a file
   –  Sampling can choose some buckets
  Example
   –  SELECT product_id, sum(price)
      FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32)
      GROUP BY product_id
Multi-table Group-By/Insert

FROM users
INSERT INTO TABLE pv_gender_sum
   SELECT gender, count(DISTINCT userid)
   GROUP BY gender
INSERT INTO
   DIRECTORY '/user/facebook/tmp/pv_age_sum.dir'
   SELECT age, count(DISTINCT userid)
   GROUP BY age
INSERT INTO LOCAL DIRECTORY '/home/me/pv_age_sum.dir'
   SELECT country, gender, count(DISTINCT userid)
   GROUP BY country, gender;
File Formats

  TextFile:
   –  Easy for other applications to write/read
   –  Gzip text files are not splittable
  SequenceFile:
   –  Only hadoop can read it
   –  Support splittable compression
  RCFile: Block-based columnar storage
   –    Use SequenceFile block format
   –    Columnar storage inside a block
   –    25% smaller compressed size
   –    On-par or better query performance depending on the query
SerDe

  Serialization/Deserialization
  Row Format
   –    CSV (LazySimpleSerDe)
   –    Thrift (ThriftSerDe)
   –    Regex (RegexSerDe)
   –    Hive Binary Format (LazyBinarySerDe)
  LazySimpleSerDe and LazyBinarySerDe
   –  Deserialize the field when needed
   –  Reuse objects across different rows
   –  Text and Binary format
UDF/UDAF

  Features:
   –    Use either Java or Hadoop Objects (int, Integer, IntWritable)
   –    Overloading
   –    Variable-length arguments
   –    Partial aggregation for UDAF
  Example UDF:
   –  public class UDFExampleAdd extends UDF {
        public int evaluate(int a, int b) {
          return a + b;
        }
      }
Hive – Performance

  Date      SVN Revision        Major Changes            Query A   Query B   Query C
2/22/2009      746906      Before Lazy Deserialization    83 sec    98 sec   183 sec
2/23/2009      747293         Lazy Deserialization        40 sec    66 sec   185 sec
3/6/2009       751166        Map-side Aggregation         22 sec    67 sec   182 sec
4/29/2009      770074             Object Reuse            21 sec    49 sec   130 sec
6/3/2009       781633           Map-side Join *           21 sec    48 sec   132 sec
8/5/2009       801497        Lazy Binary Format *         21 sec    48 sec   132 sec

        QueryA: SELECT count(1) FROM t;
        QueryB: SELECT concat(concat(concat(a,b),c),d) FROM t;
        QueryC: SELECT * FROM t;
        map-side time only (incl. GzipCodec for comp/decompression)
        * These two features need to be tested with other queries.
Hive – Future Works

    Indexes
    Create table as select
    Views / variables
    Explode operator
    In/Exists sub queries
    Leverage sort/bucket information in Join
Ad

More Related Content

What's hot (20)

IBM Platform Computing Elastic Storage
IBM Platform Computing  Elastic StorageIBM Platform Computing  Elastic Storage
IBM Platform Computing Elastic Storage
Patrick Bouillaud
 
Red Hat Storage Day Boston - Supermicro Super Storage
Red Hat Storage Day Boston - Supermicro Super StorageRed Hat Storage Day Boston - Supermicro Super Storage
Red Hat Storage Day Boston - Supermicro Super Storage
Red_Hat_Storage
 
Red Hat Ceph Storage: Past, Present and Future
Red Hat Ceph Storage: Past, Present and FutureRed Hat Ceph Storage: Past, Present and Future
Red Hat Ceph Storage: Past, Present and Future
Red_Hat_Storage
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
Alluxio, Inc.
 
HDFS Erasure Coding in Action
HDFS Erasure Coding in Action HDFS Erasure Coding in Action
HDFS Erasure Coding in Action
DataWorks Summit/Hadoop Summit
 
Best practices for using flash in hyperscale software storage architectures
Best practices for using flash in hyperscale software storage architecturesBest practices for using flash in hyperscale software storage architectures
Best practices for using flash in hyperscale software storage architectures
Eric Carter
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
Alluxio, Inc.
 
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference Architectures
Red_Hat_Storage
 
How MariaDB is approaching DBaaS
How MariaDB is approaching DBaaSHow MariaDB is approaching DBaaS
How MariaDB is approaching DBaaS
MariaDB plc
 
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with CassandraCisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
DataStax Academy
 
Red Hat Storage Day Seattle: Why Software-Defined Storage Matters
Red Hat Storage Day Seattle: Why Software-Defined Storage MattersRed Hat Storage Day Seattle: Why Software-Defined Storage Matters
Red Hat Storage Day Seattle: Why Software-Defined Storage Matters
Red_Hat_Storage
 
Ac922 watson 180208 v1
Ac922 watson 180208 v1Ac922 watson 180208 v1
Ac922 watson 180208 v1
IBM Sverige
 
Software-Defined Storage (SDS)
Software-Defined Storage (SDS)Software-Defined Storage (SDS)
Software-Defined Storage (SDS)
HTS Hosting
 
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red_Hat_Storage
 
IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015
Doug O'Flaherty
 
Red Hat Storage Day Dallas - Why Software-defined Storage Matters
Red Hat Storage Day Dallas - Why Software-defined Storage MattersRed Hat Storage Day Dallas - Why Software-defined Storage Matters
Red Hat Storage Day Dallas - Why Software-defined Storage Matters
Red_Hat_Storage
 
Red Hat Storage Day Dallas - Gluster Storage in Containerized Application
Red Hat Storage Day Dallas - Gluster Storage in Containerized Application Red Hat Storage Day Dallas - Gluster Storage in Containerized Application
Red Hat Storage Day Dallas - Gluster Storage in Containerized Application
Red_Hat_Storage
 
Ibm spectrum scale_backup_n_archive_v03_ash
Ibm spectrum scale_backup_n_archive_v03_ashIbm spectrum scale_backup_n_archive_v03_ash
Ibm spectrum scale_backup_n_archive_v03_ash
Ashutosh Mate
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
Anand Haridass
 
Severalnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IXSeveralnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IX
Severalnines
 
IBM Platform Computing Elastic Storage
IBM Platform Computing  Elastic StorageIBM Platform Computing  Elastic Storage
IBM Platform Computing Elastic Storage
Patrick Bouillaud
 
Red Hat Storage Day Boston - Supermicro Super Storage
Red Hat Storage Day Boston - Supermicro Super StorageRed Hat Storage Day Boston - Supermicro Super Storage
Red Hat Storage Day Boston - Supermicro Super Storage
Red_Hat_Storage
 
Red Hat Ceph Storage: Past, Present and Future
Red Hat Ceph Storage: Past, Present and FutureRed Hat Ceph Storage: Past, Present and Future
Red Hat Ceph Storage: Past, Present and Future
Red_Hat_Storage
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
Alluxio, Inc.
 
Best practices for using flash in hyperscale software storage architectures
Best practices for using flash in hyperscale software storage architecturesBest practices for using flash in hyperscale software storage architectures
Best practices for using flash in hyperscale software storage architectures
Eric Carter
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
Alluxio, Inc.
 
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference Architectures
Red_Hat_Storage
 
How MariaDB is approaching DBaaS
How MariaDB is approaching DBaaSHow MariaDB is approaching DBaaS
How MariaDB is approaching DBaaS
MariaDB plc
 
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with CassandraCisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
DataStax Academy
 
Red Hat Storage Day Seattle: Why Software-Defined Storage Matters
Red Hat Storage Day Seattle: Why Software-Defined Storage MattersRed Hat Storage Day Seattle: Why Software-Defined Storage Matters
Red Hat Storage Day Seattle: Why Software-Defined Storage Matters
Red_Hat_Storage
 
Ac922 watson 180208 v1
Ac922 watson 180208 v1Ac922 watson 180208 v1
Ac922 watson 180208 v1
IBM Sverige
 
Software-Defined Storage (SDS)
Software-Defined Storage (SDS)Software-Defined Storage (SDS)
Software-Defined Storage (SDS)
HTS Hosting
 
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red_Hat_Storage
 
IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015
Doug O'Flaherty
 
Red Hat Storage Day Dallas - Why Software-defined Storage Matters
Red Hat Storage Day Dallas - Why Software-defined Storage MattersRed Hat Storage Day Dallas - Why Software-defined Storage Matters
Red Hat Storage Day Dallas - Why Software-defined Storage Matters
Red_Hat_Storage
 
Red Hat Storage Day Dallas - Gluster Storage in Containerized Application
Red Hat Storage Day Dallas - Gluster Storage in Containerized Application Red Hat Storage Day Dallas - Gluster Storage in Containerized Application
Red Hat Storage Day Dallas - Gluster Storage in Containerized Application
Red_Hat_Storage
 
Ibm spectrum scale_backup_n_archive_v03_ash
Ibm spectrum scale_backup_n_archive_v03_ashIbm spectrum scale_backup_n_archive_v03_ash
Ibm spectrum scale_backup_n_archive_v03_ash
Ashutosh Mate
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
Anand Haridass
 
Severalnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IXSeveralnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IX
Severalnines
 

Similar to Hadoop and Hive Development at Facebook (20)

Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
Cloudera, Inc.
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
Sperasoft
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
jerrin joseph
 
Nextag talk
Nextag talkNextag talk
Nextag talk
Joydeep Sen Sarma
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
Sreenu Musham
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
Vinoth Chandar
 
HADOOP
HADOOPHADOOP
HADOOP
Harinder Kaur
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Chirag Ahuja
 
hdfs readrmation ghghg bigdats analytics info.pdf
hdfs readrmation ghghg bigdats analytics info.pdfhdfs readrmation ghghg bigdats analytics info.pdf
hdfs readrmation ghghg bigdats analytics info.pdf
ssuser2d043c
 
מיכאל
מיכאלמיכאל
מיכאל
sqlserver.co.il
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
BOSC 2010
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
James Chen
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
NouhaElhaji1
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
hadoop_spark_Introduction_Bigdata_intro.ppt
hadoop_spark_Introduction_Bigdata_intro.ppthadoop_spark_Introduction_Bigdata_intro.ppt
hadoop_spark_Introduction_Bigdata_intro.ppt
anuroopdv
 
hadoop-sparktitlsdernsfslfsfnsfsflsnfsfnsfl
hadoop-sparktitlsdernsfslfsfnsfsflsnfsfnsflhadoop-sparktitlsdernsfslfsfnsfsflsnfsfnsfl
hadoop-sparktitlsdernsfslfsfnsfsflsnfsfnsfl
sasuke20y4sh
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
appaji intelhunt
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
Qubole
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
Cloudera, Inc.
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
Sperasoft
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
Sreenu Musham
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
Vinoth Chandar
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Chirag Ahuja
 
hdfs readrmation ghghg bigdats analytics info.pdf
hdfs readrmation ghghg bigdats analytics info.pdfhdfs readrmation ghghg bigdats analytics info.pdf
hdfs readrmation ghghg bigdats analytics info.pdf
ssuser2d043c
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
BOSC 2010
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
James Chen
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
hadoop_spark_Introduction_Bigdata_intro.ppt
hadoop_spark_Introduction_Bigdata_intro.ppthadoop_spark_Introduction_Bigdata_intro.ppt
hadoop_spark_Introduction_Bigdata_intro.ppt
anuroopdv
 
hadoop-sparktitlsdernsfslfsfnsfsflsnfsfnsfl
hadoop-sparktitlsdernsfslfsfnsfsflsnfsfnsflhadoop-sparktitlsdernsfslfsfnsfsflsnfsfnsfl
hadoop-sparktitlsdernsfslfsfnsfsflsnfsfnsfl
sasuke20y4sh
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
appaji intelhunt
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
Qubole
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 
Ad

More from elliando dias (20)

Clojurescript slides
Clojurescript slidesClojurescript slides
Clojurescript slides
elliando dias
 
Why you should be excited about ClojureScript
Why you should be excited about ClojureScriptWhy you should be excited about ClojureScript
Why you should be excited about ClojureScript
elliando dias
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
elliando dias
 
Nomenclatura e peças de container
Nomenclatura  e peças de containerNomenclatura  e peças de container
Nomenclatura e peças de container
elliando dias
 
Geometria Projetiva
Geometria ProjetivaGeometria Projetiva
Geometria Projetiva
elliando dias
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agility
elliando dias
 
Javascript Libraries
Javascript LibrariesJavascript Libraries
Javascript Libraries
elliando dias
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!
elliando dias
 
Ragel talk
Ragel talkRagel talk
Ragel talk
elliando dias
 
A Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the WebA Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the Web
elliando dias
 
Introdução ao Arduino
Introdução ao ArduinoIntrodução ao Arduino
Introdução ao Arduino
elliando dias
 
Minicurso arduino
Minicurso arduinoMinicurso arduino
Minicurso arduino
elliando dias
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorcery
elliando dias
 
Rango
RangoRango
Rango
elliando dias
 
Fab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine DesignFab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine Design
elliando dias
 
The Digital Revolution: Machines that makes
The Digital Revolution: Machines that makesThe Digital Revolution: Machines that makes
The Digital Revolution: Machines that makes
elliando dias
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
elliando dias
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
elliando dias
 
Multi-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case StudyMulti-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case Study
elliando dias
 
From Lisp to Clojure/Incanter and RAn Introduction
From Lisp to Clojure/Incanter and RAn IntroductionFrom Lisp to Clojure/Incanter and RAn Introduction
From Lisp to Clojure/Incanter and RAn Introduction
elliando dias
 
Clojurescript slides
Clojurescript slidesClojurescript slides
Clojurescript slides
elliando dias
 
Why you should be excited about ClojureScript
Why you should be excited about ClojureScriptWhy you should be excited about ClojureScript
Why you should be excited about ClojureScript
elliando dias
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
elliando dias
 
Nomenclatura e peças de container
Nomenclatura  e peças de containerNomenclatura  e peças de container
Nomenclatura e peças de container
elliando dias
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agility
elliando dias
 
Javascript Libraries
Javascript LibrariesJavascript Libraries
Javascript Libraries
elliando dias
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!
elliando dias
 
A Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the WebA Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the Web
elliando dias
 
Introdução ao Arduino
Introdução ao ArduinoIntrodução ao Arduino
Introdução ao Arduino
elliando dias
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorcery
elliando dias
 
Fab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine DesignFab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine Design
elliando dias
 
The Digital Revolution: Machines that makes
The Digital Revolution: Machines that makesThe Digital Revolution: Machines that makes
The Digital Revolution: Machines that makes
elliando dias
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
elliando dias
 
Multi-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case StudyMulti-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case Study
elliando dias
 
From Lisp to Clojure/Incanter and RAn Introduction
From Lisp to Clojure/Incanter and RAn IntroductionFrom Lisp to Clojure/Incanter and RAn Introduction
From Lisp to Clojure/Incanter and RAn Introduction
elliando dias
 
Ad

Recently uploaded (20)

Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 

Hadoop and Hive Development at Facebook

  • 1. Hadoop and Hive Development at Facebook Dhruba Borthakur Zheng Shao {dhruba, zshao}@facebook.com Presented at Hadoop World, New York October 2, 2009
  • 3. Who generates this data?   Lots of data is generated on Facebook –  300+ million active users –  30 million users update their statuses at least once each day –  More than 1 billion photos uploaded each month –  More than 10 million videos uploaded each month –  More than 1 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each week
  • 4. Data Usage   Statistics per day: –  4 TB of compressed new data added per day –  135TB of compressed data scanned per day –  7500+ Hive jobs on production cluster per day –  80K compute hours per day   Barrier to entry is significantly reduced: –  New engineers go though a Hive training session –  ~200 people/month run jobs on Hadoop/Hive –  Analysts (non-engineers) use Hadoop through Hive
  • 5. Where is this data stored?   Hadoop/Hive Warehouse –  4800 cores, 5.5 PetaBytes –  12 TB per node –  Two level network topology   1 Gbit/sec from node to rack switch   4 Gbit/sec to top level rack switch
  • 6. Data Flow into Hadoop Cloud Network
 Storage
 and
 Servers
 Web
Servers
 Scribe
MidTier
 Oracle
RAC
 Hadoop
Hive
Warehouse
 MySQL

  • 7. Hadoop Scribe: Avoid Costly Filers Scribe
Writers
 RealBme
 Hadoop
 Cluster
 Web
Servers
 Scribe
MidTier
 Oracle
RAC
 Hadoop
Hive
Warehouse
 MySQL
 http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html
  • 8. HDFS Raid   Start the same: triplicate every data block A B C   Background encoding –  Combine third replica of A B C blocks from a single file to create parity block A B C –  Remove third replica –  Apache JIRA HDFS-503 A+B+C   DiskReduce from CMU –  Garth Gibson research A file with three blocks A, B and C http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html
  • 9. Archival: Move old data to cheap storage Hadoop
Warehouse
 NFS
 Hadoop
Archive
Node
 Cheap
NAS
 Hadoop
Archival
Cluster
 hEp://issues.apache.org/jira/browse/HDFS‐220
 Hive
Query

  • 10. Dynamic-size MapReduce Clusters   Why multiple compute clouds in Facebook? –  Users unaware of resources needed by job –  Absence of flexible Job Isolation techniques –  Provide adequate SLAs for jobs   Dynamically move nodes between clusters –  Based on load and configured policies –  Apache Jira MAPREDUCE-1044
  • 11. Resource Aware Scheduling (Fair Share Scheduler)   We use the Hadoop Fair Share Scheduler –  Scheduler unaware of memory needed by job   Memory and CPU aware scheduling –  RealTime gathering of CPU and memory usage –  Scheduler analyzes memory consumption in realtime –  Scheduler fair-shares memory usage among jobs –  Slot-less scheduling of tasks (in future) –  Apache Jira MAPREDUCE-961
  • 12. Hive – Data Warehouse   Efficient SQL to Map-Reduce Compiler   Mar 2008: Started at Facebook   May 2009: Release 0.3.0 available   Now: Preparing for release 0.4.0   Countable for 95%+ of Hadoop jobs @ Facebook   Used by ~200 engineers and business analysts at Facebook every month
  • 13. Hive Architecture Web UI + Hive CLI + JDBC/ Map Reduce HDFS ODBC User-defined Browse, Query, DDL Map-reduce Scripts MetaStore Hive QL UDF/UDAF substr Parser sum Thrift API average Planner Execution FileFormats SerDe Optimizer TextFile CSV SequenceFile Thrift RCFile Regex
  • 14. Hive DDL   DDL –  Complex columns –  Partitions –  Buckets   Example –  CREATE TABLE sales ( id INT, items ARRAY<STRUCT<id:INT, name:STRING>>, extra MAP<STRING, STRING> ) PARTITIONED BY (ds STRING) CLUSTERED BY (id) INTO 32 BUCKETS;
  • 15. Hive Query Language   SQL –  Where –  Group By –  Equi-Join –  Sub query in from clause   Example –  SELECT r.*, s.* FROM r JOIN ( SELECT key, count(1) as count FROM s GROUP BY key) s ON r.key = s.key WHERE s.count > 100;
  • 16. Group By   4 different plans based on: –  Does data have skew? –  partial aggregation   Map-side hash aggregation –  In-memory hash table in mapper to do partial aggregations   2-map-reduce aggregation –  For distinct queries with skew and large cardinality
  • 17. Join   Normal map-reduce Join –  Mapper sends all rows with the same key to a single reducer –  Reducer does the join   Map-side Join –  Mapper loads the whole small table and a portion of big table –  Mapper does the join –  Much faster than map-reduce join
  • 18. Sampling   Efficient sampling –  Table can be bucketed –  Each bucket is a file –  Sampling can choose some buckets   Example –  SELECT product_id, sum(price) FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32) GROUP BY product_id
  • 19. Multi-table Group-By/Insert FROM users INSERT INTO TABLE pv_gender_sum SELECT gender, count(DISTINCT userid) GROUP BY gender INSERT INTO DIRECTORY '/user/facebook/tmp/pv_age_sum.dir' SELECT age, count(DISTINCT userid) GROUP BY age INSERT INTO LOCAL DIRECTORY '/home/me/pv_age_sum.dir' SELECT country, gender, count(DISTINCT userid) GROUP BY country, gender;
  • 20. File Formats   TextFile: –  Easy for other applications to write/read –  Gzip text files are not splittable   SequenceFile: –  Only hadoop can read it –  Support splittable compression   RCFile: Block-based columnar storage –  Use SequenceFile block format –  Columnar storage inside a block –  25% smaller compressed size –  On-par or better query performance depending on the query
  • 21. SerDe   Serialization/Deserialization   Row Format –  CSV (LazySimpleSerDe) –  Thrift (ThriftSerDe) –  Regex (RegexSerDe) –  Hive Binary Format (LazyBinarySerDe)   LazySimpleSerDe and LazyBinarySerDe –  Deserialize the field when needed –  Reuse objects across different rows –  Text and Binary format
  • 22. UDF/UDAF   Features: –  Use either Java or Hadoop Objects (int, Integer, IntWritable) –  Overloading –  Variable-length arguments –  Partial aggregation for UDAF   Example UDF: –  public class UDFExampleAdd extends UDF { public int evaluate(int a, int b) { return a + b; } }
  • 23. Hive – Performance Date SVN Revision Major Changes Query A Query B Query C 2/22/2009 746906 Before Lazy Deserialization 83 sec 98 sec 183 sec 2/23/2009 747293 Lazy Deserialization 40 sec 66 sec 185 sec 3/6/2009 751166 Map-side Aggregation 22 sec 67 sec 182 sec 4/29/2009 770074 Object Reuse 21 sec 49 sec 130 sec 6/3/2009 781633 Map-side Join * 21 sec 48 sec 132 sec 8/5/2009 801497 Lazy Binary Format * 21 sec 48 sec 132 sec   QueryA: SELECT count(1) FROM t;   QueryB: SELECT concat(concat(concat(a,b),c),d) FROM t;   QueryC: SELECT * FROM t;   map-side time only (incl. GzipCodec for comp/decompression)   * These two features need to be tested with other queries.
  • 24. Hive – Future Works   Indexes   Create table as select   Views / variables   Explode operator   In/Exists sub queries   Leverage sort/bucket information in Join