SlideShare a Scribd company logo
Hive Integration:  HBase and RCFile John Sichi and Yongqiang He Facebook
HBase Integration (John Sichi) RCFile Integration (Yongqiang He) Session Agenda
HBase:  Facebook Warehouse Use Case Reduce latency on dimension data availability HBase (Dimension data) Partitioned RCFiles (Fact data) Periodic Load Continuous Update Hive Queries
HBase:  Storage Handler CREATE TABLE users( userid int, name string, email string, notes string) STORED BY  'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  WITH SERDEPROPERTIES (  “ hbase.columns.mapping” =  “ small:name,small:email,large:notes”) TBLPROPERTIES ( “ hbase.table.name” = “user_list” ); INSERT, SELECT, JOIN, GROUP BY, UNION etc
Testing at scale 20-node test cluster Bulk-loaded 6TB of gzip-compressed data from Hive into Hbase in about 30 hours Incremental-loaded from Hive into Hbase at 30GB/hr (with write-ahead logging disabled) Full-table scan queries:  currently 5x slower than against native Hive tables (no tuning or optimization yet) HBase: Integration Status
Retest against HBase trunk with larger (30TB) data Try out new features for accelerating incremental load Bulk load into table with existing data Multiputs Deferred logging Support for “virtual partitions” based on timestamps Support for deletion Push down filters Index join?  Optimize scans? HBase: Integration Roadmap
Why Columnar Storages Better Compression  Light weight compression  RLE  Bit-map  Etc CPU, Memory, Storage Columnar Operator  Cache conscious (MonetDB) RCFile
Why RCFile Huge Data Reduce data storage space required Ad-hoc workloads Storage space vs. speed (data performance) Can we get both with no application changes? Reduce storage spaces Accelerate performance for arbitrary applications RCFile
Pros Work with Column Pruning Only touch needed columns at runtime Lazy decompression Select col1, col2 from tbl_col_10 where col_1 > 30 Will only touch col1 and col2 Col2 is decompressed only when a block contains a col1 value greater than 30 RCFile
Cons Row Construction Is the main overhead Each column’s data is stored separately, and may be sorted in different order In memory operation for rcfile This could be really painful; a lot of room to improve here RCFile
Facebook Deployment Default file format in Facebook cluster 20% space savings on average We are transforming old data to the new format RCFile
Future work Support built in indexing Like bloom filter etc more cache conscious columnar operators Pushing predicate to file reader RCFile
Questions? [email_address] [email_address]

More Related Content

What's hot (20)

PDF
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
PPTX
Big data solution capacity planning
Riyaz Shaikh
 
PPTX
MapReduce basic
Chirag Ahuja
 
PDF
report on aadhaar anlysis using bid data hadoop and hive
siddharthboora
 
PPTX
Analysing of big data using map reduce
Paladion Networks
 
PPT
Hadoop institutes-in-bangalore
Kelly Technologies
 
PPTX
Hadoop - Stock Analysis
Vaibhav Jain
 
PPT
Introduction to Apache Hadoop
Steve Watt
 
PPTX
MapReduce Design Patterns
Donald Miner
 
KEY
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
PPTX
06 pig etl features
Subhas Kumar Ghosh
 
KEY
Intro to Hadoop
jeffturner
 
PPTX
Pig, Making Hadoop Easy
Nick Dimiduk
 
PDF
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
npinto
 
PPT
Another Intro To Hadoop
Adeel Ahmad
 
PDF
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
PDF
R, Hadoop and Amazon Web Services
Portland R User Group
 
PDF
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
soujavajug
 
PPTX
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
scottcrespo
 
PPT
Map Reduce
Michel Bruley
 
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Big data solution capacity planning
Riyaz Shaikh
 
MapReduce basic
Chirag Ahuja
 
report on aadhaar anlysis using bid data hadoop and hive
siddharthboora
 
Analysing of big data using map reduce
Paladion Networks
 
Hadoop institutes-in-bangalore
Kelly Technologies
 
Hadoop - Stock Analysis
Vaibhav Jain
 
Introduction to Apache Hadoop
Steve Watt
 
MapReduce Design Patterns
Donald Miner
 
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
06 pig etl features
Subhas Kumar Ghosh
 
Intro to Hadoop
jeffturner
 
Pig, Making Hadoop Easy
Nick Dimiduk
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
npinto
 
Another Intro To Hadoop
Adeel Ahmad
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
R, Hadoop and Amazon Web Services
Portland R User Group
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
soujavajug
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
scottcrespo
 
Map Reduce
Michel Bruley
 

Viewers also liked (6)

PDF
Resultats financiers 2002-2003
nfbourreau
 
PDF
Debs2010 tutorial on epts reference architecture v1.1c
Paul Vincent
 
PPTX
ORC File Introduction
Owen O'Malley
 
PDF
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
PPT
Seminar Presentation Hadoop
Varun Narang
 
PPTX
Big data and Hadoop
Rahul Agarwal
 
Resultats financiers 2002-2003
nfbourreau
 
Debs2010 tutorial on epts reference architecture v1.1c
Paul Vincent
 
ORC File Introduction
Owen O'Malley
 
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
Seminar Presentation Hadoop
Varun Narang
 
Big data and Hadoop
Rahul Agarwal
 
Ad

Similar to Hive integration: HBase and Rcfile__HadoopSummit2010 (20)

PPTX
Scaling HBase for Big Data
Salesforce Engineering
 
PDF
NoSQL HBase schema design and SQL with Apache Drill
Carol McDonald
 
PPTX
Ten things to consider for interactive analytics on write once workloads
Abinasha Karana
 
PPTX
H base vs hive srp vs analytics 2-14-2012
reedshea
 
PPT
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop User Group
 
PDF
Big Data Analytics with MariaDB AX
MariaDB plc
 
PDF
Rails on HBase
Tony Hillerson
 
PDF
Rails on HBase
EffectiveUI
 
PDF
Rails on HBase
Effective
 
PDF
TriHUG 3/14: HBase in Production
trihug
 
KEY
Rails on HBase
zpinter
 
PDF
HBase Application Performance Improvement
Biju Nair
 
PPT
Nextag talk
Joydeep Sen Sarma
 
PDF
Apache HBase: Introduction to a column-oriented data store
Christian Gügi
 
PDF
Columnar databases on Big data analytics
yoshidamiyasaki
 
PDF
Базы данных. HBase
Vadim Tsesko
 
PDF
Big Data: Big SQL and HBase
Cynthia Saracco
 
PPTX
A Scalable Data Transformation Framework using Hadoop Ecosystem
DataWorks Summit
 
PDF
Hbase schema design and sizing apache-con europe - nov 2012
Chris Huang
 
PDF
Intro to HBase
alexbaranau
 
Scaling HBase for Big Data
Salesforce Engineering
 
NoSQL HBase schema design and SQL with Apache Drill
Carol McDonald
 
Ten things to consider for interactive analytics on write once workloads
Abinasha Karana
 
H base vs hive srp vs analytics 2-14-2012
reedshea
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop User Group
 
Big Data Analytics with MariaDB AX
MariaDB plc
 
Rails on HBase
Tony Hillerson
 
Rails on HBase
EffectiveUI
 
Rails on HBase
Effective
 
TriHUG 3/14: HBase in Production
trihug
 
Rails on HBase
zpinter
 
HBase Application Performance Improvement
Biju Nair
 
Nextag talk
Joydeep Sen Sarma
 
Apache HBase: Introduction to a column-oriented data store
Christian Gügi
 
Columnar databases on Big data analytics
yoshidamiyasaki
 
Базы данных. HBase
Vadim Tsesko
 
Big Data: Big SQL and HBase
Cynthia Saracco
 
A Scalable Data Transformation Framework using Hadoop Ecosystem
DataWorks Summit
 
Hbase schema design and sizing apache-con europe - nov 2012
Chris Huang
 
Intro to HBase
alexbaranau
 
Ad

More from Yahoo Developer Network (20)

PDF
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Yahoo Developer Network
 
PDF
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Yahoo Developer Network
 
PDF
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Yahoo Developer Network
 
PDF
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Yahoo Developer Network
 
PDF
CICD at Oath using Screwdriver
Yahoo Developer Network
 
PDF
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Yahoo Developer Network
 
PPTX
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
Yahoo Developer Network
 
PDF
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Yahoo Developer Network
 
PPTX
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Yahoo Developer Network
 
PPTX
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Yahoo Developer Network
 
PDF
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Yahoo Developer Network
 
PPTX
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Yahoo Developer Network
 
PDF
Moving the Oath Grid to Docker, Eric Badger, Oath
Yahoo Developer Network
 
PDF
Architecting Petabyte Scale AI Applications
Yahoo Developer Network
 
PDF
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Yahoo Developer Network
 
PPTX
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Yahoo Developer Network
 
PDF
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
PPTX
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
PPTX
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Yahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Yahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Yahoo Developer Network
 
CICD at Oath using Screwdriver
Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Yahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
Yahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Yahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Yahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Yahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Yahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Yahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 

Recently uploaded (20)

PDF
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
PDF
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
PPTX
Practical Applications of AI in Local Government
OnBoard
 
PPTX
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
PPTX
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
PDF
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
PPTX
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
PDF
Python Conference Singapore - 19 Jun 2025
ninefyi
 
PDF
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
PDF
Why aren't you using FME Flow's CPU Time?
Safe Software
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
How to Visualize the ​Spatio-Temporal Data Using CesiumJS​
SANGHEE SHIN
 
PPTX
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
PDF
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
 
PDF
UiPath Agentic AI ile Akıllı Otomasyonun Yeni Çağı
UiPathCommunity
 
PPTX
UserCon Belgium: Honey, VMware increased my bill
stijn40
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
PDF
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
PDF
Plugging AI into everything: Model Context Protocol Simplified.pdf
Abati Adewale
 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
Practical Applications of AI in Local Government
OnBoard
 
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
Python Conference Singapore - 19 Jun 2025
ninefyi
 
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
Why aren't you using FME Flow's CPU Time?
Safe Software
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
How to Visualize the ​Spatio-Temporal Data Using CesiumJS​
SANGHEE SHIN
 
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
 
UiPath Agentic AI ile Akıllı Otomasyonun Yeni Çağı
UiPathCommunity
 
UserCon Belgium: Honey, VMware increased my bill
stijn40
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
Plugging AI into everything: Model Context Protocol Simplified.pdf
Abati Adewale
 

Hive integration: HBase and Rcfile__HadoopSummit2010

  • 1. Hive Integration: HBase and RCFile John Sichi and Yongqiang He Facebook
  • 2. HBase Integration (John Sichi) RCFile Integration (Yongqiang He) Session Agenda
  • 3. HBase: Facebook Warehouse Use Case Reduce latency on dimension data availability HBase (Dimension data) Partitioned RCFiles (Fact data) Periodic Load Continuous Update Hive Queries
  • 4. HBase: Storage Handler CREATE TABLE users( userid int, name string, email string, notes string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( “ hbase.columns.mapping” = “ small:name,small:email,large:notes”) TBLPROPERTIES ( “ hbase.table.name” = “user_list” ); INSERT, SELECT, JOIN, GROUP BY, UNION etc
  • 5. Testing at scale 20-node test cluster Bulk-loaded 6TB of gzip-compressed data from Hive into Hbase in about 30 hours Incremental-loaded from Hive into Hbase at 30GB/hr (with write-ahead logging disabled) Full-table scan queries: currently 5x slower than against native Hive tables (no tuning or optimization yet) HBase: Integration Status
  • 6. Retest against HBase trunk with larger (30TB) data Try out new features for accelerating incremental load Bulk load into table with existing data Multiputs Deferred logging Support for “virtual partitions” based on timestamps Support for deletion Push down filters Index join? Optimize scans? HBase: Integration Roadmap
  • 7. Why Columnar Storages Better Compression Light weight compression RLE Bit-map Etc CPU, Memory, Storage Columnar Operator Cache conscious (MonetDB) RCFile
  • 8. Why RCFile Huge Data Reduce data storage space required Ad-hoc workloads Storage space vs. speed (data performance) Can we get both with no application changes? Reduce storage spaces Accelerate performance for arbitrary applications RCFile
  • 9. Pros Work with Column Pruning Only touch needed columns at runtime Lazy decompression Select col1, col2 from tbl_col_10 where col_1 > 30 Will only touch col1 and col2 Col2 is decompressed only when a block contains a col1 value greater than 30 RCFile
  • 10. Cons Row Construction Is the main overhead Each column’s data is stored separately, and may be sorted in different order In memory operation for rcfile This could be really painful; a lot of room to improve here RCFile
  • 11. Facebook Deployment Default file format in Facebook cluster 20% space savings on average We are transforming old data to the new format RCFile
  • 12. Future work Support built in indexing Like bloom filter etc more cache conscious columnar operators Pushing predicate to file reader RCFile

Editor's Notes

  • #2: This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • #3: This is the agenda slide. There is only one of these in the deck.
  • #6: This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #7: This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #8: This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #9: This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #10: This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #11: This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #12: This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #13: This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #14: This is the final slide; generally for questions at the end of the talk. Please post your contact information here.