SlideShare a Scribd company logo
Hive Evolution
A Progress Report
November 2010
John Sichi (Facebook)
Agenda
• Hive Overview
• Version 0.6 (just released!)
• Version 0.7 (under development)
• Hive is now a TLP!
• Roadmaps
What is Hive?
• A Hadoop-based system for querying
and managing structured data
– Uses Map/Reduce for execution
– Uses Hadoop Distributed File System
(HDFS) for storage
Hive Origins
• Data explosion at Facebook
• Traditional DBMS technology could
not keep up with the growth
• Hadoop to the rescue!
• Incubation with ASF, then became a
Hadoop sub-project
• Now a top-level ASF project
Hive Evolution
• Originally:
– a way for Hadoop users to express
queries in a high-level language without
having to write map/reduce programs
• Now more and more:
– A parallel SQL DBMS which happens to
use Hadoop for its storage and
execution architecture
Intended Usage
• Web-scale Big Data
– 100’s of terabytes
• Large Hadoop cluster
– 100’s of nodes (heterogeneous OK)
• Data has a schema
• Batch jobs
– for both loads and queries
So Don’t Use Hive If…
• Your data is measured in GB
• You don’t want to impose a schema
• You need responses in seconds
• A “conventional” analytic DBMS can
already do the job
– (and you can afford it)
• You don’t have a lot of time and smart
people
Scaling Up
• Facebook warehouse, July 2010:
– 2250 nodes
– 36 petabytes disk space
• Data access per day:
– 80 to 90 terabytes added
(uncompressed)
– 25000 map/reduce jobs
• 300-400 users/month
Facebook Deployment
Web Servers Scribe MidTier
Production
Hive-Hadoop
Cluster
Sharded MySQL
Scribe-Hadoop
Clusters
Adhoc
Hive-Hadoop
Cluster
Hive replication
Hive Architecture
Metastore
Query Engine
CLI
Hive Thrift API
Metastore
Thrift API
JDBC/ODBC
clients
Hadoop Map/Reduce
+ HDFS Clusters
Web
Management
Console
Physical Data Model
clicks
ds=‘2010-10-28’
ds=‘2010-10-29’
ds=‘2010-10-30’
Partitions
(possibly
multi-level)
Table HDFS Files
(possibly as
hash buckets)
Map/Reduce Plans
Input Files Map
Tasks
Reduce Tasks
Splits
Result Files
Query Translation Example
• SELECT url, count(*) FROM
page_views GROUP BY url
• Map tasks compute partial counts for
each URL in a hash table
– “map side” preaggregation
– map outputs are partitioned by URL and
shipped to corresponding reducers
• Reduce tasks tally up partial counts to
produce final results
It Gets Quite Complicated!
Behavior Extensibility
• TRANSFORM scripts (any language)
– Serialization+IPC overhead
• User defined functions (Java)
– In-process, lazy object evaluation
• Pre/Post Hooks (Java)
– Statement validation/execution
– Example uses: auditing, replication,
authorization
UDF vs UDAF vs UDTF
• User Defined Function
• One-to-one row mapping
• Concat(‘foo’, ‘bar’)
• User Defined Aggregate Function
• Many-to-one row mapping
• Sum(num_ads)
• User Defined Table Function
• One-to-many row mapping
• Explode([1,2,3])
Storage Extensibility
• Input/OutputFormat: file formats
– SequenceFile, RCFile, TextFile, …
• SerDe: row formats
– Thrift, JSON, ProtocolBuffer, …
• Storage Handlers (new in 0.6)
– Integrate foreign metadata, e.g. HBase
• Indexing
– Under development in 0.7
Release 0.6
• October 2010
– Views
– Multiple Databases
– Dynamic Partitioning
– Automatic Merge
– New Join Strategies
– Storage Handlers
Views: Syntax
CREATE VIEW [IF NOT EXISTS]
view_name
[ (column_name [COMMENT
column_comment], … ) ]
[COMMENT ‘view_comment’]
AS SELECT …
[ ORDER BY … LIMIT … ]
Views: Usage
• Use Cases
– Column/table renaming
– Encapsulate complex query logic
– Security (Future)
• Limitations
– Read-only
– Obscures partition metadata from
underlying tables
– No dependency management
Multiple Databases
• Follows MySQL convention
– CREATE DATABASE [IF NOT EXISTS]
db_name [COMMENT ‘db_comment’]
– USE db_name
• Logical namespace for tables
• ‘default’ database is still there
• Does not yet support queries across
multiple databases
Dynamic Partitions: Syntax
• Example
INSERT OVERWRITE TABLE page_view
PARTITION(dt='2008-06-08', country)
SELECT pvs.viewTime, pvs.userid,
pvs.page_url, pvs.referrer_url, null, null,
pvs.ip, pvs.country
FROM page_view_stg pvs
Dynamic Partitions: Usage
• Automatically create partitions based
on distinct values in columns
• Works as rudimentary indexing
– Prune partitions via WHERE clause
• But be careful…
– Don’t create too many partitions!
– Configuration parameters can be used to
prevent accidents
Automatic merge
• Jobs can produce many files
• Why is this bad?
– Namenode pressure
– Downstream jobs have to deal with file
processing overhead
• So, clean up by merging results into a
few large files (configurable)
– Use conditional map-only task to do this
Join Strategies Before 0.6
• Map/reduce join
– Map tasks partition inputs on join keys
and ship to corresponding reducers
– Reduce tasks perform sort-merge-join
• Map-join
– Each mapper builds lookup hashtable
from copy of small table
– Then hash-join the splits of big table
New Join Strategies
• Bucketed map-join
– Each mapper filters its lookup table by
the bucketing hash function
– Allows “small” table to be much bigger
• Sorted merge in map-join
– Requires presorted input tables
• Deal with skew in map/reduce join
– Conditional plan step for skew keys
p(after main map/reduce join step)
Storage Handlers
Hive
HDFS
Native
Tables
Storage
Handler
Interface
HBase
Handler
Cassandra
Handler
Hypertable
Handler
Hypertable
API
Cassandra
API
HBase
API
HBase
Tables
Low Latency Warehouse
HBaseHBase
Other
Files/T
ables
Other
Files/T
ables
Periodic LoadPeriodic Load
Continuous
Update
Continuous
Update
Hive
Queries
Hive
Queries
Storage Handler Syntax
• HBase Example
CREATE TABLE users(
userid int, name string, email string, notes string)
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
“hbase.columns.mapping” =
“small:name,small:email,large:notes”)
TBLPROPERTIES (
“hbase.table.name” = “user_list”);
Release 0.7
• In development
– Concurrency
Control
– Stats Collection
– Stats Functions
– Indexes
– Local Mode
– Faster map join
– Multiple DISTINCT
aggregates
– Archiving
– JDBC/ODBC
improvements
Concurrency Control
• Pluggable distributed lock manager
– Default is Zookeeper-based
• Simple read/write locking
• Table-level and partition-level
• Implicit locking (statement level)
– Deadlock-free via lock ordering
• Explicit LOCK TABLE (global)
Statistics Collection
• Implicit metastore update during load
– Or explicit via ANALYZE TABLE
• Table/partition-level
– Number of rows
– Number of files
– Size in bytes
Stats-driven Optimization
• Automatic map-side join
• Automatic map-side aggregation
• Need column-level stats for better
estimates
– Filter/join selectivity
– Distinct value counts
– Column correlation
Statistical Functions
• Stats 101
– Stddev, var, covar
– Percentile_approx
• Data Mining
– Ngrams, sentences (text analysis)
– Histogram_numeric
• SELECT histogram_numeric(dob_year)
FROM users GROUP BY relationshipstatus
Histogram query results
• “It’s complicated” peaks at 18-19, but lasts into late 40s!
• “In a relationship” peaks at 20
• “Engaged” peaks at 25
• Married peaks in early 30s
• More married than single at 28
• Only teenagers use widowed?
Pluggable Indexing
• Reference implementation
– Index is stored in a normal Hive table
– Compact: distinct block addresses
– Partition-level rebuild
• Currently in R&D
– Automatic use for WHERE, GROUP BY
– New index types (e.g. bitmap, HBase)
Local Mode Execution
• Avoids map/reduce cluster job latency
• Good for jobs which process small
amounts of data
• Let Hive decide when to use it
– set hive.exec.model.local.auto=true;
• Or force its usage
– set mapred.job.tracker=local;
Faster map join
• Make sure small table can fit in
memory
– If it can’t, fall back to reduce join
• Optimize hash table data structures
• Use distributed cache to push out
pre-filtered lookup table
– Avoid swamping HDFS with reads from
thousands of mappers
Multiple DISTINCT Aggs
• Example
SELECT
view_date,
COUNT(DISTINCT userid),
COUNT(DISTINCT page_url)
FROM page_views
GROUP BY view_date
Archiving
• Use HAR (Hadoop archive format) to
combine many files into a few
• Relieves namenode memory
• Archived partition becomes read-only
• Syntax:
ALTER TABLE page_views
{ARCHIVE|UNARCHIVE}
PARTITION (ds=‘2010-10-30’)
JDBC/ODBC Improvements
• JDBC: Basic metadata calls
– Good enough for use with UI’s such as
SQuirreL
• JDBC: some PreparedStatement
support
– Pentaho Data Integration
• ODBC: new driver under
development (based on sqllite)
Hive is now a TLP
• PMC
– Namit Jain (chair)
– John Sichi
– Zheng Shao
– Edward Capriolo
– Raghotham Murthy
– Ning Zhang
– Paul Yang
– He Yongqiang
– Prasad Chakka
– Joydeep Sen
Sarma
– Ashish Thusoo
• Welcome to new
committer Carl
Steinbach!
Developer Diversity
• Recent Contributors
– Facebook, Yahoo, Cloudera
– Netflix, Amazon, Media6Degrees, Intuit
– Numerous research projects
– Many many more…
• Monthly San Francisco bay area
contributor meetups
• East coast meetups? 
Roadmap: Security
• Authentication
– Upgrading to SASL-enabled Thrift
• Authorization
– HDFS-level
• Very limited (no ACL’s)
• Can’t support all Hive features (e.g. views)
– Hive-level (GRANT/REVOKE)
• Hive server deployment for full effectiveness
Roadmap: Hadoop API
• Dropping pre-0.20 support starting
with Hive 0.7
– But Hive is still using old mapred.*
• Moving to mapreduce.* will be
required in order to support newer
Hadoop versions
– Need to resolve some complications with
0.7’s indexing feature
Roadmap: Howl
• Reuse metastore across Hadoop
Howl
Hive Pig Oozie Flume
HDFS
Roadmap: Heavy-Duty Tests
• Unit tests are insufficient
• What is needed:
– Real-world schemas/queries
– Non-toy data scales
– Scripted setup; configuration matrix
– Correctness/performance verification
– Automatic reports: throughput, latency,
profiles, coverage, perf counters…
Roadmap: Shared Test Site
• Nightly runs, regression alerting
• Performance trending
• Synthetic workload (e.g. TPC-H)
• Real-world workload (anonymized?)
• This is critical for
– Non-subjective commit criteria
– Release quality
Resources
• http://hive.apache.org
• user@hive.apache.org
• jsichi@facebook.com
• Questions?
Ad

More Related Content

What's hot (20)

Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
Phil Young
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
DataWorks Summit
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
Cloudera, Inc.
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
Cloudera, Inc.
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
OReillyStrata
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
Edureka!
 
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handy
Praveen Sripati
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
Byeongweon Moon
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
OReillyStrata
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
Reza Ameri
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalk
nzhang
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
Lester Martin
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
Jay
 
Hadoop
HadoopHadoop
Hadoop
Nishant Gandhi
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
DataWorks Summit
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
Danairat Thanabodithammachari
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
Phil Young
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
Cloudera, Inc.
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
Cloudera, Inc.
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
OReillyStrata
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
Edureka!
 
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handy
Praveen Sripati
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
OReillyStrata
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
Reza Ameri
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalk
nzhang
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
Lester Martin
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
Jay
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
DataWorks Summit
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
Danairat Thanabodithammachari
 

Similar to Hive Evolution: ApacheCon NA 2010 (20)

Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
A Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemA Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop Ecosystem
DataWorks Summit
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
Ike Ellis
 
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop Ecosystem
Serendio Inc.
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
Sperasoft
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetup
Remus Rusanu
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Ayyappan Paramesh
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Andrew Brust
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
Andrew Brust
 
Hive ppt on the basis of importance of big data
Hive ppt on the basis of importance of big dataHive ppt on the basis of importance of big data
Hive ppt on the basis of importance of big data
computer87914
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
vishwasgarade1
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
Neeraja Rentachintala
 
Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018
Aman Sinha
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Michael Rys
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
Bigdatapump
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User Group
Remus Rusanu
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
A Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemA Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop Ecosystem
DataWorks Summit
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
Ike Ellis
 
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop Ecosystem
Serendio Inc.
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
Sperasoft
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetup
Remus Rusanu
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Andrew Brust
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
Andrew Brust
 
Hive ppt on the basis of importance of big data
Hive ppt on the basis of importance of big dataHive ppt on the basis of importance of big data
Hive ppt on the basis of importance of big data
computer87914
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
vishwasgarade1
 
Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018
Aman Sinha
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Michael Rys
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User Group
Remus Rusanu
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
Ad

Recently uploaded (20)

UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Ad

Hive Evolution: ApacheCon NA 2010

  • 1. Hive Evolution A Progress Report November 2010 John Sichi (Facebook)
  • 2. Agenda • Hive Overview • Version 0.6 (just released!) • Version 0.7 (under development) • Hive is now a TLP! • Roadmaps
  • 3. What is Hive? • A Hadoop-based system for querying and managing structured data – Uses Map/Reduce for execution – Uses Hadoop Distributed File System (HDFS) for storage
  • 4. Hive Origins • Data explosion at Facebook • Traditional DBMS technology could not keep up with the growth • Hadoop to the rescue! • Incubation with ASF, then became a Hadoop sub-project • Now a top-level ASF project
  • 5. Hive Evolution • Originally: – a way for Hadoop users to express queries in a high-level language without having to write map/reduce programs • Now more and more: – A parallel SQL DBMS which happens to use Hadoop for its storage and execution architecture
  • 6. Intended Usage • Web-scale Big Data – 100’s of terabytes • Large Hadoop cluster – 100’s of nodes (heterogeneous OK) • Data has a schema • Batch jobs – for both loads and queries
  • 7. So Don’t Use Hive If… • Your data is measured in GB • You don’t want to impose a schema • You need responses in seconds • A “conventional” analytic DBMS can already do the job – (and you can afford it) • You don’t have a lot of time and smart people
  • 8. Scaling Up • Facebook warehouse, July 2010: – 2250 nodes – 36 petabytes disk space • Data access per day: – 80 to 90 terabytes added (uncompressed) – 25000 map/reduce jobs • 300-400 users/month
  • 9. Facebook Deployment Web Servers Scribe MidTier Production Hive-Hadoop Cluster Sharded MySQL Scribe-Hadoop Clusters Adhoc Hive-Hadoop Cluster Hive replication
  • 10. Hive Architecture Metastore Query Engine CLI Hive Thrift API Metastore Thrift API JDBC/ODBC clients Hadoop Map/Reduce + HDFS Clusters Web Management Console
  • 12. Map/Reduce Plans Input Files Map Tasks Reduce Tasks Splits Result Files
  • 13. Query Translation Example • SELECT url, count(*) FROM page_views GROUP BY url • Map tasks compute partial counts for each URL in a hash table – “map side” preaggregation – map outputs are partitioned by URL and shipped to corresponding reducers • Reduce tasks tally up partial counts to produce final results
  • 14. It Gets Quite Complicated!
  • 15. Behavior Extensibility • TRANSFORM scripts (any language) – Serialization+IPC overhead • User defined functions (Java) – In-process, lazy object evaluation • Pre/Post Hooks (Java) – Statement validation/execution – Example uses: auditing, replication, authorization
  • 16. UDF vs UDAF vs UDTF • User Defined Function • One-to-one row mapping • Concat(‘foo’, ‘bar’) • User Defined Aggregate Function • Many-to-one row mapping • Sum(num_ads) • User Defined Table Function • One-to-many row mapping • Explode([1,2,3])
  • 17. Storage Extensibility • Input/OutputFormat: file formats – SequenceFile, RCFile, TextFile, … • SerDe: row formats – Thrift, JSON, ProtocolBuffer, … • Storage Handlers (new in 0.6) – Integrate foreign metadata, e.g. HBase • Indexing – Under development in 0.7
  • 18. Release 0.6 • October 2010 – Views – Multiple Databases – Dynamic Partitioning – Automatic Merge – New Join Strategies – Storage Handlers
  • 19. Views: Syntax CREATE VIEW [IF NOT EXISTS] view_name [ (column_name [COMMENT column_comment], … ) ] [COMMENT ‘view_comment’] AS SELECT … [ ORDER BY … LIMIT … ]
  • 20. Views: Usage • Use Cases – Column/table renaming – Encapsulate complex query logic – Security (Future) • Limitations – Read-only – Obscures partition metadata from underlying tables – No dependency management
  • 21. Multiple Databases • Follows MySQL convention – CREATE DATABASE [IF NOT EXISTS] db_name [COMMENT ‘db_comment’] – USE db_name • Logical namespace for tables • ‘default’ database is still there • Does not yet support queries across multiple databases
  • 22. Dynamic Partitions: Syntax • Example INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country FROM page_view_stg pvs
  • 23. Dynamic Partitions: Usage • Automatically create partitions based on distinct values in columns • Works as rudimentary indexing – Prune partitions via WHERE clause • But be careful… – Don’t create too many partitions! – Configuration parameters can be used to prevent accidents
  • 24. Automatic merge • Jobs can produce many files • Why is this bad? – Namenode pressure – Downstream jobs have to deal with file processing overhead • So, clean up by merging results into a few large files (configurable) – Use conditional map-only task to do this
  • 25. Join Strategies Before 0.6 • Map/reduce join – Map tasks partition inputs on join keys and ship to corresponding reducers – Reduce tasks perform sort-merge-join • Map-join – Each mapper builds lookup hashtable from copy of small table – Then hash-join the splits of big table
  • 26. New Join Strategies • Bucketed map-join – Each mapper filters its lookup table by the bucketing hash function – Allows “small” table to be much bigger • Sorted merge in map-join – Requires presorted input tables • Deal with skew in map/reduce join – Conditional plan step for skew keys p(after main map/reduce join step)
  • 28. Low Latency Warehouse HBaseHBase Other Files/T ables Other Files/T ables Periodic LoadPeriodic Load Continuous Update Continuous Update Hive Queries Hive Queries
  • 29. Storage Handler Syntax • HBase Example CREATE TABLE users( userid int, name string, email string, notes string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( “hbase.columns.mapping” = “small:name,small:email,large:notes”) TBLPROPERTIES ( “hbase.table.name” = “user_list”);
  • 30. Release 0.7 • In development – Concurrency Control – Stats Collection – Stats Functions – Indexes – Local Mode – Faster map join – Multiple DISTINCT aggregates – Archiving – JDBC/ODBC improvements
  • 31. Concurrency Control • Pluggable distributed lock manager – Default is Zookeeper-based • Simple read/write locking • Table-level and partition-level • Implicit locking (statement level) – Deadlock-free via lock ordering • Explicit LOCK TABLE (global)
  • 32. Statistics Collection • Implicit metastore update during load – Or explicit via ANALYZE TABLE • Table/partition-level – Number of rows – Number of files – Size in bytes
  • 33. Stats-driven Optimization • Automatic map-side join • Automatic map-side aggregation • Need column-level stats for better estimates – Filter/join selectivity – Distinct value counts – Column correlation
  • 34. Statistical Functions • Stats 101 – Stddev, var, covar – Percentile_approx • Data Mining – Ngrams, sentences (text analysis) – Histogram_numeric • SELECT histogram_numeric(dob_year) FROM users GROUP BY relationshipstatus
  • 35. Histogram query results • “It’s complicated” peaks at 18-19, but lasts into late 40s! • “In a relationship” peaks at 20 • “Engaged” peaks at 25 • Married peaks in early 30s • More married than single at 28 • Only teenagers use widowed?
  • 36. Pluggable Indexing • Reference implementation – Index is stored in a normal Hive table – Compact: distinct block addresses – Partition-level rebuild • Currently in R&D – Automatic use for WHERE, GROUP BY – New index types (e.g. bitmap, HBase)
  • 37. Local Mode Execution • Avoids map/reduce cluster job latency • Good for jobs which process small amounts of data • Let Hive decide when to use it – set hive.exec.model.local.auto=true; • Or force its usage – set mapred.job.tracker=local;
  • 38. Faster map join • Make sure small table can fit in memory – If it can’t, fall back to reduce join • Optimize hash table data structures • Use distributed cache to push out pre-filtered lookup table – Avoid swamping HDFS with reads from thousands of mappers
  • 39. Multiple DISTINCT Aggs • Example SELECT view_date, COUNT(DISTINCT userid), COUNT(DISTINCT page_url) FROM page_views GROUP BY view_date
  • 40. Archiving • Use HAR (Hadoop archive format) to combine many files into a few • Relieves namenode memory • Archived partition becomes read-only • Syntax: ALTER TABLE page_views {ARCHIVE|UNARCHIVE} PARTITION (ds=‘2010-10-30’)
  • 41. JDBC/ODBC Improvements • JDBC: Basic metadata calls – Good enough for use with UI’s such as SQuirreL • JDBC: some PreparedStatement support – Pentaho Data Integration • ODBC: new driver under development (based on sqllite)
  • 42. Hive is now a TLP • PMC – Namit Jain (chair) – John Sichi – Zheng Shao – Edward Capriolo – Raghotham Murthy – Ning Zhang – Paul Yang – He Yongqiang – Prasad Chakka – Joydeep Sen Sarma – Ashish Thusoo • Welcome to new committer Carl Steinbach!
  • 43. Developer Diversity • Recent Contributors – Facebook, Yahoo, Cloudera – Netflix, Amazon, Media6Degrees, Intuit – Numerous research projects – Many many more… • Monthly San Francisco bay area contributor meetups • East coast meetups? 
  • 44. Roadmap: Security • Authentication – Upgrading to SASL-enabled Thrift • Authorization – HDFS-level • Very limited (no ACL’s) • Can’t support all Hive features (e.g. views) – Hive-level (GRANT/REVOKE) • Hive server deployment for full effectiveness
  • 45. Roadmap: Hadoop API • Dropping pre-0.20 support starting with Hive 0.7 – But Hive is still using old mapred.* • Moving to mapreduce.* will be required in order to support newer Hadoop versions – Need to resolve some complications with 0.7’s indexing feature
  • 46. Roadmap: Howl • Reuse metastore across Hadoop Howl Hive Pig Oozie Flume HDFS
  • 47. Roadmap: Heavy-Duty Tests • Unit tests are insufficient • What is needed: – Real-world schemas/queries – Non-toy data scales – Scripted setup; configuration matrix – Correctness/performance verification – Automatic reports: throughput, latency, profiles, coverage, perf counters…
  • 48. Roadmap: Shared Test Site • Nightly runs, regression alerting • Performance trending • Synthetic workload (e.g. TPC-H) • Real-world workload (anonymized?) • This is critical for – Non-subjective commit criteria – Release quality