Hoodie: Incremental processing on hadoop

DATA
Incremental Processing Framework
Vinoth Chandar | Prasanna Rajaperumal
Hoodie

Who Are We
Staff Software Engineer, Uber
• Linkedin : Voldemort k-v store,
Stream processing
• Oracle : Database replication, CEP
Senior Software Engineer, Uber
• Cloudera : Data Pipelines, Log analysis
• Cisco : Complex Event Processing

Agenda
• Hadoop @ Uber
• Motivation & Concepts
• Deep Dive
• Use-Cases
• Comparisons
• Future Plans

Adoption & Scale
~Few Thousand
Servers
Many Many
PBs
~20k
Hive
queries/day
~100k
Presto
queries/day
~100k
Jobs/day
Hadoop @ Uber
~100
Spark
Apps

Hadoop Use-cases
Analytics
• Dashboards
• Ad Hoc-Analysis
• Federated Querying
• Interactive Analysis
Hadoop @ Uber
Data Applications
• ML Recommendations
• Fraud Detection
• Safe Driving
• Incentive Spends
Data Warehousing
• Curated Datafeeds
• Standard ETL
• DataLake => DataMart
Presto Spark Hive
Faster Data! Faster Data! Faster Data!

We All Like A Nimble Elephant
Question: Can we get fresh data, directly on a petabyte scale
Hadoop Data Lake?

Previously on .. Strata (2016)
Hadoop @ Uber
“Uber, your Hadoop has arrived: Powering Intelligence for Uber’s
Real-time marketplace”

Partitioned by day trip started
2010-2014
New Data
Unaffected Data
Updated Data
Incremental
update
2015/XX/XX
Every 5 min
Day level partitions
Late Arriving Updates
2016/XX/XX
2017/(01-03)/XX
2017/04/16
New/Updated
Trips
Motivation

Aug: 10 hr (1000 executors)
Apr: 8 hr (800 executors)
Jan: 6 hr (500 executors)
Snapshot
NoSQL/DB Ingestion: Status Quo
Database
trips
(compacted
table)
Replicate
d Trip
Rows
HBase
New
/updated
trip rows
Changelog
12-18+ hr
Kafka
upsert Presto
Derived
Tables
logging
8 hr
Approximation
Motivation
Batch
Recompute

Exponential Growth is fun ..
Hadoop @ Uber
Also extremely hard, to keep up with …
- Long waits for queue
- Disks running out of space
Common Pitfalls
- Massive re-computations
- Batch jobs are too big fail

Let’s go back 30 years
How did RDBMS-es solve this?
• Update existing row with new value (Transactions)
• Consume a log of changes downstream (Redo log)
• Update again downstream
Concepts
MySQL
(Server A)
MySQL
(Server B)
Update
Update
Pull Redo Log
TransformationImportant Differences
• Columnar file formats
• Read-heavy workloads
• Petabytes & 1000s of servers

10 hr (1000)
8 hr (800)
6 hr (500)
snapshot
Batch
Recompute
Challenging Status Quo
trips
(compacted
table)
12-18+ hr
Presto
Derived
Tables8 hr
Approximation
Hoodie.upsert()
1 hr (100) - Today
10 min (50) - Q2 ‘17
1 hr
Hoodie.incrPull()
[2 mins to pull]
1 hr - 3 hr
(10x less
resources)
Motivation
Accurate!!!
Database
Replicate
d Trip
Rows
HBase
New
/updated
trip rows
Changelog
Kafka
upsert
logging

Incremental Processing
Advantages: Increased Efficiency / Leverage Hadoop SQL Engines/ Simplify Architecture
Hoodie Concepts
Incremental Pull (Primitive #2)
• Log stream of changes, avoid costly scans
• Enable chaining processing in DAG
For more, “Case For Incremental Processing on Hadoop” (link)
Upsert (Primitive #1)
• Modify processed results
• kv stores in stream processing

Introducing
Hoodie
Open Source
- https://github.com/uber/hoodie
- eng.uber.com/hoodie
Spark Library For Upserts & Incrementals
- Scales horizontally like any job
- Stores dataset directly on HDFS
Storage Abstraction to
- Apply mutations to dataset
- Pull changelog incrementally
Hoodie Concepts
Large HDFS
Dataset
Upsert
(Spark)
Changelog Changelog
Incr Pull
(Hive/Spark/Presto)
Hive Table
(normal queries)

Hoodie: Overview
Hoodie Concepts
Hoodie
WriteClient
(Spark)
Index
Data Files
Timeline
Metadata
Hive
Queries
Hoodie Dataset On
HDFS
Presto
Queries
Spark
DAGs
Store & Index
Data
Read data
Storage
Type
Views

Hoodie: Storage Types & Views
Hoodie Concepts
Views : How is Data read?
Read Optimized View
- Parquet Query Performance
- ~30 mins latency for ~500GB
Real Time View
- Hybrid of row & columnar data
- ~1-5 mins latency
- Brings near-real time tables
Log View
- Stream of changes to dataset
- Enables Incremental Pull
Storage Type : How is Data stored?
Copy On Write
- Purely columnar
- Simply creates new versions of files
Merge On Read
- Near-real time
- Shifts some write cost to
reads
- Merges on-the-fly

Hoodie: Storage Types & Views
Hoodie Concepts
Storage Type Supported Views
Copy On Write
Read Optimized,
LogView
Merge On Read
Read Optimized,
RealTime,
LogView

Storage: Basic Idea
2017/02/17
File1.parquet
Index
Index
File1_v2.parquet
2017/02/15
2017/02/16
2017/02/17
File1.avro.log
200 GB
30min batch
File1
10 GB
5min batch
File1_v1.parquet
10 GB
5 min batch ● 1825 Partitions (365 days * 5 yrs)
● 100 GB Partition Size
● 128 MB File Size
● ~800 Files Per Partition
● Skew spread - 0.005 (single batch)
● 20 seconds to re-write 1 File (shuffle)
● 100 executors
● 7300 Files rewritten
● 24 minutes to write
● 1825 Partitions (365 days * 5 yrs)
● 100 GB Partition Size
● 128 MB File Size
● ~800 Files Per Partition
● Skew spread - 0.5 % (single batch)
New Files - 0.005 % (single batch)
● 20 seconds to re-write 1 File (shuffle)
● 100 executors
10 executors
● 7300 Files rewritten
~ 8 new Files
● 24 minutes to write
~2 minutes to write
Deep Dive
Input
Changelog
Hoodie Dataset

Index and Storage
Index
- Tag ingested record as update or insert
- Index is immutable (record key to File mapping never changes)
- Pluggable
- Bloom Filter
- HBase
Storage
- HDFS Block aligned files
- ROFormat - Default is Apache Parquet
- WOFormat - Default is Apache Avro
Deep Dive

Concurrency
● Multi-row atomicity
● Strong consistency (Same as HDFS guarantees)
● Single Writer - Multiple Consumer pattern
● MVCC for isolation
○ Running queries are run concurrently to ingestion
Deep Dive

Data Skew
Why skew is a problem?
- Spark 2GB Remote Shuffle Block limit
- Straggler problem
Hoodie handles data skew automatically
- Index lookup skew
- Data write skew handled by auto sub partitioning
- Allocate sub-partitions (file ID) based on historical commit stats
- Morph inserts as updates to fix small files
Deep Dive

Compaction
Essential for Query performance
- Merge Write Optimized row format with Scan Optimized column format
Scheduled asynchronously to Ingestion
- Ingestion already groups updates per File Id
- Locks down versions of log files to compact
- Pluggable strategy to prioritize compactions
- Base File to Log file size ratio
- Recent partitions compacted first
Deep Dive

Failure recovery
Automatic recovery via Spark RDD
- Resilient Distributed Datasets!!
No Partial writes
- Commit is atomic
- Auto rollback last failed commit
Rollback specific commits
Savepoints/Snapshots
Deep Dive

Hoodie Write API
// WriteConfig contains basePath of hoodie dataset (among other configs)
HoodieWriteClient(JavaSparkContext jsc, HoodieWriteConfig clientConfig)
// Start a commit and get a commit time to atomically upsert a batch of records
String startCommit()
// Upsert the RDD<Records> into the hoodie dataset
JavaRDD<WriteStatus> upsert(JavaRDD<HoodieRecord<T>> records, final String commitTime)
// Bulk load the RDD<Records> into the hoodie dataset
JavaRDD<WriteStatus> bulkInsert(JavaRDD<HoodieRecord<T>> records, final String commitTime)
// Choose to commit
boolean commit(String commitTime, JavaRDD<WriteStatus> writeStatuses)
// Rollback
boolean rollback(final String commitTime) throws HoodieRollbackException
Deep Dive

Hoodie Record
HoodieRecordPayload
// Combine Existing value with New incoming value and return the combined value
○ IndexedRecord combineAndGetUpdateValue(IndexedRecord currentValue, Schema schema);
// Get the Avro IndexedRecord for the dataset schema
○ IndexedRecord getInsertValue(Schema schema);
Deep Dive

Hoodie Views
Hoodie Views
REALTIME
READ
OPTIMIZED
Queryexecutiontime
Data Latency
3 Logical views Of Dataset
Read Optimized View
- Raw Parquet Query Performance
- Targets existing Hive tables
Real Time View
- ~1-5 mins latency
Log View
- Enables Incr. Data Pipelines

Hoodie Views
Read Optimized
Table
Real Time Table
Hive
Hoodie Views
2017/02/15
2017/02/16
2017/02/17
2017/02/16
File1.parquet
Index
Index
File1_v2.parquet
File1.avro.log
File1
File1_v1.parquet
10 GB
5min batch
10 GB
5 min batch
Input
Changelog
Incremental Log
table

Read Optimized View
InputFormat picks only Compacted Columnar Files
Optimized for faster query runtime over data latency
- Plug into GetSplits to filter out older versions
- All Optimizations done to read parquet applies (Vectorized etc)
Data latency is the frequency of compaction
Works out of the box with Presto and Apache Spark
Hoodie Views

Presto Read Optimized Performance
Hoodie Views

Real Time View
InputFormat merges ROFile with WOFiles at query runtime
Custom RecordReader
- Logs are grouped per FileID
- Single split is usually a single FileID in Hoodie (Block Aligned files)
Latency is the frequency of ingestion (mini-batches)
Works out of the box with Presto and Apache Spark
- Specialized parquet read path optimizations not supported
Hoodie Views

Incremental Log View
Hoodie Views
Partitioned by day trip started
2010-2014
New Data
Unaffected Data
Updated Data
Incremental
update
2015/XX/XX
Every 5 min
Day level partitions
2016/XX/XX
2017/(01-03)/XX
2017/04/16
New/Updated
Trips Log
View
Incr Pull

Pull ONLY changed records in a time range using SQL
- ‘startTs’ > _hoodie_commit_time < ‘endTs’
Avoid full table/partition scan
Do not rely on a custom sequence ID to tail
Lookback window restricted based on cleaning policy
Hoodie Views

Use Cases
Near Real-Time ingestion / stream into HDFS
- Replicate online state in HDFS within few minutes
- Offload analytics to HDFS
Use Cases

Near Real-Time Ingestion
Use Cases

Use Cases
Near Real-Time ingestion / stream into HDFS
Incremental ETL processing
- Don't tradeoff correctness to do incremental processing
- Hoodie integration with Scheduler
Use Cases

Use Cases
Near Real-Time ingestion / streaming into HDFS
Incremental ETL processing
- Don't tradeoff correctness to do incremental processing
- Hoodie integration with Scheduler
Unified Analytical Serving Layer
- Eliminate your specialized serving layer , if latency tolerated is > 10 min
- Simplify serving with HDFS for the entire dataset
Use Cases

Unified Analytics Serving
Use Cases

Spectrum Of Data Pipelines
Use Cases

Adoption @ Uber
Use Cases
Powering ~1000 Data ingestion data feeds
- Every 30 mins today, several TBs per hour
- Towards < 10 min in the next few months
Reduced resource usage by 10x
- In production for last 6 months
- Hardened across rolling restarts, data node reboots
Incremental ETL for dimension tables
- Data warehouse at large
Future
- Self serve incremental pipelines (DeltaStreamer)

Comparison
Hoodie fills a big void in Hadoop land
- Upserts & Faster data
Play well with Hadoop ecosystem & deployments
- Leverage Spark vs re-inventing yet-another storage silo
Designed for Incremental Processing
- Incremental Pull is a ‘Hoodie’ special
Comparison

Source: (CERN Blog) Performance comparison of different file
formats and storage engines in the Hadoop ecosystem
Comparison: Analytical Storage
Hoodie Views

Comparison
Apache Kudu
- Targets both OLTP and OLAP
- Dedicated storage servers
- Evolving Ecosystem support*
Hoodie
- OLAP Only
- Built on top of HDFS
- Already works with Spark/Hive/Presto
Hive Transactions
- Tight integration with Hive & ORC
- No read-optimized view
- Hive based impl
Hoodie
- Hive/Spark/Presto
- Parquet/Avro today, but pluggable
- Power of Spark!
Comparison

Comparison
HBase/Key-Value Stores
- Write Optimized for OLTP
- Bad Scan Performance
- Scaling farm of storage servers
- Multi row atomicity is tedious
Hoodie
- Read-Optimized for OLAP
- State-of-art columnar formats
- Scales like a normal job or query
- Multi row commits!!
Stream Processing
- Row oriented processing
- Flink/Spark typically upsert results to
OLTP/specialized OLAP stores
Hoodie
- Columnar queries, at higher latency
- HDFS as Sink, Presto as OLAP engine
- Integrates with Spark/Spark Streaming
Comparison

Future Plans
Merge On Read (Project #1)
- Active developement, Productionizing, Shipping!
Global Index (Project #2)
- Fast, lightweight index to map key to fileID, globally (not just partitions)
Spark Datasource (Issue #7) & Presto Plugins (Issue #81)
- Native support for incremental SQL (e.g: where _hoodie_commit_time > ... )
Beam Runner (Issue #8)
- Build incremental pipelines that also port across batch or streaming modes
Future

Getting Involved
Engage with us on Github
- Look for “beginner-task” tagged issues
- Checkout tools/utilities
Uber is hiring for “Hoodie”
- “Software Engineer - Data Processing Plaform (Hoodie)”
- https://www.uber.com/careers/list/28811/
Swing by Office Hours after talk
- 2:40pm–3:20pm, Location: Table B
Contributions

Hoodie Views
3 Logical views Of Dataset
Read Optimized View
- Raw Parquet Query Performance
- Targets existing Hive tables
Hoodie Concepts
Real Time View
- ~1-5 mins latency
Log View
- Enables Incr. Data Pipelines

Hoodie Storage Types
Define how data is written
- Indexing & Storage of data
- Impl of primitives and timeline actions
- Support 1 or more views
2 Storage Types
- Copy On Write : Purely columnar, simply
creates new versions of files
- Merge On Read : Near-real time, Shifts
some write cost to reads, Merges on-
the-fly
Hoodie Concepts
Storage Type Supported Views
Copy On Write
Read Optimized,
LogView
Merge On Read
Read Optimized,
RealTime,
LogView

Hoodie Timeline
Time-ordered
sequence of actions
- Instantaneous views of
dataset
- Arrival-order retrieval of
data
Hoodie Concepts

Timeline Actions
Commit
- Multi-row atomic publish of data to Queries
- Detailed metadata to facilitate log view of changes
Clean
- Remove older versions of files, to reclaim storage space
- Cleaning modes : Retain Last X file versions, Retain Last X Commits
Compaction
- Compact row based log to columnar snapshot, for real-time view
Savepoint
- Roll back to a checkpoint and resume ingestion
Hoodie Concepts

Hoodie Terminology
● Basepath: Root of a Hoodie dataset
● Partition Path: Relative path to folder with partitions of data
● Commit: Produce files identified with fileid & commit time
● Record Key:
○ Uniquely identify a record within partition
○ Mapped consistently to a fileid
● File Id Group: Files with all versions of a group of records
● Metadata Directory: Stores a timeline of all metadata actions with atomically publish
Deep Dive

Hoodie Storage
2017/02/15
2017/02/16
2017/02/17
2017/02/16
File1.parquet
Index
Index
File1_v2.parquet
File1.avro.log
Change Log 200 GB
Realtime View
Read Optimized
View
Hive
File1
10 GB
File1_v1.parquet

Hoodie Write Path
Change log
Index lookup
updates
inserts
File Id1 LogFile
commit
(10:06)
Failed
commit
(10:08)
commit
(10:08)
Version 1
commit
(10:09)
Version 2
2017-03-11
File Id1
Compacted
(10:05)
2017-03-14
File Id2
2017-03-10
2017-03-11
2017-03-12
2017-03-13
2017-03-14
Commit Time: 10:10
Empty
Deep Dive

Hoodie Write Path
Deep Dive
Spark Application
Hoodie Spark Client
(Persistent) Index
Data Layout in HDFS
Metadata
Tag
Stream
Save
HoodieInputFormat
Get latest
commit
Filter and
Merge

Read Optimized View
Hoodie Views

Spark SQL Performance Comparison
Hoodie Views

Hoodie Views

Hoodie: Incremental processing on hadoop

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Hoodie: Incremental processing on hadoop (20)

Recently uploaded (20)

Hoodie: Incremental processing on hadoop

Editor's Notes