SlideShare a Scribd company logo
DATA
Incremental Processing Framework
Vinoth Chandar | Prasanna Rajaperumal
Hoodie
Who Are We
Staff Software Engineer, Uber
• Linkedin : Voldemort k-v store,
Stream processing
• Oracle : Database replication, CEP
Senior Software Engineer, Uber
• Cloudera : Data Pipelines, Log analysis
• Cisco : Complex Event Processing
Agenda
• Hadoop @ Uber
• Motivation & Concepts
• Deep Dive
• Use-Cases
• Comparisons
• Future Plans
Adoption & Scale
~Few Thousand
Servers
Many Many
PBs
~20k
Hive
queries/day
~100k
Presto
queries/day
~100k
Jobs/day
Hadoop @ Uber
~100
Spark
Apps
Hadoop Use-cases
Analytics
• Dashboards
• Ad Hoc-Analysis
• Federated Querying
• Interactive Analysis
Hadoop @ Uber
Data Applications
• ML Recommendations
• Fraud Detection
• Safe Driving
• Incentive Spends
Data Warehousing
• Curated Datafeeds
• Standard ETL
• DataLake => DataMart
Presto Spark Hive
Faster Data! Faster Data! Faster Data!
We All Like A Nimble Elephant
Question: Can we get fresh data, directly on a petabyte scale
Hadoop Data Lake?
Previously on .. Strata (2016)
Hadoop @ Uber
“Uber, your Hadoop has arrived: Powering Intelligence for Uber’s
Real-time marketplace”
Partitioned by day trip started
2010-2014
New Data
Unaffected Data
Updated Data
Incremental
update
2015/XX/XX
Every 5 min
Day level partitions
Late Arriving Updates
2016/XX/XX
2017/(01-03)/XX
2017/04/16
New/Updated
Trips
Motivation
Aug: 10 hr (1000 executors)
Apr: 8 hr (800 executors)
Jan: 6 hr (500 executors)
Snapshot
NoSQL/DB Ingestion: Status Quo
Database
trips
(compacted
table)
Replicate
d Trip
Rows
HBase
New
/updated
trip rows
Changelog
12-18+ hr
Kafka
upsert Presto
Derived
Tables
logging
8 hr
Approximation
Motivation
Batch
Recompute
Exponential Growth is fun ..
Hadoop @ Uber
Also extremely hard, to keep up with …
- Long waits for queue
- Disks running out of space
Common Pitfalls
- Massive re-computations
- Batch jobs are too big fail
Let’s go back 30 years
How did RDBMS-es solve this?
• Update existing row with new value (Transactions)
• Consume a log of changes downstream (Redo log)
• Update again downstream
Concepts
MySQL
(Server A)
MySQL
(Server B)
Update
Update
Pull Redo Log
TransformationImportant Differences
• Columnar file formats
• Read-heavy workloads
• Petabytes & 1000s of servers
10 hr (1000)
8 hr (800)
6 hr (500)
snapshot
Batch
Recompute
Challenging Status Quo
trips
(compacted
table)
12-18+ hr
Presto
Derived
Tables8 hr
Approximation
Hoodie.upsert()
1 hr (100) - Today
10 min (50) - Q2 ‘17
1 hr
Hoodie.incrPull()
[2 mins to pull]
1 hr - 3 hr
(10x less
resources)
Motivation
Accurate!!!
Database
Replicate
d Trip
Rows
HBase
New
/updated
trip rows
Changelog
Kafka
upsert
logging
Incremental Processing
Advantages: Increased Efficiency / Leverage Hadoop SQL Engines/ Simplify Architecture
Hoodie Concepts
Incremental Pull (Primitive #2)
• Log stream of changes, avoid costly scans
• Enable chaining processing in DAG
For more, “Case For Incremental Processing on Hadoop” (link)
Upsert (Primitive #1)
• Modify processed results
• kv stores in stream processing
Introducing
Hoodie
Open Source
- https://github.com/uber/hoodie
- eng.uber.com/hoodie
Spark Library For Upserts & Incrementals
- Scales horizontally like any job
- Stores dataset directly on HDFS
Storage Abstraction to
- Apply mutations to dataset
- Pull changelog incrementally
Hoodie Concepts
Large HDFS
Dataset
Upsert
(Spark)
Changelog Changelog
Incr Pull
(Hive/Spark/Presto)
Hive Table
(normal queries)
Hoodie: Overview
Hoodie Concepts
Hoodie
WriteClient
(Spark)
Index
Data Files
Timeline
Metadata
Hive
Queries
Hoodie Dataset On
HDFS
Presto
Queries
Spark
DAGs
Store & Index
Data
Read data
Storage
Type
Views
Hoodie: Storage Types & Views
Hoodie Concepts
Views : How is Data read?
Read Optimized View
- Parquet Query Performance
- ~30 mins latency for ~500GB
Real Time View
- Hybrid of row & columnar data
- ~1-5 mins latency
- Brings near-real time tables
Log View
- Stream of changes to dataset
- Enables Incremental Pull
Storage Type : How is Data stored?
Copy On Write
- Purely columnar
- Simply creates new versions of files
Merge On Read
- Near-real time
- Shifts some write cost to
reads
- Merges on-the-fly
Hoodie: Storage Types & Views
Hoodie Concepts
Storage Type Supported Views
Copy On Write
Read Optimized,
LogView
Merge On Read
Read Optimized,
RealTime,
LogView
Storage: Basic Idea
2017/02/17
File1.parquet
Index
Index
File1_v2.parquet
2017/02/15
2017/02/16
2017/02/17
File1.avro.log
200 GB
30min batch
File1
10 GB
5min batch
File1_v1.parquet
10 GB
5 min batch ● 1825 Partitions (365 days * 5 yrs)
● 100 GB Partition Size
● 128 MB File Size
● ~800 Files Per Partition
● Skew spread - 0.005 (single batch)
● 20 seconds to re-write 1 File (shuffle)
● 100 executors
● 7300 Files rewritten
● 24 minutes to write
● 1825 Partitions (365 days * 5 yrs)
● 100 GB Partition Size
● 128 MB File Size
● ~800 Files Per Partition
● Skew spread - 0.5 % (single batch)
New Files - 0.005 % (single batch)
● 20 seconds to re-write 1 File (shuffle)
● 100 executors
10 executors
● 7300 Files rewritten
~ 8 new Files
● 24 minutes to write
~2 minutes to write
Deep Dive
Input
Changelog
Hoodie Dataset
Index and Storage
Index
- Tag ingested record as update or insert
- Index is immutable (record key to File mapping never changes)
- Pluggable
- Bloom Filter
- HBase
Storage
- HDFS Block aligned files
- ROFormat - Default is Apache Parquet
- WOFormat - Default is Apache Avro
Deep Dive
Concurrency
● Multi-row atomicity
● Strong consistency (Same as HDFS guarantees)
● Single Writer - Multiple Consumer pattern
● MVCC for isolation
○ Running queries are run concurrently to ingestion
Deep Dive
Data Skew
Why skew is a problem?
- Spark 2GB Remote Shuffle Block limit
- Straggler problem
Hoodie handles data skew automatically
- Index lookup skew
- Data write skew handled by auto sub partitioning
- Allocate sub-partitions (file ID) based on historical commit stats
- Morph inserts as updates to fix small files
Deep Dive
Compaction
Essential for Query performance
- Merge Write Optimized row format with Scan Optimized column format
Scheduled asynchronously to Ingestion
- Ingestion already groups updates per File Id
- Locks down versions of log files to compact
- Pluggable strategy to prioritize compactions
- Base File to Log file size ratio
- Recent partitions compacted first
Deep Dive
Failure recovery
Automatic recovery via Spark RDD
- Resilient Distributed Datasets!!
No Partial writes
- Commit is atomic
- Auto rollback last failed commit
Rollback specific commits
Savepoints/Snapshots
Deep Dive
Hoodie Write API
// WriteConfig contains basePath of hoodie dataset (among other configs)
HoodieWriteClient(JavaSparkContext jsc, HoodieWriteConfig clientConfig)
// Start a commit and get a commit time to atomically upsert a batch of records
String startCommit()
// Upsert the RDD<Records> into the hoodie dataset
JavaRDD<WriteStatus> upsert(JavaRDD<HoodieRecord<T>> records, final String commitTime)
// Bulk load the RDD<Records> into the hoodie dataset
JavaRDD<WriteStatus> bulkInsert(JavaRDD<HoodieRecord<T>> records, final String commitTime)
// Choose to commit
boolean commit(String commitTime, JavaRDD<WriteStatus> writeStatuses)
// Rollback
boolean rollback(final String commitTime) throws HoodieRollbackException
Deep Dive
Hoodie Record
HoodieRecordPayload
// Combine Existing value with New incoming value and return the combined value
○ IndexedRecord combineAndGetUpdateValue(IndexedRecord currentValue, Schema schema);
// Get the Avro IndexedRecord for the dataset schema
○ IndexedRecord getInsertValue(Schema schema);
Deep Dive
Hoodie: Overview
Hoodie Concepts
Hoodie
WriteClient
(Spark)
Index
Data Files
Timeline
Metadata
Hive
Queries
Hoodie Dataset On
HDFS
Presto
Queries
Spark
DAGs
Store & Index
Data
Read data
Storage
Type
Views
Hoodie Views
Hoodie Views
REALTIME
READ
OPTIMIZED
Queryexecutiontime
Data Latency
3 Logical views Of Dataset
Read Optimized View
- Raw Parquet Query Performance
- ~30 mins latency for ~500GB
- Targets existing Hive tables
Real Time View
- Hybrid of row & columnar data
- ~1-5 mins latency
- Brings near-real time tables
Log View
- Stream of changes to dataset
- Enables Incr. Data Pipelines
Hoodie Views
Read Optimized
Table
Real Time Table
Hive
Hoodie Views
2017/02/15
2017/02/16
2017/02/17
2017/02/16
File1.parquet
Index
Index
File1_v2.parquet
File1.avro.log
File1
File1_v1.parquet
10 GB
5min batch
10 GB
5 min batch
Input
Changelog
Incremental Log
table
Read Optimized View
InputFormat picks only Compacted Columnar Files
Optimized for faster query runtime over data latency
- Plug into GetSplits to filter out older versions
- All Optimizations done to read parquet applies (Vectorized etc)
Data latency is the frequency of compaction
Works out of the box with Presto and Apache Spark
Hoodie Views
Presto Read Optimized Performance
Hoodie Views
Real Time View
InputFormat merges ROFile with WOFiles at query runtime
Custom RecordReader
- Logs are grouped per FileID
- Single split is usually a single FileID in Hoodie (Block Aligned files)
Latency is the frequency of ingestion (mini-batches)
Works out of the box with Presto and Apache Spark
- Specialized parquet read path optimizations not supported
Hoodie Views
Incremental Log View
Hoodie Views
Partitioned by day trip started
2010-2014
New Data
Unaffected Data
Updated Data
Incremental
update
2015/XX/XX
Every 5 min
Day level partitions
2016/XX/XX
2017/(01-03)/XX
2017/04/16
New/Updated
Trips Log
View
Incr Pull
Incremental Log View
Pull ONLY changed records in a time range using SQL
- ‘startTs’ > _hoodie_commit_time < ‘endTs’
Avoid full table/partition scan
Do not rely on a custom sequence ID to tail
Lookback window restricted based on cleaning policy
Hoodie Views
Use Cases
Use Cases
Use Cases
Near Real-Time ingestion / stream into HDFS
- Replicate online state in HDFS within few minutes
- Offload analytics to HDFS
Use Cases
Near Real-Time Ingestion
Use Cases
Use Cases
Near Real-Time ingestion / stream into HDFS
- Replicate online state in HDFS within few minutes
- Offload analytics to HDFS
Incremental ETL processing
- Don't tradeoff correctness to do incremental processing
- Hoodie integration with Scheduler
Use Cases
Incremental ETL
Use Cases
Use Cases
Near Real-Time ingestion / streaming into HDFS
- Replicate online state in HDFS within few minutes
- Offload analytics to HDFS
Incremental ETL processing
- Don't tradeoff correctness to do incremental processing
- Hoodie integration with Scheduler
Unified Analytical Serving Layer
- Eliminate your specialized serving layer , if latency tolerated is > 10 min
- Simplify serving with HDFS for the entire dataset
Use Cases
Lambda Architecture
Use Cases
Unified Analytics Serving
Use Cases
Spectrum Of Data Pipelines
Use Cases
Adoption @ Uber
Use Cases
Powering ~1000 Data ingestion data feeds
- Every 30 mins today, several TBs per hour
- Towards < 10 min in the next few months
Reduced resource usage by 10x
- In production for last 6 months
- Hardened across rolling restarts, data node reboots
Incremental ETL for dimension tables
- Data warehouse at large
Future
- Self serve incremental pipelines (DeltaStreamer)
Comparison
Hoodie fills a big void in Hadoop land
- Upserts & Faster data
Play well with Hadoop ecosystem & deployments
- Leverage Spark vs re-inventing yet-another storage silo
Designed for Incremental Processing
- Incremental Pull is a ‘Hoodie’ special
Comparison
Source: (CERN Blog) Performance comparison of different file
formats and storage engines in the Hadoop ecosystem
Comparison: Analytical Storage
Hoodie Views
Comparison
Apache Kudu
- Targets both OLTP and OLAP
- Dedicated storage servers
- Evolving Ecosystem support*
Hoodie
- OLAP Only
- Built on top of HDFS
- Already works with Spark/Hive/Presto
Hive Transactions
- Tight integration with Hive & ORC
- No read-optimized view
- Hive based impl
Hoodie
- Hive/Spark/Presto
- Parquet/Avro today, but pluggable
- Power of Spark!
Comparison
Comparison
HBase/Key-Value Stores
- Write Optimized for OLTP
- Bad Scan Performance
- Scaling farm of storage servers
- Multi row atomicity is tedious
Hoodie
- Read-Optimized for OLAP
- State-of-art columnar formats
- Scales like a normal job or query
- Multi row commits!!
Stream Processing
- Row oriented processing
- Flink/Spark typically upsert results to
OLTP/specialized OLAP stores
Hoodie
- Columnar queries, at higher latency
- HDFS as Sink, Presto as OLAP engine
- Integrates with Spark/Spark Streaming
Comparison
Future Plans
Merge On Read (Project #1)
- Active developement, Productionizing, Shipping!
Global Index (Project #2)
- Fast, lightweight index to map key to fileID, globally (not just partitions)
Spark Datasource (Issue #7) & Presto Plugins (Issue #81)
- Native support for incremental SQL (e.g: where _hoodie_commit_time > ... )
Beam Runner (Issue #8)
- Build incremental pipelines that also port across batch or streaming modes
Future
Getting Involved
Engage with us on Github
- Look for “beginner-task” tagged issues
- Checkout tools/utilities
Uber is hiring for “Hoodie”
- “Software Engineer - Data Processing Plaform (Hoodie)”
- https://www.uber.com/careers/list/28811/
Swing by Office Hours after talk
- 2:40pm–3:20pm, Location: Table B
Contributions
Questions?
source
Extra Slides
Hoodie Views
3 Logical views Of Dataset
Read Optimized View
- Raw Parquet Query Performance
- ~30 mins latency for ~500GB
- Targets existing Hive tables
Hoodie Concepts
Real Time View
- Hybrid of row & columnar data
- ~1-5 mins latency
- Brings near-real time tables
Log View
- Stream of changes to dataset
- Enables Incr. Data Pipelines
Hoodie Storage Types
Define how data is written
- Indexing & Storage of data
- Impl of primitives and timeline actions
- Support 1 or more views
2 Storage Types
- Copy On Write : Purely columnar, simply
creates new versions of files
- Merge On Read : Near-real time, Shifts
some write cost to reads, Merges on-
the-fly
Hoodie Concepts
Storage Type Supported Views
Copy On Write
Read Optimized,
LogView
Merge On Read
Read Optimized,
RealTime,
LogView
Hoodie Timeline
Time-ordered
sequence of actions
- Instantaneous views of
dataset
- Arrival-order retrieval of
data
Hoodie Concepts
Timeline Actions
Commit
- Multi-row atomic publish of data to Queries
- Detailed metadata to facilitate log view of changes
Clean
- Remove older versions of files, to reclaim storage space
- Cleaning modes : Retain Last X file versions, Retain Last X Commits
Compaction
- Compact row based log to columnar snapshot, for real-time view
Savepoint
- Roll back to a checkpoint and resume ingestion
Hoodie Concepts
Hoodie Terminology
● Basepath: Root of a Hoodie dataset
● Partition Path: Relative path to folder with partitions of data
● Commit: Produce files identified with fileid & commit time
● Record Key:
○ Uniquely identify a record within partition
○ Mapped consistently to a fileid
● File Id Group: Files with all versions of a group of records
● Metadata Directory: Stores a timeline of all metadata actions with atomically publish
Deep Dive
Hoodie Storage
2017/02/15
2017/02/16
2017/02/17
2017/02/16
File1.parquet
Index
Index
File1_v2.parquet
File1.avro.log
Change Log 200 GB
Realtime View
Read Optimized
View
Hive
File1
10 GB
File1_v1.parquet
Hoodie Write Path
Change log
Index lookup
updates
inserts
File Id1 LogFile
commit
(10:06)
Failed
commit
(10:08)
commit
(10:08)
Version 1
commit
(10:09)
Version 2
2017-03-11
File Id1
Compacted
(10:05)
2017-03-14
File Id2
2017-03-10
2017-03-11
2017-03-12
2017-03-13
2017-03-14
Commit Time: 10:10
Empty
Deep Dive
Hoodie Write Path
Deep Dive
Spark Application
Hoodie Spark Client
(Persistent) Index
Data Layout in HDFS
Metadata
Tag
Stream
Save
HoodieInputFormat
Get latest
commit
Filter and
Merge
Read Optimized View
Hoodie Views
Spark SQL Performance Comparison
Hoodie Views
Realtime View
Hoodie Views
Incremental Log View
Hoodie Views
Ad

More Related Content

What's hot (20)

An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
Benjamin Bengfort
 
Always on in sql server 2017
Always on in sql server 2017Always on in sql server 2017
Always on in sql server 2017
Gianluca Hotz
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
 
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
HostedbyConfluent
 
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Markus Michalewicz
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
Yugabyte
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data Pipeline
Manish Kumar
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
Milind Bhandarkar
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
BalajiVaradarajan13
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
Dustin Vannoy
 
Big Data Security in Apache Projects by Gidon Gershinsky
Big Data Security in Apache Projects by Gidon GershinskyBig Data Security in Apache Projects by Gidon Gershinsky
Big Data Security in Apache Projects by Gidon Gershinsky
GidonGershinsky
 
Exadata architecture and internals presentation
Exadata architecture and internals presentationExadata architecture and internals presentation
Exadata architecture and internals presentation
Sanjoy Dasgupta
 
GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @Geecon
Peter Lawrey
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
emreakis
 
Introduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and ConsistencyIntroduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and Consistency
Benjamin Black
 
Why oracle data guard new features in oracle 18c, 19c
Why oracle data guard new features in oracle 18c, 19cWhy oracle data guard new features in oracle 18c, 19c
Why oracle data guard new features in oracle 18c, 19c
Satishbabu Gunukula
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenNoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
Lorenzo Alberton
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
Benjamin Bengfort
 
Always on in sql server 2017
Always on in sql server 2017Always on in sql server 2017
Always on in sql server 2017
Gianluca Hotz
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
 
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
HostedbyConfluent
 
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Markus Michalewicz
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
Yugabyte
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data Pipeline
Manish Kumar
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
Dustin Vannoy
 
Big Data Security in Apache Projects by Gidon Gershinsky
Big Data Security in Apache Projects by Gidon GershinskyBig Data Security in Apache Projects by Gidon Gershinsky
Big Data Security in Apache Projects by Gidon Gershinsky
GidonGershinsky
 
Exadata architecture and internals presentation
Exadata architecture and internals presentationExadata architecture and internals presentation
Exadata architecture and internals presentation
Sanjoy Dasgupta
 
GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @Geecon
Peter Lawrey
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
emreakis
 
Introduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and ConsistencyIntroduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and Consistency
Benjamin Black
 
Why oracle data guard new features in oracle 18c, 19c
Why oracle data guard new features in oracle 18c, 19cWhy oracle data guard new features in oracle 18c, 19c
Why oracle data guard new features in oracle 18c, 19c
Satishbabu Gunukula
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenNoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
Lorenzo Alberton
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 

Viewers also liked (20)

Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Ayyappan Paramesh
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
larsgeorge
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Slim Baltagi
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
Praveen Kumar Donta
 
Ppt hadoop
Ppt hadoopPpt hadoop
Ppt hadoop
Fajar Nugraha
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Yael Garten
 
Hire Hadoop Developer
Hire Hadoop DeveloperHire Hadoop Developer
Hire Hadoop Developer
Geeks Per Hour
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
hadooparchbook
 
MachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_SparkMachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_Spark
Xudong Brandon Liang
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
Optimization for iterative queries on Mapreduce
Optimization for iterative queries on MapreduceOptimization for iterative queries on Mapreduce
Optimization for iterative queries on Mapreduce
makoto onizuka
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
IJRES Journal
 
Spark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticSpark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, altic
ALTIC Altic
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
mobius.cn
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
MLconf
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myui
Makoto Yui
 
White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuning
Anil Reddy
 
Data Infused Product Design and Insights at LinkedIn
Data Infused Product Design and Insights at LinkedInData Infused Product Design and Insights at LinkedIn
Data Infused Product Design and Insights at LinkedIn
Yael Garten
 
A Perspective from the intersection Data Science, Mobility, and Mobile Devices
A Perspective from the intersection Data Science, Mobility, and Mobile DevicesA Perspective from the intersection Data Science, Mobility, and Mobile Devices
A Perspective from the intersection Data Science, Mobility, and Mobile Devices
Yael Garten
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
Subhas Kumar Ghosh
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
larsgeorge
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Slim Baltagi
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Yael Garten
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
hadooparchbook
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
Optimization for iterative queries on Mapreduce
Optimization for iterative queries on MapreduceOptimization for iterative queries on Mapreduce
Optimization for iterative queries on Mapreduce
makoto onizuka
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
IJRES Journal
 
Spark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticSpark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, altic
ALTIC Altic
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
mobius.cn
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
MLconf
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myui
Makoto Yui
 
White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuning
Anil Reddy
 
Data Infused Product Design and Insights at LinkedIn
Data Infused Product Design and Insights at LinkedInData Infused Product Design and Insights at LinkedIn
Data Infused Product Design and Insights at LinkedIn
Yael Garten
 
A Perspective from the intersection Data Science, Mobility, and Mobile Devices
A Perspective from the intersection Data Science, Mobility, and Mobile DevicesA Perspective from the intersection Data Science, Mobility, and Mobile Devices
A Perspective from the intersection Data Science, Mobility, and Mobile Devices
Yael Garten
 
Ad

Similar to Hoodie: Incremental processing on hadoop (20)

SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From Uber
Chester Chen
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
Hakka Labs
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
Qubole
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
Yousun Jeong
 
Geek Sync | Guide to Understanding and Monitoring Tempdb
Geek Sync | Guide to Understanding and Monitoring TempdbGeek Sync | Guide to Understanding and Monitoring Tempdb
Geek Sync | Guide to Understanding and Monitoring Tempdb
IDERA Software
 
fluentd -- the missing log collector
fluentd -- the missing log collectorfluentd -- the missing log collector
fluentd -- the missing log collector
Muga Nishizawa
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity Planning
MongoDB
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
DataWorks Summit
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on Hadoop
DataWorks Summit
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
Alluxio, Inc.
 
Optimize SQL server performance for SharePoint
Optimize SQL server performance for SharePointOptimize SQL server performance for SharePoint
Optimize SQL server performance for SharePoint
serge luca
 
Make your SharePoint fly by tuning and optimizing SQL Server
Make your SharePoint  fly by tuning and optimizing SQL ServerMake your SharePoint  fly by tuning and optimizing SQL Server
Make your SharePoint fly by tuning and optimizing SQL Server
serge luca
 
Espc17 make your share point fly by tuning and optimising sql server
Espc17 make your share point  fly by tuning and optimising sql serverEspc17 make your share point  fly by tuning and optimising sql server
Espc17 make your share point fly by tuning and optimising sql server
Isabelle Van Campenhoudt
 
SQL Server It Just Runs Faster
SQL Server It Just Runs FasterSQL Server It Just Runs Faster
SQL Server It Just Runs Faster
Bob Ward
 
Large Scale SharePoint SQL Deployments
Large Scale SharePoint SQL DeploymentsLarge Scale SharePoint SQL Deployments
Large Scale SharePoint SQL Deployments
Joel Oleson
 
SharePoint and Large Scale SQL Deployments - NZSPC
SharePoint and Large Scale SQL Deployments - NZSPCSharePoint and Large Scale SQL Deployments - NZSPC
SharePoint and Large Scale SQL Deployments - NZSPC
guest7c2e070
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and Then
SATOSHI TAGOMORI
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From Uber
Chester Chen
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
Hakka Labs
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
Qubole
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
Yousun Jeong
 
Geek Sync | Guide to Understanding and Monitoring Tempdb
Geek Sync | Guide to Understanding and Monitoring TempdbGeek Sync | Guide to Understanding and Monitoring Tempdb
Geek Sync | Guide to Understanding and Monitoring Tempdb
IDERA Software
 
fluentd -- the missing log collector
fluentd -- the missing log collectorfluentd -- the missing log collector
fluentd -- the missing log collector
Muga Nishizawa
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity Planning
MongoDB
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
DataWorks Summit
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on Hadoop
DataWorks Summit
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
Alluxio, Inc.
 
Optimize SQL server performance for SharePoint
Optimize SQL server performance for SharePointOptimize SQL server performance for SharePoint
Optimize SQL server performance for SharePoint
serge luca
 
Make your SharePoint fly by tuning and optimizing SQL Server
Make your SharePoint  fly by tuning and optimizing SQL ServerMake your SharePoint  fly by tuning and optimizing SQL Server
Make your SharePoint fly by tuning and optimizing SQL Server
serge luca
 
Espc17 make your share point fly by tuning and optimising sql server
Espc17 make your share point  fly by tuning and optimising sql serverEspc17 make your share point  fly by tuning and optimising sql server
Espc17 make your share point fly by tuning and optimising sql server
Isabelle Van Campenhoudt
 
SQL Server It Just Runs Faster
SQL Server It Just Runs FasterSQL Server It Just Runs Faster
SQL Server It Just Runs Faster
Bob Ward
 
Large Scale SharePoint SQL Deployments
Large Scale SharePoint SQL DeploymentsLarge Scale SharePoint SQL Deployments
Large Scale SharePoint SQL Deployments
Joel Oleson
 
SharePoint and Large Scale SQL Deployments - NZSPC
SharePoint and Large Scale SQL Deployments - NZSPCSharePoint and Large Scale SQL Deployments - NZSPC
SharePoint and Large Scale SQL Deployments - NZSPC
guest7c2e070
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and Then
SATOSHI TAGOMORI
 
Ad

Recently uploaded (20)

UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 

Hoodie: Incremental processing on hadoop

  • 1. DATA Incremental Processing Framework Vinoth Chandar | Prasanna Rajaperumal Hoodie
  • 2. Who Are We Staff Software Engineer, Uber • Linkedin : Voldemort k-v store, Stream processing • Oracle : Database replication, CEP Senior Software Engineer, Uber • Cloudera : Data Pipelines, Log analysis • Cisco : Complex Event Processing
  • 3. Agenda • Hadoop @ Uber • Motivation & Concepts • Deep Dive • Use-Cases • Comparisons • Future Plans
  • 4. Adoption & Scale ~Few Thousand Servers Many Many PBs ~20k Hive queries/day ~100k Presto queries/day ~100k Jobs/day Hadoop @ Uber ~100 Spark Apps
  • 5. Hadoop Use-cases Analytics • Dashboards • Ad Hoc-Analysis • Federated Querying • Interactive Analysis Hadoop @ Uber Data Applications • ML Recommendations • Fraud Detection • Safe Driving • Incentive Spends Data Warehousing • Curated Datafeeds • Standard ETL • DataLake => DataMart Presto Spark Hive Faster Data! Faster Data! Faster Data!
  • 6. We All Like A Nimble Elephant Question: Can we get fresh data, directly on a petabyte scale Hadoop Data Lake?
  • 7. Previously on .. Strata (2016) Hadoop @ Uber “Uber, your Hadoop has arrived: Powering Intelligence for Uber’s Real-time marketplace”
  • 8. Partitioned by day trip started 2010-2014 New Data Unaffected Data Updated Data Incremental update 2015/XX/XX Every 5 min Day level partitions Late Arriving Updates 2016/XX/XX 2017/(01-03)/XX 2017/04/16 New/Updated Trips Motivation
  • 9. Aug: 10 hr (1000 executors) Apr: 8 hr (800 executors) Jan: 6 hr (500 executors) Snapshot NoSQL/DB Ingestion: Status Quo Database trips (compacted table) Replicate d Trip Rows HBase New /updated trip rows Changelog 12-18+ hr Kafka upsert Presto Derived Tables logging 8 hr Approximation Motivation Batch Recompute
  • 10. Exponential Growth is fun .. Hadoop @ Uber Also extremely hard, to keep up with … - Long waits for queue - Disks running out of space Common Pitfalls - Massive re-computations - Batch jobs are too big fail
  • 11. Let’s go back 30 years How did RDBMS-es solve this? • Update existing row with new value (Transactions) • Consume a log of changes downstream (Redo log) • Update again downstream Concepts MySQL (Server A) MySQL (Server B) Update Update Pull Redo Log TransformationImportant Differences • Columnar file formats • Read-heavy workloads • Petabytes & 1000s of servers
  • 12. 10 hr (1000) 8 hr (800) 6 hr (500) snapshot Batch Recompute Challenging Status Quo trips (compacted table) 12-18+ hr Presto Derived Tables8 hr Approximation Hoodie.upsert() 1 hr (100) - Today 10 min (50) - Q2 ‘17 1 hr Hoodie.incrPull() [2 mins to pull] 1 hr - 3 hr (10x less resources) Motivation Accurate!!! Database Replicate d Trip Rows HBase New /updated trip rows Changelog Kafka upsert logging
  • 13. Incremental Processing Advantages: Increased Efficiency / Leverage Hadoop SQL Engines/ Simplify Architecture Hoodie Concepts Incremental Pull (Primitive #2) • Log stream of changes, avoid costly scans • Enable chaining processing in DAG For more, “Case For Incremental Processing on Hadoop” (link) Upsert (Primitive #1) • Modify processed results • kv stores in stream processing
  • 14. Introducing Hoodie Open Source - https://github.com/uber/hoodie - eng.uber.com/hoodie Spark Library For Upserts & Incrementals - Scales horizontally like any job - Stores dataset directly on HDFS Storage Abstraction to - Apply mutations to dataset - Pull changelog incrementally Hoodie Concepts Large HDFS Dataset Upsert (Spark) Changelog Changelog Incr Pull (Hive/Spark/Presto) Hive Table (normal queries)
  • 15. Hoodie: Overview Hoodie Concepts Hoodie WriteClient (Spark) Index Data Files Timeline Metadata Hive Queries Hoodie Dataset On HDFS Presto Queries Spark DAGs Store & Index Data Read data Storage Type Views
  • 16. Hoodie: Storage Types & Views Hoodie Concepts Views : How is Data read? Read Optimized View - Parquet Query Performance - ~30 mins latency for ~500GB Real Time View - Hybrid of row & columnar data - ~1-5 mins latency - Brings near-real time tables Log View - Stream of changes to dataset - Enables Incremental Pull Storage Type : How is Data stored? Copy On Write - Purely columnar - Simply creates new versions of files Merge On Read - Near-real time - Shifts some write cost to reads - Merges on-the-fly
  • 17. Hoodie: Storage Types & Views Hoodie Concepts Storage Type Supported Views Copy On Write Read Optimized, LogView Merge On Read Read Optimized, RealTime, LogView
  • 18. Storage: Basic Idea 2017/02/17 File1.parquet Index Index File1_v2.parquet 2017/02/15 2017/02/16 2017/02/17 File1.avro.log 200 GB 30min batch File1 10 GB 5min batch File1_v1.parquet 10 GB 5 min batch ● 1825 Partitions (365 days * 5 yrs) ● 100 GB Partition Size ● 128 MB File Size ● ~800 Files Per Partition ● Skew spread - 0.005 (single batch) ● 20 seconds to re-write 1 File (shuffle) ● 100 executors ● 7300 Files rewritten ● 24 minutes to write ● 1825 Partitions (365 days * 5 yrs) ● 100 GB Partition Size ● 128 MB File Size ● ~800 Files Per Partition ● Skew spread - 0.5 % (single batch) New Files - 0.005 % (single batch) ● 20 seconds to re-write 1 File (shuffle) ● 100 executors 10 executors ● 7300 Files rewritten ~ 8 new Files ● 24 minutes to write ~2 minutes to write Deep Dive Input Changelog Hoodie Dataset
  • 19. Index and Storage Index - Tag ingested record as update or insert - Index is immutable (record key to File mapping never changes) - Pluggable - Bloom Filter - HBase Storage - HDFS Block aligned files - ROFormat - Default is Apache Parquet - WOFormat - Default is Apache Avro Deep Dive
  • 20. Concurrency ● Multi-row atomicity ● Strong consistency (Same as HDFS guarantees) ● Single Writer - Multiple Consumer pattern ● MVCC for isolation ○ Running queries are run concurrently to ingestion Deep Dive
  • 21. Data Skew Why skew is a problem? - Spark 2GB Remote Shuffle Block limit - Straggler problem Hoodie handles data skew automatically - Index lookup skew - Data write skew handled by auto sub partitioning - Allocate sub-partitions (file ID) based on historical commit stats - Morph inserts as updates to fix small files Deep Dive
  • 22. Compaction Essential for Query performance - Merge Write Optimized row format with Scan Optimized column format Scheduled asynchronously to Ingestion - Ingestion already groups updates per File Id - Locks down versions of log files to compact - Pluggable strategy to prioritize compactions - Base File to Log file size ratio - Recent partitions compacted first Deep Dive
  • 23. Failure recovery Automatic recovery via Spark RDD - Resilient Distributed Datasets!! No Partial writes - Commit is atomic - Auto rollback last failed commit Rollback specific commits Savepoints/Snapshots Deep Dive
  • 24. Hoodie Write API // WriteConfig contains basePath of hoodie dataset (among other configs) HoodieWriteClient(JavaSparkContext jsc, HoodieWriteConfig clientConfig) // Start a commit and get a commit time to atomically upsert a batch of records String startCommit() // Upsert the RDD<Records> into the hoodie dataset JavaRDD<WriteStatus> upsert(JavaRDD<HoodieRecord<T>> records, final String commitTime) // Bulk load the RDD<Records> into the hoodie dataset JavaRDD<WriteStatus> bulkInsert(JavaRDD<HoodieRecord<T>> records, final String commitTime) // Choose to commit boolean commit(String commitTime, JavaRDD<WriteStatus> writeStatuses) // Rollback boolean rollback(final String commitTime) throws HoodieRollbackException Deep Dive
  • 25. Hoodie Record HoodieRecordPayload // Combine Existing value with New incoming value and return the combined value ○ IndexedRecord combineAndGetUpdateValue(IndexedRecord currentValue, Schema schema); // Get the Avro IndexedRecord for the dataset schema ○ IndexedRecord getInsertValue(Schema schema); Deep Dive
  • 26. Hoodie: Overview Hoodie Concepts Hoodie WriteClient (Spark) Index Data Files Timeline Metadata Hive Queries Hoodie Dataset On HDFS Presto Queries Spark DAGs Store & Index Data Read data Storage Type Views
  • 27. Hoodie Views Hoodie Views REALTIME READ OPTIMIZED Queryexecutiontime Data Latency 3 Logical views Of Dataset Read Optimized View - Raw Parquet Query Performance - ~30 mins latency for ~500GB - Targets existing Hive tables Real Time View - Hybrid of row & columnar data - ~1-5 mins latency - Brings near-real time tables Log View - Stream of changes to dataset - Enables Incr. Data Pipelines
  • 28. Hoodie Views Read Optimized Table Real Time Table Hive Hoodie Views 2017/02/15 2017/02/16 2017/02/17 2017/02/16 File1.parquet Index Index File1_v2.parquet File1.avro.log File1 File1_v1.parquet 10 GB 5min batch 10 GB 5 min batch Input Changelog Incremental Log table
  • 29. Read Optimized View InputFormat picks only Compacted Columnar Files Optimized for faster query runtime over data latency - Plug into GetSplits to filter out older versions - All Optimizations done to read parquet applies (Vectorized etc) Data latency is the frequency of compaction Works out of the box with Presto and Apache Spark Hoodie Views
  • 30. Presto Read Optimized Performance Hoodie Views
  • 31. Real Time View InputFormat merges ROFile with WOFiles at query runtime Custom RecordReader - Logs are grouped per FileID - Single split is usually a single FileID in Hoodie (Block Aligned files) Latency is the frequency of ingestion (mini-batches) Works out of the box with Presto and Apache Spark - Specialized parquet read path optimizations not supported Hoodie Views
  • 32. Incremental Log View Hoodie Views Partitioned by day trip started 2010-2014 New Data Unaffected Data Updated Data Incremental update 2015/XX/XX Every 5 min Day level partitions 2016/XX/XX 2017/(01-03)/XX 2017/04/16 New/Updated Trips Log View Incr Pull
  • 33. Incremental Log View Pull ONLY changed records in a time range using SQL - ‘startTs’ > _hoodie_commit_time < ‘endTs’ Avoid full table/partition scan Do not rely on a custom sequence ID to tail Lookback window restricted based on cleaning policy Hoodie Views
  • 35. Use Cases Near Real-Time ingestion / stream into HDFS - Replicate online state in HDFS within few minutes - Offload analytics to HDFS Use Cases
  • 37. Use Cases Near Real-Time ingestion / stream into HDFS - Replicate online state in HDFS within few minutes - Offload analytics to HDFS Incremental ETL processing - Don't tradeoff correctness to do incremental processing - Hoodie integration with Scheduler Use Cases
  • 39. Use Cases Near Real-Time ingestion / streaming into HDFS - Replicate online state in HDFS within few minutes - Offload analytics to HDFS Incremental ETL processing - Don't tradeoff correctness to do incremental processing - Hoodie integration with Scheduler Unified Analytical Serving Layer - Eliminate your specialized serving layer , if latency tolerated is > 10 min - Simplify serving with HDFS for the entire dataset Use Cases
  • 42. Spectrum Of Data Pipelines Use Cases
  • 43. Adoption @ Uber Use Cases Powering ~1000 Data ingestion data feeds - Every 30 mins today, several TBs per hour - Towards < 10 min in the next few months Reduced resource usage by 10x - In production for last 6 months - Hardened across rolling restarts, data node reboots Incremental ETL for dimension tables - Data warehouse at large Future - Self serve incremental pipelines (DeltaStreamer)
  • 44. Comparison Hoodie fills a big void in Hadoop land - Upserts & Faster data Play well with Hadoop ecosystem & deployments - Leverage Spark vs re-inventing yet-another storage silo Designed for Incremental Processing - Incremental Pull is a ‘Hoodie’ special Comparison
  • 45. Source: (CERN Blog) Performance comparison of different file formats and storage engines in the Hadoop ecosystem Comparison: Analytical Storage Hoodie Views
  • 46. Comparison Apache Kudu - Targets both OLTP and OLAP - Dedicated storage servers - Evolving Ecosystem support* Hoodie - OLAP Only - Built on top of HDFS - Already works with Spark/Hive/Presto Hive Transactions - Tight integration with Hive & ORC - No read-optimized view - Hive based impl Hoodie - Hive/Spark/Presto - Parquet/Avro today, but pluggable - Power of Spark! Comparison
  • 47. Comparison HBase/Key-Value Stores - Write Optimized for OLTP - Bad Scan Performance - Scaling farm of storage servers - Multi row atomicity is tedious Hoodie - Read-Optimized for OLAP - State-of-art columnar formats - Scales like a normal job or query - Multi row commits!! Stream Processing - Row oriented processing - Flink/Spark typically upsert results to OLTP/specialized OLAP stores Hoodie - Columnar queries, at higher latency - HDFS as Sink, Presto as OLAP engine - Integrates with Spark/Spark Streaming Comparison
  • 48. Future Plans Merge On Read (Project #1) - Active developement, Productionizing, Shipping! Global Index (Project #2) - Fast, lightweight index to map key to fileID, globally (not just partitions) Spark Datasource (Issue #7) & Presto Plugins (Issue #81) - Native support for incremental SQL (e.g: where _hoodie_commit_time > ... ) Beam Runner (Issue #8) - Build incremental pipelines that also port across batch or streaming modes Future
  • 49. Getting Involved Engage with us on Github - Look for “beginner-task” tagged issues - Checkout tools/utilities Uber is hiring for “Hoodie” - “Software Engineer - Data Processing Plaform (Hoodie)” - https://www.uber.com/careers/list/28811/ Swing by Office Hours after talk - 2:40pm–3:20pm, Location: Table B Contributions
  • 52. Hoodie Views 3 Logical views Of Dataset Read Optimized View - Raw Parquet Query Performance - ~30 mins latency for ~500GB - Targets existing Hive tables Hoodie Concepts Real Time View - Hybrid of row & columnar data - ~1-5 mins latency - Brings near-real time tables Log View - Stream of changes to dataset - Enables Incr. Data Pipelines
  • 53. Hoodie Storage Types Define how data is written - Indexing & Storage of data - Impl of primitives and timeline actions - Support 1 or more views 2 Storage Types - Copy On Write : Purely columnar, simply creates new versions of files - Merge On Read : Near-real time, Shifts some write cost to reads, Merges on- the-fly Hoodie Concepts Storage Type Supported Views Copy On Write Read Optimized, LogView Merge On Read Read Optimized, RealTime, LogView
  • 54. Hoodie Timeline Time-ordered sequence of actions - Instantaneous views of dataset - Arrival-order retrieval of data Hoodie Concepts
  • 55. Timeline Actions Commit - Multi-row atomic publish of data to Queries - Detailed metadata to facilitate log view of changes Clean - Remove older versions of files, to reclaim storage space - Cleaning modes : Retain Last X file versions, Retain Last X Commits Compaction - Compact row based log to columnar snapshot, for real-time view Savepoint - Roll back to a checkpoint and resume ingestion Hoodie Concepts
  • 56. Hoodie Terminology ● Basepath: Root of a Hoodie dataset ● Partition Path: Relative path to folder with partitions of data ● Commit: Produce files identified with fileid & commit time ● Record Key: ○ Uniquely identify a record within partition ○ Mapped consistently to a fileid ● File Id Group: Files with all versions of a group of records ● Metadata Directory: Stores a timeline of all metadata actions with atomically publish Deep Dive
  • 58. Hoodie Write Path Change log Index lookup updates inserts File Id1 LogFile commit (10:06) Failed commit (10:08) commit (10:08) Version 1 commit (10:09) Version 2 2017-03-11 File Id1 Compacted (10:05) 2017-03-14 File Id2 2017-03-10 2017-03-11 2017-03-12 2017-03-13 2017-03-14 Commit Time: 10:10 Empty Deep Dive
  • 59. Hoodie Write Path Deep Dive Spark Application Hoodie Spark Client (Persistent) Index Data Layout in HDFS Metadata Tag Stream Save HoodieInputFormat Get latest commit Filter and Merge
  • 61. Spark SQL Performance Comparison Hoodie Views

Editor's Notes

  • #9: Talk about why updates are needed before going to the prev generation which has hbase to solve mutations
  • #18: 2 storage types and 3 views Copy on Write is the first version of storage Provides 2 views - RO and LogView Merge on Read is a strict superset of Copy on Write Provides RealTime view in addition (1 liner - More recent data with cost of merge pushed on to query execution)
  • #19: Visualization of Storage Types Talk about a basic parquet dataset laid out in HDFS We can to ingest say 200GB of data and upsert into this dataset How do we support upsert primitive First we need to tag updates and inserts - introduce index Introduce multi version - to write out updates Talk about how / why batch sizes matter - amortization - write amplification Go over the numbers 30 minutes of queued data takes 30 minutes to ingest - 1 hour SLA We wanted to take on more workloads by pushing that SLA even further down Have a differential structure - a log of updates queued for a single file Stream updates into the log file compaction happens once in a while - compaction becomes similar to previous ingestion flow Run through the change in numbers
  • #20: Index should be super quick - Pure Overhead Block Aligned Files - Balance compaction and query parallelism
  • #21: Lets talk about some of the challenges/Features of storing the data in the above format
  • #22: Explain hotspotting and 2GB Limit Skew could be during index lookup or during data write Custom partitioning which takes statistics of commits to determine the appropriate number of subpartitions Auto Corrections of file sizes
  • #24: Spark RDD has automatic recovery and retries computations Avro Log maintains the offset to the block and a partially written block will be skipped SavePoints to rollback and re-ingest
  • #25: Talk about SparkContext and Config - Index, Storage Formats, Parallelism StartCommit - Token
  • #26: Take about what a hoodie record is and the record payload abstraction
  • #27: Talk briefly about metadata storage. Bring attention towards the views.
  • #28: A view is a inputformat - 3 different hive tables are essentially registered pointing to the same HDFS dataset
  • #29: Recap the storage breifly Introduce one view after next and explain how it works Explain about hive - query plan generation
  • #30: Explain InputFormats for each view Explain how read optimized inputformat works - generate query plan - getsplits - filter Talk about optimizated for query runtime - chosen when compaction data latency is good enough Talk about hive metastore registration
  • #33: Another way to visualize log view
  • #43: Batch stream is not a dichotomy - it is a spectrum Workloads that can tolerate minutes level latency is common Transition
  • #51: Image source: https://openclipart.org/detail/50287/eleve-posant-une-question-student-asking-a-question
  • #57: Hoodie partitions HDFS directory further partitioning to a more finer granularity Subpartitioned as <Partition Path, File Id> Record Key <==> <Partition Path, File Id> is immutable Dynamic sub partition automatically handles data skew Fundamental unit of compaction is rewriting a single File Id Sub partitioning is used for ingestion only Query engines only see HDFS directory partitions