SlideShare a Scribd company logo
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Apache Hudi 1.0
The Hudi Platform
Lake Storage
(Cloud Object Stores, HDFS, …)
Open File/Data Formats
(Parquet, HFile, Avro, Orc, …)
Concurrency Control
(OCC, MVCC, Non-blocking, Lock providers,
Scheduling...)
Table Services
(cleaning, compaction, clustering, indexing,
file sizing,...)
Indexes
(Bloom filter, HBase, Bucket index, Hash based,
Lucene..)
Table Format
(Schema, File listings, Stats, Evolution, …)
Lake Cache*
(Columnar, transactional, mutable, WIP,...)
Metaserver*
(Stats, table service coordination,...)
Transactional
Database Layer
Query Engines
(Spark, Flink, Hive, Presto, Trino, Impala,
Redshift, BigQuery, Snowflake,..)
Platform Services
(Streaming/Batch ingest, various sources,
Catalog sync, Admin CLI, Data Quality,...)
User Interface
Readers
(Snapshot, Time Travel, Incremental, etc)
Writers
(Inserts, Updates, Deletes, Smart Layout
Management, etc)
Programming API
In Industry Today
Trading transactions - Near
real-time CDC from 4000+
postgres tables at 5 mins!
Minute level analytics with 70%
CPU savings @ Exabyte scale Tiktok
recommendations
Package deliveries -
real-time event analytics at
PB scale
Streaming log ingestion and
efficient GDPR deletes
using Apache Hudi
150 source systems, ETL
processing for 10,000+
tables
Faster data access @ 75%
less storage costs
Near real-time grocery
delivery tracking
Streaming data lake for
device data
Feature Store using Hudi
Building faster analytics for
automotive data
Uber rides - 250+PB from
24h+ to minutes latency on
8000+ tables
Real time analytics that
power financial decisions
Real-time advertising for 20M+
concurrent viewers
Lakehouse at Fortune 1 Scale
Lake House
Architecture @
Halodoc
Faster SLAs with low
cost data pipelines
cost optimized fast analytics
for sports solutions
3800+
members
The Community
7000+
Commits
431+
Contributors
6000+
GH Engagers
36
Committers
Pre-installed on 5 cloud providers
Diverse PMC/Committers
19
PMCs
800B+
Records/Day
(from even just 1 user!)
A vibrant OSS Community
4700+
questions
answered
(in just last 2 years!)
22800+
responses
(in just last 2 years!)
Opportunities
- Query engines prefer separate integrations.
- Need to maintain specific Hudi connectors.
- Improved query planning & execution with
Hudi’s advanced capabilities multi-modal
indexing
Deeper Query Engine
Integrations
- Mature SQL support made possible
from advancements in engines like
Apache Spark & Apache Flink
- Generalized data model for
supporting keys in Hudi tables
Generalized Data Model
- Migrate to hybrid architecture:
Serverless for data and serverful for
table metadata.
- Scales well for metadata.
- Addresses evolving concurrency
control needs.
Serverful & Serverless
- Support for complex, unstructured,
large blobs with indexing, mutation
and change capture.
- Expand to ML/AL modeling, image
and video processing applications.
Beyond Structured Data
- Reverse streaming data
- Snapshot management
- Diagnostic reporters
- Cross Region Replication
- TTL management
Enhanced self management
Database
experience on
the Lake
The Database building blocks
Main components of a DBMS.
Courtesy: The seminal database paper: Architecture of a Database System
Reference diagram highlighting existing (green) and new (yellow) Hudi
components, along with external components (blue). Checkout RFC-69
LSM Tree Style Timeline
Can we support commits every
minute for the 10 years?
Can we organize the timeline in a
better way so that it scales well
linearly?
Unlocks infinite time travel,
time-travel writes, NB Concurrency
LSM Trees FTW!
https://github.com/google/leveldb
Non-Blocking Concurrency Control
Are we being too optimistic?
Three generally agreed upon approaches :
Pessimistic, Optimistic and Multi Version
Architecture of a Database System (Sec 6.2)
Non-Blocking Concurrency Control
Can we avoid the performance and
cost penalties due to OCC?
One way is to enhance OCC with
sophisticated techniques for early
conflict detection
How about a general-purpose
non-blocking MVCC-based
concurrency control
Spanner’s TrueTime-like global
monotonically increasing timestamps
New Filegroup Reader and Writer
Can we do better?
Positional merging instead of
key-based merging
- Improve performance when > 50% base
records are changed
First class support for partial
updates
- Reduce write amplification, read
amplification
Engine agnostic abstractions
is_partial
schema (can be partial)
Position-based Merge Benchmark
Good gains on large updates; But still on paper
- Existing implementations like Iceberg are poor, scan
the entire base file.
- Hudi PR#10167 open to make it reality with filter
pushdown for positional merging
Data: MOR tables, 500GB and 1TB with 1000
partitions. 50% records deleted after initial
load.
Data
Size
Key based
Query
Latency (ms)
Position based
Query Latency
(ms)
Gains
500GB 9407 8686 12%
1TB 15030 12534 20%
Setup: AWS EMR cluster, 1 driver
(m5.8xlarge) and 20 executors
(m5.4xlarge), Apache Spark 3.3.3
Partial Update Benchmark
Game changing performance improvements!
Data: 1TB MOR table, with 1000 partitions. 80% random updates in
subsequent commit after bulk loading the data. Total 100 fields in schema,
but updates are done only for 3 fields.
Metric Full Update Partial Update Gains
Update latency (s) 2072 1429 1.4x
Total Bytes Written (GB) 891.7 12.7 70.2x
Query latency (s) 164 29 5.7x
Functional Index
Relational databases allow to build
index on functions or expressions
Accelerate queries based on results
of computations.
Hide how data is partitioned from
how data is queried.
Absorb partitioning into indexes. No
more hide-and-evolving partitions!
RFC-63
Functional Index In Action
SQL Script
CREATE TABLE hudi_table_func_index (
ts STRING,
uuid STRING,
rider STRING,
driver STRING,
fare DOUBLE,
city STRING
) USING HUDI
tblproperties (primaryKey = 'uuid')
PARTITIONED BY (city)
INSERT INTO hudi_table_func_index VALUES (...);
CREATE INDEX ts_hour ON hudi_table_func_index USING
column_stats(ts) options(func='hour');
SELECT city, fare, rider, driver FROM
hudi_table_func_index WHERE city NOT IN ('chennai')
AND hour(ts) > 12;
Come Build With The Community!
Docs : https://hudi.apache.org
Blogs : https://hudi.apache.org/blog
Slack : Apache Hudi Slack Group
LinkedIn: company/apache-hudi
Twitter : https://twitter.com/apachehudi
Github: https://github.com/apache/hudi/ Give us a star ⭐!
Mailing list(s) :
dev-subscribe@hudi.apache.org (send an empty email to subscribe)
Join Hudi Slack
Thanks!
Questions?
Join Hudi Slack
Ad

More Related Content

Similar to A Hudi Live Event: Shaping a Database Experience within the Data Lake with Apache Hudi 1.0 (20)

Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
Vinoth Chandar
 
HANA SITSP 2011
HANA SITSP 2011HANA SITSP 2011
HANA SITSP 2011
Henrique Pinto
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
Alluxio, Inc.
 
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationParis FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks Presentation
Abdelkrim Hadjidj
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
DataWorks Summit
 
Hana Training Day 1
Hana Training Day 1Hana Training Day 1
Hana Training Day 1
mishra4927
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
Ashish Narasimham
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
Sascha Dittmann
 
Middle Tier Scalability - Present and Future
Middle Tier Scalability - Present and FutureMiddle Tier Scalability - Present and Future
Middle Tier Scalability - Present and Future
dfilppi
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
Vinoth Chandar
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Dell Lustre Storage Architecture Presentation - MBUG 2016
Dell Lustre Storage Architecture Presentation - MBUG 2016Dell Lustre Storage Architecture Presentation - MBUG 2016
Dell Lustre Storage Architecture Presentation - MBUG 2016
Andrew Underwood
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
Spark Seattle meetup - Breaking ETL barrier with Spark StreamingSpark Seattle meetup - Breaking ETL barrier with Spark Streaming
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
Santosh Sahoo
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
Martin Bém
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
Cloudera, Inc.
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 
Big Data .. Are you ready for the next wave?
Big Data .. Are you ready for the next wave?Big Data .. Are you ready for the next wave?
Big Data .. Are you ready for the next wave?
Mahmoud Sabri
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
Vinoth Chandar
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
Alluxio, Inc.
 
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationParis FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks Presentation
Abdelkrim Hadjidj
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
DataWorks Summit
 
Hana Training Day 1
Hana Training Day 1Hana Training Day 1
Hana Training Day 1
mishra4927
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
Ashish Narasimham
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
Sascha Dittmann
 
Middle Tier Scalability - Present and Future
Middle Tier Scalability - Present and FutureMiddle Tier Scalability - Present and Future
Middle Tier Scalability - Present and Future
dfilppi
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
Vinoth Chandar
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Dell Lustre Storage Architecture Presentation - MBUG 2016
Dell Lustre Storage Architecture Presentation - MBUG 2016Dell Lustre Storage Architecture Presentation - MBUG 2016
Dell Lustre Storage Architecture Presentation - MBUG 2016
Andrew Underwood
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
Spark Seattle meetup - Breaking ETL barrier with Spark StreamingSpark Seattle meetup - Breaking ETL barrier with Spark Streaming
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
Santosh Sahoo
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
Martin Bém
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
Cloudera, Inc.
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 
Big Data .. Are you ready for the next wave?
Big Data .. Are you ready for the next wave?Big Data .. Are you ready for the next wave?
Big Data .. Are you ready for the next wave?
Mahmoud Sabri
 

Recently uploaded (20)

five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdffive-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
AdityaSharma944496
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Basic Principles for Electronics Students
Basic Principles for Electronics StudentsBasic Principles for Electronics Students
Basic Principles for Electronics Students
cbdbizdev04
 
"Heaters in Power Plants: Types, Functions, and Performance Analysis"
"Heaters in Power Plants: Types, Functions, and Performance Analysis""Heaters in Power Plants: Types, Functions, and Performance Analysis"
"Heaters in Power Plants: Types, Functions, and Performance Analysis"
Infopitaara
 
How to Make Material Space Qu___ (1).pptx
How to Make Material Space Qu___ (1).pptxHow to Make Material Space Qu___ (1).pptx
How to Make Material Space Qu___ (1).pptx
engaash9
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
Dust Suppressants: A Sustainable Approach to Dust Pollution Control
Dust Suppressants: A Sustainable Approach to Dust Pollution ControlDust Suppressants: A Sustainable Approach to Dust Pollution Control
Dust Suppressants: A Sustainable Approach to Dust Pollution Control
Janapriya Roy
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Building Security Systems in Architecture.pdf
Building Security Systems in Architecture.pdfBuilding Security Systems in Architecture.pdf
Building Security Systems in Architecture.pdf
rabiaatif2
 
QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)
rccbatchplant
 
Taking AI Welfare Seriously, In this report, we argue that there is a realist...
Taking AI Welfare Seriously, In this report, we argue that there is a realist...Taking AI Welfare Seriously, In this report, we argue that there is a realist...
Taking AI Welfare Seriously, In this report, we argue that there is a realist...
MiguelMarques372250
 
vlsi digital circuits full power point presentation
vlsi digital circuits full power point presentationvlsi digital circuits full power point presentation
vlsi digital circuits full power point presentation
DrSunitaPatilUgaleKK
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.
Kamal Acharya
 
Engineering Chemistry First Year Fullerenes
Engineering Chemistry First Year FullerenesEngineering Chemistry First Year Fullerenes
Engineering Chemistry First Year Fullerenes
5g2jpd9sp4
 
Gas Power Plant for Power Generation System
Gas Power Plant for Power Generation SystemGas Power Plant for Power Generation System
Gas Power Plant for Power Generation System
JourneyWithMe1
 
aset and manufacturing optimization and connecting edge
aset and manufacturing optimization and connecting edgeaset and manufacturing optimization and connecting edge
aset and manufacturing optimization and connecting edge
alilamisse
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdffive-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
AdityaSharma944496
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Basic Principles for Electronics Students
Basic Principles for Electronics StudentsBasic Principles for Electronics Students
Basic Principles for Electronics Students
cbdbizdev04
 
"Heaters in Power Plants: Types, Functions, and Performance Analysis"
"Heaters in Power Plants: Types, Functions, and Performance Analysis""Heaters in Power Plants: Types, Functions, and Performance Analysis"
"Heaters in Power Plants: Types, Functions, and Performance Analysis"
Infopitaara
 
How to Make Material Space Qu___ (1).pptx
How to Make Material Space Qu___ (1).pptxHow to Make Material Space Qu___ (1).pptx
How to Make Material Space Qu___ (1).pptx
engaash9
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
Dust Suppressants: A Sustainable Approach to Dust Pollution Control
Dust Suppressants: A Sustainable Approach to Dust Pollution ControlDust Suppressants: A Sustainable Approach to Dust Pollution Control
Dust Suppressants: A Sustainable Approach to Dust Pollution Control
Janapriya Roy
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Building Security Systems in Architecture.pdf
Building Security Systems in Architecture.pdfBuilding Security Systems in Architecture.pdf
Building Security Systems in Architecture.pdf
rabiaatif2
 
QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)
rccbatchplant
 
Taking AI Welfare Seriously, In this report, we argue that there is a realist...
Taking AI Welfare Seriously, In this report, we argue that there is a realist...Taking AI Welfare Seriously, In this report, we argue that there is a realist...
Taking AI Welfare Seriously, In this report, we argue that there is a realist...
MiguelMarques372250
 
vlsi digital circuits full power point presentation
vlsi digital circuits full power point presentationvlsi digital circuits full power point presentation
vlsi digital circuits full power point presentation
DrSunitaPatilUgaleKK
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.
Kamal Acharya
 
Engineering Chemistry First Year Fullerenes
Engineering Chemistry First Year FullerenesEngineering Chemistry First Year Fullerenes
Engineering Chemistry First Year Fullerenes
5g2jpd9sp4
 
Gas Power Plant for Power Generation System
Gas Power Plant for Power Generation SystemGas Power Plant for Power Generation System
Gas Power Plant for Power Generation System
JourneyWithMe1
 
aset and manufacturing optimization and connecting edge
aset and manufacturing optimization and connecting edgeaset and manufacturing optimization and connecting edge
aset and manufacturing optimization and connecting edge
alilamisse
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
Ad

A Hudi Live Event: Shaping a Database Experience within the Data Lake with Apache Hudi 1.0

  • 2. The Hudi Platform Lake Storage (Cloud Object Stores, HDFS, …) Open File/Data Formats (Parquet, HFile, Avro, Orc, …) Concurrency Control (OCC, MVCC, Non-blocking, Lock providers, Scheduling...) Table Services (cleaning, compaction, clustering, indexing, file sizing,...) Indexes (Bloom filter, HBase, Bucket index, Hash based, Lucene..) Table Format (Schema, File listings, Stats, Evolution, …) Lake Cache* (Columnar, transactional, mutable, WIP,...) Metaserver* (Stats, table service coordination,...) Transactional Database Layer Query Engines (Spark, Flink, Hive, Presto, Trino, Impala, Redshift, BigQuery, Snowflake,..) Platform Services (Streaming/Batch ingest, various sources, Catalog sync, Admin CLI, Data Quality,...) User Interface Readers (Snapshot, Time Travel, Incremental, etc) Writers (Inserts, Updates, Deletes, Smart Layout Management, etc) Programming API
  • 3. In Industry Today Trading transactions - Near real-time CDC from 4000+ postgres tables at 5 mins! Minute level analytics with 70% CPU savings @ Exabyte scale Tiktok recommendations Package deliveries - real-time event analytics at PB scale Streaming log ingestion and efficient GDPR deletes using Apache Hudi 150 source systems, ETL processing for 10,000+ tables Faster data access @ 75% less storage costs Near real-time grocery delivery tracking Streaming data lake for device data Feature Store using Hudi Building faster analytics for automotive data Uber rides - 250+PB from 24h+ to minutes latency on 8000+ tables Real time analytics that power financial decisions Real-time advertising for 20M+ concurrent viewers Lakehouse at Fortune 1 Scale Lake House Architecture @ Halodoc Faster SLAs with low cost data pipelines cost optimized fast analytics for sports solutions
  • 4. 3800+ members The Community 7000+ Commits 431+ Contributors 6000+ GH Engagers 36 Committers Pre-installed on 5 cloud providers Diverse PMC/Committers 19 PMCs 800B+ Records/Day (from even just 1 user!) A vibrant OSS Community 4700+ questions answered (in just last 2 years!) 22800+ responses (in just last 2 years!)
  • 5. Opportunities - Query engines prefer separate integrations. - Need to maintain specific Hudi connectors. - Improved query planning & execution with Hudi’s advanced capabilities multi-modal indexing Deeper Query Engine Integrations - Mature SQL support made possible from advancements in engines like Apache Spark & Apache Flink - Generalized data model for supporting keys in Hudi tables Generalized Data Model - Migrate to hybrid architecture: Serverless for data and serverful for table metadata. - Scales well for metadata. - Addresses evolving concurrency control needs. Serverful & Serverless - Support for complex, unstructured, large blobs with indexing, mutation and change capture. - Expand to ML/AL modeling, image and video processing applications. Beyond Structured Data - Reverse streaming data - Snapshot management - Diagnostic reporters - Cross Region Replication - TTL management Enhanced self management Database experience on the Lake
  • 6. The Database building blocks Main components of a DBMS. Courtesy: The seminal database paper: Architecture of a Database System Reference diagram highlighting existing (green) and new (yellow) Hudi components, along with external components (blue). Checkout RFC-69
  • 7. LSM Tree Style Timeline Can we support commits every minute for the 10 years? Can we organize the timeline in a better way so that it scales well linearly? Unlocks infinite time travel, time-travel writes, NB Concurrency LSM Trees FTW! https://github.com/google/leveldb
  • 8. Non-Blocking Concurrency Control Are we being too optimistic? Three generally agreed upon approaches : Pessimistic, Optimistic and Multi Version Architecture of a Database System (Sec 6.2)
  • 9. Non-Blocking Concurrency Control Can we avoid the performance and cost penalties due to OCC? One way is to enhance OCC with sophisticated techniques for early conflict detection How about a general-purpose non-blocking MVCC-based concurrency control Spanner’s TrueTime-like global monotonically increasing timestamps
  • 10. New Filegroup Reader and Writer Can we do better? Positional merging instead of key-based merging - Improve performance when > 50% base records are changed First class support for partial updates - Reduce write amplification, read amplification Engine agnostic abstractions is_partial schema (can be partial)
  • 11. Position-based Merge Benchmark Good gains on large updates; But still on paper - Existing implementations like Iceberg are poor, scan the entire base file. - Hudi PR#10167 open to make it reality with filter pushdown for positional merging Data: MOR tables, 500GB and 1TB with 1000 partitions. 50% records deleted after initial load. Data Size Key based Query Latency (ms) Position based Query Latency (ms) Gains 500GB 9407 8686 12% 1TB 15030 12534 20% Setup: AWS EMR cluster, 1 driver (m5.8xlarge) and 20 executors (m5.4xlarge), Apache Spark 3.3.3
  • 12. Partial Update Benchmark Game changing performance improvements! Data: 1TB MOR table, with 1000 partitions. 80% random updates in subsequent commit after bulk loading the data. Total 100 fields in schema, but updates are done only for 3 fields. Metric Full Update Partial Update Gains Update latency (s) 2072 1429 1.4x Total Bytes Written (GB) 891.7 12.7 70.2x Query latency (s) 164 29 5.7x
  • 13. Functional Index Relational databases allow to build index on functions or expressions Accelerate queries based on results of computations. Hide how data is partitioned from how data is queried. Absorb partitioning into indexes. No more hide-and-evolving partitions! RFC-63
  • 14. Functional Index In Action SQL Script CREATE TABLE hudi_table_func_index ( ts STRING, uuid STRING, rider STRING, driver STRING, fare DOUBLE, city STRING ) USING HUDI tblproperties (primaryKey = 'uuid') PARTITIONED BY (city) INSERT INTO hudi_table_func_index VALUES (...); CREATE INDEX ts_hour ON hudi_table_func_index USING column_stats(ts) options(func='hour'); SELECT city, fare, rider, driver FROM hudi_table_func_index WHERE city NOT IN ('chennai') AND hour(ts) > 12;
  • 15. Come Build With The Community! Docs : https://hudi.apache.org Blogs : https://hudi.apache.org/blog Slack : Apache Hudi Slack Group LinkedIn: company/apache-hudi Twitter : https://twitter.com/apachehudi Github: https://github.com/apache/hudi/ Give us a star ⭐! Mailing list(s) : [email protected] (send an empty email to subscribe) Join Hudi Slack