The Best of Both Worlds: Hybrid Clustering with Delta Lake

Scott Haines, Distinguished Engineer, Nike
The Best of Both
Worlds Hybrid
Clustering
with Delta Lake

Session Overview
Hybrid Clustering with Delta Lake
https://www.youtube.com/watch?v=l8CEyXgi7y

Introduction
Optimization Techniques
Goal: Make Data Access Faster and more
efficient while being read.
* indexing
* compression
* data layout
* caching
Read Optimizations
Write Optimizations
Goal: Improve the writing or insertion of data into
a system, ensuring it is done quickly and
efficiently.
* batching (micro-batch) insertion flows
* using append-only style tables (no upsert, no
point-delete)
* workflow shared caches (spark executor cache
reuse)
Decision Foundation

Introduction
Table Decision: Indecision.
Where it Shines
● Provides Read and Write isolation.
● Data Skipping on partition columns
available via file-system path
● Simple to understand the table layout
Hive-Style
Partitioning Where it Shines
● Clustering Flexibility (can change during table lifecycle).
● Removes Data Skew
● *Eliminates Small File Problem
● Incremental Optimization
Liquid Clustering
The Behavior of your Table should Act as your Guide

Introduction
Tuning our Tables can feel like
a Classic Goldilocks Problem?
Performance Tuning
● Run OPTIMIZE to reduce small files
(limited to partition boundaries)
● Run OPTIMIZE ZORDER BY to
collocate data (non-incremental)
Hive-Style
Partitioning Performance Tuning
● Run OPTIMIZE (eliminates data skew)
Liquid Clustering

Table Optimization
Wait…
What About
my Third
Option?

Table Optimization
Why Not Use
Both?

Current (active) Data
Architecture Patterns
Use Case: Fast and Slow Systems
Close of Books Interval
Historic (closed) Data
Newes
t
Oldest

Workflow Design
Sources
Table: lifecycle_events
Supports Partition Overwrite, Upsert
Batch: Periodic Complete Partition Overwrites
Table: lifecycle_events_historic
Schedule Job Move Data after Close of Books
Retention 3
years
Retention 14 days
Unstable Table Data Period: Close of Books (7d), up to 14d for ooops
Stable Table Data: Retention Dictated by Need (3 years in this example)
Cron: Daily

As a workflow across tables.
- Individual Tables are optimized for their
specific use cases
- This means selecting the best technique
based on the context (use case)
- Leaning on Virtual Tables (Views) to
connect the dots
Hybrid Optimization
Mixed Optimization
Techniques
Plays well with the Medallion Architecture. Simplified with Unity
Catalog
^^ Virtual Tables are your friend.
What We’re Going to
Build!

Creating the Bronze Capture Table
- Table must have write isolation (for multiple
writers)
- Allow Merge Operations
- Enable simple deletes at Partition Boundaries
- Stores mutating changes during close of books
- Retention 14 days
“Every week at close-of-books data will be stored a
long-term historic table”
Requirements
Foundation: Ingestion Table

Creating the Current View
Registered functions that can be
applied in creative ways.
Easy to reuse and Share.
UC Functions
Create a Function: Fetch the Close of Books Date
Current: Short-Term Memory

- applied at the table globally
- *typically used for “sensitive data access”
Can also be used to simplify table “behavior”
of time-bounded tables and only show “what
we want to make visible”
UC Row Filters
Create a Row Filter Function: Check if the row is “current”

Utilizing periodic view refresh to
select all data that falls into the
“close-of-books” window.
Leaning on Views Create or Replace Clamped View: Virtual View of “current” window
^^ pushes down to partition filters

Table Optimization
Hybrid Clustering with Delta
Lake
Liquid Clustering Enabled
- data is loaded into the historic
table at the “end” of the “close-
of-books”
- treating the table like an
“append-only” table
- allows for highly optimized
reads of historic data
Historic Table
Historic: Long-Term Memory

Table Optimization
Hybrid Clustering with Delta
Lake
Remember the Kappa Architecture?
Periodic Job runs to take the latest close
of books data and drop it into the long-
term historic table.
- While this job is batch-esq, it honors
throttle and appends a specific series of
rows to our historic table, accounting
for data at the daily edges (starting,
ending timestamp)
Append Flow
Historic: Long-Term Memory

Creating the Composite View
Using UNION to combine Historic and
Current Tables and Views.
View Composition
Best of Both Worlds

Creating the Composite View
“If using Unity Catalog Row Filters, you
don’t need to create the two virtual
tables, since we can automatically
ignore rows that fall out of the
predicate”
View Composition
Best of Both Worlds

One More Thing
…
Achieving. Hybrid Dynamic
Views
“Whoa. That is neat…”

Questions and Answers Time
…
Now. Ask us Anything
It might even become the topic of another
webinar!

The Best of Both Worlds: Hybrid Clustering with Delta Lake

Recommended

More Related Content

Similar to The Best of Both Worlds: Hybrid Clustering with Delta Lake (20)

Recently uploaded (20)

The Best of Both Worlds: Hybrid Clustering with Delta Lake

Editor's Notes