Delta Lake The Definitive Guide, Compliments of Databricks
Delta Lake The Definitive Guide, Compliments of Databricks
Compliments of
Delta Lake has revolutionized data architectures by combining the best of data lakes
and warehouses into the lakehouse architecture. This definitive guide by O’Reilly is
an essential resource for anyone looking to harness the full potential of Delta Lake. It
offers deep insights into building scalable, reliable, high-performance data architectures.
Whether you’re a data engineer, scientist, or practitioner, this book will empower you to
tackle your toughest data challenges with confidence and precision.
—Matei Zaharia, associate professor of computer science at UC
Berkeley and cofounder and chief technologist at Databricks
This book not only provides excellent code examples for Delta Lake but also explains
what happens behind the scenes. It’s a resource I’ll continue to rely on as a practical
reference for Delta Lake APIs. Furthermore, it covers the latest exciting innovations
within the Delta Lake ecosystem.
—Ryan Zhu, founding developer of Delta Lake, cocreator of
Delta Sharing, Apache Spark PMC member, Delta Lake maintainer
The authors of this book fuse deep technical knowledge with pragmatism and clear
exposition to allow readers to bring their Spark data lakehouse aspirations to life with the
Delta Lake framework.
—Matt Housley, CTO and coauthor of
Fundamentals of Data Engineering
Open table formats are the future. If you are invested in Delta Lake, this book will take
you from zero to 100, including use cases, integrations, and how to overcome hiccups.
—Adi Polak, author of Scaling Machine Learning with Spark
There are two types of people in data: those who believe they understand what
Delta Lake is and those who read this book.
—Andy Petrella, part of the second group, author of
Fundamentals of Data Observability, and founder of Kensu
Look no further if you want to master all things Delta Lake. Denny, Tristen, Scott, and
Prashanth have gone above and beyond to give you more experience than you could
ever imagine.
—Jacek Laskowski, freelance Data(bricks) engineer
Delta Lake is much more than Apache Parquet with a commit log. Delta Lake: The
Definitive Guide takes the mystery out of streaming, data governance, and design patterns.
—Bartosz Konieczny, waitingforcode.com
Delta Lake: The Definitive Guide
Modern Data Lakehouse Architectures
with Data Lakes
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Delta Lake: The Definitive Guide, the
cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the authors and do not represent the publisher’s views.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use
of or reliance on this work. Use of the information and instructions contained in this work is at your
own risk. If any code samples or other technology this work contains or describes is subject to open
source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
This work is part of a collaboration between O’Reilly and Databricks. See our statement of editorial
independence.
978-1-098-15195-9
[LSI]
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
iii
PySpark Shell 24
JupyterLab Notebook 25
Scala Shell 25
Delta Rust API 26
ROAPI 27
Native Delta Lake Libraries 28
Multiple Bindings Available 29
Installing the Delta Lake Python Package 29
Apache Spark with Delta Lake 29
Setting Up Delta Lake with Apache Spark 30
Prerequisite: Set Up Java 30
Setting Up an Interactive Shell 31
PySpark Declarative API 33
Databricks Community Edition 33
Create a Cluster with Databricks Runtime 33
Importing Notebooks 35
Attaching Notebooks 36
Conclusion 37
iv | Table of Contents
Installing the Connector 62
DeltaSource API 62
DeltaSink API 66
End-to-End Example 69
Kafka Delta Ingest 71
Install Rust 72
Build the Project 72
Run the Ingestion Flow 73
Trino 75
Getting Started 75
Configuring and Using the Trino Connector 79
Using Show Catalogs 79
Creating a Schema 80
Show Schemas 80
Working with Tables 81
Table Operations 84
Conclusion 88
Table of Contents | v
6. Building Native Applications with Delta Lake. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Getting Started 116
Python 116
Rust 127
Building a Lambda 131
What’s Next 137
vi | Table of Contents
9. Architecting Your Lakehouse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
The Lakehouse Architecture 188
What Is a Lakehouse? 188
Learning from Data Warehouses 188
Learning from Data Lakes 189
The Dual-Tier Data Architecture 190
Lakehouse Architecture 192
Foundations with Delta Lake 193
Open Source on Open Standards in an Open Ecosystem 193
Transaction Support 195
Schema Enforcement and Governance 197
The Medallion Architecture 201
Exploring the Bronze Layer 202
Exploring the Silver Layer 205
Exploring the Gold Layer 208
Streaming Medallion Architecture 210
Conclusion 212
10. Performance Tuning: Optimizing Your Data Pipelines with Delta Lake. . . . . . . . . . . . . 213
Performance Objectives 214
Maximizing Read Performance 214
Maximizing Write Performance 216
Performance Considerations 217
Partitioning 218
Table Utilities 220
Table Statistics 226
Cluster By 236
Bloom Filter Index 240
Conclusion 242
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
Table of Contents | ix
Foreword by Michael Armbrust
The Delta protocol was first conceived when I met Dominique Brezinski at Spark
Summit 2017. As he described to me the scale of data processing that he was envi‐
sioning, I knew that, through our collaborative approach to running Apache Spark,
Databricks had already laid down the building blocks of the cloud-scale computing
environment necessary to make him successful. Yet I also knew that these fundamen‐
tals would inevitably prove to be insufficient without us introducing a novel system
to manage the complexities of transactional access to the ever-growing lake of data
that Dom had been collecting in his private cloud. Recognizing that Apache Spark
itself could serve as the engine of scalable transaction consistency enforcement was
the key insight that underpins the ongoing success of Delta Lake. That is, to simplify
and scale, we treated the metadata like how we processed and queried the data.
Translating this single insight and the resulting protocol into Delta Lake, a compre‐
hensive toolset for developers to use in any streaming data management solution, has
been a long road, with many collaborations along the way. Becoming an open source
project allowed Delta Lake to evolve through community input and contributions.
The robust ecosystem that has resulted now includes multiple implementations of the
Delta protocol, in multiple frameworks, such as Flink, Trino, Presto, and Pulsar, and
in multiple languages, including Rust, Go, Java, Scala, Hive, and Python.
To celebrate and further build on this vibrant open source community, I’m now
excited to present Delta Lake: The Definitive Guide. This guide details Delta Lake’s
architecture, use cases, and best practices, catering to data engineers, scientists, and
analysts alike. It encapsulates years of innovation in data management, offering a
comprehensive resource for unlocking Delta Lake’s full potential. As you explore this
book, you’ll gain the knowledge to leverage Delta Lake’s capabilities in your projects.
I’m eager to see how you’ll use it to drive innovation and achieve your data goals.
xi
Welcome to the shore of the Delta Lake. The water is great—let’s take a swim!
— Michael Armbrust
Creator of Delta Lake, Spark PMC Member,
Delta Lake TSC and Maintainer
Delta Lake emerged from Michael and my discussions about the challenges I encoun‐
tered when building a high-scale streaming ETL system using Apache Spark, EC2,
and S3. We faced the same challenges at Apple in processing vast amounts of data
for intrusion monitoring and threat response. We needed to build a system that
could do not only streaming ingestion but also streaming detection and support
performant queries over a long retention window of large datasets. From these
requirements Delta Lake was created to support ACID transactions and seamless
integration of batch and streaming processes, allowing us to handle petabytes of daily
data efficiently.
This guide reveals Delta Lake’s architectural fundamentals, practical applications, and
best practices. Whether you’re a data engineer, scientist, or business leader, you’ll find
valuable insights to leverage Delta Lake effectively.
I’m excited for you to explore this guide and witness how Delta Lake can propel your
own innovations. Together, we’re shaping the future of data management, enabling
the construction of reliable and performant data lakehouses.
— Dominique Brezinski
Distinguished Engineer, Apple
Delta Lake Technical Steering Committee Member
xiii
Preface
Welcome to Delta Lake: The Definitive Guide! Since it became an open source project
in 2019, Delta Lake has revolutionized how organizations manage and process their
data. Designed to bring reliability, performance, and scalability to data lakes, Delta
Lake addresses many of the inherent challenges traditional data lake architectures
face.
Over the past five years, Delta Lake has undergone significant transformation. Origi‐
nally focused on enhancing Apache Spark, Delta Lake now boasts a rich ecosystem
with integrations across various platforms, including Apache Flink, Trino, and many
more. This evolution has enabled Delta Lake to become a versatile and integral
component of modern data engineering and data science workflows.
xv
significantly since its inception, growing beyond its initial focus on Apache Spark to
embrace a wide array of integrations with multiple languages and frameworks. To
reflect this diversity, we’ve included code examples featuring Flink, Kafka, Python,
Rust, Spark, Trino, and more. This broad coverage ensures that you’ll find relevant
examples regardless of your preferred tools and languages.
While we cover the fundamental concepts, we’ve also included our personal experien‐
ces and lessons learned. More importantly, we go beyond theory to offer practical
guidance on running a production lakehouse successfully. We’ve included best prac‐
tices, optimization techniques, and real-world scenarios to help you navigate the
challenges of implementing and maintaining a Delta Lake–based system at scale.
Whether you’re a data engineer, architect, or scientist, our goal is to equip you with
the knowledge and tools to leverage Delta Lake effectively in your data projects. We
hope this guide serves as your companion in building robust, efficient, and scalable
lakehouse architectures.
xvi | Preface
Chapter 4, “Diving into the Delta Lake Ecosystem”
We delve into the Delta Lake ecosystem, discussing the many frameworks, serv‐
ices, and community projects that support Delta Lake. This chapter includes code
samples for the Flink DataStream Connector, Kafka Delta Ingest, and Trino.
Chapter 5, “Maintaining Your Delta Lake”
While Delta Lake provides optimal reading and writing out of the box, develop‐
ers reading this book will want to further tweak Delta Lake configuration and
settings to get even more performance. This chapter looks at using table proper‐
ties, optimizing your table with Z-Ordering, table tuning and management, and
repairing/restoring your table.
Chapter 6, “Building Native Applications with Delta Lake”
The delta-rs project was built from scratch by the community starting in 2020.
Together, we built a Delta Rust API using native code, thus allowing developers
to take advantage of Delta Lake’s reliability without needing to install or maintain
the JVM (Java virtual machine). In this chapter, we will dive into this project and
its popular Python bindings.
Preface | xvii
Chapter 11, “Successful Design Patterns”
To help you build a successful production environment, we look at slashing
compute costs, efficient streaming ingestion, and coordinating complex systems.
Chapter 12, “Foundations of Lakehouse Governance and Security”, and Chapter 13,
“Metadata Management, Data Flow, and Lineage”
Next, we have detailed chapters on lakehouse governance! From access control
and the data asset model to unifying data warehousing and lake governance, data
security, metadata management, and data flow and lineage, these two chapters set
the foundation for your governance story.
Chapter 14, “Data Sharing with the Delta Sharing Protocol”
Delta Sharing is an open protocol for secure, real-time data sharing across organ‐
izations and computing platforms. It allows data providers to share live data
directly from their Delta Lake tables without the need for data replication or
copying to another system. In this chapter, we explore these topics further.
xviii | Preface
This element signifies a general note.
Preface | xix
Our unique network of experts and innovators share their knowledge and expertise
through books, articles, and our online learning platform. O’Reilly’s online learning
platform gives you on-demand access to live training courses, in-depth learning
paths, interactive coding environments, and a vast collection of text and video from
O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at https://oreil.ly/DeltaLakeDefGuide.
For news and information about our books and courses, visit https://oreilly.com.
Find us on LinkedIn: https://linkedin.com/company/oreilly-media.
Watch us on YouTube: https://youtube.com/oreillymedia.
Acknowledgments
This book has truly been a team effort and a labor of love. As authors, we were driven
by a strong desire to share our lessons learned and best practices with the community.
The journey of bringing this book to life has been immensely rewarding, and we are
deeply grateful to everyone who contributed along the way.
First and foremost, we would like to extend our heartfelt thanks to some of the early
contributors who played a pivotal role in making Delta Lake a reality. Our sincere
gratitude goes out to Ali Ghodsi, Allison Portis, Burak Yavuz, Christian Williams,
Dominique Brezinski, Florian Valeye, Gerhard Brueckl, Matei Zaharia, Michael Arm‐
brust, Mykhailo Osypov, QP Hou, Reynold Xin, Robert Pack, Ryan Zhu, Scott Sandre,
Tathagata Das, Thomas Vollmer, Venki Korukanti, and Will Jones; your vision and
dedication laid the foundation for this project, and without your efforts, this book
would not have been possible.
xx | Preface
We are also incredibly thankful to the numerous reviewers who provided us with
invaluable guidance. Their diligent and constructive feedback, informed by their
technical expertise and perspectives, has shaped this book into a valuable resource for
learning about Delta Lake. Special thanks to Adi Polak, Aditya Chaturvedi, Andrew
Bauman, Andy Petrella, Bartosz Konieczny, Holden Karau, Jacek Laskowski, Jobinesh
Purushothaman, Matt Housley, and Matt Powers; your insights have been instrumen‐
tal in refining our work.
A massive shout-out goes to R. Tyler Croy, who started as a reviewer of this book and
eventually joined the author team. His contributions have been invaluable, and his
work on Chapter 6, “Building Native Applications with Delta Lake”, is a testament to
his dedication and expertise. Thank you, Tyler, for your unwavering support and for
being an integral part of this journey.
Last but certainly not least, we want to thank the Delta Lake community. As of this
book’s release, it has been a little more than five years since Delta Lake became open
source. Throughout this time, we have experienced many ups and downs, but we
have grown together to create an amazing project and community. Your enthusiasm,
collaboration, and support have been the driving force behind our success.
Thank you all for being a part of this incredible journey!
Denny
On a personal note, I would like to express my deepest gratitude to my wonder‐
ful family and friends. Your unwavering support and encouragement have been
my anchor throughout this journey. A special thank-you to my amazing children,
Katherine, Samantha, and Isabella, for your patience and love. And to my partner and
wonderful wife, Hua-Ping, I could not have done this without you or your constant
support and patience.
Tristen
I could not have made it through the immense effort (and many hours) required to
pour myself into this book without the many people who helped me become who
I am today. I want to thank my wife, Jessyca, for her loving and patient endurance,
and my children, Jake, Zek, and Ada, for always being a motivation and a source of
inspiration to keep going the distance. I would also like to thank my good friend
Steven Yu for helping to guide and encourage me over the years we’ve known each
other; my parents, Kirk and Patricia, for always being encouraging; and the numerous
colleagues with whom I have shared many experiences and conversations.
Preface | xxi
Scott
Getting to the end of a book as an author is a fascinating journey. It requires patience
and dedication, but even more, you leave part of yourself behind in the pages you
write, and in a very real sense you leave the world behind as you write. Finding the
time to write is a balancing act that tries the patience of your friends and family. To
my wife, Lacey: thanks for putting up with another book. To my dogs, Willow and
Clover: I’m sorry I missed walks and couch time. To my family: thanks for always
being there, and for pretending to get excited as I talk about distributed data (your
glassy eyes give you away every time). To my friends: I owe all of you more personal
time now and promise to drive up to the Bay Area more often. Last, I lost my little
sister Meredith while writing this book, and as a means of memorializing her, I’ve
hidden inside jokes and things that would have made her laugh throughout the book
and in the examples and data. I love you, Meredith.
Prashanth
I extend my deepest gratitude to my wife, Kavyasudha, for her unwavering support,
patience, and love throughout the journey. Your belief in me, even during the most
challenging times, has been my anchor. To our curious and joyful child, Advaith,
thank you for your infectious laughter and understanding, which have provided end‐
less motivation and joy. Your curiosity and energy remind me daily of the importance
of perseverance and passion. To both of you, I extend all my love and appreciation.
xxii | Preface
CHAPTER 1
Introduction to the Delta Lake
Lakehouse Format
This chapter explains Delta Lake’s origins and how it was initially designed to address
data integrity issues around petabyte-scale systems. If you are familiar with Delta
Lake’s history and instead want to dive into what Delta Lake is, its anatomy, and
the Delta transaction protocol, feel free to jump ahead to the section “What Is Delta
Lake?” on page 6 later in this chapter.
1
Data warehousing
Data warehouses are purpose-built to aggregate and process large amounts of struc‐
tured data quickly (Figure 1-1). To protect this data, they typically use relational
databases to provide ACID transactions, a step that is crucial for ensuring data
integrity for business applications.
Figure 1-1. Data warehouses are purpose-built for querying and aggregating structured
data
Data lakes
Data lakes are scalable storage repositories (HDFS, cloud object stores such as Ama‐
zon S3, ADLS Gen2, and GCS, and so on) that hold vast amounts of raw data
in their native format until needed (see Figure 1-2). Unlike traditional databases,
Figure 1-2. Data lakes are built for storing structured, semistructured, and unstructured
data on scalable storage infrastructure (e.g., HDFS or cloud object stores)
While data lakes could handle all your data for data science and machine learning,
they are an inherently unreliable form of data storage. Instead of providing ACID
protections, these systems follow the BASE model—basically available, soft-state,
and eventually consistent. The lack of ACID guarantees means the storage system
processing failures leave your storage in an inconsistent state with orphaned files.
Figure 1-3. Lakehouses are the best of both worlds between data warehouses and data
lakes
Delta Lake, Apache Iceberg, and Apache Hudi are the most popular open source
lakehouse formats. As you can guess, this book will focus on Delta Lake.1
1 To learn more about lakehouses, see the 2021 CIDR whitepaper “Lakehouse: A New Generation of Open
Platforms That Unify Data Warehousing and Advanced Analytics”.
Figure 1-4. Delta Lake provides a scalable, open, general-purpose transactional data
format for your lakehouse
However, as it has evolved, Delta Lake has been optimally designed to work with
numerous workloads (small data, medium data, big data, etc.). It has also been
designed to work with multiple frameworks (e.g., Apache Spark, Apache Flink,
Trino, Presto, Apache Hive, and Apache Druid), services (e.g., Athena, Big Query,
Databricks, EMR, Fabric, Glue, Starburst, and Snowflake), and languages (.NET, Java,
Python, Rust, Scala, SQL, etc.).
Figure 1-5. Delta Lake table layout for the transaction log and data files (adapted from
an image by Denny Lee)2
2 Denny Lee, “Understanding the Delta Lake Transaction Log at the File Level”, Denny Lee (blog), November
26, 2023.
Figure 1-6. (left) Creating a new Delta table by adding Parquet files and their relation‐
ship with the Delta transaction log; (right) deleting rows from this Delta table by
removing and adding files and their relationship with the Delta transaction log
• If a user were to read the Parquet files without reading the Delta transaction
log, they would read duplicates because of the replicated rows in all the files
(1.parquet, 2.parquet, 3.parquet).
• The remove and add actions are wrapped in the single transaction log
000...00001.json. When a client queries the Delta table at this time, it records
both of these actions and the filepaths for that snapshot. For this transaction, the
filepath would point only to 3.parquet.
• Note that the remove operation is a soft delete or tombstone where the physi‐
cal removal of the files (1.parquet, 2.parquet) has yet to happen. The physical
removal of files will happen when executing the VACUUM command.
• The previous transaction 000...00000.json has the filepath pointing to the original
files (1.parquet, 2.parquet). Thus, when querying for an older version of the Delta
table via time travel, the transaction log points to the files that make up that older
snapshot.
Figure 1-8. Delta Lake avoids the partial files scenario because of its transaction log
At t1, job 1 fails with the creation of 3.parquet. However, because the job failed, the
transaction was not committed to the transaction log. No new files are recorded;
Table Features
Originally, Delta tables used protocol versions to map to a set of features to ensure
user workloads did not break when new features in Delta were released. For example,
if a client wanted to use Delta’s Change Data Feed (CDF) option, users were required
to upgrade their protocol versions and validate their workloads to access new features
(Figure 1-9). This ensured that any readers or writers incompatible with a specific
protocol version were blocked from reading or writing to that table to prevent data
corruption.
The advantage of this approach is that any connectors (or integrations) can selectively
implement certain features of their interest, instead of having to work on all of
them. A quick way to view what table features are enabled is to run the query SHOW
TBLPROPERTIES:
SHOW TBLPROPERTIES default.my_table;
The output would look similar to the following:
Key (String) Value (String)
delta.minReaderVersion 3
delta.minWriterVersion 7
delta.feature.deletionVectors supported
delta.enableDeletionVectors true
delta.checkpoint.writeStatsAsStruct true
delta.checkpoint.writeStatsAsJson false
Delta Kernel
As previously noted, Delta Lake provides ACID guarantees and performance across
many frameworks, services, and languages. As of this writing, every time new features
are added to Delta Lake, the connector must be rewritten entirely, because there is
a tight coupling between the metadata and data processing. Delta Kernel simplifies
the development of connectors by abstracting out all the protocol details so the
connectors do not need to understand them. Kernel itself implements the Delta
transaction log specification (per the previous section). This allows the connectors to
build only against the Kernel library, which provides the following advantages:
Modularity
Creating Delta Kernel allows for more easily maintained parity between Delta
Lake Rust and Scala/JVM, enabling both to be first-class citizens. All metadata
(i.e., transaction log) logic is coordinated and executed through the Kernel
library. This way, the connectors need only to focus on how to perform their
respective frameworks/services/languages. For example, the Apache Flink/Delta
Lake connector needs to focus only on reading or modifying the specific files
provided by Delta Kernel. The end client does not need to understand the
semantics of the transaction log.
Extensibility
Delta Kernel decouples the logic for the metadata (i.e., transaction log) from
the data. This allows Delta Lake to be modular, extensible, and highly portable
(for example, you can copy the entire table with its transaction log to a new
location for your AI workloads). This also extends (pun intended) to Delta Lake’s
extensibility, as a connector is now, for example, provided the list of files to read
instead of needing to query the transaction log directly. Delta Lake already has
many integrations, and by decoupling the logic around the metadata from the
data, it will be easier for all of us to maintain our various connectors.
Delta Kernel achieves this level of abstraction through the following requirements:
It provides narrow, stable APIs for connectors.
For a table scan query, a connector needs to specify only the query schema, so
that the Kernel can read only the required columns, and the query filters for
Kernel to skip data (files, rowgroups, etc.). APIs will be stable and backward
compatible. Connectors should be able just to upgrade the Delta Kernel version
without rewriting their client code—that is, they automatically get support for an
updated Delta protocol via Table Features.
Delta UniForm
As noted in the section “Lakehouses (or data lakehouses)” on page 4, there are
multiple lakehouse formats. Delta Universal Format, or UniForm, is designed to
simplify the interoperability among Delta Lake, Apache Iceberg, and Apache Hudi.
Fundamentally, lakehouse formats are composed of metadata and data (typically in
Parquet file format).
What makes these lakehouse formats different is how they create, manage, and
maintain the metadata associated with this data. With Delta UniForm, the metadata
of other lakehouse formats is generated concurrently with the Delta format. This way,
whether you have a Delta, Iceberg, or Hudi client, it can read the data, because all
of their APIs can understand the metadata. Delta UniForm includes the following
support:
Delta UniForm | 19
• Apache Iceberg support as part of Delta Lake 3.0.0 (October 2023)
• Apache Hudi support as part of Delta Lake 3.2.0 (May 2024)
For the latest information on how to enable these features, please refer to the Delta
UniForm documentation.
Conclusion
In this chapter, we explained the origins of Delta Lake, what it is and what it does,
its anatomy, and the transaction protocol. We emphasized that the Delta transaction
log is the single source of truth and thus is the single source of the relationship
between its metadata and data. While still early, this has led to the development of
Delta Kernel as the foundation for simplifying the building of Delta connectors for
Delta Lake’s many frameworks, services, and community projects. The core difference
between the different lakehouse formats is their metadata, so Delta UniForm unifies
them by generating all formats’ metadata.
In this chapter, we will show you how to set up Delta Lake and walk you through the
simple steps to start writing your first standalone application.
There are multiple ways you can install Delta Lake. If you are just starting, using
a single machine with the Delta Lake Docker image is the best option. If you want
to skip the hassle of a local installation, the Databricks Community Edition, which
includes the latest version of Delta Lake, is free. Various free trials of Databricks,
which natively provides Delta Lake, are also available; check your cloud provider’s
documentation for additional details. Other options discussed in this chapter include
the Delta Rust Python bindings, the Delta Rust API, and Apache Spark. In this
chapter, we also create and verify the Delta Lake tables for illustrative purposes. Delta
Lake table creation and other CRUD operations are covered in depth in Chapter 3.
21
Please note this Docker image comes preinstalled with the following:
Apache Arrow
Apache Arrow is a development platform for in-memory analytics and aims
to provide a standardized, language-independent columnar memory format for
flat and hierarchical data, as well as libraries and tools for working with this
format. It enables fast data processing and movement across different systems
and languages, such as C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python,
R, Ruby, and Rust.
DataFusion
Created in 2017 and donated to the Apache Arrow project in 2019, DataFusion
is a fast, extensible query engine for building high-quality data-centric systems
written in Rust that uses the Apache Arrow in-memory format.
ROAPI
ROAPI is a no-code solution to automatically spin up read-only APIs for Delta
Lake and other sources; it builds on top of Apache Arrow and DataFusion.
Rust
Rust is a statically typed, compiled language that offers performance akin to C
and C++, but with a focus on safety and memory management. It’s known for its
unique ownership model that ensures memory safety without a garbage collector,
making it ideal for systems programming in which control over system resources
is crucial.
We will discuss each of the following interfaces in detail, including how to create and
read Delta Lake tables with each one:
PySpark Shell
First, open a bash shell and run a container from the built image with a bash
entrypoint.
Next, launch a PySpark interactive shell session:
# Bash
$SPARK_HOME/bin/pyspark --packages io.delta:${DELTA_PACKAGE_VERSION} \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf \
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
Let’s run some basic commands in the shell:
# Python
# Create a Spark DataFrame
data = spark.range(0, 5)
JupyterLab Notebook
Open a bash shell and run a container from the built image with a JupyterLab
entrypoint:
# Bash
docker run --name delta_quickstart --rm -it \
-p 8888-8889:8888-8889 delta_quickstart
The command will output a JupyterLab notebook URL. Copy the URL and launch a
browser to follow along in the notebook and run each cell.
Scala Shell
First, open a bash shell and run a container from the built image with a bash
entrypoint. Next, launch a Scala interactive shell session:
# Bash
$SPARK_HOME/bin/spark-shell --packages io.delta:${DELTA_PACKAGE_VERSION} \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf \
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
For all the following instructions, make sure to install the version
of Spark or PySpark that is compatible with Delta Lake 2.3.0. See
the release compatibility matrix for details.
PySpark shell
The PySpark shell, also known as the PySpark CLI, is an interactive environment that
facilitates engagement with Spark’s API using the Python programming language.
It serves as a platform for learning, testing PySpark examples, and conducting data
analysis directly from the command line. The PySpark shell operates as a Read-Eval-
Print Loop (REPL), providing a convenient environment for swiftly testing PySpark
statements.
Install the PySpark version that is compatible with the Delta Lake version by running
the following in the command prompt:
# Bash
pip install pyspark==<compatible-spark-version>
Figure 2-1. Selecting a Databricks Runtime for a new cluster in Databricks Community
Edition
You can create only one cluster at a time with Databricks Commu‐
nity Edition. If a cluster already exists, you will need to either use it
or delete it before you can create a new cluster.
Importing Notebooks
For brevity and ease of understanding, we will use the Jupyter Notebook we saw
in the section “JupyterLab Notebook” on page 25. This notebook is available in the
delta-docs GitHub repository. Please copy the notebook link and keep it handy, as
you will import the notebook in this step.
Go to Databricks Community Edition and click on Workspace, and then click on the
three stacked dots at top right, as shown in Figure 2-3.
In the dialog box, click on the URL radio button, paste in the notebook URL,
and click Import. This will render the Jupyter Notebook in Databricks Community
Edition.
Attaching Notebooks
Now select the Delta_Lake_DLDG cluster you created earlier to run this notebook, as
shown in Figure 2-4.
Figure 2-4. Choosing the cluster you want to attach to the notebook
Conclusion
In this chapter, we explored the various approaches you can take to get started
with Delta Lake, including Delta Docker, Delta Lake for Python, Apache Spark with
Delta Lake, PySpark Declarative API, and finally Databricks Community Edition. We
showed how easily you can run a simple notebook or a command shell to write to
and read from Delta Lake tables. The next chapter will cover writing and reading
operations in more detail.
Finally, we showed you how to use any of these approaches to install Delta Lake and
the many different ways in which Delta Lake is available. You also learned how to use
SQL, Python, Scala, Java, and Rust programming languages through the API to access
Delta Lake tables. In the next chapter, we’ll cover the essential operations you need to
know to use Delta Lake.
Conclusion | 37
CHAPTER 3
Essential Delta Lake Operations
This chapter explores the essential operations of using Delta Lake for your data
management needs. Since Delta Lake functions as the storage layer and participates
in the interaction layer of data applications, it makes perfect sense to begin with the
foundational operations of persistent storage systems. You know that Delta Lake pro‐
vides ACID guarantees already,1 but focusing on CRUD operations (see Figure 3-1)
will point us more toward the question “How do I use Delta Lake?”2 This would be
a woefully short story (and consequently this would be a short book) if that was all
that you needed to know, however, so we will look at several additional things that
are vital to interacting with Delta Lake tables: merge operations, conversion from
so-called vanilla Parquet files, and table metadata.
Except where specified, SQL will refer to the Spark SQL syntax
for simplicity’s sake. If you are using Trino or some other SQL
engine with Delta Lake, you can find additional details either in
Chapter 4, which explores more of the Delta Lake ecosystem, or
in the relevant documentation.3 The Python examples will all use
the Spark-based Delta Lake API for the same reason. Equivalent
examples are presented for both throughout. It is also possible to
leverage the equivalent operations using PySpark, and examples of
that are shown where it makes sense to do so.
39
Figure 3-1. Create, read, update, and delete (CRUD) operations are among the most
fundamental operations required for any persistent storage data system
We can perform operations with Delta Lake tables using the top-level directory
path of a Delta Lake table or by accessing it via a catalog, like the Hive Metastore
commonly used with Apache Spark, or the more advanced Unity Catalog.4 You will
see both methods used throughout this chapter; your choice of which method to
use will depend primarily on personal preference and the features of the systems
you are working with. Generally speaking, if you have a catalog available in the
environment you use, it simplifies both the readability of your code and potential
future transactions (imagine if you change a table’s location). Note that if you use a
catalog, you can set a location for the database object or individually for each table.
Create
Before much else can be done, you need to create a table so there’s something to
interact with. The actual creation operation can occur in different forms, as many
engines will handle something like a nonexistent table simply by creating it as part
of the processing during certain actions (such as an append operation in Spark SQL).
“What gets created during this process?” you might ask. At its core, Delta Lake could
not exist without Parquet, so one of the things you will see created is the Parquet file
directory and data files, as if you had used Parquet to create the table. However, one
of the new things you should notice is another file called _delta_log.
4 For a technical review of the Hive Metastore, including its design and the interaction operations that are
fundamental to its operation, see the Apache Hive documentation.
All the examples and other supporting code for this chapter can be
found in the book’s GitHub repository.
In Python, you will start with the TableBuilder object yielded by the method Delta
Table.create and then add attributes like the table name and the definitions of the
columns to be included. The execute command combines the definition into a query
plan and puts it into action:
# Python
from pyspark.sql.types import *
from delta.tables import *
delta_table = (
DeltaTable.create(spark)
.tableName("exampleDB.countries")
.addColumn("id", dataType=LongType(), nullable=False)
.addColumn("country", dataType=StringType(), nullable=False)
.addColumn("capital", dataType=StringType(), nullable=False)
.execute()
)
In either the Python or the SQL method of defining the table, the process itself is
essentially just a matter of creating a named table object with a specification of the
column names and types. One other element you might have noticed in the Python
dialect is that we also have the option to specify nullability in Apache Spark. This
setting will be ignored for Delta Lake tables, as it applies only to JDBC sources. An
additional item you might commonly include during a create statement is the IF NOT
5 Type support does vary by engine to some degree, though most engines support most data types. For an
example of the types supported by Azure Databricks, see the documentation page.
Create | 41
EXISTS qualifier in SQL or the alternative method createIfNotExists in Python.
Their use is purely at your discretion.
Many of the examples throughout this chapter take the use of table
objects accessed through a catalog for granted, but most of the
essential operations here are well supported with direct file access
methods. One of the key differences for Spark SQL is that it uses
the path accessor delta.`<TABLE>` (note the backticks) in place of
a table name. With the DeltaTable API, you will typically just swap
out the forPath method in place of forName. In PySpark you’ll
have to turn to alternative methods at times as well, such as using
save with a path argument in place of saveAsTable with a table
name. Refer to Delta Lake’s Python documentation for additional
details that might need to be configured for path-based access
in some cases (e.g., cloud provider–specific security configuration
arguments).
INSERT INTO
When you have an empty Delta Lake table, you can load data into it using the INSERT
INTO command. The idea is to define where you are inserting data and then what you
are inserting by providing the VALUES for each row with all the specific info of the
columns:
-- SQL
INSERT INTO exampleDB.countries VALUES
(1, 'United Kingdom', 'London'),
(2, 'Canada', 'Toronto')
With PySpark DataFrame syntax, you just need to specify that inserting records into
a specific table is the destination of a write operation with insertInto (note that
columns are aligned positionally, so column names will be ignored with this method):
df = spark.createDataFrame(data, schema=schema)
(
df
.write
.format("delta")
.insertInto("exampleDB.countries")
)
There might be cases in which you already have the required data (with the same
schema and headers) in other formats such as CSV or Parquet. You can specify that
the source is a file and select from it or even directly specify another table. This
data, via a SELECT statement, can be swapped out with the VALUES argument in the
INSERT INTO operation. You need to specify which columns you are selecting from
the new data source or specify that you are selecting an entire table with SELECT
TABLE <table name> instead:
-- SQL
INSERT INTO exampleDB.countries
SELECT * FROM parquet.`countries.parquet`;
This provides one way of appending preexisting data into a Delta Lake table. Another
way will be through the append mode option for Spark DataFrame write operations.
Append
In addition to the insertInto method for a DataFrame, we can add new data to a
Delta Lake table using append mode. In SQL, this just happens as part of the INSERT
INTO operation, but for the DataFrameWriter you will explicitly set writing mode
with the syntax .mode(append), or with its longer specification .option("mode",
"append"). This informs the DataFrameWriter that you are only adding additional
records to the table. When a DataFrame is written with the mode set to append and
the table already exists, data gets appended to it; however, if the table didn’t exist
before, it will be created:
# Python
# Sample data
data = [(3, 'United States', 'Washington, D.C.') ]
Create | 43
# Create a DataFrame from the sample data and schema
df = spark.createDataFrame(data, schema=schema)
If the mode is not set, Delta Lake assumes you are creating a table
by default, but if a table with that name already exists, you will
receive the following error message:
AnalysisException: [TABLE_OR_VIEW_ALREADY_EXISTS] Can-
not create table or view `exampleDB`.`countries`
because it already exists.
For PySpark users, this is the most common method of appending data to a table
because it provides a little more flexibility in the event you might like to specify
different write modes at different points in development. It also uses the table specifi‐
cation to align column names from the incoming DataFrame, unlike the insertInto
method.
countries.delta/_delta_log
└── 00000000000000000000.json
These files provide a record of all the operations that happen to the table and
make some kind of change (i.e., not read operations). Each creation, insertion, or
append action will add another JSON file to the transaction log and increment the
version number of the table. The exact structure of the transaction log varies by
implementation, but some of what you will commonly find within the transaction
records is information about the creation of the table (such as what processing engine
was used to create it, the number of records, or other metrics from write operations
to the table), records of maintenance operations, and deletion information.7
While it may seem like a small thing, you should understand that the transaction log
is the core component that makes Delta Lake work. Some might even go so far as
to say that the transaction log is Delta Lake. The record of transactions and the way
processing engines interact with it are what set Delta Lake apart from Parquet and
provide ACID guarantees, the possibilities of exactly-once stream processing, and all
the other magic Delta Lake provides to you. One example of the magic that comes
from the transaction log is time travel, which is described in the next section.
The details may differ depending on where and how you are using Delta Lake, but
the key takeaway is that you need to know that the transaction log exists and where
to find it. Owing to the richness of the information often included in the transaction
log, you may find it an invaluable tool for investigating processes, diagnosing errors,
and monitoring the health of your data pipelines. Don’t neglect the information
available at your fingertips!
6 Chapter 1 includes a detailed review of the transaction log, but this is a critical concept.
7 Matthew Powers provides a handy reference to many implementations of the Delta Lake transaction log, if
you want to compare what information might be available to each.
Create | 45
Read
Reading is such a fundamental operation in data processing that one could almost
assume there is no need to look into it. However, there are several things concerning
reading from Delta Lake tables that are worth focusing on, including a high-level
understanding of how partition filtering works (which is explored much more deeply
in Chapters 5 and 10) and how the transaction log allows querying views of the data
from previous versions with time travel.
# Python
from delta.tables import DeltaTable
# Python
delta_table_df.filter(delta_table_df.capital == 'London')
8 The concept of partitioning is a supported part of the Parquet file structure. For an in-depth exploration of
partitioning, we suggest checking out the Spark documentation covering Parquet files.
The DeltaTable API used throughout this chapter does not directly
support time travel. However, that feature is still available to
Python users via PySpark. The API supports restoration actions,
which are covered in “Repairing, Restoring, and Replacing Table
Data” on page 108. You will also find some more advanced oper‐
ations regarding deletion of data. In light of this limitation, equiv‐
alent expressions with PySpark are presented alongside the SQL
expressions for time travel.
Read | 47
To view a previous version of a table in SQL, just add a qualifier to the query. There
are two different options for specifying this. One is to specify the VERSION AS OF
with a particular version number. For example, if you want to see which values of id
existed as part of a specific version of the table, you might combine a DISTINCT query
with time travel to version 1 of the table:
-- SQL
SELECT DISTINCT id FROM exampleDB.countries VERSION AS OF 1
# Python
(
spark
.read
.option("versionAsOf", "1")
.load("countries.delta")
.select("id")
.distinct()
)
Or if you want to see how many records existed before the current date without
having to check the version number, you can use TIMESTAMP AS OF instead and
specify the current date:9
-- SQL
SELECT count(1) FROM exampleDB.countries TIMESTAMP AS OF "2024-04-20"
# Python
(
spark
.read
.option("timestampAsOf", "2024-04-20")
.load("countries.delta")
.count()
)
While extremely useful as a feature, time travel is just a by-product of proper version‐
ing on your table. It does exemplify the protections that you get for your data in
terms of transaction guarantees and atomicity, though. So really, time travel is not
the full benefit but a window into the protections provided by ACID transactions
available in Delta Lake.10
9 This “same day” behavior does depend on the way Spark converts a date to a timestamp—i.e., in the examples,
2024-04-20 = 2024-04-20 00:00:00.
10 To read more about retention timelines and versioning, see Chapter 5.
Using UPDATE makes it easy to fix a specific value in a table. You can also use this to
update many values in your table by using a less specific filtering clause. Omitting
the WHERE clause completely would allow you to update values across the entire
Delta Lake table. Each update action will increment the version of the table in the
transaction log.
Delete
Deleting data from a table is the last of the CRUD operations to be explored here.
Deletions can happen for many reasons, but a few of the most common ones are to
remove specific records (e.g., right to be forgotten11), to replace erroneous or stale
data (e.g., daily table refresh), or to trim a table time window (most often when the
same data might be available elsewhere but you wish to keep a reporting table or
similar to a trimmed length for performance or as part of the basis for calculations).
For some of these, you would want to give explicit commands to remove values; in
other cases, you might be able to let the system handle the deletion on your behalf.
11 The right to be forgotten or right to erasure is part of the EU’s General Data Protection Regulation (GDPR). If
this law applies to your data practices, we suggest you thoroughly review the EU’s “Complete Guide to GDPR
Compliance”.
Delete | 49
The two usual ways to achieve this are to use the DELETE command or to specify
overwriting behavior.
Deleting many records from a table works similarly to deleting a single record.
In cases in which the value in the expression matches multiple records, all those
records will get deleted. You could also use inequality-based expressions to delete
based on thresholds. An example of this kind of expression might look something
like "transaction_date <= date_sub(current_date(), 7)", which would trim the
table values to have only values within the last week. Deleting large amounts of data
from a table can often be associated with replacing the data in that table with a whole
new set of records. Rather than doing this as a two-step operation, there may be cases
in which you would like to overwrite the data instead.
Overwrite mode
Overwriting data by changing the output mode on a Spark DataFrameWriter can be
a quick and efficient method for wholly replacing part or all of a dataset in Delta
Lake. The overwrite mode parameter is a mirror of the append mode parameter used
Delete | 51
to add data to a Delta Lake table. In this case, instead of data being added to the
table’s preexisting data, the contents of the current DataFrame will just replace what
is already in the table. All of the prior data will be removed and only the current data
will be available going forward, unless you restore the table to the prior version:
# Python
(
spark
.createDataFrame(
[
(1, 'India', 'New Delhi'),
(4, 'Australia', 'Canberra')
],
schema=["id", "country", "capital"]
)
.write
.format("delta")
.mode("overwrite") # specify the output mode
.saveAsTable("exampleDB.countries")
)
Using this method gives you the ability to switch between the two different out‐
put modes by changing just one word, which can be particularly useful during
development and testing. You can do something similar in Spark SQL with INSERT
OVERWRITE.
INSERT OVERWRITE
As a companion to INSERT INTO, INSERT OVERWRITE can be used in the same way
as the overwrite mode with PySpark DataFrame syntax.13 These two query-based
commands function in the same way as append and overwrite modes in PySpark;
that is, they allow you to switch between the INTO and OVERWRITE parameters without
making other changes to your queries:
-- SQL
INSERT OVERWRITE exampleDB.countries
VALUES (3, 'U.S.', 'Washington, D.C.');
As with the overwrite mode or the replace method, using INSERT OVERWRITE will
remove all previous data from the target table. This means you should exercise
caution when using it and make sure you know what you are overwriting. As with
the INSERT INTO command, you have a large amount of freedom with regard to the
contents you want to insert into the target table. You can use specific values, other
tables, or files as a source for writing over a target table.
13 Due to the way Trino interacts with files, it does not directly support INSERT OVERWRITE.
14 For a more in-depth exploration of its history and implementation comparisons across multiple SQL dialects,
we recommend the “Merge (SQL)” Wikipedia article as a jumping-off point.
15 For a dedicated exploration of Delta Lake merge semantics, we suggest Nick Karpov’s blog post.
Merge | 53
With SQL, you simply combine the actions to build your entire MERGE query and
execute it as a single statement. You start by specifying the target to merge into,
the source to merge from, and the conditions on which you want to base your
matching logic. Then, for an upsert, you will just define the update operation and
insert operation details:
-- SQL
MERGE INTO exampleDB.countries A
USING (select * from parquet.`countries.parquet`) B
ON A.id = B.id
WHEN MATCHED THEN
UPDATE SET
id = A.id,
country = B.country,
capital = B.capital
WHEN NOT MATCHED
THEN INSERT (
id,
country,
capital
)
VALUES (
B.id,
B.country,
B.capital
)
With the DeltaTable API, you will use a new class called the DeltaMergeBuilder to
specify these conditions and actions. Unlike in the SQL syntax, each combination of
matching status and subsequent action to take has its own method to use. You can
find the full list of supported combinations in the documentation. We recommend
you combine multiple actions and just chain them together into a single transaction
to help you break down the logical path of any particular record. Here is what it
might look like if you wanted to do an upsert operation with a DataFrame containing
new records; notice that, starting with the DeltaTable object, you first apply MERGE
to specify the new record source and the matching conditions and then apply when
MatchedUpdate and whenNotMatchedInsert to cover both cases:
# Python
idf = (
spark
.createDataFrame([
(1, 'India', 'New Delhi'),
(4, 'Australia', 'Canberra')],
schema=["id", "country", "capital"]
)
)
delta_table.alias("target").merge(
source = idf.alias("source"),
Overall, using MERGE can help you simplify what otherwise would require several
distinct queries with different kinds of join logic and associated actions.
Parquet Conversions
Even in cases where you establish Delta Lake as the file format underlying all your
data activities, you are still likely to encounter datasets coming from legacy systems,
third-party providers, or other sources that use different formats. For a couple of file
types, namely the Parquet and the Parquet-based Iceberg formats, there is a simple
conversion method you can use to simplify some of your operations. The CONVERT
TO DELTA command is the recommended approach for transforming an Iceberg or
Parquet directory into a Delta table.
Iceberg conversion
Like Delta Lake, Apache Iceberg is composed of Parquet files internally. Is it possible
to again use CONVERT TO DELTA in SQL, or convertToDelta, to convert Iceberg
files? Partly yes and partly no. The DeltaTable API does not support the Iceberg
conversion. Spark SQL, however, can support the conversion with CONVERT TO
DELTA, but you will also need to install support for the Iceberg format in your Spark
environment:
-- SQL
CONVERT TO DELTA iceberg.`countries.iceberg`
You should be able to accomplish this by installing an additional JAR file (delta-
iceberg) to the cluster you are using.16 Unlike with Parquet files, when converting
Iceberg you will not need to specify the partitioning structure of the table, as it will
infer this information from the source.17
There’s one more thing you should know about this conversion process. An interest‐
ing side effect exists in converted Iceberg tables. Since both Iceberg and Delta Lake
maintain distinctly separate transaction logs, none of the new files added through
interactions via Delta Lake will be registered on the Iceberg side. However, since
the Iceberg log is not removed, the new Delta Lake table will still be readable and
accessible as an Iceberg table.18
16 For further exploration of using Apache Iceberg with Apache Spark, we suggest starting with the official
quickstart guide.
17 There are some caveats to being able to convert Iceberg based on different feature usage. You should look at
the documentation to check whether this might affect your specific situation.
18 This is different from Delta UniForm (Universal Format), which we discussed in Chapter 1.
Conclusion
The essential operations of Delta Lake provide a robust interaction layer for creating,
reading, updating, and deleting data in tables, going well beyond traditional data lake
capabilities. With ACID transactions, time travel, merge operations, and easy conver‐
sion from Parquet and Iceberg formats, Delta Lake offers a powerful storage and data
management layer. By understanding the essential operations covered in this chap‐
ter—from basic CRUD actions to more advanced merge logic and transaction log
introspection—you can effectively use Delta Lake to build reliable, high-performance
data pipelines and applications.
Conclusion | 57
CHAPTER 4
Diving into the Delta Lake Ecosystem
Over the last few chapters, we’ve explored Delta Lake from the comfort of the Spark
ecosystem. The Delta protocol, however, offers rich interoperability not only across
the underlying table format but within the computing environment as well. This
opens the doors to an expansive universe of possibilities for powering our lakehouse
applications, using a single source of table truth. It’s time to break outside the box and
look at the connector ecosystem.
The connector ecosystem is a set of ever-expanding frameworks, services, and
community-driven integrations enabling Delta to be utilized from just about any‐
where. The commitment to interoperability enables us to take full advantage of the
hard work and effort the growing open source community provides without sacrific‐
ing the years we’ve collectively poured into technologies outside the Spark ecosystem.
In this chapter, we’ll discover some of the more popular Delta connectors while
learning to pilot our Delta-based data applications from outside the traditional Spark
ecosystem. For those of you who haven’t done much work with Apache Spark, you’re
in luck, since this chapter is a love song to Delta Lake without Apache Spark and a
closer look at how the connector ecosystem works.
We will be covering the following integrations:1
1 For the full list of evolving integrations, see “Delta Lake Integrations” on the Delta Lake website.
59
In addition to the four core connectors in this chapter, support for Apache Pulsar,
ClickHouse, FINOS Legend, Hopsworks, Delta Rust, Presto, StarRocks, and general
SQL import to Delta is also available at the time of writing.
What are connectors, you ask? We will learn all about them next.
Connectors
As people, we don’t like to set limits for ourselves. Some of us are more adventurous
and love to think about the unlimited possibilities of the future. Others take a more
narrow, straight-ahead approach to life. Regardless of our respective attitudes, we are
bound together by our pursuit of adventure, search for novelty, and desire to make
decisions for ourselves. Nothing is worse than being locked in, trapped, with no way
out. From the perspective of the data practitioner, it is also nice to know that what
we rely on today can be used tomorrow without the dread of contract renegotiations!
While Delta Lake is not a person, the open source community has responded to the
various wants and needs of the community at large, and a healthy ecosystem has risen
up to ensure that no one will have to be tied directly to the Apache Spark ecosystem,
the JVM, or even the traditional set of data-focused programming languages like
Python, Scala, and Java.
The mission of the connector ecosystem is to ensure frictionless interoperability with
the Delta protocol. Over time, however, fragmentation across the current (delta <
3.0) connector ecosystem has led to multiple independent implementations of the
Delta protocol and divergence across the current connectors. To streamline support
for the future of the Delta ecosystem, Delta Kernel was introduced to provide a
common interface and expectations that simplify true interoperability within the
Delta ecosystem.
There are a healthy number of connectors and integrations that enable interoperabil‐
ity with the Delta table format and protocols, no matter where we trigger operations
from. Interoperability and unification are part of the core tenets of the Delta project
and helped drive the push toward UniForm (introduced along with Delta 3.0), which
provides cross-table support for Delta, Iceberg, and Hudi.
Apache Flink
Apache Flink is “a framework and distributed processing engine for stateful compu‐
tations over unbounded and bounded data streams...[that] is designed to run in all
common cluster environments [and] perform computations at in-memory speed and
at any scale.” In other words, Flink can scale massively and continue to perform
efficiently while handling every increasing load in a distributed way, and while
also adhering to exactly-once semantics (if specified in the CheckpointingMode) for
stream processing, even in the case of failures or disruptions at runtime to a data
application.
If you haven’t worked with Flink before and would like to, there
is an excellent book by Fabian Hueske and Vasiliki Kalavri called
Stream Processing with Apache Flink (O’Reilly) that will get you up
to speed in no time.
The assumption from here going forward is that we either (a) understand enough
about Flink to compile an application or (b) are willing to follow along and learn as
we go. With that said, let’s look at how to add the delta-flink connector to our Flink
applications.
Apache Flink | 61
The full Java application referenced in the following sections is
located in the book’s Git repository under /ch04/flink/dldg-flink-
delta-app/.
As a follow-up for the curious reader, unit tests for the application
provide a glimpse at how to use the Delta standalone APIs. You can
walk through these under /src/test/ within the Java application.
The connector ships with classes for reading and writing to Delta Lake. Reading is
handled by the DeltaSource API, and writing is handled by the DeltaSink API. We’ll
start with the DeltaSource API, move on to the DeltaSink API, and then look at an
end-to-end application.
DeltaSource API
The DeltaSource API provides static builders to easily construct sources for bounded
or unbounded (continuous) data flows. The big difference between the two variants is
specific to the bounded (batch) or unbounded (streaming) operations on the source
Delta table. This is analogous to the batch or microbatch (unbounded) processing
with Apache Spark. While the behavior of these two processing modes differs, the
Bounded mode
To create the DeltaSource object, we’ll be using the static forBoundedRowData
method from the DeltaSource class. This builder takes the path to the Delta table and
an instance of the application’s Hadoop configuration, as shown in Example 4-1.
The object returned in Example 4-1 is a builder. Using the various options on the
builder, we specify how we’d like to read from the Delta table, including options to
slow down the read rates, filter the set of columns read, and more.
Builder options. The following options can be applied directly to the builder:
columnNames (string ...)
This option provides us with the ability to specify the column names on a table
we’d like to read while ignoring the rest. This functionality is especially useful
on wide tables with many columns and can help alleviate unnecessary memory
pressure for columns that will go unused anyway:
% builder.columnNames("event_time", "event_type", "brand", "price");
builder.columnNames(
Arrays.asList("event_time", "event_type", "brand", "price"));
startingVersion (long)
This option provides us with the ability to specify the exact version of the
Delta table’s transaction to start reading from (in the form of a numeric Long).
This option and the startingTimestamp option are mutually exclusive, as both
provide a means of supplying a cursor (or transactional starting point) on the
Delta table:
% builder.startingVersion(100L);
Apache Flink | 63
startingTimestamp (string)
This option provides the ability to specify an approximate timestamp to begin
reading from in the form of an ISO-8601 string. This option will trigger a scan
of the Delta transaction history looking for a matching version of the table that
was generated at or after the given timestamp. In the case where the entire table is
newer than the timestamp provided, the table will be fully read:
% builder.startingTimestamp("2023-09-10T09:55:00.001Z");
The timestamp string can represent time with low precision—for example, as a
simple date like "2023-09-10"—or with millisecond precision, as in the previous
example. In either case, the operation will result in the Delta table being read
from a specific point in table time.
parquetBatchSize (int)
This option takes an integer controlling how many rows to return per internal
batch, or generated split within the Flink engine:
% builder.option("parquetBatchSize", 5000);
Generating the bounded source. Once we finish supplying the options to the builder,
we generate the DeltaSource instance by calling build:
% final DeltaSource<RowData> source = builder.build();
With the bounded source built, we can now read batches of our Delta Lake records
off our tables—but what if we wanted to continuously process new records as they
arrived? In that case, we can just use the continuous mode builder!
Continuous mode
To create this variation of the DeltaSource object, we’ll use the static forContinuous
RowData method on the DeltaSource class. The builder is shown in Example 4-2, and
we provide the same base parameters as were provided to the forBoundedRowData
builder, which makes switching from batch to streaming super simple.
Generating the continuous source. Once we finish configuring the builder, we generate
the DeltaSource instance by calling build:
% final DeltaSource<RowData> source = builder.build();
We have looked at how to build the DeltaSource object and have seen the connector
configuration options, but what about table schema or partition column discovery?
Luckily, there is no need to go into too much detail about those, since both are
automatically discovered using the table metadata.
Apache Flink | 65
collection of column names using the DeltaSource builder method (columnNames),
then only that subset of columns will be read from the underlying Delta table. In
both cases, the DeltaSource connector will discover the Delta table column types
and convert them to the corresponding Flink types. This process of conversion from
the internal Delta table data (Parquet rows) to the external data representation (Java
types) provides us with a seamless way to work with our datasets.
We now have a live data source for our Flink job supporting Delta. We can choose
to add additional sources, join and transform our data, and even write the results of
our transforms back to Delta using the DeltaSink, or anywhere else our application
requires us to go.
Next, we’ll look at using the DeltaSink and then connect the dots with a full end-to-
end example.
DeltaSink API
The DeltaSink API provides a static builder to egress to Delta Lake easily. Following
the same pattern as the DeltaSource API, the DeltaSink API provides a builder class.
Construction of the builder is shown in Example 4-4.
The builder pattern for the delta-flink connector should already feel familiar at this
point. The only difference with crafting this builder is the addition of the RowType
reference.
RowType
Similar to the StructType from Spark, the RowType stores the logical type informa‐
tion for the fields within a given logical row. At a higher level, we can think about this
in terms of a simple DataFrame. It is an abstraction that makes working with dynamic
data simpler.
More practically, if we have a reference to the source, or transformation, that occur‐
red prior to the DeltaSink in our DataStream, then we can dynamically provide the
RowType using a simple trick. Through some casting tricks, we can apply a conversion
between TypeInformation<T> and RowData<T>, as seen in Example 4-5.
Example 4-5. Extracting the RowType via TypeInformation
The getRowType method converts the provided typeInfo object into Internal
TypeInfo and uses toLogicalType, which can be cast back to a RowType. In Exam‐
ple 4-6 we see how to use this method to gain an understanding of the power of
Flink’s RowData.
Example 4-6. Extracting the RowType from our DeltaSource
% DeltaSource<RowData> source = …
TypeInformation<RowData> typeInfo = source.getProducedType();
RowType rowTypeForSink = getRowType(typeInfo);
If we have a simple streaming application, chances are we’ve managed to get along
nicely for a while without spending a lot of time manually crafting plain old Java
objects (POJOs) and working with serializers and deserializers; or maybe we’ve
Apache Flink | 67
decided to use alternative mechanisms for creating our data objects, such as Avro
or Protocol Buffers. It’s also possible that we’ve never had to work with data outside of
traditional database tables. No matter what the use case, working with columnar data
means we have the luxury of simply reading the columns we want in the same way
that we would with a SQL query.
Take the following SQL statement:
% select name, age, country from users;
While we could read all columns on a table using select *, it is always better to take
only what we need from a table. This is the beauty of columnar-oriented data. Given
the high likelihood that our data application won’t need everything, we save compute
cycles and memory overhead and provide a clean interface between the data sources
we read from.
The ability to dynamically read and select specific columns—known as SQL projec‐
tion—via our Delta Lake table means we can trust in the table’s schema, which is not
something we could always say of just any data living in the data lake. While a table
schema can and will change over time, we won’t need to maintain a separate POJO
to represent our source table. This might not seem like a large lift, but the lower the
number of moving parts, the simpler it is to write, release, and maintain our data
applications. We only need to express the columns we expect to have, which speeds
up our ability to create flexible data processing applications, as long as we can trust
that the Delta tables we read from use backward compatible schema evolution. See
Chapter 5 for more information on schema evolution.
Builder options
The following options can be applied directly to the builder:
withPartitionColumns (string ...)
This builder option takes an array of strings that represent the subset of columns.
The columns must exist physically in the stream.
withMergeSchema (boolean)
This builder option must be set to true in order to opt into automatic schema
evolution. The default value is false.
In addition to discussing the builder options, it is worth covering the semantics of
exactly-once writes using the delta-flink connector.
Exactly-once guarantees
The DeltaSink does not immediately write to the Delta table. Rather, rows are
appended to flink.streaming.sink.filesystem.DeltaPendingFile—not to be
confused with Delta Lake—as these files provide a mechanism to buffer writes
% StreamExecutionEnvironment
.getExecutionEnvironment()
.enableCheckpointing(2000, CheckpointingMode.EXACTLY_ONCE);
Using the checkpoint config above, we’d create a new transaction every two seconds
at most, at which point the DeltaSink would use our Flink application appId and the
checkpointId associated with the pending files. This is similar to the use of txnAppId
and txnVersion for idempotent writes and will likely be unified in the future.
End-to-End Example
Now we’ll look at an end-to-end example that uses the Flink DataStream API to read
from Kafka and write to Delta. The application source code and Docker-compatible
environment are provided in the book’s GitHub repository under /ch04/flink/, includ‐
ing steps to initialize the ecomm.v1.clickstream Kafka topic, write (produce) records
to be consumed by the Flink application, and ultimately write those records into
Delta. The results of running the application can be seen in Figure 4-1, which shows
the Flink UI and represents the end state of the application.
Let’s define our DataStream using the KafkaSource connector and the DeltaSink
from earlier in this section within the scope of Example 4-8.
Apache Flink | 69
Example 4-8. KafkaSource to DeltaSink DataStream
return stream
.map((MapFunction<Ecommerce, RowData>) Ecommerce::convertToRowData)
.setParallelism(1)
.sinkTo(sink)
.name("delta-sink")
.setDescription("writes to Delta Lake")
.setParallelism(1);
}
The example takes binary data from Kafka representing ecommerce transactions in
JSON format. Behind the scenes, we deserialize the JSON data into ecommerce rows
and then transform from the JVM object into the internal RowData representation
required for writing to our Delta table. Then we simply use an instance of the
DeltaSink to provide a terminal point for our DataStream.
Next, we call execute after adding some additional descriptive metadata to the
resulting DataStreamSink, as we’ll see in Example 4-9.
env.execute("kafka-to-delta-sink-job");
}
We’ve just scratched the surface on how to use the Flink connector for Delta Lake,
and it is already time to take a look at another connector.
In a similar vein as our end-to-end example with Flink, we’ll next be exploring how
to ingest the same ecommerce data from Kafka; however, this time we’ll be using the
Rust-based kafka-delta-ingest library.
1. Install Rust.
2. Build the project.
3. Create your Delta table.
4. Run the ingestion flow.
Install Rust
This can be done using the rustup toolchain:
% curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Once rustup is installed, running rustup update will ensure we are on the latest
stable version of Rust available.
2 The full ingestion flow application is available in the book’s GitHub repository under ch04/rust/kafka-delta-
ingest.
Now all that is left to do is to run the ingestion application. If we are running
the application using our environment variables, then the simplest command would
provide the Kafka topic and the Delta table location. The command signature is as
follows:
% cargo run ingest <topic> <delta_table_location>
Next, we’ll see a complete example:
% cargo run \
ingest ecomm.v1.clickstream file:///dldg/ecomm-ingest/ \
--allowed_latency 120 \
--app_id clickstream_ecomm \
--auto_offset_reset earliest \
--checkpoints \
--kafka 'localhost:9092' \
--max_messages_per_batch 2000 \
--transform 'date: substr(meta.producer.timestamp, `0`, `10`)' \
--transform 'meta.kafka.offset: kafka.offset' \
--transform 'meta.kafka.partition: kafka.partition' \
--transform 'meta.kafka.topic: kafka.topic'
Trino
Trino is a distributed SQL query engine designed to seamlessly connect to and
interoperate with a myriad of data sources. It provides a connector ecosystem that
supports Delta Lake natively.
Getting Started
All we need to get started with Trino and Delta Lake is any version of Trino newer
than version 373. At the time of writing, Trino is currently at version 459.
Connector requirements
While the Delta connector is natively included in the Trino distribution, there are still
additional things we need to consider to ensure a frictionless experience.
Connecting to OSS or Databricks Delta Lake:
• Delta Tables written by Databricks Runtime 7.3 LTS, 9.1 LTS, 10.4 LTS, 11.3 LTS,
and >= 12.2 LTS.
• Deployments using AWS, HDFS, Azure Storage, and Google Cloud Storage
(GCS) are fully supported.
• Network access from the coordinator and workers to the Delta Lake storage.
• Access to the Hive Metastore (HMS).
Trino | 75
• Network access to HMS from the coordinator and workers. Port 9083 is the
default port for the Thrift protocol used by HMS.
• Trino Image
• Hive Metastore (HMS) service (standalone)
• Postgres or supported relational database management system (RDBMS) to store
the HMS table properties, columns, databases, and other configurations (can
point to managed RDBMS like RDS for simplicity)
• Amazon S3 or MinIO (for object storage for our managed data warehouse)
services:
trinodb:
image: trinodb/trino:426-arm64
platform: linux/arm64
hostname: trinodb
container_name: trinodb
volumes:
- $PWD/etc/catalog/delta.properties:/etc/trino/catalog/delta.properties
- $PWD/conf:/etc/hadoop/conf/
ports:
- target: 8080
published: 9090
protocol: tcp
mode: host
environment:
- AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY
- AWS_DEFAULT_REGION=${AWS_DEFAULT_REGION:-us-west-1}
networks:
- dldg
The example in the next section assumes we have the following resources available
to us:
• Amazon S3 or MinIO (bucket provisioned, with a user, and roles set to allow read,
write, and delete access). Using local MinIO to mock S3 is a simple way to try
things out without any upfront costs. See the docker compose examples in the
book’s GitHub repository under ch04/trinodb/.
Next, we’ll learn how to configure the Delta Lake connector so that we can create
a Delta catalog in Trino. If you want to learn more about using the Hive Metastore
(HMS), including how to configure the hive-site.xml, how to include the required
JARs for S3, and how to run HMS, you can read through “Running the Hive Meta‐
store”. Otherwise, skip ahead to “Configuring and Using the Trino Connector” on
page 79.
<configuration>
<property>
<name>hive.metastore.version</name>
<value>3.1.0</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://RDBMS_REMOTE_HOSTNAME:3306/metastore</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>RDBMS_USERNAME</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>RDBMS_PASSWORD</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>s3a://dldgv2/delta/</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>S3_ACCESS_KEY</value>
Trino | 77
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>S3_SECRET_KEY</value>
</property>
<property>
<name>fs.s3.path-style-access</name>
<value>true</value>
</property>
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
</configuration>
The configuration provides the nuts and bolts we need to access the metadata data‐
base, using the JDBC connection URL, username, and password properties, as well
as the data warehouse, using the hive.metastore.warehouse.dir and the properties
prefixed with fs.s3a.
Next, we need to create a Docker Compose file to run the metastore, which we do in
Example 4-12.
version: "3.7"
services:
metastore:
image: apache/hive:3.1.3
platform: linux/amd64
hostname: metastore
container_name: metastore
volumes:
- ${PWD}/jars/hadoop-aws-3.2.0.jar:/opt/hive/lib/
- ${PWD}/jars/mysql-connector-java-8.0.23.jar:/opt/hive/lib/
- ${PWD}/jars/aws-java-sdk-bundle-1.11.375.jar:/opt/hive/lib/
- ${PWD}/conf:/opt/hive/conf
environment:
- SERVICE_NAME=metastore
- DB_DRIVER=mysql
- IS_RESUME="true"
expose:
- 9083
ports:
- target: 9083
published: 9083
protocol: tcp
mode: host
networks:
- dldg
connector.name=delta_lake
hive.metastore=thrift
hive.metastore.uri=thrift://metastore:9083
delta.hive-catalog-name=metastore
delta.compression-codec=SNAPPY
delta.enable-non-concurrent-writes=true
delta.target-max-file-size=512MB
delta.unique-table-location=true
delta.vacuum.min-retention=7d
The property file from Example 4-13 can be saved as delta.properties. As long as the
file is copied into the Trino catalog directory (/etc/trino/catalog/), then we’ll be able to
read, write, and delete from the underlying hive.metastore.warehouse.dir, and do
a whole lot more.
Let’s look at what’s possible.
Trino | 79
delta
...
(6 rows)
As long as we see delta in the list, we can move on to creating a schema. This
confirms that our catalog is correctly configured.
Creating a Schema
The notion of a schema is a bit overloaded. We have schemas that represent the
structured data describing the columns of our tables, but we also have schemas
representing traditional databases. Using create schema enables us to generate a
managed location within our data warehouse that can act as a boundary for access
and governance, as well as to separate the physical table data among bronze, silver,
and golden tables. We’ll learn more about the medallion architecture in Chapter 9,
but for now let’s create a bronze_schema to store some raw tables:
trino> create schema delta.bronze_schema;
CREATE SCHEMA
Show Schemas
This command allows us to query a catalog to view available schemas:
trino> show schemas from delta;
Schema
--------------------
default
information_schema
bronze_schema
(3 rows)
Data types
There are a few caveats to creating tables using Trino, especially when it comes
to type mapping differences between Trino and Delta Lake. The table shown in
Table 4-3 can be used to ensure that the appropriate types are used and to steer clear
of incompatibility if our aim is interoperability.
Trino | 81
Trino won’t automatically discover partitions, which could be a problem when it
comes to the performance of SQL queries.
Creating tables
We can create tables using the longform <catalog>.<schema>.<table> syntax, or the
shortform syntax <table> after calling use delta.<schema>. Example 4-14 provides
an example using the shortform create.
Listing tables
Using show tables will allow us to view the collection of tables within a given
schema in the Delta catalog:
trino:bronze_schema> show tables;
Table
----------------------
ecomm_v1_clickstream
(1 row)
Inspecting tables
If we are not the owners of a given table, we can use describe to learn about the table
through its metadata:
trino> describe delta.bronze_schema."ecomm_v1_clickstream";
Column | Type | Extra | Comment
---------------+--------------+-------+---------
event_date | date | |
event_time | varchar | |
event_type | varchar | |
product_id | integer | |
category_id | bigint | |
category_code | varchar | |
brand | varchar | |
price | decimal(5,2) | |
user_id | integer | |
user_session | varchar | |
(10 rows)
Using INSERT
Rows can be inserted directly using the command line, or through the use of the
Trino client:
trino> INSERT INTO delta.bronze_schema."ecomm_v1_clickstream"
VALUES
(DATE '2023-10-01', '2023-10-01T19:10:05.704396Z', 'view', ...),
(DATE('2023-10-01'), '2023-10-01T19:20:05.704396Z', 'view', ...);
INSERT: 2 rows
Trino | 83
Querying Delta tables
Using the select operator allows you to query your Delta tables:
trino> select event_date, product_id, brand, price
-> from delta.bronze_schema."ecomm_v1_clickstream";
event_date | product_id | brand | price
------------+------------+---------+--------
2023-10-01 | 44600062 | nars | 35.79
2023-10-01 | 54600062 | lancome | 122.79
(2 rows)
Updating rows
The standard update operator is available:
trino> UPDATE delta.bronze_schema."ecomm_v1_clickstream"
-> SET category_code = 'health.beauty.products'
-> where category_id = 2103807459595387724;
Table Operations
There are many table operations to consider for optimal performance, and for declut‐
tering the physical filesystem in which our Delta tables live. Chapter 5 covers the
common maintenance and table utility functions, and the following section covers
what functions are available within the Trino connector.
Vacuum
The vacuum operation will clean up files that are no longer required in the current
version of a given Delta table. In Chapter 5 we go into more detail about why vac‐
uuming is required, as well as the caveats to keep in mind to support table recovery
and rolling back to prior versions with time travel.
With respect to Trino, the Delta catalog property delta.vacuum.min-retention pro‐
vides a gating mechanism to protect a table in case of an arbitrary call to vacuum with
a low number of days or hours:
trino> CALL delta.system.vacuum('bronze_schema', 'ecomm_v1_clickstream', '1d');
Table optimization
Depending on the size of the table parts created as we make modifications to our
tables with Trino, we run the risk of creating too many small files representing our
tables. A simple technique to combine the small files into larger files is bin-packing
optimize (which we cover in Chapter 5 and in the performance-tuning deep dive in
Chapter 10). To trigger compaction, we can call ALTER TABLE with EXECUTE:
trino> ALTER TABLE delta.bronze_schema."ecomm_v1_clickstream" EXECUTE optimize;
We can also provide more hints to change the behavior of the optimize operation.
The following will ignore files greater than 10 MB:
trino> ALTER TABLE delta.bronze_schema."ecomm_v1_clickstream"
-> EXECUTE optimize(file_size_threshold => '10MB')
The following will only attempt to compact table files within the partition
(event_date = "2023-10-01"):
trino> ALTER TABLE delta.bronze_schema."ecomm_v1_clickstream" EXECUTE optimize
WHERE event_date = "2023-10-01"
Metadata tables
The connector exposes several metadata tables for each Delta Lake table that contain
information about their internal structure. We can query these tables to learn more
about our tables and to inspect changes and recent history.
Table history
Each transaction is recorded in the <table>$history metadata table:
trino> describe delta.bronze_schema."ecomm_v1_clickstream$history";
Column | Type | Extra | Comment
----------------------+-----------------------------+-------+---------
version | bigint | |
timestamp | timestamp(3) with time zone | |
user_id | varchar | |
user_name | varchar | |
operation | varchar | |
operation_parameters | map(varchar, varchar) | |
cluster_id | varchar | |
read_version | bigint | |
Trino | 85
isolation_level | varchar | |
is_blind_append | boolean | |
We can query the metadata table. Let’s look at the last three transactions for our
ecomm_v1_clickstream table:
trino> select version, timestamp, operation
-> from delta.bronze_schema."ecomm_v1_clickstream$history";
version | timestamp | operation
---------+-----------------------------+--------------
0 | 2023-10-01 19:47:35.618 UTC | CREATE TABLE
1 | 2023-10-01 19:48:41.212 UTC | WRITE
2 | 2023-10-01 23:01:13.141 UTC | OPTIMIZE
(3 rows)
Deleting tables
Using the DROP TABLE operation, we can permanently remove a table that is no longer
needed:
trino> DROP TABLE delta.bronze_schema."ecomm_lite";
There is a lot more that we can do with the Trino connector that is out of scope for
this book; for now we will say goodbye to Trino and conclude this chapter.
Trino | 87
Conclusion
During the time we spent together in this chapter, we learned how simple it can be
to connect our Delta tables as either the source or the sink for our Flink applications.
We then learned to use the Rust-based kafka-delta-ingest application to simplify the
data ingestion process that is the bread and butter of most data engineers working
with high-throughput streaming data. By reducing the level of effort required to
simply read a stream of data and write it into our Delta tables, we end up in a
much better place in terms of cognitive burden. When we start to think about all
data in terms of tables—bounded or unbounded—the mental model can be applied
to tame even the most wildly data-intensive problems. On that note, we concluded
the chapter by exploring the native Trino connector for Delta. We discovered how
simple configuration opens up the doors to analytics and insights, all while ensuring
we continue to have a single source of data truth residing in our Delta tables.
The process of keeping our Delta Lake tables running efficiently over time is akin
to any kind of preventative maintenance for a car or motorcycle or any alternative
mode of transportation (a bike, a scooter, rollerblades). We wouldn’t wait for our tires
to go flat before assessing the situation and finding a solution—we’d take action. We
would start with simple observations, look for leaks, and ask ourselves, “Does the
tire need to be patched? Could the problem be as simple as adding more air, or is
this situation more dire, and the whole tire will need to be replaced?” The process of
monitoring the situation, finding a remedy when we detect a problem, and applying
the solution can be applied to our Delta Lake tables as well and is all part of the
general process of maintaining the tables. In essence, we just need to think in terms of
cleaning, monitoring, tuning, repairing, and replacing.
In the sections that follow, we’ll learn to take advantage of the Delta Lake utility
methods and learn about their associated configurations (aka table properties). We’ll
walk through some common methods for cleaning, tuning, repairing, and replacing
our tables, in order to lend a helping hand while optimizing the performance and
health of our tables, and ultimately build a firm understanding of the cause-and-
effect relationships among the actions we take.
89
Delta Lake Table Properties Reference
The metadata stored alongside our table definitions includes TBLPROPERTIES. The
common properties are presented in Table 5-1 and are used to control the behavior of
our Delta tables. These properties enable automated preventative maintenance. When
combined with the Delta Lake table utility functions, they also provide incredibly
simple control over otherwise complex tasks. We simply add or remove properties to
control the behavior of our tables.
The beauty behind using table properties is that they affect only the metadata of our
tables and in most cases don’t require any changes to the physical table structure.
Additionally, being able to opt in, or opt out, allows us to modify Delta Lake’s
behavior without the need to go back and change any existing pipeline code, and in
most cases without needing to restart, or redeploy, our streaming applications (the
batch applications will simply read the revised properties on their next run).
There are other use cases that fall under the maintenance umbrella and require inten‐
tional action by humans and the courtesy of a heads-up to downstream consumers.
As we close out this chapter, we’ll look at using REPLACE TABLE to add partitions. This
process can break active readers of our tables, as the operation rewrites the physical
layout of the Delta Lake table.
To follow along, the rest of the chapter will be using the covid_nyt dataset included
in the book’s GitHub repo, along with the companion Docker environment. To get
started, execute the following:
$ export DLDG_DATA_DIR=~/path/to/delta-lake-definitive-guide/datasets/
$ export DLDG_CHAPTER_DIR=~/path/to/delta-lake-definitive-guide/ch05
$ docker run --rm -it \
--name delta_quickstart \
-v $DLDG_DATA_DIR/:/opt/spark/data/datasets \
-v $DLDG_CHAPTER_DIR/:/opt/spark/work-dir/ch05 \
-p 8888-8889:8888-8889 \
delta_quickstart
This command will spin up the JupyterLab environment locally. Using the URL
provided to you in the output, open up the JupyterLab environment and click
into /ch05/ch05_notebook.ipynb to follow along.
Example 5-1. Creating a Delta Lake table with default table properties
$ spark.sql("""
CREATE TABLE IF NOT EXISTS default.covid_nyt (
date DATE
) USING DELTA
TBLPROPERTIES('delta.logRetentionDuration'='interval 7 days');
""")
It is worth pointing out that the covid_nyt dataset has six columns.
In Example 5-1, we are purposefully being lazy, since we can steal
the schema of the full covid_nyt table while we import it in the
next step. This will teach us how to evolve the schema of the
current table by filling in missing columns in the table definition.
The inputFiles command will return an empty list. That is expected but also feels a
little lonely. Let’s go ahead and bring some joy to this table by adding some data. We’ll
execute a simple read-through operation of the covid_nyt Parquet data directly into
our managed Delta Lake table (the empty table from before).
From your active session, execute the following block of code to merge the covid_nyt
dataset into the empty default.covid_nyt table:
$ from pyspark.sql.functions import to_date
(spark.read
.format("parquet")
.load("/opt/spark/work-dir/rs/data/COVID-19_NYT/*.parquet")
.withColumn("date", to_date("date", "yyyy-MM-dd"))
.write
.format("delta")
.saveAsTable("default.covid_nyt")
)
This time the AnalysisException is thrown due to a schema mismatch. This is how
the Delta protocol protects us (the operator) from blindly making changes when
there is a mismatch between the expected (committed) table schema that currently
has one column and our local schema (from reading the covid_nyt Parquet data)
that is currently uncommitted and has six columns. This exception is another guard‐
rail in place to block the accidental pollution of our table schema, a process known as
schema enforcement.
We updated the write mode to an append operation. This was necessary given
that we created the table in a separate transaction, and the default write mode
(errorIfExists) short-circuits the operation when the Delta Lake table already
exists.
You’ll see the complete table metadata after executing the DESCRIBE command,
including the columns (and comments) and partitioning (in our case, none), as well
as all available tblproperties. Using DESCRIBE is a simple way of getting to know our
table, or frankly any table you’ll need to work with in the future.
To view (or confirm) the changes from the prior transaction, you can call SHOW
TBLPROPERTIES on the covid_nyt table:
$ spark.sql("show tblproperties default.covid_nyt").show(truncate=False)
Or you can execute the detail() function on the DeltaTable instance from earlier:
$ dt.detail().select("properties").show(truncate=False)
To round out this section, we’ll learn how to remove unwanted table properties; then
we can continue our journey by learning to clean and optimize our Delta Lake tables.
For fun, we are going to re-create a very real small files problem and then figure out
how to optimize the table. To follow along, head back to the session from earlier in
the chapter, as we’ll continue to use the covid_nyt dataset in the following examples.
If you want to view the physical table files, you can run the follow‐
ing command:
WAREHOUSE_DIR=/opt/spark/work-dir/ch05/spark-warehouse
FILE_PATH=$WAREHOUSE_DIR/nonoptimal_covid_nyt/*parquet
We now have a table we can optimize. Next we’ll introduce OPTIMIZE. As a utility,
consider it to be your friend. It will help you painlessly consolidate the many small
files representing our table into a few larger files, and all in the blink of an eye.
The results of running the optimize operation are returned locally in a DataFrame
(results_df) and are available via the table history as well. To view the OPTIMIZE
stats, we can use the history method on our DeltaTable instance:
$ from pyspark.sql.functions import col
(
DeltaTable.forName(spark, "default.nonoptimal_covid_nyt")
.history(10)
2 For the Spark ecosystem, Delta Lake >= 3.1.0 includes the option for auto compaction, using delta.auto
Optimize.autoCompact.
Z-Order Optimize
Z-Ordering is a technique for colocating related information in the same set of files.
The related information is the data residing in your table’s columns. Consider the
covid_nyt dataset. If we knew we wanted to quickly calculate the death rate by state
over time, then utilizing ZORDER BY would allow us to skip opening files in our tables
that don’t contain relevant information for our query. This colocality is automatically
used by the Delta Lake data-skipping algorithms. This behavior dramatically reduces
the amount of data that needs to be read.
For tuning ZORDER BY:
3 For the complete list of rules, you can always reference the Databricks documentation.
(recovery_table
.where(col(partition_col) == partition_to_fix)
.write
.format("delta")
.mode("overwrite")
.option("replaceWhere", f"{partition_col} == {partition_to_fix}")
.saveAsTable("silver.covid_nyt_by_date")
)
This code showcases the replace overwrite pattern, as it can either replace missing
data or overwrite the existing data conditionally in a table. This option allows you to
fix tables that may have become corrupt or to resolve issues where data was missing
and has become available. The replaceWhere with insert overwrite isn’t bound only
to partition columns and can be used to conditionally replace data in your tables.
Cleaning Up
When we delete data from our Delta Lake tables, this action is not immediate. In fact,
the operation itself simply removes the reference from the Delta Lake table snapshot,
so it is as if the data is now invisible. This operation means that we have the ability to
“undo” in cases where data is accidentally deleted. We can clean up the artifacts, the
deleted files, and truly purge them from the Delta Lake table using a process called
vacuuming.
Vacuum
The vacuum command will clean up deleted files or versions of the table that are no
longer current, which can happen when you use the overwrite method on a table. If
you overwrite the table, all you are really doing is creating new pointers to new files
that are referenced by the table metadata. So if you overwrite a table often, the size of
the table on disk will grow exponentially. With this in mind, it is best to utilize vacuum
to enable short-lived time travel (up to 30 days is typical), and to employ a different
strategy for storing strategic table backups. We’ll look at the common scenario now.
Luckily, there are some table properties that help us control the behavior of the table
as changes occur over time. These rules will govern the vacuuming process:
With the table properties set on our table, the vacuum command does most of the
work for us. The following code example shows how to execute the vacuum operation:
$ (DeltaTable.forName(spark, "default.nonoptimal_covid_nyt")
.vacuum()
Other issues that may arise will be covered in Chapter 7, where we tackle streaming
data in and out of our Delta Lake tables.
The vacuum command will not run itself. When you are planning
to bring your table into production and want to automate the
process of keeping the table tidy, you can set up a cron job to
call vacuum on a normal cadence (daily, weekly). It is also worth
pointing out that vacuum relies on the timestamps of the files when
they were written to disk, so if the entire table was imported, the
vacuum command will not do anything until you hit your retention
thresholds. This is due to the way that the filesystem marks file cre‐
ation time versus the actual time the files were originally created.
Dropping tables
Dropping a table is an operation with no undo. If you execute delete from {table},
you are essentially truncating the table and can still utilize time travel to undo the
operation. However, if you really want to remove all traces of a table, please read
through the following warning box, and remember to plan ahead by creating a table
copy (or clone) if you want a recovery strategy.
Conclusion
This chapter introduced you to the common utility functions provided within the
Delta Lake project. We learned how to work with table properties, explored the table
properties we’d most likely encounter, and learned how to optimize our tables to fix
the small files problem. This led to our learning about partitioning and about restor‐
ing and replacing data within our tables. We explored using time travel to restore our
tables, and we concluded the chapter with a dive into cleaning up after ourselves and,
lastly, permanently deleting tables that are no longer necessary. While not every use
case can fit cleanly into a book, we now have a great reference for common problems
and their required solutions in maintaining your Delta Lake tables and keeping them
running smoothly over time.
Conclusion | 113
CHAPTER 6
Building Native Applications
with Delta Lake
By R. Tyler Croy
Delta Lake was created on the Java platform, but since the protocol became open
source, it has been implemented with a number of different languages, allowing
for new opportunities to use Delta Lake in native applications without requiring
Apache Spark. The most mature implementation of the Delta Lake protocol after the
original Spark-based library is delta-rs, which produces the deltalake library for
both Python and Rust users.
In this chapter you will learn how to build a Python- or Rust-based application
for loading, querying, and writing Delta Lake tables using these libraries. Along the
way we will review some of the tools in the larger Python and Rust ecosystems
that support Delta Lake, giving users substantial flexibility and performance when
building data applications. Unlike its Spark-based counterpart, the deltalake library
has no specific infrastructure requirements and can easily run in your command line,
a Jupyter Notebook, an AWS Lambda, or anywhere else Python or compiled Rust
programs can be executed. This extreme portability comes with a trade-off: there
is no “cluster,” and therefore native Delta Lake applications generally cannot scale
beyond the computational or memory resources of a single machine.1
To demonstrate the utility of this “low overhead” approach to utilizing Delta Lake, in
this chapter you will create an AWS Lambda, which will receive new data via its trig‐
ger, query an existing Delta Lake table to enrich its data, and store the new results in
a new silver Delta Lake table. The pricing model of AWS Lambda incentivizes short
1 Some really interesting efforts such as Ballista are underway that will enable users to build Python- or
Rust-based programs that run on a cluster, but they are still early in their maturity.
115
execution time and low memory utilization, which makes deltalake a powerful tool
for building fast and cheap data applications. While the examples in this chapter run
on AWS, the deltalake libraries for Python and Rust support a number of different
storage backends from cloud providers such as Azure and Google Cloud Platform or
on-premises tools like MinIO, HDFS, and more.
Getting Started
To develop native Delta Lake applications, you will need to have Python 3 installed
when building Python applications. Chances are your workstation either has Python
3 preinstalled or has it readily available as part of “developer tooling” packages. The
Rust toolchain is necessary only when building Rust-based Delta Lake applications.2
Rust, on the other hand, should be installed following the official documentation for
installing the compiler and associated tooling, such as cargo.
Python
This example will largely be developed in the terminal on your workstation using
virtualenv to manage the project-specific dependencies of the Lambda function:
% cd ~/dldg # Choose the directory of your choice
% virtualenv venv # Configure a Python virtualenv for managing deps
# in the ./venv/ directory
% source ./venv/bin/activate # Activate the virtualenv in this shell
Once the virtualenv has been activated, the deltalake package can be installed with
pip. It is also helpful to install the pandas package to do some data querying. The
following example demonstrates some basic deltalake and pandas invocations to
load and display a test dataset that is partitioned between two separate columns (c1,
c2) containing a series of numbers:
% pip install 'deltalake>=0.18.2' pandas
% python
>>> from deltalake import DeltaTable
>>> dt = DeltaTable('./deltatbl-partitioned')
>>> dt.files()
2 The Rust compiler toolchain can easily be installed with the rustup installer.
Reading data from Delta Lake tables is very easy thanks to the to_pandas() function,
which loads data from the DeltaTable and produces a DataFrame that can be used
to further query or inspect the data stored in the Delta Lake table. With a Pandas
DataFrame, a wide world of data analysis is available in your terminal or notebook; to
learn more about Pandas specifically, check out Python for Data Analysis (O’Reilly).
Getting started with Pandas is simple, but when reading large datasets, to_pandas()
has some limitations; those will be covered in the next section.
1. Collect references to the necessary data files—in essence, the .parquet files
returned from dt.files().
2. Retrieve those data files from storage (the local filesystem in this example).
3. Deserialize and load those data files into memory.
4. Construct the pandas.DataFrame object using the data loaded in memory.
Steps 2 and 3 pose scaling limitations as the size of the data in the Delta Lake table
grows. Modern workstations have lots of memory, which often means that loading a
few gigabytes of data into memory is not that much of a concern, but the retrieval
of that data can be a problem. For example, if the Delta Lake table is stored in AWS
S3 but your Python terminal is running on your laptop, loading a few gigabytes over
The semantics provided by the to_pandas() function are no substitute for the
expressive power provided by the Pandas DataFrame API, but they offer a very useful
mechanism for constraining the amount of data retrieved from a Delta Lake table and
loaded into memory. Both are important to consider in the example discussed later in
the “Building a Lambda” on page 131 section, in which the resource constraints of the
AWS Lambda environment reward fast and lightweight runtimes.
File statistics. The Delta protocol allows for optional file statistics that can enable
further optimization by query engines. When writing a .parquet file, most writers will
put this additional metadata into the Delta transaction log, capturing each column’s
minimum and maximum values. The deltalake Python library can utilize this infor‐
mation to skip files that don’t contain values in the specified column(s). This can
be especially useful for append-only tables that have predictable and sequential data
within a given partition.
Using an example dataset3 that is partitioned by year but contains multiple Parquet
files within each partition, the transaction log includes the following entry:
{
"add": {
"path": "year=2022/0-ec9935aa-a154-4ba4-ab7e-92a53369c433-2.parquet",
"partitionValues": {
"year": "2022"
},
"size": 3025,
"modificationTime": 1705178628881,
"dataChange": true,
"stats": "{\"numRecords\": 4, \"minValues\": {\"month\": 9,
\"decimal date\": 2022.7083, \"average\": 415.74,
\"deseasonalized\": 419.02, \"ndays\": 24, \"sdev\": 0.27,
\"unc\": 0.1}, \"maxValues\": {\"month\": 12, \"decimal date\": 2022.9583,
\"average\": 418.99, \"deseasonalized\": 419.72, \"ndays\": 30,
\"sdev\": 0.57, \"unc\": 0.22}, \"nullCount\": {\"month\": 0,
\"decimal date\": 0, \"average\": 0, \"deseasonalized\": 0, \"ndays\": 0,
3 Scripts to download this sample data can be found in this book’s GitHub repository. This particular example
uses data from NOAA that tracks global CO2 concentrations.
The stats portion contains the relevant information for the file statistics–based opti‐
mization. Inspecting minValues and maxValues shows that 0-ec9935aa-a154-4ba4-
ab7e-92a53369c433-2.parquet contains data only for the months September to
December in the year 2022. The following Pandas invocation will create a DataFrame
that has loaded data only from this specific file utilizing the partition column and the
month column. The file statistics help the underlying engine avoid loading every file
in the year=2022/ partition; instead, it selects only the one containing values where
the month is greater than or equal to 9, leading to a much faster and more efficient
execution of data retrieval:
>>> from deltalake import DeltaTable
>>> dt = DeltaTable('./data/gen/filestats')
>>> len(dt.files())
198
>>> df = dt.to_pandas(filters=[('year', '=', 2022), ('month', '>=', 9)])
>>> df
year month decimal date average deseasonalized ndays sdev unc
0 2022 9 2022.7083 415.91 419.36 28 0.41 0.15
1 2022 10 2022.7917 415.74 419.02 30 0.27 0.10
2 2022 11 2022.8750 417.47 419.44 25 0.52 0.20
3 2022 12 2022.9583 418.99 419.72 24 0.57 0.22
Rather than loading every one of the files in the Delta Lake table to produce a
DataFrame for experimentation, filters utilizes Delta’s partitioning and file statistics
to load a single file from storage for this example.
The Delta Lake transaction log provides a wealth of information that the deltalake
native Python library utilizes to provide fast and efficient reads of tables; more
examples can be found in the online documentation. Reading existing Delta Lake
tables is exciting, but for many Python users, the writing of Delta Lake tables helps
unlock new superpowers in the Python-based data analysis or machine learning
environment.
Writing data
Numerous examples for performing data analysis or machine learning in Python
start with loading data into a DataFrame of some form (typically Pandas) from
a CSV- or TSV-formatted dataset. Comma-separated values (CSV) files are fairly
The ability to write a Delta Lake table easily can accelerate local development or
model training; in addition, it can enable building simple and fast ingestion appli‐
cations in environments such as AWS Lambda, which will be covered later in the
chapter.
Merging/updating
The DeltaTable object contains a number of simple functions for common merge
or update tasks on the Delta Lake table, such as delete, merge, and update. These
functions can be used in much the same way as delete, merge, and update operations
in a relational database, but underneath the covers the Delta transaction log is doing a
lot of important work to keep track of the data being modified.
The preceding snippet contains the two key actions: remove and add, with their
respective files. There is an additional action called commitInfo that is optional in
the Delta Lake table protocol but may contain additional information about what
triggered this particular transaction. In this case, it describes the DELETE operation
with its predicate, giving us insight into why the remove and add were necessary.
Whether the operation is a delete, update, or merge, when data is changed in the
Delta Lake table, there is typically a removal of outdated Parquet files and a creation
of new Parquet files with the modified data. This is the case except when using the
newer deletion vectors feature, which the Python or Rust libraries do not support at
the time of this writing.4
Following the same patterns as the Python examples, start by opening a contrived
Delta Lake table:
#[tokio::main]
async fn main() {
println!(">> Loading `deltatbl-partitioned`");
let table = deltalake::open_table("../data/deltatbl-partitioned").await
.expect("Failed to open table");
println!("..loaded version {}", table.version());
for file in table.get_files_iter() {
println!(" - {}", file.as_ref());
}
}
5 One of the earliest open source applications developed with delta-rs, kafka-delta-ingest, has been running in
production environments for years without incident or substantial change in system resource requirements.
#[tokio::main]
async fn main() {
let ctx = SessionContext::new();
let table = deltalake::open_table("../data/deltatbl-partitioned")
.await
.unwrap();
ctx.register_table("demo", Arc::new(table)).unwrap();
Writing data
At a fundamental level, a Delta Lake table consists of data files, typically in the
Apache Parquet format, and transaction log files in a JSON format. The deltalake
Rust crate supports writing both data and transaction log files, or writing only
transactions. For example, kafka-delta-ingest translates streams of JSON data into
Apache Parquet before creating a transaction to add the data to the configured Delta
Lake table. Other Rust applications may use Parquet data files created by an external
system, such as oxbow, which only needs to manage the Delta Lake table’s transaction
log.
• Transaction operations that allow direct interaction with the Delta log
• A DataFusion-based writer for inserting and/or merging records
• A simple high-level JSON writer that accepts serde_json::Value types
• A RecordBatch writer that allows developers to turn Arrow RecordBatches into
Apache Parquet files written into Delta Lake tables
For most use cases, the decision about what type of writer is required will come down
to whether the write should be an append or a merge.
For append-only writers, DataFusion is not necessary, and the deltalake package’s
RecordBatchWriter can be used to issue append-only writes to a DeltaTable.
6 The writers are available as of the 0.19 release of the deltalake crate, but this may change as the project
moves toward a 1.0 release.
Building a Lambda
Serverless functions represent an ideal use case for building native applications for
Delta Lake, such as with AWS Lambda. The billing model for Lambda encourages
low memory usage and fast execution time, which makes it a great platform for com‐
pact and efficient data processing applications. This section will adapt some of this
chapter’s previous examples to run within AWS Lambda to handle data ingestion or
processing using deltalake. Other cloud providers have similar serverless offerings,
such as Azure Functions and Google Cloud Run. The concepts in this section can be
ported into those environments, but some of the interfaces may change.
For most applications, AWS Lambda is triggered by an external event such as an
inbound HTTP request, an SQS message, or a CloudWatch event. Lambda will then
translate this external event into a JSON payload, which the Lambda function will
receive and can act upon. Imagine, for example, an application that receives an HTTP
POST with a JSON array containing thousands of records that should be written to
S3, as sketched out via the request flow diagram in Figure 6-1. Upon invocation, the
Lambda receives the JSON array, which it can then append to a preconfigured Delta
Lake table. Lambdas should conceptually be simple and complete their task as quickly
and efficiently as possible.
Python
Lambdas can be written in Python directly within the AWS Lambda web UI. Unfortu‐
nately, the default Python runtime has only minimal packages built in, and developers
wishing to include deltalake will need to package their Lambdas either with layers
or as containers. AWS provides an “AWS SDK with Pandas” layer that can be used to
get started, but some care must be taken to include the deltalake dependency due
to the 250 MB size limitation of Lambda layers. How the Lambda is packaged doesn’t
have a significant impact on its execution, so this section will not focus heavily on
packaging and uploading the Lambda. Please refer to the book’s GitHub repository, as
it contains examples that use layers and container-based approaches, along with the
infrastructure code necessary to deploy the examples.
The hello-delta-rust example demonstrates the simplest possible Delta Lake appli‐
cation in Lambda. This example looks only at the table’s metadata, rather than
querying any of the data.
The lambda_function.py simply opens the Delta Lake table and returns metadata to
the HTTP client:
import os
from deltalake import DeltaTable
This simple Python to create a DeltaTable object and then perform operations on
the table (dt) demonstrates how easy interacting with Delta Lake from a Lambda can
be. So long as the function returns a list or dict to the caller of lambda_handler,
AWS Lambda will handle returning the information to the caller in JSON over HTTP.
The examples from the section “Reading large datasets” on page 117, which used
Pandas or PyArrow for querying data in Python, can be reused inside the Lambda
environment.
Similarly, the examples that cover writing data in Python can be reused in a
Lambda. However, the Lambda execution environment is inherently parallelized,
which presents concurrent write challenges when using AWS S3; these challenges and
solutions are discussed later in this chapter. First we need the application, which will
take the JSON array described above and append that to a Delta Lake table. Execution
begins with the lambda_handler function, which is the entrypoint for AWS Lambda
to execute your uploaded code:
def lambda_handler(event, context):
table_url = os.environ['TABLE_URL']
try:
input = pa.RecordBatch.from_pylist(json.loads(event['body']))
dt = DeltaTable(table_url)
write_deltalake(dt, data=input, schema=schema(), mode='append')
status = 201
body = json.dumps({'message' : 'Thanks for the data!'})
except Exception as err:
status = 400
body = json.dumps({'message' : str(err),
'type' : type(err).__name__})
return {
'statusCode' : status,
'headers' : {'Content-Type' : 'application/json'},
'isBase64Encoded' : False,
'body' : body,
}
Rust
Building AWS Lambdas in Rust is similarly straightforward to building their Python
counterparts. Unlike Python, however, Rust can be compiled to native code and does
not require a “runtime” in AWS Lambda; instead, a custom-formatted bootstrap.zip
file containing the compiled executable must be uploaded to AWS. Additional tools
such as cargo-lambda should be installed on your workstation to provide genera‐
tors and the build/cross-compiling functionality needed to build the bootstrap.zip
files required by Lambda. The following examples and those in the book’s GitHub
Figure 6-2. Coordination process for two concurrent writers using the S3DynamoDBLog
Store process
DynamoDB lock. Applications with older dependencies may still rely on dynamo
db_lock, but since this approach is deprecated, this section will not dive too deeply
into its function and design. At a high level, a DynamoDB table is configured as a
simple key-value store alongside the Python or Rust application. Prior to executing
a write operation, the deltalake library will check DynamoDB for the presence of a
lock item—essentially a key representing the table it wishes to write against. If that
key does not exist, the library will:
What’s Next
The native data processing ecosystem is blossoming, with dozens of great tools in
Python and Rust being developed and coming to maturity. Most of this innovation
is being done by passionate and inspired developers in the larger open source
ecosystem.
Delta Lake plays a pivotal role via the deltalake Python package or Rust crate,
allowing data applications to benefit from the optimized storage and transactional
nature of Delta. The list of integrations and great tools continues to grow; following is
a list of interesting projects that are worth learning more about:
7 Check out the blog post “Concurrency Limitations for Delta Lake on AWS” for more details on the Dyna‐
moDB lock’s limitations.
Now more than ever, the world is infused with real-time data sources. From ecom‐
merce, social network feeds, and airline flight data to network security and IoT
devices, the volume of data sources is increasing alongside the speed with which
you’re able to access it. One problem with this is that, while some event-level opera‐
tions make sense, much of the information we depend on lives in the aggregation
of that information. So we are caught between the dueling priorities of (a) reducing
the time to insights as much as possible and (b) capturing enough meaningful and
actionable information from aggregates. For years we’ve seen processing technologies
shifting in this direction, and it was this environment in which Delta Lake originated.
What we got from Delta Lake was an open lakehouse format that supports seamless
integrations of multiple batch and stream processes while delivering the necessary
features like ACID transactions and scalable metadata processing that are commonly
absent in most distributed data stores. With that in mind, in this chapter we dig into
some of the details for stream processing with Delta Lake—namely, the functionality
that is core to streaming processes, configuration options, specific usage methods,
and the relationship of Delta Lake to Databricks Delta Live Tables.
139
a couple of related features used in Databricks, such as Delta Live Tables and how it
relates to Delta Lake, and then review how to use the Change Data Feed functionality
available in Delta Lake.
1 For a review of the lambda architecture pattern, we suggest starting with the Wikipedia page. It is essentially
a parallel path architecture with a stream processing component and a batch processing component, both
reading from the same source. The streaming process provides a faster view of the data, and the batch process
ensures eventual accuracy.
Figure 7-1. The biggest difference between batch and stream processing is latency; we can
handle the files or messages individually as each becomes available or as a group
From a practical standpoint, the way we think about other related concepts such as
processing time and table maintenance is affected by our choice between batch and
streaming. If a batch process is scheduled to run at certain times, then we can easily
measure the amount of time the process runs and how much data was processed
and then chain it together with additional processes to handle table maintenance
operations. We do need to think a little differently when it comes to measuring
and maintaining stream processes, but many of the features we’ve already looked
at—such as autocompaction and optimized writes, for example—can work in both
realms. In Figure 7-2, we can see how, with modern systems, batch and streaming
can converge, and we can focus instead on latency trade-offs once we depart from
traditional frameworks. By choosing a framework that has a reasonably unified API
minimizing the differences in programming for both batch and streaming use cases
and then running it on top of a storage format like Delta Lake that simplifies the
maintenance operations and provides for either method of processing, we wind up
with a more robust yet flexible system that can handle all our data processing tasks,
and we minimize the need to balance multiple tools and avoid other complications
necessitated by running multiple systems. This makes Delta Lake the ideal storage
solution for streaming workloads. Next, we’ll consider some of the specific terminol‐
ogy for stream processing applications and follow up with a review of a few of the
different framework integrations available for use with Delta Lake.
Streaming terminology
In many ways, streaming processes are quite the same as batch processes, with the
difference being mostly one of latency and cadence. This does not mean, however,
that streaming processes don’t come with some of their own lingo. Some terms, such
as source and sink, vary only a little from batch usage, while terms like checkpoint
and watermark don’t really apply to batch. It’s useful to have some working familiarity
with these terms, but you can dig into them at a greater depth in Stream Processing
with Apache Flink by Fabian Hueske and Vasiliki Kalavri (O’Reilly) or Learning Spark.
Source. A stream processing source is any of a variety of sources of data that can be
treated as an unbounded dataset. Sources for data stream processing are varied and
ultimately depend on the nature of the processing task in mind. There are a number
of different message queue and pub/sub connectors used as data sources across the
Spark and Flink ecosystems. These include many common favorites such as Apache
Kafka, Amazon Kinesis, ActiveMQ, RabbitMQ, Azure Event Hubs, and Google’s Pub/
Sub. Both systems can also generate streams from files, for example, by monitoring
cloud storage locations for new files. We will see shortly how Delta Lake fits in as a
streaming data source.
Checkpoint. It is usually important to make sure that you have implemented check‐
pointing in a streaming process. Checkpointing keeps track of the progress made
in processing tasks and is what makes failure recovery possible without restarting
processing from the beginning every time. This is accomplished by keeping some
tracking record of the offsets for the stream, as well as any associated stateful infor‐
mation. In some processing engines, such as Flink and Spark, there are built-in
mechanisms to make checkpointing operations simpler to use. We refer you to the
respective documentation for usage details.
All the examples and some other supporting code for this chapter
can be found in the GitHub repository for the book.
Let’s consider an example from Spark. When we start a stream writing process and
define a suitable checkpoint location, it will in the background create a few directories
at the target location. In the following example, we find a checkpoint written from a
process we called “gold” (and named the directory similarly):
tree -L 1 /…/ckpt/gold/
/…/ckpt/gold/
├── __tmp_path_dir
├── commits
├── metadata
├── offsets
└── state
The metadata directory will contain some information about the streaming query,
and the state directory will contain snapshots of the state information (if any) related
to the query. The offsets and commits directories track at a microbatch level the
progress of streaming from the source and writing to the sink, which for Delta Lake
amounts to tracking the input or output files, respectively, as we’ll see more of shortly.
Apache Flink
Apache Flink is one of the major distributed, in-memory processing engines that sup‐
ports both bounded and unbounded data manipulation. Flink supports many prede‐
fined and built-in data stream sources and sink connectors.3 On the data source side,
we see many message queues and pub/sub connectors supported, such as RabbitMQ,
Apache Pulsar, and Apache Kafka (see the Flink documentation for more detailed
streaming connector information). While some, such as Kafka, are supported as an
output destination, it’s probably most common to instead see something like writing
to file storage or Elasticsearch or even a JDBC connection to a database as the goal.
You can find more information about Flink connectors in their documentation.
With Delta Lake, we gain yet another source and destination for Flink, but one
that can be critical in multitool hybrid ecosystems or can simplify logical processing
transitions. For example, with Flink, we can focus on event stream processing and
then write directly to a Delta table in cloud storage, where we can access it for
subsequent processing in Spark. Alternatively, we could reverse this situation entirely
and feed a message queue from records in Delta Lake. A more in-depth review of the
connector, including both implementation and architectural details, is available as a
blog post on the delta.io website.
Apache Spark
Apache Spark similarly supports many input sources and sinks.4 Since Apache Spark
tends to hold more of a place on the large-scale ingestion and ETL side, we do see a
little bit of a skew in the direction of input sources available, in contrast to the more
event-processing-centered Flink system. In addition to file-based sources, there is a
strong native integration with Kafka in Spark, as well as several separately maintained
2 To explore watermarks in more detail, we suggest the “Event-Time and Stateful Processing” chapter of
Spark: The Definitive Guide by Bill Chambers and Matei Zaharia (O’Reilly).
3 We understand many readers are more familiar with Apache Spark. For an introduction to concepts more
specific to Apache Flink, we suggest the “Learn Flink” page of the Flink documentation.
4 Apache Spark source and sink documentation can be found in the “Structured Streaming Programming
Guide”, which is generally seen as the go-to source for all things streaming with Spark.
Delta-rs
The Rust ecosystem also has additional processing engines and libraries of its own,
and thanks to the implementation called delta-rs, we get further processing options
that can run on Delta Lake. This area is one of the newer sides and has seen some
intensive build-out in recent years. Polars and DataFusion are just a couple of the
other options for stream data processing, and both couple with delta-rs reasonably
well. This is a rapidly developing area that we expect to see a lot more growth in
going forward.
One other benefit of the delta-rs implementation is that there is a direct Python
integration, which opens up additional possibilities for data stream processing tasks.
This means that for smaller-scale jobs, it is possible to use a Python API (such as
AWS Boto3, for example) for services that otherwise require larger-scale frameworks
for interaction and thus cause unneeded overhead. While you may not be able to
leverage some of the features from the frameworks that more naturally support
streaming operations, you could benefit from a significant reduction in infrastructure
requirements and still get lightning-fast performance.
The net result of the delta-rs implementation is that Delta Lake gives us a format
through which we can simultaneously make use of multiple processing frameworks
and engines without relying on an additional RDBMS and still operate outside of
more Java-centered stacks. This means that, even when working in disparate systems,
we can build data applications confidently without sacrificing the built-in benefits we
gain through Delta Lake.
5 You can find detailed descriptions, including error messages, in the “Concurrency Control” section of the
Delta Lake documentation.
Similarly, you can chain together a readStream definition (similarly formatted) and
a writeStream definition to set up a whole input-transformation-output flow (trans‐
formation code omitted here for brevity):
# Python
(spark
.readStream
.format("delta")
.load("/files/delta/user_events")
…
# other transformation logic
…
.writeStream
.format("delta")
.outputMode("append")
.start("/<delta_path>/")
)
Example
Let’s consider an example. Suppose you have a Delta Lake table called user_events
with date, user_email, and action columns, and it is partitioned by the date col‐
umn. Let’s also suppose that we are using the user_events table as a streaming source
for a step in our larger pipeline process and that we need to delete data from it due to
a GDPR-related request.
When you delete at a partition boundary (that is, the WHERE clause of the query filters
data on a partition column), the files are already in directories based on those values,
so the delete just drops any of those files from the table metadata.
So if you just want to delete data from some entire partitions aligning to specific
dates, you can add the ignoreDeletes option to the readStream:
# Python
streamingDeltaDf = (
spark
.readStream
.format("delta")
If you want to delete data based on a nonpartition column like user_email instead,
then you will need to use the ignoreChanges option:
# Python
streamingDeltaDf = (
spark
.readStream
.format("delta")
.option("ignoreChanges", "true")
.load("/files/delta/user_events")
)
Unlike with some of our other settings, we cannot use both options simultaneously
here; we have to choose one or the other. If this setting is added to an existing
streaming query with a checkpoint already defined, then they will both be ignored, as
they apply only when starting a new query.
Another thing you will want to note is that even though you can start from any
specified place in the source using these options, the schema will reflect the latest
available version. This means that incorrect values or failures can occur if there is
an incompatible schema change between the specified starting point and the current
version.
Considering our user_events dataset again, suppose you want to read changes
occurring since version 5. Then you would write something like the following:
# Python
(spark
.readStream
.format("delta")
.option("startingVersion", "5")
.load("/files/delta/user_events")
)
• The data drop issue happens only when the initial Delta snapshot of a stateful
streaming query is processed in the default order.
• withEventTimeOrder is another of those settings that takes effect only at the
beginning of a streaming query, so it cannot be changed after the query is started
and while the initial snapshot is still being processed. If you want to modify the
withEventTimeOrder setting, you must delete the checkpoint and make use of
the initial processing position options to proceed.
• If you are running a stream query with withEventTimeOrder enabled, you can‐
not downgrade it to a version that doesn’t support this feature until the initial
snapshot processing is completed. If you need to downgrade versions, you can
either wait for the initial snapshot to finish or delete the checkpoint and restart
the query.
• There are a few rarer scenarios in which you cannot use withEventTimeOrder:
— If the event time column is a generated column and there are nonprojection
transformations between the Delta source and the watermark
— If there is a watermark with multiple Delta sources in the stream query
• Due to the potential for increased shuffle operations, the performance of the
processing for the initial snapshot may be impacted.
Using the event time ordering triggers a scan of the initial snapshot to find the
corresponding event time range for each microbatch. This suggests that for better
performance we want to be sure that our event time column is among the columns
we collect statistics for. This way our query can take advantage of data skipping, and
we get faster filter action. You can increase the performance of the processing in
cases where it makes sense to partition the data in relation to the event time column.
Performance metrics should indicate how many files are being referenced in each
microbatch.
Idempotent writes
Let’s suppose that we are leveraging foreachBatch from a streaming source and are
writing to just two destinations. What we would like to do is take the structure of the
foreachBatch transaction and combine it with some nifty Delta Lake functionality to
make sure we commit the microbatch transaction across all the tables without wind‐
ing up with duplicate transactions in some of the tables (i.e., we want idempotent
writes to the tables). We have two options we can use to help get to this state:
txnAppId
This should be a unique string identifier and acts as an application ID that you
can pass for each DataFrame write operation. This identifies the source for each
write. You can use a streaming query ID or some other meaningful name of your
choice as txnAppId.
txnVersion
This is a monotonically increasing number that acts as a transaction version and
functionally becomes the offset identifier for a writeStream query.
By including both of these options, we create a unique source and offset tracking
at the write level, even inside a foreachBatch operation writing to multiple destina‐
tions. This allows, at a table level, for the detection of duplicate write attempts that
can be ignored. This means that if a write is interrupted during the processing of
just one of multiple table destinations, we can continue the processing without dupli‐
cating write operations to tables for which the transaction was already successful.
When the stream restarts from the checkpoint, it will start again with the same
microbatch, but then in the foreachBatch, with the write operations now being
checked at a table level of granularity, we write only to the table or tables that were
not able to complete successfully before, because we will have the same txnAppId and
txnVersion identifiers.
In the case that you want to restart processing from a source and
delete/recreate the streaming checkpoint, you must provide a new
appId as well before restarting the query. If you don’t, then all of
the writes from the restarted query will be ignored because it will
contain the same txnAppId, and the batch ID values will restart, so
the destination table will see them as duplicate transactions.
If we wanted to update the function from our earlier example to write to multiple
locations with idempotency using these options, we could specify the options for each
destination like this:
# Python
app_id = ... # A unique string used as an application ID.
Merge
There is another common case in which we tend to see foreachBatch used for stream
processing. Think about some of the limitations we have seen where we might allow
large amounts of unchanged records to be reprocessed through the pipeline, or where
we might otherwise want more advanced matching and transformation logic, such as
processing CDC records. To update values, we need to merge changes into an existing
table rather than simply append the information. The bad news is that the default
behavior in streaming kind of requires us to use append-type behaviors (unless we
leverage foreachBatch, that is).
We looked at the merge operation in Chapter 3 and saw that it allows us to use
matching criteria to update or delete existing records and append others that don’t
match the criteria—that is, we can perform upsert operations. Since foreachBatch
lets us treat each microbatch like a regular DataFrame, then at the microbatch level
we can actually perform these upsert operations with Delta Lake. You can upsert data
from a source table, view, or DataFrame into a target Delta table by using the MERGE
SQL operation or its corollary for the Scala, Java, and Python Delta Lake API. It even
supports extended syntax beyond the SQL standards to facilitate advanced use cases.
A merge operation on Delta Lake typically requires two passes over the source data.
If you use nondeterministic functions such as current_timestamp or random in a
source DataFrame, then multiple passes on the source data can produce different
values in rows, causing incorrect results. You can avoid this by using more concrete
functions or values for columns or by writing out results to an intermediate table.
Caching the source data may help as well, because a cache invalidation can cause the
source data to be partially or completely reprocessed, resulting in the same kind of
value changes (for example, when a cluster loses some of its executors when scaling
down). We’ve seen cases in which this can fail in surprising ways when trying to
do something like using a salt column to restructure DataFrame partitioning based
on random number generation (e.g., Spark cannot locate a shuffle partition on disk
because the random prefix is different than expected on a retried run). The multiple
passes for merge operations increase the possibility of this happening.
Let’s consider an example of using merge operations in a stream using foreachBatch
to update the most recent daily retail transaction summaries for a set of customers.
In this case, we will match on a customer ID value and include the transaction date,
number of items, and dollar amount. In practice what we do to use the mergeBuilder
API here is to build a function to handle the logic for our streaming DataFrame.
Inside the function, we’ll provide the customer ID as a matching criteria for the
6 For additional details and examples on using merge in foreachBatch, e.g., for SCD Type II merges, see the
Delta Lake documentation.
Metrics
As we’ve seen, there are cases in which we want to manually set starting and ending
boundary points for processing with Delta Lake, and these are generally aligned to
versions or timestamps. Within those boundaries, we can have differing numbers
of files and so forth, and one of the concepts we’ve seen is important to streaming
processes in particular is tracking the offsets, or the progress, through those files.
In the metrics reported out for Spark Structured Streaming, we see several details
tracking these offsets.
When running the process on Databricks as well, there are some additional metrics
that help to track backpressure—that is, how much outstanding work there is to be
done at the current point in time. The performance metrics we see get output are
numInputRows, inputRowsPerSecond, and processedRowsPerSecond. The backpres‐
sure metrics are numBytesOutstanding and numFilesOutstanding. These metrics are
fairly self-explanatory by design, so we won’t explore them individually.
Auto Loader
Databricks has a somewhat unique Spark Structured Streaming source known as
Auto Loader, though it is really better thought of as the cloudFiles source. On the
whole, the cloudFiles source is more of a streaming source definition in Structured
Streaming on Databricks, but it has rapidly become an easier entrypoint for stream‐
ing for many organizations in which Delta Lake is commonly the destination sink.
This is partly because it provides a natural way to incrementalize batch processes so
as to integrate some of the benefits of stream processing, such as offset tracking.
The cloudFiles source actually has two different methods of operation: one is to
directly run file-listing operations on a storage location, and the other is to listen on
a notifications queue tied to a storage location. Whichever method is used, it will
quickly become apparent that this is a scalable and efficient mechanism for regular
ingestion of files from cloud storage, as the offsets it uses for tracking progress are the
actual filenames in the specified source directories. Refer to the section “Delta Live
Tables” on page 163 for an example of the most common usage.
One fairly standard application of Auto Loader is to use it as a part of the medallion
architecture design, with a process ingesting files and feeding the data into Delta
Lake tables with additional levels of transformation, enrichment, and aggregation
up to gold layer aggregate data tables. This is quite commonly done with additional
data layer processing taking place, with Delta Lake as both the source and the sink
of streaming processes, which provides low-latency, high-throughput, end-to-end
data transformation pipelines. This process has become somewhat of a standard
for file-based ingestion and has eliminated some of the need for more complicated
processes based on lambda architecture—so much so that Databricks also built a
framework largely centered around this approach.
@dlt.table
def autoloader_dlt_bronze():
return (
spark
.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("<data path>")
)
@dlt.table
def delta_dlt_silver():
return (
dlt
.read_stream("autoloader_dlt_bronze")
…
<transformation logic>
…
)
@dlt.table
def live_delta_gold():
return (
7 Joe Reis and Matt Housley, Fundamentals of Data Engineering: Plan and Build Robust Data Systems (O’Reilly),
163, 256.
Specifying boundaries for batch processes. Since batch operations are a bounded pro‐
cess, we need to tell Delta Lake what bounds we want to use to read the change feed.
You can provide either version numbers or timestamp strings to set both the starting
and ending boundaries. The boundaries you set will be inclusive in the queries—that
is, if the final timestamp or version number exactly matches a commit, then the
changes from that commit will be included in the change feed. If you want to read the
changes from any particular point all the way up to the latest available changes, then
only specify the starting version or timestamp.
When setting boundary points, you need to use either an integer to specify a version
or a string in the format yyyy-MM-dd[ HH:mm:ss[.SSS]] for timestamps in a similar
way to how we set time travel options. An error will be thrown letting you know that
the change data feed was not enabled if a timestamp or version you give is lower or
older than any that precedes when the change data feed was enabled:
# Python
# version as ints or longs
(spark.read.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 0)
.option("endingVersion", 10)
.table("myDeltaTable")
)
Schema
At this point, you might wonder exactly how the data we are receiving in a change
feed looks as it comes across. We get all the same columns in your data as before. This
makes sense, because otherwise it wouldn’t match up with the schema of the table.
We do, however, get some additional columns so we can understand things like the
change type taking place. We get these three new columns in the data when we read it
as a change feed:
Change type
The _change_type column is a string type column that, for each row, will
identify whether the change taking place is an insert, an update_preimage,
an update_postimage, or a delete operation. In this case, the preimage is the
matched value before the update, and the postimage is the matched value after
the update.
Commit version
The _commit_version column is a long integer type column noting the Delta
Lake file/table version from the transaction log that the change belongs to. When
reading the change feed as a batch process, it will be at or in between the
boundaries defined for the query. When read as a stream, it will be at or greater
than the starting version and will continue to increase over time.
Commit timestamp
The _commit_timestamp column is a timestamp type column (formatted
as yyyy-MM-dd[ HH:mm:ss[.SSS]]) noting the time at which the version in
_commit_version was created and committed to the log.
Conclusion | 171
CHAPTER 8
Advanced Features
In this chapter the focus is a bit less on how to interact with and use Delta Lake tables
than you may have found in other chapters. Instead, the main focus here is a handful
of advanced features that you’ll find useful. At heart, these Delta Lake features have
more to do with metadata than anything else. The first thing we’ll look at is how
you can use generated columns as part of table definitions to reduce the amount of
insertion or transformation work required for data loading operations. After that,
we’ll look at how Delta Lake metadata helps drive higher data quality standards
and provides richer information to users through constraints and comments. Last,
we’ll share some insight into how deletion vectors can speed up many operations
against applicable tables. Each of these features shows how the power of Delta Lake is
enhanced through well-thought-out uses of table metadata and the transaction log.
All the examples and some other supporting code for this chapter
can be found in the GitHub repository for the book.
173
You can include two types of generation expressions in a table definition that allow
you to control whether values will always be generated or are generated by default.
Columns that are always generated cannot be overwritten, whereas you can specify
values during insertion operations for columns that are generated by default. Usually,
the choice is to always generate columns because that option is simpler, but you may
have cases in which you wish to explicitly be able to override a generated value with
a specific value. For example, suppose you want to set a transaction at the beginning
of each month to increment the beginning values of keys to the next thousand or
million; you would then use generate by default so you could manually set that initial
monthly transaction. Regardless, if you want to generate columns, you need to add
the generation expression in your original table definition. In the following example,
you can apply a Spark SQL function to an incoming date column to extract the
year as a column. This can also be done to typecast columns or even to create more
complex data structures such as structs out of incoming columns:
-- SQL
CREATE TABLE if not exists summary_cases(
state STRING,
fips INT,
cases INT,
deaths INT,
county STRING,
year INT GENERATED ALWAYS AS (YEAR(date))
)
USING DELTA
One of the most common applications of generated columns is to create identity or
surrogate key columns.1 In the past you’ve been able to do this with other methods,
such as leveraging external libraries to create UUIDs or using hashing methods to
create unique keys. Delta Lake offers some advantages over these methods. By baking
the ability to generate columns into the foundation of the format, you can avoid
running into issues that stem from the nondeterminism of many of these previous
methods and get simpler ID columns that are more human-readable than the results
of the hashing methods.
Defining identity columns is just a slight extension of the generation expressions,
except that there is no required SQL statement to perform some transformation.
Instead, using the IDENTITY keyword triggers some actions behind the scenes that
make this work. What you get is, in essence, a bit of automated tracking that main‐
tains the incremental nature of the identity column(s):
1 If you’d like to further explore the use of various kinds of key-based relationships, we recommend Learning
SQL, 3rd ed., by Alan Beaulieu (O’Reilly), Deciphering Data Architectures by James Serra (O’Reilly), and Data
Management at Scale, 2nd ed., by Piethein Strengholt (O’Reilly).
-- SQL
id BIGINT GENERATED ALWAYS AS IDENTITY
These identity columns serve as surrogate keys that can be leveraged throughout your
downstream applications to create primary and foreign key relationships, or poten‐
tially even for slowly changing dimension (SCD) types of tables. It’s relevant to note
that Databricks has a feature to make these primary and foreign key relationships
enforceable via Unity Catalog.2
At an implementation level, there are a few secrets to the recipe to be learned from
the Delta Lake protocol definitions for identity columns. The main takeaway is that
whenever overwriting values is disallowed, a simple monotonic function generates
the values for the column. This means you can also feel assured that the generation
of values is an efficient operation, as it primarily relies on table metadata and simple
integer mathematics.3
There are a couple of things under the covers that you will want to be aware of. First,
when columns are generated using ALWAYS, there is a constraint applied to the table
(you will find more information on constraints later in this chapter). This means
that attempting to provide values for generated columns during insertion operations
will yield an error for your transaction. Second, using generated columns poses some
limitations on usage; for example, you cannot partition a table by a generated iden‐
tity column, and concurrent transactions are disallowed. Last, for identity columns
specifically, you must use the BIGINT type, whereas with other generated columns the
type definition is more flexible, depending on your actual application.
2 See the Databricks documentation for usage examples and additional details.
3 For an extended discussion of the benefits and use of surrogate keys, see the blog post “Identity Columns to
Generate Surrogate Keys Are Now Available in a Lakehouse Near You!” by Franco Patano.
Tags are map objects that contain additional metadata about trans‐
actional operations. They are an optional field in add or remove
files, deletion vector files, and CDC files. When using the check‐
point V2 space, both the checkpoint and the associated sidecar files
can also have tags. Note that remove actions in the checkpoint are
tombstones used only by VACUUM and do not contain the stats field
or the tags field. These are mostly intended for use at the imple‐
mentation level to support or add new features to a particular Delta
Lake implementation. A common use of tags is to annotate table
properties in different processing engines. These are not explored
in depth here, as most users will not use them, but we want to
mention them, as they are distinctly different from the tags used in
catalogs.
Comments
Comments should be used often and well. There are many kinds of comments you
might wish to include for different kinds of informational purposes. They can convey
important information about ownership or column design information. Possibilities
for types of constructive comments might include:
Instructive
Sometimes when creating different datasets, we make decisions about the layout
that may not be transparent to end users. If a table does not have a unique key
column but requires a combination of multiple columns to have a unique key, we
might wish to capture in the comments for those columns which columns they
can be combined with to have a unique key.
Explanatory
In some cases, it might be useful to annotate the origin of data residing in a
column, its security classification level, its intended users, or information about
the derivation of calculated fields. Denoting the data origin is even more valuable
when Delta Lake is used outside of environments that automatically capture
lineage information. All of these provide enriched information to consumers on
demand and can increase the delivered value of data products. This can be par‐
ticularly useful in cases in which a table includes nonstandard key performance
indicators (KPIs) with a reference to design documentation.
What you include in the comments is ultimately up to you and your organization. We
recommend that you come up with a standard definition for usage and stick with it,
as the many benefits that may potentially be gained from the additional information
can greatly improve the experience.
Here is a quick example showing how you can easily add columns to a table at the
time of creation. This allows you to provide clarifying or explanatory quick notes for
all columns in the table at the same time; all you need to do is include the comments
as part of the table schema definition:
-- SQL
CREATE TABLE example_table (
id INT COMMENT 'uid column',
content STRING COMMENT 'payload column for text'
)
USING delta
Sometimes your initial comments may not be as clear to table consumers as intended.
In those cases, you can update individual column comments to refine them. This
also gives you the flexibility you might need to include additional information not
available at the time of table creation:
-- SQL
ALTER TABLE example_table
ALTER COLUMN id
COMMENT 'unique id column'
One last area that can be rather useful in many cases is adding transactional com‐
ments to table changes. You can set this option during individual operations as part
of the table options when using the Python API, set it for a session in SQL and
reuse it until you are done with those updates, or change it as many times as needed
throughout the session.
When using the Python API as a table option, you just want to set the userMetadata
option with your custom metadata:
# Python
(spark
.read
.table(<source>)
.write
.format("delta")
.option("path", <destination>)
.option("userMetadata", "custom commit metadata for the creation operation")
Tables using writer version 7 and above need to have the feature
name checkConstraints in the writerFeatures. Versions 3–6,
however, always support CHECK constraints.
CHECK constraints are stored in your table metadata just like userMetadata and
column comments. They are stored as key-value pair objects. You can see the
value of any constraint for a particular table by name under the attribute delta
.constraints.<name>. The value is stored as a SQL expression that will return a
Boolean value. Because of this expression’s nature, the columns specified in the
expression must exist in the table. All rows in the table must satisfy the constraint
expression by returning true when expressions are evaluated.
When you add a constraint to a table, it will check the existing table data to make
sure it is compliant. In cases where it is not, the ALTER TABLE execution will fail.
Similarly, when writing data to the table after a constraint has been added, every row
4 For an extended overview, we suggest Matt Powers’s blog post on constraints for Delta Lake.
It’s important to note here that only CHECK constraints added via ALTER TABLE com‐
mands will be represented in the table metadata, but you can feel assured that the null
constraints set at creation will also be effective. Setting constraints via ALTER TABLE
is also relatively straightforward. Consider the following example you could use to
ensure you have nonnegative ID column values:
-- SQL
ALTER TABLE example_table
ADD CONSTRAINT id CHECK (id > 0)
Whichever way you choose to set various constraints, they are an effective way to
increase data quality and enhance confidence in your data platform.
Deletion Vectors
Sometimes we can look at a problem and think of different ways to solve it. A
feature in Delta Lake called deletion vectors is a great example of this idea. Chapter 10
provides a look at several ways you can optimize for either the table readers or the
table writers for performance and the trade-offs you might need to make in that
process. While deletion vectors certainly have a place in that discussion, they also
deserve treatment and investigation as one of the advanced features in Delta Lake,
so they are referenced here instead. The reason for this is that the way they work
introduces a new concept that deserves a bit of explanation. Another reason is that
the term deletion vector defines more the form and function of what the process does
rather than how it helps you as a feature. One of the key benefits is that it gives
you the ability to do a Merge-on-Read (MoR) operation. It dramatically reduces the
5 For an explanation of the relationship between constraints and nullability in Spark, as well as additional
examples, see Matt Powers’s blog post.
Merge-on-Read
What does Merge-on-Read mean? It means that, rather than going through the oper‐
ation of rewriting a file at the time of deleting a record or set of records from a
particular file, you instead make some kind of a note that the record or records are
deleted. Thus you get to postpone the performance impact of actually performing
the delete operation until a later time. Usually you will do this when you can run an
OPTIMIZE operation or a more complicated UPDATE statement. With columnar files
(Parquet, Delta, Iceberg, etc.), row-level deletions invoke relatively expensive rewrite
operations of entire files containing those rows. Of course, if someone were to read
the table after a Merge-on-Read operation has been initiated, then it will merge
during that read operation. That’s kind of the point, because it allows us to minimize
the performance impact of performing a simple delete operation and to just perform
it at a later time, when you are already filtering on the same set of files while reading
them. For other cases, you can then avoid the deletions in situations where you don’t
need them to happen straight away. This further allows you to push multiple (or
many) deletes into a single large batch later.
Deletion vectors are a way to get this kind of Merge-on-Read behavior.6 Put simply,
deletion vectors are just a file (or multiple files) adjacent to a data file that allows you
to know which records are to be deleted out of the data file and to save the delete
(rewrite) operation for a later point in time that is more efficient and convenient.
Adjacent in this case is relative: deletion vector files are part of the larger set of files
that make up a Delta Lake table, but in partitioned tables you will notice that the
deletion vector files sit at the top directory level rather than within the partition
directories. You can observe this in the coming examples.
We might call a deletion vector file a sidecar file since it is a file that
sits alongside the other files in a table. In Delta Lake, however, we
would want to distinguish this from sidecar files that are a formal
component of the V2 checkpoint specification and that specify add
or remove file operations.
For most cases in which performance is being optimized for the Delta Lake writer
operations, deletion vectors present a unique opportunity to reduce latency in the
write operations, as their use avoids cases of rewriting files where otherwise there is
no data change. This does come at a small cost of an additional filtering operation
6 There’s an excellent blog post by Nick Karpov exploring deletion vectors in great detail.
This example should help you to understand the nature of how the deletion vectors
are operating. It’s worth mentioning here that the original table creation does not
need to occur in an environment that supports deletion vectors, but once the feature
is enabled, read and write operations will be subject to the aforementioned version
constraints.
First, create the reduced-size table; this makes it easier to view all the files simultane‐
ously:
# Python
from pyspark.sql.functions import col
(
spark
.read
.load("rs/data/COVID-19_NYT/")
.filter(col("state")=="Florida")
(
spark
.read
.load("nyt_covid_19/")
.write
.mode("overwrite")
.format("delta")
.saveAsTable("nyt")
)
Next, identify a single record from the table as a deletion target (for partition-level
delete operations, you will be able to use any partition value):
# Python
spark.sql("""
select
date,
county,
state,
count(1) as rec_count
from
nyt
where
county="Pinellas"
and
date="2020-03-11"
group by
date,
county,
state
order by
date
""").show()
date county state rec_count
2020-03-11 Pinellas Florida 1
Using tree or a file browser, verify the table structure before making any further
changes. Since the table data was partitioned by county, you will see four resulting
partition directories. Also, when the deletion vector operation was enabled, it incre‐
mented the table version and added a transaction to the _delta_log subdirectory.
This allows for traceability across table transactions, which is useful if something
downstream is not working right later:
# BASH
!tree spark-warehouse/nyt/
spark-warehouse/nyt/
├── county=Hillsborough
│ └── part-00000-6cf1fac7-1237-48b5-a7ca-ce824054a997.c000.snappy.parquet
├── county=Pasco
│ └── part-00003-dc22f540-c7f7-449c-8dc1-816f0f357075.c000.snappy.parquet
├── county=Pinellas
│ └── part-00001-42060e31-83e8-48d2-9174-02325ca5e686.c000.snappy.parquet
├── county=Sarasota
│ └── part-00002-dfb35d92-25bc-4caf-8aa0-1228143444a7.c000.snappy.parquet
└── _delta_log
├── 00000000000000000000.json
└── 00000000000000000001.json
Apply a single deletion against the previously identified record:7
# Python
spark.sql("""
delete from
nyt
where
county='Pinellas'
and
date='2020-03-11'
""").show()
7 Adding the show command at the end of the delete operations yields the number of affected rows in your
output; otherwise you don’t see this until checking the transaction log.
spark.sql("""
delete
from
nyt
where
date='2020-03-13'# This has records in multiple partitions
""").show()
Inspect the files again. Notice that only one new deletion vector appears in this case:
# BASH
!tree spark-warehouse/nyt/
spark-warehouse/nyt/
├── county=Hillsborough
│ └── part-00000-6cf1fac7-1237-48b5-a7ca-ce824054a997.c000.snappy.parquet
├── county=Pasco
│ └── part-00003-dc22f540-c7f7-449c-8dc1-816f0f357075.c000.snappy.parquet
├── county=Pinellas
│ └── part-00001-42060e31-83e8-48d2-9174-02325ca5e686.c000.snappy.parquet
├── county=Sarasota
Successful engineering initiatives begin with a clear vision and sense of purpose (what
we are doing and why) as well as with a solid design and architecture (how we plan
to achieve the vision). Combining a thoughtful plan with the right building blocks
(tools, resources, and engineering capabilities) ensures that the final result reflects
the mission and performs well at scale. Delta Lake provides key building blocks
that enable us to design, construct, test, deploy, and maintain enterprise-grade data
lakehouses.
Our goal for this chapter is not just to offer a collection of ideas, patterns, and
best practices but to offer you a field guide. We’ve provided the right information,
reasoning, and mental models so that the lessons learned here can coalesce into clear
blueprints for architecting your own data lakehouse. Whether you are new to the
concept of the lakehouse, unfamiliar with the medallion architecture for incremental
data quality, or attempting your first foray into working with streaming data, we’ll
take this journey together.
What we’ll learn:
187
The Lakehouse Architecture
If successful engineering initiatives begin with a clear vision and purpose, and our
goal is ultimately to lay the foundation for our own data lakehouses, then we’ll need
to first define what a lakehouse is.
What Is a Lakehouse?
The lakehouse is an open data management architecture that combines the flexibility, cost
efficiency, and scale of the data lake with the data management, schema enforcement, and
ACID transactions of the traditional data warehouse.
—Databricks
There is a lot to unpack from this definition—namely, assumptions are being made
that require some hands-on experience, or shared mental models, from both an
engineering and a data management perspective. Specifically, the definition assumes a
familiarity with data warehouses and data lakes, as well as with the trade-offs people
must make when selecting one technology over another. The following section will
cover the pros and cons of each choice and describe how the lakehouse came to be.
The history and myriad use cases shared across the data warehouse and data lake
should be second nature for anyone who has previously worked in roles spanning the
delivery and consumption spaces. For those of you who are just setting out on your
data journey, are transitioning from data warehousing, or have only worked with data
in a data lake, this section is also for you.
To understand where the lakehouse architecture evolved from, we’ll need to be able to
answer the following:
• If the lakehouse is a hybrid architecture combining the best of the data lake and
the data warehouse, then mustn’t it be better than the sum of its parts?
• Why does the flexibility, cost efficiency, and unbounded data scaling inspired by
traditional data lakes matter for all of us today?
• Why do the benefits of the data lake only truly matter when coupled with the
benefits of schema enforcement and evolution, ACID transactions, and proper
data management, as inspired by traditional data warehouses?
1. Extract operational data from siloed sources for writing into landing zones
(/raw/*).
2. Read, clean, and transform the data from /raw and write the changes to /cleansed.
3. Read from /cleansed (could do additional joining and normalizing with other
data) before writing out to the data warehouse.
As long as the workflow completes, the data in the data lake will always be in sync
with the warehouse. This pattern also enables support for unloading or reloading
tables to save cost in the data warehouse. This makes sense in hindsight.
In order to support direct read access on the data, the data lake is required for sup‐
porting machine learning use cases, while the data warehouse is required to support
the business and analytical processing. However, the added complexity inadvertently
puts a greater burden on data engineers to manage multiple sources of truth as well as
the cost of maintaining multiple copies of all the same data (one or more times in the
data lake, and once in the data warehouse) and the headache of figuring out what data
is stale, and where and why.
If you have ever played the game Two Truths and a Lie, this is the architectural
equivalent, but rather than it being a fun game, the stakes are much higher; this is,
after all, our precious operational data. Having two sources of truth by definition
means that the systems can (and probably will) be out of sync, each telling its own
version of the truth. This also means each source of truth is also lying. They just
aren’t aware.
So the question is still up in the air: what if you could achieve the best of both worlds
and efficiently combine the data lake and the data warehouse? Well, that is where the
data lakehouse was born.
Figure 9-2. The data lakehouse provides a common interface for BI and reporting while
ensuring that data science and machine learning workflows are supported in a single,
unified way
• Transaction support
• Schema enforcement and governance/audit log and data integrity
By merging the best of both worlds, we gain a single system that data teams can
utilize to move faster, as they can utilize data for its explicit purpose without needing
to access multiple systems (which always increases complexity). The dissolution of
boundaries between the data warehouse and the data lake also makes it easier to
utilize a single source of table truth. When compared against the dual-tier architec‐
ture, this is a huge win. This also prevents the problem of figuring out which side
(warehouse or lake) has the correct data, who isn’t in sync, and all the costly work
involved to come up with a straight answer. The benefits also ensure teams have the
most complete and up-to-date data available for data science, machine learning, and
business analytics projects.
You can enable Iceberg or Hudi support using the Delta table properties:
% 'delta.universalFormat.enabledFormats' = 'iceberg, hudi'
You can create a table with support for Iceberg and Hudi as follows:
% CREATE TABLE T(c1 INT) USING DELTA TBLPROPERTIES(
'delta.universalFormat.enabledFormats' = 'iceberg, hudi');
Or to add support for Iceberg after table creation:
ALTER TABLE T SET TBLPROPERTIES(
'delta.columnMapping.mode' = 'name',
'delta.enableIcebergCompatV2' = 'true',
'delta.universalFormat.enabledFormats' = 'iceberg');
UniForm works by asynchronously generating the metadata for our Iceberg or Hudi
tables after each successful Delta transaction.
Transaction Support
Support for transactions is critical whenever data accuracy and sequential insertion
order are important. Arguably this is required for nearly all production cases. We
should concern ourselves with achieving a minimally high bar at all times. While
transactions mean there are additional checks and balances, if, for example, there
are multiple writers making changes to a table, there will always be the possibility of
collisions. Understanding the behavior of the distributed Delta transaction protocol
means we know exactly which write should win and how, and we can guarantee the
insertion order of data to be exact for reads.
Serializable writes
Delta provides ACID guarantees for transactions while enabling multiple concurrent
writers using a technique called write serialization. When new rows are simply being
appended to the table, as with INSERT operations, the table metadata doesn’t need
% ./run-some-batch-job.py \
--startTime x \
–-recordsPerBatch 10000 \
--lastRecordId z
With Delta Lake, we can ignore the startTime and lastRecordId and simply use the
startingVersion of the transaction log. This provides a specific point for us to read
from. Example 9-2 shows the modified job.
While there may not be a clear “Aha!” moment with this example, the power of
incremental processing with Delta is that there is a transaction log that informs us of
all the changes that happened on a table since our last run.
Schema-on-write
Because Delta Lake supports schema-on-write and declarative schema evolution,
the onus of being correct falls to the producers of the data for a given Delta Lake
table. However, this doesn’t mean that anything goes just because you wear the
producer-of-the-data hat. Remember that data lakes only become data swamps due to
a lack of governance. With Delta Lake, the initial successful transaction committed
automatically sets the stage for identifying the table columns and types. With a
governance hat on, we now must abide by the rules written into the transaction
log. This may sound a little scary, but rest assured, it is for the betterment of the
data ecosystem. With clear rules around schema enforcement and proper procedures
in place to handle schema evolution, the rules governing how the structure of a
table is modified ultimately protect the consumers of a given table from problematic
surprises.
Schema-on-read
Data lakes use the schema-on-read approach because there is no consistent form
of governance or metadata native to the data lake, which is essentially a glorified
distributed filesystem. While schema-on-read is flexible, its flexibility is also why data
lakes are categorized as being like the Wild West—ungoverned, chaotic, and, more
often than not, problematic.
What this means is that when there is data in some location (directory root) with
some filetype (JSON, CSV, binary, Parquet, text, or other), with the ability of files
being written to a specific location to grow unbounded, there is a high potential for
problems to grow as the dataset ages.
As a consumer of the data in the data lake at a specific location, you may be able
to extract and parse the data, if you’re lucky—it may even have some kind of docu‐
mentation, if you’re really lucky—and with enough lead time and compute, you can
probably accomplish your job. Without proper governance and type safety, however,
the data lake can quickly grow to multiple terabytes (or petabytes, if you love burning
money), or essentially data garbage with a low cost of storage overhead. While this is
an extreme statement, it is also a reality in many data organizations.
1 This is true for common append-style writes to the table. Other operations such as overwriting a table or
deleting the table can affect streaming applications.
Figure 9-3. The Delta Sharing Protocol is the industry’s first open protocol for secure data
sharing, making it simple to share data with other organizations, regardless of which
computing platforms they use
Figure 9-4. The medallion architecture is a procedural framework providing quality gates
and tiers from the point of ingestion onward to the purpose-built curated data product
The medallion architecture provides a flexible framework for dealing with progres‐
sive enhancement of data in a structured way. It is worth pointing out that, while
it is common to see three tiers (bronze, silver, gold), there is no rule stating that
all use cases require three tiers. It may be that more mature data practitioners will
have a two-tier system in which golden tables are joined with other golden tables to
create even more golden tables. So the separation between silver and gold or between
2 Remember that anything containing user data must be captured and processed according to the end-user
agreed-upon consent and according to data governance bylaws and standards.
The extreme minimal approach applied in Example 9-3 takes only the information
needed to preserve the data as close to its raw form as possible. This technique puts
the onus on the silver layer to extract and transform the data from the value column.
While we are creating a minor amount of additional work, this bare-bones approach
enables the future ability to reprocess (reread) the raw data as it landed from Kafka
without worrying about the data expiring (which can lead to data loss). Most data
retention periods for delete in Kafka are between 24 hours and 7 days.
In cases in which we are reading from an external database, such as Postgres, the
minimum schema is simply the table DDL. We already have explicit guarantees and
row-wide expected behavior given the schema-on-write nature of the database, and
thus we can simplify the work required in the silver layer when compared to the
example shown in Example 9-3.
As a rule of thumb, if the data source has a type-safe schema (Avro, Protobuf), or
the data source implements schema-on-write, then we will typically see a significant
reduction in the work required in the bronze layer. This doesn’t mean we can blindly
write directly to silver either, since the bronze layer is the first guardian blocking
unexpected or corrupt rows of data from its progression toward gold. In the case
where we are importing non-type-safe data — as seen with CSV or JSON data — the
bronze tier is incredibly important to weeding out corrupt and otherwise problematic
data.
We then append the _corrupt field to our schema. This will provide a container
for our bad data to sit in. Think of this as either the _corrupt column is null
or it contains a value. The data can then be read using a filter where(col(
"_corrupt").isNotNull()) or the inverse to separate the good from the bad.
While the bronze layer may feel limited in scope and responsibility, it plays an
incredibly important role in debugging and recovery, and as a source for new ideas in
the future. Due to the raw nature of the bronze layer tables, it is also unadvisable to
broadcast the availability of these tables widely. There is nothing worse than getting
paged or called into an incident for issues arising from the misuse of raw tables.
Example 9-5. Filtering, dropping, and transformations—all the things needed for writing
to silver
% medallion_stream = (
delta_source.readStream.format("delta")
.options(**reader_options)
.load()
.transform(transform_from_json)
.transform(transform_for_silver)
.writeStream.format("delta")
.options(**writer_options))
.option('mergeSchema': 'false'))
streaming_query = (
medallion_stream
.toTable(f"{managed_silver_table}"))
The pipeline shown in Example 9-5 reads from the bronze Delta table (from Exam‐
ple 9-3) and decodes the binary data received (from the value column), while also
enabling permissive mode, which we explored in Example 9-4:
def transform_from_json(input_df: DataFrame) -> DataFrame:
return input_df.withColumn("ecomm",
from_json(
col("value").cast(StringType()),
known_schema,
options={
'mode': 'PERMISSIVE',
'columnNameOfCorruptRecord': '_corrupt'
}
))
Then a second transformation is required as we make preparations for writing into
the silver layer. This is a minor secondary transformation removing any corrupt rows
and applying aliasing to declare the ingestion data and timestamp, which could be
different from the event timestamp and date:
def transform_for_silver(input_df: DataFrame) -> DataFrame:
return (
input_df.select(
col("event_date").alias("ingest_date"),
col("timestamp").alias("ingest_timestamp"),
col("ecomm.*")
)
.where(col("_corrupt").isNull())
.drop("_corrupt"))
Figure 9-5. Each layer of the medallion architecture can be simple or complex; it can
be easier to visualize the transformation of data across a lineage in terms of what is
internal (left-hand side of the figure) and what is external (right-hand side)
Being able to view the lineage between bronze, silver, and gold can help provide
additional context as the number of tables and views increases, and as the total data
products and their owners naturally grow over time. We cover lineage in more detail
in Chapter 13.
% pyspark
silver_table = spark.read.format("delta")...
top5 = (
silver_table
.groupBy("ingest_date", "category_id")
.agg(
count(col("product_id")).alias("impressions"),
min(col("price")).alias("min_price"),
avg(col("price")).alias("avg_price"),
max(col("price")).alias("max_price")
)
.orderBy(desc("impressions"))
.limit(5))
(top5
.write.format("delta")
.mode("overwrite")
.options(**view_options)
.saveAsTable(f"gold.{topN_products_daily}"))
Example 9-6 shows how to do daily aggregations. It is typical for reporting data to
be stored in the gold layer. This is the data we (and the business) care most about.
It is our job to ensure that we provide purpose-built tables (or views) to ensure that
business-critical data is available, reliable, and accurate.
For foundational tables—and really with any business-critical data—surprise changes
are upsetting and may lead to broken reporting as well as to inaccurate runtime
inference for machine learning models. This can cost the company more than just
money; it can be the difference in whether or not the company retains its customers
and reputation in a highly competitive industry.
The gold layer can be implemented using physical tables or virtual tables (views).
This provides us with ways of optimizing our curated tables that result in either a
full physical table when not using a view, or simple metadata providing any filters,
column aliases, or join criteria required when interacting with the virtual table. The
performance requirements will ultimately dictate the usage of tables versus views, but
a view is good enough to support the needs of many gold layer use cases.
Now that we’ve explored the medallion architecture, the last stop on our journey will
be to dive into patterns for decreasing the effort level and time requirements from the
point of data ingestion to the time when the data becomes available for consumption
by downstream stakeholders at the gold edge.
While we’ve already looked at patterns to refine data using the medallion architecture
to remove imperfections, adhere to explicitly defined schemas, and provide data
checks and balances, what we didn’t cover was how to provide a seamless flow for
transformations from bronze to silver and silver to gold.
Time tends to get in the way more often than not—with too little time, there is not
enough information to make informed decisions, and with too much time, there
is a tendency to become complacent and sometimes even a little bit lazy. Thus
time is something of a Goldilocks problem, especially when we concern ourselves
with reducing the end-to-end latency for data traversing our lakehouse. In the next
section, we will look at common patterns for reducing the latency of each tier within
the medallion architecture, focusing on end-to-end streaming.
As we’ve seen across the book, the Delta protocol supports both batch or streaming
access to tables. We can deploy our pipelines to take specific steps ensuring that the
datasets that are output meet our quality standards and result in the ability to trust
the upstream sources of data, enabling us to drastically reduce the end-to-end latency,
from data ingestion (bronze) on through (silver) and ultimately to the business or
data product owners in the gold layer.
By crafting our pipelines to block and correct data quality problems before they
become more widespread, we can use the lessons learned across Examples 9-3
through 9-6 to stitch together end-to-end streaming workflows.
Figure 9-6 provides an example of the streaming workflow. Data arrives from our
Kafka topic, as we saw in Example 9-3. The dataset is then appended to our bronze
Delta table (ecomm_raw), which enables us to pick up the incremental changes in
our silver application. The example providing the transformations was shown in
Example 9-5. Last, either we create and replace temporary views (or materialized
views in Databricks), or we create another golden application with the responsibility
Figure 9-6. Streaming medallion architecture as viewed from the workflow level
There are many ways to orchestrate end-to-end workflows using scheduled jobs or
full-fledged frameworks like Apache Airflow, Databricks Workflows, or Delta Live
Tables. The end result provides us with reduced latency from the edge all the way to
our most important and business-critical golden tables.
Up to this point, you’ve explored various ways of working with Delta Lake. You’ve
seen many of the features that make Delta Lake a better and more reliable choice
as a storage format for your data. Tuning your Delta Lake tables for performance,
however, requires a solid understanding of the basic mechanics of table maintenance,
which was covered in Chapter 5, as well as a bit of knowledge about and practice
at manipulating or implementing some of the internal and advanced features intro‐
duced in Chapter 8. This performance side becomes the focus now, as we’ll look at
the impact of pulling the levers of some of those features in a bit more detail. We
encourage you to review the topics laid out in Chapter 5 if you have not recently used
or reviewed them.
In general, you will often want to maximize reliability and the efficiency with which
you can accomplish data creation, consumption, and maintenance tasks without
adding unnecessary costs to your data processing pipelines. By taking the time to
optimize your workloads properly, you can balance the overhead costs of these tasks
with various performance considerations to align with your objectives. What you
should be able to gain here is an understanding of how tuning some of the features
you’ve already seen can help to achieve your objectives.
First, there’s some background work to provide a bit of clarity about the nature
of your objectives. After that, there is an exploration into several of Delta Lake’s
features and how they impact these objectives. While Delta Lake can generally be
used suitably with limited changes, when you think about the requirements put on
modern data stacks, you should realize that you could always do better. In the end,
taking on performance tuning involves striking balances and considering trade-offs
to gain advantages where you need them. Because of this, it is best to make sure
213
you think about what other settings are affected when you consider modifying some
parameters.
Performance Objectives
One of the biggest factors you need to consider is whether you want to try and opti‐
mize best for data producers or consumers. As discussed in Chapter 9, the medallion
architecture is an example of a data architecture that allows you to optimize for both
reading and writing where needed through data curation layers. This separation of
processes helps you to streamline the process at the point of data creation and at
the point of consumption by focusing on the goals of each at different points in the
pipeline. Let’s first consider some of the different objectives toward which you might
want to orient your tuning efforts.
Point queries
A point query is a query submitted by a data consumer, or user, with the intention of
returning a single record from a dataset. For example, a user may access a database
to look up individual records on a case-by-case basis. Such users are less likely to
use advanced query patterns involving SQL-based join logic or advanced filtering
conditions. Another example is a robust web-server process retrieving results pro‐
grammatically and dynamically on a case-by-case basis. These queries are more likely
to be evaluated with higher levels of scrutiny concerning perceived performance
1 If you wish to read more about data modeling and ER diagrams, check out Appendix A in Learning SQL,
3rd ed., by Alan Beaulieu (O’Reilly), or see the Wikipedia pages for data modeling and the entity–relationship
model.
214 | Chapter 10: Performance Tuning: Optimizing Your Data Pipelines with Delta Lake
metrics. In both scenarios there is a human at the other end who is impacted by
the query’s performance, so you want to avoid any delays in record lookup without
incurring high costs. This could mean that in some scenarios—such as the latter one,
potentially—a high-performance, dedicated transactional system is required to meet
latency requirements; this is often not the case, however, and through the tuning
methods seen here you may be able to meet targets adequately without the need for
secondary systems.
One of the things you’ll consider is how things like file sizes, keys or indexing, and
partitioning strategies can impact point query performance. As a rule of thumb, you
should tend to steer toward smaller file sizes and try to use features such as indexes
that reduce latency when searching for a needle in a haystack, even if the haystack
is an entire field. You’ll also see how statistics and file distribution impact lookup
performance.
Range queries
A range query retrieves a set of records instead of retrieving a single record result
like in a point query (which you can think of as just a special case with narrow
boundaries). Rather than having an exact filter-matching condition, you’ll find that
range queries look for data within boundaries. Some common phrases that suggest
such situations might be:
• Between
• At least
• Prior to
• Such that
Lots of other phrases are possible, but the general idea is that many records could
satisfy such a condition (though it’s still possible to wind up with just a single
record). You will still encounter range queries when you use exact matching criteria
describing broad categories, such as selecting cats as the type of animal from a list of
pet species and breeds—you would have only one species but many different breeds.
In other words, the result you look to obtain with a range query will generally be
greater than one. Usually, you wouldn’t know the specific number of records without
adding some ordering element and further restricting the range.
Aggregations
On the surface, an aggregation query is similar to a range query, except that, instead of
selecting down to a particular set of records, you’ll use additional logical operations
to perform some operation on each group of records. Borrowing from the pets
example, you might want to get a count of the number of breeds per species or some
other summary type of information. In such cases, you’ll often see some type of
216 | Chapter 10: Performance Tuning: Optimizing Your Data Pipelines with Delta Lake
Trade-offs
As has been noted, many of the constraints on your write processes will be deter‐
mined by the producer systems. If you are thinking of large file-based ingestion or
event- or microbatch-level stream processing, then the size and number of transac‐
tions will vary considerably. Similarly, if you are working with a single-node Python
application or using larger distributed frameworks, you will have such variance. You
will also need to consider the amount of time required for processing, as well as
the cadence. Many of these things must be balanced, and so again, the medallion
architecture lends a hand, because you can separate some of these concerns by
optimizing for your core data-producing process at the bronze level and for your data
consumers at the gold level, with the silver level forming a kind of bridge between
them. Refer back to Chapter 9 if you want to review the medallion architecture.
Conflict avoidance
How frequently you perform write operations can limit when you can run table
maintenance operations—for example, when you are using Z-Ordering. If you are
using Structured Streaming with Apache Spark to write microbatch-level transactions
to a Delta Lake table partitioned by the hour, then you have to consider the impacts
of running other processes against that partition while it is still active.2 How you
choose options like autocompaction and optimized writes also impacts when or
whether you need to run additional maintenance operations. Building indexes takes
time to compute and could conflict with other processes too. It’s up to you to make
sure you avoid conflicts when needed, though it is much easier to do so than it was
with things like read/write locks involved in every file access.
Performance Considerations
So far you’ve seen some of the criteria on which you’ll want to base much of your
decision making as far as how you interact with Delta Lake. You have many different
tools built in, and how you use them usually will depend on how a particular table is
interacted with. Our goal now is to look at the different levers you can pull and think
about how the way you set different parameters can be better for any of the above
cases. Some of this will review concepts discussed in Chapter 6 in the context of data
producer/consumer trade-offs.
2 You can find detailed descriptions, including error messages, in the “Concurrency Control” section of the
Delta Lake documentation.
Structure
The easiest way to think about what partitioning does is that it breaks a set of
files into sorted directories tied to your partitioning column(s). Suppose you have a
customer membership category column in which every customer record will fall into
either a “paid” membership or a “free” membership, as in the following example. If
you partition by this membership_type column, then all the files with “paid” member
records will be in one subdirectory, while all the files with “free” member records will
be in a second directory:
# Python
from deltalake.writer import write_deltalake
import pandas as pd
df = pd.DataFrame(data=[
(1, "Customer 1", "free"),
(2, "Customer 2", "paid"),
(3, "Customer 3", "free"),
(4, "Customer 4", "paid")],
columns=["id", "name", "membership_type"])
write_deltalake(
"/tmp/delta/partitioning.example.delta",
data=df,
mode="overwrite",
partition_by=["membership_type"])
3 For a more in-depth look at the Hive side of data layouts, see Programming Hive by Edward Capriolo, Dean
Wampler, and Jason Rutherglen (O’Reilly).
218 | Chapter 10: Performance Tuning: Optimizing Your Data Pipelines with Delta Lake
All the examples and some other supporting code for this chapter
can be found in the GitHub repository for the book.
/tmp/delta/partitioning.example.delta
├── _delta_log
│ └── 00000000000000000000.json
├── membership_type=free
│ └── 0-9bfd1aed-43ce-4201-9ef0-1d6b1a42db8a-0.parquet
└── membership_type=paid
└── 0-9bfd1aed-43ce-4201-9ef0-1d6b1a42db8a-0.parquet
The following section can help you figure out when (or when not to) partition tables
and the impact such decisions bear on other performance features, but understanding
the larger partitioning concept is important, as even if you don’t choose to partition
tables yourself, you could inherit ownership of partitioned tables from someone
who did.
Pitfalls
There are some cautions laid out for you here with regard to the partitioning struc‐
ture in Delta Lake (remember the table partitioning rules from Chapter 5!). Your
decision about the actual file sizes to use will be impacted by what kind of data
consumers will use the table, but the way you partition your files has downstream
consequences too. Generally, you will want to make sure that the total amount of
data in a given partition is at least 1 GB, and you don’t want partitioning at all for
total table sizes under 1 TB. Anything less, and you can incur large amounts of
unnecessary overhead with file and directory listing operations, especially if you are
using Delta Lake in the cloud.4 This means that if you have a high cardinality column,
you should not use it as a partitioning column in most cases unless the sizing is
still appropriate. In cases in which you need to revise the partitioning structure, you
should use methods such as those outlined in Chapter 5 to replace the table with a
more optimized layout. Overpartitioning tables is a problem that has been seen as
4 See more on this in the whitepaper “Delta Lake: High-Performance ACID Table Storage over Cloud Object
Stores”.
File sizes
One direct implication of overpartitioning is that file sizes often turn out to be
too small. File sizes of about 1 GB are recommended to handle large-scale data
processes with relative ease. There are many cases, however, in which leveraging
smaller file sizes, typically in the 32 MB to 128 MB range, can have performance
benefits for read operations. A decision about the optimal file size comes down to
the nature of the data consumer. High-volume append-only tables in the bronze
layer generally function better with larger file sizes, as the larger sizes maximize
throughput per operation with little regard to anything else. The smaller sizes will
help a lot more with finer-grained read operations such as point queries, or in cases
in which you have lots of merge operations, because of the higher number of file
rewrites generated.
In the end, file size will often wind up being determined by the way you apply
maintenance operations. When you run OPTIMIZE, and in particular when you run it
with the included Z-Ordering option, you’ll see that it affects your resulting file sizes.
You do, however, have a couple of base options for trying to control the file sizes.
Table Utilities
You’re probably pretty familiar with some version of the small files problem. While
it was originally a condition largely affecting elephantine MapReduce processing,
the underlying nature of the problem extends to more recent large-scale distributed
processing systems as well.5 In Chapter 5, you saw the need to maintain your Delta
Lake tables and some of the tools available to do that. One of the scenarios covered
was that for streaming use cases, where the transactions tend to be smaller, you
need to make sure you rewrite those files into bigger ones to avoid a similar small
files problem. Here you’ll see how leveraging these tools can affect read and write
performance while interacting with Delta Lake.
OPTIMIZE
The OPTIMIZE operation on its own is intended to reduce the number of files con‐
tained in a Delta Lake table (recall the exploration in Chapter 5). This is true in
particular of streaming workloads, where you may have microbatches creating files
and commits measured in just a couple of MB or less, and thus you can wind up with
many comparatively small files. Compaction is a term used to describe the process of
packing smaller files together, and it’s one that is often used when talking about this
5 If you’re not familiar with this problem, the blog post “The Small Files Problem” is probably worth a read.
220 | Chapter 10: Performance Tuning: Optimizing Your Data Pipelines with Delta Lake
operation. One of the most common performance implications of compaction is the
failure to do it. While there could be some minute benefits to such small files (like
rather fine-grained column statistics), these are generally heavily outweighed by the
costs of listing and opening many files.
How it works is that when you run OPTIMIZE, you kick off a listing operation that
lists all the files that are active in the table and their sizes. Then any files that can
be combined will be combined into files around the target size of 1 GB. This helps
to reduce issues that might occur from, for example, several concurrent processes
committing smaller transactions to the same Delta Lake destination. In other words,
OPTIMIZE is a mechanism to help avoid the small files problem.
Remember, there is some overhead to the operation; it has to read multiple files
and combine them into the files that eventually get written, so it is a heavy I/O
operation. Removing the file overhead is part of what helps to improve the read time
for downstream data consumers. If you are using an optimized table downstream as a
streaming source, as you explored in Chapter 9, the resulting files are not data change
files and are ignored.
It’s important to recall that there are some file size settings with OPTIMIZE that
you can tweak to tune performance more to your preference. These settings and
their behavior are covered in depth in Chapter 5. Next, we take a deeper look at
Z-Ordering, which is instructive even if you’re planning on using liquid clustering, as
the underlying concepts are strongly related.
Z-Ordering
Sometimes the way you insert files or model the data you’re working with will
provide a kind of natural clustering of records. Say you insert one file to a table from
something like customer transaction records, or you aggregate playback events from
a video device every 10 minutes. Then say you want to go back an hour later to
compute some KPIs from the data. How many files will you have to read? You already
know it’s six because of the natural time element you’re working with (assuming you
used event or transaction times). You might describe the data as having a natural,
linear clustering behavior. You can apply the same description to any cases in which a
natural sort order is inherent to the data. You could also artificially create a sorting or
partitioning of the data by alphabetizing, using unique universal identifiers (UUIDs),
or using a file insertion time and then reordering as needed.
222 | Chapter 10: Performance Tuning: Optimizing Your Data Pipelines with Delta Lake
Z-Ordering attempts to create clusters of similar size in memory,
which typically will be directly correlated with the size on disk,
but there are situations in which this can become untrue. In those
cases, task skewing can occur during the compaction process.
For example, if you have a string column containing JSON values,
and this column has significantly increased in size over time, then
when Z-Ordering by date, both the task durations and the resulting
file sizes can become skewed during later processing.
Except for the most extreme cases, this should generally not signifi‐
cantly affect downstream consumers or processes.
One thing you might notice if you experiment with and without Z-Ordering of
files in your table is that it changes the distribution of the sizes of the files. While
OPTIMIZE, when left to its defaults, will generally create files that are fairly uniform in
size, the clustering behavior you put in place means that file sizes can become smaller
(or larger) than the built-in file size limiter (or one specified when available). This
preference for the clustering behavior over strict file sizing is intended to provide the
best performance by making sure the data gets colocated as desired.8
8 There is a more detailed example of Z-Ordering later in this chapter, but if you’re in a hurry, the blog post
“Optimize by Clustering not Partitioning Data with Delta Lake” by Denny Lee is a good and fast end-to-end
walkthrough.
While this feature can improve the way you use OPTIMIZE with
Delta Lake, it will not allow the option of including a ZORDER on
the files. You may still need additional processes, even when auto
Compact is used to provide the best performance for downstream
data consumers.
You can control the target output size of autoCompact with spark.databricks
.delta.autoCompact.maxFileSize. While the default of 128 MB is often sufficient in
practice, you might wish to tune this to a higher or lower number to balance between
the impacts of rewriting multiple files during processing, whether or not you plan to
run periodic table maintenance operations, and your desired target end state for file
sizes.
The number of files required before compaction will be initiated is set through
spark.databricks.delta.autoCompact.minNumFiles. The default number is 50.
This just makes sure you have a lower threshold to avoid any negative impact of
additional operations on small tables with small numbers of files. Tables that are
small but have many append and delete operations might benefit from setting this
number lower, as this would create fewer files but would have less performance
impacts due to the smaller size. A higher setting might be beneficial for rather
large-scale processes where the number of writes to Delta Lake in a single transaction
is generally higher. This would avoid running an OPTIMIZE step for every write
stage, which could become burdensome in terms of added operational costs for each
transaction.
9 You can find additional configuration options in the Databricks documentation for this feature.
224 | Chapter 10: Performance Tuning: Optimizing Your Data Pipelines with Delta Lake
the actual write operation to get your results compacted down to just n files being
written. Optimized writes are a way to avoid needing to do this.
If you set delta.optimizeWrites to true on your table—or similarly, if you
set spark.databricks.delta.optimizeWrites.enabled to true in your Databricks
SparkSession—you get this different behavior. The latter setting will apply the for‐
mer option setting to all newly created tables from the SparkSession. You might
be wondering how this magical automation gets applied behind the scenes. How it
works is that before the write part of the operation happens, you will get additional
shuffle operations (as needed) to combine memory partitions so that fewer files can
be added during the commit. This is beneficial on partitioned tables because the
partitioning tends to make files even more granular. The added shuffle step can add
some latency into write operations, so for data producer–optimized scenarios you
might want to skip it, but it provides some additional compaction automatically,
similar to autoCompact above, except that it occurs prior to the write operation rather
than happening afterward. Figure 10-1 illustrates a case in which the distribution
of the data across multiple executors would result in multiple files written to each
partition (at left) and how the added shuffle improves the arrangement (at right).
Figure 10-1. A comparison of how optimized writes add a shuffle before writing files
Vacuum
Because things like failed writes are not committed to the transaction log, you need
to make sure you vacuum even append-only tables that don’t have OPTIMIZE run on
them. Write failures do occur from time to time, whether due to some cloud provider
failure or perhaps because of something else, and the resulting stubs still live inside
Databricks autotuning
Databricks includes a couple of scenarios in which the respective options, when
enabled, automatically adjust the delta.targetFileSize setting. One case is based
on workload types, and the second is on the table size.
In Databricks Runtime (DBR) 8.2 and later, when delta.tuneFileSizesForRewrites
is set to true, the runtime will check whether nine out of the last ten operations
against the table were merge operations. In cases where that is the case, the target file
size will be reduced to improve write efficiencies (at least some of the reasoning has
to do with statistics and file skipping, which will be covered in the next section).
From DBR 8.4 onward, the table size is accounted for in determining this setting. For
tables less than about 2.5 TB, the delta.targetFileSize setting will be put at a lower
value of 256 MB. If the table is larger than 10 TB, the target will be set at a larger size
of 1 GB. For sizes that fall in the intermediate range between 2.5 TB and 10 TB, there
is a linearly increasing scale for the target, from 256 MB up to the 1 GB value. Please
refer to the documentation for additional details, including a reference table for this
scale.
Table Statistics
Up to this point, most of the focus has been centered around the layout and distribu‐
tion of the files in your tables. The reason for this has a great deal to do with the
underlying arrangement of the data within those files. The primary way to see what
that data looks like is based on the file statistics in the metadata. Now you will see
how you get statistics information and why it matters to you. You’ll see what the
process looks like, what the stats look like, and how they influence performance.
10 Matthew Powers and Nick Karpov’s blog post on the vacuum command provides a more in-depth exploration
of vacuuming, with examples and exploration of some of the nuances.
226 | Chapter 10: Performance Tuning: Optimizing Your Data Pipelines with Delta Lake
How statistics help
Statistics about our data can be pretty useful. You’ll see more about what this means
and what it looks like in a moment, but first, let’s think about some reasons why you
might want statistics on the files in your Delta Lake. Suppose that you have a table
with a “color” field that takes 1 of 100 possible values, and each color value occurs in
exactly 100 rows. This gives you 10,000 total rows. If these color values are randomly
distributed throughout the rows, then finding all the “green” records would require
scanning the whole set. Suppose you now add some more structure to the set by
breaking it into ten files. In this case, you might guess that there are green records
in each of the ten files. How could you know whether that is true without scanning
all ten files? This is part of the motivation for having statistics on your files—namely,
that if you do some counting operations at the time of writing the files or as part of
your maintenance operations, then you can know from your table metadata whether
or not specific values occur within files. If your records are sorted, this impact gets
even bigger, because then you can drastically reduce the number of files that need to
be read to find all your green records, or to find the row numbers between 50 and
150, as you can see in Figure 10-2. While this example is just conceptual, it should
help to convince you why table statistics are important—but before you turn to a
more detailed practical example, see first how statistics operate in Delta Lake.
Figure 10-2. The arrangement of the data can affect the number of files read
basepath = "/tmp/delta/partitioning.example.delta/"
fname = basepath + "_delta_log/00000000000000000000.json"
with open(fname) as f:
for i in f.readlines():
parsed = json.loads(i)
if 'add' in parsed.keys():
stats = json.loads(parsed['add']['stats'])
print(json.dumps(stats))
When you run this, you will get a collection of the statistics generated for each of the
created files added to the Delta Lake table:
{
"numRecords": 2,
"minValues": {"id": 2, "name": "Customer 2"},
"maxValues": {"id": 4, "name": "Customer 4"},
"nullCount": {"id": 0, "name": 0}
}
{
"numRecords": 2,
"minValues": {"id": 1, "name": "Customer 1"},
"maxValues": {"id": 3, "name": "Customer 3"},
"nullCount": {"id": 0, "name": 0}
}
In this case, you see all the data values since the table has only four records, and there
were no null values inserted, so those metrics are returned as zeros.
228 | Chapter 10: Performance Tuning: Optimizing Your Data Pipelines with Delta Lake
In Databricks (DBR 8.3 and above), you can additionally run an ANALYZE TABLE
command to collect additional statistics, such as the number of distinct values,
average length, and maximum length. These added statistics values can yield further
performance improvements, so be sure to leverage them if you’re using a compatible
compute engine.
If you’ll recall from Chapter 5, one of the settings you have available to you is
delta.dataSkippingNumIndexedCols, which, with a default value of 32, determines
how many columns statistics will be collected on. If you have a situation in which
you are unlikely to run SELECT queries against the table, as in a bronze to silver layer
stream process, for example, you can reduce this value to avoid additional overhead
from the write operations. You could also increase the number of columns indexed in
cases where query behavior against wider tables varies considerably more than would
make sense to ZORDER BY (anything more than a few columns is usually not very
beneficial). One other item to note here is that you can alter the table order to directly
place larger valued columns after the number of indexed columns using ALTER TABLE
CHANGE COLUMN (FIRST | AFTER).11
If you want to make sure statistics are collected on columns you add after the initial
table is created, you would use the FIRST parameter. You can reduce the number of
columns and move a long text column, for example, after something like a timestamp
column to avoid trying to collect statistics on the large text column and ensure that
you still include your timestamp information to take advantage of filtering better.
Setting each is fairly straightforward, but note that the after argument requires a
named column:
-- SQL
ALTER TABLE
delta.`example`
set tblproperties("delta.dataSkippingNumIndexedCols"=5);
ALTER TABLE
delta.`example`
CHANGE articleDate first;
ALTER TABLE
delta.`example` CHANGE textCol after revisionTimestamp;
11 There is an example in the section covering the CLUSTER BY command that demonstrates this practice.
230 | Chapter 10: Performance Tuning: Optimizing Your Data Pipelines with Delta Lake
Z-Ordering revisited
File skipping creates great performance improvements by reducing the number of
files that need to be read for many kinds of queries. You might ask, though: how
does adding the clustering behavior from ZORDER BY affect this process? This is
fairly straightforward. Remember, Z-Ordering creates clusters of records using a
space-filling curve. The implication of doing this is that the files in your tables are
arranged according to the clustering of the data. This means that when statistics are
collected on the files, you get boundary information that aligns with how your record
clusters are segregated in the process. So now when seeking records that align with
your Z-Ordered clusters, you can further reduce the number of files that need to be
read.
You might further wonder how the clusters in the data get created in the first
place. Consider the goal of optimizing the read task for a more straightforward case.
Suppose you have a dataset with a timestamp column. If you wanted to create some
same-sized files with definite boundaries, then a straightforward answer appears. You
can sort the data linearly by the timestamp column and then just divide it into chunks
that are the same size. What if you want to use more than one column, though, and
create real clusters according to the keys, instead of just some linear sort you could
have done on your own?
The more advanced task of using space-filling curves on multiple columns is not
that hard to understand once you see the idea, but it’s not as simple as the linearly
sorted case either. At least not yet it isn’t. That’s actually part of the idea. You need
to perform some additional work to construct a way to be able to similarly range
partition data across multiple columns. To do this, you need a mapping function that
can translate multiple dimensions onto a single dimension so that you can do the
dividing step, just like in the linear ordering case. The actual implementation used in
Delta Lake might be a little tricky to digest out of context, but consider this snippet
from the Delta Lake repository:
// Scala
object ZOrderClustering extends SpaceFillingCurveClustering {
override protected[skipping] def getClusteringExpression(
cols: Seq[Column], numRanges: Int): Column = {
assert(cols.size >= 1, "Cannot do Z-Order clustering by zero columns!")
val rangeIdCols = cols.map(range_partition_id(_, numRanges))
interleave_bits(rangeIdCols: _*).cast(StringType)
}
}
This takes the multiple columns passed to the Z-Order modifier and then alternates
the column bits to create a new temporary column that provides a linear dimension
you can now sort on and then partition as a range. Now that you know how it works,
consider a more discrete example that demonstrates this approach.
Figure 10-3. With files laid out in a linear fashion, you wind up reading extra records
232 | Chapter 10: Performance Tuning: Optimizing Your Data Pipelines with Delta Lake
First, find the rows that match the conditions x = 5 or x = 6. Then find the columns
matching the conditions y = 5 or y = 6. The points where they intersect are the target
values you want, but if the condition matches for a file, you have to read the whole
file. So for the files you read (the ones that contain matching conditions), you can
sort the data into two categories: data that matches your conditions specifically, and
extra data in the files that you still have to read anyway.
As you can see, you have to read the entirety of the files (rows) where x = 5 or x = 6
to capture the values of y that match as well, which means nearly 80% of our read
operation was unnecessary.
Now update your set to be arranged with a space-filling Z-Order curve instead. In
both cases, you have a total of nine data files, but now the layout of the data (as
shown in Figure 10-4) is such that by analyzing the metadata (checking the min/max
values per file), you can skip additional files and avoid a large chunk of unnecessary
records being read.
After applying the clustering technique to the example, you only have to read a single
file. This is partly why Z-Ordering goes alongside an OPTIMIZE action. The data needs
to be sorted and arranged according to the clusters. You might wonder if you still
need to partition the data in these cases since the data is organized efficiently. The
short answer is yes, as you may still want to partition the data, for example, in cases
where you are not using liquid clustering and might run into concurrency issues.
When the data is partitioned, OPTIMIZE and ZORDER will only cluster and compact
data already colocated within the same partition. In other words, clusters will be
created only within the scope of data inside a single partition, so the benefits of
ZORDER still directly rely on a good choice of partitioning scheme.
The method for determining the closeness, or cluster membership, relies on inter‐
leaving the column bits and then range partitioning the dataset.12
12 There is a version of this written in Python to encourage additional exploration in the Chapter 10 section of
the book’s repository.
234 | Chapter 10: Performance Tuning: Optimizing Your Data Pipelines with Delta Lake
5. Range partition the new one-dimensional column.
6. Plot the points by coordinates and bin identifier.
The results are shown in Figure 10-5. They don’t show quite the same behavior
as Figure 10-4, which is very neat and orderly, but they do clearly show that even
with a self-generated and directly calculated approach, you could create your own
Z-Ordering on a dataset.
From a mathematical perspective, there are more details and even some enhance‐
ments that could be considered, but this algorithm is already built into Delta Lake, so
for the sake of our sanity, this is the current limit of our rigor.13
13 For more technical details, refer to Mohamed F. Mokbel, Walid G. Aref, and Ibrahim Kamel, “Performance
of Multi-dimensional Space-Filling Curves”, in Proceedings of the 10th ACM International Symposium on
Advances in Geographic Information Systems (GIS ’02) (New York: Association for Computing Machinery,
2002), 149–54.
Cluster By
The end of partitioning? That’s the idea. The newest and best-performing method
for taking advantage of data skipping came in Delta Lake 3.0. Liquid clustering
takes the place of traditional Hive-style partitioning with the introduction of the
CLUSTER BY parameter during table creation. Like ZORDER, CLUSTER BY uses a space-
filling curve to determine the best data layout but changes to other curve types
that yield more efficiency. Figure 10-6 shows how different partitions may either get
coalesced together or be broken down in different combinations within the same
table structure.
Figure 10-6. An example file layout resulting from applying liquid clustering on a
dataset14
Where it starts to get different is in how you use it. Liquid clustering must be declared
during table creation to enable it, and it is incompatible with partitioning, so you
can’t define both. When set, it creates a table property, clusteringColumns, which
can be used to validate that liquid clustering is in effect for the table. Functionally, it
14 This example comes from a fuller walk-through highlighting how liquid clustering works both to split apart
larger partitions as well as to coalesce smaller ones. For the full example, check out Denny Lee’s blog post
“How Delta Lake Liquid Clustering Conceptually Works”.
236 | Chapter 10: Performance Tuning: Optimizing Your Data Pipelines with Delta Lake
operates similarly to ZORDER BY in that it still helps to know which columns might
yield the greatest filtering behaviors on queries, so you should still make sure to keep
our optimization goals in sight.
You also will not be able to ZORDER the table independently, as the action takes place
primarily during compaction operations. A small side benefit worth mentioning
is that liquid clustering reduces the specific information needed to run OPTIMIZE
against a set of tables because there are no extra parameters to set, allowing you to
even loop through a list of tables to run OPTIMIZE without worrying about matching
up the correct clustering keys for each table. You also get row-level concurrency—a
must-have feature for a partitionless table—which means that most of the time you
can stop trying to schedule processes around one another and reduce downtime,
since even OPTIMIZE can be run during write operations. The only conflicts that
happen are when two operations try to modify the same row at the same time.
File clustering, like that shown in Figure 10-6, gets applied to compaction in two
different ways. For normal OPTIMIZE operations, it will check for changes to the
layout distribution and adjust as needed. This newer clustering enables a best-effort
application of clustering the data during write processes, which makes it far more
reliably incremental to apply. This means less work is required to rewrite files during
compaction, which also makes that process more efficient as well. This feature is
called eager clustering. This means that for data under the threshold (512 GB by
default), new data appended to the table will be partially clustered at the time of the
write (the best-effort part). In some cases, the size of these files will vary from the
larger table until a larger amount of data accumulates and OPTIMIZE is run again. This
is because the file sizes are still driven by the OPTIMIZE command.
Explanation
CLUSTER BY uses a different space-filling curve than ZORDER, but without the presence
of partitions, it creates clusters across the whole table. Using it is fairly straightfor‐
ward, as you simply include a CLUSTER BY argument as part of your table creation
statement. You must do so at creation or else the table will not be compatible as
a liquid partitioning table—it cannot be added afterward. You can, however, later
update the columns chosen for the operation or even remove all columns from the
clustering by using an ALTER TABLE statement and CLUSTER BY (use NONE instead of
providing a column name or names for the latter case—there’s an example of this
Hopefully, it has become apparent that liquid clustering offers several advantages over
Hive-style partitioning and Z-Ordering tables whenever it’s a good fit. You get faster
write operations with similar read performance to other well-tuned tables. You can
avoid problems with partitioning. You get more consistent file sizes, which makes
downstream processes more resistant to task skewing. Any column can be a cluster‐
ing column, and you gain much more flexibility to shift these keys as required. Last,
thanks to row-level concurrency, conflicts with processes are minimized, allowing
workflows to be more dynamic and adaptable.
Example
In this example, you’ll see the Wikipedia articles dataset found in the /databricks-
datasets/ directory available in any Databricks workspace. This Parquet directory has
roughly 11 GB of data (disk size) across almost 1,100 gzipped files.
Start by creating a DataFrame to work with, add a regular date column to the set, and
then create a temporary view to work with in SQL afterward:
238 | Chapter 10: Performance Tuning: Optimizing Your Data Pipelines with Delta Lake
# Python
articles _path = (
"/databricks-datasets/wikipedia-datasets/" +
"data-001/en_wikipedia/articles-only-parquet")
parquetDf = (
spark
.read
.parquet(articles_path)
)
parquetDf.createOrReplaceTempView("source_view")
With a temporary view in place to read from, you can create a table simply by adding
the CLUSTER BY argument to a regular CTAS statement to define the table:
-- SQL
CREATE TABLE
example.wikipages
CLUSTER BY
(id)
AS (SELECT *,
date(revisionTimestamp) AS articleDate
FROM source_view
)
You still have a normal statistics collection action to think about, so you probably
want to exclude the actual article text from that process, but you also created the
articleDate column, which you probably want to use for clustering. To do this, you
can add the following steps: reduce the number of columns you collect statistics on to
only the first five, move both the articleDate and text columns, and then define the
new CLUSTER BY column. You can do all of these using ALTER TABLE statements:
-- SQL
ALTER TABLE example.wikipages
SET tblproperties ("delta.dataSkippingNumIndexedCols"=5);
ALTER TABLE example.wikipages CHANGE articleDate first;
ALTER TABLE example.wikipages CHANGE `text` after revisionTimestamp;
ALTER TABLE example.wikipages CLUSTER BY (articleDate);
After this step, you can run your OPTIMIZE command, and everything else will be
handled for you. Then you can use a simple query like the following for testing:
-- SQL
SELECT
year(articleDate) AS PublishingYear,
count(distinct title) AS Articles
FROM
example.wikipages
WHERE
month(articleDate)=3
AND
day(articleDate)=4
A deeper look
A Bloom filter index is created at the time of writing files, so this has some implica‐
tions to consider if you want to use the option. In particular, if you want all the
data indexed, then you should define the index immediately after defining a table
but before you write any data into it. The trick to this part is that defining the index
correctly requires you to know the number of distinct values of any columns you
want to index ahead of time. This may require some additional processing overhead,
but for the example, you can add a COUNT DISTINCT statement and get the value
15 If you wish to dive more deeply into the mechanisms and calculations used to create Bloom filter indexes,
consider starting with the “Bloom filter” Wikipedia article.
240 | Chapter 10: Performance Tuning: Optimizing Your Data Pipelines with Delta Lake
as part of the process to accomplish this using only metadata (another Delta Lake
benefit). Use the same table from the CLUSTER BY example, but now insert a Bloom
filter creation process right after the table definition statement (before you run the
OPTIMIZE process):
# Python
from pyspark.sql.functions import countDistinct
cdf = spark.table("example.wikipages")
raw_items = cdf.agg(countDistinct(cdf.id)).collect()[0][0]
num_items = int(raw_items * 1.25)
spark.sql(f"""
create bloomfilter index
on table
example.wikipages
for columns
(id options (fpp=0.05, numItems={num_items}))
""")
Here the previously created table is loaded, and you can bring in the Spark SQL func‐
tion countDistinct to get the number of items for the column you want to add an
index for. Since this number determines the overall hash length, it’s probably a good
idea to pad it—for example, where raw_items is multiplied by 1.25, there was an
additional 25% added to get num_items to allow for some growth to the table (adjust
according to your projected needs). Then define the Bloom filter index itself using
SQL. Note that the syntax of the creation statement details exactly what you wish to
do for the table and is pretty straightforward. Then specify the column(s) to index
and set a value for fpp (more details are in the following section on configuration)
and the number of distinct items you want to be able to index (as already calculated).
Configuration
The fpp value in the parameters is short for false positive probability. This number
sets a limit on what rate of false positives is acceptable during reads. A lower value
increases the accuracy of the index but takes a little bit of a performance hit. This is
because the fpp value determines how many bits are required for each element to be
stored, so increasing the accuracy increases the size of the index itself.
The less commonly used configuration option, maxExpectedFpp, is a threshold value
set to 1.0 by default, which disables it. Setting any other value in the interval [0, 1)
sets the maximum expected false positive probability. If the calculated fpp value
exceeds the threshold, the filter is deemed to be more costly to use than it is benefi‐
cial, and so it is not written to disk. Reads on the associated data file would then fall
back to normal Spark operation, since no index remains for it.
Conclusion
When you set out to refine the way you engineer data tables and pipelines with
Delta Lake, you may have a clear optimization target, or you might have conflicting
objectives. In this chapter, you saw how partitioning and file sizes influence the
statistics generated for Delta Lake tables. Further, you saw how compaction and
space-filling curves can influence those statistics. In any case, you should be well
equipped with knowledge about the different kinds of optimization tools you have
available to you in working with Delta Lake. Most specifically, note that file statistics
and data skipping are probably the most valuable tools for improving downstream
query performance, and you have many levers you can use to impact those statistics
and optimize for any situation. Whatever your goal is, this should prove to be a
valuable reference as you evaluate and design data processes with Delta Lake.
242 | Chapter 10: Performance Tuning: Optimizing Your Data Pipelines with Delta Lake
CHAPTER 11
Successful Design Patterns
243
High-Speed Solutions
Streaming media services usually capture data from individual end-user devices,
which include several different components. To run such services successfully, you
may require varying kinds of information about device health, application status,
playback event information, and interaction information. This usually translates to a
need for building high-throughput stream processing applications and solutions.
One of the most critical components of these streaming applications is ensuring the
capture of the data with reliability and efficiency. In Chapter 7, several implementa‐
tion methods and their benefits demonstrate how Delta Lake can play a critical role
in doing exactly these kinds of data capture tasks. Delta Lake is often the destination
for many of these ingestion processes because it has ACID transaction guarantees and
additional features like optimized writes that make high-volume stream processing
better and easier.
Let’s say you want to monitor the Quality of Service (QoS) across all your users in
near real time. To accomplish this task, you usually need not just playback event
information but also the relevant context from each user’s session, a sequence of
interactions bound together over some time span. Sessionization is often an impor‐
tant cornerstone of many downstream operations beyond ingestion and typically falls
into the data engineering stages of a larger data process, as shown in Figure 11-1.
With session information and other system information in Delta Lake, you can power
downstream analytics use cases such as Quality of Service measurement or trending
item recommendations while maintaining a low turnaround time in processing.
Building out these pipelines is often fairly complex and will involve the interaction
of multiple pipelines and processes. At the core, you will find that each component
boils down to the idea of needing to build a robust data processing pipeline to serve
multiple business needs.
1 For an extended exploration of a QoS solution end-to-end, we recommend the blog post “How to Build a
Quality of Service (QoS) Analytics Solution for Streaming Video Services” and its accompanying notebooks
from Databricks.
Figure 11-2. Comcast’s smart remote control provides an alternative interface for
entertainment
Before we explore how Comcast is building its solutions on Delta Lake, it might be
useful to review more specific information about the scale of its operations. Comcast
drives interactions through its voice remote, and its customers used this remote 14
billion times in 2018–2019 (Figure 11-3 illustrates the relative scale to data process‐
ing).2 Users expect many things in their experience with the applications, such as
getting accurate searches and feeling enabled to find the right content for consump‐
tion. Each user’s individual experience should also have elements of personalization
that make the experience their own. With the voice remote, users can interact with
the whole system; anything they want is just a quick phrase away. On top of this,
Comcast uses user data to create personalized experiences.
2 For additional detail, see the Databricks videos “Comcast Makes Home Entertainment Accessible to Everyone
with Voice, Data and AI” and “Winning the Audience with AI: How Comcast Built an Agile Data and AI
Platform at Scale”.
Consider the technical components essential to running such services behind the
scenes. First, receiving voice commands as input (something that’s exploded in popu‐
larity more recently) is a technically challenging problem. There’s the transformation
of voice to a digital signal, which then has to be mapped to each needed command.
There’s often an additional component to this mapping of correcting for intent. Is it
more likely for someone to be searching for a show called How It’s Made or to be
asking about other shows that explain how some particular thing is made? If it is
a search command, there is still a need to find similar content through a matching
algorithm. All of this gets wrapped together into a single interface point in a setting
in which the user experience needs to be measured against accuracy, so getting bits
of data about these processes and enabling analytics to assess immediate problems or
long-term trends is also critical.
So now we have voice inputs that have to be converted to embedding vectors (vectors
of numeric data capturing semantic meaning as “tokens”), as well as contextual data
(this could be what type of page the user is on, other recent searches, date-time
parameters, etc.) for each interaction with the remote.3 The goal is to collect all
this and provide inference back through the user interface (UI) in nearly real time.
From a functional standpoint, there’s also a large amount of telemetry information
that needs to be collected to maintain insights into things such as device health,
connectivity status, viewing session data, and other similar concerns.
Once the problem of getting this data from individual devices to a centralized
processing platform is solved, there are still additional challenges in deciding how
to standardize the data sources, as multiple versions of devices may have differing
available information, or usage regions may have differing collection laws that mean
fuller or lesser contents of captured events. Downstream from standardization, there
is still a need to organize the data and create actionable steps in a fit-for-function
format.
3 For a more robust treatment of embeddings, see Marcos Garcia, “Embeddings in Natural Language Process‐
ing: Theory and Advances in Vector Representations of Meaning”, Computational Linguistics 47, no. 3 (2021):
699–701.
Earlier attempts
To support the voice remote, Comcast needed to be able to analyze queries and look
at user journeys to do things like measure the intention of a query. At a rate of
up to 15 million transactions per second, Comcast needed to enable sessionization
across billions of sessions on multiple petabytes of data. Running on native AWS
services, it would overrun limits and increase the concurrency it was using until it
was eventually running 32 concurrent job runs across 640 virtual machines to be
able to get to the scale it needed for sessionization. The processing flow is shown in
Figure 11-4. This led Comcast to seek a scalable, reliable, and performant solution.
Figure 11-4. To scale the earlier data ingestion pipeline, Comcast had to crank up the
concurrency
Figure 11-5. Delta Lake provides the foundation for optimized ingestion and
sessionization
If this was the whole story, you would probably already be convinced of the value
Delta Lake can bring to ease processing burdens. What’s great is that it’s not the
whole story. In its Databricks environment, Comcast was able to readily access this
sessionized data for multiple downstream purposes.
4 There is some good discussion of hot-spot keys in key-value stores in the section “Partitioning of Key-Value
Data” in Martin Kleppmann’s Designing Data-Intensive Applications (O’Reilly).
5 AWS states in its performance guidance for S3 that sequential prefixes can also be effective.
6 Databricks, “Customer Story: Comcast”.
Since Comcast is using MLflow, it gets additional side benefits from Delta Lake in
its machine learning processes. With the data source tracking available in the experi‐
ment for a project, MLflow can track information about the Delta Lake table being
used for the experiment without having to make a copy of the data, in the same way
as you would with a CSV file or other data sources.7 Figure 11-6 shows where MLflow
sits in the data life cycle. Since Delta Lake also has time travel capabilities, machine
learning experiments can have enhanced reproducibility, which would benefit anyone
maintaining data science products in production.
Figure 11-6. Delta Lake helps enable reliable end-to-end MLOps processes
7 To compare the entire capabilities for tracking different kinds of files in MLflow experiments, we suggest you
look at the “mlflow.data” section of their documentation.
Figure 11-7. Performance comparison results for query running times in Databricks SQL
on Delta Lake versus Redshift
In the end, it’s looking to be highly advantageous for Comcast to continue innovating
with Delta Lake. It has so far experienced huge savings gains in its data ingestion
processes and has a promising outlook on improving reporting. This should allow
Comcast to further improve end-user experiences for its smart remotes and increase
overall satisfaction rates.
8 Molly Nagamuthu and Suraj Nesamani, “SQL Analytics Powering Telemetry Analysis at Comcast”, posted
September 16, 2021, by Databricks, YouTube.
Streaming Ingestion
Stream processing applications for ingestion tasks are relatively common. We have
a large array of streaming frameworks out there to choose from. Among the most
common ones are the open source Apache Kafka, Kinesis from AWS, Event Hubs in
Azure, and Google’s Pub/Sub.
While there is certainly a wide variety of applicability covering interesting subjects
like real-time telemetry monitoring of IoT devices and fraudulent transaction moni‐
toring or alerting, one of the most common cases for stream processing is large-scale
and dynamic data ingestion.9 For many organizations, collecting data about activi‐
ties by end users on mobile applications or point-of-sale (POS) data from retailers
directly translates to success in supporting mission-critical business analytics applica‐
tions. Acquiring large amounts of data from widely dispersed sources quickly and
correctly allows businesses to become more rapidly adaptable to changing conditions
as well (Figure 11-8 shows a unified architecture across many streaming sources).
Great flexibility, as achieved through the enablement of real-time processes and the
use of artificial intelligence applications, is fueled by dynamic and resilient data
pipelines often falling into this category.10 In all of these, there’s usually an element
of capturing inbound data for later analytical or evaluation purposes, so while there
might be additional components in some processing pipelines, at the end of the day
this process applies to most stream processing applications.11
9 Many teams document their own journey of landing streaming data sources in Delta Lake; for example, the
Michelin team captured a step-by-step implementation guide to building a Kafka + Avro + Spark + Delta Lake
IoT data ingestion pipeline in a Microsoft Azure environment.
10 The term artificial intelligence is used here in the classical software development sense of “narrow AI,” mean‐
ing the application of machine learning algorithms to make automated business decisions without human
interaction—see the definitions of artificial intelligence posted by the Stanford Institute of Human-Centered
Artificial Intelligence.
11 Refer to the discussion of the medallion architecture in Chapter 7 or 9 for more details on implementing
stream processing applications and Delta Lake.
Consider the case of IoT data coming in from devices. If you send all the data into
Kafka, you can build a Spark application to consume that stream and capture all the
original data as it is received, following the model of the medallion architecture. Then
you can create business-level reporting and send those results out to be consumed
in a downstream application. Naturally, there are many variations on this approach,
but the general pipeline model is similar, as shown in Figure 11-9. At Scribd, this
application was so common that they built a new framework around implementing
this process.
12 This architecture diagram comes from the Databricks blog post “Simplifying Streaming Data Ingestion into
Delta Lake” (accessed December 7, 2023).
13 Christian Williams, “Streaming Data into Delta Lake with Rust and Kafka”, posted July 19, 2022, by Data‐
bricks, YouTube.
This also led to thoughtful reflection on how the team might approach the problem.
Would it be possible to do this without Spark or to find some more minimal overhead
method? How would the team still maintain its standardization on Delta Lake, since
that made stewardship so much easier?
Figure 11-10. Scribd’s kafka-delta-ingest in tandem with delta-rs for efficient ingestion
Undertaking such an endeavor was not without risks or potential blocking issues. The
risk of corrupting the Delta log posed one challenge, as did the need to manually
control offset tracking in Kafka to avoid duplicate or dropped records. Scribd also
needs to support multiple writers to tables, and furthermore, some limitations in
AWS S3 require specific handling (e.g., S3 lock coordination).15
Scribd runs anywhere from 70 to 90 of these kafka-delta-ingest and delta-rs pipelines
in production. It runs serverless computation of these pipelines through AWS Fargate
and monitors everything in Datadog. Some of the things it monitors include message
deserialization logs and several metrics: the number of transformations and failures,
the number of Arrow batches in memory, the sizes of Parquet data files written, and
the current time lag in Kafka streams.
14 Christian Williams, “Kafka to Delta Lake, as Fast as Possible”, Scribd Technology (blog), Scribd, May 19, 2021.
15 Some of these S3 issues are discussed in the D3L2 web series episode “The Inception of Delta Rust” on
YouTube.
Figure 11-11. Some of the cost-saving examples Scribd shared during Data+AI Summit
2022 that show the cost of running a process originally in Spark and then using delta-rs16
16 Note that the Rust resources show individual vCPU and memory allocation, whereas the Spark resources
show clusters composed of multiple EC2 instances; r5.large EC2 instances each have two vCPUs and 16 GB of
RAM. Amazon EC2 R5 instance metrics can be found on the AWS website.
Figure 11-12. Retail merchant credit transactions present just one area in which we
might see complex system interactions
17 If you want to spend more time exploring CDC, also known as logical log replication, we recommend
Designing Data-Intensive Applications by Martin Kleppmann (O’Reilly).
18 Ivan Peng and Phani Nalluri, “Unlocking Near Real Time Data Replication with CDC, Apache Spark Stream‐
ing, and Delta Lake”, posted July 26, 2023, by Databricks, YouTube.
The design that arose from these requirements (see Figure 11-13) is a streaming CDC
framework built on Spark Structured Streaming that replicates change feeds into a
unified source of truth built on Delta Lake that supports downstream integrations
across a wide range of query interfaces. Features such as merge support and ACID
transactions helped make Delta Lake a critical component of the design.
The success of this design could be measured in many ways, but there are several
aspects that the team highlights. The system supports 450 streams (one-to-one with
tables) running 24/7 on more than one thousand EC2 nodes. This translates to about
800 GB ingested daily from Kafka, with a total daily processing volume of about 80
TB. The design far exceeded the initial requirements and attained a data freshness of
less than 30 minutes. The team has enabled the self-service creation of tables for data
users in the environment that become available in less than an hour.
Figure 11-14. The starting state of processes at DoorDash before the move to Delta Lake
Figure 11-15 shows exactly what this change at DoorDash enabled: easy integration
with its current tooling with the addition of ACID guarantees at a massive scale.
Previously this process was taking place with regular Parquet files, adding additional
complications in the form of write locks and other challenges. Additionally, the
quality-of-life improvements gained through easy-to-use compaction operations and
the ability to do these operations while stream processing applications are still
19 Fabian Paul, Pawel Kubit, Scott Sandre, Tathagata Das, and Denny Lee, “Writing to Delta Lake from Apache
Flink”, Delta Lake (blog), April 27, 2022.
20 Allen Wang, “Building Scalable Real Time Event Processing with Kafka and Flink”, DoorDash Engineering
(blog), DoorDash, August 2, 2022; Allison Cheng, “Flink + Delta: Driving Real-Time Pipelines at DoorDash”,
posted July 26, 2023, by Databricks, YouTube.
Figure 11-15. The resulting state of the data ecosystem at DoorDash after the move to
Delta Lake
The moral of the story of the DoorDash decision to adopt Delta Lake is this: even
for data systems with multiple types of tooling operating at massive scale and with
a need to support things like efficiently capturing data from real-time event streams
or the changes coming through operational databases, Delta Lake provides reliability
and usability, making it a winning choice.
Conclusion
Data applications come in many different forms and formats. Authoring those data
applications can be complex and painful. In this chapter you’ve seen a few ways to
alleviate this pain through the many benefits of Delta Lake. In particular, the features
of Delta Lake help create a robust data environment that supports broad tooling
choices, reduces costs, and improves your quality of life as a developer.
We do many things every day without consciously thinking about them. These rote
actions, or automatic behaviors, are based on our daily routines and on information
we’ve grown to trust over time. Our routines can be simple or complex, with actions
grouped and categorized into different logical buckets. Consider, for example, the
routine of locking up before leaving for the day; this is a common behavior for miti‐
gating risk, because we simply can’t trust everyone to have our best interests in mind.
Think about this risk mitigation as a simple story: to prevent unauthorized access to
a physical location (entity: home, car, office), access controls (locking mechanism) have
been introduced to secure a physical space (resource) and provide authorized admittance
only when trust can be confirmed (key, credentials).
In the simplest sense, the only thing preventing intrusion is the key. While a key
grants access to a given physical space via a lock, the bearer of a given key must
also know the physical location of a protected resource; otherwise, the key has no
use. This is an example of site security, and as a mental model, it is useful when
constructing a plan for the layered governance and security model for resources
contained within our lakehouse. After all, the lakehouse is a safe space that protects
what we hold near and dear only if we collectively govern the resources contained
within.
But what exactly is the governance of a data resource, and how do we get started
when there are many components of the governance landscape?
263
This chapter provides a foundation for architecting a scalable
data governance strategy for the data assets (resources) contained
within the lakehouse. While we aim to cover as much surface area
here as possible, consider this a referential chapter just scratching
the surface of the myriad components of lakehouse data gover‐
nance. For example, we won’t cover governance with respect to
compliance and enforcement of region-specific rules and regula‐
tions (GDPR, CCPA, right-to-be-forgotten policies, and so on),
nor will we cover general governance from a nonengineering or
nontechnical perspective.
Lakehouse Governance
Before we dive deeper into lakehouse governance, it is important to introduce the
many components of governance today. The reason for this is that governance is
an overloaded term that means many different things, depending on who you ask.
Therefore, in order to go beyond basic access controls and traditional database-level
governance, we need to introduce the systems and services that can come together to
provide a comprehensive governance solution for our lakehouse.
There are many components to lakehouse governance, as seen in Figure 12-1, but at a
high level, we can simply break them down between identity and access management
or IAM (1) and catalog services (2–8). This allows us to build a working model that is
easier to adopt.
For example, unless we understand who (or what) is requesting access to our data
(identity services), we cannot manage the permissions enabling access—seen as the
union between identity services and policies. Furthermore, without integrating the
policies and rules contained within our IAM services (1) with the physical filesystem
(3), we will not be capable of governing the databases (schemas), tables, views,
and other assets stored in our catalog metastores (2). Given the modern separation
between metadata management (2) and physical filesystem resources (3), the foun‐
dation for any lakehouse governance begins with the basic delegation of access to
resources in a unified and controlled way.
Modern lakehouse governance includes (4) robust auditing across data manage‐
ment operations on a per-action basis, commonly captured through event logs for
state changes made via IAM (1) for the resources registered within the catalog or
metastore (3).
1 The term data catalog can mean a metastore like Hive, or it can also encapsulate a full “enterprise” data
catalog. This chapter caters to the engineering side of the house, and so we won’t be discussing the integration
of “data catalogs” for use by nonengineering personas in a typical enterprise.
To expand the scope of table-level data life cycle management, the simple diagram in
Figure 12-2 provides a lens into common steps, from data creation to archiving and
ultimately to destruction.
Similar to other common cycles, such as the software development cycle, the com‐
mon data life cycle starts with (1) creation and continues to (2) storage, (3) usage,
(4) sharing, (5) archiving, and ultimately (6) destruction. This life cycle encapsulates
a complete history of actions and operations (a timeline) occurring at the resource
level.
These observable moments in time are critical for the purposes of data governance,
as well as for the maintenance and usability of the table from an engineering perspec‐
tive. Each table is a governable resource referred to as a data asset.
We learned about the medallion architecture for data quality in Chapter 9. This novel
design pattern introduced the three-tiered approach for data refinement, from bronze
to silver and into gold. This architecture plays a practical role when we’re thinking
about managing the life cycle of our data assets over time and when we’re considering
how long to retain data at a specific tier.
Aided by Figure 12-3, we can visualize the value of data assets as they are refined over
time and across the logical data quality boundaries represented by bronze, silver, and
gold.
Figure 12-3 shows the source tables and lineage of transformations for a curated data
product named (table G). Working backward from the gold data assets, we see that
there is a decrease in the value of the tables as we retrace the lineage back through
the silver tier (D–F), concluding with our bronze data assets (A–C). Why is the single
table worth more conceptually than the collection of the prior six tables?
Simply put, the complexity to build, manage, monitor, and maintain the collection
of data asset dependencies for table G represents a higher cost than that of the
individual parts. Consider that the raw data represented by the bronze data assets
(A–C) is expected to survive only as long as necessary in order to be accessed and
further refined, joined with, or generally utilized by the direct downstream data
consumers (D–F), and that the same expectations are in turn made of our silver-tier
data assets by the gold tier—they must exist only as long as they are needed,2 and they
must provide a simplification and general increase in data quality the further down
the lineage chain they go.
A helpful way of thinking about the end-to-end lineage is through the lens of data
products.
2 Drawing a line between data value and data hoarding is difficult. If there is value yet to be discovered, then I
would suggest keeping that data in bronze, or archiving it for a later point in time.
Figure 12-4. Data products are the sum of all their parts (adapted from Zhamak
Dehghani)
Zhamak Dehghani introduced the novel idea of data products as part of her archi‐
tectural paradigm the data mesh, where she proposed a rule that any curated data
product must be purpose-built and capable of being used as is without requiring
additional joins to other tables.3 Essentially, the expense and effort of producing the
data product should be paid in full on behalf of the consumers of the data product
itself. This rule also helps tie together the simple idea that a data product is tied to a
service, and that service is the production of useful, fit-for-purpose data. You can still
The set of operations and actions that a principal can execute on a data asset is
contained under the umbrella of data definition language (DDL), which contains
CREATE, ALTER, DROP operations, and via data manipulation language (DML), which
enables the INSERT and UPDATE actions, while the ability to execute one or more
actions and operations is managed using data control language (DCL) by way of
GRANT and REVOKE statements.
Nowadays, data assets have evolved to also encapsulate other resources that require
access and use control (authorization) policies governing how they can be interacted
with—for example, dashboards, queries (which in turn power dashboards), note‐
books, machine learning models, and more.
While the size and scale of data operations continues to grow across the globe,
the paradigm of using simple GRANT and REVOKE privileges to control both access
and authorization of data assets is still the simplest path toward adopting a unified
governance strategy. Challenges arise almost immediately as we begin to consider
interoperability with systems and services that simply don’t speak SQL.
Permissions Management
Just below the surface of the lakehouse lies the data lake. As we all know by now, the
data lake is a data management paradigm that assists in the organization of raw data
using primitives from the traditional filesystem. In most cases, cloud object stores are
used, and at the root of these elastic systems are buckets containing objects in a flat
structure.
Buckets encapsulate a resource root "/" representing a logical structure similar to
the standard filesystem, but within a cloud object store. Figure 12-6 shows the
breakdown of the bucket into its constituent parts. For example, just off the root we
have top-level directories (paths and partitions) and their underlying files.
Figure 12-6. Data lakes are commonly built using cloud object stores. The primitives
for these collections begin with the bucket, or root of the filesystem, and descend
in an orderly fashion across directories and their subcollections of files or additional
directories.
In addition to all other types of unstructured and structured data, the data lake stores
our managed (or unmanaged) Delta tables. So we have many possible kinds of files
stored behind the scenes.
Understanding how to secure the underlying filesystem from unauthorized access is
critical for lakehouse governance, and luckily, SQL-like permissions share a similar
data management paradigm to that of the classic operating system (OS) filesystem
permissions—access to files and directories is controlled using users, groups (akin to
roles), and permissions granting read, write, and execute actions.
Filesystem Permissions
The OS running on our laptops and the OS running remotely on servers we’ve provi‐
sioned share similar access and delegation patterns. For example, it is the responsibil‐
ity of the OS to oversee the distribution of finite resources (compute, RAM, storage)
among many short- and long-lived processes (operations). Each process is itself the
result of executing a command (action), and the execution is associated with a user,
group, and set of permissions. Using this model, the OS is able to construct simple
rules of governance.
Let’s look at the ls command as a practical example:
% ls -lah /lakehouse/bronze/
The output of the command is a listing of filesystem resources (files, directories) as
well as their metadata. The metadata includes the resource type (file or directory), the
access mode (permissions), references (resources relying on this resource), ownership
(user), and group association, as well as the file size, the last modified date, and the
filename or directory name:
File type
This is represented by a single character. Files are represented by a –, while
directories are represented with d.
4 See “Convert a Parquet Table to a Delta Table” in the Delta Lake documentation.
• What is the identity (user) of a given runtime process, and how does that apply to
the traditional user permissions model?
• How can we enable access to one or more cloud-based resources?
• Once identified, in what ways can we authorize specific actions and operations to
occur for a given user?
The paradigm shifts away from classic filesystem permissions (user, group, permis‐
sion) and into a more flexible system called identity and access management, or IAM
for short.
Identity
Each identity represents a user (human) or a service (API, pipeline job, task, etc.).
Identities encapsulate both individual users as well as service principals, who are
jokingly referred to as headless users, since they are not human but still represent a
system doing things on behalf of a user. An identity acts like a passport, certifying the
legitimacy of the user. In addition, the identity is used to connect the user to a set of
permissions through the use of policies.
It is common to see access tokens issued for individual users, and for both long-lived
tokens and certificates (certs) to be issued for service principals.
Authentication
While an identity might be legitimate, the whole point of authentication is to test to
be absolutely certain. Most systems issue (generate) keys or tokens for only a specific
period of time; this forces the identity to reauthenticate from time to time, proving
Authorization
The identity and authentication mechanics come together to provide a guarantee
that a user isn’t simply an imposter. These two concepts are tightly coupled to the
authorization process. Authorization is akin to GRANT permissions. We can assume
that we know the identity of a user (since they have passed the test and proved that
they are who they say they are), as they were able to gain entrance to the physical
location of our resource (using a key, cert, or token to access data assets in the
lakehouse). The authorization process is the bridge between the user and a set of
policy files that describe what a user is allowed to do within a given system.
Access management
In a nutshell, access management is all about providing methods to control access
to data and enforce security checks and balances, and it is the cornerstone of gover‐
nance. Access controls provide a means of identifying what kinds of operations and
actions can be executed on a given resource (data asset, file, directory, ML model)
and provide capabilities to approve or deny based on policies.
The entire process of creating a user (identity), issuing credentials (tokens), and
authenticating and authorizing access to resources is really no different than the
GRANT mechanisms—the reverse being REVOKE, which would invalidate active creden‐
tials. No process is complete without the ability to also remove an identity, which
completes the full-access life cycle.
IAM provides the missing capabilities enabling the implementation of GRANT-like
permissions management for our lakehouse through the use of identities and access
policies.
In the next section, we’ll look at access policies and see how role-based access controls
help simplify data access management through the use of personas (or actors), and
we’ll learn about creating and using policies as code.
Data Security
There are many pieces to the governance story, and in order to effectively scale a
solution, there are important rules and ways of working that must be established up
front—or carefully integrated into an existing solution.
For example, you might be familiar with the duck test: if it looks like a duck, swims like
a duck, and quacks like a duck, then it probably is a duck. This refers to our ability to
So remember: we always need to keep in mind the who, what, where, and how, as well
as the if.
Think about it this way: If we grant access to a given identity (who), then (what)
operations are necessary to accomplish a given set of tasks (how), and in what environ‐
ment (where) do they need user-level access versus headless user access? And last, what
potential risks are involved in granting read-level versus read-write-level access?
Additionally, aside from the considerations around whether access should be granted,
the other question that must always be back of mind is whether the identity is allowed
to view (read) all the data residing in the table. It is common to have data that is
divided into groups based on the security and privacy considerations for the data
access.
We will look at data classification patterns next.
Data classification
The following classifiers are a useful way to identify what kind of data is stored within
a resource at a specific location in the data lake.
As a simple abstraction, let’s think about data classification in terms of the stop-light
pattern. A stop light signals to a driver to continue (green), slow down (yellow), or
stop (red). As an analogy, when thinking about governing access to our data assets,
the stop-light pattern provides a simple mental model to tag or label (identify) data
that can be green, yellow, or red.
For example, access to data classified as “green” could be automated, assuming there
are appropriate checks in place to ensure the resources are not leaking sensitive
data. A practical example for “green” would be the earthquake and hazard data made
generally available by the United States Geological Survey.
Access to data classified as “yellow” or “red” would require the grantee to consider
who would have access, why they would need access and for how long, and how the
access could benefit or harm the organization. When in doubt, always consider the if.
If we grant access to this data, do we trust the grantee(s) to do the right thing?
Establishing rules and common ways of working can help to ensure that data is
classified in a common way, reducing decision making to a scientific process:
General access
This classification assumes the data is available to a general audience. For exam‐
ple, let’s say Complete Foods believes it can sell more groceries by enabling
services like Instacart, Uber Eats, and DoorDash to access our inventory data. By
enabling open access—sign up, get a token, and hit the Delta sharing endpoint—
we can ensure that any external organization can access specific tables associated
with the general access role limited to read-only.
Stop-light pattern: Green-level access
Restricted access
This classification assumes data is read-only, with approval on a need-to-know
(use) basis. Continuing the Complete Foods example from before, while external
access to the inventory data (via the general-access classification) enables a mutu‐
ally beneficial relationship to extend the reach of our grocery business and brand,
there is data that represents our competitive advantage that must remain internal
only, or restricted to external domains.
├── s3://com.common_foods.[dev|prod]
└── common_foods
├── consumer
│ ├── _apps
│ │ └── clickstream
│ │ ├── app.yaml
│ │ └── v1.0.0
│ │ ├── _checkpoints
│ │ │ ├── commits
│ │ │ ├── metadata
│ │ │ ├── offsets
│ │ │ └── state
│ │ ├── config
│ │ │ └── app.properties
│ │ └── sources
│ │ └── clickstream_app_v1.0.0.whl
│ └── clickstream
│ ├── _delta_log
│ ├── event_date=2024-02-17
│ └── event_date=2024-02-18
├── {table}
The lakehouse namespace pattern allows us to colocate our data applications along‐
side the physical Delta tables they produce. This reduces the number of policies
required to manage the basics such as team-based access, line-of-business-level data
management, and other concerns, like which environment to provide access to. When
everything is done correctly, the development environment can act as a proving
ground for new ideas, primed with mock data and built using anonymized produc‐
tion data (there is higher risk here, so remember the who, what, where, how, why, and
if rules), and having two environments separated by a physical bucket makes it easier
to follow the stop-light pattern, since dev and staging are traditionally all-access,
while our production environment is almost always justifiably yellow- or red-level
access, at least when it comes to personal data.
Create an S3 bucket. The S3 bucket will act as a container encapsulating our produc‐
tion lakehouse. Using the Amazon CLI (shown in Example 12-2), we set up the
bucket and call it production.v1.
% ACCOUNT_ID="123456789012" && \
aws s3control create-access-grants-instance \
--account-id $ACCOUNT_ID
Create the trust policy. A trust policy must be created to allow the AWS service (identi‐
fied by the service access-grants.s3.amazon.com) permissions to generate temporary
IAM credentials using the GetDataAccess action on an S3 resource. The trust policy
is shown in Example 12-4.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "access-grants.s3.amazonaws.com"
},
"Action": [
"sts:AssumeRole",
"sts:SetSourceIdentity",
Create the S3 data access policy. The last step is simply to associate the generic read and
write permissions on our S3 bucket:
% aws iam put-role-policy --role-name s3ag-location-role \
--policy-name s3ag-location-role --policy-document file://iam-policy.json
The iam-policy.json file is included in the book’s GitHub materials for this chapter.
Now that we have established the S3 Access Grants, we can move on to simplifying
how we manage read and read-write permissions, or even admin-level permissions,
for resources in our lakehouse.
Read. This will authorize read-only capabilities on a resource, or the ability to view
metadata about a given data asset, including the table properties, ownership, lineage,
and other related data. This capability is required to view the row-level data within
a table, list the resources contained within a bucket prefix (filesystem path), or read
table-level metadata. Example 12-5 shows how to use SQL Grants to enable READ for
our analystRole.
$ export ACCOUNT_ID="123456789012"
ReadWrite. In addition to the actions provided by read, the write capabilities add
modify capabilities enabling the actor (identity) to insert (write) new data, update
table metadata, and delete rows from a table. The simple policy is shown in
Example 12-6.
$ export ACCOUNT_ID="123456789012"
export GRANT_ROLE="role/developerRole"
Admin. In addition to the capabilities managed by readwrite, the admin role author‐
izes an actor to create—or delete—a data asset located at a specific location. For
example, it is common to restrict destructive capabilities to only service principals;
similarly, creating resources most often also means additional orchestration to man‐
age and monitor a resource. Since headless users can act only on behalf of a user,
this means they can only run workflows and commands and execute actions and
operations that already exist. In other words, the service principal can trigger a
specific action based on some external event, reducing the surface area of accidental
“oops.” It is best to use traditional IAM policies to control access to create and destroy
lakehouse resource locations.
Limitations of RBAC. There are, of course, limitations when simply using roles alone
to manage access; mainly what tends to happen is an explosion of roles. This can be
considered “sprawl,” and it is an unforeseen side effect of success. Let’s be honest: if
there are only four lines of business, and you have four supporting roles (developer,
analyst, scientist, business), then you are looking at a max of 4 × 4 × n (with n being
the number of tables within a line of business that require special rules to govern
access) to handle the requirements of general governance across the company. What
happens when you go from four lines of business to twenty? What about fifty? It
is the what-ifs that define what to do next. If we are lucky and the company has
taken off, and we’ve hired well and managed to maintain a robust set of engineering
disciplines and practices, then we could technically begin to pivot into attribute-based
access control (ABAC). This is also known as tag-based policies and can also live
under the umbrella of fine-grained access controls.
Example 12-7. Using dynamic views and tags for fine-grained access controls
-- SQL
CREATE VIEW consumer.prod.orders_redacted AS
SELECT
order_id,
region,
items,
amount,
CASE
WHEN has_tag_value('pii')
AND is_account_group_member('consumer_privileged')
THEN user
ELSE named_struct(
'user_id', sha1(user.user_id), 'email', null, 'age', null)
AS user
END
FROM consumer.prod.orders
Conclusion
The way we govern, secure, and store the precious assets inside our lakehouse can be
complicated, complex, or simple; it all depends on size and scale (or the number of
tables and other data assets) and at what point in time we realize the need for a more
complete governance solution. No matter the point in the journey, start small—begin
by creating separation between data catalogs at the bucket level to separate all-access
data from highly sensitive data. Layer into your solution ways of synchronizing what
people need from the data and what systems and services will need from that same
data, and roll this into your strategy for who, what, and when.
In the next chapter, we will continue to look at metadata management, data flow, and
lineage and round out what we started in this chapter.
Metadata Management
Have you ever been lost in the woods, or been driving in a new place without GPS
or even an old-school map? Being lost is something we all have in common, and
the same feeling can be expressed by data teams who are just trying to get to a set
of tables they know should exist. But where are those tables? Metadata management
systems provide the missing components between being lost and having directions.
In our case, the location we are trying to get to is a set of known tables within one
or more data products that we can trust to provide us with the correct information to
solve our data problem. The metastore and services built on top of this metadata, like
any data discovery services, act as a compass to help us reach our waypoint or final
destination. The metadata, which is our data about our data, is required to solve our
problem and can provide assistance when we are trying to arrive at the correct data
destination.
297
What Is Metadata Management?
Just as in data management, the life cycle of our metadata provides a way to keep
track of the data assets we hold near and dear, as well as notes, descriptions, com‐
ments, and tags. The centralized metadata layer—a foundational component of our
lakehouse data catalogs—provides a representation of an organization’s information
architecture. This includes the hierarchy represented by our catalog(s) and databases
(schemas) and the tables and views contained therein. This basic hierarchy was
presented in Chapter 12, when introducing the prefix patterns for organizational
success. The role of the metadata layer is to provide the necessary descriptive data to
produce a macro view across the entire lakehouse regarding the current state of all
data assets, and to provide a compass pointing to those data assets available for use.
Data Catalogs
Depending on where you sit within your organization, you may find there are many
interpretations of what a data catalog is. Essentially, in its most basic form, a data
catalog is a tool that enables a user to locate the high-quality data they need to
get their job done. At a minimum, the data catalog provides information about the
components of a data product—the catalog, database (schema), tables, and views—
along with a simple search component called the data discovery layer (or service).
The data catalog is used in the same way someone shopping at IKEA would use integrated
search to locate something they want, be it a couch, table, or chair; this is very different from
how someone would look through a paper catalog—there are expectations. For data, people
have a general idea of what they need, and a good data catalog makes the journey simple.
—Andy Petrella
There are many different ways to solve the problem of looking things up, and what
we are solving for and the definition of the problem should be actionable and based
on real customer use cases.
For instance, we could create a manual list of all tables; the solution could be a simple
shared spreadsheet—with the known limitation of the shared spreadsheet being the
need to ensure that someone keeps the metadata up to date. This book is about
solving problems, so the prior example is more an example of what not to do, but it
might also be the simplest solution, depending on the size of your organization, and
it ticks the boxes of enabling a user to search (filter) the spreadsheet (basic metastore)
to narrow the set of tables and hopefully find (locate) what they need.
Figure 13-1 provides a high-level overview of the Hive Metastore. The metastore
itself is a set of tables (>70) that enable the magic that provides us with a catalog
of the where and what of our tables. However, the metastore is responsible only for
storing the referential database and table data, including the location of these data
assets with respect to their path on our cloud storage. The metadata also includes the
table type, partitions, columns, and the schema.
While the basic information about the table is nice to have, we are missing a consid‐
erable amount of information that is really needed to operate our lakehouse at scale—
not to mention, this is a book on Delta Lake and not on all table types or supported
protocols. So we can safely ignore most of what the Hive Metastore provides, given
that each Delta table contains a reference to its own metadata.
What the Hive Metastore provides to Delta for our lakehouse is the ability to identify
the databases (schemas) and tables contained at a given cloud-storage prefix without
requiring the object tree to be manually listed (which can be an expensive operation).
Given that we have the Delta log (recording the table history), as well as the ability to
fetch isolated snapshots of our tables (using time travel, or just for the current version
of the table), we have limited use for the Hive Metastore outside of the general
“listing” of catalogs, schemas, and the tables that reside within a known instance of
the metastore.
Unity Catalog
Unity Catalog is a universal catalog for data and AI. There are two versions of Unity
Catalog at the time of writing—the internal proprietary version within Databricks
and the open source software (OSS) version. The OSS version is interoperable with
the Databricks version and provides the following key features: the metastore, a
three-tiered namespace, governed assets, managed and unmanaged volumes, interop‐
erability, and true system openness:
Metastore
Unity Catalog utilizes a centralized metadata layer called the metastore. This
provides the ability to catalog and share data assets across the lakehouse, within
regions, and even across clouds. Additionally, the metastore provides a three-
tiered namespace in which data can be organized.
Three-tiered namespace
The namespace within Unity Catalog provides the following convention:
{catalog}.{database/schema}.{table}. The namespace is a component of the
metastore and enables us to organize our data and assets hierarchically.
The hierarchy is used for more than simple organization; it enables our data
applications to read and join across boundaries that traditionally required copy‐
ing data between Hive tables due to limitations of the two-tiered Hive name‐
space. Enabling a single job to read from multiple catalogs makes it simple for
our data applications to join data between many tables residing across many
catalogs.
Data Lineage
The purpose of data lineage is to record the movements, transformations, and refine‐
ments along a data journey, from the point of initial ingestion (data inception) within
the lakehouse to the data’s final destination—which can take the form of insights
and other BI capabilities—or to provide a solid foundation for mission-critical ML
models. Consider data lineage to be a sort of flight recorder, capturing important
moments in time across our critical data applications—producing our data assets—
with the purpose of being used to provide a measure of data quality, consistency, and
overall compliance and to track the many data dependencies along that processing
line.
The lineage of the many data sources and associated data applications comes together
to provide an observable lens into the dependencies for our data products at runtime.
In addition to helping with understanding the dependency graph, data lineage helps
to ensure data teams understand the when, where, and why if any problems are
experienced at runtime. Even with the best of intentions, things do inevitably go
wrong, and flying blind is never a good look!
1 Andy Petrella, Fundamentals of Data Observability: Implement Trustworthy End-to-End Data Solutions
(O’Reilly), 44.
Figure 13-2. A starting point for data flow visualizations using data lineage
At the most rudimentary level, data lineage can be captured as a graph of sources
to tables (or other data assets). However, this would ignore the fact that there are
data applications (2) running to produce all tables other than the initial ingestion
sources (1)—with respect to Figure 13-2. Therefore, we have both the concept of data
lineage and that of data application lineage to consider.
Leaning on the data lineage to view the data flow allows us to quickly visualize “what
changed” or to see “what is no longer behaving as expected,” which can help to
mitigate risk. To understand what changed, we need to go back to data application
lineage (or workflow lineage).
Consider the fact that data doesn’t simply exist in the lakehouse but requires a
process (Job) to execute (Run) in order to ingest an initial table (Dataset) or to make
modifications from one or more upstream tables (Datasets) in order to produce a
new table or set of tables. This pattern and operating model essentially tracks the
operational behavior of any data pipeline or simple data flow, as viewed through the
lens of a data application (like we saw in Figure 13-2).
Example 13-1. Setting up the OpenLineage client to send start and complete events
client = OpenLineageClient.from_environment()
producer = 'common_foods.consumer.clickstream'
job_name = 'consumer.clickstream.orders'
datasets = {
'clickstream': Dataset(namespace='consumer', name='consumer.clickstream')
}
def emit_complete(
client, run, job, producer, inputs: List[Dataset], outputs: List[Dataset]):
run_event = (RunEvent(
RunState.COMPLETE,
datetime.now().isoformat(),
run, job, producer,
inputs, outputs,
)
client.emit(run_event)
The code in Example 13-1 requires manual effort to construct string names and
naming conventions in order to identify the data producer and the datasets, and
to handle the construction of the Dataset, Job, Run, and RunEvent identifiers. Over
time, it is much easier to use standard libraries and runtime environment variables,
or common configurations, to streamline the generation of these lineage objects and
remove the requirements of manual engineering effort—this helps to mitigate the risk
that jobs end up reusing names and breaking the lineage. Just like with the “what not
to do” covered in “Data Catalogs” on page 298, problems will arise when we ignore
automation or convention-based engineering.
% python
_app: Application = Application.fromEnv()
@lineage.record(
app: Application = _app,
git: _app.git,
)
def run(df: DataFrame) -> StreamingQuery:
…
While the code in Example 13-2 is just a snippet, it provides enough information
to facilitate the generation of the Dataset, Job, Run, and RunEvent objects needed to
track lineage via OpenLineage.
The way that data flows through the lakehouse and between our Delta Lake tables
by way of our data applications ultimately provides the building blocks to create
high-trust data products in a dynamic way—just like water moving between streams,
rivers, and deltas and into reservoirs. Just like in nature, there will always be ebbs
and flows, and ultimately certain areas that used to provide many downstreams will
eventually dry up—but with the end of any data product, or the deprecation of
an older source of data truth, there will always be new sources and new ways of
connecting the data dots.
This is the beauty of capturing data lineage: when it is done correctly, the informa‐
tion provides a real-time or last “active” state of the what, when, and how, using a
narrow or wide lens. This additional lineage-based metadata can then be combined
Data Sharing
What does it mean to share data or a data asset? In the simplest way, we provide the
ability for a known identity (a stakeholder, customer, system, or service) to consume
a collection of data by reading it directly from our single source of data truth. For
our Delta tables, this means providing the capabilities to a known identity to read the
Delta transaction log and generate a snapshot of the table so they can execute a table
read.
There are many reasons why we would want to make our data available to others—for
example, we may be able to monetize our data to provide insights not available to
other companies (as long as it abides by data use laws and isn’t creepy), or we may
need to provide data to our partners or suppliers, which is often the case in retail.
And in the case of data that isn’t exiting our company, sharing data between internal
lines of business is critical to ensuring that everyone references the same sources of
data truth.
% spark.sql(f"""
ALTER TABLE delta.`{table_path}`
SET TBLPROPERTIES (
'catalog.table.gov.retention.enabled'='true',
'catalog.table.gov.retention.date_col'='event_date',
'catalog.table.gov.retention.policy'='interval 28 days'
)
""")
% python
def convert_to_interval(interval: str):
target = str.lower(interval).lstrip()
target = if target.startswith("interval"):
target.replace("interval", "").lstrip()
else:
target
number, interval_type = re.split("\s+", target)
amount = int(number)
return make_dt_interval(
days=dt_interval[0],
hours=dt_interval[1],
mins=dt_interval[2],
secs=dt_interval[3]
)
The Python function from Example 13-4 can now be used to extract the
catalog.table.gov.retention.policy rule in the form of an Interval from a Delta
table. Next, we will use our new convert_to_interval function to take a Delta
table and return the earliest date that is acceptable to retain. This can be used to
automatically delete older data from the table, or even just to mark the table as out of
compliance. The final flow is shown in Example 13-5.
% python
table_path = "..."
dt = DeltaTable.forPath(spark, table_path)
props = dt.detail().first()['properties']
table_retention_enabled = bool(
props.get('catalog.table.gov.retention.enabled', 'false'))
table_retention_policy = (props.get(
'catalog.table.gov.retention.policy', 'interval 90 days'))
interval = convert_to_interval(table_retention_policy)
rules = (
spark.sql("select current_timestamp() as now")
.withColumn("retention_interval", interval)
.withColumn("retain_after", to_date((col("now")-col("retention_interval"))))
)
rules.show(truncate=False)
Audit Logging
Audit is another critical component and important lens required for compliance
within the lakehouse. Because each data asset has a specific set of rules (policies) and
entitlements that must be enforced for compliance sake, we must therefore provide a
simple way to query the access and permissions change log and general audit log of
resources being created or removed from the lakehouse.
Thinking along the lines of what operations need to be recorded, we can use specific
actions within the lakehouse like a flight recorder—similar to the recording of data
as it flows to generate end-to-end lineage. Rather than tracking the journey in terms
of the data life cycle and how the data flows through the data network making up
the lakehouse, we are recording activity regarding the state changes for our data
management.
In Chapter 12 we explained that audit logging can be as simple as capturing changes
in the behavior of the lakehouse—for example, when there are changes to the roles or
policies for critical operations on highly controlled resources like catalogs, databases,
and tables.
Additionally, it is important to track operations for data in flight to provide a source
of data (metrics) to help identify anomalies that can in turn help mitigate risks and
identify threats or the potential for bad actors to take advantage of holes in security.
% spark.sql(f"""
ALTER TABLE delta.`{table_path}`
SET TBLPROPERTIES (
'catalog.table.deprecated'='false',
'catalog.table.expectations.sla.refresh.frequency'='interval 1 hour',
'catalog.table.expectations.checks.frequency'='interval 15 minutes',
'catalog.table.expectations.checks.alert_after_num_failed'='3'
)
""")
Using the techniques introduced in Example 13-3 through Example 13-5, we can
leverage a simple pattern to automatically run checks for a given table. The theory
here is that unless a table is deprecated, there should be a known data service-level
agreement (DSLA), or, at a minimum, specific DLOs and DLIs. With respect to our
data assets (tables specifically), our downstream consumers tend to want to know the
frequency with which data “becomes” available, or how often it is refreshed.
When making decisions based on when to use batch processing or microbatch pro‐
cessing, it usually comes down to the expectations of one or more upstream data
sources. If nearly all sources usually refresh in under 15 minutes, but one source only
updates daily, then if you need all data to provide specific data answers, you’ll always
be stuck in batch processing mode or be wasting money waiting on the laggard
dataset. Making it easier to understand the average update frequency for a given
table (without requiring meetings) can empower engineers and analysts to make
decisions about whether streaming or batch processing makes the most sense to solve
a problem.
Then when things go wrong, or when your pipelines stall due to “no new data” from
your upstreams, you can check the DLOs for the laggard tables to understand what
might have changed. Hopefully, if we’ve also incorporated data application lineage,
For data discovery, a solution to the problem can be as simple as adding the table
metadata (ownership and rules, as well as immediate upstream and downstream
lineage) to an ElasticSearch index. If we wanted to layer in additional capabilities
to the discovery engine—whether catalogs, databases/schemas, or other data asset
types—we would only need to modify the types of metadata in our index and modify
the search parameters to handle more complex discovery. Depending on the size and
number of assets being maintained, the solution could be scaled accordingly, but for
fewer than one million data assets, a simple ElasticSearch index would take us a very
long way.
Considering what sorts of answers the customers of the lakehouse would be searching
for can help inform what it means to be successful. In some cases, having validated
“highly reliable” tables or “verified” owners is a useful step to reduce the number of
tables matching the search criteria. As long as the process to get a specific tag or
badge is a controlled process (meaning not just anyone can add their own tag), then
the customers will trust that the process can’t be gamed. If nothing else, think about
how to balance complexity in terms of moving parts for the data discovery solution:
How many sources of metadata need to be indexed, and how often? Is there a simple
way to be notified when things change? Can we automate the process?
Conclusion
This chapter explored the value of metadata within the context of the lakehouse.
Specifically, we looked at how metadata management acts as a critical component of
the lakehouse platform and at how to utilize basic data asset information to capture
more complex data flows through the use of data lineage. We spent time investigating
how data lineage can be enhanced with data application lineage to enable context-
aware insights, and we concluded with a brief overview of data discovery. In the
next and final chapter, we will be looking at how data sharing with Delta Sharing
completes the final component required for comprehensive lakehouse governance
and security.
319
deal with the expectations of all the teams represented by each downstream location
to which we are actively exporting data.
Imagine that parts (partitions) of our table are exported to support 40 separate
external locations, with each location representing a different cloud storage bucket, or
prefix within a given bucket. Now, for each of our 40 separate locations, we include
the added constraints of minimal permissions, and the sneaky problem of invalid
(revoked) access permissions. What happens if we need to replace data in one or
more partitions due to system failures or faults? Things tend to go wrong the more
complex a system grows. Not to mention, there is a cost associated with reprocessing
all the data again (for each downstream location), and this cost is included on both
sides, represented by egress and ingress—when all along there has been an active
single source of data truth represented by the original Delta table.
The problem described above is the issue with distributed synchronization—given
that we can’t assume that each export job will always succeed, we therefore must also
carry state alongside each of our simple export jobs. So the simple act of periodic
data export can easily become a complex and fragile process. Now for the good news:
this chapter introduces the Delta Sharing Protocol, which is purpose-built to provide
a secure and reliable way to share our Delta tables, regardless of where each table
originates, and regardless of which cloud storage provider is used to store the table.
320 | Chapter 14: Data Sharing with the Delta Sharing Protocol
Figure 14-1. The relationship between the data provider and the data recipient
The relationship between the data provider and the data recipient can be thought
of as being the same as the relationship between the data producer and the data
consumer. On one side, the owner of the table or view is responsible for delegating
a share. This share represents a presigned acknowledgment that the consumer of the
data (the recipient) can access the Delta tables contained within the configuration of
the respective share. Now let’s look more closely at the notion of shares and recipients
through the lens of data providers and recipients.
Data Providers
Data providers are responsible for managing access to their data products through
the use of a share. A share represents a logical grouping of schemas, and of the
tables or views accessible within each schema, to be shared with the recipients. Each
recipient is an abstraction over an identity, known as a principal, which can act on
behalf of a user, system, or service to provide read-only access to the tables or views
allowed by a share (which we will go into in the next section).
Each share can be shared with one or more recipients, and each recipient can access
all resources contained within a share. To put this information into perspective, an
example share configuration is presented in Example 14-1. The share itself is config‐
ured in a similar way to an IAM-based policy file, providing the specific location
of the tables or views that the recipient can access while reducing the complexity
of managing cross-cloud (or on-prem) identity and access management (IAM). Lake‐
house security and governance are covered in earlier chapters, if these concepts are
new and a refresher is required.
version: 1
shares:
- name: "consumer_marketing_analysts_secure-read"
schemas:
Example 14-1 enables the recipient—in this case, the consumer marketing analysts—
to access hourly clickstream data. The configuration itself can contain many different
shares representing many different policies for many different recipients, and for each
uniquely identified share, a collection of one or more schemas can be configured,
with one or more tables or views per schema. This pattern enables us to simplify
access controls through the use of logical groups. We will be looking into how this
configuration is used later in the chapter when we explore the Delta Sharing server.
Data Recipients
The recipient of a share is a principal identified by a bearer token. While we go into
much more detail regarding identity and access management in earlier chapters, it
is worth pointing out that a principal represents a known identity, and the identity
can be at the user level, or represent a logical group like a team or even an entire
department or business unit, or be strictly headless—meaning it represents a system
or service that is not human acting on behalf of a human (hence the headlessness).
All of the information required to authenticate against the Delta Sharing server is
packaged for the recipient in a simple profile file. Example 14-2 introduces us to the
format of the profile, which is represented by a JSON object.
{
"shareCredentialsVersion": 1,
"endpoint": "https://commonfoods.io/delta-sharing/",
"bearerToken": "<token>",
"expirationTime": "2023-08-11T00:00:00.0Z"
}
The profile contains all the information necessary to authenticate with the Delta
Sharing server from the delta-sharing client:
322 | Chapter 14: Data Sharing with the Delta Sharing Protocol
shareCredentialsVersion
The file format version of the profile file. This version will be increased whenever
non-forward-compatible changes are made to the profile format. When a client
is running an unsupported profile file format version, it should show an error
message instructing the user to upgrade to a newer version of their client.
endpoint
The URL of the sharing server.
bearerToken
The bearer token to access the server. This is just an opaque OAuth 2.0 token.
The contents of the token can be as simple as a hash, or it can hold meaning, as
with JWT tokens. It all depends on the authentication mechanism used and on
whether we’re using unstructured or structured tokens.
expirationTime
The expiration time of the bearer token in ISO-8601 format. This field is
optional, and if it is not provided, the bearer token can be seen as never expiring.
In the next section we will look at the Delta Sharing server. This service implements
the Delta Sharing Protocol and offers a simple-to-use REST API powering the
sharing service as well as the introspection API used by the Delta Sharing clients
themselves.
The REST APIs are intended to ensure that the Delta Sharing
Protocol can be implemented easily by folks building Delta Sharing
clients. If you are interested in using the Delta Sharing clients and
want to skip the REST APIs section, then just move ahead to the
section “Delta Sharing Clients” on page 332, as the rest of this sec‐
tion covers the REST API methods, all of which are encapsulated
by most of the Delta Sharing clients.
324 | Chapter 14: Data Sharing with the Delta Sharing Protocol
analytics, and insights. Now we can use name-based routing via the sharing prefix to
direct each request to the appropriate sharing endpoint, enabling each data domain to
fulfill a specific share-based request:
% https://{endpoint}/<consumer|commercial|analytics|insights>/{api-route}
Consider when we first introduced the recipient profile files. Whereas we previously
had a common route prefix named delta-sharing under the endpoint property of
the recipient profile file, we can now be more consistent with respect to where the
share lives within the distributed ecosystem:
{
"shareCredentialsVersion": 1,
"endpoint": "https://sharing.commonfoods.io/consumer/",
"bearerToken": "<token>",
"expirationTime": "2023-08-11T00:00:00.0Z"
}
Now the recipient profile file is specifically pointing to the consumer prefix. In the
case where we need to redirect or modify the prefix again in the future, we can
use simple DNS, or force the recipient to reauthenticate and receive a new profile
pointing to the new location endpoint.
When we use the sharing service to distribute requests across logical data domains,
we end up embracing the decentralized nature of how data is distributed across
natural organizational boundaries. This also makes it easier to scale based on specific
workloads, rather than needing to arbitrarily scale up to meet “any” demands.
Next, we’ll move onto the actual API methods and see how a recipient can explore the
capabilities associated with their unique share.
List Shares
REST APIs commonly provide a list resource—the request parameters are shown in
Table 14-1. In this case, the resource provides the means to view the variable number
of shares that have been configured and assigned to the recipient identified by the
provided bearer token on the request. Running the code in Example 14-3, we see how
simple it is to explore what data assets we have access to, beginning with the most
basic concept of the Delta Sharing Protocol—the humble share.
Example 14-3. Using the Delta Sharing Protocol to list configured shares
% export DELTA_SHARING_URL="https://sharing.delta.io"
export DELTA_SHARING_PREFIX="delta-sharing"
export DELTA_SHARING_ENDPOINT="$DELTA_SHARING_URL/$DELTA_SHARING_PREFIX"
export BEARER_TOKEN="faaie590d541265bcab1f2de9813274bf233"
export REQUEST_URI="shares"
export REQUEST_URL="$DELTA_SHARING_ENDPOINT/$REQUEST_URI"
export QUERY_PARAMS="maxResults=10"
curl -XGET \
--header 'Authorization: Bearer $BEARER_TOKEN' \
--url "$REQUEST_URL?$QUERY_PARAMS"
The response from the sharing service will provide us with a list of the one or many
shares that have been configured for us, the recipient. The response to our request is
as follows:
{
"items":[
{"name":"delta_sharing"}
]
}
The object returned is a collection identified by items, with a single item representing
a share with the name of delta_sharing. The protocol also allows the share record to
contain an id field:
% {
"name": "<unique_share_name>",
"id": "<uuid_or_hash>"
}
If the optional id field is present, the value of the id must be immutable for the
lifetime of the share.
326 | Chapter 14: Data Sharing with the Delta Sharing Protocol
Using the shares as a starting point, we can introspect what is available in a given
share using the share introspection endpoint—in this case, we are going to see what
the delta_sharing share entails.
Get Share
Each share can contain one or more schemas, and within each schema, one or more
tables or views (or other data assets) can be configured. To view a share, we must first
use the list shares API to understand what shares are available for us to view. Next,
we just need to send our request to the API endpoint. Example 14-4 shows the full
request, while Table 14-2 shows the API request parameters required to complete the
request.
% ...
export REQUEST_URI="shares/delta_sharing"
export REQUEST_URL="$DELTA_SHARING_ENDPOINT/$REQUEST_URI"
curl -XGET \
--header 'Authorization: Bearer $BEARER_TOKEN' \
--url "$REQUEST_URL"
The result of issuing the request to the get share endpoint isn’t much different from
the list shares endpoint:
% {
"share":{
"name":"delta_sharing"
}
}
The only change from the list shares endpoint is the result is now a single object
rather than the array of items. The results of this request are unique to the shares
configured for a recipient.
Next we will look at how to introspect the schemas associated with the share itself.
% ...
export REQUEST_URI="shares/delta_sharing/schemas"
export REQUEST_URL="$DELTA_SHARING_ENDPOINT/$REQUEST_URI"
export QUERY_PARAMS="maxResults=10"
curl -XGET \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $BEARER_TOKEN' \
--url "$REQUEST_URL?$QUERY_PARAMS"
As we observed with the list shares endpoint, the list schemas endpoint provides
capabilities to paginate over an arbitrary number of schemas. While pagination may
not be required in all cases, the way pagination works is the same for all list resources:
% {
"items":[
{"name":"default","share":"delta_sharing"}
],
"nextPageToken": "..."
}
328 | Chapter 14: Data Sharing with the Delta Sharing Protocol
As we traverse the hierarchical tree from the share, now to the schemas, we are
essentially unwrapping the exact same structure that represents our actual share itself.
For context, look back at Example 14-1, where we learned to configure a share.
Next, we will learn to list the tables available underneath a specific schema, using the
default schema returned from the request in Example 14-5.
% ...
export REQUEST_URI="shares/delta_sharing/schemas/default/tables"
export REQUEST_URL="$DELTA_SHARING_ENDPOINT/$REQUEST_URI"
export QUERY_PARAMS="maxResults=4"
curl -XGET \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $BEARER_TOKEN' \
--url "$REQUEST_URL?$QUERY_PARAMS"
We see that the service returned four tables in the result and is honoring the max
Results query parameter. Because the nextPageToken is included in the response
object, we can now return this to the service in order to fetch the next set of tables,
as we see in Example 14-7. If there were no more results, then the absence of the
nextPageToken declares that we are at the end of the list.
% ...
export QUERY_PARAMS="maxResults=4&nextPageToken=CgE0Eg1kZWx0YV9zaGFyaW5nGgdkZWZhdWx0"
curl \
--request GET \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $BEARER_TOKEN' \
--url "$REQUEST_URL?$QUERY_PARAMS"
Given that the share is set up with a single schema (default), and underneath that
schema there is a total of only seven tables—and because we are limiting the max
Results per request to just four tables—it takes us two requests to get the full list of
tables:
% {
"items":[
{"name":"nyctaxi_2019","schema":"default","share":"delta_sharing"},
{"name":"nyctaxi_2019_part","schema":"default","share":"delta_sharing"},
{"name":"owid-covid-data","schema":"default","share":"delta_sharing"}
]
}
Now there is a better way of quickly viewing all tables available to us, without
requiring us to first descend the hierarchical tree from the shares to the schemas of an
individual share, and then again descend into one or more schemas per share to view
the configured tables. We can simply use the next API to query all tables available
to us.
330 | Chapter 14: Data Sharing with the Delta Sharing Protocol
List All Tables in Share
To quickly view all configured tables for our share, we use the list all tables endpoint.
Example 14-8 shows the full request, while Table 14-5 provides the API request
parameters required to complete the request.
% ...
export REQUEST_URI="shares/delta_sharing/all-tables"
export REQUEST_URL="$DELTA_SHARING_ENDPOINT/$REQUEST_URI"
export QUERY_PARAMS="maxResults=10"
curl -XGET \
--header 'Authorization: Bearer $BEARER_TOKEN' \
--url "$REQUEST_URL?$QUERY_PARAMS"
The source code for the following examples is located in the book’s
GitHub repository under /ch14/delta-sharing/.
PySpark client
Getting started with the PySpark client requires the delta-sharing Python package,
which can be installed locally using pip install delta-sharing. In addition to the
Python wrappers, if you want to be able to run a local pytest, you will also need
to bring the necessary JARs to your local SparkSession. We will walk through the
end-to-end use case now, starting with Example 14-9, where we create an instance of
the SharingClient and generate the share URL, which encapsulates the profile file as
well as the share, schema, and table we will be reading.
332 | Chapter 14: Data Sharing with the Delta Sharing Protocol
Example 14-9. Generating the share URL to use the Delta Sharing client
%
profile_path = ...
sharing_client = SharingClient(f"{profile_path}/open-datasets.share")
shares = sharing_client.list_shares()
first_share: Share = shares[0]
schemas = sharing_client.list_schemas(first_share)
first_schema: Schema = schemas[0]
tables = sharing_client.list_tables(first_schema)
The code from Example 14-9 instantiates the SharingClient by passing a reference
to the location on the filesystem where we’ve stored our recipient profile file. We then
fetch the list of available shares and for simplicity’s sake take the first entity—which is
a Share object—from the results list to use to fetch our schemas. We repeat this same
pattern, taking the first Schema from the results list to fetch what tables are available
to us. Consider this series of operations to just be a hierarchical traversal from the
share to the schema.
Last, we retrieve the list of tables and take the Table object representing the
lending_club remote Delta table (since we will be querying the lending club Delta
table). The Table object provides us with everything we need to generate the table
_url, which is required by the sharing client to query the remote table.
The function in Example 14-10 is from the book’s source code, and it provides a
simple way to generate the full table_url required for reading the remote Delta
table.
This method allows us to pass a Table object to an instance of our Sharing helper
class, and the end result is the Delta Sharing table URL required to query the
table. The URL is the concatenation of the path to the share profile file along
with the <share>.<schema>.<table>. For example, executing the function from
Example 14-10—self.table_url(lending_club)—yields the following table_url:
…/delta-sharing/profiles/open-datasets.share#delta_sharing.default.lending_club
Armed with our SparkSession and the table_url from Example 14-10, we can
now read from the remote Delta table using the new deltaSharing format on our
DataFrameReader. The code in Example 14-11 shows us how to do that.
df = ((
spark.read
.format("deltaSharing")
.option("responseFormat", "parquet")
.option("startingVersion", 1)
.load(table_url)
).select(
col("loan_amnt"),
col("funded_amnt"),
col("term"),
col("grade"),
col("home_ownership"),
col("annual_inc"),
col("loan_status")
))
Behind the scenes, the delta-sharing Python library and the underlying delta-
sharing-spark Scala library work together to negotiate the network calls to the Delta
Sharing service, utilizing the table version API (startingVersion = 1), which if
implemented on the sharing service allows our remote procedure call to time travel
to a specific version of the remote Delta table. We also are using the responseFormat
option on the reader. The available options at the time of writing are either Parquet or
Delta.
However, ignoring what is happening behind the scenes, the process is fairly trans‐
parent with respect to how we write our data applications. Given we can utilize the
full set of DataFrame functions, there is no significant difference, except that we now
334 | Chapter 14: Data Sharing with the Delta Sharing Protocol
can directly query a remote Delta table with the benefits of cloud-agnostic IAM and
the complications presented in Chapter 12.
Example 14-12. Reading a remote Delta table using the Scala Delta Sharing extensions
% import org.apache.spark.sql.functions.{col}
val df = spark.read
.format("deltaSharing")
.load(table_url)
.select("iso_code", "location", "date")
.where(col("iso_code").equalTo("USA"))
.limit(100)
The same DataFrameReader options are available for the PySpark and Spark Scala
clients. The only difference between this and the code in Example 14-11 is the
addition of the where and limit clauses.
Depending on how the Delta Sharing service has been implemented, the where clause
can be handled as a direct predicate pushdown, or the client can handle the filtering.
Given that the Delta Sharing server is based on an open standard, the service imple‐
mentation should be checked if we are experiencing less-than-ideal query times. The
mechanism for handling modifications to the service response is through the use of
hints. There are jsonPredicateHints as well as limitHints. These are all done using best
effort and will evolve as the Delta Sharing Protocol does.
336 | Chapter 14: Data Sharing with the Delta Sharing Protocol
Table 14-6. Configuration options for the Delta Sharing reader
Delta Sharing Data type Description
option
readChange Boolean Enables the Delta Sharing client to read the change data feed.
Feed
maxVersions String When incrementally processing table changes (readChangeFeed=true) using
PerRpc startingVersion and endingVersion, this option provides a mechanism to
control the volume of data read per remote procedure call.
starting Int Supports TimeTravel on the remote shared table.
Version
ending Int Supports reading of bounded sets. For example, if you want to read from table version 1 to
Version 10, you can set startingVersion to 1, and endingVersion to 10; in this way, you
can meter the volume of data being read for a given operation.
starting Timestamp Read the shared table from the closest transaction available to the provided starting
Timestamp Timestamp. The timestamp must be parsable as a TimestampType—for example,
2024-05-26 04:30:00.
ending Timestamp Set the bounds for the table read to the closest transaction available to the provided
Timestamp endingTimestamp. The timestamp must be parsable as a TimestampType—for
example, 2024-05-26 05:30:00.
response String Changes the format of the read operation. The supported options are delta and
Format parquet. To handle reading from tables with deletionVectors or columnMapping support,
the responseFormat must be “delta.”
This list will continue growing to support additional UniForm types in the future.
maxFilesPer Int How many new files to be considered in every microbatch. The default is 1,000. (streaming
Trigger only)
maxBytesPer String How much data gets processed in each microbatch. This option sets a “soft max,” meaning
Trigger that a batch processes approximately this amount of data and may process more than the
limit in order to make the streaming query move forward in cases when the smallest input
unit is larger than this limit. If you use Trigger.Once for your streaming, this option is
ignored. This is not set by default. (streaming only)
ignore Boolean Reprocess updates if files had to be rewritten in the source table due to a data changing
Changes operation such as UPDATE, MERGE INTO, DELETE (within partitions), or OVERWRITE.
Unchanged rows may still be emitted; therefore your downstream consumers should be
able to handle duplicates. Deletes are not propagated downstream. ignoreChanges
subsumes ignoreDeletes. Thus if you use ignoreChanges, your stream will not be
disrupted by either deletions or updates to the source table. (streaming only)
ignore Boolean Ignore transactions that delete data at partition boundaries. (streaming only)
Deletes
skipChange Boolean If set to true, transactions that delete or modify records on the source table are ignored.
Commits (streaming only)
Next, we will close this chapter with a listing of all the additional community-driven
Delta Sharing connectors. These clients are lovingly built and shared to continue to
extend the mission to bring Delta everywhere.
Conclusion
This chapter introduced us to Delta Sharing and showed us how we can move beyond
traditional data export workflows to reduce complexity with secure, trust-based data
sharing from a single source of data truth. When we reduce data sharing complexity,
we in turn remove the common headaches related to distributed synchronization
and the problem with many sources of fragmented truth. As long as we abide by
the appropriate best practices with respect to table-level backward compatibility (see
Chapter 5 for details on schema evolution), then we can rest easy at night knowing
that the tables and views we’ve worked so hard to produce can bring joy and delight
to the recipients of our shares.
338 | Chapter 14: Data Sharing with the Delta Sharing Protocol
Index
A defined, 144
ABAC (attribute-based access control), 294 delta-flink connector, 18, 61-71, 261
access controls, 266 real-time processes, 261
cloud object store, 281 Scala, dropping support for, 62
fine-grained, 295 Apache Hudi, 19, 194
identity and access management, 282-283 Apache Iceberg, 19, 56, 194
Access Grants instance, S3, 292 Apache Kafka (see Kafka)
ACID (atomicity, consistency, isolation, and Apache Parquet file format (see Parquet files)
durability) transactions, 2, 3, 8 Apache Pulsar library, 145
write serialization and, 195 Apache Spark, 29-32
active lineage, 268 defined, 144
add function, 125 Delta Sharing with, 332-336
admin role, 294 PySpark client, 332-335
aggregation queries, 215 Spark Scala client, 335
AI (artificial intelligence), 303 Spark SQL client, 336
Airflow platform, 190 documentation, 31
allowed_latency argument, 74 release compatibility matrix, 30
ALTER action, DDL, 277 setting up Delta Lake with, 30
ALTER TABLE CHANGE COLUMN com‐ setting up interactive shell, 31-32
mand, 229 PySpark shell, 31
ALTER TABLE command, 85, 95, 96, 167, 237, Spark Scala shell, 32
239, 267 Spark SQL shell, 31
ALTER TABLE {t} ADD COLUMN(S) com‐ setting up Java for, 30
mand, 94 streaming, 156-162
Amazon S3, 76, 79 idempotent stream writes, 156-161
creating Access Grants instances, 292 performance metrics, 161-162
creating buckets, 291 Apache Spark SQL, 300
creating data access policies, 293 Apache Spark Structured Streaming, 8
AnalysisException, 93 APIs (application programming interfaces), for
analyst role, 285 connectors, narrow and stable, 18
ANALYZE TABLE command, 216, 229 append mode option, 43, 93, 123
Apache Arrow, 22 append operations, 42
Apache Avro, 280 Apple’s information security team, 5
Apache Flink app_id argument, 74
339
Arcuate connector, 338 BIGINT data type, 81, 175
Armbrust, Michael, 5 BINARY data type, 81
ARN (Amazon Resource Name), 292 blog, Delta Lake, 136
ARRAY data type, 81 Bloom filter indexes, 240-242
Arrow, Apache, 22 BOOLEAN data type, 81
atomic commits, 12 bounded more, DeltaSource API, 63-64
atomicity, consistency, isolation, and durability builder options, 63
transactions (see ACID transactions) generating bounded source, 64
attribute-based access control (ABAC), 294 brokers, Kafka, 71
audit history, 9 bronze layer, medallion architecture, 202-205
audit logging, 267, 314 buckets, 279, 291
audit trail, creating, 166 business intelligence, 7
authentication, 282 business role, 285
authorization, 283 business-based data discovery, 269
Auto Loader, 162 BYTE data type, Delta, 81
autoCompact option, 223
autocompaction, 223-224
auto_offset_reset argument, 74
C
C++ connector, 338
Avro, Apache, 280 cargo utility, 73
AWS Fargate, 256 cargo-lambda tool, 134
AWS Lambdas (see Lambdas) catalog services, 264
AWS S3, concurrent Lambda writes on, catalog.engineering.comms.[email|slack] prop‐
135-137 erties, 315
AWS_ACCESS_KEY_ID environment variable, catalog.engineering.comms.email table prop‐
73 erty, 98
AWS_COPY_IF_NOT_EXISTS environment catalog.engineering.comms.slack table prop‐
variable, 137 erty, 98
AWS_DEFAULT_REGION environment vari‐ catalog.table.gov.retention.* properties, 315
able, 74 catalog.table.gov.retention.enabled boolean, 312
AWS_ENDPOINT_URL environment variable, catalog.table.classification table property, 98
73 catalog.team_name table property, 98
AWS_S3_ALLOW_UNSAFE_RENAME envi‐ <catalog>.<schema>.<table> syntax, 82
ronment variable, 137 catalogs, Trino, 79
AWS_SECRET_ACCESS_KEY environment CDC (change data capture), 150, 259-260
variable, 74 CDF (Change Data Feed), 16, 86, 164-170
Azure Event Hubs library, 145 capturing as Delta table, 166
azurite, 73 enabling, 166
reading, 167-169
B specifying boundaries for batch pro‐
backpressure, metrics for tracking, 161 cesses, 167
BASE (basically available, soft-state, and even‐ specifying boundaries for streaming pro‐
tually consistent) model, 3 cesses, 168
bash shell, starting Docker container with, 23 schema, 169
batch processing, 7, 316 Trino connector, 86
specifying boundaries for, 167 use cases, 165
streaming versus, 140-141 change data capture (CDC), 150, 259-260
bearerToken, 323 Change Data Feed (see CDF)
big data, limitations of data warehouses in scal‐ change_data_feed_enabled table property, 82,
ing for, 2 86, 87
340 | Index
CHECK constraints, 178 kafka-delta-ingest connector, 71-75
checkpointing, 10, 69, 143 building, 73
CheckpointingMode, 61 building projects, 72
checkpoints argument, 74 installing Rust, 72
checkpoint_interval property, CREATE TABLE, running ingestion flow, 73-75
82 setting up environment, 72
classification of data, 287 overview of, 60
cleanup tasks (see VACUUM command) simplifying development of with Delta Ker‐
cloud object stores, 279, 281 nel, 18
cloudFiles source, Databricks, 162 Trino connector, 75-87
CLUSTER BY parameter, 236-240 configuring, 79
clusteringColumns table property, 236 connecting to OSS or Databricks, 75
clusters creating schema, 80
creating using Databricks Runtime, 33 requirements for, 75
imposing order on, 222, 231 running Hive Metastore, 77-79
in Kafka, 71 show catalogs command, 79
code in this book, GitHub repository for, 41 table operations, 81-87
collect() method, 107 viewing schema, 80
columnNames (string . . .) option, 63 working locally with Docker, 76-77
columns parameter, to_pandas() function, 119 constraints, 175, 178-179
columns, generated, 17, 173-175 consumer_group_id argument, 74
column_mapping_mode property, CREATE consumer_privileged account group, 295
TABLE, 82 contextual data, 247
Comcast, 5 continuous mode, DeltaSource API, 64-65
Comcast Xfinity Voice Remote, 245-251 builder options, 65
comma-separated values (CSV) datasets, 121 generating continuous source, 65
COMMENT action, DDL, 277 CONVERT TO DELTA command, 55
comments, 175-178 convertToDelta command, 55
commitInfo, 125 convert_to_interval function, 313
compaction, of files, 220 corrupted data, 190
complex system coordination, 257-262 countDistinct function, 241
compute cost reduction, 243-251 covid19_nyt Delta Lake table
high-speed solutions, 244 listing schema and data from, 26
smart device integration, 245-251 querying, 28
compute, separation between storage and, 198 CREATE action, DDL, 277
concat function, 134 create operations, 41-42
concurrency, increasing, 248 create schema, 80
configure_spark_with_delta_pip utility func‐ CREATE TABLE AS command, 84
tion, 33 CREATE TABLE AS SELECT (CTAS) state‐
confluent schema registry, 73 ment, 44, 238
connectors, 59-88 CREATE TABLE command, 81, 92
for APIs, narrow and stable, 18 createIfNotExists method, 42
Delta Sharing community connectors, 338 createOrReplace command, 51
delta-flink connector, 61-71 creation operations, 40-45
DeltaSink API, 66-69 transaction log, 45
DeltaSource API, 62-66 writing to tables, 42-45
end-to-end example, 69-71 Croy, R. Tyler, 5
installing, 62 CRUD (create, read, update, and delete) opera‐
overview of, 61 tions, 39-52
Index | 341
create, 40-45 shortcomings of, 4
delete, 49-52 data life cycle, 270
read, 46-48 automating, 311-314
update, 49 retention policies, 312
CSV (comma-separated values) datasets, 121 using table properties, 311
CTAS (CREATE TABLE AS SELECT) state‐ data lineage, 268, 304-318
ment, 44, 238 audit logging, 314
current_timestamp function, 159 automating data life cycles, 311-314
retention policies, 312-314
D table properties, 311
automating using OpenLineage, 307-311
Damji, Jules, 5
Das, Tathagata, 5 data application or workflow lineage, 306
data asset model, 275-278 data discovery, 317
data assets data sharing, 311
governance of, 271 monitoring and alerting, 315-317
life cycle of, 271 data quality and pipeline degradations,
relationship of data products to, 273 316
data catalogs, 266, 298 general compliance monitoring, 315
data classification patterns, 286-288 overview of, 305-311
data control language (DCL), 276, 277 simplifying with decorators and abstrac‐
data definition language (DDL), 276, 277 tions, 310
data discovery, 269, 317 data management operations, 39-56
data engineering, 7 creation operations, 40-45
data files, 10, 11, 12 delete, 49-52
data governance, 263-296 merge, 53
data assets Parquet conversions, 55
data asset model, 275-278 read, 46-48
life cycle tracking, 274 data manipulation language (DML), 9, 276, 278
relationship of data products to, 273 data mesh, 273
emergence of, 270-275 data processing failure scenario, 14
maintaining high trust, 274 data products, 273
overview of, 264-270 data providers, 321
unifying, between data warehouses and data quality and pipeline degradations, 316
lakes, 278-296 data quality frameworks, 208
cloud object store access controls, 281 data science, 7
data security, 283-294 data service level agreement (DSLA), 316
filesystem permissions, 280 data sharing, 268, 311, 319-338
fine-grained access controls, 295 data providers, 321
identity and access management, data recipients, 322-323
282-283 Delta Sharing Clients, 332-338
permissions management, 279 Apache Spark, 332-336
data lakehouses (see lakehouses) community connectors, 338
data lakes stream processing, 336-337
corrupted data in, 190 Delta Sharing Protocol, 323-332
costs of, 190 get share, 327
file format flexibility, 190 list shares, 325-327
lakehouses versus, 189 REST APIs, 324
modernizing, 7 REST URI, 324
overview of, 2-4 overview of, 320-323
342 | Index
safely and reliably, 200 overwriting data in tables, 51-52
data skipping, 229-230 DELETE statement, 50
data warehouses, 7 deletion vectors, 13, 179-185
benefits of data lakes over, 3 example, 181-185
lakehouses versus, 188 Merge-on-Read (MoR), 180
limitations in scaling for big data scenarios, Delta Kernel, 18-19
2 connectors and, 60
overview of, 2 resources about, 19
unifying governance between lakes and, Delta Lake, 1-20
278-296 advanced features of, 173-186
data-level indicators (DLIs), 299 comments, 175-178
data-level objectives (DLOs), 299 constraints, 175, 178-179
Databricks, 9 deletion vectors, 179-185
data masking within, 295 generated columns, 173-175
medallion architecture, 147 blog, 136
metrics for tracking backpressure, 161 choice of name for, 5
streaming processing applications using, connectors, 59-88
252 delta-flink connector, 61-71
use by Comcast, 251 kafka-delta-ingest connector, 71-75
Databricks Community Edition, 33-37 overview of, 60
attaching notebooks, 36 Trino connector, 75-87
Auto Loader, 162 data lineage, 304-318
autotuning, 226 audit logging, 314
creating clusters, 33-35 automating data life cycles, 311-314
Delta Live Tables, 163-164 automating using OpenLineage, 307-311
importing notebooks, 35 data application or workflow lineage,
streaming, 162-164 306
Trino connector, 75 data discovery, 317
Datadog, 256 data sharing, 311
DataFrame object, 47, 117 monitoring and alerting, 315-317
DataFrameWriter object, 43, 51 overview of, 305-311
DataFusion, 22 data management operations, 39-57
datafusion feature flag, 127, 130 creation operations, 40-45
DataSet object, 126 delete, 49-52
DataStream class, Fink, 146 merge, 53-55
DataStream object, enabling checkpointing on, metadata and history, 57
69 Parquet conversion, 55-56
data_df DataFrame, 51 read, 46-48
DATE data type, 81 update, 49
DCL (data control language), 276, 277 data sharing, 319-338
DDL (data definition language), 276, 277 data providers, 321
DECIMAL(p,s) data type, 81 data recipients, 322-323
decorators, 310 Delta Sharing Clients, 332-338
Dehghani, Zhamak, 273 Delta Sharing Protocol, 323-332
DELETE action, DML, 278 overview of, 320-323
delete from {table} command, 112 defined, 6
delete function, 123 Delta Kernel, 18-19
delete operations, 49-52 Delta UniForm, 19
deleting data from tables, 50 design patterns, 243-262
Index | 343
complex system coordination, 257-262 lakehouses, 4
compute cost reduction, 243-251 name change, 5
streaming ingestion efficiency, 252-257 separation between logical action and physi‐
documentation, 136 cal reaction, 199
early use cases, 5 streaming, 139-171
installing and setting up, 21-37 Apache Spark, 156-162
Apache Spark, 29-32 batch processing versus, 140-141
Databricks Community Edition, 33-37 Change Data Feed, 164-170
Docker image, 21-28 Databricks, 162-164
native libraries, 28 Delta as sink, 147-148
PySpark declarative API, 33 Delta as source, 146
key features of, 8-9 options for, 149-155
lakehouse architecture, 187-212 overview of, 139
dual-tier architecture, 190 terminology, 142-145
medallion architecture, 201-211 transaction log protocol, 11-18
open standards and open ecosystem, use cases, 7
193-195 workloads addressed by, 6
overview of, 192 Delta Lake Rust, 18
schema enforcement and governance, Delta Lake tables (Delta tables)
197-201 anatomy of, 10-11
transaction support, 195-197 capturing change data feed as, 166
lakehouse governance and security, 263-296 comments, 175-178
data asset model, 275-278 constraints, 175, 178-179
emergence of, 270-275 creating, 23, 32, 41-42
overview of, 264-270 curating downstream tables, 165
unifying, between data warehouses and deleting data from, 50, 109
lakes, 278-296 deletion vectors, 179-185
maintenance, 89-113 dropping, 112
optimization, 99-103 features of, 16-18
partitioning, 104-107 generated columns, 173-175
repairing, restoring, and replacing data, life cycle of, 110
108-113 life cycle of data in, 270
utility functions, 89-97 optimizing, 99-103
metadata management, 297-303 overwriting data in, 51-52
data catalogs, 298 partitioning, 104-107
data reliability, 299 choosing the right partition column, 105
data stewards, 299 defining partitions at table creation, 105
defined, 298 migrating from nonpartitioned to parti‐
Hive Metastore, 300-302 tioned tables, 106-107
permissions management, 299 rules for, 104
Unity Catalog, 302-303 properties, 89-97
motive for creating, 5 adding, 96
native-application building, 115-138 creating empty tables, 92
Lambdas, 131-137 evolving schema, 94
Python, 116-126, 137 modifying, 96
Rust, 127-131, 138 populating tables, 92-93
origins of, 1-5 reference for, 90
data lakes, 2-4 removing, 97
data warehouses, 2 querying data from, 46-47
344 | Index
recovering, 108-109 Delta Standalone library, 61
removing all traces of, 113 Delta transaction log protocol, 45
replacing, 108-109 file level, 12
restoring, 110 metadata–data interactions, 14-16
restoring older versions of, 47 metadata–data relationship, 13
schema discovery, 65 multiversion concurrency control, 13
statistics, 226-236 as single source of truth, 12
data skipping, 229-230 table features, 16-18
file statistics, 228 Delta UniForm (Universal Format), 19
partition pruning, 229-230 Delta writer protocol versions, 16
Z-ordering, 231-236 delta-flink connector, 61-71
time travel feature, 47-48 DeltaSink API, 66-69
Trino connector and, 81-87 DeltaSource API, 62-66
Change Data Feed, 86 end-to-end example, 69-71
CREATE TABLE operation options, 81 installing, 62
creating tables, 82 overview of, 61
creating tables with selection, 84 delta-rs implementation, 145
data types, 81 delta-rs project, 256
deleting tables, 87 delta-sharing Python library, 332, 334
history of transactions, 85 delta-sharing-spark JAR, 334, 336
INSERT command, 83 delta-sharing-spark Scala library, 334
inspecting tables, 83 delta-spark library, 302
listing tables, 83 delta.autoCompact option, 223
metadata tables, 85 delta.autoCompact.enabled true setting, 224
modifying table properties, 87 delta.autoOptimize.autoCompact property, 90
optimizing tables, 85 delta.autoOptimize.optimizeWrite table prop‐
querying tables, 84 erty, 90
updating rows, 84 delta.checkpoint.writeStatsAsJson property, 90
vacuum operation, 84 delta.checkpoint.writeStatsAsStruct table prop‐
viewing table properties, 87 erty, 90, 103
utilities for, 220-223 delta.constraints.<name> attribute, 178
OPTIMIZE operation, 220 delta.dataSkippingNumIndexedCols table
Z-Ordering, 221-223 property, 90, 103, 229
vacuum operation, 111-112 delta.deletedFileRetentionDuration table prop‐
writing to, 42-45 erty, 90, 111
Delta Live Tables (DLT), 163-164 delta.enable-non-concurrent-writes property,
Delta log (see transaction log) 79
Delta Rust API, 26 delta.enableChangeDataFeed property, 87, 166
Delta Sharing Clients, 332-338 delta.logRetentionDuration table property, 90,
Apache Spark, 332-336 92, 110, 111
community connectors, 338 delta.optimizeWrites option, 225
stream processing, 336-337 delta.randomizeFilePrefixes table property, 90
Delta Sharing Protocol, 200, 268, 320, 322, delta.setTransactionRetentionDuration table
323-332 property, 90
(see also data sharing) delta.targetFileSize property, 90, 102, 226
get share, 327 delta.tuneFileSizesForRewrites property, 90,
list shares, 325-327 226
REST APIs, 324 delta.vacuum.min-retention property, 84
REST URI, 324 delta.`<TABLE>` path accessor, 42
Index | 345
DeltaInvariantViolationException, 208 Python, 23
deltalake library, Rust, 129 ROAPI, 27-28
deltalake package, installing, 116 running containers, 23
DeltaLog object, 61 Rust API, 26
DeltaMergeBuilder class, 54 Scala shell, 25
DeltaOps struct, Rust, 131 Trino connector, 76-77
deltars_table, 27 documentation, Delta Lake, 136
DeltaSink API, 66-69 DoorDash, 258-262
builder options, 68 double counting, 15
exactly-once guarantees, 68 DOUBLE data type, 81
DeltaSource API, 62-66 downstream tables, curating, 165
bounded mode, 63-64 DROP action, DDL, 277
builder options, 63 DROP TABLE operation, 87
generating bounded source, 64 DSLA (data service level agreement), 316
continuous mode, 64-65 dt.files() function, 117
builder options, 65 dual-tier architecture, 190
generating continuous source, 65 dynamic masking, 295
DeltaSource object, 66 DynamoDB lock, 136
table schema discovery, 65 dynamodb_lock tool, 135
DeltaTable object, merging or updating tasks
using, 123-125
DeltaTable utility function, 313
E
eager clustering, 237
DeltaTable.create method, 41 ecomm.v1.clickstream Kafka topic, 69
_delta_log directory, 10, 12, 45, 228 elastic data management, 267
delta_table.detail(), 57 ElasticSearch index, 317
describe command, 83 ELT (extract, load, transform) operations, 165
DESCRIBE command, 95 embedding vectors, 247
DESCRIBE DETAIL, 57 enableDeletionVectors table property, 181
DESCRIBE HISTORY command, 153 end-to-end latency, reducing within lakehouse,
design patterns, 243-262 210
complex system coordination, 257-262 end-to-end streaming, 199
compute cost reductions, 243-251 endingTimestamp option, 336
high-speed solutions, 244 endingVersion option, 336
smart device integration, 245-251 endpoint, 323
streaming ingestion efficiency, 252-257 engineering role, 285
detail method, 107 engineering-specific data discovery, 269
detail() function, 97 Enterprise Data Catalog, The (Olesen-
directional lineage graph (DLAG), 269 Bagneux), 270
discovery, data, 269 environment variables, 73
DISTINCT query, 48 error mode, 123
DLAG (directional lineage graph), 269 ETL (extract, transform, load) operations, 190
DLIs (data-level indicators), 299 --example read_delta_datafusion command, 26
DLOs (data-level objectives), 299 examples in this book, GitHub repository for,
DLT (Delta Live Tables), 163-164 41
DML (data manipulation language), 9, 276, 278 --examples read_delta_table.rs command, 26
Docker execute command, 41
Delta Lake Docker image, 21-28 expirationTime, 323
JupyterLab notebooks, 25 explanatory comments, 176
PySpark shell, 24 extract, load, transform (ELT) operations, 165
346 | Index
extract, transform, load (ETL) operations, 190 has_tag_value function, 296
HDFS (Hadoop Distributed File System), 189
F headless users, 282
highly sensitive access classification, 288
failed permissions, 267
false positive probability (fpp), 241 Hilbert curve, 222
file format flexibility, of data lakes, 190 history function, DeltaTable class, 198
file size, effect of Z-ordering on, 223 history method, 102
file statistics, 120 Hive Metastore, 40, 77-79, 300-302
filesystem permissions, 280 hive.metastore.warehouse.dir, 78
filter() function, 126 Housley, Matt, 164
filters keyword parameter, to_pyarrow_table() Hudi, Apache, 19, 194
function, 126 Hueske, Fabian, 61
fine-grained access controls, 295
FIRST parameter, 229 I
Flink (see Apache Flink) IAM (identity and access management), 264,
FLOAT data type, Delta, 81 282-283, 299, 315
forBoundedRowData method, DeltaSource access management, 283
class, 63 authentication, 282
forContinuousRowData method, DeltaSource authorization, 283
class, 64 identity, 282
foreachBatch method, 156-162 iam-policy.json file, 293
forName method, 42 Iceberg, Apache, 19, 56, 194
fpp (false positive probability), 241 IcebergCompatV2 table feature, 195
frameworks, 6 idempotent stream writes, 156-161
From Tahoe to Delta Lake online meetup, 5 merge operation, 159-161
from_json native function, 204 txnAppId option, 157
Fundamentals of Data Engineering (Reis and txnVersion option, 157
Housley), 164 identities, 282, 283
identity and access management (see IAM)
G IDENTITY keyword, 174
IF NOT EXISTS qualifier, 41
general access classification, 287
general compliance monitoring, 315 ignore mode, 123
generated columns, 17, 173-175 ignoreChanges (boolean) option, 65
GetDataAccess action, 292 ignoreChanges option, 151, 336
getRowType method, 67 ignoreDeletes option, 65, 150, 151, 336
GitHub repository for this book, 41 in-sync replicas (ISRs), Kafka, 71
Go connector, 338 incremental processing, 196
gold layer, medallion architecture, 208-209 ingest-with-rust, 135
Google Protobuf, 280 ingestion streaming, 252-257
Google Pub/Sub Lite library, 145 inputFiles command, 92
governance (see data governance) inputRowsPerSecond performance metric, 161
GRANT operation, 277, 278 INSERT command, 83, 278
grants, 189 INSERT INTO command, 42-44
groups, user, 281 INSERT OVERWRITE command, 52
INSERT statement, 42
insertInto method, 42
H installing and setting up Delta Lake, 21-37
Hadoop Distributed File System (HDFS), 189 Apache Spark, 29-32
hashmap indexes, 240 Databricks Community Edition, 33-37
Index | 347
Docker image, 21-28 architecture, 187-212
instructive comments, 176 dual-tier architecture, 190
INTEGER data type, 81 medallion architecture, 201-211
InternalTypeInfo, 67 open standards and open ecosystem,
IntervalType, 312 193-195
INTO parameter, 52 overview of, 192
io.delta:delta-sharing-spark_2.12:3.1.0 package, schema enforcement and governance,
335 197-201
IoT (Internet of Things), streaming ingestion transaction support, 195-197
for, 253 data lakes versus, 189
is_account_group_member function, 296 data warehouses versus, 188
defined, 188
J governance and security, 263-296
data asset model, 275-278
Java connector, 338
Java, setting up for Apache Spark, 30 emergence of, 270-275
JAVA_HOME environmental variable, 30 overview of, 264-270
join function, 134 unifying, between data warehouses and
JSON files, 10, 12 lakes, 278-296
jsonPredicateHints, 335 portability of data in, 199
JupyterLab reducing end-to-end latency within, 210
attaching notebooks, 36 scalability of data in, 199
Delta Lake Docker image, 25 Lambdas, 131-137
importing notebooks, 35 concurrent writes on AWS S3, 135-137
writing in Python, 132-134
writing in Rust, 134
K lambda_handler function, 133
Kafka lastRecordId variable, 196
DataFrame structure, 202 Learning Spark (Damji, Wenig, Das, and Lee),
reading from and writing to Delta, 69-71 139
streaming ingestion for IoT devices specific Lee, Denny, 5
to, 253 libraries, native Delta Lake, 28-29
kafka argument, 74 life cycle of data, 270
kafka-delta-ingest connector, 71-75 limit clause, 335
building, 73 limitHints, 335
building projects, 72 Linux Foundation, 9
installing Rust, 72 liquid clustering, 236-240
running ingestion flow, 73-75 localstack, 73
setting up environment, 72 location property, CREATE TABLE, 82
kafka-delta-ingest library, 129, 202, 255, 256 logging, audit, 267
KafkaSource, writing to DeltaSink, 69 logRetentionDuration, 150
KAFKA_BROKERS environment variable, 73 LONG data type, Delta, 81
Kalavri, Vasiliki, 61 ls command, 24, 280
Kernel (see Delta Kernel)
key partitioning, 249
KPIs (key performance indicators), 268 M
machine learning, 7
MAP data type, 81
L MapReduce operations, 29
lakehouse namespace pattern, 289 masking, dynamic, 295
lakehouses (data lakehouses), 4 maxBytesPerTrigger option, 149, 167, 336
348 | Index
maxExpectedFpp option, 241 resolving all columns and their types using,
maxFilesPerTrigger option, 149, 167, 336 65
maxResults option, 325, 328-331 scalable, 8
maxVersionsPerRpc option, 336 self-describing, 194
max_messages_per_batch argument, 74 table properties and, 98
mean time to resolution (MTTR), reducing, Trino connector and, 85
197 trust and, 274
medallion architecture, 147, 201-211 Unity Catalog, 302-303
bronze layer, 202-205 viewing, 57
gold layer, 208-209 metastore (see data catalogs; Hive Metastore)
role in life cycle of data, 271 microbatch processing, 316
silver layer, 205-208 MinIO, 76
augmenting data, 207 min_bytes_per_file argument, 74
cleaning and filtering data, 205-207 MLflow, 250
data quality checks and balances, 208 mode parameter, 123
streaming, 210-211 .mode(append) option, 43
merge function, 134, 159, 230 mold linker, Rust, 127
Merge-on-Read (MoR), 180 monitoring, 268
mergeBuilder API, 159 data quality and pipeline degradations, 316
mergeSchema option, 198 general compliance monitoring, 315
merging data, 53-55 MoR (Merge-on-Read), 180
metadata, 10, 11, 266, 297-303 MTTR (mean time to resolution), reducing,
active lineage and, 269 197
added to data lineage graph, 306 mutual trust-based relationship, 282
centralized metadata layer, 298 MVCC (multiversion concurrency control), 13
checking values of, 57 MySQL, 77
comments, 175-178
concurrent generation of lakehouse formats
metadata with Delta format, 19
N
native Delta Lake libraries
constraints, 175 bindings, 29
coordinated and executed through Kernel installing Python package, 29
library, 18 native-application building, 115-138
data catalogs, 298 Lambdas, 131-137
data reliability, 299 Python, 116-126, 137
data stewards, 299 Rust, 127-131, 138
decoupling logic around metadata from nextPageToken, 325, 328-331
data, 18 Node.js connector, 338
file skipping and, 233 nohup command, 27
generated by UniForm, 194 notebooks (see JupyterLab)
history of changes, 57 numBytesOutstanding backpressure metric,
Hive Metastore, 300-302 161
metadata management defined, 298 numFilesOutstanding backpressure metric, 161
metadata–data interactions, 14-16 numInputRows performance metric, 161
migrating from nonpartitioned to parti‐
tioned tables, 107
as output of ls command, 280 O
permissions management, 299 Olesen-Bagneux, Ole, 270
relationships between data and, 13 open source code, 9
OpenLineage, 307-311
operational metadata layer (see data catalogs)
Index | 349
operationMetrics option, 185 considerations for, 217-242
optimization automation, in Spark, 223-226 objectives of, 214-217
autocompaction, 223 maximizing read performance, 214-216
optimized writes, 224 maximizing write performance, 216
OPTIMIZE command, 101-104, 147, 180, 216, partitioning, 218-220
220, 222, 223, 233, 237, 239 table statistics, 226-236
optimized writes, 224 table utilities, 220-223
.option("mode", "append"), 43 OPTIMIZE operation, 220
overwrite mode, 51, 108, 123 Z-Ordering, 221-223
OVERWRITE parameter, 52 permissions, 267, 280, 281
overwriteSchema option, 198 permissive passthrough, in Spark, 204
owners, 281 personally identifiable information (PII), 288
oxbow, 129 personas, establishing roles around, 284
Petrella, Andy, 298, 305
P pip install delta lake command, 29
pip install delta-spark command, 33
pageToken, 325, 328-331
Pandas DataFrame, 23 pip package manager, installing deltalake pack‐
Parquet conversion, 55-56 age using, 116
Iceberg conversion, 56 plain old Java objects (POJOs), 67
regular, 55 point queries, 214
Parquet file format, 10 policy-as-code, 291-293
Parquet files, 11, 12, 193, 280 portability of data, 199
creation of, 40 PostgreSQL, 77
partial files, 15 Power BI connector, 338
parquetBatchSize (int) option, 64 PrestoSQL (see Trino entries)
partial files, 14 privileges, governance and, 275
partitionBy("date") syntax, 106 processedRowsPerSecond performance metric,
PARTITIONED BY parameter, 56 161
partitioned_by property, CREATE TABLE, 82 Project Tahoe, 5
partitioning data properties, of Delta Lake tables, 89-97
large data sets, 118-120 creating empty tables, 92
partition pruning, 229-230 default (Spark only), 98
performance tuning, 218-220 evolving schema, 94
partitioning tables, 104-107 modifying, 96
choosing the right partition column, 105 populating tables, 92-93
defining partitions at table creation, 105 reference for, 90
migrating from nonpartitioned to parti‐ removing, 97
tioned tables, 106-107 Protobuf, 280
managing metadata, 107 Pulsar library, 145
viewing metadata, 107 pyarrow library, 125-126
number of partitions, controlling, 224 DataSet objects, 126
removing partitions, 109 online API documentation, 125
Z-ordering and, 233 RecordBatch objects, 126
partitions keyword parameter, to_pyar‐ Table objects, 126
row_table() function, 126 pyarrow.Table, 126
passive lineage, 269 PySpark
performance tuning, 213-242 declarative API, 33
Bloom filter indexes, 240-242 Delta Lake Docker image, 24
CLUSTER BY parameter, 236-240 Delta Sharing with Spark, 332-335
350 | Index
release compatibility matrix, 30 remapping data, 222
setting up shell for Apache Spark, 31 remote control, 245-251
pyspark command, 30 remove function, 125
Python, 23, 116-126 RENAME action, DDL, 277
bindings, 29 REORG operation, 199
Lambdas, 132-134 replace method, 51
merging data, 123-125 replaceWhere option, 108
projects list, 137 responseFormat option, 336
pyarrow library, 125-126 restricted access classification, 287
reading large datasets, 117-121 retention policies, 312
file statistics, 120-121 REVOKE operation, 277, 278
partitioning, 118-120 ROAPI, 22, 27-28
updating data, 123-125 role-based access controls (see RBAC)
writing data, 121-123 ROW(...) data type, 81
python3 command, 23 RowData representation, 70
RowData<T>, 67
Q RowDataContinuousDeltaSourceBuilder, 64
RowType reference, 67
QP Hou, 5
Quality of Service (QoS) monitoring, 244 run method, decorating, 310
Rust, 22, 127-131
Delta Lake Docker image, 26
R delta-rs implementation, 145
random function, 159 installing, 72
range queries, 215 Lambdas, 134
RBAC (role-based access controls), 283-294 merging data, 131
applying policies at role level, 293 projects list, 138
data assets and policy-as-code, 291-293 reading large datasets, 129
creating S3 Access Grants instances, 292 updating data, 131
creating S3 buckets, 291 writing data, 129
creating S3 data access policies, 293 Rust connector, 338
creating trust, 292
data classification patterns, 286-288
limitations of, 294 S
overview of, 284 S3, Amazon, 76, 79
personas and role establishment, 284 S3DynamoDBLogStore protocol, 135
read operations, 46-48 Scala
querying data from tables, 46-47 Delta Lake Docker image, 25
time travel feature, 47-48 Delta Sharing with Spark, 335
read performance, maximizing, 214-216 setting up Spark Scala shell, 32
Read policy, 293 Scala/JVM, 18
read-only APIs, 22, 27-28 scalability of data in lakehouses, 199
readChangeFeed option, 167, 336 scalable metadata, 8
readStream operation, 146, 150, 168 schema
ReadWrite policy, 294 enforcement and governance, 94, 197-201
REAL data type, Trino, 81 Delta Sharing Protocol, 200
RecordBatch objects, 126, 130 schema-on-read approach, 198
RecordBatchWriter, Rust, 130 schema-on-write approach, 197
recordsPerBatch variable, 196 separation between storage and com‐
references, 281 pute, 198
Reis, Joe, 164 transactional streaming, 199
Index | 351
unified access for analytical and ML Delta as source, 146
workloads, 200 space-filling curve, 222, 231, 233
evolution of, 8, 94-95 Spark (see Apache Spark)
schema-on-write technique, 94, 203 Spark Structured Streaming, Apache, 8
scientist role, 285 spark-shell command, Scala, 30
Scribd, 254-257 spark-sql command, 31
SELECT action, DML, 278 spark.databricks.delta.autoCompact.minNum‐
select operator, 84 Files option, 224
SELECT statement, 43, 46 spark.databricks.delta.optimize.maxFileSize
SELECT TABLE <table name>, 43 property, 102
self-describing table metadata, 194 spark.databricks.delta.optimize.minFileSize
sensitive access classification, 288 property, 102
serializable writes, 195 spark.databricks.delta.optimize.reparti‐
SessionContext::sql function, Rust, 129 tion.enabled property, 102
sessionization, 248 spark.databricks.delta.optimizeWrites.enabled
SHALLOW CLONE command, 194 option, 225
shareCredentialsVersion, 323 spark.databricks.delta.properties.defaults.ena‐
shares, 320 bleChangeDataFeed setting, SparkSession
data providers and, 321 object, 167
recipients of, 322 spark.databricks.delta.withEventTimeOr‐
using Delta Sharing Protocol to list config‐ der.enabled true option, 155
ured shares, 326 spark.sql(...) method, 336
sharing data (see data sharing) spark.sql.catalog.spark_catalog, 302
SharingClient, 332 spark.sql.warehouse.dir, 302
SHORT data type, Delta, 81 Spark: The Definitive Guide (O’Reilly), 30
show catalogs command, 79 SparkSession, configuring, 33
SHOW commands, 300 sprawl, 294
SHOW TBLPROPERTIES query, 17, 97 SQL grants, 277, 278, 300
shuffle operations, 225 SQL projection, 68
sidecar files, 180 SQL queries, 7
silver layer, medallion architecture, 205-208 standalone library (see Delta Standalone
augmenting data, 207 library)
cleaning and filtering data, 205-207 startingTimestamp (string) option, 64
data quality checks and balances, 208 startingTimestamp option, 64, 153, 336
single source of truth, 12 startingVersion (long) option, 63
sinks startingVersion option, 64, 196, 336
defined, 143 startTime variable, 196
Delta as sink, 147-148 storage, separation between compute and, 198
skipChangeCommits option, 336 Stream Processing with Apache Flink (Hueske
skipping (see Z-ordering) and Kalavri), 61, 142
small file problem, 99-103 streaming, 7, 139-171
creating, 100 Apache Spark, 156-162
OPTIMIZE command, 101-103 idempotent stream writes, 156-161
Z-ordering, 103 performance metrics, 161-162
SMALLINT data type, Trino, 81 batch processing versus, 140-141
smart device integration, 245-251 Change Data Feed, 164-170
snapshot isolation for reads, 12, 196 enabling, 166
sources schema, 169
defined, 142 use cases, 165
352 | Index
Databricks, 162-164 TIMESTAMPNTZ (TIMESTAMP_NTZ) data
Auto Loader, 162 type, 81
Delta Live Tables, 163-164 TINYINT data type, 81
Delta as sink, 147-148 tokens, 247, 283
Delta as source, 146 toLogicalType, 67
ingestion topics, Kafka, 71
efficiency, 252-257 to_batches() function, 126
evolution of, 255-257 to_pandas() function, 117
overview of, 252 partitioning data using, 118
medallion architecture, 210-211 reading large datasets using, 117
options for, 149-155 to_pyarrow_dataset() function, 126
ignoring updates or deletes, 150-152 to_pyarrow_table() function, 126
initial processing position, 152-154 to_table() function, 126
initial snapshot, 154-155 transaction log, 10, 11
limiting input rate, 149 transaction support, 195-197
overview of, 139 incremental processing, 196
reading, 167-169 serializable writes, 195
stream processing with Delta Shares, snapshot isolation for reads, 196
336-337 time travel, 197
terminology, 142-145 transactional streaming, 199
transactional, 199 triggers, in Structured Streaming, 149
StreamingExecutionEnvironment, 66 Trino, 39
StreamingQuery object, 310 Trino connector, 75-87
STRING data type, Delta, 81 configuring, 79
STRUCT(...) data type, Delta, 81 connecting to OSS or Databricks, 75
StructType.fromJson method, 204 creating schema, 80
Structured Streaming, 8 requirements for, 75
running Hive Metastore, 77-79
T show catalogs command, 79
table operations, 81-87
table lag metric, 300
Table object, 333 viewing schema, 80
table protocol versions, 17 working locally with Docker, 76-77
<table> syntax, 82 Trino: The Definitive Guide (O’Reilly), 75
TableBuilder object, 41, 51 trust, 274, 292
table_url, 333, 334 TTL (time to live), for tokens, 283
tag-based policies, 294 txnAppId option, 157
tags, 176 txnVersion option, 157
TBLPROPERTIES, 87 type-safe schema, 203
tblproperties, 95 typeInfo object, 67
thrift:hostname:port, 79 TypeInformation, 67
time to live (TTL), for tokens, 283
time travel feature, 8, 13, 14, 47-48, 197 U
TIMESTAMP AS OF, 48 UC volumes, 303
TIMESTAMP data type, Delta, 81 unified batch/streaming, 8
TIMESTAMP(3) WITH TIME ZONE data UniForm (see Delta UniForm)
type, 81 Unity Catalog, 40, 175, 177, 295, 302-303
TIMESTAMP(6) data type, 81 UNSET TBLPROPERTIES command, 97
timestampGreaterThanLatestCommit error, UPDATE action, DML, 278
169 update operations, 49, 84
Index | 353
UPDATE statement, 49, 180 WITH clause, CREATE TABLE operation, 81
updateCheckIntervalMillis (long) option, 65 withColumn("date", to_date("date", "yyyy-MM-
upserting (see merging data) dd")), 93
use cases, 7 withEventTimeOrder option, 154-155
use delta.<schema> command, 82 withMergeSchema (boolean) option, 68
userMetadata option, 177, 178 withPartitionColumns (string ...) option, 68
users, headless, 282 workloads, 6
USING DELTA parameter, 41 write performance, maximizing, 216
write serialization, 195
V writes, optimized, 224
writeStream method, 148, 156
vacuum operation, 14, 84, 111-112, 199, 225
VALUES argument, INSERT INTO operation, write_deltalake() function, 122
43
VARBINARY data type, 81 X
VARCHAR data type, 81 Xfinity Voice Remote, 245-251
VERSION AS OF, 48
virtualenv, 116
voice remote, 245-251
Y
Yavuz, Burak, 5
W Z
watermarking, 66, 144 Z-ordering, 103, 221-223, 231-236
whenMatchedUpdate, 54 Zhu, Ryan, 5
whenNotMatchedInsert, 54 ZORDER BY clause, 222, 229, 231
WHERE clause, 49, 50 ZORDER clause, 224, 233
where clause, 335
354 | Index
About the Authors
Denny Lee is a Unity Catalog, Apache Spark, and MLflow contributor, a Delta Lake
maintainer, and a Principal Developer Advocate at Databricks. He is a hands-on
distributed systems and data sciences engineer with extensive experience developing
internet-scale data platforms and predictive analytics and AI systems. He has previ‐
ously built enterprise DW/BI and big data systems at Microsoft, including Azure
Cosmos DB, Project Isotope (HDInsight), and SQL Server. He was also the Senior
Director of Data Sciences Engineering at SAP Concur. His current technical focuses
include AI, distributed systems, Unity Catalog, Delta Lake, Apache Spark, deep learn‐
ing, machine learning, and genomics.
Tristen Wentling is a Solutions Architect at Databricks, where he works with cus‐
tomers in the retail industry. Formerly a data scientist, he also has authored several
blog posts covering topics such as best practices for production stream applications
and building generative AI applications for ecommerce. Outside of technical work,
Tristen spends a great deal of free time reading or heading to the beach. Tristen holds
an MS in mathematics and a BS in applied mathematics.
Scott Haines is a Databricks Beacon and has been working with data, distributed
systems, and real-time applications for over 15 years. His data journey began at
Yahoo! and then took him to Twilio and more recently to Nike. He owns a consult‐
ing company named DataCircus and more recently wrote a book encapsulating his
journey called Modern Data Engineering with Apache Spark: A Hands-On Guide for
Building Mission-Critical Streaming Applications (Apress). He enjoys teaching people
how to simplify data systems and data-intensive services and takes to the snow in the
winter to pursue his love of snowboarding.
Prashanth Babu is a Databricks Certified Developer who helps guide design and
implementation of customer use cases by building out reference architectures, best
practices, frameworks, MVP, and prototypes, which enables customers to succeed in
turning their data into value.
R. Tyler Croy helped create and still maintains the delta-rs project, which now
helps thousands of organizations use Delta Lake from Rust, Python, and beyond.
He also acts as a Databricks Beacon, helping teach others about Delta Lake and the
Databricks platform. R. Tyler Croy contributed Chapter 6 of this book.
Colophon
The animal on the cover of Delta Lake: The Definitive Guide is an American pika
(Ochotona princeps). Related to rabbits and hares, American pikas are small mam‐
mals that live in the mountains of western North America, from central British
Columbia and Alberta in Canada to Oregon, Washington, Idaho, Montana, Wyom‐
ing, Colorado, Utah, Nevada, California, and New Mexico in the United States. They
are typically found at or above the tree line.
American pikas often live in talus fields or among piles of broken rock or boulders,
where they forage for the vegetation that makes up their diet. They rely on existing
spaces in the talus for their homes and do not dig burrows.
The American pika has been classified by the IUCN as being of least concern from a
conservation standpoint. However, the population is reportedly decreasing, especially
at lower elevations in the southwestern United States. American pikas are highly
sensitive to high temperatures, have limited dispersal ability and low fecundity, and
are vulnerable to decreases in snowpack. Many of the animals on O’Reilly covers are
endangered; all of them are important to the world.
The cover illustration is by Karen Montgomery, based on an antique line engraving
from the Museum of Natural History. The series design is by Edie Freedman, Ellie
Volckhausen, and Karen Montgomery. The cover fonts are Gilroy Semibold and
Guardian Sans. The text font is Adobe Minion Pro; the heading font is Adobe Myriad
Condensed; and the code font is Dalton Maag’s Ubuntu Mono.