0% found this document useful (0 votes)
89 views25 pages

Apache Iceberg - Additional Real World Use Cases

Uploaded by

Rajveer Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views25 pages

Apache Iceberg - Additional Real World Use Cases

Uploaded by

Rajveer Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Additional Real-World Use Cases of

Apache Iceberg

Real-Time Workloads with AWS Kinesis and


Apache Iceberg
Real-time data has become increasingly important for catering to various analytical use
cases. With the capability to access, analyze, and act on data in real time, organizations
gain a competitive edge by enabling fast decision making, optimizing operational effi‐
ciency, elevating the customer experience, and detecting issues early. As we mature and
expand on our analytics journey, the ability to incorporate real-time workloads becomes
necessary.
AWS Kinesis steps in here, serving as a robust bridge for streaming data. It facilitates
the ingestion, processing, and analysis of real-time data, allowing for instantaneous
insights and decision-making capabilities while integrating seamlessly with various
analytical tools.
Unlike batch processing, where data is collected over a period and then processed,
real-time or streaming analytics tools process data almost instantly as it’s generated,
accelerating the time-to-insights process. For instance, consider an over-the-top
(OTT) streaming platform that offers media services directly to viewers via the
internet. With real-time analytics, the service can instantly recommend shows or
movies to viewers based on what’s currently trending or on the viewing habits of
similar users, enhancing binge-watching sessions. Applications such as these bring
out the importance of real-time data and analytics.

1
While streaming data presents a lot of interesting scope for downstream analytics use
cases, it also introduces hurdles in terms of scalability, reliability, and cost factors.
With the constant influx of data, you need a data architecture that can scale and
seamlessly handle increasing data volume as well as deal with small file problems,
schema changes, and more. Additionally, the platform should be able to support
reliable transactions and provide a cost-effective way to handle them.
Let’s look at a use case and see how we can address the problem using AWS Kinesis
and Apache Iceberg.
Consider AeroWorld. For this aviation company, flight safety and timeliness are
supercritical. To ensure this, the company relies heavily on weather predictions.
Traditional weather forecasts, updated hourly or daily, are only sometimes enough.
The aviation industry requires more frequent updates due to the rapid changes
in atmospheric conditions that can impact flight trajectories, passenger safety, and
schedules.
AeroWorld’s engineering team aims to integrate data from OpenWeather and other
sources to provide streaming insights into atmospheric changes. However, this presents a
massive data challenge: how can they efficiently ingest, process, and store large volumes
of global weather data, and, more importantly, how can the downstream analytics team
consume this data for reporting or predictive analytics? Here’s how we can approach the
problem.
First we will use AWS Kinesis Data Streams to ingest and store large volumes of data
from our data source, OpenWeather’s API. Once the weather streams are stored in
Kinesis, we will use AWS Glue to create and run an ETL job to ingest the data into an
Apache Iceberg table sitting atop an Amazon S3 data lake. Iceberg as the table format will
serve as an efficient storage layer providing capabilities such as consistency guarantees,
schema evolution, hidden partitioning, and time travel, which are critical for analytical
workloads. Once the data is stored in Iceberg’s open table format, different compute
engines can consume the data for various use cases. For instance, AeroWorld might be
either building some dashboards to monitor weather events or building ML models.
Figure 1 shows a high-level architecture diagram of our solution.

2 | Additional Real-World Use Cases of Apache Iceberg


Figure 1. High-level diagram of AeroWorld’s architecture

Now let’s see how to implement this.

Ingestion of Streams with AWS Kinesis


The first step in this process is to ingest streaming data from the source. We will
ingest real-time weather data from OpenWeather with AWS Kinesis in this case.
Kinesis Data Streams makes capturing, processing, and storing data streams at any
scale easy.
Let’s go ahead and create a Kinesis data stream. From the AWS Kinesis main page on
AWS, create a new stream and then do the following:

1. Name the stream (e.g., example_stream_book).


2. Set the capacity mode to provisioned.
3. Provision one shard.

Now that you’ve created your stream, let’s dive into a couple of important aspects.

Sharding strategy and capacity mode


A shard in AWS Kinesis represents a sequence of data entries within a stream and acts
as the foundational throughput component in a Kinesis data stream. Each shard in
a Kinesis stream can handle writes at 1 MB per second or 1,000 entries per second
and reads at a rate of 2 MB per second. You should allocate the appropriate shards
depending on your use case and requirements.
Kinesis provides two capacity modes for sharding:

Additional Real-World Use Cases of Apache Iceberg | 3


On-demand mode
With on-demand mode, you don’t need to specify the number of shards. Instead,
Kinesis manages the sharding behind the scenes as data comes in. It automati‐
cally allows the stream to scale its capacity based on incoming data so that you
never have too few shards when a lot of data comes in or too many when there
isn’t much data coming in.
Provisioned mode
In provisioned mode, you specify the number of shards you want to allocate
for your stream. Each shard you provision provides a certain capacity for reads
and writes. You need to plan and adjust the number of shards as your data
throughput changes. With this approach, you can have hard control over your
expenditures.
For AeroWorld, it is crucial to gauge the expected data velocity and decide accord‐
ingly on days of severe weather disturbances or during certain seasons. Planning
shards judiciously is therefore vital to avoid any potential overprovisioning, which
can escalate costs, or underprovisioning, which may result in data ingestion hiccups.
Click “Create data stream,” and the stream example_stream_book should be running
and ready to be used by the data producer.

Data producer
In this example, OpenWeather’s API is the primary data producer, pushing real-time
weather metrics into the Kinesis stream. You can integrate these data sources with
Kinesis using SDKs or connectors, ensuring seamless data push without loss. We
will use the boto3 SDK to interact with our Kinesis data stream. The following code
acts as a producer to fetch real-time weather data for various cities and ingest this
data into our Kinesis stream, example_stream_book. To see the full code snippet with
client setup and function definitions, visit the book’s GitHub repository.
In the following code block, first we iterate through a list of cities and attempt to
fetch weather data from the API; if the fetch is successful, we pass the data to another
function to send the data into the Kinesis stream. After each iteration, we put the
process to sleep to avoid throttling from the API:
## Setup of client details done before this
## Full Code on Book Repository

if __name__ == "__main__":
for city in cities:
data = get_weather_data(city)
if data:
send_data_to_kinesis(data)
time.sleep(5) # Introducing a short delay to not hit API rate limits
and pace the inserts

4 | Additional Real-World Use Cases of Apache Iceberg


Data Transformation and ETL with AWS Glue
The next component of our architecture is configuring AWS Glue with Apache
Iceberg. The idea is to consume the weather stream data currently stored in Kinesis as
a source in Glue and create an Iceberg table using this data. Before proceeding with
the integration, there are some essential preparatory steps.

Create a Glue database


AWS Glue’s Data catalog is a hub that keeps track of our table’s metadata. Before we
can utilize this service for our ETL operations, we need to create a database within this
catalog. If you navigate to the data catalog, you should see a Databases section in the left
pane. Select Add Database to create a new database for this use case and label it “test.”

Set up the Glue ETL job


AWS Glue offers a visual ETL interface, simplifying the setup of an ETL job. From
the Jobs page of AWS, choose a template to create a new job and use the Visual with
a Blank Canvas option to create the job. This will give you a canvas without any
unnecessary nodes you’d need to remove or alter to get your desired result. Click
Create, which will take you to a canvas for an ETL job.
Remember that our goal is to create Iceberg tables using data from the Kinesis Data
Stream. To make this happen, we have to add a node with Kinesis as the source in
Glue Studio.
In the settings for your Kinesis node, make sure to select the right stream under
Stream Name, which in this case is example_stream_book, set Data Format to JSON,
and set Starting Position to Latest. By setting Starting Position to Latest, we capture
new data each time the stream is checked.
With the Kinesis node active, add an SQL node. The flow should be from the Kinesis
node to the SQL node.

AWS Glue Node Relationships


In AWS Glue, each node represents a data source, data destination, or transformation.
The nodes flow from one to the other in a parent–child relationship; one node will
not accomplish its task until all its parent nodes have completed this job. The end
result is the creation of a DAG that captures the order in which tasks should be com‐
pleted for your job, starting from data sources and typically ending at a destination
with each node receiving the data from its parent along the way.

In the settings for our SQL node, set the name of the node to Create Table. Here,
you can also set a name for the incoming data from the prior Kinesis node; keep the

Additional Real-World Use Cases of Apache Iceberg | 5


default name, myDataSource. Then input the following SQL code to create the table
based on the streams schema if the table doesn’t exist:
CREATE TABLE IF NOT EXISTS glue.test.my_streaming_data AS
(SELECT * FROM myDataSource LIMIT 0);

Add another SQL node, named INSERT INTO and flowing from the Kinesis node.
This node’s function is to populate the table post-creation and during subsequent
stream checks. It must have the Kinesis node as its parent so that it can receive the
data from the stream.
Here is the query used for this SQL node:
INSERT INTO glue.test.my_streaming_data SELECT * FROM myDataSource;
Figure 2 shows what your AWS Glue flow should look like.

Figure 2. The flow of the AWS Glue ETL job so far

Now move to the “Job details” tab to finalize the connection settings between Glue
and Apache Iceberg.

Configure job properties


Now it’s time to configure the job properties. First, assign a name to the job, such as
Iceberg_ETL, and choose an identity and access management (IAM) role equipped
with the requisite Glue permissions. By default, the job type is configured as Spark
Streaming because the data source originates from Kinesis. Next, select Glue 4.0 as
the Glue version and Python 3 for the scripting language. Finally, allocate the number
of workers for the ETL job; we will set this to 2 for demonstration purposes.
The “Job parameters” section within the Advanced properties is where we’ll input
parameters for the job. Here are the required parameters:

• --datalake-formats: We will use iceberg to let Glue know which table format to
use.
• --conf: This input comprises key-value pairings to configure Iceberg with tasks
such as creating the catalog, providing the warehouse directory, and implement‐
ing Spark SQL-specific settings.

6 | Additional Real-World Use Cases of Apache Iceberg


spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSession
Extensions
--conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.glue_catalog.warehouse=s3://<your-warehouse-dir>/
--conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.
GlueCatalog
--conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO

Note that the initial --conf is missing in the code block. This is because the parame‐
ter name is the initial --conf. Be careful not to have any unexpected line breaks in
your settings passed as the --conf parameter. We discussed the configuration of these
parameters using Spark in detail in Chapters 5 and 6.
After defining the required configurations, we can save the job and move to the Script
tab. Following is the code at this point (minus the initial setup code; please refer to the
book’s GitHub repository for this chapter for the full code):
#######
# Script Generated for Kinesis Node
#######

dataframe_AmazonKinesis_node1693319486385 = glueContext.create_data_frame.
from_options(
connection_type="kinesis",
connection_options={
"typeOfData": "kinesis",
"streamARN": "arn:aws:kinesis:us-east-1:xxxxxxxxxx:stream/exam
ple_stream_book",
"classification": "json",
"startingPosition": "latest",
"inferSchema": "true",
},
transformation_ctx="dataframe_AmazonKinesis_node1693319486385",
)

def processBatch(data_frame, batchId):


if data_frame.count() > 0:
AmazonKinesis_node1693319486385 = DynamicFrame.fromDF(
data_frame, glueContext, "from_data_frame"
)

#######
# Script Generated for Create Table Node
#######

SqlQuery0 = """
CREATE TABLE IF NOT EXISTS glue.dip.my_streaming_data AS
(SELECT * FROM myDataSource LIMIT 0);
"""
CreateTable_node1693319510128 = sparkSqlQuery(
glueContext,

Additional Real-World Use Cases of Apache Iceberg | 7


query=SqlQuery0,
mapping={"myDataSource": AmazonKinesis_node1693319486385},
transformation_ctx="CreateTable_node1693319510128",
)

#######
# Script Generated for Insert Into
#######

SqlQuery1 = """
INSERT INTO glue.dip.my_streaming_data SELECT * FROM myDataSource;
"""
InsertInto_node1693319537299 = sparkSqlQuery(
glueContext,
query=SqlQuery1,
mapping={"myDataSource": AmazonKinesis_node1693319486385},
transformation_ctx="InsertInto_node1693319537299",
)

glueContext.forEachBatch(
frame=dataframe_AmazonKinesis_node1693319486385,
batch_function=processBatch,
options={
"windowSize": "100 seconds",
"checkpointLocation": args["TempDir"] + "/" + args["JOB_NAME"] + "/check
point/",
},
)
job.commit()
Now let’s run the job. We should see the records in our Apache Iceberg table,
my_streaming_data. Once the job is completed, the Glue dashboard will display the
table, indicating the present metadata file’s location. The S3 output directory will have
a database-specific folder that houses individual folders for each table, each including
data and metadata. Note that this streaming job will continue to run and write
data into the Iceberg table. The new updates will then be available for downstream
applications.

Downstream Analytics
Now that we have the streaming data in an open table format (Iceberg), it is ready
to be consumed by downstream analytical workloads such as BI and ML. Having
an architecture with Apache Iceberg as the table format for streaming workloads
offers great data management capabilities compared to using it with datafile formats
such as Parquet. One of the core strengths lies in Iceberg’s ability to manage schema
evolution efficiently, allowing real-time data schemas to adapt and evolve without
interrupting ongoing processes. This is crucial for streaming workloads, ensuring

8 | Additional Real-World Use Cases of Apache Iceberg


continuous, uninterrupted data for downstream analytics. Partition evolution is
another significant benefit. In the dynamic streaming environment, data patterns
and volumes fluctuate rapidly. Iceberg’s ability to evolve partitions without having any
side effects to the existing architecture makes it an ideal solution for these workloads.
Additionally, Iceberg’s rollback and time-travel feature is invaluable for streaming
workloads. It allows users to revert to previous data states, correct any errors in the
streams, and enable querying historical data.

Running Machine Learning Workloads with


Apache Iceberg
In the ML ecosystem, while algorithms and models are crucial, the underlying data
architecture plays an equally if not more important role. A robust and adaptive data
foundation is key to ensuring that models train effectively and predict with precision.
Imagine a highly accurate model in the healthcare world put into production without
being trained effectively on diverse datasets. The predictions of such models can
hardly be used for decision making. Similarly, data schemas and structures are bound
to change with the rapid evolution of user behaviors, technology landscapes, and
business models. However, amid these constant changes, the need for data integrity
and reliability becomes critical, especially for ML workflows that thrive on consistent
data patterns.
Apache Iceberg as the table format on top of data lakes addresses some of the key
challenges of tabular datasets within the ML workflow. Its features, such as schema
enforcement, ensure that the data remains clean and untainted. Schema evolution
allows the incorporation of new business requirements without disrupting existing
operations. Additionally, with capabilities such as time travel and isolated branching,
data scientists and analysts can revisit past data states, reproduce experiments, and
create isolated environments for new explorations without impacting the production
data sources.
In this section, we will cover a few scenarios from an ML perspective to show the
importance of a well-architected data platform backed by Apache Iceberg. We will use
the architecture illustrated in Figure 3 as a potential solution for the three use cases
discussed in the next section.
In this architecture, we will have batch and streaming sources landed as raw Apache
Iceberg tables in our data lake. We will then use that raw data to create feature tables
in the Apache Iceberg format, all with Apache Spark, which can then also use that
data to feed our prediction models.

Additional Real-World Use Cases of Apache Iceberg | 9


Figure 3. Our example architecture

Use Case 1: Reliable ML Pipelines


The telecom company TelCan has a churn prediction model that helps its teams
identify and retain customers who are likely to terminate their services. The teams’
existing model uses features such as call duration, customer service interactions, and
international calling details. Their workflow is mostly Spark based. TelCan regularly
updates its ML training pipelines by bringing in new data from diverse data sources.
During this ingestion process, there’s a risk that incorrect data types might creep into
the tables, causing disruptions to existing training pipelines. Additionally, as TelCan
introduces new services such as VoLTE and 5G, it begins capturing relevant data,
such as 5G_Usage_minutes and VoLTE_calls_made. This new data can improve the
accuracy of its churn prediction model. However, integrating these features into its
existing training pipeline is tricky. The company faces two major challenges:

• How does it ensure that the schema is respected during writes and notify
stakeholders of any failures?
• How can it incorporate new features into the training pipeline without impacting
the current datasets used for training?

A consistent schema is required to ensure that an ML training pipeline can confi‐


dently handle new incoming data. By guaranteeing that the newly ingested data

10 | Additional Real-World Use Cases of Apache Iceberg


adheres to a predefined schema, you can make sure the data is consistent and the pre‐
dictions are reliable. Apache Iceberg tables have defined schemas in their metadata,
and engines can use this to enforce schemas when writing new data and coerce data
from the old schema to the current schema on reads. When schemas are enforced
on write, any discrepancies in the data structure are immediately flagged, ensuring
that downstream processes, such as ML pipelines, only work with consistent and
trustworthy data.
TelCan’s daily ingestion job brings new training data to its ML pipelines. Let’s say
the ingestion job doesn’t do any schema checks prior to writing data into the Iceberg
table. We may see the following error (to see the full code with function definitions
and error handling logic, refer to the book’s GitHub repository):
# Ingest new data
new_data_path = "churn_etl_new.csv"
iceberg_table_name = "glue.test.churn"
ingest_new_data(new_data_path, iceberg_table_name)

------------------------------------------------------------
# Error ingesting data into glue.test.churn: Cannot write incompatible data to
table 'Iceberg glue.test.churn':
- Cannot safely cast 'Account_length': string to int
As you can see, Apache Spark raises an error if the schema of the incoming data
doesn’t match the existing Iceberg table’s schema.
On top of the native schema enforcement capabilities of Iceberg, TelCan can also
explicitly validate its incoming data against an existing Iceberg schema to proactively
catch any errors and notify the responsible stakeholders to avoid writing bad data
to the tables used by the training pipeline. Let’s say that currently one of its ML
training pipelines uses the churn Iceberg dataset. Here we use a custom function,
validate_and_ingest, to first validate the schema and ingest if successful (refer to
the book’s GitHub repository for full implementation of validate_and_ingest):
# Validate and ingest new data
new_data_path = "churn_etl.csv"
iceberg_table_name = "glue.test.churn"
validate_and_ingest(new_data_path, iceberg_table_name)
The preceding code first extracts the schema of the existing Iceberg table. It then
reads the new data and validates its schema against the Iceberg table’s schema. If the
schemas match, it appends the new data to the table. If not, it raises an error.
Let’s explore the second part of the problem: adding new features without interrupting
training on current data. TelCan wants to introduce two new features, 5G_Usage_minutes
and VoLTE_calls_made, into its existing Iceberg table, churn. This new information
could be crucial in improving its churn prediction model, as patterns in adopting newer
technologies might be linked to customer retention. Changes in business requirements

Additional Real-World Use Cases of Apache Iceberg | 11


like this are fairly common in data workflows. It is therefore important to have a data
architecture that supports quick and easy changes to the schema without the need to
rewrite existing tables (which can be extremely costly at scale).
To do so, let’s first change the schema of the churn table using Spark SQL:
# Add the new columns to the Iceberg table
spark.sql("""
ALTER TABLE glue.test.churn
ADD COLUMNS (5G_Usage_minutes FLOAT, VoLTE_calls_made INT)
""")
Once the new columns are added, we can read the data from the new source, which
includes these columns, and append to the table:
# Reading new data
df_evolved = spark.read.csv("newdatasource.csv")

# Append the new data to the Iceberg table


df_evolved.writeTo("glue.test.churn").append()
These two steps ensure that TelCan’s underlying table’s schema matches the new data
being ingested.

Use Case 2: ML Reproducibility


Reproducibility in ML, which means achieving consistent outcomes with a given
version of the model on specific datasets, is important in both academic research and
industry. However, one of the challenges is that it is extremely difficult to reproduce
the empirical results of a specific model on others. Imagine you are a data scientist
at TelCan. Your churn prediction model, built on top of historical customer data,
has achieved impressive performance. Now the business has introduced a new set
of features (5G_Usage_minutes and VoLTE_calls_made) from the recent data, which
is then used to build a new model. However, you notice a dip in the new model’s
performance after a week. You suspect that the new data, or perhaps the changed
features, might be causing the dip. To validate, you would need to reproduce the
previous week’s model, which means reverting to the older version of your dataset.
With traditional data architectures, this could be quite a challenging task. How does
Apache Iceberg facilitate seamless ML reproducibility?
By default, Apache Iceberg has two capabilities to address reproducibility challenges
in ML: time travel and tagging. Let’s see how to use these features to solve TelCan’s
reproducibility problem.

Time travel
Instead of creating multiple copies of data, which is usually a common pattern when
dealing with ML reproducibility tasks, Iceberg’s time-travel feature lets users access

12 | Additional Real-World Use Cases of Apache Iceberg


older data snapshots using either the timestamp at which the snapshot was commit‐
ted or the snapshot ID. This leads to faster reproducibility, better data versioning, and
lower storage costs. Iceberg allows you to find out all the snapshots associated with a
table using the history metadata table.
Let’s see the history of TelCan’s churn table:
spark.sql("SELECT * FROM glue.test.churn.history").toPandas()
A result set for this query looks like this:

made_current_at snapshot_id parent_id is_current_ancestor

2023-04-20 17:06:12.515 5889239598709613914 NaN True

2023-05-03 19:24:24.418 7869769243560997710 5.889240e+18 True

There are two snapshots present in this table. The second snapshot (snapshot_id
7869769243560997710) is the one in which two features were added to the table,
5G_Usage_minutes and VoLTE_calls_made, which we can determine based on the
timestamp of the commit. A time-travel query could be run in Spark SQL to validate
whether that’s true:
spark.sql("SELECT * FROM glue.test.churn TIMESTAMP AS OF '2023-05-03
19:24:24.418'").toPandas()
We can see that a result set for this query includes the two new features as two
columns in our table:

State … Churn 5G_Usage_minutes VoLTE_calls_made

LA … FALSE 5 1

IN … TRUE 3 0

NY … TRUE 3 1

SC … FALSE 10 0

HI … FALSE 12 0

... ... ... ...

WI … FALSE 6 2

AL … FALSE 5 3

VT … FALSE 15 0

WV … FALSE 13 1

Additional Real-World Use Cases of Apache Iceberg | 13


To compare, we could query the first snapshot ID, 5889239598709613914, and see
whether these features were present in the earlier version of the table:
spark.sql("SELECT * FROM glue.test.churn VERSION AS OF
5889239598709613914").toPandas()
If we receive the following result set, we have confirmed that earlier snapshots do not
include the two new features among their columns:

State … Total_intl_charge Customer_service_calls Churn

LA … 2.35 1 FALSE

IN … 3.43 4 TRUE

NY … 1.46 4 TRUE

SC … 2.08 2 FALSE

This confirms that the new features were introduced in the later snapshot.
Now that we have a better understanding of the history of the table, we can train
our models on the data with the schema we expected. If we want to train the data
based on the data before the two features were added, we can just use that earlier
timestamp to time travel, as shown in the following code snippet (on the book’s
GitHub repository, you can find a longer version of this snippet that goes through the
whole model training process):
# Retrieve a specific snapshot of the Iceberg table from a week ago
snapshot_id_of_interest = 5889239598709613914

df = spark.read \
.option("snapshot-id", snapshot_id_of_interest) \
.format("iceberg") \
.load("glue.test.churn")

# Convert to Pandas for ML operations


pdf = df.toPandas()
Training different versions of the model against consistent data allows us to measure
the actual benefit from changes in the model versus changes in the data. If our new
model performs better on the new data with the new features but is the same as the
old version on the older data, the improvement comes from data quality, not model
improvement. Using Iceberg’s time-travel feature, TelCan’s data scientists can better
test their models against consistent, reproducible data.

14 | Additional Real-World Use Cases of Apache Iceberg


Tags
Tags are labels for individual snapshots. So technically, they are just the version name
given to a snapshot ID. You can read more about tags in Chapter 11.
TelCan’s daily ingestion job now inserts some new records into the Iceberg table
churn, as shown in the following code:
## Create temporary view from new data
spark.sql(
"""CREATE OR REPLACE TEMPORARY VIEW new_train_data USING csv
OPTIONS (path "newtrainingdata.csv", header true)"""
)

## Insert new data into table


spark.sql(
"INSERT INTO glue.test.churn SELECT * FROM new_train_data"
)
As a data scientist, you want to use this specific version of the table to train data using
an existing production model and against future versions for easy comparison. To
avoid having to regularly look up the snapshot ID, you can just create a tag and label
it something to be referred to later. Using some SQL, we can easily create the tag:
## Run SQL to create tag this snapshot for easy reference
spark.sql("ALTER TABLE glue.test.churn CREATE TAG June_23")
Now, if we query the references metadata table, we can see all our tags:
spark.sql("SELECT * FROM glue.test.churn.refs").toPandas()
This query gives us the following results:

name type snapshot_id max_ref... min_snapshots_to_keep max_snap...

main BRANCH 2496871934354606665 NaN NaN NaN

June_23 TAG 2496871934354606665 NaN NaN NaN

The June_23 tag is the one we just created. We can now refer to this dataset version
using this tag and reproduce the model. The following code shows how we can
specify the tag we’d like to query (to see the full code snippet that goes through the
model training process, refer to the book’s GitHub repository):
df = spark.read \
.option("tags", 'June_23') \
.format("iceberg") \
.load("glue.test.churn")

Additional Real-World Use Cases of Apache Iceberg | 15


By using Iceberg features such as time travel and tagging, some of the crucial prob‐
lems with ML reproducibility can be efficiently solved. This provides better flexibility
and an easier way for data scientists and ML engineers to reproduce models. This can
be achieved across multiple tables using catalog-level tagging and time travel using a
Nessie catalog, as we covered in Chapter 11.

Use Case 3: Support ML Experimentation


TelCan has observed an increased churn rate among its premium customers, leading
to substantial revenue losses. The data science team hypothesizes that integrating
customer feedback and support ticket data might reveal new patterns for predicting
churn. Implementing such a change directly into the existing churn dataset poses sig‐
nificant risks. This dataset drives real-time business decisions and customer outreach
programs and feeds into other data products. Disrupting its integrity, even momen‐
tarily, could lead to misguided strategies and further revenue losses. Currently, the
data science team’s only option is to make multiple copies of the data when they want
to run experiments, which is expensive and hard to manage, potentially leading to
issues such as data drift. So the challenge is: how can the team safely and flexibly
experiment with new data without impacting the production dataset or making
multiple copies while controlling cost and effort?
This is where Iceberg’s table-level branching capability comes into play. Branches in
Iceberg are mutable named references for snapshots that can be updated by associat‐
ing them with a new snapshot. So, if you create a branch from an existing Iceberg
table, you can write data specific to that branch in isolation and decide to make that
branch available or drop it. We discuss branching in detail in Chapter 11.
Branches provide an isolated way to interact with datasets, which can benefit situa‐
tions such as ML experimentation. To start, TelCan’s data engineering team creates an
isolated branch, Churn_PremiumExp, off the main branch:
## Create a Branch
spark.sql("ALTER TABLE glue.test.churn CREATE BRANCH Churn_PremiumExp")
The data science team then integrates new records for their experiment, which are
extracted from customer surveys, feedback forms, and other external sources.
The new data, while potentially valuable, is experimental. So, it’s written in isolation
to the Churn_PremiumExp branch:
# Load the new data
df_exp = spark.read.csv("churndata.csv", header=True, inferSchema=True)

## Write the Data to the Branch


df_exp.write.format("iceberg").mode("append").save("glue.test.churn.branch_
Churn_PremiumExp")

16 | Additional Real-World Use Cases of Apache Iceberg


By writing the newly available data in an isolated branch, Iceberg ensures that there
is no impact whatsoever on the main table. So the downstream applications or users
using the main version of the table remain unaffected by these new changes. This
gives the team scope to play around with the data and conveniently build models.
Let’s quickly validate that the primary churn dataset remains untouched with the new
data inserted in the Churn_PremiumExp branch:
# Query the main dataset
spark.table("glue.test.churn").toPandas()

# Output: 1240 rows × 20 columns


# Query the experimental branch
spark.read.format("iceberg").load("glue.test.churn.branch_Churn_
PremiumExp").toPandas()

# Output: 2175 rows × 20 columns


This confirms that the newly inserted records are only available in the isolated
branch, not the main table.
Using this branch-specific data, the team can build and test new predictive models as part
of their experiments. The following code reflects how we specify that we are reading data
from the branch. To see the full code example with model training, please refer to the
book’s GitHub repository:
df_prem = spark.read \
.option("branch", 'Churn_PremiumExp') \
.format("iceberg") \
.load("glue.test.churn_new")
Post-experimentation, if the team finds the new dataset invaluable, they might con‐
sider making the data available to the main table, and they can merge the changes
from the branch. Alternatively, if the experiment doesn’t provide substantial gains,
they can safely discard the branch, ensuring that the main table remains streamlined:
## Drop the Branch
spark.sql("ALTER TABLE glue.test.churn DROP BRANCH Churn_PremiumExp")
Through Iceberg’s branching, TelCan can experiment, iterate, and innovate without
impacting its core data assets, ensuring that its decision making remains rooted in
data integrity and quality. This can be done across multiple tables using catalog-level
branching using a Nessie catalog, as discussed in Chapter 11.
In ML, ensuring data consistency, enabling iterative experiments, and reproducing
results are critical. Apache Iceberg’s features can be used to safeguard the integrity
and quality of your data and empower ML practitioners to experiment and roll back
changes seamlessly, thereby elevating the efficiency and reliability of ML workflows.

Additional Real-World Use Cases of Apache Iceberg | 17


Dealing with Historical Data and Slowly
Changing Dimensions
SCDs are essential for tracking data evolution in OLAP databases, particularly for
descriptive attributes tied to factual data such as product features. SCD ensures his‐
torical data accuracy, preventing downstream analytical metric disruptions. Among
the six SCD types, SCD Type 2 (SCD2) retains complete value history by adding
new records for dimension attribute changes while marking old ones as expired.
It employs columns such as Effective Start Date, Effective End Date, and
Is_Current Flag. SCD is a technology-independent practice applicable to various
data architectures.
Imagine that Nova Trends, a home furnishings retailer operating both offline stores
across North America and an ecommerce platform, utilizes a lakehouse architecture
with Apache Iceberg on Amazon S3 for analytics, benefiting from an open data
structure, schema evolution, time travel, and transactional guarantees.
Nova Trends faced challenges as it expanded, particularly concerning historical data
management. Updating a customer’s address directly in the Customer table when the
customer relocated from Chicago to Paris could disrupt downstream BI reports that
are crucial for sales analysis, inventory, and marketing efforts. This direct update
would also erase any historical context, such as tracking a customer’s transition from
online shopping in Chicago to in-store visits in Paris, leading to the loss of valuable
insights into customer behavior.
To address these historical data issues, the company adopted an SCD2 approach
for the Customer dimensions table, incorporating fields such as eff_start_date,
eff_end_date, and is_current to track changes effectively. The team created two
Iceberg tables, Sales (fact) and Customers (dimension, managed as SCD2); intro‐
duced a surrogate key (customer_dim_key) in the Sales table for preserving his‐
torical data accuracy; handled customer address changes using SCD2 logic with
appropriate effective dates; and added new sales data to verify the correct handling of
these changes.
Figure 4 depicts a visual representation of this approach.

18 | Additional Real-World Use Cases of Apache Iceberg


Figure 4. Handling historical data using SCD2 patterns

Let’s see this in action.

Create Fact and Dimension Tables to Prepare for SCD2


To replicate the Nova Trends scenario, let’s first create the fact Iceberg table called
Customers. Following is the code constructing the schema (to see the full code that
creates the table and injects sample data, refer to the book’s GitHub repository):
dim_customer_schema = StructType([
StructField('customer_id', StringType(), False),
StructField('first_name', StringType(), True),
StructField('last_name', StringType(), True),
StructField('city', StringType(), True),
StructField('country', StringType(), True),
StructField('eff_start_date', DateType(), True),
StructField('eff_end_date', DateType(), True),
StructField('timestamp', TimestampType(), True),
StructField('is_current', BooleanType(), True),
])
Let’s query the table to see what content we have inside:
spark.sql("SELECT * FROM glue.dip.customers").toPandas()

Additional Real-World Use Cases of Apache Iceberg | 19


Your results should look like this:

cmr_ ... ... ... ... eff_ eff_ timestamp is_ customer_
id start_date end_date crnt dim_key

1 ... ... ... ... 2020-09-27 2999-12-31 2020-12-08 True 1694797722535090


09:15:32

2 ... ... ... ... 2020-10-14 2999-12-31 2020-12-08 True 1694797722510645


09:15:32

The customer_dim_key serves as a bridge to the Customer dimension table. By using


this field in the Sales table, you can link each sale to a specific version of a customer
record. This is particularly useful for scenarios such as customer changes (moving
from one city to another, in this case) and when you want to retain the context of
sales against the customer’s old and new profiles. It also helps differentiate between
historical and current records for the same customer.
Now let’s work on the fact table, Sales. In the following code, we show the schema; to
see the full code creating a DataFrame of sales data called sales_fact_df, please refer
to the book’s GitHub repository:
from pyspark.sql.functions import to_timestamp

fact_sales_schema = StructType([
StructField('item_id', StringType(), True),
StructField('quantity', IntegerType(), True),
StructField('price', DoubleType(), True),
StructField('timestamp', TimestampType(), True),
StructField('customer_id', StringType(), True)
])
To implement SCD2, it’s crucial to determine the correct customer version (e.g., Paris
or Chicago) at the time of the sale, ensuring that each sale is associated with the
appropriate customer and therefore preserving historical context. This is achieved
through a LEFT OUTER JOIN with specific condition logic.
First we match the sale with the correct customer based on their unique customer
ID (sales_fact_df.customer_id == customer_ice_df.customer_id). Then we
validate that the sale’s timestamp falls on or after the effective start date of the cus‐
tomer’s record (sales_fact_df.timestamp >= customer_ice_df.eff_start_date).
Additionally, we verify that the sale occurred before the effective end date of the cus‐
tomer’s record. These conditions together ensure that the sale is linked to the active
and valid customer record during the sale’s timestamp (sales_fact_df.timestamp <
customer_ice_df.eff_end_date):

20 | Additional Real-World Use Cases of Apache Iceberg


from pyspark.sql.functions import when

join_cond = [sales_fact_df.customer_id == customer_ice_df.customer_id,


sales_fact_df.timestamp >= customer_ice_df.eff_start_date,
sales_fact_df.timestamp < customer_ice_df.eff_end_date]

customers_dim_key_df = (sales_fact_df
.join(customer_ice_df, join_cond, 'leftouter')
.select(sales_fact_df['*'],
when(customer_ice_df.customer_dim_key.isNull(), '-1')
.otherwise(customer_ice_df.customer_dim_key)
.alias("customer_dim_key") )
)

# Creating the Sales Iceberg table


customers_dim_key_df.writeTo("glue.dip.sales").create()
Now that we joined the DataFrame and added the customer dimension key to each
record, the resulting table should look something like this:
+-------+--------+-----+-------------------+-----------+----------------+
|item_id|quantity|price| timestamp|customer_id|customer_dim_key|
+-------+--------+-----+-------------------+-----------+----------------+
| 111| 40| 90.5|2020-11-17 09:15:32| 1|1694797722535090|
| 112| 250|80.65|2020-10-28 09:15:32| 1|1694797722535090|
| 113| 10|600.5|2020-12-08 09:15:32| 2|1694797722510645|
+-------+--------+-----+-------------------+-----------+----------------+

Note that the LEFT OUTER JOIN ensures that all sales records are retained in the
resulting table, even if a matching customer dimension record isn’t found. We assign
a default value (-1) to customer_dim_key for unmatched sales entries to easily iden‐
tify and address them.
Now let’s run an aggregated query for some quick insights on the number of sales by
country:
spark.sql(
'SELECT ct.country, '
'SUM(st.quantity) as sales_quantity,'
'COUNT(*) as count_sales '
'FROM glue.dip.sales2 st '
'INNER JOIN glue.dip.customers ct on st.customer_dim_key = ct.customer_
dim_key '
'group by ct.country').show()

+-------+--------------+-----------+
|country|sales_quantity|count_sales|
+-------+--------------+-----------+
| Canada| 290| 2|
| US| 10| 1|
+-------+--------------+-----------+

Additional Real-World Use Cases of Apache Iceberg | 21


Handle Changes to Customer Data and Implement SCD2
In this section, we’ll take a look at another hypothetical scenario. Imagine that a
customer, Angie Keller (customer_id=2), living in Chicago, has moved to Paris.
Additionally, say a new customer record for Sebastian White has also come in from
new data from operational databases to be updated in the customer fact table. In the
following code, we’ll create a DataFrame with the customer information:
new_customer_dim_df = spark.createDataFrame([('3', 'Sebastian', 'White',
'Rome', 'IT',
datetime.strptime(datetime.today().strftime('%Y-%m-%d'),
'%Y-%m-%d'),
datetime.strptime('2999-12-31', '%Y-%m-%d'),
datetime.strptime('2020-12-09 09:15:32', '%Y-%m-%d
%H:%M:%S'), True),
('2', 'Angie', 'Keller',
'Paris', 'FR',
datetime.strptime(datetime.today().strftime('%Y-%m-%d'),
'%Y-%m-%d'),
datetime.strptime('2999-12-31', '%Y-%m-%d'),
datetime.strptime('2020-12-09 10:15:32', '%Y-%m-%d
%H:%M:%S'), True)],
dim_customer_schema)

new_customer_dim_df = new_customer_dim_df.withColumn("customer_dim_key",
random_udf())
In line with the SCD2 approach, when a fresh customer record is created, the preced‐
ing record must be marked as outdated. To implement this, we do the following.
First, a join condition, join_cond, is set to match records from the existing customer
dataset with the new incoming records based on the customer_id. Using the defined
join condition, we identify records in the existing dataset that have corresponding
entries in the new dataset.
For these matched records, we retain most of their original attributes but adjust
the eff_end_date to match the eff_start_date of the new record and set the
is_current value to false (indicating that it is no longer the current version of that
customer’s data). We then combine the updated customer records with the entirely
new records, resulting in a consolidated dataset (merged_customers_df):
from pyspark.sql.functions import lit

# Match records from existing dataset with new records based on customer_id
join_cond = [customer_ice_df.customer_id == new_customer_dim_df.customer_id,
customer_ice_df.is_current == True]

## Find customer records to update


customers_to_update_df = (customer_ice_df
.join(new_customer_dim_df, join_cond)

22 | Additional Real-World Use Cases of Apache Iceberg


.select(customer_ice_df.customer_id,
customer_ice_df.first_name,
customer_ice_df.last_name,
customer_ice_df.city,
customer_ice_df.country,
customer_ice_df.eff_start_date,
new_customer_dim_df.eff_start_date.
alias("eff_end_date"),
customer_ice_df.customer_dim_key,
customer_ice_df.timestamp)
.withColumn('is_current', lit(False))
)

## Union with new customer records


merged_customers_df = new_customer_dim_df.unionByName(customers_to_update_df)

Finally, we utilize Apache Iceberg’s upsert command (MERGE INTO) to update records
if they exist or insert them if they don’t. Iceberg allows updating customer dimension
records with transactional guarantees in a data lake:
# Convert the merged_customers_df to SQL view
merged_customers_df.createOrReplaceTempView("merged_customers_view")

# MERGE INTO statement


merge_sql = """
MERGE INTO glue.dip.customers AS target
USING merged_customers_view AS source
ON target.customer_dim_key = source.customer_dim_key
WHEN MATCHED THEN
UPDATE SET
target.first_name = source.first_name,
target.last_name = source.last_name,
target.city = source.city,
target.country = source.country,
target.eff_start_date = source.eff_start_date,
target.eff_end_date = source.eff_end_date,
target.timestamp = source.timestamp,
target.is_current = source.is_current
WHEN NOT MATCHED THEN
INSERT *
"""

# Execute the SQL statement


spark.sql(merge_sql)

Let’s query the customers table to see the results of our updates:
spark.sql("SELECT * FROM glue.dip.customers").toPandas()
You can see the customer data has been updated:

Additional Real-World Use Cases of Apache Iceberg | 23


cmr_ ... ... ... ... eff_ eff_ timestamp is_ customer_
id start_date end_date crnt dim_key

1 ... ... ... ... 2020-09-27 2999-12-31 2020-12-08 True 1694797722535090


09:15:32

2 ... ... ... ... 2020-10-14 2023-09-15 2020-12-08 False 1694797722510645


09:15:32

2 ... ... ... ... 2023-09-15 2999-12-31 2020-12-09 True 1694798226033799


10:15:32

3 ... ... ... ... 2023-09-15 2999-12-31 2020-12-09 True 169479822594


09:15:32

This table has been truncated to fit the page. You can see the full table at this
supplemental repository. As expected, we can see two records for Angie now; one is
historical with the old city, Chicago, and the other is the current record with the new
city, Paris. The record for Chicago now has eff_end_date set to 2023-09-15, which
is when the Paris record was activated (notice the eff_start_date). Additionally, the
brand-new customer data (Sebastian White) has been appended to the dimension
table using the upsert query in Iceberg. By implementing the SCD2 technique here,
Nova Trends can now keep track of the historical changes for its customers.

Add New Sales Data from a New Location


Let’s go a step further and add some new sales data for Angie from their new location
in Paris:
sales_fact_df = spark.createDataFrame([('103', 300, 15.8, datetime.
strptime(datetime.today().strftime('%Y-%m-%d')+' 12:15:42', '%Y-%m-%d
%H:%M:%S'), '2'),
('104', 10, 800.5, datetime.
strptime(datetime.today().strftime('%Y-%m-%d')+' 06:35:32', '%Y-%m-%d
%H:%M:%S'), '2')], fact_sales_schema)

sales_fact_df.show()

+-------+--------+-----+-------------------+-----------+
|item_id|quantity|price| timestamp|customer_id|
+-------+--------+-----+-------------------+-----------+
| 103| 300| 15.8|2023-09-15 12:15:42| 2|
| 104| 10|800.5|2023-09-15 06:35:32| 2|
+-------+--------+-----+-------------------+-----------+

# Load new sales data


customer_ice_df = spark.sql("SELECT * FROM glue.dip.customers")

join_cond = [sales_fact_df.customer_id == customer_ice_df.customer_id,

24 | Additional Real-World Use Cases of Apache Iceberg


sales_fact_df.timestamp >= customer_ice_df.eff_start_date, sales_fact_df.
timestamp <= customer_ice_df.eff_end_date]

customers_dim_key_df = (sales_fact_df
.join(customer_ice_df, join_cond, 'leftouter')
.select(sales_fact_df['*'],
when(customer_ice_df.customer_dim_key.isNull(),
'-1').otherwise(customer_ice_df.customer_dim_key).alias("customer_dim_key") )
)

customers_dim_key_df.writeTo("glue.dip.sales").append()

If we now read the Sales table, we should see all the sales records along with updated
data:
spark.sql("select * from glue.dip.sales").toPandas()
Your results should look like this:

item_id quantity price timestamp customer_id customer_dim_key

103 300 15.80 2023-09-15 12:15:42 2 1694798226033799

104 10 800.50 2023-09-15 06:35:32 2 1694798226033799

111 40 90.50 2020-11-17 09:15:32 1 1694797722535090

112 250 80.65 2020-10-28 09:15:32 1 1694797722535090

113 10 600.50 2020-12-08 09:15:32 2 1694797722510645

Finally, let’s run the aggregated query to see if the changes were recorded as expected:
spark.sql("""
SELECT ct.country, SUM(st.quantity) as sales_quantity, COUNT(*) as
count_sales
FROM glue.dip.sales st
INNER JOIN glue.dip.customers ct on st.customer_
dim_key = ct.customer_dim_key group by ct.country
""").show()

+-------+--------------+-----------+
|country|sales_quantity|count_sales|
+-------+--------------+-----------+
| US| 10| 1|
| FR| 310| 2|
| Canada| 290| 2|
+-------+--------------+-----------+
Now we have two sales for Angie, as expected after implementing the SCD2 logic
with the up-to-date address (country = FR).

Additional Real-World Use Cases of Apache Iceberg | 25

You might also like