Apache Iceberg - Additional Real World Use Cases
Apache Iceberg - Additional Real World Use Cases
Apache Iceberg
1
While streaming data presents a lot of interesting scope for downstream analytics use
cases, it also introduces hurdles in terms of scalability, reliability, and cost factors.
With the constant influx of data, you need a data architecture that can scale and
seamlessly handle increasing data volume as well as deal with small file problems,
schema changes, and more. Additionally, the platform should be able to support
reliable transactions and provide a cost-effective way to handle them.
Let’s look at a use case and see how we can address the problem using AWS Kinesis
and Apache Iceberg.
Consider AeroWorld. For this aviation company, flight safety and timeliness are
supercritical. To ensure this, the company relies heavily on weather predictions.
Traditional weather forecasts, updated hourly or daily, are only sometimes enough.
The aviation industry requires more frequent updates due to the rapid changes
in atmospheric conditions that can impact flight trajectories, passenger safety, and
schedules.
AeroWorld’s engineering team aims to integrate data from OpenWeather and other
sources to provide streaming insights into atmospheric changes. However, this presents a
massive data challenge: how can they efficiently ingest, process, and store large volumes
of global weather data, and, more importantly, how can the downstream analytics team
consume this data for reporting or predictive analytics? Here’s how we can approach the
problem.
First we will use AWS Kinesis Data Streams to ingest and store large volumes of data
from our data source, OpenWeather’s API. Once the weather streams are stored in
Kinesis, we will use AWS Glue to create and run an ETL job to ingest the data into an
Apache Iceberg table sitting atop an Amazon S3 data lake. Iceberg as the table format will
serve as an efficient storage layer providing capabilities such as consistency guarantees,
schema evolution, hidden partitioning, and time travel, which are critical for analytical
workloads. Once the data is stored in Iceberg’s open table format, different compute
engines can consume the data for various use cases. For instance, AeroWorld might be
either building some dashboards to monitor weather events or building ML models.
Figure 1 shows a high-level architecture diagram of our solution.
Now that you’ve created your stream, let’s dive into a couple of important aspects.
Data producer
In this example, OpenWeather’s API is the primary data producer, pushing real-time
weather metrics into the Kinesis stream. You can integrate these data sources with
Kinesis using SDKs or connectors, ensuring seamless data push without loss. We
will use the boto3 SDK to interact with our Kinesis data stream. The following code
acts as a producer to fetch real-time weather data for various cities and ingest this
data into our Kinesis stream, example_stream_book. To see the full code snippet with
client setup and function definitions, visit the book’s GitHub repository.
In the following code block, first we iterate through a list of cities and attempt to
fetch weather data from the API; if the fetch is successful, we pass the data to another
function to send the data into the Kinesis stream. After each iteration, we put the
process to sleep to avoid throttling from the API:
## Setup of client details done before this
## Full Code on Book Repository
if __name__ == "__main__":
for city in cities:
data = get_weather_data(city)
if data:
send_data_to_kinesis(data)
time.sleep(5) # Introducing a short delay to not hit API rate limits
and pace the inserts
In the settings for our SQL node, set the name of the node to Create Table. Here,
you can also set a name for the incoming data from the prior Kinesis node; keep the
Add another SQL node, named INSERT INTO and flowing from the Kinesis node.
This node’s function is to populate the table post-creation and during subsequent
stream checks. It must have the Kinesis node as its parent so that it can receive the
data from the stream.
Here is the query used for this SQL node:
INSERT INTO glue.test.my_streaming_data SELECT * FROM myDataSource;
Figure 2 shows what your AWS Glue flow should look like.
Now move to the “Job details” tab to finalize the connection settings between Glue
and Apache Iceberg.
• --datalake-formats: We will use iceberg to let Glue know which table format to
use.
• --conf: This input comprises key-value pairings to configure Iceberg with tasks
such as creating the catalog, providing the warehouse directory, and implement‐
ing Spark SQL-specific settings.
Note that the initial --conf is missing in the code block. This is because the parame‐
ter name is the initial --conf. Be careful not to have any unexpected line breaks in
your settings passed as the --conf parameter. We discussed the configuration of these
parameters using Spark in detail in Chapters 5 and 6.
After defining the required configurations, we can save the job and move to the Script
tab. Following is the code at this point (minus the initial setup code; please refer to the
book’s GitHub repository for this chapter for the full code):
#######
# Script Generated for Kinesis Node
#######
dataframe_AmazonKinesis_node1693319486385 = glueContext.create_data_frame.
from_options(
connection_type="kinesis",
connection_options={
"typeOfData": "kinesis",
"streamARN": "arn:aws:kinesis:us-east-1:xxxxxxxxxx:stream/exam
ple_stream_book",
"classification": "json",
"startingPosition": "latest",
"inferSchema": "true",
},
transformation_ctx="dataframe_AmazonKinesis_node1693319486385",
)
#######
# Script Generated for Create Table Node
#######
SqlQuery0 = """
CREATE TABLE IF NOT EXISTS glue.dip.my_streaming_data AS
(SELECT * FROM myDataSource LIMIT 0);
"""
CreateTable_node1693319510128 = sparkSqlQuery(
glueContext,
#######
# Script Generated for Insert Into
#######
SqlQuery1 = """
INSERT INTO glue.dip.my_streaming_data SELECT * FROM myDataSource;
"""
InsertInto_node1693319537299 = sparkSqlQuery(
glueContext,
query=SqlQuery1,
mapping={"myDataSource": AmazonKinesis_node1693319486385},
transformation_ctx="InsertInto_node1693319537299",
)
glueContext.forEachBatch(
frame=dataframe_AmazonKinesis_node1693319486385,
batch_function=processBatch,
options={
"windowSize": "100 seconds",
"checkpointLocation": args["TempDir"] + "/" + args["JOB_NAME"] + "/check
point/",
},
)
job.commit()
Now let’s run the job. We should see the records in our Apache Iceberg table,
my_streaming_data. Once the job is completed, the Glue dashboard will display the
table, indicating the present metadata file’s location. The S3 output directory will have
a database-specific folder that houses individual folders for each table, each including
data and metadata. Note that this streaming job will continue to run and write
data into the Iceberg table. The new updates will then be available for downstream
applications.
Downstream Analytics
Now that we have the streaming data in an open table format (Iceberg), it is ready
to be consumed by downstream analytical workloads such as BI and ML. Having
an architecture with Apache Iceberg as the table format for streaming workloads
offers great data management capabilities compared to using it with datafile formats
such as Parquet. One of the core strengths lies in Iceberg’s ability to manage schema
evolution efficiently, allowing real-time data schemas to adapt and evolve without
interrupting ongoing processes. This is crucial for streaming workloads, ensuring
• How does it ensure that the schema is respected during writes and notify
stakeholders of any failures?
• How can it incorporate new features into the training pipeline without impacting
the current datasets used for training?
------------------------------------------------------------
# Error ingesting data into glue.test.churn: Cannot write incompatible data to
table 'Iceberg glue.test.churn':
- Cannot safely cast 'Account_length': string to int
As you can see, Apache Spark raises an error if the schema of the incoming data
doesn’t match the existing Iceberg table’s schema.
On top of the native schema enforcement capabilities of Iceberg, TelCan can also
explicitly validate its incoming data against an existing Iceberg schema to proactively
catch any errors and notify the responsible stakeholders to avoid writing bad data
to the tables used by the training pipeline. Let’s say that currently one of its ML
training pipelines uses the churn Iceberg dataset. Here we use a custom function,
validate_and_ingest, to first validate the schema and ingest if successful (refer to
the book’s GitHub repository for full implementation of validate_and_ingest):
# Validate and ingest new data
new_data_path = "churn_etl.csv"
iceberg_table_name = "glue.test.churn"
validate_and_ingest(new_data_path, iceberg_table_name)
The preceding code first extracts the schema of the existing Iceberg table. It then
reads the new data and validates its schema against the Iceberg table’s schema. If the
schemas match, it appends the new data to the table. If not, it raises an error.
Let’s explore the second part of the problem: adding new features without interrupting
training on current data. TelCan wants to introduce two new features, 5G_Usage_minutes
and VoLTE_calls_made, into its existing Iceberg table, churn. This new information
could be crucial in improving its churn prediction model, as patterns in adopting newer
technologies might be linked to customer retention. Changes in business requirements
Time travel
Instead of creating multiple copies of data, which is usually a common pattern when
dealing with ML reproducibility tasks, Iceberg’s time-travel feature lets users access
There are two snapshots present in this table. The second snapshot (snapshot_id
7869769243560997710) is the one in which two features were added to the table,
5G_Usage_minutes and VoLTE_calls_made, which we can determine based on the
timestamp of the commit. A time-travel query could be run in Spark SQL to validate
whether that’s true:
spark.sql("SELECT * FROM glue.test.churn TIMESTAMP AS OF '2023-05-03
19:24:24.418'").toPandas()
We can see that a result set for this query includes the two new features as two
columns in our table:
LA … FALSE 5 1
IN … TRUE 3 0
NY … TRUE 3 1
SC … FALSE 10 0
HI … FALSE 12 0
WI … FALSE 6 2
AL … FALSE 5 3
VT … FALSE 15 0
WV … FALSE 13 1
LA … 2.35 1 FALSE
IN … 3.43 4 TRUE
NY … 1.46 4 TRUE
SC … 2.08 2 FALSE
This confirms that the new features were introduced in the later snapshot.
Now that we have a better understanding of the history of the table, we can train
our models on the data with the schema we expected. If we want to train the data
based on the data before the two features were added, we can just use that earlier
timestamp to time travel, as shown in the following code snippet (on the book’s
GitHub repository, you can find a longer version of this snippet that goes through the
whole model training process):
# Retrieve a specific snapshot of the Iceberg table from a week ago
snapshot_id_of_interest = 5889239598709613914
df = spark.read \
.option("snapshot-id", snapshot_id_of_interest) \
.format("iceberg") \
.load("glue.test.churn")
The June_23 tag is the one we just created. We can now refer to this dataset version
using this tag and reproduce the model. The following code shows how we can
specify the tag we’d like to query (to see the full code snippet that goes through the
model training process, refer to the book’s GitHub repository):
df = spark.read \
.option("tags", 'June_23') \
.format("iceberg") \
.load("glue.test.churn")
cmr_ ... ... ... ... eff_ eff_ timestamp is_ customer_
id start_date end_date crnt dim_key
fact_sales_schema = StructType([
StructField('item_id', StringType(), True),
StructField('quantity', IntegerType(), True),
StructField('price', DoubleType(), True),
StructField('timestamp', TimestampType(), True),
StructField('customer_id', StringType(), True)
])
To implement SCD2, it’s crucial to determine the correct customer version (e.g., Paris
or Chicago) at the time of the sale, ensuring that each sale is associated with the
appropriate customer and therefore preserving historical context. This is achieved
through a LEFT OUTER JOIN with specific condition logic.
First we match the sale with the correct customer based on their unique customer
ID (sales_fact_df.customer_id == customer_ice_df.customer_id). Then we
validate that the sale’s timestamp falls on or after the effective start date of the cus‐
tomer’s record (sales_fact_df.timestamp >= customer_ice_df.eff_start_date).
Additionally, we verify that the sale occurred before the effective end date of the cus‐
tomer’s record. These conditions together ensure that the sale is linked to the active
and valid customer record during the sale’s timestamp (sales_fact_df.timestamp <
customer_ice_df.eff_end_date):
customers_dim_key_df = (sales_fact_df
.join(customer_ice_df, join_cond, 'leftouter')
.select(sales_fact_df['*'],
when(customer_ice_df.customer_dim_key.isNull(), '-1')
.otherwise(customer_ice_df.customer_dim_key)
.alias("customer_dim_key") )
)
Note that the LEFT OUTER JOIN ensures that all sales records are retained in the
resulting table, even if a matching customer dimension record isn’t found. We assign
a default value (-1) to customer_dim_key for unmatched sales entries to easily iden‐
tify and address them.
Now let’s run an aggregated query for some quick insights on the number of sales by
country:
spark.sql(
'SELECT ct.country, '
'SUM(st.quantity) as sales_quantity,'
'COUNT(*) as count_sales '
'FROM glue.dip.sales2 st '
'INNER JOIN glue.dip.customers ct on st.customer_dim_key = ct.customer_
dim_key '
'group by ct.country').show()
+-------+--------------+-----------+
|country|sales_quantity|count_sales|
+-------+--------------+-----------+
| Canada| 290| 2|
| US| 10| 1|
+-------+--------------+-----------+
new_customer_dim_df = new_customer_dim_df.withColumn("customer_dim_key",
random_udf())
In line with the SCD2 approach, when a fresh customer record is created, the preced‐
ing record must be marked as outdated. To implement this, we do the following.
First, a join condition, join_cond, is set to match records from the existing customer
dataset with the new incoming records based on the customer_id. Using the defined
join condition, we identify records in the existing dataset that have corresponding
entries in the new dataset.
For these matched records, we retain most of their original attributes but adjust
the eff_end_date to match the eff_start_date of the new record and set the
is_current value to false (indicating that it is no longer the current version of that
customer’s data). We then combine the updated customer records with the entirely
new records, resulting in a consolidated dataset (merged_customers_df):
from pyspark.sql.functions import lit
# Match records from existing dataset with new records based on customer_id
join_cond = [customer_ice_df.customer_id == new_customer_dim_df.customer_id,
customer_ice_df.is_current == True]
Finally, we utilize Apache Iceberg’s upsert command (MERGE INTO) to update records
if they exist or insert them if they don’t. Iceberg allows updating customer dimension
records with transactional guarantees in a data lake:
# Convert the merged_customers_df to SQL view
merged_customers_df.createOrReplaceTempView("merged_customers_view")
Let’s query the customers table to see the results of our updates:
spark.sql("SELECT * FROM glue.dip.customers").toPandas()
You can see the customer data has been updated:
This table has been truncated to fit the page. You can see the full table at this
supplemental repository. As expected, we can see two records for Angie now; one is
historical with the old city, Chicago, and the other is the current record with the new
city, Paris. The record for Chicago now has eff_end_date set to 2023-09-15, which
is when the Paris record was activated (notice the eff_start_date). Additionally, the
brand-new customer data (Sebastian White) has been appended to the dimension
table using the upsert query in Iceberg. By implementing the SCD2 technique here,
Nova Trends can now keep track of the historical changes for its customers.
sales_fact_df.show()
+-------+--------+-----+-------------------+-----------+
|item_id|quantity|price| timestamp|customer_id|
+-------+--------+-----+-------------------+-----------+
| 103| 300| 15.8|2023-09-15 12:15:42| 2|
| 104| 10|800.5|2023-09-15 06:35:32| 2|
+-------+--------+-----+-------------------+-----------+
customers_dim_key_df = (sales_fact_df
.join(customer_ice_df, join_cond, 'leftouter')
.select(sales_fact_df['*'],
when(customer_ice_df.customer_dim_key.isNull(),
'-1').otherwise(customer_ice_df.customer_dim_key).alias("customer_dim_key") )
)
customers_dim_key_df.writeTo("glue.dip.sales").append()
If we now read the Sales table, we should see all the sales records along with updated
data:
spark.sql("select * from glue.dip.sales").toPandas()
Your results should look like this:
Finally, let’s run the aggregated query to see if the changes were recorded as expected:
spark.sql("""
SELECT ct.country, SUM(st.quantity) as sales_quantity, COUNT(*) as
count_sales
FROM glue.dip.sales st
INNER JOIN glue.dip.customers ct on st.customer_
dim_key = ct.customer_dim_key group by ct.country
""").show()
+-------+--------------+-----------+
|country|sales_quantity|count_sales|
+-------+--------------+-----------+
| US| 10| 1|
| FR| 310| 2|
| Canada| 290| 2|
+-------+--------------+-----------+
Now we have two sales for Angie, as expected after implementing the SCD2 logic
with the up-to-date address (country = FR).