SlideShare a Scribd company logo
25127
Migrating to Apache
Iceberg
1
Presented by Alex Merced - Dremio Developer Advocate - Follow @amdatalakehouse
25127
What will we
learn?
- Iceberg’s Structure
- What Catalog to use?
- The migrate procedure
- The add_files procedure
- Migrate with CTAS Statements
- Summary
25127
Apache Iceberg Apache Iceberg’s approach is to define the table through three layers of
metadata. These categories are:
- metadata files that define the table
- manifest lists that define a snapshot of the table, with a list of
manifests that make up the snapshot and metadata about their data
- manifests is a list of data files along with metadata on those data
files for file pruning
25127
Iceberg table format
● Overview of the components
● Summary of the read path (SELECT) #
metadata layer
data layer
25127
Iceberg Catalog
● A store that houses the current
metadata pointer for Iceberg tables
● Must support atomic operations for
updating the current metadata pointer
(e.g. HDFS, HMS, Nessie)
table1’s current metadata pointer
● Mapping of table name to the location
of current metadata file
Iceberg components: Catalog
1
2
metadata layer
data layer
25127
Metadata file - stores metadata about a
table at a certain point in time
{
"table-uuid" : "<uuid>",
"location" : "/path/to/table/dir",
"schema": {...},
"partition-spec": [ {<partition-details>}, ...],
"current-snapshot-id": <snapshot-id>,
"snapshots": [ {
"snapshot-id": <snapshot-id>
"manifest-list": "/path/to/manifest/list.avro"
}, ...],
...
}
Iceberg components: Metadata File
3
5
4
metadata layer
data layer
25127
Iceberg components: Manifest List
Manifest list file - a list of manifest files
{
"manifest-path" : "/path/to/manifest/file.avro",
"added-snapshot-id": <snapshot-id>,
"partition-spec-id": <partition-spec-id>,
"partitions": [ {partition-info}, ...],
...
}
{
"manifest-path" : "/path/to/manifest/file2.avro",
"added-snapshot-id": <snapshot-id>,
"partition-spec-id": <partition-spec-id>,
"partitions": [ {partition-info}, ...],
...
}
6
metadata layer
data layer
25127
Manifest file - a list of data files, along with
details and stats about each data file
{
"data-file": {
"file-path": "/path/to/data/file.parquet",
"file-format": "PARQUET",
"partition":
{"<part-field>":{"<data-type>":<value>}},
"record-count": <num-records>,
"null-value-counts": [{
"column-index": "1", "value": 4
}, ...],
"lower-bounds": [{
"column-index": "1", "value": "aaa"
}, ...],
"upper-bounds": [{
"column-index": "1", "value": "eee"
}, ...],
}
...
}
{
...
}
Iceberg components: Manifest file
7
8
metadata layer
data layer
25127
Iceberg Catalogs
9
25127
10
Type of Catalog Pros Cons
Project Nessie - Git Like Functionality
- Cloud Managed Service (Arctic)
- Support from engines beyond
Spark & Dremio
- If not using arctic must deploy
and maintain Nessie server
Hive Metastore - Can use existing Hive Metastore - You have to deploy and maintain
a hive metastore
AWS Glue - Interop with AWS Services - Support outside of AWS, Spark
and Dremio
What can be used an Iceberg Catalog
Catalogs help tracks Iceberg tables and provide locking mechanisms for ACID Guarantees.
Thing to keep in mind is that while many engines may support Iceberg tables they may not
support connections to all catalogs.
- HDFS, JDBC and REST catalogs also available
- With 0.14 there will be a registerTable method to migrate tables between catalogs
without losing snapshot history
25127
Migrating to Iceberg
11
25127
Existing Parquet files that are part of a Hive Table
Apache Iceberg has a Call procedure called migrate which can be used to create a new Iceberg table from an
the existing Hive table. This uses the existing data files in the table (must be parquet). This REPLACES the
Hive table with an Iceberg table, so the Hive table will no longer exist after this operation:
CALL catalog_name.system.migrate('db.sample')
To test the results of migrate before running migrate you can use the snapshot call procedure to create
temporary iceberg table based on a Hive table without replacing the Hive table. (temporary tables have
limitations).
CALL catalog_name.system.snapshot('db.sample', 'db.snap')
12
25127
Parquet files not part of a hive table
If you have paquet files representing a dataset that are not yet part of an Iceberg table, you can use the
add_files procedure to add those datafiles to that of an existing table with a matching schema.
CALL spark_catalog.system.add_files(
table => 'db.tbl',
source_table => '`parquet`.`path/to/table`'
)
13
25127
Migration by Restating the Data
You can use CREATE TABLE…AS (CTAS) statements to create new Iceberg tables from pre-existing tables.
This will create new data files but allows you to modify partitioning, schema and do validations at the time of
migration if choose to. Since the data is being rewritten/restated, this will be more time consuming that
migrate or add_files. (This can be done with Spark or Dremio)
CREATE TABLE prod.db.sample
USING iceberg
PARTITIONED BY (bucket(16, id), days(ts), truncate(last_name, 2))
AS SELECT ...
14
25127
When to use which approach?
15
Approach
Are Data Files
Rewritten
Can I update Schema
and Partition while
migrating?
When to use?
migrate No No
Moving existing hive table to a
NEW iceberg table
add_files No No
Moving files from existing table
in parquet to an existing
Iceberg table
CTAS Yes Yes
Moving existing table
anywhere to NEW iceberg
table with schema or
partitioning changes
registerTable
(to be released with 0.14)
No No
Moving Iceberg table from one
catalog to another.
25127
Best Practices
1. As the process begins, the new Iceberg table is not yet created or in sync with the source. User-facing
read and write operations remain operating on the source table.
2. The table is created but not fully in sync. Read operations are applied on the source table and write
operations are applied to the source and new table.
3. Once the new table is in sync, you can switch to read operations on the new table. Until you are certain
the migration is successful, continue to apply write operations to the source and new table.
4. When everything is tested, synced, and working properly, you can apply all read and write operations to
the new Iceberg table and retire the source table
16
25127
Q&A
17
Ad

More Related Content

What's hot (20)

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Real-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidReal-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and Druid
Jan Graßegger
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
Dori Waldman
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
Diego Pacheco
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
Dustin Vannoy
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Speed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with AlluxioSpeed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with Alluxio
Alluxio, Inc.
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Kai Wähner
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
kafka
kafkakafka
kafka
Amikam Snir
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Real-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidReal-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and Druid
Jan Graßegger
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
Dori Waldman
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
Dustin Vannoy
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Speed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with AlluxioSpeed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with Alluxio
Alluxio, Inc.
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Kai Wähner
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 

Similar to Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg (20)

OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
Altinity Ltd
 
Redefining tables online without surprises
Redefining tables online without surprisesRedefining tables online without surprises
Redefining tables online without surprises
Nelson Calero
 
Apache Iceberg Presentation 101:Lakehouse
Apache Iceberg Presentation 101:LakehouseApache Iceberg Presentation 101:Lakehouse
Apache Iceberg Presentation 101:Lakehouse
tripathisachinwork
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
Subhas Kumar Ghosh
 
External & Managed Tables In Fabric Lakehouse.pptx
External & Managed Tables In Fabric Lakehouse.pptxExternal & Managed Tables In Fabric Lakehouse.pptx
External & Managed Tables In Fabric Lakehouse.pptx
Puneet Vijwani
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
Eric Sun
 
The Best of Both Worlds: Hybrid Clustering with Delta Lake
The Best of Both Worlds: Hybrid Clustering with Delta LakeThe Best of Both Worlds: Hybrid Clustering with Delta Lake
The Best of Both Worlds: Hybrid Clustering with Delta Lake
carlyakerly1
 
PostgreSQL Database Slides
PostgreSQL Database SlidesPostgreSQL Database Slides
PostgreSQL Database Slides
metsarin
 
Ecosistema MLOps: Gestión y Despliegue de Modelos
Ecosistema MLOps: Gestión y Despliegue de ModelosEcosistema MLOps: Gestión y Despliegue de Modelos
Ecosistema MLOps: Gestión y Despliegue de Modelos
ManuelGarcia99558
 
Metail at Cambridge AWS User Group Main Meetup #3
Metail at Cambridge AWS User Group Main Meetup #3Metail at Cambridge AWS User Group Main Meetup #3
Metail at Cambridge AWS User Group Main Meetup #3
Gareth Rogers
 
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdfZesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
Eran Levy
 
Db2 Important questions to read
Db2 Important questions to readDb2 Important questions to read
Db2 Important questions to read
Prasanth Dusi
 
July 2017 Meeting of the Denver AWS Users' Group
July 2017 Meeting of the Denver AWS Users' GroupJuly 2017 Meeting of the Denver AWS Users' Group
July 2017 Meeting of the Denver AWS Users' Group
David McDaniel
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
Chester Chen
 
SQL Server 2016 - Stretch DB
SQL Server 2016 - Stretch DB SQL Server 2016 - Stretch DB
SQL Server 2016 - Stretch DB
Shy Engelberg
 
An overview of snowflake
An overview of snowflakeAn overview of snowflake
An overview of snowflake
Sivakumar Ramar
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big data
Trieu Nguyen
 
AWS RDS Migration Tool
AWS RDS Migration Tool AWS RDS Migration Tool
AWS RDS Migration Tool
Blazeclan Technologies Private Limited
 
Big Data Analytics Module-4 as per vtu .pptx
Big Data Analytics Module-4 as per vtu .pptxBig Data Analytics Module-4 as per vtu .pptx
Big Data Analytics Module-4 as per vtu .pptx
shilpabl1803
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
Padma shree. T
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
Altinity Ltd
 
Redefining tables online without surprises
Redefining tables online without surprisesRedefining tables online without surprises
Redefining tables online without surprises
Nelson Calero
 
Apache Iceberg Presentation 101:Lakehouse
Apache Iceberg Presentation 101:LakehouseApache Iceberg Presentation 101:Lakehouse
Apache Iceberg Presentation 101:Lakehouse
tripathisachinwork
 
External & Managed Tables In Fabric Lakehouse.pptx
External & Managed Tables In Fabric Lakehouse.pptxExternal & Managed Tables In Fabric Lakehouse.pptx
External & Managed Tables In Fabric Lakehouse.pptx
Puneet Vijwani
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
Eric Sun
 
The Best of Both Worlds: Hybrid Clustering with Delta Lake
The Best of Both Worlds: Hybrid Clustering with Delta LakeThe Best of Both Worlds: Hybrid Clustering with Delta Lake
The Best of Both Worlds: Hybrid Clustering with Delta Lake
carlyakerly1
 
PostgreSQL Database Slides
PostgreSQL Database SlidesPostgreSQL Database Slides
PostgreSQL Database Slides
metsarin
 
Ecosistema MLOps: Gestión y Despliegue de Modelos
Ecosistema MLOps: Gestión y Despliegue de ModelosEcosistema MLOps: Gestión y Despliegue de Modelos
Ecosistema MLOps: Gestión y Despliegue de Modelos
ManuelGarcia99558
 
Metail at Cambridge AWS User Group Main Meetup #3
Metail at Cambridge AWS User Group Main Meetup #3Metail at Cambridge AWS User Group Main Meetup #3
Metail at Cambridge AWS User Group Main Meetup #3
Gareth Rogers
 
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdfZesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
Eran Levy
 
Db2 Important questions to read
Db2 Important questions to readDb2 Important questions to read
Db2 Important questions to read
Prasanth Dusi
 
July 2017 Meeting of the Denver AWS Users' Group
July 2017 Meeting of the Denver AWS Users' GroupJuly 2017 Meeting of the Denver AWS Users' Group
July 2017 Meeting of the Denver AWS Users' Group
David McDaniel
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
Chester Chen
 
SQL Server 2016 - Stretch DB
SQL Server 2016 - Stretch DB SQL Server 2016 - Stretch DB
SQL Server 2016 - Stretch DB
Shy Engelberg
 
An overview of snowflake
An overview of snowflakeAn overview of snowflake
An overview of snowflake
Sivakumar Ramar
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big data
Trieu Nguyen
 
Big Data Analytics Module-4 as per vtu .pptx
Big Data Analytics Module-4 as per vtu .pptxBig Data Analytics Module-4 as per vtu .pptx
Big Data Analytics Module-4 as per vtu .pptx
shilpabl1803
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
Padma shree. T
 
Ad

More from Anant Corporation (20)

LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
Anant Corporation
 
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdfKono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Anant Corporation
 
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache PinotData Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Anant Corporation
 
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
Anant Corporation
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Anant Corporation
 
YugabyteDB Developer Tools
YugabyteDB Developer ToolsYugabyteDB Developer Tools
YugabyteDB Developer Tools
Anant Corporation
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapEpisode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Anant Corporation
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with Airflow
Anant Corporation
 
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward TalksCassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Anant Corporation
 
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with ArcionData Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Anant Corporation
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Anant Corporation
 
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & FutureCassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Anant Corporation
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Anant Corporation
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
Anant Corporation
 
CL 121
CL 121CL 121
CL 121
Anant Corporation
 
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOpsApache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Anant Corporation
 
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache CassandraApache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Anant Corporation
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Anant Corporation
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
Anant Corporation
 
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdfKono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Anant Corporation
 
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache PinotData Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Anant Corporation
 
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
Anant Corporation
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Anant Corporation
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapEpisode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Anant Corporation
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with Airflow
Anant Corporation
 
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward TalksCassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Anant Corporation
 
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with ArcionData Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Anant Corporation
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Anant Corporation
 
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & FutureCassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Anant Corporation
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Anant Corporation
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
Anant Corporation
 
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOpsApache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Anant Corporation
 
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache CassandraApache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Anant Corporation
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Anant Corporation
 
Ad

Recently uploaded (20)

The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 

Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg

  • 1. 25127 Migrating to Apache Iceberg 1 Presented by Alex Merced - Dremio Developer Advocate - Follow @amdatalakehouse
  • 2. 25127 What will we learn? - Iceberg’s Structure - What Catalog to use? - The migrate procedure - The add_files procedure - Migrate with CTAS Statements - Summary
  • 3. 25127 Apache Iceberg Apache Iceberg’s approach is to define the table through three layers of metadata. These categories are: - metadata files that define the table - manifest lists that define a snapshot of the table, with a list of manifests that make up the snapshot and metadata about their data - manifests is a list of data files along with metadata on those data files for file pruning
  • 4. 25127 Iceberg table format ● Overview of the components ● Summary of the read path (SELECT) # metadata layer data layer
  • 5. 25127 Iceberg Catalog ● A store that houses the current metadata pointer for Iceberg tables ● Must support atomic operations for updating the current metadata pointer (e.g. HDFS, HMS, Nessie) table1’s current metadata pointer ● Mapping of table name to the location of current metadata file Iceberg components: Catalog 1 2 metadata layer data layer
  • 6. 25127 Metadata file - stores metadata about a table at a certain point in time { "table-uuid" : "<uuid>", "location" : "/path/to/table/dir", "schema": {...}, "partition-spec": [ {<partition-details>}, ...], "current-snapshot-id": <snapshot-id>, "snapshots": [ { "snapshot-id": <snapshot-id> "manifest-list": "/path/to/manifest/list.avro" }, ...], ... } Iceberg components: Metadata File 3 5 4 metadata layer data layer
  • 7. 25127 Iceberg components: Manifest List Manifest list file - a list of manifest files { "manifest-path" : "/path/to/manifest/file.avro", "added-snapshot-id": <snapshot-id>, "partition-spec-id": <partition-spec-id>, "partitions": [ {partition-info}, ...], ... } { "manifest-path" : "/path/to/manifest/file2.avro", "added-snapshot-id": <snapshot-id>, "partition-spec-id": <partition-spec-id>, "partitions": [ {partition-info}, ...], ... } 6 metadata layer data layer
  • 8. 25127 Manifest file - a list of data files, along with details and stats about each data file { "data-file": { "file-path": "/path/to/data/file.parquet", "file-format": "PARQUET", "partition": {"<part-field>":{"<data-type>":<value>}}, "record-count": <num-records>, "null-value-counts": [{ "column-index": "1", "value": 4 }, ...], "lower-bounds": [{ "column-index": "1", "value": "aaa" }, ...], "upper-bounds": [{ "column-index": "1", "value": "eee" }, ...], } ... } { ... } Iceberg components: Manifest file 7 8 metadata layer data layer
  • 10. 25127 10 Type of Catalog Pros Cons Project Nessie - Git Like Functionality - Cloud Managed Service (Arctic) - Support from engines beyond Spark & Dremio - If not using arctic must deploy and maintain Nessie server Hive Metastore - Can use existing Hive Metastore - You have to deploy and maintain a hive metastore AWS Glue - Interop with AWS Services - Support outside of AWS, Spark and Dremio What can be used an Iceberg Catalog Catalogs help tracks Iceberg tables and provide locking mechanisms for ACID Guarantees. Thing to keep in mind is that while many engines may support Iceberg tables they may not support connections to all catalogs. - HDFS, JDBC and REST catalogs also available - With 0.14 there will be a registerTable method to migrate tables between catalogs without losing snapshot history
  • 12. 25127 Existing Parquet files that are part of a Hive Table Apache Iceberg has a Call procedure called migrate which can be used to create a new Iceberg table from an the existing Hive table. This uses the existing data files in the table (must be parquet). This REPLACES the Hive table with an Iceberg table, so the Hive table will no longer exist after this operation: CALL catalog_name.system.migrate('db.sample') To test the results of migrate before running migrate you can use the snapshot call procedure to create temporary iceberg table based on a Hive table without replacing the Hive table. (temporary tables have limitations). CALL catalog_name.system.snapshot('db.sample', 'db.snap') 12
  • 13. 25127 Parquet files not part of a hive table If you have paquet files representing a dataset that are not yet part of an Iceberg table, you can use the add_files procedure to add those datafiles to that of an existing table with a matching schema. CALL spark_catalog.system.add_files( table => 'db.tbl', source_table => '`parquet`.`path/to/table`' ) 13
  • 14. 25127 Migration by Restating the Data You can use CREATE TABLE…AS (CTAS) statements to create new Iceberg tables from pre-existing tables. This will create new data files but allows you to modify partitioning, schema and do validations at the time of migration if choose to. Since the data is being rewritten/restated, this will be more time consuming that migrate or add_files. (This can be done with Spark or Dremio) CREATE TABLE prod.db.sample USING iceberg PARTITIONED BY (bucket(16, id), days(ts), truncate(last_name, 2)) AS SELECT ... 14
  • 15. 25127 When to use which approach? 15 Approach Are Data Files Rewritten Can I update Schema and Partition while migrating? When to use? migrate No No Moving existing hive table to a NEW iceberg table add_files No No Moving files from existing table in parquet to an existing Iceberg table CTAS Yes Yes Moving existing table anywhere to NEW iceberg table with schema or partitioning changes registerTable (to be released with 0.14) No No Moving Iceberg table from one catalog to another.
  • 16. 25127 Best Practices 1. As the process begins, the new Iceberg table is not yet created or in sync with the source. User-facing read and write operations remain operating on the source table. 2. The table is created but not fully in sync. Read operations are applied on the source table and write operations are applied to the source and new table. 3. Once the new table is in sync, you can switch to read operations on the new table. Until you are certain the migration is successful, continue to apply write operations to the source and new table. 4. When everything is tested, synced, and working properly, you can apply all read and write operations to the new Iceberg table and retire the source table 16