[COURSE+SUPPORT] Getting+Started+-+Apache+Iceberg

The document provides an overview of Apache Iceberg, highlighting its role in modern data management by addressing challenges faced by traditional data warehouses and data lakes. It discusses key concepts such as metadata management, schema evolution, partitioning strategies, and integration with various data processing tools like Apache Spark and Flink. Additionally, practical exercises are included to guide users in implementing Iceberg in their data environments.

Uploaded by

mfuenzalida

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views34 pages

[COURSE+SUPPORT] Getting+Started+-+Apache+Iceberg

Uploaded by

mfuenzalida

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Getting Started - Apache Iceberg -

Dr. Firas
Author & Conference speaker
Getting Started - Apache Iceberg
■ Combining Strengths ■ Benefits and Popularity
■ Key Capabilities ■ Real-World Applications
Understanding Data Warehouses
■ Introduction to Data Warehouses
Definition and role as a centralized repository optimized for analytics and business intelligence.
■ Centralization and Organization
Goal of having a well-maintained, organized, and centralized data warehouse that stores most of
an organization’s data.
■ Challenges with Structuring Data
The complex, messy task of structuring data to fit within a warehouse.
Issues arising from the ETL process: data duplication, delays in data availability, and reduced
operational flexibility.
■ Maintenance Costs and Challenges
Ongoing, expensive, and labor-intensive efforts required to maintain a data warehouse.
Consequences of inadequate maintenance: reduced data accessibility or a completely ineffective
system.
■ Evolving Needs and Limitations
Persistent challenges with cost, scalability, and maintenance that prompt the need for innovative
solutions like Iceberg.
Understanding Data Lakes
■ The Concept of a Data Lake
Explanation of data lakes storing data in its native format, avoiding rigorous structuring and massive
ETL workloads.
Highlight the cost reduction and simplification of the data management stack.
■ Advantages and Simplification
Discussion of the operational streamlining promised by data lakes.
Transition: While appealing, this simplicity introduces significant challenges.
■ Challenges of Data Lakes
Detailed look at the complexities of extracting information from unstructured data.
Impact on data scientists and analysts due to advanced requirements for data querying and
management.
The evolution of data management challenges over time, leading to potential inefficiencies and data
bogs.
■ A Thoughtful Consideration
Introduction to the idea of hybrid solutions like data lakehouses.
A proposed solution that blends the flexibility of data lakes with the structured benefits of data
warehouses.
Understanding Apache Iceberg Core Concepts
■ Introduction to Metadata Management
Overview of Iceberg’s metadata layer handling schemas, partitions, and file locations.
Explanation of metadata and manifest files stored in JSON format.
■ Schema Evolution
Definition and significance of schema evolution in adapting to changing data needs.
Example of adding a new column to employee data and how Iceberg updates metadata without
affecting existing data.
■ Partitioning Strategies
Introduction to partitioning as a method for dividing data into manageable subsets for faster querying.
Description of different partitioning strategies:
Range partitioning (e.g., dates, numeric values), Hash partitioning (applying a hash function), Truncate
partitioning (e.g., truncating zip codes), List partitioning (e.g., categorizing by company names)
■ Snapshots and Their Importance
Explanation of how each data change creates a new snapshot with updated manifest files.
The role of snapshots in enabling historical data access and rollback capabilities.
Benefits of snapshot-based querying for maintaining data integrity and performing audits.
Iceberg Architecture
Apache Iceberg Integration and Compatibility
■ Integration with Apache Spark
Capability to use Spark APIs for reading and writing data to Iceberg tables.
Two key catalogs in Spark :
org.apache.iceberg.spark.SparkCatalog: For external catalog services like Hive or Hadoop
org.apache.iceberg.spark.SparkSessionCatalog: Manages both Iceberg and non-Iceberg tables
■ Apache Flink Integration
Ideal for streaming data processing
Enables direct data streaming from various sources into Iceberg tables
Simplifies real-time data analytics
■ Integration with Presto and Trino
Known for fast data processing capabilities
Suitable for massive data querying and analysis
Dependency on external catalogs like Hive Metastore or AWS Glue for table management
Data Lake Compatibility
■ Apache Iceberg and Amazon S3 Integration
Description of Amazon S3 as a cloud storage service
Role of S3 in data lake architectures
Integration process using AWS Glue as the catalog service
Benefits: Enhanced querying capability and data consistency

■ Google Cloud Storage Compatibility

Advantages of Google Cloud for data lakes: Scalability and flexibility
Integration details: Using Iceberg with Google Cloud Storage
Querying options: Google’s BigQuery and standard SQL languages

■ Azure Blob Storage and Iceberg Integration

Overview of Azure Blob Storage: Designed for massive unstructured data
Benefits of integrating Iceberg with Azure
Outcome: Improved data access speed and reliability
Practical Exercise
■ https://www.docker.com/
Terminal : docker version
docker info
clear
docker pull hello-world
docker images
docker +tab
docker run hello-world
docker ps
docker ps -a
Practical Exercise
■ https://iceberg.apache.org/docs/nightly/
docker-compose up notebook
docker-compose up dremio
docker-compose up minio
docker-compose up nessie

http://127.0.0.1:8888/tree
http://127.0.0.1:9001/
http://127.0.0.1:9047/
Practical Exercise
■ localhost:9047
Set the name of the source to “nessie”
Set the endpoint URL to “http://nessie:19120/api/v2”
Set the authentication to “none”

Navigate to the storage tab, by clicking on “storage” on the left

For your access key, set “admin”
For your secret key, set “password”
Set root path to “/warehouse”
Set the following connection properties:
“fs.s3a.path.style.access” to true
“fs.s3a.endpoint” to “minio:9000”
“dremio.s3.compat” to “true”
Uncheck “encrypt connection” (since our local Nessie instance is running on http)
Thank You
Dr. Firas
Author & Conference speaker

THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Huawei H19-338 v2022-05-04 q57 PDF
100% (2)
Huawei H19-338 v2022-05-04 q57 PDF
13 pages
APC Building Data Lakes On AWS SG
No ratings yet
APC Building Data Lakes On AWS SG
187 pages
E-Sys - Installation Instructions - v.17
No ratings yet
E-Sys - Installation Instructions - v.17
4 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Apache Iceberg Quick Guide
No ratings yet
Apache Iceberg Quick Guide
20 pages
Why Do You Need Apache Iceberg_
No ratings yet
Why Do You Need Apache Iceberg_
10 pages
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
From Everand
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
Robert Johnson
No ratings yet
Iceberg Table Formats and Analytics: Definitive Reference for Developers and Engineers
From Everand
Iceberg Table Formats and Analytics: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Delta Lake: Optimizing Data Lakes for Performance and Reliability
From Everand
Mastering Delta Lake: Optimizing Data Lakes for Performance and Reliability
Robert Johnson
No ratings yet
House Dzone Refcard 382 Getting Started Apache Ice
No ratings yet
House Dzone Refcard 382 Getting Started Apache Ice
9 pages
20240918 BR047 Current24 AWS Noritaka Sekiyama
No ratings yet
20240918 BR047 Current24 AWS Noritaka Sekiyama
57 pages
Practical TimescaleDB Solutions: Definitive Reference for Developers and Engineers
From Everand
Practical TimescaleDB Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Concise Oracle Database For People Who Has No Time
From Everand
Concise Oracle Database For People Who Has No Time
Billy Aung Myint
No ratings yet
A Short Introduction to Apache Iceberg _ by Christine Mathiesen _ Expedia Group Technology _ Medium
No ratings yet
A Short Introduction to Apache Iceberg _ by Christine Mathiesen _ Expedia Group Technology _ Medium
12 pages
Introduction to Oracle Database Administration
From Everand
Introduction to Oracle Database Administration
Ying Wang
5/5 (1)
Apache Iceberg
No ratings yet
Apache Iceberg
2 pages
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
From Everand
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
Will Girten
No ratings yet
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
Snowflake Data Platform Engineering: Definitive Reference for Developers and Engineers
From Everand
Snowflake Data Platform Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The Snowflake Handbook: Optimizing Data Warehousing and Analytics
From Everand
The Snowflake Handbook: Optimizing Data Warehousing and Analytics
Robert Johnson
No ratings yet
Welcome to the Age of $10_month Lakehouses
No ratings yet
Welcome to the Age of $10_month Lakehouses
29 pages
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet
Decoding Oracle Database: A Comprehensive Guide to Mastery
From Everand
Decoding Oracle Database: A Comprehensive Guide to Mastery
Kameron Hussain
No ratings yet
Redshift Essentials: Definitive Reference for Developers and Engineers
From Everand
Redshift Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DBA's Guide to NoSQL
From Everand
DBA's Guide to NoSQL
The Enlightened DBA
5/5 (1)
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
From Everand
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Lakes & Pipelines: A Modern Azure Guide
From Everand
Data Lakes & Pipelines: A Modern Azure Guide
Kameron Hussain
No ratings yet
Data Lake Essentials
No ratings yet
Data Lake Essentials
11 pages
Dataproc Administration and Engineering Solutions: Definitive Reference for Developers and Engineers
From Everand
Dataproc Administration and Engineering Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Data Engineering - Session 03
No ratings yet
Data Engineering - Session 03
26 pages
Oracle Information Integration, Migration, and Consolidation
From Everand
Oracle Information Integration, Migration, and Consolidation
Jason Williamson
No ratings yet
AWS+Data+Lake (1)
No ratings yet
AWS+Data+Lake (1)
118 pages
Aerospike Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
Aerospike Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The Data Lakes: A Leap Forward Future of Data Warehousing
No ratings yet
The Data Lakes: A Leap Forward Future of Data Warehousing
5 pages
DB2 Administration and Optimization Guide: Definitive Reference for Developers and Engineers
From Everand
DB2 Administration and Optimization Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Data Preparation with AWS Glue DataBrew: Definitive Reference for Developers and Engineers
From Everand
Efficient Data Preparation with AWS Glue DataBrew: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
House Refcard 350 Getting Started Data Lakes 2021
No ratings yet
House Refcard 350 Getting Started Data Lakes 2021
5 pages
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
From Everand
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The DynamoDB Handbook: Practical Solutions for Modern NoSQL Database Management
From Everand
The DynamoDB Handbook: Practical Solutions for Modern NoSQL Database Management
Robert Johnson
No ratings yet
Architecting A Data Lake
100% (8)
Architecting A Data Lake
60 pages
Data Lake
No ratings yet
Data Lake
26 pages
Oracle GoldenGate 11g Implementer's guide
From Everand
Oracle GoldenGate 11g Implementer's guide
John P Jeffries
5/5 (1)
SQL Mastery: From Novice Queries to Advanced Database Wizardry
From Everand
SQL Mastery: From Novice Queries to Advanced Database Wizardry
Scott Markham
No ratings yet
Mastering Amazon Redshift: Scalable Cloud Data Warehousing
From Everand
Mastering Amazon Redshift: Scalable Cloud Data Warehousing
Robert Johnson
No ratings yet
AWS Data Lake
100% (1)
AWS Data Lake
104 pages
AWS Glue for Data Engineers: Serverless ETL Made Easy
From Everand
AWS Glue for Data Engineers: Serverless ETL Made Easy
Robert Johnson
No ratings yet
Mastering Oracle Database: From Basics to Expert Proficiency
From Everand
Mastering Oracle Database: From Basics to Expert Proficiency
William Smith
No ratings yet
Databricks Platform Essentials: Definitive Reference for Developers and Engineers
From Everand
Databricks Platform Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
What What The The Hype!! Hype!!
No ratings yet
What What The The Hype!! Hype!!
75 pages
DATABASE 4
No ratings yet
DATABASE 4
35 pages
The Study of Building the Data Warehouse
From Everand
The Study of Building the Data Warehouse
venkateswara Rao
No ratings yet
SQL and NoSQL Full Mastery: A Comprehensive Guide to Modern Data Management
From Everand
SQL and NoSQL Full Mastery: A Comprehensive Guide to Modern Data Management
Kameron Hussain
No ratings yet
Efficient Data Querying with Drill: Definitive Reference for Developers and Engineers
From Everand
Efficient Data Querying with Drill: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DBMS MASTER: Become Pro in Database Management System
From Everand
DBMS MASTER: Become Pro in Database Management System
Ummed Singh
No ratings yet
Essential Apache Beam: Definitive Reference for Developers and Engineers
From Everand
Essential Apache Beam: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Advanced SQL Queries: Writing Efficient Code for Big Data
From Everand
Advanced SQL Queries: Writing Efficient Code for Big Data
Robert Johnson
5/5 (2)
AWS Data-Lake Ebook
No ratings yet
AWS Data-Lake Ebook
9 pages
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
SQL Made Easy: Tips and Tricks to Mastering SQL Programming
From Everand
SQL Made Easy: Tips and Tricks to Mastering SQL Programming
Ryan Campbell
No ratings yet
6Infographic_Blue_[Light]
No ratings yet
6Infographic_Blue_[Light]
105 pages
Microstrategy Desktop Descripción General Del Inicio Rápido
No ratings yet
Microstrategy Desktop Descripción General Del Inicio Rápido
15 pages
TN41561 - How To Setup Dynamic Address List For Distribution Services Subscriptions in MicroStrategy 9.3
No ratings yet
TN41561 - How To Setup Dynamic Address List For Distribution Services Subscriptions in MicroStrategy 9.3
4 pages
A Trio of Interesting Snowflakes - Kimball Group
No ratings yet
A Trio of Interesting Snowflakes - Kimball Group
9 pages
Sir C R Reddy College of Engineering Department of Computer Science and Engineering
No ratings yet
Sir C R Reddy College of Engineering Department of Computer Science and Engineering
1 page
Grade 12 Textbook and Stationery List 2025
No ratings yet
Grade 12 Textbook and Stationery List 2025
6 pages
Resume Fakruddin
No ratings yet
Resume Fakruddin
3 pages
405 Datasheet
No ratings yet
405 Datasheet
4 pages
Environment Analysis of Nokia
0% (2)
Environment Analysis of Nokia
36 pages
WORDPRESS Tutorial
100% (1)
WORDPRESS Tutorial
11 pages
SIPROTEC 7SK82 Profile
No ratings yet
SIPROTEC 7SK82 Profile
7 pages
Features of Web 2.0
No ratings yet
Features of Web 2.0
25 pages
EquivalentFractions 2
No ratings yet
EquivalentFractions 2
22 pages
Windows XP Professional SP3 x86 - Microsoft - Free Download, Borrow, and Streaming - Internet Archive
No ratings yet
Windows XP Professional SP3 x86 - Microsoft - Free Download, Borrow, and Streaming - Internet Archive
16 pages
Upssscjuniorassistantcomputerquestionbank 250213043251 a8c799f4 (1)
No ratings yet
Upssscjuniorassistantcomputerquestionbank 250213043251 a8c799f4 (1)
19 pages
Machine Learning For Blockchain Data Analysis: Progress and Opportunities
No ratings yet
Machine Learning For Blockchain Data Analysis: Progress and Opportunities
9 pages
Quectel GSM MQTT Application Note V1.3
No ratings yet
Quectel GSM MQTT Application Note V1.3
30 pages
Iptv-Unicast and Multicast
0% (1)
Iptv-Unicast and Multicast
21 pages
W12-Lec 16
No ratings yet
W12-Lec 16
56 pages
Brochure i950 Cabinet Servo Inverter Servoumrichter En
No ratings yet
Brochure i950 Cabinet Servo Inverter Servoumrichter En
24 pages
Hardware Assessment Format
No ratings yet
Hardware Assessment Format
3 pages
2015 Honeywell RMshell LR
No ratings yet
2015 Honeywell RMshell LR
8 pages
Hacking Corporate Em@il Systems
No ratings yet
Hacking Corporate Em@il Systems
61 pages
Chapter 16 (Computer)
No ratings yet
Chapter 16 (Computer)
3 pages
1.HC35W42R2
No ratings yet
1.HC35W42R2
3 pages
# Consensus and Agreement Algorithms: Distributed Computing
No ratings yet
# Consensus and Agreement Algorithms: Distributed Computing
9 pages
Code Pal Result
No ratings yet
Code Pal Result
2 pages
BACKTRACKING Solutions
No ratings yet
BACKTRACKING Solutions
5 pages
Operating - System - KCS 401 - Assignment - 1 PDF
No ratings yet
Operating - System - KCS 401 - Assignment - 1 PDF
5 pages
Deposit Bank - Google Search
No ratings yet
Deposit Bank - Google Search
3 pages
ST 2100
No ratings yet
ST 2100
2 pages
BCA-MATHS-1
No ratings yet
BCA-MATHS-1
119 pages

[COURSE+SUPPORT] Getting+Started+-+Apache+Iceberg

Uploaded by

[COURSE+SUPPORT] Getting+Started+-+Apache+Iceberg

Uploaded by

Getting Started - Apache Iceberg -

■ Google Cloud Storage Compatibility

■ Azure Blob Storage and Iceberg Integration

Navigate to the storage tab, by clicking on “storage” on the left

You might also like