0% found this document useful (0 votes)
15 views

[COURSE+SUPPORT] Getting+Started+-+Apache+Iceberg

The document provides an overview of Apache Iceberg, highlighting its role in modern data management by addressing challenges faced by traditional data warehouses and data lakes. It discusses key concepts such as metadata management, schema evolution, partitioning strategies, and integration with various data processing tools like Apache Spark and Flink. Additionally, practical exercises are included to guide users in implementing Iceberg in their data environments.

Uploaded by

mfuenzalida
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

[COURSE+SUPPORT] Getting+Started+-+Apache+Iceberg

The document provides an overview of Apache Iceberg, highlighting its role in modern data management by addressing challenges faced by traditional data warehouses and data lakes. It discusses key concepts such as metadata management, schema evolution, partitioning strategies, and integration with various data processing tools like Apache Spark and Flink. Additionally, practical exercises are included to guide users in implementing Iceberg in their data environments.

Uploaded by

mfuenzalida
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Getting Started - Apache Iceberg -

Dr. Firas
Author & Conference speaker
Getting Started - Apache Iceberg
■ Combining Strengths ■ Benefits and Popularity
■ Key Capabilities ■ Real-World Applications
Understanding Data Warehouses
■ Introduction to Data Warehouses
Definition and role as a centralized repository optimized for analytics and business intelligence.
■ Centralization and Organization
Goal of having a well-maintained, organized, and centralized data warehouse that stores most of
an organization’s data.
■ Challenges with Structuring Data
The complex, messy task of structuring data to fit within a warehouse.
Issues arising from the ETL process: data duplication, delays in data availability, and reduced
operational flexibility.
■ Maintenance Costs and Challenges
Ongoing, expensive, and labor-intensive efforts required to maintain a data warehouse.
Consequences of inadequate maintenance: reduced data accessibility or a completely ineffective
system.
■ Evolving Needs and Limitations
Persistent challenges with cost, scalability, and maintenance that prompt the need for innovative
solutions like Iceberg.
Understanding Data Lakes
■ The Concept of a Data Lake
Explanation of data lakes storing data in its native format, avoiding rigorous structuring and massive
ETL workloads.
Highlight the cost reduction and simplification of the data management stack.
■ Advantages and Simplification
Discussion of the operational streamlining promised by data lakes.
Transition: While appealing, this simplicity introduces significant challenges.
■ Challenges of Data Lakes
Detailed look at the complexities of extracting information from unstructured data.
Impact on data scientists and analysts due to advanced requirements for data querying and
management.
The evolution of data management challenges over time, leading to potential inefficiencies and data
bogs.
■ A Thoughtful Consideration
Introduction to the idea of hybrid solutions like data lakehouses.
A proposed solution that blends the flexibility of data lakes with the structured benefits of data
warehouses.
Understanding Apache Iceberg Core Concepts
■ Introduction to Metadata Management
Overview of Iceberg’s metadata layer handling schemas, partitions, and file locations.
Explanation of metadata and manifest files stored in JSON format.
■ Schema Evolution
Definition and significance of schema evolution in adapting to changing data needs.
Example of adding a new column to employee data and how Iceberg updates metadata without
affecting existing data.
■ Partitioning Strategies
Introduction to partitioning as a method for dividing data into manageable subsets for faster querying.
Description of different partitioning strategies:
Range partitioning (e.g., dates, numeric values), Hash partitioning (applying a hash function), Truncate
partitioning (e.g., truncating zip codes), List partitioning (e.g., categorizing by company names)
■ Snapshots and Their Importance
Explanation of how each data change creates a new snapshot with updated manifest files.
The role of snapshots in enabling historical data access and rollback capabilities.
Benefits of snapshot-based querying for maintaining data integrity and performing audits.
Iceberg Architecture
Apache Iceberg Integration and Compatibility
■ Integration with Apache Spark
Capability to use Spark APIs for reading and writing data to Iceberg tables.
Two key catalogs in Spark :
org.apache.iceberg.spark.SparkCatalog: For external catalog services like Hive or Hadoop
org.apache.iceberg.spark.SparkSessionCatalog: Manages both Iceberg and non-Iceberg tables
■ Apache Flink Integration
Ideal for streaming data processing
Enables direct data streaming from various sources into Iceberg tables
Simplifies real-time data analytics
■ Integration with Presto and Trino
Known for fast data processing capabilities
Suitable for massive data querying and analysis
Dependency on external catalogs like Hive Metastore or AWS Glue for table management
Data Lake Compatibility
■ Apache Iceberg and Amazon S3 Integration
Description of Amazon S3 as a cloud storage service
Role of S3 in data lake architectures
Integration process using AWS Glue as the catalog service
Benefits: Enhanced querying capability and data consistency

■ Google Cloud Storage Compatibility


Advantages of Google Cloud for data lakes: Scalability and flexibility
Integration details: Using Iceberg with Google Cloud Storage
Querying options: Google’s BigQuery and standard SQL languages

■ Azure Blob Storage and Iceberg Integration


Overview of Azure Blob Storage: Designed for massive unstructured data
Benefits of integrating Iceberg with Azure
Outcome: Improved data access speed and reliability
Practical Exercise
■ https://www.docker.com/
Terminal : docker version
docker info
clear
docker pull hello-world
docker images
docker +tab
docker run hello-world
docker ps
docker ps -a
Practical Exercise
■ https://iceberg.apache.org/docs/nightly/
docker-compose up notebook
docker-compose up dremio
docker-compose up minio
docker-compose up nessie

http://127.0.0.1:8888/tree
http://127.0.0.1:9001/
http://127.0.0.1:9047/
Practical Exercise
■ localhost:9047
Set the name of the source to “nessie”
Set the endpoint URL to “http://nessie:19120/api/v2”
Set the authentication to “none”

Navigate to the storage tab, by clicking on “storage” on the left


For your access key, set “admin”
For your secret key, set “password”
Set root path to “/warehouse”
Set the following connection properties:
“fs.s3a.path.style.access” to true
“fs.s3a.endpoint” to “minio:9000”
“dremio.s3.compat” to “true”
Uncheck “encrypt connection” (since our local Nessie instance is running on http)
Thank You
Dr. Firas
Author & Conference speaker

You might also like