SlideShare a Scribd company logo
2
Most read
12
Most read
13
Most read
Confidential - Do Not Share or Distribute
From Warehouse to
Lakehouse
February 15, 2023 (Paris)
2 Confidential - Do Not Share or Distribute
Dremio
The Only Data Lakehouse with Self-Service SQL Analytics
Your Data Forever, No Lock In
Sub-Second Performance, 1/10th the Cost of Data Warehouses
■
■
■
The Easy and Open Data Lakehouse
Self-service analytics with data warehouse functionality and data
lake flexibility across all of your data.
Open Source & Community Enterprise Adoption
Apache Arrow (70M+
downloads/m), Apache
Iceberg, Nessie
Creator and host of
Subsurface LIVE conference
1000s
of companies across all
industries
5
of the Fortune 10
3 Confidential - Do Not Share or Distribute
Data Analytics - The History
1980 – 2010
● Fixed compute & storage capacity
● Mostly on-prem
● Harder to use & manage
Enterprise Data Warehouse
Transactional and analytical workloads
2010 – 2015
● Fixed compute & storage capacity
● Mostly on-prem
● Harder to use & manage
Big Data + Open Source
Introduced big data processing
● Data in open file and table formats
● No need to copy & move data
● Multiple best-of-breed processing
engines
The Movement
Open Data Lakehouse
Self-service analytics on the data lake
2015 – Present
● Scale storage and compute independently
● Must load data into proprietary system
● Limited to one processing engine
● Cost prohibitive
Cloud Data Warehouse
Analytics on cloud data warehouse
4 Confidential - Do Not Share or Distribute
Competing Data Priorities
Access Governance
Self-Service
/ Agility
Security
Line of Business Centralized
5 Confidential - Do Not Share or Distribute
SQL
Data Science Dashboards Apps
Companies Want to Democratize Data… But How?
▪ Everyone wants access
▪ Data volumes are
exploding
▪ Security risks
▪ Compliance requirements
▪ Limited resources
Application Databases | IoT | Web | Logs
Continuous New Data
ADLS RDBMS
S3 GCS
Cloud Object Storage On-Prem
6 Confidential - Do Not Share or Distribute
SQL
Data Science Dashboards Apps
Data Warehouses: Expensive, Proprietary, Complex
Application Databases | IoT | Web | Logs
Continuous New Data
✗ Skyrocketing costs
✗ Vendor lock-in
✗ Exploding backlog
✗ Can’t explore data
✗ No self-service
ADLS RDBMS
S3 GCS
Cloud Object Storage On-Prem
7 Confidential - Do Not Share or Distribute
SQL
Data Science Dashboards Apps
Data Lakehouse: Easy, Open, 1/10th the Cost
Application Databases | IoT | Web | Logs
Continuous New Data
⇅ ODBC | JDBC | REST | Arrow Flight ⇅
⇅ Parallelism | Caching | Optimized Push-Downs ⇅
✓ Sub-second performance
✓ Eliminate Data Silos
✓ Improve Data Discovery and
Access
✓ No Data Movement Required
✓ No Copies
✓ Inexpensive
✓ No lock-in
ADLS RDBMS
S3 GCS
Cloud Object Storage On-Prem
8 Confidential - Do Not Share or Distribute
Created by Netflix, Apple and other big tech
INSERT/UPDATE/DELETE with any engine
Strong momentum in OSS community
■
■
■
Table Formats Enable Data Warehouse Workloads on the Lake
Record-level data mutations with SQL DML
INSERT INTO t1 ...
UPDATE t1 SET col1 = ...
DELETE FROM t1 WHERE state = "CA"
Automatic partitioning
CREATE TABLE t1 PARTITIONED BY (month(date),
bucket[1000](user))
Instant schema and partition evolution
ALTER TABLE t1 ADD/DROP/MODIFY/RENAME COLUMN c1 ...
ALTER TABLE t1 ADD/DROP PARTITION FIELD ...
Time travel
SELECT * FROM t1 AT/BEFORE <timestamp>
9 Confidential - Do Not Share or Distribute
Iceberg is a Community-Built, Vendor-Agnostic Table Format
10 Confidential - Do Not Share or Distribute
AWS
Dremio
GCP (BigQuery, etc.)
Snowflake
Tabular
…
Databricks
Iceberg is the Primary Format of Cloud and Platform Vendors
11 Confidential - Do Not Share or Distribute
Data Analytics - The History Continued
980 – 2010
e & storage capacity
m
& manage
se Data Warehouse
nd analytical workloads
2010 – 2015
● Fixed compute & storage capacity
● Mostly on-prem
● Harder to use & manage
Big Data + Open Source
Introduced big data processing
● Data in open file and table formats
● No need to copy & move data
● Multiple best-of-breed processing
engines
The Movement
Open Data Lakehouse
Self-service analytics on the data lake
2015 – Present
● Scale storage and compute independently
● Must load data into proprietary system
● Limited to one processing engine
● Cost prohibitive
Cloud Data Warehouse
Analytics on cloud data warehouse
2023 – …
● Isolated data exploration and engineering
with branches
● Version control with commits and tags
● Governance and domain management
Data Mesh with Data-as-Code
Data mesh with data managed as code
12 Confidential - Do Not Share or Distribute
Dremio Arctic
(in preview)
13 Confidential - Do Not Share or Distribute
Dremio Arctic is a Data Lakehouse Management Service
Automated data
optimization
Automated garbage
collection
Automated data ingestion
(roadmap)
Data Optimization
Fine-grained access control
Commit/audit logs
Governance & Security
Commits
Tags
Branches
Data as Code
Query Engines
Tables and views in hierarchical namespaces
Native Iceberg catalog (Delta Lake in roadmap)
Lakehouse Catalog
ADLS S3 GCS
Cloud Storage
Apache Iceberg Tables
14 Confidential - Do Not Share or Distribute
ICEBERG-NATIVE MULTIPLE DOMAINS ACCESS CONTROL
Dremio Arctic is a Modern Lakehouse Catalog
▪ Nessie (the Arctic catalog) is built into the
open source Apache Iceberg project
▪ Use a variety of Iceberg-compatible engines
including Dremio Sonar, Spark and Flink
▪ Multiple isolated domains/catalogs in an
organization, each containing a folder
hierarchy of tables and views
▪ Designed to enable data mesh (including
federated ownership and data sharing)
▪ Table, column- and row-based access
control
▪ Custom roles and integration with existing
user/group directories (AAD, Okta, etc.)
15 Confidential - Do Not Share or Distribute
ISOLATION VERSION CONTROL GOVERNANCE
Manage Data as Code with Git-like Capabilities
▪ Experiment with data without impacting other
users
▪ Ingest, transform and test data before
exposing it to other users in an atomic merge
▪ Reproduce models and dashboards from
historical data based on time or tags
▪ Recover from any mistake by instantly
undoing accidental data or metadata
changes
▪ All changes to the data and metadata are
tracked: who accessed what data and when
▪ Fine-grained privileges to control access to
the data at the table, column and row level
16 Confidential - Do Not Share or Distribute
TABLE OPTIMIZATION TABLE VACUUM
Automatic Data Optimization Enables Faster Queries
▪ Dremio Arctic automatically rewrites smaller
files into larger files and groups similar rows in
a table together
▪ Table optimization significantly accelerates
query performance
▪ Dremio Arctic automatically removes unused
manifest files, manifest lists, and data files
▪ Cleanup runs in the background and ensures
efficient use of data lake storage
17 Confidential - Do Not Share or Distribute
5 Use Cases for Data as Code
18 Confidential - Do Not Share or Distribute
1: Ensure data quality with ETL branches
CREATE BRANCH events_etl_9_28_22
USE BRANCH events_etl_9_28_22
COPY INTO web.events …
DELETE FROM web.events WHERE length(ip_address) >= 7
USE BRANCH main
MERGE BRANCH events_etl_9_28_22
Create an ETL branch and ingest the data with COPY
INTO, CTAS or Spark:
SELECT COUNT(*) FROM web.events WHERE
length(ip_address) >= 7
Run queries to test data quality:
Test the dashboard to see that it looks ok:
Fix the problems and merge into main:
main
events_etl_9_28_22
Data quality checks
Production
19 Confidential - Do Not Share or Distribute
2: Experiment with data in transient branches
CREATE BRANCH dave_9_28_22
USE BRANCH dave_9_28_22
CREATE TABLE t AS SELECT …
UPDATE t … SET …
Create a transient branch and perform data explorations
and transformations in it:
DROP BRANCH dave_9_28_22
Delete the branch or merge it when experimentation is
complete:
Create ad-hoc visualizations on the branch via a
Notebook:
main
dave_9_28_22
Experimentation
Production
Experimentation
20 Confidential - Do Not Share or Distribute
3: Reproduce models or analysis
spark.sql("USE REFERENCE modelA in arctic;")
Change context to a named tag:
val trainingData = spark.read.table("arctic.t")
val lr = new LogisticRegression()
// configure logistic regression...
val paramMap = ParamMap(...)
val model = lr.fit(trainingData, paramMap)
Create ML model based on historic data:
Select a tag, commit or branch to query in SQL Runner:
21 Confidential - Do Not Share or Distribute
ALTER BRANCH main ASSIGN COMMIT …f724
4: Recover from mistakes
main
…f724 main’
…a233
…b84c
…9bc8
…2563
…4231
Move the branch head to a historical commit:
22 Confidential - Do Not Share or Distribute
5: Troubleshooting (see who changed the data)
SHOW LOGS AT REFERENCE etl;
Get the commit history for a branch:
curl -X GET -H 'Authorization: Bearer
<PAT>' <Catalog API
Endpoint>/trees/tree/<reference
name>/log?filter="operations.exists(op,op.
key=='<table name>')"
Get the commit history for a specific table:
23 Confidential - Do Not Share or Distribute

More Related Content

PDF
Vue d'ensemble Dremio
PPTX
Data Lakehouse Symposium | Day 4
PDF
Learn to Use Databricks for Data Science
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PDF
Modernizing to a Cloud Data Architecture
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
PPTX
Building a modern data warehouse
PDF
Achieving Lakehouse Models with Spark 3.0
Vue d'ensemble Dremio
Data Lakehouse Symposium | Day 4
Learn to Use Databricks for Data Science
Data Lakehouse Symposium | Day 1 | Part 1
Modernizing to a Cloud Data Architecture
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Building a modern data warehouse
Achieving Lakehouse Models with Spark 3.0

What's hot (20)

PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
PDF
Incremental View Maintenance with Coral, DBT, and Iceberg
PDF
Intro to Delta Lake
PDF
Getting Started with Delta Lake on Databricks
PDF
Lakehouse in Azure
PDF
Databricks Delta Lake and Its Benefits
PPTX
Zero to Snowflake Presentation
PPTX
Delta lake and the delta architecture
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
PPTX
Big data architectures and the data lake
PPTX
Presto best practices for Cluster admins, data engineers and analysts
PDF
Azure Synapse Analytics
PDF
Moving to Databricks & Delta
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PDF
Introduction SQL Analytics on Lakehouse Architecture
PDF
Gain 3 Benefits with Delta Sharing
PDF
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
PDF
PDF
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Building Lakehouses on Delta Lake with SQL Analytics Primer
Incremental View Maintenance with Coral, DBT, and Iceberg
Intro to Delta Lake
Getting Started with Delta Lake on Databricks
Lakehouse in Azure
Databricks Delta Lake and Its Benefits
Zero to Snowflake Presentation
Delta lake and the delta architecture
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Big data architectures and the data lake
Presto best practices for Cluster admins, data engineers and analysts
Azure Synapse Analytics
Moving to Databricks & Delta
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
Architect’s Open-Source Guide for a Data Mesh Architecture
Introduction SQL Analytics on Lakehouse Architecture
Gain 3 Benefits with Delta Sharing
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Ad

Similar to From Data Warehouse to Lakehouse (20)

PDF
Demystifying Data Warehouse as a Service (DWaaS)
PDF
Laboratorio práctico: Data warehouse en la nube
PDF
Delivering rapid-fire Analytics with Snowflake and Tableau
PDF
Best-Practices-for-Using-Tableau-With-Snowflake.pdf
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
PDF
Technical Deck Delta Live Tables.pdf
PPTX
ME_Snowflake_Introduction_for new students.pptx
PDF
Unlocking the Value of Your Data Lake
PPTX
Reshape Data Lake (as of 2020.07)
PPTX
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
PDF
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
PDF
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
PDF
Using Data Platforms That Are Fit-For-Purpose
PPTX
Delivering Data Democratization in the Cloud with Snowflake
PPTX
10 Reasons Snowflake Is Great for Analytics
PDF
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
PDF
Snowflake for Data Engineering
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
DOCX
Microsoft Fabric data warehouse by dataplatr
PPTX
Databricks Platform.pptx
Demystifying Data Warehouse as a Service (DWaaS)
Laboratorio práctico: Data warehouse en la nube
Delivering rapid-fire Analytics with Snowflake and Tableau
Best-Practices-for-Using-Tableau-With-Snowflake.pdf
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
Technical Deck Delta Live Tables.pdf
ME_Snowflake_Introduction_for new students.pptx
Unlocking the Value of Your Data Lake
Reshape Data Lake (as of 2020.07)
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
Using Data Platforms That Are Fit-For-Purpose
Delivering Data Democratization in the Cloud with Snowflake
10 Reasons Snowflake Is Great for Analytics
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
Snowflake for Data Engineering
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Microsoft Fabric data warehouse by dataplatr
Databricks Platform.pptx
Ad

More from Modern Data Stack France (20)

PDF
Stash - Data FinOPS
PDF
Talend spark meetup 03042017 - Paris Spark Meetup
PDF
Paris Spark Meetup - Trifacta - 03_04_2017
PDF
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
PDF
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
PDF
Hadoop France meetup Feb2016 : recommendations with spark
PPTX
Hug janvier 2016 -EDF
PPTX
HUG France - 20160114 industrialisation_process_big_data CanalPlus
PDF
Hugfr SPARK & RIAK -20160114_hug_france
PDF
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
PDF
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
PDF
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
PDF
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
PDF
Spark dataframe
PDF
June Spark meetup : search as recommandation
PDF
Spark ML par Xebia (Spark Meetup du 11/06/2015)
PPTX
Spark meetup at viadeo
PPTX
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
PPTX
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX
PDF
The Cascading (big) data application framework
Stash - Data FinOPS
Talend spark meetup 03042017 - Paris Spark Meetup
Paris Spark Meetup - Trifacta - 03_04_2017
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
Hadoop France meetup Feb2016 : recommendations with spark
Hug janvier 2016 -EDF
HUG France - 20160114 industrialisation_process_big_data CanalPlus
Hugfr SPARK & RIAK -20160114_hug_france
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Spark dataframe
June Spark meetup : search as recommandation
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark meetup at viadeo
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX
The Cascading (big) data application framework

Recently uploaded (20)

PPT
Teaching material agriculture food technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Cloud computing and distributed systems.
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
cuic standard and advanced reporting.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Teaching material agriculture food technology
The AUB Centre for AI in Media Proposal.docx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
MYSQL Presentation for SQL database connectivity
Dropbox Q2 2025 Financial Results & Investor Presentation
Cloud computing and distributed systems.
Reach Out and Touch Someone: Haptics and Empathic Computing
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Review of recent advances in non-invasive hemoglobin estimation
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Machine learning based COVID-19 study performance prediction
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Unlocking AI with Model Context Protocol (MCP)
cuic standard and advanced reporting.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
The Rise and Fall of 3GPP – Time for a Sabbatical?
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...

From Data Warehouse to Lakehouse

  • 1. Confidential - Do Not Share or Distribute From Warehouse to Lakehouse February 15, 2023 (Paris)
  • 2. 2 Confidential - Do Not Share or Distribute Dremio The Only Data Lakehouse with Self-Service SQL Analytics Your Data Forever, No Lock In Sub-Second Performance, 1/10th the Cost of Data Warehouses ■ ■ ■ The Easy and Open Data Lakehouse Self-service analytics with data warehouse functionality and data lake flexibility across all of your data. Open Source & Community Enterprise Adoption Apache Arrow (70M+ downloads/m), Apache Iceberg, Nessie Creator and host of Subsurface LIVE conference 1000s of companies across all industries 5 of the Fortune 10
  • 3. 3 Confidential - Do Not Share or Distribute Data Analytics - The History 1980 – 2010 ● Fixed compute & storage capacity ● Mostly on-prem ● Harder to use & manage Enterprise Data Warehouse Transactional and analytical workloads 2010 – 2015 ● Fixed compute & storage capacity ● Mostly on-prem ● Harder to use & manage Big Data + Open Source Introduced big data processing ● Data in open file and table formats ● No need to copy & move data ● Multiple best-of-breed processing engines The Movement Open Data Lakehouse Self-service analytics on the data lake 2015 – Present ● Scale storage and compute independently ● Must load data into proprietary system ● Limited to one processing engine ● Cost prohibitive Cloud Data Warehouse Analytics on cloud data warehouse
  • 4. 4 Confidential - Do Not Share or Distribute Competing Data Priorities Access Governance Self-Service / Agility Security Line of Business Centralized
  • 5. 5 Confidential - Do Not Share or Distribute SQL Data Science Dashboards Apps Companies Want to Democratize Data… But How? ▪ Everyone wants access ▪ Data volumes are exploding ▪ Security risks ▪ Compliance requirements ▪ Limited resources Application Databases | IoT | Web | Logs Continuous New Data ADLS RDBMS S3 GCS Cloud Object Storage On-Prem
  • 6. 6 Confidential - Do Not Share or Distribute SQL Data Science Dashboards Apps Data Warehouses: Expensive, Proprietary, Complex Application Databases | IoT | Web | Logs Continuous New Data ✗ Skyrocketing costs ✗ Vendor lock-in ✗ Exploding backlog ✗ Can’t explore data ✗ No self-service ADLS RDBMS S3 GCS Cloud Object Storage On-Prem
  • 7. 7 Confidential - Do Not Share or Distribute SQL Data Science Dashboards Apps Data Lakehouse: Easy, Open, 1/10th the Cost Application Databases | IoT | Web | Logs Continuous New Data ⇅ ODBC | JDBC | REST | Arrow Flight ⇅ ⇅ Parallelism | Caching | Optimized Push-Downs ⇅ ✓ Sub-second performance ✓ Eliminate Data Silos ✓ Improve Data Discovery and Access ✓ No Data Movement Required ✓ No Copies ✓ Inexpensive ✓ No lock-in ADLS RDBMS S3 GCS Cloud Object Storage On-Prem
  • 8. 8 Confidential - Do Not Share or Distribute Created by Netflix, Apple and other big tech INSERT/UPDATE/DELETE with any engine Strong momentum in OSS community ■ ■ ■ Table Formats Enable Data Warehouse Workloads on the Lake Record-level data mutations with SQL DML INSERT INTO t1 ... UPDATE t1 SET col1 = ... DELETE FROM t1 WHERE state = "CA" Automatic partitioning CREATE TABLE t1 PARTITIONED BY (month(date), bucket[1000](user)) Instant schema and partition evolution ALTER TABLE t1 ADD/DROP/MODIFY/RENAME COLUMN c1 ... ALTER TABLE t1 ADD/DROP PARTITION FIELD ... Time travel SELECT * FROM t1 AT/BEFORE <timestamp>
  • 9. 9 Confidential - Do Not Share or Distribute Iceberg is a Community-Built, Vendor-Agnostic Table Format
  • 10. 10 Confidential - Do Not Share or Distribute AWS Dremio GCP (BigQuery, etc.) Snowflake Tabular … Databricks Iceberg is the Primary Format of Cloud and Platform Vendors
  • 11. 11 Confidential - Do Not Share or Distribute Data Analytics - The History Continued 980 – 2010 e & storage capacity m & manage se Data Warehouse nd analytical workloads 2010 – 2015 ● Fixed compute & storage capacity ● Mostly on-prem ● Harder to use & manage Big Data + Open Source Introduced big data processing ● Data in open file and table formats ● No need to copy & move data ● Multiple best-of-breed processing engines The Movement Open Data Lakehouse Self-service analytics on the data lake 2015 – Present ● Scale storage and compute independently ● Must load data into proprietary system ● Limited to one processing engine ● Cost prohibitive Cloud Data Warehouse Analytics on cloud data warehouse 2023 – … ● Isolated data exploration and engineering with branches ● Version control with commits and tags ● Governance and domain management Data Mesh with Data-as-Code Data mesh with data managed as code
  • 12. 12 Confidential - Do Not Share or Distribute Dremio Arctic (in preview)
  • 13. 13 Confidential - Do Not Share or Distribute Dremio Arctic is a Data Lakehouse Management Service Automated data optimization Automated garbage collection Automated data ingestion (roadmap) Data Optimization Fine-grained access control Commit/audit logs Governance & Security Commits Tags Branches Data as Code Query Engines Tables and views in hierarchical namespaces Native Iceberg catalog (Delta Lake in roadmap) Lakehouse Catalog ADLS S3 GCS Cloud Storage Apache Iceberg Tables
  • 14. 14 Confidential - Do Not Share or Distribute ICEBERG-NATIVE MULTIPLE DOMAINS ACCESS CONTROL Dremio Arctic is a Modern Lakehouse Catalog ▪ Nessie (the Arctic catalog) is built into the open source Apache Iceberg project ▪ Use a variety of Iceberg-compatible engines including Dremio Sonar, Spark and Flink ▪ Multiple isolated domains/catalogs in an organization, each containing a folder hierarchy of tables and views ▪ Designed to enable data mesh (including federated ownership and data sharing) ▪ Table, column- and row-based access control ▪ Custom roles and integration with existing user/group directories (AAD, Okta, etc.)
  • 15. 15 Confidential - Do Not Share or Distribute ISOLATION VERSION CONTROL GOVERNANCE Manage Data as Code with Git-like Capabilities ▪ Experiment with data without impacting other users ▪ Ingest, transform and test data before exposing it to other users in an atomic merge ▪ Reproduce models and dashboards from historical data based on time or tags ▪ Recover from any mistake by instantly undoing accidental data or metadata changes ▪ All changes to the data and metadata are tracked: who accessed what data and when ▪ Fine-grained privileges to control access to the data at the table, column and row level
  • 16. 16 Confidential - Do Not Share or Distribute TABLE OPTIMIZATION TABLE VACUUM Automatic Data Optimization Enables Faster Queries ▪ Dremio Arctic automatically rewrites smaller files into larger files and groups similar rows in a table together ▪ Table optimization significantly accelerates query performance ▪ Dremio Arctic automatically removes unused manifest files, manifest lists, and data files ▪ Cleanup runs in the background and ensures efficient use of data lake storage
  • 17. 17 Confidential - Do Not Share or Distribute 5 Use Cases for Data as Code
  • 18. 18 Confidential - Do Not Share or Distribute 1: Ensure data quality with ETL branches CREATE BRANCH events_etl_9_28_22 USE BRANCH events_etl_9_28_22 COPY INTO web.events … DELETE FROM web.events WHERE length(ip_address) >= 7 USE BRANCH main MERGE BRANCH events_etl_9_28_22 Create an ETL branch and ingest the data with COPY INTO, CTAS or Spark: SELECT COUNT(*) FROM web.events WHERE length(ip_address) >= 7 Run queries to test data quality: Test the dashboard to see that it looks ok: Fix the problems and merge into main: main events_etl_9_28_22 Data quality checks Production
  • 19. 19 Confidential - Do Not Share or Distribute 2: Experiment with data in transient branches CREATE BRANCH dave_9_28_22 USE BRANCH dave_9_28_22 CREATE TABLE t AS SELECT … UPDATE t … SET … Create a transient branch and perform data explorations and transformations in it: DROP BRANCH dave_9_28_22 Delete the branch or merge it when experimentation is complete: Create ad-hoc visualizations on the branch via a Notebook: main dave_9_28_22 Experimentation Production Experimentation
  • 20. 20 Confidential - Do Not Share or Distribute 3: Reproduce models or analysis spark.sql("USE REFERENCE modelA in arctic;") Change context to a named tag: val trainingData = spark.read.table("arctic.t") val lr = new LogisticRegression() // configure logistic regression... val paramMap = ParamMap(...) val model = lr.fit(trainingData, paramMap) Create ML model based on historic data: Select a tag, commit or branch to query in SQL Runner:
  • 21. 21 Confidential - Do Not Share or Distribute ALTER BRANCH main ASSIGN COMMIT …f724 4: Recover from mistakes main …f724 main’ …a233 …b84c …9bc8 …2563 …4231 Move the branch head to a historical commit:
  • 22. 22 Confidential - Do Not Share or Distribute 5: Troubleshooting (see who changed the data) SHOW LOGS AT REFERENCE etl; Get the commit history for a branch: curl -X GET -H 'Authorization: Bearer <PAT>' <Catalog API Endpoint>/trees/tree/<reference name>/log?filter="operations.exists(op,op. key=='<table name>')" Get the commit history for a specific table:
  • 23. 23 Confidential - Do Not Share or Distribute