From Data Warehouse to Lakehouse

Confidential - Do Not Share or Distribute
From Warehouse to
Lakehouse
February 15, 2023 (Paris)

2 Confidential - Do Not Share or Distribute
Dremio
The Only Data Lakehouse with Self-Service SQL Analytics
Your Data Forever, No Lock In
Sub-Second Performance, 1/10th the Cost of Data Warehouses
■
■
■
The Easy and Open Data Lakehouse
Self-service analytics with data warehouse functionality and data
lake flexibility across all of your data.
Open Source & Community Enterprise Adoption
Apache Arrow (70M+
downloads/m), Apache
Iceberg, Nessie
Creator and host of
Subsurface LIVE conference
1000s
of companies across all
industries
5
of the Fortune 10

Data Analytics - The History
1980 – 2010
● Fixed compute & storage capacity
● Mostly on-prem
● Harder to use & manage
Enterprise Data Warehouse
Transactional and analytical workloads
2010 – 2015
● Mostly on-prem
Big Data + Open Source
Introduced big data processing
● Data in open file and table formats
● No need to copy & move data
● Multiple best-of-breed processing
engines
The Movement
Open Data Lakehouse
Self-service analytics on the data lake
2015 – Present
● Scale storage and compute independently
● Must load data into proprietary system
● Limited to one processing engine
● Cost prohibitive
Cloud Data Warehouse
Analytics on cloud data warehouse

Competing Data Priorities
Access Governance
Self-Service
/ Agility
Security
Line of Business Centralized

SQL
Data Science Dashboards Apps
Companies Want to Democratize Data… But How?
▪ Everyone wants access
▪ Data volumes are
exploding
▪ Security risks
▪ Compliance requirements
▪ Limited resources
Application Databases | IoT | Web | Logs
Continuous New Data
ADLS RDBMS
S3 GCS
Cloud Object Storage On-Prem

SQL
Data Warehouses: Expensive, Proprietary, Complex
Continuous New Data
✗ Skyrocketing costs
✗ Vendor lock-in
✗ Exploding backlog
✗ Can’t explore data
✗ No self-service
ADLS RDBMS
S3 GCS

SQL
Data Lakehouse: Easy, Open, 1/10th the Cost
Continuous New Data
⇅ ODBC | JDBC | REST | Arrow Flight ⇅
⇅ Parallelism | Caching | Optimized Push-Downs ⇅
✓ Sub-second performance
✓ Eliminate Data Silos
✓ Improve Data Discovery and
Access
✓ No Data Movement Required
✓ No Copies
✓ Inexpensive
✓ No lock-in
ADLS RDBMS
S3 GCS

Created by Netflix, Apple and other big tech
INSERT/UPDATE/DELETE with any engine
Strong momentum in OSS community
■
■
■
Table Formats Enable Data Warehouse Workloads on the Lake
Record-level data mutations with SQL DML
INSERT INTO t1 ...
UPDATE t1 SET col1 = ...
DELETE FROM t1 WHERE state = "CA"
Automatic partitioning
CREATE TABLE t1 PARTITIONED BY (month(date),
bucket[1000](user))
Instant schema and partition evolution
ALTER TABLE t1 ADD/DROP/MODIFY/RENAME COLUMN c1 ...
ALTER TABLE t1 ADD/DROP PARTITION FIELD ...
Time travel
SELECT * FROM t1 AT/BEFORE <timestamp>

Iceberg is a Community-Built, Vendor-Agnostic Table Format

AWS
Dremio
GCP (BigQuery, etc.)
Snowflake
Tabular
…
Databricks
Iceberg is the Primary Format of Cloud and Platform Vendors

Data Analytics - The History Continued
980 – 2010
e & storage capacity
m
& manage
se Data Warehouse
nd analytical workloads
2010 – 2015
● Mostly on-prem
Big Data + Open Source
Introduced big data processing
● Data in open file and table formats
● No need to copy & move data
● Multiple best-of-breed processing
engines
The Movement
Open Data Lakehouse
Self-service analytics on the data lake
2015 – Present
● Scale storage and compute independently
● Must load data into proprietary system
● Limited to one processing engine
● Cost prohibitive
Cloud Data Warehouse
Analytics on cloud data warehouse
2023 – …
● Isolated data exploration and engineering
with branches
● Version control with commits and tags
● Governance and domain management
Data Mesh with Data-as-Code
Data mesh with data managed as code

Dremio Arctic
(in preview)

Dremio Arctic is a Data Lakehouse Management Service
Automated data
optimization
Automated garbage
collection
Automated data ingestion
(roadmap)
Data Optimization
Fine-grained access control
Commit/audit logs
Governance & Security
Commits
Tags
Branches
Data as Code
Query Engines
Tables and views in hierarchical namespaces
Native Iceberg catalog (Delta Lake in roadmap)
Lakehouse Catalog
ADLS S3 GCS
Cloud Storage
Apache Iceberg Tables

ICEBERG-NATIVE MULTIPLE DOMAINS ACCESS CONTROL
Dremio Arctic is a Modern Lakehouse Catalog
▪ Nessie (the Arctic catalog) is built into the
open source Apache Iceberg project
▪ Use a variety of Iceberg-compatible engines
including Dremio Sonar, Spark and Flink
▪ Multiple isolated domains/catalogs in an
organization, each containing a folder
hierarchy of tables and views
▪ Designed to enable data mesh (including
federated ownership and data sharing)
▪ Table, column- and row-based access
control
▪ Custom roles and integration with existing
user/group directories (AAD, Okta, etc.)

ISOLATION VERSION CONTROL GOVERNANCE
Manage Data as Code with Git-like Capabilities
▪ Experiment with data without impacting other
users
▪ Ingest, transform and test data before
exposing it to other users in an atomic merge
▪ Reproduce models and dashboards from
historical data based on time or tags
▪ Recover from any mistake by instantly
undoing accidental data or metadata
changes
▪ All changes to the data and metadata are
tracked: who accessed what data and when
▪ Fine-grained privileges to control access to
the data at the table, column and row level

TABLE OPTIMIZATION TABLE VACUUM
Automatic Data Optimization Enables Faster Queries
▪ Dremio Arctic automatically rewrites smaller
files into larger files and groups similar rows in
a table together
▪ Table optimization significantly accelerates
query performance
▪ Dremio Arctic automatically removes unused
manifest files, manifest lists, and data files
▪ Cleanup runs in the background and ensures
efficient use of data lake storage

5 Use Cases for Data as Code

1: Ensure data quality with ETL branches
CREATE BRANCH events_etl_9_28_22
USE BRANCH events_etl_9_28_22
COPY INTO web.events …
DELETE FROM web.events WHERE length(ip_address) >= 7
USE BRANCH main
MERGE BRANCH events_etl_9_28_22
Create an ETL branch and ingest the data with COPY
INTO, CTAS or Spark:
SELECT COUNT(*) FROM web.events WHERE
length(ip_address) >= 7
Run queries to test data quality:
Test the dashboard to see that it looks ok:
Fix the problems and merge into main:
main
events_etl_9_28_22
Data quality checks
Production

2: Experiment with data in transient branches
CREATE BRANCH dave_9_28_22
USE BRANCH dave_9_28_22
CREATE TABLE t AS SELECT …
UPDATE t … SET …
Create a transient branch and perform data explorations
and transformations in it:
DROP BRANCH dave_9_28_22
Delete the branch or merge it when experimentation is
complete:
Create ad-hoc visualizations on the branch via a
Notebook:
main
dave_9_28_22
Experimentation
Production
Experimentation

3: Reproduce models or analysis
spark.sql("USE REFERENCE modelA in arctic;")
Change context to a named tag:
val trainingData = spark.read.table("arctic.t")
val lr = new LogisticRegression()
// configure logistic regression...
val paramMap = ParamMap(...)
val model = lr.fit(trainingData, paramMap)
Create ML model based on historic data:
Select a tag, commit or branch to query in SQL Runner:

ALTER BRANCH main ASSIGN COMMIT …f724
4: Recover from mistakes
main
…f724 main’
…a233
…b84c
…9bc8
…2563
…4231
Move the branch head to a historical commit:

5: Troubleshooting (see who changed the data)
SHOW LOGS AT REFERENCE etl;
Get the commit history for a branch:
curl -X GET -H 'Authorization: Bearer
<PAT>' <Catalog API
Endpoint>/trees/tree/<reference
name>/log?filter="operations.exists(op,op.
key=='<table name>')"
Get the commit history for a specific table:

From Data Warehouse to Lakehouse

More Related Content

What's hot (20)

Similar to From Data Warehouse to Lakehouse (20)

More from Modern Data Stack France (20)

Recently uploaded (20)

From Data Warehouse to Lakehouse