IBM Cloud Native Day April 2021: Serverless Data Lake

Big Data, the cloud-native way:
Serverless Data Lake with IBM Cloud
Torsten Steinbach
Cloud Data Lake Lead Architect | IBM

Cloud Data Lake Evolutionary Context
Enterprise Data
Warehouses
Tightly integrated and
optimized systems
Hadoop
Introduced open data formats &
easy scaling on commodity HW
Cloud-Native: Serverless Analytics-aaS
• Elasticity
• Pay-per-query
• Data in object store
• Disaggregated architecture
• Increasingly real-time first
The 90-ies 2000 Today

Telemetry Data
Explore
ETL or CDC
Replication
Prep Enrich
Streaming
Optimize Batch Query
ü Seamless Elasticity
ü Seamless Scalability
ü Highly Cost Effective
ü Long Term Retention
ü Any data formats
ETL
IBM Cloud Data Lake – Big Picture
Databases
ü Response Time SLAs
ü Warm High-quality Data only
Cloud Data Lake
Analytics
Interactive
Query
Transactional
Consistency
DWH
Cloud Data Lakehouse

IBM Serverless Stack for Analytics
Serverless
Storage
Serverless
Runtimes
Serverless
Analytics
Object
Storage
Cloud
Functions
Query
Only pay for volume of data
that you really store
Only pay for
amount of
data that you
really scan
Only pay for
CPU that
you really
consume
Blog Article
§ Properties of Serverless:
– No management of resources, hosts and
processes
– Auto-scaling and auto-provisioning based
on actual load
– Precise billing based on really consumed
system resources (memory, storage, CPU,
network, I/O)
– High-Availability is always implicit

IBM SQL Query – The Central Cloud Data Lake Service
Cloud Data
Data
Transformation
Serverless SQL Query Service
Analytics
Object
Storage RDBMS
+
Developers
Data
Engineers
Data Analysts
ü Supports ad-hoc and
unknown data structures
ü Ingestion & ELT Support
ü 100% Pay-as-you-go (5$/TB)
ü 100% API enabled
ü Automatic Big Data Scale-
Out with Spark
ü 100% Self service, No Setup
Data
Management
+
Data Scientists
ü Built-In Database Catalog &
Data Skipping
Data Ingestion
+

IBM SQL Query Architecture
2. Read data
4. Read
results
Application
3. Write data
Cloud Data Services
1. Submit SQL
SQL
Event Streams
Query
Db2 on Cloud
Geospatial SQL
Data Skipping
Timeseries SQL
Hive Metastore
Video
Cloud Object Storage
• Using IBM Analytic Engine service
(Spark aaS)
• Large farm of Spark clusters auto-
provisioned & auto-managed in background
• Managing a hot pool of Spark applications
(a.k.a. kernels, using Jupyter Kernel Gateway)
• SQL grammar sandbox
• Auto-scaling of each serverless SQL job
inside large Spark clusters using dynamic
resource allocation
• Intrinsically HA (dispatching across Spark
environments in each availability zone)

IBM SQL Query – Access Patterns
Create
Query
SQL
Console
Watson
Studio
Notebooks
Cloud Functions
Integrate Explore
Deploy
Python SDK
REST API
JDBC
Object
Store
Console
Event
Streams
Console

Meta Data
IBM Cloud Data Lake – Separating Out Responsibilities
Cloud Data
ACID
Serverless Spark (IBM Analytic Engine)
Data Skipping Indexes Governance Policies
& Lineage
Schema, Partitioning,
Statistics
Serverless SQL (IBM SQL Query)
IBM Cloud
Object
Storage
RDBMS
Hive
Metastore
Kafka Schema
Registry
Xskipper Iceberg
Watson Knowledge
Catalog
Deltalake
Serverless Containers (IBM Cloud Code Engine)
IBM Event Streams IBM Cloud Databases
Processing
State

Data Lakehouse Architecture in IBM Cloud
…
BigSQL
Dremio
IBM Cloud
Databases
Event Streams SQL Query
Meta Data
Integrated Hive Metastore + Kafka Schema Registry + ACID (Iceberg)
Real-Time
Queries
COS
Batch
Queries
Stream Xform
& Joins
Stream data landing
Schema management & enforcement
ETL & Data
Preparation
CDC
Interactive &
DWH Queries

Streaming Data Lakes – EventStreams–COS Integration with SQL Query
New
Stream Landing
Event Streams: Real time event
feeds in Kafka topics
SQL Query: Serverless stream
landing ingests Kafka topics
into tables in COS
COS: Cost-effective permanent
storage and analytics for real-
time data.
Real Time Serverless
Data Lakes
Turn Topics into Tables with a
few clicks
Fully managed ingestion of
message feeds into parquet at
$0,10/hour for 1MB/s capacity
Infinite storage of all your
message data in COS
Run DWH-style SQL on your
message data in serverless
manner
Publish to Kafka to create your
specialised domain COS lake
house
• Log records
• Click Stream data
• IOT data
Combine with Change Data
Capture for real-time replication
of all your systems into data lake
for analytics
Common Ingest Fabric
to Data Lakes

IBM Cloud Data Lake
Real-Time Data Lake Solutions
Audit Trails
Cloud Platform Logs
Application Logs
Network Logs
User Behavior
IoT Feeds
IoT Lakes
Log Lakes AIOps Lakes Compliance Lakes

Cloud Pak for Data as a Service
Built On
IBM Cloud
Uses
IBM Cloud Data Lake
COS
Storage Analytics
SQL Query
Event Streams
Streaming Transformation
Spark Cloud Databases
Databases
Integrated IBM Solution for Cloud Data Lakes

Integrated IBM Solution for Cloud Data Lakes
IBM Cloud Data Lake
Manage
Explore &
Prepare
Govern
Data Catalogs, Projects & Connections
Automate
Data Stage &
Kubeflow Pipelines
Consume
Watson Studio,
BigSQL
Cloud Pak for Data aaS
Ingest
CDC
Ad-hoc
Application Logs
IoT Streams
User Behavior
ETL
JDBC
Python
Dremio
Presto
ML
Tableau
Data Virtualization
Kafka
Power BI
Cognos
Infuse
Analyze
Organize
Collect
Ladder to AI

IBM’s Serverless 2.0 Initiative
Data COS
EventStreams
(Kafka)
State Meta Data Common Hive Metastore
Temp Data NVMe
RAM
Containers IBM Cloud Code Engine
Runtimes Others Apache Spark
Stateless
Compute
Shuffle
100% Elastic with
Hyperscale &
Scale down to Zero
AI & ML DataOps & BI
Petabytes

Backup
I/O Optimization for Analytics

Analytic-Friendly Data Formats
Blog Article:
Data Layout

Data Skipping in IBM SQL Query
• Avoid reading irrelevant objects using indexes
• Complements partition pruning -> object level pruning
• Stores aggregate metadata per object to enable skipping decisions
• Indexes are stored in COS
• Supports multiple index types
• Currently MinMax, ValueList, BloomFilter, Geospatial
• Underlying data skipping library is extensible
• New index types can easily be supported
• Enables data skipping on SQL UDFs
• e.g. ST_Contains, ST_Distance etc.
• UDFs are mapped to indexes

How Data Skipping Works
Spark SQL Query Execution Flow
Uses Catalyst optimizer and
session extensions API
Query
Prune
partitions
Read data
Query
Prune
partitions
Optional
file filter
Read data
Metadata
Filter

Data Skipping Example
Weather/dt=2020-08-17/part-00085.parquet
Data
Object Listing
Example Query
SELECT *
FROM cos://us-geo/twc/Weather STORED AS parquet
WHERE temp > 40
Object Name Temp
Min
Temp
Max
...
dt=2020-08-17/part-00085 7.97 26.77
dt=2020-08-17/part-00086 2.45 23.71
dt=2020-08-17/part-00087 6.46 18.62
dt=2020-08-17/part-00088 23.67 41.02
...
Metadata
Red objects are not relevant to this query

Geospatial Data Skipping Example
Example Query
SELECT * FROM Weather STORED AS parquet
WHERE ST_Contains(ST_WKTToSQL('POLYGON((-
78.93 36.00, -78.67 35.78, -79.04 35.90, -
78.93 36.00))'), ST_Point(long, lat))
INTO cos://us-south/results STORED AS parquet
Object Name lat
Min
lat
Max
...
dt=2020-08-17/part-00085 35.02 36.17
dt=2020-08-17/part-00086 43.59 44.95
dt=2020-08-17/part-00087 34.86 40.62
dt=2020-08-17/part-00088 23.67 25.92
...
Metadata
Red objects are not relevant to this query
Raleigh Research
Triangle (US)
Map ST Contains UDF
to necessary conditions
on lat, long

X10 Acceleration with Data Skipping and Catalog
Query rewrite approach
(yellow) is the baseline
• Using already optimized data format:
Parquet/ORC
For other formats the
acceleration is much larger
• e.g. CSV/JSON/Avro
Experiment uses Raleigh Research
Triangle query
X10 speedup
on average
10 TB of Weather Data on COS

Making trusted COVID-19 data available to broad set of analytics, e.g.:
§ https://accelerator.weather.com/bi
§ Watson Health Return to Work Advisor
The COVID-19 Data Lake
Ø Extensible with new data sources easily
Ø Maximized velocity and elasticity
Ø Full automation of all pipelines
Ø New pipeline prototype in hours
& productize in 2-3 days
Ø Radically minimizing resource
and operational costs by using IBM Cloud
serverless and full ops automation
Cloud Functions
Cloud
Object Storage
- Persist
- Trigger
- Static Content Creation
- Schema Management
- Pipeline PoCs
- Usage Tutorials
Watson Studio
SQL Query
- Transformation
- Transport
- Table Catalog (Mart)
- Queries
- Export
- Pipeline -Productization
- Automation
- Monitoring & Alerting
- Pull External Data

COVID-19 Data Lake Topology – High Level
Landing Zone (E)
Landing Buckets
Preparation Zone (T)
Landing Namespace
Preparation
Namespace
Preparation Buckets
Integration Zone (L)
Dashboarding
DWH
Integration Buckets
Data Mart Instance
Integration
Namespace
Mart Management
Project
Data Mart Access
Project
TWC Scrapers & Pipeline
Collectors Sequences
Preparation Sequences
Mart Sequences
Delivery Sequences
Pipeline Instance
Schema
Management
Static Content
Management
Pipeline Instance
Usage Notebooks
Table Catalog
External
Data
Sources
Pull
Push
Collectors Sequences
Usage Notebooks
Usage Notebooks
Users
Pipeline PoC Project
Preliminary Pipeline
Notebooks
Location
Statistics
Upload
Update
Reference
Data
Add
Partitions
Query &
Extract
Transform
COGNOS

IBM Cloud Native Day April 2021: Serverless Data Lake

IBM Cloud Native Day April 2021: Serverless Data Lake

Recommended

More Related Content

What's hot (20)

Similar to IBM Cloud Native Day April 2021: Serverless Data Lake (16)

More from Torsten Steinbach (11)

Recently uploaded (20)

IBM Cloud Native Day April 2021: Serverless Data Lake