SlideShare a Scribd company logo
Big Data, the cloud-native way:
Serverless Data Lake with IBM Cloud
Torsten Steinbach
Cloud Data Lake Lead Architect | IBM
Cloud Data Lake Evolutionary Context
Enterprise Data
Warehouses
Tightly integrated and
optimized systems
Hadoop
Introduced open data formats &
easy scaling on commodity HW
Cloud-Native: Serverless Analytics-aaS
• Elasticity
• Pay-per-query
• Data in object store
• Disaggregated architecture
• Increasingly real-time first
The 90-ies 2000 Today
Telemetry Data
Explore
ETL or CDC
Replication
Prep Enrich
Streaming
Optimize Batch Query
ü Seamless Elasticity
ü Seamless Scalability
ü Highly Cost Effective
ü Long Term Retention
ü Any data formats
ETL
IBM Cloud Data Lake – Big Picture
Databases
ü Response Time SLAs
ü Warm High-quality Data only
Cloud Data Lake
Analytics
Interactive
Query
Transactional
Consistency
DWH
Cloud Data Lakehouse
IBM Serverless Stack for Analytics
Serverless
Storage
Serverless
Runtimes
Serverless
Analytics
Object
Storage
Cloud
Functions
Query
Only pay for volume of data
that you really store
Only pay for
amount of
data that you
really scan
Only pay for
CPU that
you really
consume
Blog Article
§ Properties of Serverless:
– No management of resources, hosts and
processes
– Auto-scaling and auto-provisioning based
on actual load
– Precise billing based on really consumed
system resources (memory, storage, CPU,
network, I/O)
– High-Availability is always implicit
IBM SQL Query – The Central Cloud Data Lake Service
Cloud Data
Data
Transformation
Serverless SQL Query Service
Analytics
Object
Storage RDBMS
+
Developers
Data
Engineers
Data Analysts
ü Supports ad-hoc and
unknown data structures
ü Ingestion & ELT Support
ü 100% Pay-as-you-go (5$/TB)
ü 100% API enabled
ü Automatic Big Data Scale-
Out with Spark
ü 100% Self service, No Setup
Data
Management
+
Data Scientists
ü Built-In Database Catalog &
Data Skipping
Data Ingestion
+
IBM SQL Query Architecture
2. Read data
4. Read
results
Application
3. Write data
Cloud Data Services
1. Submit SQL
SQL
Event Streams
Query
Db2 on Cloud
Geospatial SQL
Data Skipping
Timeseries SQL
Hive Metastore
Video
Cloud Object Storage
• Using IBM Analytic Engine service
(Spark aaS)
• Large farm of Spark clusters auto-
provisioned & auto-managed in background
• Managing a hot pool of Spark applications
(a.k.a. kernels, using Jupyter Kernel Gateway)
• SQL grammar sandbox
• Auto-scaling of each serverless SQL job
inside large Spark clusters using dynamic
resource allocation
• Intrinsically HA (dispatching across Spark
environments in each availability zone)
IBM SQL Query – Access Patterns
Create
Query
SQL
Console
Watson
Studio
Notebooks
Cloud Functions
Integrate Explore
Deploy
Python SDK
REST API
JDBC
Object
Store
Console
Event
Streams
Console
Meta Data
IBM Cloud Data Lake – Separating Out Responsibilities
Cloud Data
ACID
Serverless Spark (IBM Analytic Engine)
Data Skipping Indexes Governance Policies
& Lineage
Schema, Partitioning,
Statistics
Serverless SQL (IBM SQL Query)
IBM Cloud
Object
Storage
RDBMS
Hive
Metastore
Kafka Schema
Registry
Xskipper Iceberg
Watson Knowledge
Catalog
Deltalake
Serverless Containers (IBM Cloud Code Engine)
IBM Event Streams IBM Cloud Databases
Processing
State
Data Lakehouse Architecture in IBM Cloud
…
BigSQL
Dremio
IBM Cloud
Databases
Event Streams SQL Query
Meta Data
Integrated Hive Metastore + Kafka Schema Registry + ACID (Iceberg)
Real-Time
Queries
COS
Batch
Queries
Stream Xform
& Joins
Stream data landing
Schema management & enforcement
ETL & Data
Preparation
CDC
Interactive &
DWH Queries
Streaming Data Lakes – EventStreams–COS Integration with SQL Query
New
Stream Landing
Event Streams: Real time event
feeds in Kafka topics
SQL Query: Serverless stream
landing ingests Kafka topics
into tables in COS
COS: Cost-effective permanent
storage and analytics for real-
time data.
Real Time Serverless
Data Lakes
Turn Topics into Tables with a
few clicks
Fully managed ingestion of
message feeds into parquet at
$0,10/hour for 1MB/s capacity
Infinite storage of all your
message data in COS
Run DWH-style SQL on your
message data in serverless
manner
Publish to Kafka to create your
specialised domain COS lake
house
• Log records
• Click Stream data
• IOT data
Combine with Change Data
Capture for real-time replication
of all your systems into data lake
for analytics
Common Ingest Fabric
to Data Lakes
IBM Cloud Data Lake
Real-Time Data Lake Solutions
Audit Trails
Cloud Platform Logs
Application Logs
Network Logs
User Behavior
IoT Feeds
IoT Lakes
Log Lakes AIOps Lakes Compliance Lakes
IBM Solution for Data & AI
Cloud Pak for Data as a Service
Built On
IBM Cloud
Uses
IBM Cloud Data Lake
COS
Storage Analytics
SQL Query
Event Streams
Streaming Transformation
Spark Cloud Databases
Databases
Integrated IBM Solution for Cloud Data Lakes
Integrated IBM Solution for Cloud Data Lakes
IBM Cloud Data Lake
Manage
Explore &
Prepare
Govern
Data Catalogs, Projects & Connections
Automate
Data Stage &
Kubeflow Pipelines
Consume
Watson Studio,
BigSQL
Cloud Pak for Data aaS
Ingest
CDC
Ad-hoc
Application Logs
IoT Streams
User Behavior
ETL
JDBC
Python
Dremio
Presto
ML
Tableau
Data Virtualization
Kafka
Power BI
Cognos
Infuse
Analyze
Organize
Collect
Ladder to AI
Outlook
IBM’s Serverless 2.0 Initiative
Data COS
EventStreams
(Kafka)
State Meta Data Common Hive Metastore
Temp Data NVMe
RAM
Containers IBM Cloud Code Engine
Runtimes Others Apache Spark
Stateless
Compute
Shuffle
100% Elastic with
Hyperscale &
Scale down to Zero
AI & ML DataOps & BI
Petabytes
Backup
I/O Optimization for Analytics
Analytic-Friendly Data Formats
Blog Article:
Data Layout
Data Skipping in IBM SQL Query
• Avoid reading irrelevant objects using indexes
• Complements partition pruning -> object level pruning
• Stores aggregate metadata per object to enable skipping decisions
• Indexes are stored in COS
• Supports multiple index types
• Currently MinMax, ValueList, BloomFilter, Geospatial
• Underlying data skipping library is extensible
• New index types can easily be supported
• Enables data skipping on SQL UDFs
• e.g. ST_Contains, ST_Distance etc.
• UDFs are mapped to indexes
How Data Skipping Works
Spark SQL Query Execution Flow
Uses Catalyst optimizer and
session extensions API
Query
Prune
partitions
Read data
Query
Prune
partitions
Optional
file filter
Read data
Metadata
Filter
Data Skipping Example
Weather/dt=2020-08-17/part-00085.parquet
Weather/dt=2020-08-17/part-00086.parquet
Weather/dt=2020-08-17/part-00087.parquet
Weather/dt=2020-08-17/part-00088.parquet
Weather/dt=2020-08-18/part-00001.parquet
Weather/dt=2020-08-18/part-00002.parquet
Data
Object Listing
Example Query
SELECT *
FROM cos://us-geo/twc/Weather STORED AS parquet
WHERE temp > 40
Object Name Temp
Min
Temp
Max
...
dt=2020-08-17/part-00085 7.97 26.77
dt=2020-08-17/part-00086 2.45 23.71
dt=2020-08-17/part-00087 6.46 18.62
dt=2020-08-17/part-00088 23.67 41.02
...
Metadata
Red objects are not relevant to this query
Geospatial Data Skipping Example
Example Query
SELECT * FROM Weather STORED AS parquet
WHERE ST_Contains(ST_WKTToSQL('POLYGON((-
78.93 36.00, -78.67 35.78, -79.04 35.90, -
78.93 36.00))'), ST_Point(long, lat))
INTO cos://us-south/results STORED AS parquet
Object Name lat
Min
lat
Max
...
dt=2020-08-17/part-00085 35.02 36.17
dt=2020-08-17/part-00086 43.59 44.95
dt=2020-08-17/part-00087 34.86 40.62
dt=2020-08-17/part-00088 23.67 25.92
...
Metadata
Red objects are not relevant to this query
Raleigh Research
Triangle (US)
Map ST Contains UDF
to necessary conditions
on lat, long
X10 Acceleration with Data Skipping and Catalog
Query rewrite approach
(yellow) is the baseline
• Using already optimized data format:
Parquet/ORC
For other formats the
acceleration is much larger
• e.g. CSV/JSON/Avro
Experiment uses Raleigh Research
Triangle query
X10 speedup
on average
10 TB of Weather Data on COS
A Real-Life Cloud Data Lake
Making trusted COVID-19 data available to broad set of analytics, e.g.:
§ https://accelerator.weather.com/bi
§ Watson Health Return to Work Advisor
The COVID-19 Data Lake
Ø Extensible with new data sources easily
Ø Maximized velocity and elasticity
Ø Full automation of all pipelines
Ø New pipeline prototype in hours
& productize in 2-3 days
Ø Radically minimizing resource
and operational costs by using IBM Cloud
serverless and full ops automation
Cloud Functions
Cloud
Object Storage
- Persist
- Trigger
- Static Content Creation
- Schema Management
- Pipeline PoCs
- Usage Tutorials
Watson Studio
SQL Query
- Transformation
- Transport
- Table Catalog (Mart)
- Queries
- Export
- Pipeline -Productization
- Automation
- Monitoring & Alerting
- Pull External Data
COVID-19 Data Lake Topology – High Level
Landing Zone (E)
Landing Buckets
Preparation Zone (T)
Landing Namespace
Preparation
Namespace
Preparation Buckets
Integration Zone (L)
Dashboarding
DWH
Integration Buckets
Data Mart Instance
Integration
Namespace
Mart Management
Project
Data Mart Access
Project
TWC Scrapers & Pipeline
Collectors Sequences
Preparation Sequences
Mart Sequences
Delivery Sequences
Pipeline Instance
Schema
Management
Static Content
Management
Pipeline Instance
Usage Notebooks
Table Catalog
Preparation Sequences
External
Data
Sources
Pull
Push
Collectors Sequences
Preparation Sequences
Usage Notebooks
Usage Notebooks
Users
Pipeline PoC Project
Preliminary Pipeline
Notebooks
Location
Statistics
Upload
Update
Reference
Data
Add
Partitions
Query &
Extract
Transform
COGNOS
IBM Cloud Native Day April 2021: Serverless Data Lake
Ad

More Related Content

What's hot (20)

Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
Laurent Leturgez
 
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientistAi & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Alberto Diaz Martin
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Michael Rys
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
Rakesh Jayaram
 
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsHow to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
Informatica
 
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Michael Rys
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data Factory
BizTalk360
 
Module 3 - QuickSight Overview
Module 3 - QuickSight OverviewModule 3 - QuickSight Overview
Module 3 - QuickSight Overview
Lam Le
 
Tarun poladi resume
Tarun poladi resumeTarun poladi resume
Tarun poladi resume
Tarun P
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analytics
Eduardo Castro
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop Migration
Databricks
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
Mark Kromer
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
Dmitry Anoshin
 
Data Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Lakes with Azure Databricks
Data Lakes with Azure Databricks
Data Con LA
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
Dave Nielsen
 
Data weekender4.2 azure purview erwin de kreuk
Data weekender4.2  azure purview erwin de kreukData weekender4.2  azure purview erwin de kreuk
Data weekender4.2 azure purview erwin de kreuk
Erwin de Kreuk
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Carole Gunst
 
DataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de KreukDataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de Kreuk
Erwin de Kreuk
 
Phar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the RealityPhar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the Reality
Databricks
 
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientistAi & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Alberto Diaz Martin
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Michael Rys
 
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsHow to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
Informatica
 
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Michael Rys
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data Factory
BizTalk360
 
Module 3 - QuickSight Overview
Module 3 - QuickSight OverviewModule 3 - QuickSight Overview
Module 3 - QuickSight Overview
Lam Le
 
Tarun poladi resume
Tarun poladi resumeTarun poladi resume
Tarun poladi resume
Tarun P
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analytics
Eduardo Castro
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop Migration
Databricks
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
Mark Kromer
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
Dmitry Anoshin
 
Data Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Lakes with Azure Databricks
Data Lakes with Azure Databricks
Data Con LA
 
Data weekender4.2 azure purview erwin de kreuk
Data weekender4.2  azure purview erwin de kreukData weekender4.2  azure purview erwin de kreuk
Data weekender4.2 azure purview erwin de kreuk
Erwin de Kreuk
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Carole Gunst
 
DataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de KreukDataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de Kreuk
Erwin de Kreuk
 
Phar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the RealityPhar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the Reality
Databricks
 

Similar to IBM Cloud Native Day April 2021: Serverless Data Lake (16)

Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
Torsten Steinbach
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
Lam Le
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
Amazon Web Services Korea
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
Shu-Jeng Hsieh
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Coud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AICoud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AI
Torsten Steinbach
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
Data Con LA
 
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
Sungmin Kim
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Precisely
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyers
huguk
 
Big data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The NetherlandsBig data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The Netherlands
Marek Kuczynski
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Streamsets Inc.
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Rick Bilodeau
 
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
Databricks
 
Data analytics master class: predict hotel revenue
Data analytics master class: predict hotel revenueData analytics master class: predict hotel revenue
Data analytics master class: predict hotel revenue
Kris Peeters
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
Lam Le
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
Amazon Web Services Korea
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
Shu-Jeng Hsieh
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Coud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AICoud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AI
Torsten Steinbach
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
Data Con LA
 
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
Sungmin Kim
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Precisely
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyers
huguk
 
Big data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The NetherlandsBig data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The Netherlands
Marek Kuczynski
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Streamsets Inc.
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Rick Bilodeau
 
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
Databricks
 
Data analytics master class: predict hotel revenue
Data analytics master class: predict hotel revenueData analytics master class: predict hotel revenue
Data analytics master class: predict hotel revenue
Kris Peeters
 
Ad

More from Torsten Steinbach (11)

IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM CloudIBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
Torsten Steinbach
 
IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?
IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?
IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?
Torsten Steinbach
 
IBM THINK 2019 - Cloud-Native Clickstream Analysis in IBM Cloud
IBM THINK 2019 - Cloud-Native Clickstream Analysis in IBM CloudIBM THINK 2019 - Cloud-Native Clickstream Analysis in IBM Cloud
IBM THINK 2019 - Cloud-Native Clickstream Analysis in IBM Cloud
Torsten Steinbach
 
IBM THINK 2019 - Self-Service Cloud Data Management with SQL
IBM THINK 2019 - Self-Service Cloud Data Management with SQL IBM THINK 2019 - Self-Service Cloud Data Management with SQL
IBM THINK 2019 - Self-Service Cloud Data Management with SQL
Torsten Steinbach
 
IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query Introduction
Torsten Steinbach
 
IBM Insight 2014 - Advanced Warehouse Analytics in the Cloud
IBM Insight 2014 - Advanced Warehouse Analytics in the CloudIBM Insight 2014 - Advanced Warehouse Analytics in the Cloud
IBM Insight 2014 - Advanced Warehouse Analytics in the Cloud
Torsten Steinbach
 
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloud
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloudIBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloud
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloud
Torsten Steinbach
 
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisIBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
Torsten Steinbach
 
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
Torsten Steinbach
 
IBM Information on Demand 2013 - Session 2839 - Using IBM PureData System fo...
IBM Information on Demand 2013  - Session 2839 - Using IBM PureData System fo...IBM Information on Demand 2013  - Session 2839 - Using IBM PureData System fo...
IBM Information on Demand 2013 - Session 2839 - Using IBM PureData System fo...
Torsten Steinbach
 
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
Torsten Steinbach
 
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM CloudIBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
Torsten Steinbach
 
IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?
IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?
IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?
Torsten Steinbach
 
IBM THINK 2019 - Cloud-Native Clickstream Analysis in IBM Cloud
IBM THINK 2019 - Cloud-Native Clickstream Analysis in IBM CloudIBM THINK 2019 - Cloud-Native Clickstream Analysis in IBM Cloud
IBM THINK 2019 - Cloud-Native Clickstream Analysis in IBM Cloud
Torsten Steinbach
 
IBM THINK 2019 - Self-Service Cloud Data Management with SQL
IBM THINK 2019 - Self-Service Cloud Data Management with SQL IBM THINK 2019 - Self-Service Cloud Data Management with SQL
IBM THINK 2019 - Self-Service Cloud Data Management with SQL
Torsten Steinbach
 
IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query Introduction
Torsten Steinbach
 
IBM Insight 2014 - Advanced Warehouse Analytics in the Cloud
IBM Insight 2014 - Advanced Warehouse Analytics in the CloudIBM Insight 2014 - Advanced Warehouse Analytics in the Cloud
IBM Insight 2014 - Advanced Warehouse Analytics in the Cloud
Torsten Steinbach
 
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloud
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloudIBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloud
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloud
Torsten Steinbach
 
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisIBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
Torsten Steinbach
 
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
Torsten Steinbach
 
IBM Information on Demand 2013 - Session 2839 - Using IBM PureData System fo...
IBM Information on Demand 2013  - Session 2839 - Using IBM PureData System fo...IBM Information on Demand 2013  - Session 2839 - Using IBM PureData System fo...
IBM Information on Demand 2013 - Session 2839 - Using IBM PureData System fo...
Torsten Steinbach
 
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
Torsten Steinbach
 
Ad

Recently uploaded (20)

Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 

IBM Cloud Native Day April 2021: Serverless Data Lake

  • 1. Big Data, the cloud-native way: Serverless Data Lake with IBM Cloud Torsten Steinbach Cloud Data Lake Lead Architect | IBM
  • 2. Cloud Data Lake Evolutionary Context Enterprise Data Warehouses Tightly integrated and optimized systems Hadoop Introduced open data formats & easy scaling on commodity HW Cloud-Native: Serverless Analytics-aaS • Elasticity • Pay-per-query • Data in object store • Disaggregated architecture • Increasingly real-time first The 90-ies 2000 Today
  • 3. Telemetry Data Explore ETL or CDC Replication Prep Enrich Streaming Optimize Batch Query ü Seamless Elasticity ü Seamless Scalability ü Highly Cost Effective ü Long Term Retention ü Any data formats ETL IBM Cloud Data Lake – Big Picture Databases ü Response Time SLAs ü Warm High-quality Data only Cloud Data Lake Analytics Interactive Query Transactional Consistency DWH Cloud Data Lakehouse
  • 4. IBM Serverless Stack for Analytics Serverless Storage Serverless Runtimes Serverless Analytics Object Storage Cloud Functions Query Only pay for volume of data that you really store Only pay for amount of data that you really scan Only pay for CPU that you really consume Blog Article § Properties of Serverless: – No management of resources, hosts and processes – Auto-scaling and auto-provisioning based on actual load – Precise billing based on really consumed system resources (memory, storage, CPU, network, I/O) – High-Availability is always implicit
  • 5. IBM SQL Query – The Central Cloud Data Lake Service Cloud Data Data Transformation Serverless SQL Query Service Analytics Object Storage RDBMS + Developers Data Engineers Data Analysts ü Supports ad-hoc and unknown data structures ü Ingestion & ELT Support ü 100% Pay-as-you-go (5$/TB) ü 100% API enabled ü Automatic Big Data Scale- Out with Spark ü 100% Self service, No Setup Data Management + Data Scientists ü Built-In Database Catalog & Data Skipping Data Ingestion +
  • 6. IBM SQL Query Architecture 2. Read data 4. Read results Application 3. Write data Cloud Data Services 1. Submit SQL SQL Event Streams Query Db2 on Cloud Geospatial SQL Data Skipping Timeseries SQL Hive Metastore Video Cloud Object Storage • Using IBM Analytic Engine service (Spark aaS) • Large farm of Spark clusters auto- provisioned & auto-managed in background • Managing a hot pool of Spark applications (a.k.a. kernels, using Jupyter Kernel Gateway) • SQL grammar sandbox • Auto-scaling of each serverless SQL job inside large Spark clusters using dynamic resource allocation • Intrinsically HA (dispatching across Spark environments in each availability zone)
  • 7. IBM SQL Query – Access Patterns Create Query SQL Console Watson Studio Notebooks Cloud Functions Integrate Explore Deploy Python SDK REST API JDBC Object Store Console Event Streams Console
  • 8. Meta Data IBM Cloud Data Lake – Separating Out Responsibilities Cloud Data ACID Serverless Spark (IBM Analytic Engine) Data Skipping Indexes Governance Policies & Lineage Schema, Partitioning, Statistics Serverless SQL (IBM SQL Query) IBM Cloud Object Storage RDBMS Hive Metastore Kafka Schema Registry Xskipper Iceberg Watson Knowledge Catalog Deltalake Serverless Containers (IBM Cloud Code Engine) IBM Event Streams IBM Cloud Databases Processing State
  • 9. Data Lakehouse Architecture in IBM Cloud … BigSQL Dremio IBM Cloud Databases Event Streams SQL Query Meta Data Integrated Hive Metastore + Kafka Schema Registry + ACID (Iceberg) Real-Time Queries COS Batch Queries Stream Xform & Joins Stream data landing Schema management & enforcement ETL & Data Preparation CDC Interactive & DWH Queries
  • 10. Streaming Data Lakes – EventStreams–COS Integration with SQL Query New Stream Landing Event Streams: Real time event feeds in Kafka topics SQL Query: Serverless stream landing ingests Kafka topics into tables in COS COS: Cost-effective permanent storage and analytics for real- time data. Real Time Serverless Data Lakes Turn Topics into Tables with a few clicks Fully managed ingestion of message feeds into parquet at $0,10/hour for 1MB/s capacity Infinite storage of all your message data in COS Run DWH-style SQL on your message data in serverless manner Publish to Kafka to create your specialised domain COS lake house • Log records • Click Stream data • IOT data Combine with Change Data Capture for real-time replication of all your systems into data lake for analytics Common Ingest Fabric to Data Lakes
  • 11. IBM Cloud Data Lake Real-Time Data Lake Solutions Audit Trails Cloud Platform Logs Application Logs Network Logs User Behavior IoT Feeds IoT Lakes Log Lakes AIOps Lakes Compliance Lakes
  • 12. IBM Solution for Data & AI
  • 13. Cloud Pak for Data as a Service Built On IBM Cloud Uses IBM Cloud Data Lake COS Storage Analytics SQL Query Event Streams Streaming Transformation Spark Cloud Databases Databases Integrated IBM Solution for Cloud Data Lakes
  • 14. Integrated IBM Solution for Cloud Data Lakes IBM Cloud Data Lake Manage Explore & Prepare Govern Data Catalogs, Projects & Connections Automate Data Stage & Kubeflow Pipelines Consume Watson Studio, BigSQL Cloud Pak for Data aaS Ingest CDC Ad-hoc Application Logs IoT Streams User Behavior ETL JDBC Python Dremio Presto ML Tableau Data Virtualization Kafka Power BI Cognos Infuse Analyze Organize Collect Ladder to AI
  • 16. IBM’s Serverless 2.0 Initiative Data COS EventStreams (Kafka) State Meta Data Common Hive Metastore Temp Data NVMe RAM Containers IBM Cloud Code Engine Runtimes Others Apache Spark Stateless Compute Shuffle 100% Elastic with Hyperscale & Scale down to Zero AI & ML DataOps & BI Petabytes
  • 18. Analytic-Friendly Data Formats Blog Article: Data Layout
  • 19. Data Skipping in IBM SQL Query • Avoid reading irrelevant objects using indexes • Complements partition pruning -> object level pruning • Stores aggregate metadata per object to enable skipping decisions • Indexes are stored in COS • Supports multiple index types • Currently MinMax, ValueList, BloomFilter, Geospatial • Underlying data skipping library is extensible • New index types can easily be supported • Enables data skipping on SQL UDFs • e.g. ST_Contains, ST_Distance etc. • UDFs are mapped to indexes
  • 20. How Data Skipping Works Spark SQL Query Execution Flow Uses Catalyst optimizer and session extensions API Query Prune partitions Read data Query Prune partitions Optional file filter Read data Metadata Filter
  • 21. Data Skipping Example Weather/dt=2020-08-17/part-00085.parquet Weather/dt=2020-08-17/part-00086.parquet Weather/dt=2020-08-17/part-00087.parquet Weather/dt=2020-08-17/part-00088.parquet Weather/dt=2020-08-18/part-00001.parquet Weather/dt=2020-08-18/part-00002.parquet Data Object Listing Example Query SELECT * FROM cos://us-geo/twc/Weather STORED AS parquet WHERE temp > 40 Object Name Temp Min Temp Max ... dt=2020-08-17/part-00085 7.97 26.77 dt=2020-08-17/part-00086 2.45 23.71 dt=2020-08-17/part-00087 6.46 18.62 dt=2020-08-17/part-00088 23.67 41.02 ... Metadata Red objects are not relevant to this query
  • 22. Geospatial Data Skipping Example Example Query SELECT * FROM Weather STORED AS parquet WHERE ST_Contains(ST_WKTToSQL('POLYGON((- 78.93 36.00, -78.67 35.78, -79.04 35.90, - 78.93 36.00))'), ST_Point(long, lat)) INTO cos://us-south/results STORED AS parquet Object Name lat Min lat Max ... dt=2020-08-17/part-00085 35.02 36.17 dt=2020-08-17/part-00086 43.59 44.95 dt=2020-08-17/part-00087 34.86 40.62 dt=2020-08-17/part-00088 23.67 25.92 ... Metadata Red objects are not relevant to this query Raleigh Research Triangle (US) Map ST Contains UDF to necessary conditions on lat, long
  • 23. X10 Acceleration with Data Skipping and Catalog Query rewrite approach (yellow) is the baseline • Using already optimized data format: Parquet/ORC For other formats the acceleration is much larger • e.g. CSV/JSON/Avro Experiment uses Raleigh Research Triangle query X10 speedup on average 10 TB of Weather Data on COS
  • 24. A Real-Life Cloud Data Lake
  • 25. Making trusted COVID-19 data available to broad set of analytics, e.g.: § https://accelerator.weather.com/bi § Watson Health Return to Work Advisor The COVID-19 Data Lake Ø Extensible with new data sources easily Ø Maximized velocity and elasticity Ø Full automation of all pipelines Ø New pipeline prototype in hours & productize in 2-3 days Ø Radically minimizing resource and operational costs by using IBM Cloud serverless and full ops automation Cloud Functions Cloud Object Storage - Persist - Trigger - Static Content Creation - Schema Management - Pipeline PoCs - Usage Tutorials Watson Studio SQL Query - Transformation - Transport - Table Catalog (Mart) - Queries - Export - Pipeline -Productization - Automation - Monitoring & Alerting - Pull External Data
  • 26. COVID-19 Data Lake Topology – High Level Landing Zone (E) Landing Buckets Preparation Zone (T) Landing Namespace Preparation Namespace Preparation Buckets Integration Zone (L) Dashboarding DWH Integration Buckets Data Mart Instance Integration Namespace Mart Management Project Data Mart Access Project TWC Scrapers & Pipeline Collectors Sequences Preparation Sequences Mart Sequences Delivery Sequences Pipeline Instance Schema Management Static Content Management Pipeline Instance Usage Notebooks Table Catalog Preparation Sequences External Data Sources Pull Push Collectors Sequences Preparation Sequences Usage Notebooks Usage Notebooks Users Pipeline PoC Project Preliminary Pipeline Notebooks Location Statistics Upload Update Reference Data Add Partitions Query & Extract Transform COGNOS