SlideShare a Scribd company logo
COST EFFICIENT
ALTERNATIVE TO
DATABRICKS
Georg Heiler
Exploring Alternatives for Cost-Effective and Flexible Data Pipelines  bit.ly/efficient-spark
Data expert
Academia & Industry (telco)
Specialties
data architecture, multimodal and
complex data challenges
Thought leader
Meetup organizer & speaker
• Rising importance of
understanding and shaping
supply chains (covid, Ukraine war)
• No fine-grained clean data
accessible
• Abundant un- and semistructured
data  sophisticated cleaning &
parsing required
• Extract and classify links based on
semantic context
Results at a
glance
• 43% Cost Reduction
• Software Engineering
practices
• Future proof flexibility
• Single pane of glass for
pipelines
History
• Mainframe
• Data warehouse
• Big Data (Hadoop)
• SQL on large data (Hive, Spark)
• Cloud DWH (Snowflake,
bigquery)
PaaS offering
PaaS Solution Comparison
Databricks (DBR)
• Easy to use
• Can be expensive
• Lock-in features
(permissions, catalog)
• Proprietary Photon
engine
AWS Elastic Map Reduce
(EMR)
• Price efficient
• Many tuning knobs
available (& required)
• OSS Spark managed
(scaled)
Challenges
• Runaway expenses (usage-based pricing)
• Missing software engineering best practices (notebooks)
• Developer productivity reduced
• Vendor lock-in
Vision
• 0-cost switch
• Software
engineering
practices
• Cost & lock-in
reduction
Orchestrator
(dagster)
Runtime
local
Runtime
remote DBR
Runtime
remote EMR
Spark at a glance
Dagster introduction
X No distributed monolith of CRON strings
 Asset aware event based orchestration
Observed challenges
• Remote execution
• Parameter injection
• Logging
• Opaque SaaS tools
• Single pane of glass
• Dependency bootstrap
• Missing testability in
notebooks
• Large-scale compute &
orchestrator native
development
Orchestrator
(dagster)
Runtime
local
Runtime
remote DBR
Runtime
remote EMR
Dagster-pipes
Dagster-pipes - Architecture
Dagster-pipes - Sample
External code (with metadata) Internal asset shim orchestrating the execution
of external script
Results & Demo
Demo: youtube.com/watch?v=W27C5LpdEkE
Partitioned UI
Implementati
on time of
DBR is lower
Implementation
complexity of
DBR is lower
more & more
frequent
commits for
EMR integration
Median
cost of DBR
is higher
than EMR
Variability of
execution
time of DBR
is lower
Implementation lessons
• Complexity of AWS EMR: Many low level details about AWS,
spot instances, networking required (master on spot instance
=> 💥💥)
• Abstracting the PaaS requires deep understanding of their APIs
Tips
• maximizeResourceAllocation
• LZO
• Delta zorder on partition
• spark.databricks.delta.vacuum.parallelDelete.enabled=true
Summary
• Money saved – 43%
• Bring back software engineering
best practices for data
• Flexibility
• Data PaaS as a commodity
• Take back control
• Best in breed
• Single pane of glass for pipelines
Takeaway – if
you have a
small data
problem
• Pipes allows to quickly bring in existing
scripts whilst retaining observability
• High code engineering practices scales
well
• Full control
• Compute technology can easily be
changed (i.e. duckdb, daft, …)
data-engineering.expert/2023/12/11/da
gster-dbt-duckdb-as-new-local-mds
COST EFFICIENCY
FOR DATA
Georg Heiler
bit.ly/efficient-spark
(data-engineering.expert/2024/06/21/cost-efficient-alternative-to-databricks-lock-in
arxiv.org/abs/2408.11635 github.com/ascii-supply-networks/ascii-hydra/tree/main/src/pipelines/ascii_library_demo )
Ad

More Related Content

Similar to [DSC DACH 24] Cost efficient alternative to databricks lock-in - Georg Heiler (20)

Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
Bigstep
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
Shu-Jeng Hsieh
 
Extending your data to the cloud
Extending your data to the cloudExtending your data to the cloud
Extending your data to the cloud
Microsoft TechNet - Belgium and Luxembourg
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
John D Almon
 
Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
DataWorks Summit
 
Azure SQL Database
Azure SQL DatabaseAzure SQL Database
Azure SQL Database
rockplace
 
Univa Presentation at DAC 2020
Univa Presentation at DAC 2020 Univa Presentation at DAC 2020
Univa Presentation at DAC 2020
Univa, an Altair Company
 
NoSQL – Data Center Centric Application Enablement
NoSQL – Data Center Centric Application EnablementNoSQL – Data Center Centric Application Enablement
NoSQL – Data Center Centric Application Enablement
DATAVERSITY
 
Cloud Migration and Portability Best Practices
Cloud Migration and Portability Best PracticesCloud Migration and Portability Best Practices
Cloud Migration and Portability Best Practices
RightScale
 
Machine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh PoduskaMachine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh Poduska
Data Con LA
 
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Qubole
 
Azure Application Architecture Guide ~Design principles for Azure application...
Azure Application Architecture Guide ~Design principles for Azure application...Azure Application Architecture Guide ~Design principles for Azure application...
Azure Application Architecture Guide ~Design principles for Azure application...
Naoki (Neo) SATO
 
Ein Expertenleitfaden für die Migration von Legacy-Datenbanken zu PostgreSQL
Ein Expertenleitfaden für die Migration von Legacy-Datenbanken zu PostgreSQLEin Expertenleitfaden für die Migration von Legacy-Datenbanken zu PostgreSQL
Ein Expertenleitfaden für die Migration von Legacy-Datenbanken zu PostgreSQL
EDB
 
Introduction to Microservices with Docker and Kubernetes
Introduction to Microservices with Docker and KubernetesIntroduction to Microservices with Docker and Kubernetes
Introduction to Microservices with Docker and Kubernetes
David Charles
 
Apache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsApache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and Basics
Oleg Magazov
 
Building Blocks for Hybrid IT
Building Blocks for Hybrid ITBuilding Blocks for Hybrid IT
Building Blocks for Hybrid IT
RightScale
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
GDG Ternopil TechTalks Web #1 2015 - Data storages in Microsoft Azure
GDG Ternopil TechTalks Web #1 2015 - Data storages in Microsoft AzureGDG Ternopil TechTalks Web #1 2015 - Data storages in Microsoft Azure
GDG Ternopil TechTalks Web #1 2015 - Data storages in Microsoft Azure
Andriy Deren'
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha
 
Running your database in the cloud presentation
Running your database in the cloud presentationRunning your database in the cloud presentation
Running your database in the cloud presentation
Manish Singh
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
Bigstep
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
Shu-Jeng Hsieh
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
John D Almon
 
Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
DataWorks Summit
 
Azure SQL Database
Azure SQL DatabaseAzure SQL Database
Azure SQL Database
rockplace
 
NoSQL – Data Center Centric Application Enablement
NoSQL – Data Center Centric Application EnablementNoSQL – Data Center Centric Application Enablement
NoSQL – Data Center Centric Application Enablement
DATAVERSITY
 
Cloud Migration and Portability Best Practices
Cloud Migration and Portability Best PracticesCloud Migration and Portability Best Practices
Cloud Migration and Portability Best Practices
RightScale
 
Machine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh PoduskaMachine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh Poduska
Data Con LA
 
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Qubole
 
Azure Application Architecture Guide ~Design principles for Azure application...
Azure Application Architecture Guide ~Design principles for Azure application...Azure Application Architecture Guide ~Design principles for Azure application...
Azure Application Architecture Guide ~Design principles for Azure application...
Naoki (Neo) SATO
 
Ein Expertenleitfaden für die Migration von Legacy-Datenbanken zu PostgreSQL
Ein Expertenleitfaden für die Migration von Legacy-Datenbanken zu PostgreSQLEin Expertenleitfaden für die Migration von Legacy-Datenbanken zu PostgreSQL
Ein Expertenleitfaden für die Migration von Legacy-Datenbanken zu PostgreSQL
EDB
 
Introduction to Microservices with Docker and Kubernetes
Introduction to Microservices with Docker and KubernetesIntroduction to Microservices with Docker and Kubernetes
Introduction to Microservices with Docker and Kubernetes
David Charles
 
Apache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsApache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and Basics
Oleg Magazov
 
Building Blocks for Hybrid IT
Building Blocks for Hybrid ITBuilding Blocks for Hybrid IT
Building Blocks for Hybrid IT
RightScale
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
GDG Ternopil TechTalks Web #1 2015 - Data storages in Microsoft Azure
GDG Ternopil TechTalks Web #1 2015 - Data storages in Microsoft AzureGDG Ternopil TechTalks Web #1 2015 - Data storages in Microsoft Azure
GDG Ternopil TechTalks Web #1 2015 - Data storages in Microsoft Azure
Andriy Deren'
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha
 
Running your database in the cloud presentation
Running your database in the cloud presentationRunning your database in the cloud presentation
Running your database in the cloud presentation
Manish Singh
 

More from DataScienceConferenc1 (20)

[DSC Europe 24] Anastasia Shapedko - How Alice, our intelligent personal assi...
[DSC Europe 24] Anastasia Shapedko - How Alice, our intelligent personal assi...[DSC Europe 24] Anastasia Shapedko - How Alice, our intelligent personal assi...
[DSC Europe 24] Anastasia Shapedko - How Alice, our intelligent personal assi...
DataScienceConferenc1
 
[DSC Europe 24] Joy Chatterjee - Balancing Personalization and Experimentatio...
[DSC Europe 24] Joy Chatterjee - Balancing Personalization and Experimentatio...[DSC Europe 24] Joy Chatterjee - Balancing Personalization and Experimentatio...
[DSC Europe 24] Joy Chatterjee - Balancing Personalization and Experimentatio...
DataScienceConferenc1
 
[DSC Europe 24] Pratul Chakravarty - Personalized Insights and Engagements us...
[DSC Europe 24] Pratul Chakravarty - Personalized Insights and Engagements us...[DSC Europe 24] Pratul Chakravarty - Personalized Insights and Engagements us...
[DSC Europe 24] Pratul Chakravarty - Personalized Insights and Engagements us...
DataScienceConferenc1
 
[DSC Europe 24] Domagoj Maric - Modern Web Data Extraction: Techniques, Tools...
[DSC Europe 24] Domagoj Maric - Modern Web Data Extraction: Techniques, Tools...[DSC Europe 24] Domagoj Maric - Modern Web Data Extraction: Techniques, Tools...
[DSC Europe 24] Domagoj Maric - Modern Web Data Extraction: Techniques, Tools...
DataScienceConferenc1
 
[DSC Europe 24] Marcin Szymaniuk - The path to Effective Data Migration - Ove...
[DSC Europe 24] Marcin Szymaniuk - The path to Effective Data Migration - Ove...[DSC Europe 24] Marcin Szymaniuk - The path to Effective Data Migration - Ove...
[DSC Europe 24] Marcin Szymaniuk - The path to Effective Data Migration - Ove...
DataScienceConferenc1
 
[DSC Europe 24] Fran Mikulicic - Building a Data-Driven Culture: What the C-S...
[DSC Europe 24] Fran Mikulicic - Building a Data-Driven Culture: What the C-S...[DSC Europe 24] Fran Mikulicic - Building a Data-Driven Culture: What the C-S...
[DSC Europe 24] Fran Mikulicic - Building a Data-Driven Culture: What the C-S...
DataScienceConferenc1
 
[DSC Europe 24] Sofija Pervulov - Building up the Bosch Semantic Data Lake
[DSC Europe 24] Sofija Pervulov - Building up the Bosch Semantic Data Lake[DSC Europe 24] Sofija Pervulov - Building up the Bosch Semantic Data Lake
[DSC Europe 24] Sofija Pervulov - Building up the Bosch Semantic Data Lake
DataScienceConferenc1
 
[DSC Europe 24] Dani Ei-Ayyas - Overcoming Loneliness with LLM Dating Assistant
[DSC Europe 24] Dani Ei-Ayyas - Overcoming Loneliness with LLM Dating Assistant[DSC Europe 24] Dani Ei-Ayyas - Overcoming Loneliness with LLM Dating Assistant
[DSC Europe 24] Dani Ei-Ayyas - Overcoming Loneliness with LLM Dating Assistant
DataScienceConferenc1
 
[DSC Europe 24] Ewelina Kucal & Maciej Dziezyc - How to Encourage Children to...
[DSC Europe 24] Ewelina Kucal & Maciej Dziezyc - How to Encourage Children to...[DSC Europe 24] Ewelina Kucal & Maciej Dziezyc - How to Encourage Children to...
[DSC Europe 24] Ewelina Kucal & Maciej Dziezyc - How to Encourage Children to...
DataScienceConferenc1
 
[DSC Europe 24] Nikola Milosevic - VerifAI: Biomedical Generative Question-An...
[DSC Europe 24] Nikola Milosevic - VerifAI: Biomedical Generative Question-An...[DSC Europe 24] Nikola Milosevic - VerifAI: Biomedical Generative Question-An...
[DSC Europe 24] Nikola Milosevic - VerifAI: Biomedical Generative Question-An...
DataScienceConferenc1
 
[DSC Europe 24] Josip Saban - Buidling cloud data platforms in enterprises
[DSC Europe 24] Josip Saban - Buidling cloud data platforms in enterprises[DSC Europe 24] Josip Saban - Buidling cloud data platforms in enterprises
[DSC Europe 24] Josip Saban - Buidling cloud data platforms in enterprises
DataScienceConferenc1
 
[DSC Europe 24] Sray Agarwal - 2025: year of Ai dilemma - ethics, regulations...
[DSC Europe 24] Sray Agarwal - 2025: year of Ai dilemma - ethics, regulations...[DSC Europe 24] Sray Agarwal - 2025: year of Ai dilemma - ethics, regulations...
[DSC Europe 24] Sray Agarwal - 2025: year of Ai dilemma - ethics, regulations...
DataScienceConferenc1
 
[DSC Europe 24] Peter Kertys & Maros Buban - Application of AI technologies i...
[DSC Europe 24] Peter Kertys & Maros Buban - Application of AI technologies i...[DSC Europe 24] Peter Kertys & Maros Buban - Application of AI technologies i...
[DSC Europe 24] Peter Kertys & Maros Buban - Application of AI technologies i...
DataScienceConferenc1
 
[DSC Europe 24] Orsalia Andreou - Fostering Trust in AI-Driven Finance
[DSC Europe 24] Orsalia Andreou - Fostering Trust in AI-Driven Finance[DSC Europe 24] Orsalia Andreou - Fostering Trust in AI-Driven Finance
[DSC Europe 24] Orsalia Andreou - Fostering Trust in AI-Driven Finance
DataScienceConferenc1
 
[DSC Europe 24] Arnault Ioualalen - AI Trustworthiness – A Path Toward Mass A...
[DSC Europe 24] Arnault Ioualalen - AI Trustworthiness – A Path Toward Mass A...[DSC Europe 24] Arnault Ioualalen - AI Trustworthiness – A Path Toward Mass A...
[DSC Europe 24] Arnault Ioualalen - AI Trustworthiness – A Path Toward Mass A...
DataScienceConferenc1
 
[DSC Europe 24] Nathan Coyle - Open Data for Everybody: Social Action, Peace ...
[DSC Europe 24] Nathan Coyle - Open Data for Everybody: Social Action, Peace ...[DSC Europe 24] Nathan Coyle - Open Data for Everybody: Social Action, Peace ...
[DSC Europe 24] Nathan Coyle - Open Data for Everybody: Social Action, Peace ...
DataScienceConferenc1
 
[DSC Europe 24] Miodrag Vladic - Revolutionizing Information Access: All Worl...
[DSC Europe 24] Miodrag Vladic - Revolutionizing Information Access: All Worl...[DSC Europe 24] Miodrag Vladic - Revolutionizing Information Access: All Worl...
[DSC Europe 24] Miodrag Vladic - Revolutionizing Information Access: All Worl...
DataScienceConferenc1
 
[DSC Europe 24] Katherine Munro - Where there’s a will, there’s a way: The ma...
[DSC Europe 24] Katherine Munro - Where there’s a will, there’s a way: The ma...[DSC Europe 24] Katherine Munro - Where there’s a will, there’s a way: The ma...
[DSC Europe 24] Katherine Munro - Where there’s a will, there’s a way: The ma...
DataScienceConferenc1
 
[DSC Europe 24] Ana Stojkovic Knezevic - How to effectively manage AI/ML proj...
[DSC Europe 24] Ana Stojkovic Knezevic - How to effectively manage AI/ML proj...[DSC Europe 24] Ana Stojkovic Knezevic - How to effectively manage AI/ML proj...
[DSC Europe 24] Ana Stojkovic Knezevic - How to effectively manage AI/ML proj...
DataScienceConferenc1
 
[DSC Europe 24] Simun Sunjic & Lovro Matosevic - Empowering Sales with Intell...
[DSC Europe 24] Simun Sunjic & Lovro Matosevic - Empowering Sales with Intell...[DSC Europe 24] Simun Sunjic & Lovro Matosevic - Empowering Sales with Intell...
[DSC Europe 24] Simun Sunjic & Lovro Matosevic - Empowering Sales with Intell...
DataScienceConferenc1
 
[DSC Europe 24] Anastasia Shapedko - How Alice, our intelligent personal assi...
[DSC Europe 24] Anastasia Shapedko - How Alice, our intelligent personal assi...[DSC Europe 24] Anastasia Shapedko - How Alice, our intelligent personal assi...
[DSC Europe 24] Anastasia Shapedko - How Alice, our intelligent personal assi...
DataScienceConferenc1
 
[DSC Europe 24] Joy Chatterjee - Balancing Personalization and Experimentatio...
[DSC Europe 24] Joy Chatterjee - Balancing Personalization and Experimentatio...[DSC Europe 24] Joy Chatterjee - Balancing Personalization and Experimentatio...
[DSC Europe 24] Joy Chatterjee - Balancing Personalization and Experimentatio...
DataScienceConferenc1
 
[DSC Europe 24] Pratul Chakravarty - Personalized Insights and Engagements us...
[DSC Europe 24] Pratul Chakravarty - Personalized Insights and Engagements us...[DSC Europe 24] Pratul Chakravarty - Personalized Insights and Engagements us...
[DSC Europe 24] Pratul Chakravarty - Personalized Insights and Engagements us...
DataScienceConferenc1
 
[DSC Europe 24] Domagoj Maric - Modern Web Data Extraction: Techniques, Tools...
[DSC Europe 24] Domagoj Maric - Modern Web Data Extraction: Techniques, Tools...[DSC Europe 24] Domagoj Maric - Modern Web Data Extraction: Techniques, Tools...
[DSC Europe 24] Domagoj Maric - Modern Web Data Extraction: Techniques, Tools...
DataScienceConferenc1
 
[DSC Europe 24] Marcin Szymaniuk - The path to Effective Data Migration - Ove...
[DSC Europe 24] Marcin Szymaniuk - The path to Effective Data Migration - Ove...[DSC Europe 24] Marcin Szymaniuk - The path to Effective Data Migration - Ove...
[DSC Europe 24] Marcin Szymaniuk - The path to Effective Data Migration - Ove...
DataScienceConferenc1
 
[DSC Europe 24] Fran Mikulicic - Building a Data-Driven Culture: What the C-S...
[DSC Europe 24] Fran Mikulicic - Building a Data-Driven Culture: What the C-S...[DSC Europe 24] Fran Mikulicic - Building a Data-Driven Culture: What the C-S...
[DSC Europe 24] Fran Mikulicic - Building a Data-Driven Culture: What the C-S...
DataScienceConferenc1
 
[DSC Europe 24] Sofija Pervulov - Building up the Bosch Semantic Data Lake
[DSC Europe 24] Sofija Pervulov - Building up the Bosch Semantic Data Lake[DSC Europe 24] Sofija Pervulov - Building up the Bosch Semantic Data Lake
[DSC Europe 24] Sofija Pervulov - Building up the Bosch Semantic Data Lake
DataScienceConferenc1
 
[DSC Europe 24] Dani Ei-Ayyas - Overcoming Loneliness with LLM Dating Assistant
[DSC Europe 24] Dani Ei-Ayyas - Overcoming Loneliness with LLM Dating Assistant[DSC Europe 24] Dani Ei-Ayyas - Overcoming Loneliness with LLM Dating Assistant
[DSC Europe 24] Dani Ei-Ayyas - Overcoming Loneliness with LLM Dating Assistant
DataScienceConferenc1
 
[DSC Europe 24] Ewelina Kucal & Maciej Dziezyc - How to Encourage Children to...
[DSC Europe 24] Ewelina Kucal & Maciej Dziezyc - How to Encourage Children to...[DSC Europe 24] Ewelina Kucal & Maciej Dziezyc - How to Encourage Children to...
[DSC Europe 24] Ewelina Kucal & Maciej Dziezyc - How to Encourage Children to...
DataScienceConferenc1
 
[DSC Europe 24] Nikola Milosevic - VerifAI: Biomedical Generative Question-An...
[DSC Europe 24] Nikola Milosevic - VerifAI: Biomedical Generative Question-An...[DSC Europe 24] Nikola Milosevic - VerifAI: Biomedical Generative Question-An...
[DSC Europe 24] Nikola Milosevic - VerifAI: Biomedical Generative Question-An...
DataScienceConferenc1
 
[DSC Europe 24] Josip Saban - Buidling cloud data platforms in enterprises
[DSC Europe 24] Josip Saban - Buidling cloud data platforms in enterprises[DSC Europe 24] Josip Saban - Buidling cloud data platforms in enterprises
[DSC Europe 24] Josip Saban - Buidling cloud data platforms in enterprises
DataScienceConferenc1
 
[DSC Europe 24] Sray Agarwal - 2025: year of Ai dilemma - ethics, regulations...
[DSC Europe 24] Sray Agarwal - 2025: year of Ai dilemma - ethics, regulations...[DSC Europe 24] Sray Agarwal - 2025: year of Ai dilemma - ethics, regulations...
[DSC Europe 24] Sray Agarwal - 2025: year of Ai dilemma - ethics, regulations...
DataScienceConferenc1
 
[DSC Europe 24] Peter Kertys & Maros Buban - Application of AI technologies i...
[DSC Europe 24] Peter Kertys & Maros Buban - Application of AI technologies i...[DSC Europe 24] Peter Kertys & Maros Buban - Application of AI technologies i...
[DSC Europe 24] Peter Kertys & Maros Buban - Application of AI technologies i...
DataScienceConferenc1
 
[DSC Europe 24] Orsalia Andreou - Fostering Trust in AI-Driven Finance
[DSC Europe 24] Orsalia Andreou - Fostering Trust in AI-Driven Finance[DSC Europe 24] Orsalia Andreou - Fostering Trust in AI-Driven Finance
[DSC Europe 24] Orsalia Andreou - Fostering Trust in AI-Driven Finance
DataScienceConferenc1
 
[DSC Europe 24] Arnault Ioualalen - AI Trustworthiness – A Path Toward Mass A...
[DSC Europe 24] Arnault Ioualalen - AI Trustworthiness – A Path Toward Mass A...[DSC Europe 24] Arnault Ioualalen - AI Trustworthiness – A Path Toward Mass A...
[DSC Europe 24] Arnault Ioualalen - AI Trustworthiness – A Path Toward Mass A...
DataScienceConferenc1
 
[DSC Europe 24] Nathan Coyle - Open Data for Everybody: Social Action, Peace ...
[DSC Europe 24] Nathan Coyle - Open Data for Everybody: Social Action, Peace ...[DSC Europe 24] Nathan Coyle - Open Data for Everybody: Social Action, Peace ...
[DSC Europe 24] Nathan Coyle - Open Data for Everybody: Social Action, Peace ...
DataScienceConferenc1
 
[DSC Europe 24] Miodrag Vladic - Revolutionizing Information Access: All Worl...
[DSC Europe 24] Miodrag Vladic - Revolutionizing Information Access: All Worl...[DSC Europe 24] Miodrag Vladic - Revolutionizing Information Access: All Worl...
[DSC Europe 24] Miodrag Vladic - Revolutionizing Information Access: All Worl...
DataScienceConferenc1
 
[DSC Europe 24] Katherine Munro - Where there’s a will, there’s a way: The ma...
[DSC Europe 24] Katherine Munro - Where there’s a will, there’s a way: The ma...[DSC Europe 24] Katherine Munro - Where there’s a will, there’s a way: The ma...
[DSC Europe 24] Katherine Munro - Where there’s a will, there’s a way: The ma...
DataScienceConferenc1
 
[DSC Europe 24] Ana Stojkovic Knezevic - How to effectively manage AI/ML proj...
[DSC Europe 24] Ana Stojkovic Knezevic - How to effectively manage AI/ML proj...[DSC Europe 24] Ana Stojkovic Knezevic - How to effectively manage AI/ML proj...
[DSC Europe 24] Ana Stojkovic Knezevic - How to effectively manage AI/ML proj...
DataScienceConferenc1
 
[DSC Europe 24] Simun Sunjic & Lovro Matosevic - Empowering Sales with Intell...
[DSC Europe 24] Simun Sunjic & Lovro Matosevic - Empowering Sales with Intell...[DSC Europe 24] Simun Sunjic & Lovro Matosevic - Empowering Sales with Intell...
[DSC Europe 24] Simun Sunjic & Lovro Matosevic - Empowering Sales with Intell...
DataScienceConferenc1
 
Ad

Recently uploaded (20)

chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Ad

[DSC DACH 24] Cost efficient alternative to databricks lock-in - Georg Heiler

  • 1. COST EFFICIENT ALTERNATIVE TO DATABRICKS Georg Heiler Exploring Alternatives for Cost-Effective and Flexible Data Pipelines  bit.ly/efficient-spark
  • 2. Data expert Academia & Industry (telco) Specialties data architecture, multimodal and complex data challenges Thought leader Meetup organizer & speaker
  • 3. • Rising importance of understanding and shaping supply chains (covid, Ukraine war) • No fine-grained clean data accessible • Abundant un- and semistructured data  sophisticated cleaning & parsing required • Extract and classify links based on semantic context
  • 4. Results at a glance • 43% Cost Reduction • Software Engineering practices • Future proof flexibility • Single pane of glass for pipelines
  • 5. History • Mainframe • Data warehouse • Big Data (Hadoop) • SQL on large data (Hive, Spark) • Cloud DWH (Snowflake, bigquery)
  • 7. PaaS Solution Comparison Databricks (DBR) • Easy to use • Can be expensive • Lock-in features (permissions, catalog) • Proprietary Photon engine AWS Elastic Map Reduce (EMR) • Price efficient • Many tuning knobs available (& required) • OSS Spark managed (scaled)
  • 8. Challenges • Runaway expenses (usage-based pricing) • Missing software engineering best practices (notebooks) • Developer productivity reduced • Vendor lock-in
  • 9. Vision • 0-cost switch • Software engineering practices • Cost & lock-in reduction Orchestrator (dagster) Runtime local Runtime remote DBR Runtime remote EMR
  • 10. Spark at a glance
  • 11. Dagster introduction X No distributed monolith of CRON strings  Asset aware event based orchestration
  • 12. Observed challenges • Remote execution • Parameter injection • Logging • Opaque SaaS tools • Single pane of glass • Dependency bootstrap • Missing testability in notebooks • Large-scale compute & orchestrator native development Orchestrator (dagster) Runtime local Runtime remote DBR Runtime remote EMR
  • 15. Dagster-pipes - Sample External code (with metadata) Internal asset shim orchestrating the execution of external script
  • 20. Implementation complexity of DBR is lower more & more frequent commits for EMR integration
  • 21. Median cost of DBR is higher than EMR
  • 23. Implementation lessons • Complexity of AWS EMR: Many low level details about AWS, spot instances, networking required (master on spot instance => 💥💥) • Abstracting the PaaS requires deep understanding of their APIs Tips • maximizeResourceAllocation • LZO • Delta zorder on partition • spark.databricks.delta.vacuum.parallelDelete.enabled=true
  • 24. Summary • Money saved – 43% • Bring back software engineering best practices for data • Flexibility • Data PaaS as a commodity • Take back control • Best in breed • Single pane of glass for pipelines
  • 25. Takeaway – if you have a small data problem • Pipes allows to quickly bring in existing scripts whilst retaining observability • High code engineering practices scales well • Full control • Compute technology can easily be changed (i.e. duckdb, daft, …) data-engineering.expert/2023/12/11/da gster-dbt-duckdb-as-new-local-mds
  • 26. COST EFFICIENCY FOR DATA Georg Heiler bit.ly/efficient-spark (data-engineering.expert/2024/06/21/cost-efficient-alternative-to-databricks-lock-in arxiv.org/abs/2408.11635 github.com/ascii-supply-networks/ascii-hydra/tree/main/src/pipelines/ascii_library_demo )

Editor's Notes

  • #1: Mention talk modality interactive ask questions during the talk
  • #2: Supply Chain, Text analytics & data architecture & pipelines, graphs, spatial time series
  • #4: Physical goods Software 400TiB Commoncrawl, AIS, Satellite, OSINT, …),
  • #5: Easy GPU/accelerator prototyping AWS is migrating to daft https://www.getdaft.io/ https://aws.amazon.com/de/blogs/opensource/amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-amazon-ec2/ as we are in control of how the individual steps in the pipeline relate to each other we can relatively easily switch out the compute framework (like AWS did it)
  • #9: PaaS solutions offer big benefits. Easy scalability Single centerpiece of the data engineering strategy not just an implementation detail Runaway expenses due to usage-based pricing High CI costs for spinning up resources Prioritizes simplicity over best practices All-notebook environments: Limited code reuse Limited testability Limited VCS integration Developer productivity hampered by VM spin-up times Single central platform dependence
  • #10: Pass as implementation detail Containerization, CI/CD, testability flexibility
  • #11: Img https://intellipaat.com/blog/tutorial/spark-tutorial/spark-architecture/
  • #13: Single pane of glass for operative monitoring of opaque saas tools. Allows us to get standard software best practices like testing, modularity, DRY, ... and maintain these in the SaaS tools. It even allows us to abstract SaaS vendors (Databricks vs. EMR) and substitute one against the other one to save money for large-scale pipelines where we just need compute without wanting to pay for all the enterprisey extra features. 1) boostrap of the remote execution environments 2) centralized logging 3) single/simple start/stop of all pipeline in dagster 3) integration of upstream/downstream pipeline steps in one place PaaS solutions offer big benefits. Easy scalability Single centerpiece of the data engineering strategy not just an implementation detail Runaway expenses due to usage-based pricing High CI costs for spinning up resources Prioritizes simplicity over best practices All-notebook environments: Limited code reuse Limited testability Limited VCS integration Developer productivity hampered by VM spin-up times Single central platform dependence
  • #20: The volume of trial runs required to achieve stability on EMR is high. It shows the complex setup and optimization demands of these platforms. Yet, once set up now, it proves hugely beneficial for us.
  • #21: EMR was labor-intensive. This is shown by more failed and successful trials. They happened before the product was ready. In fact we required almost twice as many trial runs for EMR as for Databricks. The increase was mainly due to the complexity of setting up EMR systems. They had to handle large datasets well and safely. This setup needs lots of customization and tuning. Databricks provided these features out of the box. EMR demanded more frequent code changes. They were sometimes extensive. This reflected a steeper learning curve and higher complexity. But, it was in exchange for lower costs.
  • #24: Teams need to learn the specifics of each platform. This includes API differences, setup quirks, and best practices. This learning curve can delay initial deployment and require additional training.
  • #25: For small-ish to medium workloads EMR works fine out of the box => we can save money (cheaper compute) For very large workloads, we use EMR with fine-tuning. It lets us save money with cheaper compute. For special workloads, Photon is very profitable. We use Databricks to save money and time again.
  • #26: https://aws.amazon.com/de/blogs/opensource/amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-amazon-ec2/
  • #27: Mention talk modality interactive ask questions during the talk