SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Kim Hammar, Logical Clocks AB KimHammar1
Jim Dowling, Logical Clocks AB jim_dowling
End-to-End ML Pipelines
with Databricks Delta and
Hopsworks Feature Store
#UnifiedDataAnalytics #SparkAISummit
Machine Learning in the Abstract
3
Where does the Data come from?
4
Where does the Data come from?
5
“Data is the hardest part of ML and the most important piece to get
right. Modelers spend most of their time selecting and transforming
features at training time and then building the pipelines to deliver
those features to production models.” [Uber on Michelangelo]
Data comes from the Feature Store
6
How do we feed the Feature Store?
7
Outline
8
1. Hopsworks
2. Databricks Delta
3. Hopsworks Feature Store
4. Demo
5. Summary
9
Datasources
Applications
API
Dashboards
Hopsworks
Apache Beam
Apache Spark Pip
Conda
Tensorflow
scikit-learn
Keras
J upyter
Notebooks
Tensorboard
Apache Beam
Apache Spark
Apache Flink
Kubernetes
Batch Distributed
ML &DL
Model
Serving
Hopsworks
Feature Store
Kafka +
Spark
Streaming
Model
Monitoring
Orchestration in Airflow
Data Preparation
&Ingestion
Experimentation
&Model Training
Deploy
&Productionalize
Streaming
Filesystem and Metadata storage
HopsFS
10
11
12
13
14
15
Next-Gen Data Lakes
Data Lakes are starting to resemble databases:
– Apache Hudi, Delta, and Apache Iceberg add:
• ACID transactional layers on top of the data lake
• Indexes to speed up queries (data skipping)
• Incremental Ingestion (late data, delete existing records)
• Time-travel queries
16
Problems: No Incremental Updates, No rollback
on failure, No Time-Travel, No Isolation.
17
Solution: Incremental ETL with ACID
Transactions
18
Upsert & Time Travel Example
19
Upsert & Time Travel Example
20
Upsert ==Insert or Update
21
Version Data By Commits
22
Delta Lake by Databricks
• Delta Lake is a Transactional Layer that sits on
top of your Data Lake:
– ACID Transactions with Optimistic Concurrency
Control
– Log-Structured Storage
– Open Format (Parquet-based storage)
– Time-travel
23
Delta Datasets
24
Optimistic Concurrency Control
25
Optimistic Concurrency Control
26
Mutual Exclusion for Writers
27
Optimistic Retry
28
Scalable Metadata Management
29
Other Frameworks: Apache Hudi,
Apache Iceberg
• Hudi was developed by Uber for their Hadoop
Data Lake (HDFS first, then S3 support)
• Iceberg was developed by Netflix with S3 as
target storage layer
• All three frameworks (Delta, Hudi, Iceberg)
have common goals of adding ACID updates,
incremental ingestion, efficient queries.
30
Next-Gen Data Lakes Compared
31
Delta Hudi Iceberg
Incremental Ingestion Spark Spark Spark
ACID updates HDFS, S3* HDFS S3, HDFS
File Formats Parquet Avro, Parquet Parquet, ORC
Data Skipping
(File-Level Indexes)
Min-Max Stats+Z-Order
Clustering*
File-Level Max-Min
stats + Bloom Filter
File-Level
Max-Min Filtering
Concurrency Control Optimistic Optimistic Optimistic
Data Validation Expectations (coming soon) In Hopsworks N/A
Merge-on-Read No Yes (coming soon) No
Schema Evolution Yes Yes Yes
File I/O Cache Yes* No No
Cleanup Manual Automatic, Manual No
Compaction Manual Automatic No
*Databricks version only (not open-source)
32
How can a Feature Store
leverage Log-Structured Storage
(e.g., Delta or Hudi or Iceberg)?
Hopsworks Feature Store
33
Feature Mgmt Storage Access
Statistics
Online
Features
Discovery
Offline
Features
Data Scientist
Online Apps
Data Engineer
MySQL Cluster
(Metadata,
Online Features)
Apache Hive
Columnar DB
(Offline Features)
Feature Data
Ingestion
Hopsworks Feature Store
Training Data
(S3, HDFS)
Batch Apps
Discover features,
create training data,
save models,
read online/offline/on-
demand features,
historical feature values.
Models
HopsFS
JDBC
(SAS, R, etc)
Feature
CRUD
Add/remove features,
access control,
feature data validation.
Access
Control
Time Travel
Data
Validation
Pandas or
PySpark
DataFrame
External DB
Feature Defn
Țselect ..Ț
AWS Sagemaker and Databricks Integration
• Computation
engine (Spark)
• Incremental
ACID Ingestion
• Time-Travel
• Data Validation
• On-Demand or
Cached Features
• Online or Offline
Features
Incremental Feature Engineering with Hudi
34
Point-in-Time Correct Feature Data
35
Feature Time Travel with Hudi
and Hopsworks Feature Store
36
Demo: Hopsworks Featurestore
+ Databricks Platform
37
Summary
• Delta, Hudi, Iceberg bring Reliability, Upserts & Time-Travel to
Data Lakes
– Functionalities that are well suited for Feature Stores
• Hopsworks Feature Store builds on Hudi/Hive and is the world’s
first open-source Feature Store (released 2018)
• The Hopsworks Platform also supports End-to-End ML pipelines
using the Feature Store and Spark/Beam/Flink, Tensorflow/PyTorch,
and Airflow
38
Thank you!
470 Ramona St, Palo Alto
Kista, Stockholm
https://www.logicalclocks.com
Register for a free account at
www.hops.site
Twitter
@logicalclocks
@hopsworks
GitHub
https://github.com/logicalclocks/hopswo
rks
https://github.com/hopshadoop/hops
References
• Feature Store: the missing data layer in ML pipelines?
https://www.logicalclocks.com/feature-store/
• Python-First ML Pipelines with Hopsworks
https://hops.readthedocs.io/en/latest/hopsml/hopsML.html.
• Hopsworks white paper.
https://www.logicalclocks.com/whitepapers/hopsworks
• HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases.
https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi
• Open Source:
https://github.com/logicalclocks/hopsworks
https://github.com/hopshadoop/hops
• Thanks to Logical Clocks Team: Jim Dowling, Seif Haridi, Theo Kakantousis, Fabio Buso,
Gautier Berthou, Ermias Gebremeskel, Mahmoud Ismail, Salman Niazi, Antonios Kouzoupis,
Robin Andersson, Alex Ormenisan, Rasmus Toivonen, Steffen Grohsschmiedt, and Moritz
Meister
40
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot (20)

PDF
The Apache Spark File Format Ecosystem
Databricks
 
PPT
An overview of snowflake
Sivakumar Ramar
 
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PDF
Physical Plans in Spark SQL
Databricks
 
PPTX
Apache Spark MLlib
Zahra Eskandari
 
PDF
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
PDF
Building Reliable Data Lakes at Scale with Delta Lake
Databricks
 
PPTX
Hive: Loading Data
Benjamin Leonhardi
 
PDF
Apache Spark Overview
Vadim Y. Bichutskiy
 
PDF
The delta architecture
Prakash Chockalingam
 
PPTX
Delta lake and the delta architecture
Adam Doyle
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
PPTX
Introduction to snowflake
Sunil Gurav
 
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
PDF
Observability for Data Pipelines With OpenLineage
Databricks
 
PDF
Iceberg: a fast table format for S3
DataWorks Summit
 
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
PDF
Moving to Databricks & Delta
Databricks
 
The Apache Spark File Format Ecosystem
Databricks
 
An overview of snowflake
Sivakumar Ramar
 
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
Introduction to Apache Spark
Rahul Jain
 
Physical Plans in Spark SQL
Databricks
 
Apache Spark MLlib
Zahra Eskandari
 
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Building Reliable Data Lakes at Scale with Delta Lake
Databricks
 
Hive: Loading Data
Benjamin Leonhardi
 
Apache Spark Overview
Vadim Y. Bichutskiy
 
The delta architecture
Prakash Chockalingam
 
Delta lake and the delta architecture
Adam Doyle
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Introduction to snowflake
Sunil Gurav
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
Observability for Data Pipelines With OpenLineage
Databricks
 
Iceberg: a fast table format for S3
DataWorks Summit
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Moving to Databricks & Delta
Databricks
 

Similar to End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta (20)

PDF
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Kim Hammar
 
PDF
The Feature Store in Hopsworks
Jim Dowling
 
PPTX
Feature Store as a Data Foundation for Machine Learning
Provectus
 
PPTX
Reshape Data Lake (as of 2020.07)
Eric Sun
 
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
PDF
Metadata and Provenance for ML Pipelines with Hopsworks
Jim Dowling
 
PDF
Kim Hammar - Feature Store: the missing data layer in ML pipelines? - HopsML ...
Kim Hammar
 
PDF
Managed Feature Store for Machine Learning
Logical Clocks
 
PDF
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
Big Data Value Association
 
PDF
Berlin buzzwords 2020-feature-store-dowling
Jim Dowling
 
PDF
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Lex Avstreikh
 
PDF
Building a Feature Store around Dataframes and Apache Spark
Databricks
 
PDF
PyData Meetup - Feature Store for Hopsworks and ML Pipelines
Jim Dowling
 
PDF
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Data Science Milan
 
PDF
Hamburg Data Science Meetup - MLOps with a Feature Store
Moritz Meister
 
PDF
Kim Hammar - FOSDEM 2019 Brussels - Hopsworks Feature store
Kim Hammar
 
PDF
Data Con LA 2022 - Pre- Recorded - Simplifying AI/ML using Databricks feature...
Data Con LA
 
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
PDF
Hopsworks at Google AI Huddle, Sunnyvale
Jim Dowling
 
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Kim Hammar
 
The Feature Store in Hopsworks
Jim Dowling
 
Feature Store as a Data Foundation for Machine Learning
Provectus
 
Reshape Data Lake (as of 2020.07)
Eric Sun
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Metadata and Provenance for ML Pipelines with Hopsworks
Jim Dowling
 
Kim Hammar - Feature Store: the missing data layer in ML pipelines? - HopsML ...
Kim Hammar
 
Managed Feature Store for Machine Learning
Logical Clocks
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
Big Data Value Association
 
Berlin buzzwords 2020-feature-store-dowling
Jim Dowling
 
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Lex Avstreikh
 
Building a Feature Store around Dataframes and Apache Spark
Databricks
 
PyData Meetup - Feature Store for Hopsworks and ML Pipelines
Jim Dowling
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Data Science Milan
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Moritz Meister
 
Kim Hammar - FOSDEM 2019 Brussels - Hopsworks Feature store
Kim Hammar
 
Data Con LA 2022 - Pre- Recorded - Simplifying AI/ML using Databricks feature...
Data Con LA
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
Hopsworks at Google AI Huddle, Sunnyvale
Jim Dowling
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna36
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna36
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
Research Methodology Overview Introduction
ayeshagul29594
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Kim Hammar, Logical Clocks AB KimHammar1 Jim Dowling, Logical Clocks AB jim_dowling End-to-End ML Pipelines with Databricks Delta and Hopsworks Feature Store #UnifiedDataAnalytics #SparkAISummit
  • 3. Machine Learning in the Abstract 3
  • 4. Where does the Data come from? 4
  • 5. Where does the Data come from? 5 “Data is the hardest part of ML and the most important piece to get right. Modelers spend most of their time selecting and transforming features at training time and then building the pipelines to deliver those features to production models.” [Uber on Michelangelo]
  • 6. Data comes from the Feature Store 6
  • 7. How do we feed the Feature Store? 7
  • 8. Outline 8 1. Hopsworks 2. Databricks Delta 3. Hopsworks Feature Store 4. Demo 5. Summary
  • 9. 9 Datasources Applications API Dashboards Hopsworks Apache Beam Apache Spark Pip Conda Tensorflow scikit-learn Keras J upyter Notebooks Tensorboard Apache Beam Apache Spark Apache Flink Kubernetes Batch Distributed ML &DL Model Serving Hopsworks Feature Store Kafka + Spark Streaming Model Monitoring Orchestration in Airflow Data Preparation &Ingestion Experimentation &Model Training Deploy &Productionalize Streaming Filesystem and Metadata storage HopsFS
  • 10. 10
  • 11. 11
  • 12. 12
  • 13. 13
  • 14. 14
  • 15. 15
  • 16. Next-Gen Data Lakes Data Lakes are starting to resemble databases: – Apache Hudi, Delta, and Apache Iceberg add: • ACID transactional layers on top of the data lake • Indexes to speed up queries (data skipping) • Incremental Ingestion (late data, delete existing records) • Time-travel queries 16
  • 17. Problems: No Incremental Updates, No rollback on failure, No Time-Travel, No Isolation. 17
  • 18. Solution: Incremental ETL with ACID Transactions 18
  • 19. Upsert & Time Travel Example 19
  • 20. Upsert & Time Travel Example 20
  • 21. Upsert ==Insert or Update 21
  • 22. Version Data By Commits 22
  • 23. Delta Lake by Databricks • Delta Lake is a Transactional Layer that sits on top of your Data Lake: – ACID Transactions with Optimistic Concurrency Control – Log-Structured Storage – Open Format (Parquet-based storage) – Time-travel 23
  • 27. Mutual Exclusion for Writers 27
  • 30. Other Frameworks: Apache Hudi, Apache Iceberg • Hudi was developed by Uber for their Hadoop Data Lake (HDFS first, then S3 support) • Iceberg was developed by Netflix with S3 as target storage layer • All three frameworks (Delta, Hudi, Iceberg) have common goals of adding ACID updates, incremental ingestion, efficient queries. 30
  • 31. Next-Gen Data Lakes Compared 31 Delta Hudi Iceberg Incremental Ingestion Spark Spark Spark ACID updates HDFS, S3* HDFS S3, HDFS File Formats Parquet Avro, Parquet Parquet, ORC Data Skipping (File-Level Indexes) Min-Max Stats+Z-Order Clustering* File-Level Max-Min stats + Bloom Filter File-Level Max-Min Filtering Concurrency Control Optimistic Optimistic Optimistic Data Validation Expectations (coming soon) In Hopsworks N/A Merge-on-Read No Yes (coming soon) No Schema Evolution Yes Yes Yes File I/O Cache Yes* No No Cleanup Manual Automatic, Manual No Compaction Manual Automatic No *Databricks version only (not open-source)
  • 32. 32 How can a Feature Store leverage Log-Structured Storage (e.g., Delta or Hudi or Iceberg)?
  • 33. Hopsworks Feature Store 33 Feature Mgmt Storage Access Statistics Online Features Discovery Offline Features Data Scientist Online Apps Data Engineer MySQL Cluster (Metadata, Online Features) Apache Hive Columnar DB (Offline Features) Feature Data Ingestion Hopsworks Feature Store Training Data (S3, HDFS) Batch Apps Discover features, create training data, save models, read online/offline/on- demand features, historical feature values. Models HopsFS JDBC (SAS, R, etc) Feature CRUD Add/remove features, access control, feature data validation. Access Control Time Travel Data Validation Pandas or PySpark DataFrame External DB Feature Defn Țselect ..Ț AWS Sagemaker and Databricks Integration • Computation engine (Spark) • Incremental ACID Ingestion • Time-Travel • Data Validation • On-Demand or Cached Features • Online or Offline Features
  • 36. Feature Time Travel with Hudi and Hopsworks Feature Store 36
  • 37. Demo: Hopsworks Featurestore + Databricks Platform 37
  • 38. Summary • Delta, Hudi, Iceberg bring Reliability, Upserts & Time-Travel to Data Lakes – Functionalities that are well suited for Feature Stores • Hopsworks Feature Store builds on Hudi/Hive and is the world’s first open-source Feature Store (released 2018) • The Hopsworks Platform also supports End-to-End ML pipelines using the Feature Store and Spark/Beam/Flink, Tensorflow/PyTorch, and Airflow 38
  • 39. Thank you! 470 Ramona St, Palo Alto Kista, Stockholm https://www.logicalclocks.com Register for a free account at www.hops.site Twitter @logicalclocks @hopsworks GitHub https://github.com/logicalclocks/hopswo rks https://github.com/hopshadoop/hops
  • 40. References • Feature Store: the missing data layer in ML pipelines? https://www.logicalclocks.com/feature-store/ • Python-First ML Pipelines with Hopsworks https://hops.readthedocs.io/en/latest/hopsml/hopsML.html. • Hopsworks white paper. https://www.logicalclocks.com/whitepapers/hopsworks • HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases. https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi • Open Source: https://github.com/logicalclocks/hopsworks https://github.com/hopshadoop/hops • Thanks to Logical Clocks Team: Jim Dowling, Seif Haridi, Theo Kakantousis, Fabio Buso, Gautier Berthou, Ermias Gebremeskel, Mahmoud Ismail, Salman Niazi, Antonios Kouzoupis, Robin Andersson, Alex Ormenisan, Rasmus Toivonen, Steffen Grohsschmiedt, and Moritz Meister 40
  • 41. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT