SlideShare a Scribd company logo
A Practical Feature
Store on Delta Lake
Nathan Buesgens
ML Operations
Bryan Christian
Data Science
Agenda
§ What is a Feature Store?
▪ MLOps for Acceleration and
Governance in the Enterprise
▪ Feature Store: Use Cases
▪ Edge Cases: 80/20
▪ Relation to the Data Warehouse
§ Design Reference
▪ Logical Data Model & Access
Patterns
▪ Physical Representation in the Delta
Lake
What is a Feature Store?
75%
Reduction in Feature Engineering
“Data Wrangling” Time
15X
Accelerated Model Delivery
with MLOps Automation and
Governance
END-TO-END VALUE DELIVERY
TIME TO VALUE & CONCURRENCY
SCALABLE INFRASTRUCTURE
I.E. AVOID:
“PROOF OF CONCEPT FACTORY”
MLOps: Data Science at Scale
BOTTLENECK
Feature
Engineering
Modelling
The feature store serves as the
consumption layer for ML
applications. It provides:
• Acceleration: pre-”hardened”
features reduces data wrangling
time for the Data Scientist.
• Governance: a common
consumptions pattern ensures
nothing is lost in the translation
to production.
Predictions
Curated
Data
Feature
Engineering
Modelling
Feature
Engineering
Modelling
Modelling
Modelling
Modelling
Feature
Store
Example: Feature Store
Infrastructure to support DS + MLE
The Feature Store is built on the following data science requirements that are relevant to predictive
analytics in Financial Services use cases.
Correct and consistently applied
joins across of multiple Delta
files without loss of processing
speed
Aggregations, window functions,
and transformations of data
Granularity of point in time and
level of the prediction (e.g.
individual, account, etc.)
customer_id as_of feature_name_last_0-30_days_prior feature_name_last_31-60_days_prior feature_name_next_1-30_days
12345 2021-05-01 0.43 0.32 0.21
23456 2021-05-01 0.99 0.94 0.98
34567 2021-05-01 0.03 0.92 0.13
45678 2021-05-01 0.42 0.59 0.50
The Feature Store uses “as_of” date for the point in time granularity for both backwards- and forward-
facing windows. Code-embedded metadata allows easy removal of future facing windows as
“independent” variables to prevent feature leakage.
Data Science Use Cases
§ Many ML use cases that don’t have an
online requirement: Esp.
“Human + AI”
§ Extending the MVP:
▪ Some online use cases can be
reframed as streaming use cases.
▪ Online use cases can be met with
extension to the Delta Lake design.
▪ See: feast.dev
§ Low-code & ciGzen science expands
user base, doesn’t necessarily
accelerate exisGng users.
§ 80/20 value from:
Op#mizing Access vs. Op#mizing
ETL Development
“Online” Features
Ultra-Low-Latency, Ultra-Timely Point Reads
Low-Code ETL
Configuration Based, AutoML, FeatureFlow, etc.
Edge Cases
Opportunities to Simplify for an 80/2- Feature Store MVP
▪ “Golden” aggregates of curated data.
▪ Highly structured, well-defined
granularities (esp. as 80/20 solution).
▪ Similar non-functional requirements for
strong governance standards, metadata
management, discovery, etc.
▪ Different Use Case: BI vs. Modelling
▪ Different Access Patterns, therefore:
▪ Different Data Model
▪ Different Technology Stack
▪ Supervised learning creates complex
requirements for:
“point in time accurate data”
• Differences
• Similarities
Comparison with Data Warehouse
i.e. Dimensional Model
Design
WINDOW FUNCTIONS
WATERMARK
1
2
3
FEATURE LEAKAGE
Point in Time Accurate Data
Three Ways Inconsistency Sneaks In
Structured Streaming Programming Guide
WINDOW FUNCTIONS
WATERMARK
1
2
3
FEATURE LEAKAGE
Point in Time Accurate Data
Three Ways Inconsistency Sneaks In
§ The thing being modelled.
The “Entity”
Term barrowed from Feast
Granularity
“As of”
Every feature for an entity “as of” a date.
Columns
§ Discrete granularity (daily, hourly, etc.), not an
“event time”.
§ 80/20 solution.
§ For “continuous” granularity see: Feast.
Features
Un-vectorized (80/20)
Targets
Necessarily at same granularity as features.
Predictions
One model’s prediction is often another’s feature.
Feature Store Logical Model
Data Model for Feature Store Access
No need to rebuild the whole
feature store when new features
are added.
(Certain sets of features might be rebuilt
at times, though they will have severely
shorter downtime.)
The SDK indexes the available features and upon request builds the joins to combine all desired features
into one cohesive data frame to provide a production grade feature selection tool.
Keyword searching enabled for
features so you can find any
feature you're looking for using
"human" logic
Tuning can be specific to each set
of features allowing more optimal
feature creation.
find()
select()
select_by()
To search through all columns and metadata for the features you want to use by giving keys, keywords or regex.
When you know exactly the features you want
Selecting columns and returning a dataframe you want to use by giving a date, keys, keywords or regex
Core Functionality
SDK for Feature Store
find() To search through all columns and metadata for the features you want to use by giving keys, keywords or regex.
regexp
kwrds
keys
kwrds_exclude
partial
partial_exclude
verbose
case_sensitive
A regular expression
A list of key words to look for
A dictionary of str, any pointing to tags in the metadata of features, ie {"model_output": True,}
A list of words to exclude from search
If kwrds is used, this decides if it should find all or any of them when searching.
If kwrds_exclude is used, this decides if it will exclude all or any of them when searching
If True, prints out results otherwise just returns them.
If True, an exact match is required to return results.
Arguments
fs.find(regexp="^(?=.*asdf)(?=.*qw
erty).+")
Your search returned 20 results…
feature_name_1: {'comment': 'Flag if asdf > 0.3 at any point within the last 3 months.'}
feature_name_qwerty_1: {'comment': 'Average number of widgets customer purchased in the last 0-1 months.'}
...
Example
Calling the feature store with “fs”, a command could be:
With a returned result of…
The find method searches through all features given a set of criteria and returns any matches within the name or metadata
of columns. It is a great tool to explore the data without pulling in massive datasets
Value to Data Scientist
Explore what features are in
the feature store via metadata
and leverage metadata to
enforce governance (e.g., no
PI, 3rd party data, etc. as
needed)
SDK for Feature Store
date
*features
Return features given a specific date or use "latest" to return the last
updated feature date. For specific dates, please include a dictionary
with an operator and a date i.e. {">": "2021-05-01"}
Feature names as strings
Arguments
dataframe_name = fs.select( "latest", # Give a date {"=": "2021-05-01"} or "latest" for the newest available features
“feature_name_last_0-30_days_prior”, “feature_name_last_31-60_days_prior”, “feature_name_next_1-30_days” # List the features you want )
display(dataframe_name)
Example
Calling the feature store with “fs”, a command could be:
With a returned result of…
The select method will return a dataframe of all selected features with the given date.
select() When you know exactly the features you want
customer_id as_of feature_name_last_0-30_days_prior feature_name_last_31-60_days_prior feature_name_next_1-30_days
12345 2021-05-01 0.43 0.32 0.21
23456 2021-05-01 0.99 0.94 0.98
Consistent way of selecting the
same feature set from the feature
store – consistent in dev and when
deployed in production
Value to Data Scientist
Consistent way of selecting
(in dev and prod) the same
feature set from the feature
store when creating a
dataframe
SDK for Feature Store
customer_id as_of feature_name_1 feature_name_qwerty_1 …
12345 2021-05-01 0.43 0.32 …
23456 2021-05-01 0.99 0.94 …
select_by() Selecting columns and returning a dataframe you want to use by giving a date, keys, keywords or regex
date
regexp
kwrds
keys
kwrds_exclude
partial
partial_exclude
case_sensitive
Return features given a specific date or use "latest" to return the last updated feature date.
For specific dates, please include a dictionary with an operator and a date i.e. {">": "2021-05-01"}
A regular expression
A list of key words to look for
A dictionary of str, any pointing to tags in the metadata of features, ie {"model_output": True,}
A list of words to exclude from search
If kwrds is used, this decides if it should find all or any of them when searching.
If kwrds_exclude is used, this decides if it will exclude all or any of them when searching
If True, an exact match is required to return results.
Arguments
dataframe_name = fs.select_by("=": "2021-05-01“,
regexp="^(?=.*asdf)(?=.*qwerty).+")
display(dataframe_name)
Example
Calling the feature store with “fs”, a command could be:
With a returned result of…
The select_by method searches through all features given a set of criteria and returns a dataframe including all the
features that match the criteria within the name or metadata.
Value to Data Scientist
Consistent way of exploring
the feature store and
leveraging metadata for
selection while simultaneity
creating a dataframe with the
selected features
SDK for Feature Store
Gold
BI Consumption:
Dimensional
Model
Bronze Silver
ML Consumption:
Feature Store
The Delta Lake
Optional:
Consumption
Optimized Databases
ETL ETL
Low Latency
Memory Cache
High Concurrency
Data Warehouse
Mirror
Mirror
Implementation on the Data Lake
Bronze Silver
ML Consumption:
Feature Store
The Delta Lake
Optional:
Consumption
Optimized Databases
ETL ETL
Low Latency
Memory Cache
Mirror
SDK (Data Access Layer)
• Consistent view of “online” and “historic” features.
• Separation of logical and physical models.
• Metadata focused query interface for data science
exploration.
Historic Feature
Queries
Online Point
Reads
Implementation on the Data Lake
§ Simplifies “point in .me joins”.
§ Not as flexible or .mely.
Pre-defined time aggregations
“As Of” Granularity
“Dynamic Point in Time Joins”
Demonstrated by Feast
More flexible, improved timeliness.
Multiple feature tables
Technically possible to use a single wide table.
§ Simplifies:
▪ Schema Migration
▪ Query Planning & Optimization
▪ Scheduling
Physical Feature Tables
Two Choices
Summary
1
Feature stores accelerate data science & enable
better governance.
2
Most design complexity stems from machine
learning requirements for point in time accurate data.
3
80/20 solutions possible by carefully considering
“online” requirements.
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
Ad

More Related Content

What's hot (20)

Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
Databricks
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Databricks
 
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
AWSKRUG - AWS한국사용자모임
 
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
Whoops, The Numbers Are Wrong! Scaling Data Quality @ NetflixWhoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
DataWorks Summit
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습
동현 강
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systems
Andrzej Michałowski
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
Kujambu Murugesan
 
Combine Spring Data Neo4j and Spring Boot to quickl
Combine Spring Data Neo4j and Spring Boot to quicklCombine Spring Data Neo4j and Spring Boot to quickl
Combine Spring Data Neo4j and Spring Boot to quickl
Neo4j
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Elastic Search (엘라스틱서치) 입문
Elastic Search (엘라스틱서치) 입문Elastic Search (엘라스틱서치) 입문
Elastic Search (엘라스틱서치) 입문
SeungHyun Eom
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
DATAVERSITY
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
Girish Khanzode
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
pcherukumalla
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
Databricks
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Databricks
 
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
AWSKRUG - AWS한국사용자모임
 
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
Whoops, The Numbers Are Wrong! Scaling Data Quality @ NetflixWhoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
DataWorks Summit
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습
동현 강
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systems
Andrzej Michałowski
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
Kujambu Murugesan
 
Combine Spring Data Neo4j and Spring Boot to quickl
Combine Spring Data Neo4j and Spring Boot to quicklCombine Spring Data Neo4j and Spring Boot to quickl
Combine Spring Data Neo4j and Spring Boot to quickl
Neo4j
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Elastic Search (엘라스틱서치) 입문
Elastic Search (엘라스틱서치) 입문Elastic Search (엘라스틱서치) 입문
Elastic Search (엘라스틱서치) 입문
SeungHyun Eom
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
DATAVERSITY
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
pcherukumalla
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 

Similar to A Practical Enterprise Feature Store on Delta Lake (20)

NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020
Thodoris Bais
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Jim Dowling
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Grega Kespret
 
Overview of query evaluation
Overview of query evaluationOverview of query evaluation
Overview of query evaluation
avniS
 
Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overview
Amit Juneja
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Chester Chen
 
Catalyst optimizer
Catalyst optimizerCatalyst optimizer
Catalyst optimizer
Ayub Mohammad
 
Analysis Services en SQL Server 2008
Analysis Services en SQL Server 2008Analysis Services en SQL Server 2008
Analysis Services en SQL Server 2008
Eduardo Castro
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
dwm042
 
Machine Learning on the Microsoft Stack
Machine Learning on the Microsoft StackMachine Learning on the Microsoft Stack
Machine Learning on the Microsoft Stack
Lynn Langit
 
Making sense of your data jug
Making sense of your data   jugMaking sense of your data   jug
Making sense of your data jug
Gerald Muecke
 
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax
 
DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)
Data Finder
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real Experience
Ihor Bobak
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
Steven Johnson
 
Odtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for youOdtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for you
Luc Bors
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
At the core you will have KUSTO
At the core you will have KUSTOAt the core you will have KUSTO
At the core you will have KUSTO
Riccardo Zamana
 
Compass Framework
Compass FrameworkCompass Framework
Compass Framework
Lukas Vlcek
 
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
Piyush Kumar
 
NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020
Thodoris Bais
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Jim Dowling
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Grega Kespret
 
Overview of query evaluation
Overview of query evaluationOverview of query evaluation
Overview of query evaluation
avniS
 
Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overview
Amit Juneja
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Chester Chen
 
Analysis Services en SQL Server 2008
Analysis Services en SQL Server 2008Analysis Services en SQL Server 2008
Analysis Services en SQL Server 2008
Eduardo Castro
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
dwm042
 
Machine Learning on the Microsoft Stack
Machine Learning on the Microsoft StackMachine Learning on the Microsoft Stack
Machine Learning on the Microsoft Stack
Lynn Langit
 
Making sense of your data jug
Making sense of your data   jugMaking sense of your data   jug
Making sense of your data jug
Gerald Muecke
 
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax
 
DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)
Data Finder
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real Experience
Ihor Bobak
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
Steven Johnson
 
Odtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for youOdtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for you
Luc Bors
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
At the core you will have KUSTO
At the core you will have KUSTOAt the core you will have KUSTO
At the core you will have KUSTO
Riccardo Zamana
 
Compass Framework
Compass FrameworkCompass Framework
Compass Framework
Lukas Vlcek
 
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
Piyush Kumar
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 

A Practical Enterprise Feature Store on Delta Lake

  • 1. A Practical Feature Store on Delta Lake Nathan Buesgens ML Operations Bryan Christian Data Science
  • 2. Agenda § What is a Feature Store? ▪ MLOps for Acceleration and Governance in the Enterprise ▪ Feature Store: Use Cases ▪ Edge Cases: 80/20 ▪ Relation to the Data Warehouse § Design Reference ▪ Logical Data Model & Access Patterns ▪ Physical Representation in the Delta Lake
  • 3. What is a Feature Store?
  • 4. 75% Reduction in Feature Engineering “Data Wrangling” Time 15X Accelerated Model Delivery with MLOps Automation and Governance END-TO-END VALUE DELIVERY TIME TO VALUE & CONCURRENCY SCALABLE INFRASTRUCTURE I.E. AVOID: “PROOF OF CONCEPT FACTORY” MLOps: Data Science at Scale
  • 5. BOTTLENECK Feature Engineering Modelling The feature store serves as the consumption layer for ML applications. It provides: • Acceleration: pre-”hardened” features reduces data wrangling time for the Data Scientist. • Governance: a common consumptions pattern ensures nothing is lost in the translation to production. Predictions Curated Data Feature Engineering Modelling Feature Engineering Modelling Modelling Modelling Modelling Feature Store Example: Feature Store Infrastructure to support DS + MLE
  • 6. The Feature Store is built on the following data science requirements that are relevant to predictive analytics in Financial Services use cases. Correct and consistently applied joins across of multiple Delta files without loss of processing speed Aggregations, window functions, and transformations of data Granularity of point in time and level of the prediction (e.g. individual, account, etc.) customer_id as_of feature_name_last_0-30_days_prior feature_name_last_31-60_days_prior feature_name_next_1-30_days 12345 2021-05-01 0.43 0.32 0.21 23456 2021-05-01 0.99 0.94 0.98 34567 2021-05-01 0.03 0.92 0.13 45678 2021-05-01 0.42 0.59 0.50 The Feature Store uses “as_of” date for the point in time granularity for both backwards- and forward- facing windows. Code-embedded metadata allows easy removal of future facing windows as “independent” variables to prevent feature leakage. Data Science Use Cases
  • 7. § Many ML use cases that don’t have an online requirement: Esp. “Human + AI” § Extending the MVP: ▪ Some online use cases can be reframed as streaming use cases. ▪ Online use cases can be met with extension to the Delta Lake design. ▪ See: feast.dev § Low-code & ciGzen science expands user base, doesn’t necessarily accelerate exisGng users. § 80/20 value from: Op#mizing Access vs. Op#mizing ETL Development “Online” Features Ultra-Low-Latency, Ultra-Timely Point Reads Low-Code ETL Configuration Based, AutoML, FeatureFlow, etc. Edge Cases Opportunities to Simplify for an 80/2- Feature Store MVP
  • 8. ▪ “Golden” aggregates of curated data. ▪ Highly structured, well-defined granularities (esp. as 80/20 solution). ▪ Similar non-functional requirements for strong governance standards, metadata management, discovery, etc. ▪ Different Use Case: BI vs. Modelling ▪ Different Access Patterns, therefore: ▪ Different Data Model ▪ Different Technology Stack ▪ Supervised learning creates complex requirements for: “point in time accurate data” • Differences • Similarities Comparison with Data Warehouse i.e. Dimensional Model
  • 10. WINDOW FUNCTIONS WATERMARK 1 2 3 FEATURE LEAKAGE Point in Time Accurate Data Three Ways Inconsistency Sneaks In
  • 11. Structured Streaming Programming Guide WINDOW FUNCTIONS WATERMARK 1 2 3 FEATURE LEAKAGE Point in Time Accurate Data Three Ways Inconsistency Sneaks In
  • 12. § The thing being modelled. The “Entity” Term barrowed from Feast Granularity “As of” Every feature for an entity “as of” a date. Columns § Discrete granularity (daily, hourly, etc.), not an “event time”. § 80/20 solution. § For “continuous” granularity see: Feast. Features Un-vectorized (80/20) Targets Necessarily at same granularity as features. Predictions One model’s prediction is often another’s feature. Feature Store Logical Model Data Model for Feature Store Access
  • 13. No need to rebuild the whole feature store when new features are added. (Certain sets of features might be rebuilt at times, though they will have severely shorter downtime.) The SDK indexes the available features and upon request builds the joins to combine all desired features into one cohesive data frame to provide a production grade feature selection tool. Keyword searching enabled for features so you can find any feature you're looking for using "human" logic Tuning can be specific to each set of features allowing more optimal feature creation. find() select() select_by() To search through all columns and metadata for the features you want to use by giving keys, keywords or regex. When you know exactly the features you want Selecting columns and returning a dataframe you want to use by giving a date, keys, keywords or regex Core Functionality SDK for Feature Store
  • 14. find() To search through all columns and metadata for the features you want to use by giving keys, keywords or regex. regexp kwrds keys kwrds_exclude partial partial_exclude verbose case_sensitive A regular expression A list of key words to look for A dictionary of str, any pointing to tags in the metadata of features, ie {"model_output": True,} A list of words to exclude from search If kwrds is used, this decides if it should find all or any of them when searching. If kwrds_exclude is used, this decides if it will exclude all or any of them when searching If True, prints out results otherwise just returns them. If True, an exact match is required to return results. Arguments fs.find(regexp="^(?=.*asdf)(?=.*qw erty).+") Your search returned 20 results… feature_name_1: {'comment': 'Flag if asdf > 0.3 at any point within the last 3 months.'} feature_name_qwerty_1: {'comment': 'Average number of widgets customer purchased in the last 0-1 months.'} ... Example Calling the feature store with “fs”, a command could be: With a returned result of… The find method searches through all features given a set of criteria and returns any matches within the name or metadata of columns. It is a great tool to explore the data without pulling in massive datasets Value to Data Scientist Explore what features are in the feature store via metadata and leverage metadata to enforce governance (e.g., no PI, 3rd party data, etc. as needed) SDK for Feature Store
  • 15. date *features Return features given a specific date or use "latest" to return the last updated feature date. For specific dates, please include a dictionary with an operator and a date i.e. {">": "2021-05-01"} Feature names as strings Arguments dataframe_name = fs.select( "latest", # Give a date {"=": "2021-05-01"} or "latest" for the newest available features “feature_name_last_0-30_days_prior”, “feature_name_last_31-60_days_prior”, “feature_name_next_1-30_days” # List the features you want ) display(dataframe_name) Example Calling the feature store with “fs”, a command could be: With a returned result of… The select method will return a dataframe of all selected features with the given date. select() When you know exactly the features you want customer_id as_of feature_name_last_0-30_days_prior feature_name_last_31-60_days_prior feature_name_next_1-30_days 12345 2021-05-01 0.43 0.32 0.21 23456 2021-05-01 0.99 0.94 0.98 Consistent way of selecting the same feature set from the feature store – consistent in dev and when deployed in production Value to Data Scientist Consistent way of selecting (in dev and prod) the same feature set from the feature store when creating a dataframe SDK for Feature Store
  • 16. customer_id as_of feature_name_1 feature_name_qwerty_1 … 12345 2021-05-01 0.43 0.32 … 23456 2021-05-01 0.99 0.94 … select_by() Selecting columns and returning a dataframe you want to use by giving a date, keys, keywords or regex date regexp kwrds keys kwrds_exclude partial partial_exclude case_sensitive Return features given a specific date or use "latest" to return the last updated feature date. For specific dates, please include a dictionary with an operator and a date i.e. {">": "2021-05-01"} A regular expression A list of key words to look for A dictionary of str, any pointing to tags in the metadata of features, ie {"model_output": True,} A list of words to exclude from search If kwrds is used, this decides if it should find all or any of them when searching. If kwrds_exclude is used, this decides if it will exclude all or any of them when searching If True, an exact match is required to return results. Arguments dataframe_name = fs.select_by("=": "2021-05-01“, regexp="^(?=.*asdf)(?=.*qwerty).+") display(dataframe_name) Example Calling the feature store with “fs”, a command could be: With a returned result of… The select_by method searches through all features given a set of criteria and returns a dataframe including all the features that match the criteria within the name or metadata. Value to Data Scientist Consistent way of exploring the feature store and leveraging metadata for selection while simultaneity creating a dataframe with the selected features SDK for Feature Store
  • 17. Gold BI Consumption: Dimensional Model Bronze Silver ML Consumption: Feature Store The Delta Lake Optional: Consumption Optimized Databases ETL ETL Low Latency Memory Cache High Concurrency Data Warehouse Mirror Mirror Implementation on the Data Lake
  • 18. Bronze Silver ML Consumption: Feature Store The Delta Lake Optional: Consumption Optimized Databases ETL ETL Low Latency Memory Cache Mirror SDK (Data Access Layer) • Consistent view of “online” and “historic” features. • Separation of logical and physical models. • Metadata focused query interface for data science exploration. Historic Feature Queries Online Point Reads Implementation on the Data Lake
  • 19. § Simplifies “point in .me joins”. § Not as flexible or .mely. Pre-defined time aggregations “As Of” Granularity “Dynamic Point in Time Joins” Demonstrated by Feast More flexible, improved timeliness. Multiple feature tables Technically possible to use a single wide table. § Simplifies: ▪ Schema Migration ▪ Query Planning & Optimization ▪ Scheduling Physical Feature Tables Two Choices
  • 20. Summary 1 Feature stores accelerate data science & enable better governance. 2 Most design complexity stems from machine learning requirements for point in time accurate data. 3 80/20 solutions possible by carefully considering “online” requirements.
  • 21. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.