0% found this document useful (0 votes)
63 views

The Feature Store and The Semantic Layer

The document discusses opportunities and challenges around feature stores and the semantic layer. It notes that feature stores can fill the gap between operational ML systems using real-time data and analytical ML systems using historical data by providing reusable features to data scientists and ML engineers. However, challenges remain around defining and computing features both in SQL and Python environments. The feature store aims to support both online and offline use cases by enabling features to be reused across different models.

Uploaded by

Jim Dowling
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

The Feature Store and The Semantic Layer

The document discusses opportunities and challenges around feature stores and the semantic layer. It notes that feature stores can fill the gap between operational ML systems using real-time data and analytical ML systems using historical data by providing reusable features to data scientists and ML engineers. However, challenges remain around defining and computing features both in SQL and Python environments. The feature store aims to support both online and offline use cases by enabling features to be reused across different models.

Uploaded by

Jim Dowling
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Feature Store and the Semantic Layer

Opportunities and Challenges

Jim Dowling - CEO, Co-Founder


Analytical and Operational ML Systems Generate Business Value

Operational ML
with real-time data
Business Value

Operational ML
with historical data

Analytical ML

Semantic Layer
& Metrics Store

Data Warehouse BI Dashboards

BI: AI:
DESCRIPTIVE & DIAGNOSTIC PREDICTIVE & PRESCRIPTIVE
ANALYTICS ANALYTICS
Gap between the MDS (SQL) and Data Science (Python) Worlds

Real-Time Features
ML Engineer

Data Scientist
kafka

Data Engineers
Kinesis Operational
SQL ML System
Metrics &
Event Based Data Extract Semantic Prototype
Transform Data Layer code
& Load
Warehouse
Analytical
Fivetran, Databricks, AtScale ML System
DBT
Matillion, Snowflake,BQ..

OLTP DB

Apps and Services


that generates Data
Operational Data Enterprise AI is
Python-centric

SQL to Python Gap


Feature Store fills the Gap between MDS and Data Science Worlds

Real-Time Features
ML Engineer

Data Engineers Data Scientists


kafka

Kinesis
Operational
SQL ML System

Event Based Data Python code


Extract Metrics &
Transform Data API
& Load Semantic SQL code
Warehouse Layer
Spark & Flink Analytical
Fivetran, DBT Databricks, AtScale ML System
Matillion, Snowflake,
… BigQuery..
OLTP

Enterprise Data
Operational Data Enterprise AI
MDS not suitable today for Online Use Cases (Real-time ML)

Web App
Model
Predict Serving
Read
Features

Online
Real-Time Logs Feature
Write features
store

Streaming
Feature
Pipelines

Flink
Do I have to define Features
in the Feature Store?
Can Features be Metrics in the Semantic Layer?
Feature Engineering in the Semantic Layer / SQL

Redshift
(Amazon)

Snowflake HOPSWORKS
ADLS
(Azure)
Feature Store

Connectors
JDBC
(MySQL,
Postgres,
MongoDB)

Semantic Query Amazon S3 GCS

SELECT location, SUM(price) AS External


total_revenue FROM sales GROUP BY location; Feature Group
Is SQL is enough?
Feature Engineering in Python/Pandas is not perfect either…

Pandas is popular for creating features. However, Pandas


has challenges when working with data in the MDS
1. Pandas doesn’t scale to process volumes of data larger
than 10s of GBs.
2. Writing DataFrames in Pandas back to the MDS,
leading to issues around mapping Pandas DTypes to
Schemas (and back again)
Some Features are not easy to compute in SQL

SQL Aggregations
Feature Store
SQL Data Validation

Feature Pipeline

Dimensionality
Reductions
??
Normalization
One-hot encoding
Dimensionality Reduction
Feature Reuse
Big wins by reusing Features in many models
Reuse Features from Different Feature Groups with Feature Views

FeatureView Feature Vectors


HOPSWORKS Online API (Online Inference)
Feature Store
Write to
Feature Groups

FeatureView Training Data


Offline API Batch Inference Data
Transformations after the Feature Store to maximise Feature Reuse

Feature Store
Feature Pipeline Transformations

Normalization
One-hot encoding

The feature store should help ensure there is no training/inference skew when applying transformations
Point-in-Time Correct SQL hard to write/debug/grok

WITH right_fg0 AS (SELECT *


FROM (
SELECT `fg1`.`fraud_label` `fraud_label`, `fg1`.`category` `category`, `fg1`.`amount` `amount`,
`fg1`.`age_at_transaction` `age_at_transaction`, `fg1`.`days_until_card_expires`
`days_until_card_expires`, `fg1`.`loc_delta` `loc_delta`, `fg1`.`cc_num` `join_pk_cc_num`,
`fg1`.`datetime` `join_evt_datetime`, `fg0`.`trans_volume_mstd` `trans_volume_mstd`,
`fg0`.`trans_volume_mavg` `trans_volume_mavg`, `fg0`.`trans_freq` `trans_freq`,
`fg0`.`loc_delta_mavg` `loc_delta_mavg`, RANK() OVER (PARTITION BY `fg0`.`cc_num`, `fg1`.`datetime`
ORDER BY `fg0`.`datetime` DESC) pit_rank_hopsworks
FROM `fabio_featurestore`.`transactions_1` `fg1`
INNER JOIN `fabio_featurestore`.`transactions_4h_aggs_1` `fg0`
ON `fg1`.`cc_num` = `fg0`.`cc_num` AND `fg1`.`datetime` >= `fg0`.`datetime`) NA
WHERE `pit_rank_hopsworks` = 1)

(SELECT `right_fg0`.`fraud_label` `fraud_label`, `right_fg0`.`category` `category`, `right_fg0`.`amount`


`amount`, `right_fg0`.`age_at_transaction` `age_at_transaction`, `right_fg0`.`days_until_card_expires`
`days_until_card_expires`, `right_fg0`.`loc_delta` `loc_delta`, `right_fg0`.`trans_volume_mstd`
`trans_volume_mstd`, `right_fg0`.`trans_volume_mavg` `trans_volume_mavg`, `right_fg0`.`trans_freq`
`trans_freq`, `right_fg0`.`loc_delta_mavg` `loc_delta_mavg`
FROM right_fg0)
Feature Reuse with Feature Views

Transformation
Functions

Feature Read
Group 1 Batch
Inference
Data

Select Feature Statistics


Features
View
Read
Feature
Feature
Group 2
Vector

Train
Data
Feature View

From Hopsworks tutorial: https://github.com/logicalclocks/hopsworks-tutorials/blob/master/fraud_batch


Feature Stores and the
Semantic Layer
Opportunities and Challenges
Feature Stores and Semantic Layer: Challenges and Opportunities

● Big Opportunity to unify definition of business logic between the


Semantic Layer and Feature Store
● Challenges exist about how to define real-time features in the MDS.
● Need to clearly define the border between Feature Stores and the MDS -
dimensionality reductions, transformations, feature reuse.
Enterprise Data Enterprise AI
Data Sources Applications & Services

Model
Applications
Development
-
Services
Online & Batch
Feature Stores & AI-Enabled Apps

Data Semantic Layer?


warehouse Reporting
Compliance
Governance

Efficiency
At Scale

Open &
modular
www.hopsworks.ai

You might also like