Building Machine Learning Systems with a Feature Store Batch, Real-Time, and LLM Systems Early Release Jim
Building Machine Learning Systems with a Feature Store Batch, Real-Time, and LLM Systems Early Release Jim
With Early Release ebooks, you get books in their earliest form—the
author’s raw and unedited content as they write—so you can take advantage
of these technologies long before the official release of these titles.
Jim Dowling
Building Machine Learning Systems
with a Feature Store
by Jim Dowling
See http://oreilly.com/catalog/errata.csp?isbn=9781098165239
for release details.
The views expressed in this work are those of the author and do
not represent the publisher’s views. While the publisher and
the author have used good faith efforts to ensure that the
information and instructions contained in this work are
accurate, the publisher and the author disclaim all
responsibility for errors or omissions, including without
limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and
instructions contained in this work is at your own risk. If any
code samples or other technology this work contains or
describes is subject to open source licenses or the intellectual
property rights of others, it is your responsibility to ensure that
your use thereof complies with such licenses and/or rights.
[TO COME]
Brief Table of Contents (Not Yet Final)
Preface
Introduction
With Early Release ebooks, you get books in their earliest form
—the author’s raw and unedited content as they write—so you
can take advantage of these technologies long before the official
release of these titles.
This will be the introduction of the final book. Please note that
the GitHub repo will be made active later on.
Figure I-2. When we plot all of our example apples and oranges using the observed
values for the red and green color channels, we can see that most apples are on the left
of the decision boundary, and most oranges are on the right. Some apples and oranges
are, however, difficult to differentiate based only on their red and green channel colors.
The model can then be used to classify a new piece of a fruit as
either an apple or orange using its red and green channel
values. If the fruit’s red and green channel values place it on the
left of the line, then it is an apple, otherwise it is an orange.
Figure I-3. The red apple complicates our prediction problem because there is no longer
a linear decision boundary between the apples and oranges using only color as a
feature.
We can see that red apples also have zero for the blue channel,
so we can ignore that feature. However, in figure 4, we can see
that the red examples are located in the bottom right hand
corner of our chart, and our model (a linear decision boundary)
is broken—it would predict that red apples are oranges. Our
model’s precision and recall is now much worse.
Figure I-4. When we add red apples to our training examples, we can see that we can no
longer use a straight line to classify fruit as orange or apple. We now need a non-linear
decision boundary to separate apples from oranges, and in order to learn the decision
boundary, we need a more complex model (with more parameters), more training
examples, and m.
Figure I-5. This regression problem of predicting the weight of an apple can be solved
using a linear model that minimizes the mean-squared error
In this regression example, we don’t technically need the full
power of supervised learning yet—a simple linear model will
work well. We can fit a straight line (that predicts an apple’s
weight using its green channel value and diameter) to the data
points by drawing the line on the chart such that it minimizes
the distance between the line and the data points (X1, X2, X3, X4,
X5). For example, a common technique is to sum together the
distance between all the data points and the line in the mean
absolute error (MAE). We take the absolute value of the distance
of the data points to the line, because if we didn’t take the
absolute value then the distance for X1 would be negative and
the distance for X2 would be positive, canceling each other out.
Sometimes, we have data points that are very far from the line,
and we want the model to have a larger error for those outliers
—we want the model to perform better for outliers. For this, we
can sum the square of distances and then take the square root
of the total. This is called the root mean-squared error (RMSE).
The MAE and RMSE are both metrics used to help fit our linear
regression model, but also to evaluate the performance of our
regression model. Similar to our earlier classification example,
if we introduce more features to improve the performance of
our regression model, we will have to upgrade from our linear
regression model to use a supervised learning regression model
that can perform better by learning non-linear relationships
between the features and the target.
Figure 1-2. A monolithic batch ML system that can run in either (1) training mode or (2)
inference mode.
Figure 1-4. Many (real-time) interactive ML systems also require history and context to
make personalized predictions. The feature store enables personalized history and
context to be retrieved at low latency as precomputed features for online models.
1
Inspired by the DevOps movement in software engineering,
MLOps is a set of practices and processes for building reliable
and scalable ML systems that can be quickly and incrementally
developed, tested, and rolled out to production using
automation where possible. Some of the problems considered
part of MLOps were addressed already in this section, such as
how to ensure consistent feature data between training and
inference. An O’Reilly book entitled “Machine Learning Design
Patterns” published 30 patterns for building ML systems in
2020, and many problems related to testing, versioning, and
monitoring features, models, and data have been identified by
the MLOps community.
Supervised Learning
In supervised learning, you train a model with data
containing features and labels. Each row in a training
dataset contains a set of input feature values and a label
(the outcome, given the input feature values). Supervised
ML algorithms learn relationships between the labels
(also called the target variable) and the input feature
values. Supervised ML is used to solve classification
problems, where the ML system will answer yes-or-no
questions (is there a hotdog in this photo?) or make a
multiclass classification (what type of hotdog is this?).
Supervised ML is also used to solve regression problems,
where the model predicts a numeric value using the input
feature values (estimate the price of this apartment, given
input features such as its area, condition, and location).
Finally, supervised ML is also used to fine-tune chatbots
using open-source large language models (LLMs). For
example, if you train a chatbot with questions (features)
and answers (labels) from the legal profession, your
chatbot can be fine-tuned so that it talks like a lawyer.
Unsupervised Learning
Semi-supervised Learning
Self-supervised Learning
In-context Learning
Tabular data
Columnar data stores are the most common data source for
historical data for ML systems in Enterprises. Many data
transformations for creating features, such as aggregations and
feature extraction, can be efficiently and scalably implemented
in DBT/SQL or Spark on data stored in data warehouses. Python
frameworks for data transformations, such as Pandas 2+ and
Polars, are also popular platforms for feature engineering with
data of more reasonable scale (GBs, not TBs or more).
A Lakehouse is a combination of (1) tables stored as columnar
files in a data lake (object store or distributed file system) and
(2) data processing that ensures ACID operations on the table
for reading and writing that store columnar data. They are
collectively known as Table File Formats. There are 3 popular
open-source table formats: Apache Iceberg, Apache Hudi, and
Delta Lake. All 3 provide similar functionality, enabling you to
update the tabular data, delete rows from tables, and
incrementally add data to tables. You no longer need to read up
the old data, update it, and write back your new version of the
table. Instead you can just append or upsert (insert or update)
data into your tables.
Unstructured Data
Event Data
API-Provided Data
Incremental Datasets
4
For example, the well-known titanic passenger dataset
consists of the following files:
train.csv
test.csv
NOTE
Immutable files are not suitable as the data layer of record in an enterprise
environment where GDPR (the EU’s General Data Protection Regulation) and CCPA
(California Consumer Privacy Act) require that users are allowed to have their data
deleted, updated, and its usage and provenance tracked. In recent years, open-source
table formats for data lakes have appeared, such as Apache Iceberg, Apache Hudi,
and Delta Laker, that support mutable datasets (that work with GDPR and CCPA) that
are designed to work at massive scale (PBs in size) on low cost storage (object stores
and distributed file systems).
In introductory ML courses, you do not typically learn about
incremental datasets. An incremental dataset is a dataset that
supports efficient appends, updates, and deletions. ML systems
continually produce new data - whether once per year, day,
hour, minute, or even second. ML systems need to support
incremental datasets. In ML systems built with time-series data
(for example, online consumer data), that data may also have
freshness constraints, such that you need to periodically retrain
your model so that it does not degrade in performance. So, we
need to accumulate historical data in incremental datasets so
that, over time, more training data becomes available for re-
training models to ensure high performance for our ML
systems - models degrade over time if they are not periodically
retrained using recent (fresh) data.
What is a ML Pipeline ?
A pipeline is a program that has well-defined inputs and
outputs and is run either on a schedule or 24x7. ML Pipelines is
a widely used term in ML engineering that loosely refers to the
pipelines that are used to build and operate ML systems.
However, a problem with the term ML pipeline is that it is not
clear what the input and output to a ML pipeline is. Is the input
raw data or training data? Is the model part of input or the
output? In this book, we will use the term ML pipeline to refer
collectively to any pipeline in a ML system. We will not use the
term ML pipeline to refer to a specific stage in a ML system,
such as feature engineering, model training, or inference.
Figure 1-5. A ML pipeline has well-defined inputs and outputs. The outputs of ML
pipelines can be inputs to other ML pipelines or to external ML Systems that use the
predictions and prediction logs to make them “AI-enabled”.
The three different pipelines have clear inputs and outputs and
can be developed and operated independently:
Figure 1-6. The testing pyramid for ML Systems is higher than traditional software
systems, as both code and data need to be tested, not just code.
It is often said that the main difference between testing
traditional software systems and ML systems is that in ML
systems we need to test both the source-code and data - not just
the source-code. The features created by feature pipelines can
have their logic tested with unit tests and their input data
checked with data validation tests, see Chapter 5. The models
need to be tested for performance, but also for a lack of bias
against known groups of vulnerable users, see Chapter 6.
Finally, at the top of the pyramid, ML-Systems need to test their
performance with A/B tests before they can switch to use a new
model, see Chapter 7.
MLOps folks love that feeling when you push changes in your
source code, and your ML artifact or system is automatically
deployed. Deployments are often associated with the concept of
development (dev), pre-production (preprod), and production
(prod) environments. ML assets are developed in the dev
environment, tested in preprod, and tested again before for
deployment in the prod environment. Although a human may
ultimately have to sign off on deploying a ML artifact to
production, the steps should be automated in a process known
as continuous deployment (CD). In this book, we work with the
philosophy that you can build, test, and run your whole ML
system in dev, preprod, or prod environments. The data your
ML system can access will be dependent on which environment
you deploy in (only prod has access to production data). We will
start by first learning to build and operate a ML system, then
look at CD in Chapter 12.
Figure 1-7. Versioning of features and models is needed to be able to easily upgrade ML
systems and rollback upgrades in case of failure.
Figure 1-8. A ML system with a feature store supports 3 different types of ML pipeline: a
feature pipeline, a training pipeline, and inference pipeline. Logging pipelines help
implement observability for ML systems.
While the feature store stores feature data for ML pipelines, the
model registry is the storage layer for trained models. The ML
pipelines in a ML system can be run on potentially any compute
platform. Many different compute engines are used for feature
pipelines - including SQL, Spark, Flink, and Python - and
whether they are batch or streaming pipelines, they typically
are operational services that need to either run on a schedule
(batch) or 24x7 (streaming). Training pipelines are most
commonly implemented in Python, as are online inference
pipelines. Batch inference pipelines can be Python, PySpark, or
even a streaming compute engine or SQL database.
The following are some examples for the three different types
of ML systems that use a feature store:
Real-Time ML Systems
Batch ML Systems
Summary
In this chapter, we introduced ML systems with a feature store.
We introduced the main properties of ML systems, their
architecture, and the ML pipelines that power them. We
introduced MLOps and its historical evolution as a set of best
practices for developing and evolving ML systems, and we
presented a new architecture for ML systems as feature,
training, and inference (FTI) pipelines connected with a feature
store. In the next chapter, we will look closer at this new FTI
architecture for building ML systems, and how you can build
ML systems faster and more reliably as connected FTI pipelines.
1
Wikipedia states that “DevOps integrates and automates the work of software
development (Dev) and IT operations (Ops) as a means for improving and shortening
the systems development life cycle.”
2
Enterprise computing refers to the information storage and processing platforms
that businesses use for operations, analytics, and data science.
3
Parquet files store tabular data in a columnar format - the values for each column
are stored together, enabling faster aggregate operations at the column level (such as
the average value for a numerical column) and better compression, with both
dictionary and run-length encoding.
4
The titanic dataset is a well-known example of a binary classification problem in
machine learning, where you have to train a model to predict if a given passenger
will survive or not.
About the Author
Jim Dowling is CEO of Hopsworks and an Associate Professor at
KTH Royal Institute of Technology. He’s led the development of
Hopsworks that includes the first open-source feature store for
machine learning. He has a unique background in the
intersection of data and AI. For data, he worked at MySQL and
later led the development of HopsFS, a distributed file system
that won the IEEE Scale Prize in 2017. For AI, his PhD
introduced Collaborative Reinforcement Learning, and he
developed and taught the first course on Deep Learning in
Sweden in 2016. He also released a popular online course on
serverless machine learning using Python at serverless-ml.org.
This combined background of Data and AI helped him realize
the vision of a feature store for machine learning based on
general purpose programming languages, rather than the
earlier feature store work at Uber on DSLs. He was the first
evangelist for feature stores, helping to create the feature store
product category through talks at industry conferences, like
Data/AI Summit, PyData, OSDC, and educational articles on
feature stores. He is the organizer of the annual feature store
summit conference and the featurestore.org community, as well
as co-organizer of PyData Stockholm.