Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads

Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it
UC Berkeley, 05/23/2018
Yunseong Lee, Alberto Scolari, Matteo Interlandi, Markus
Weimer, Marco D. Santambrogio, Byung-Gon Chun
PRETZEL: Opening the Black Box
of Machine Learning Prediction Serving Systems

ML-as-a-Service
• ML models are deployed on cloud platforms as black
boxes
• Users deploy multiple models per machine (10-100s)
2

ML-as-a-Service
boxes
2
good:0
bad:1
…
This is a good
product

ML-as-a-Service
boxes
2
• Deployed models are often similar
- Similar structure
- Similar state
good:0
bad:1
…
This is a good
product

ML-as-a-Service
boxes
2
- Similar structure
- Similar state
good:0
bad:1
…
This is a good
product
good:0
bad:1
…
This is a good
product

ML-as-a-Service
boxes
2
- Similar structure
- Similar state
• Two key requirements:
1. Performance: latency or throughput
2. Model density: number of models per
machine
good:0
bad:1
…
This is a good
product
good:0
bad:1
…
This is a good
product

3
Limitations of black-box
300 models in ML.NET [1], C#, from Amazon Reviews dataset [2]:
250 Sentiment Analysis + 50 Regression Task
[1] https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet
[2] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th
International Conference on World Wide Web, WWW ’16

• Overheads: JIT, GC, virtual calls, …
• Operators are separate calls: no code fusion,
multiple data accesses
• Operators have different buffers and break
locality
3
(a) We identify the operators by their parameters. The ﬁrst
operators, which have multiple versions with different length
Figure 3: Probability for an operator within the 250 d
250⇥ y
100 pipelines have operator x, implying that pipe
Figure 4: CDF of latency of prediction requests of 25
DAGs. We denote the ﬁrst prediction as cold; the h
line is reported as average over 100 predictions after
warm-up period of 10. The plot is normalized over t
99th percentile latency of the hot case.
describes this situation, where the performance of h
predictions over the 250 sentiment analysis pipelines wi
memory already allocated and JIT-compiled code is mo
than an order-of-magnitude faster then the cold versio
for the same pipelines. To drill down more into the pro
Limited performance

• Overheads: JIT, GC, virtual calls, …
• Operators are separate calls: no code fusion,
multiple data accesses
• Operators have different buffers and break
locality
3
(a) We identify the operators by their parameters. The first
operators, which have multiple versions with different length
Figure 3: Probability for an operator within the 250 d
250⇥ y
100 pipelines have operator x, implying that pipe
Figure 4: CDF of latency of prediction requests of 25
DAGs. We denote the first prediction as cold; the h
line is reported as average over 100 predictions after
warm-up period of 10. The plot is normalized over t
99th percentile latency of the hot case.
describes this situation, where the performance of h
predictions over the 250 sentiment analysis pipelines wi
memory already allocated and JIT-compiled code is mo
than an order-of-magnitude faster then the cold versio
for the same pipelines. To drill down more into the pro
(a) We identify the operators by their parameters. The first two groups represent N-gram
operators, which have multiple versions with different length (e.g., unigram, trigram)
(b)
eac
Figure 3: Probability for an operator within the 250 different pipelines. If an operator x
250⇥ y
100 pipelines have operator x, implying that pipelines can save memory by sharing
• No state sharing: state is
duplicated-> memory is wasted
Limited performance
Limited density

White-box: open the black box of models for them to co-
exist better, be scheduled better
1. End-to-end Optimizations: merge operators in
computational units (logical stages), to decrease
overheads
2. Multi-model Optimizations: create once, use
everywhere, for both data and stages
4
White-Box Design principles

5
Off-line phase [1] - Flour
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
ﬂoat[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
recognize that the Linear Regression c
Char and WordNgram, therefore bypas
of Concat. Additionally, Tokenizer can
Char and WordNgram, therefore it will
CharNgram (in one stage) and a dep
CharNgram and WordNGram (in anot
created. The ﬁnal plan will therefore b
stages, versus the initial 4 operators (an
Model Plan Compiler: Model plans
two DAGs: a DAG composed of log
DAG of physical stages. Logical stag
tion of the stages output of the Oven O
lated parameters; physical stages conta
that will be executed by the PRETZEL
given DAG, there is a 1-to-n mapping
physical stages so that a logical stage

5
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
recognize that the Linear Regression can be pushed in
Char and WordNgram, therefore bypassing the executi
of Concat. Additionally, Tokenizer can be reused betwe
Char and WordNgram, therefore it will be pipelined w
CharNgram (in one stage) and a dependency betwe
CharNgram and WordNGram (in another stage) will
created. The final plan will therefore be composed by
stages, versus the initial 4 operators (and vectors) of ML
Model Plan Compiler: Model plans are composed
two DAGs: a DAG composed of logical stages, and
DAG of physical stages. Logical stages are an abstra
tion of the stages output of the Oven Optimizer, with
lated parameters; physical stages contains the actual co
that will be executed by the PRETZEL runtime. For.ea
given DAG, there is a 1-to-n mapping between logical
physical stages so that a logical stage can represent t
execution code of different physical implementations.
physical implementation is selected based on the param
Flour API

5
Off-line phase [1] - Flourhe dynamic decisions on how to schedule plans
on machine workload. Finally, a FrontEnd is used
mit prediction requests to the system.
del pipeline deployment and serving in PRETZEL
a two-phase process. During the off-line phase
n 4.1), ML2’s pre-trained pipelines are translated
our transformation-based API. Oven optimizer re-
es and fuses transformations into model plans com-
of parameterized logical units called stages. Each
stage is then AOT compiled into physical computa-
its where memory resources and threads are pooled
me. Model plans are registered for prediction serv-
he Runtime where physical stages and parameters
red between pipelines with similar model plans. In
line phase (Section 4.2), when an inference request
gistered model plan is received, physical stages are
eterized dynamically with the proper values main-
in the Object Store. The Scheduler is in charge of
g physical stages to shared execution units.
ures 6 and 7 pictorially summarize the above de-
ons; note that only the on-line phase is executed
rence time, whereas the model plans are generated
etely off-line. Next, we will describe each layer
sing the PRETZEL prediction system.
Off-line Phase
Flour
al of Flour is to provide an intermediate represen-
between ML frameworks (currently only ML2) and
EL that is both easy to target and amenable to op-
ions. Once a pipeline is ported into Flour, it can
arrays indicate the number and type of input fields. The
successive call to Tokenize in line 4 splits the input
fields into tokens. Lines 6 and 7 contain the two branches
defining the char-level and word-level n-gram transforma-
tions, which are then merged with the Concat transform
in line 9 before the linear binary classifier of line 10. Both
char and word n-gram transformations are parametrized
by the number of n-grams and maps translating n-grams
into numerical format (not shown in the Figure). Addi-
tionally, each Flour transformation accepts as input an
optional set of statistics gathered from training. These
statistics are used by the compiler to generate physical
plans more efficiently tailored to the model characteristics.
Example statistics are max vector size (to define the mini-
mum size of vectors to fetch from the pool at prediction
time 4.2), dense or sparse representations, etc.
Listing 1: Flour program for the sentiment analysis
pipeline. Transformations’ parameters are extracted from
the original ML2 pipeline.
1 var fContext = new FlourContext(objectStore, ...)
2 var tTokenizer = fContext.CSV.
3 FromText(fields, fieldsType,
sep).
4 Tokenize();
5
6 var tCNgram = tTokenizer.CharNgram(numCNgrms,
...);
7 var tWNgram = tTokenizer.WordNgram(numWNgrms,
...);
8 var fPrgrm = tCNgram.
9 Concat(tWNgram).
10 ClassifierBinaryLinear(cParams);
11
12 return fPrgrm.Plan();
We have instrumented the ML2 library to collect statis-
tics from training and with the related bindings to the
Box of Machine Learning
ving Systems
# 355
“This is a nice product”
Positive vs. Negative
Tokenizer
Char
Ngram
Word
Ngram
Concat
Linear
Regression
Figure 1: A sentimental analysis pipeline consisting of
operators for featurization (ellipses), followed by a ML
model (diamond). Tokenizer extracts tokens (e.g., words)
from the input string. Char and Word Ngrams featurize
input tokens by extracting n-grams. Concat generates a
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Flour API

5
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the elements
and is fed to the runtime.
recognize
Char and
of Concat
Char and
CharNgra
CharNgra
created. T
stages, ve
Model Pl
two DAG
DAG of p
tion of the
lated para
that will b
given DA
physical s
execution
physical i
ters chara
Plan co
DAG is g
Plan Com
representa
formation
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
recognize that the Linear R
Char and WordNgram, ther
of Concat. Additionally, To
CharNgram (in one stage)
CharNgram and WordNGr
created. The final plan wil
stages, versus the initial 4 o
Model Plan Compiler: M
two DAGs: a DAG comp
DAG of physical stages. L
tion of the stages output of
lated parameters; physical s
that will be executed by the
given DAG, there is a 1-to-
physical stages so that a lo
execution code of different
physical implementation is
ters characterizing a logica
Plan compilation is a two
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET
ZEL. In (1), a model is translated into a Flour program. (2
statistics. A model plan is the union of all the element
(such as most featurizers) are pipelined together in a sin
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg
isters [33, 38]. Compute-intensive transformations (e.g
Oven
Optimiser
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Flour API

5
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
recognize
Char and
of Concat
Char and
CharNgra
CharNgra
created. T
stages, ve
Model Pl
two DAG
DAG of p
tion of the
lated para
that will b
given DA
physical s
execution
physical i
ters chara
Plan co
DAG is g
Plan Com
representa
formation
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
recognize that the Linear R
of Concat. Additionally, To
CharNgram (in one stage)
CharNgram and WordNGr
created. The final plan wil
stages, versus the initial 4 o
Model Plan Compiler: M
two DAGs: a DAG comp
DAG of physical stages. L
tion of the stages output of
lated parameters; physical s
that will be executed by the
given DAG, there is a 1-to-
physical stages so that a lo
execution code of different
physical implementation is
ters characterizing a logica
Plan compilation is a two
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET
ZEL. In (1), a model is translated into a Flour program. (2
statistics. A model plan is the union of all the element
(such as most featurizers) are pipelined together in a sin
locality because records are likely to reside in CPU reg
isters [33, 38]. Compute-intensive transformations (e.g
Oven
Optimiser
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
(such as most featurizers) are pipelined together in a sin-
locality because records are likely to reside in CPU reg-
isters [33, 38]. Compute-intensive transformations (e.g.,
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Flour API

• Two main components:
– Runtime, with an Object Store
– Scheduler
• Runtime handles physical resources: threads and buffers
• Object Store caches objects of all models
– models register and retrieve state objects via a key
• Scheduler is event-based, each stage being an event
6
On-line phase [1] - Runtime

• 7x less memory for SA models, 40% for RT models
• Higher model density, higher efficiency and
profitability
7
Memory
Figure 8: Cumulative memory usage of the model
pipelines with and without Object Store, normalized by
Figure 9: Latency co
ZEL, with values nor
Cliffs indicate pipeli

• Hot vs Cold scenarios
• Without and with Caching of partial results
• Pretzel vs ML:NET: cold is 15x faster, hot is 2.6x
8
Latency
Figure 9: Latency comparison between ML2 and PRET-

• Batch size is 1000 inputs, multiple runs of same batch
• Delay batching: as in Clipper, let requests tolerate a given
delay
• Batch while latency is smaller than tolerated delay
• ML.NET suffers from missed data sharing, i.e. higher memory
traffic
9
Throughput
Figure 11: The average throughput computed among the
latency, instead, gracefully increases linearl
increase of the load.

• Batch size is 1000 inputs, multiple runs of same batch
• Delay batching: as in Clipper, let requests tolerate a given
delay
• Batch while latency is smaller than tolerated delay
• ML.NET suffers from missed data sharing, i.e. higher memory
traffic
9
Throughput
Figure 11: The average throughput computed among the
latency, instead, gracefully increases linearl
increase of the load.
the number of cores on the x-axis. PRETZEL scales lin
early to the number of CPU cores, close to the expecte
maximum throughput with Hyper Threading enabled.
not shared, thus increasing the pressure on the memor
subsystem: indeed, even if the data values are the sam
the model objects are mapped to different memory area
Delay Batching: As in Clipper, PRETZEL FrontEnd a
lows users to specify a maximum delay they can wait t
maximize throughput. For each model category, Figure 1
depicts the trade-off between throughput and latency a
the delay increases. Interestingly, for such models even
small delay helps reaching the optimal batch size.
Figure 12: Throughput and latency of SA and RT model

• We addressed performance/density bottlenecks in ML
inference for MaaS by advocating white-box approach
• Future work
- Code generation
- Support more model formats than ML.NET
- Distributed version
- NUMA awareness
10
Conclusions and current work

• Physical stages can be offloaded to FPGA
• How to deploy it?
– Fixed subset of common operators
– Multiple operators in kernels
– Whole model, deployed via partial reconf
11
Future work with FPGA

• Physical stages can be offloaded to FPGA
• How to deploy it?
– Fixed subset of common operators
– Multiple operators in kernels
– Whole model, deployed via partial reconf
11
Future work with FPGA
QUESTIONS ?
Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it

Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads

More Related Content

What's hot (20)

Similar to Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads (20)

More from NECST Lab @ Politecnico di Milano (20)

Recently uploaded (20)

Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads