SlideShare a Scribd company logo
Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it
UC Berkeley, 05/23/2018
Yunseong Lee, Alberto Scolari, Matteo Interlandi, Markus
Weimer, Marco D. Santambrogio, Byung-Gon Chun
PRETZEL: Opening the Black Box
of Machine Learning Prediction Serving Systems
ML-as-a-Service
• ML models are deployed on cloud platforms as black
boxes
• Users deploy multiple models per machine (10-100s)
2
ML-as-a-Service
• ML models are deployed on cloud platforms as black
boxes
• Users deploy multiple models per machine (10-100s)
2
good:0
bad:1
…
This is a good
product
ML-as-a-Service
• ML models are deployed on cloud platforms as black
boxes
• Users deploy multiple models per machine (10-100s)
2
• Deployed models are often similar
- Similar structure
- Similar state
good:0
bad:1
…
This is a good
product
ML-as-a-Service
• ML models are deployed on cloud platforms as black
boxes
• Users deploy multiple models per machine (10-100s)
2
• Deployed models are often similar
- Similar structure
- Similar state
good:0
bad:1
…
This is a good
product
good:0
bad:1
…
This is a good
product
ML-as-a-Service
• ML models are deployed on cloud platforms as black
boxes
• Users deploy multiple models per machine (10-100s)
2
• Deployed models are often similar
- Similar structure
- Similar state
• Two key requirements:
1. Performance: latency or throughput
2. Model density: number of models per
machine
good:0
bad:1
…
This is a good
product
good:0
bad:1
…
This is a good
product
3
Limitations of black-box
300 models in ML.NET [1], C#, from Amazon Reviews dataset [2]:
250 Sentiment Analysis + 50 Regression Task
[1] https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet
[2] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th
International Conference on World Wide Web, WWW ’16
• Overheads: JIT, GC, virtual calls, …
• Operators are separate calls: no code fusion,
multiple data accesses
• Operators have different buffers and break
locality
3
Limitations of black-box
(a) We identify the operators by their parameters. The first
operators, which have multiple versions with different length
Figure 3: Probability for an operator within the 250 d
250⇥ y
100 pipelines have operator x, implying that pipe
Figure 4: CDF of latency of prediction requests of 25
DAGs. We denote the first prediction as cold; the h
line is reported as average over 100 predictions after
warm-up period of 10. The plot is normalized over t
99th percentile latency of the hot case.
describes this situation, where the performance of h
predictions over the 250 sentiment analysis pipelines wi
memory already allocated and JIT-compiled code is mo
than an order-of-magnitude faster then the cold versio
for the same pipelines. To drill down more into the pro
Limited performance
300 models in ML.NET [1], C#, from Amazon Reviews dataset [2]:
250 Sentiment Analysis + 50 Regression Task
[1] https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet
[2] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th
International Conference on World Wide Web, WWW ’16
• Overheads: JIT, GC, virtual calls, …
• Operators are separate calls: no code fusion,
multiple data accesses
• Operators have different buffers and break
locality
3
Limitations of black-box
(a) We identify the operators by their parameters. The first
operators, which have multiple versions with different length
Figure 3: Probability for an operator within the 250 d
250⇥ y
100 pipelines have operator x, implying that pipe
Figure 4: CDF of latency of prediction requests of 25
DAGs. We denote the first prediction as cold; the h
line is reported as average over 100 predictions after
warm-up period of 10. The plot is normalized over t
99th percentile latency of the hot case.
describes this situation, where the performance of h
predictions over the 250 sentiment analysis pipelines wi
memory already allocated and JIT-compiled code is mo
than an order-of-magnitude faster then the cold versio
for the same pipelines. To drill down more into the pro
(a) We identify the operators by their parameters. The first two groups represent N-gram
operators, which have multiple versions with different length (e.g., unigram, trigram)
(b)
eac
Figure 3: Probability for an operator within the 250 different pipelines. If an operator x
250⇥ y
100 pipelines have operator x, implying that pipelines can save memory by sharing
• No state sharing: state is
duplicated-> memory is wasted
Limited performance
Limited density
300 models in ML.NET [1], C#, from Amazon Reviews dataset [2]:
250 Sentiment Analysis + 50 Regression Task
[1] https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet
[2] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th
International Conference on World Wide Web, WWW ’16
White-box: open the black box of models for them to co-
exist better, be scheduled better
1. End-to-end Optimizations: merge operators in
computational units (logical stages), to decrease
overheads
2. Multi-model Optimizations: create once, use
everywhere, for both data and stages
4
White-Box Design principles
5
Off-line phase [1] - Flour
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
recognize that the Linear Regression c
Char and WordNgram, therefore bypas
of Concat. Additionally, Tokenizer can
Char and WordNgram, therefore it will
CharNgram (in one stage) and a dep
CharNgram and WordNGram (in anot
created. The final plan will therefore b
stages, versus the initial 4 operators (an
Model Plan Compiler: Model plans
two DAGs: a DAG composed of log
DAG of physical stages. Logical stag
tion of the stages output of the Oven O
lated parameters; physical stages conta
that will be executed by the PRETZEL
given DAG, there is a 1-to-n mapping
physical stages so that a logical stage
5
Off-line phase [1] - Flour
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
recognize that the Linear Regression c
Char and WordNgram, therefore bypas
of Concat. Additionally, Tokenizer can
Char and WordNgram, therefore it will
CharNgram (in one stage) and a dep
CharNgram and WordNGram (in anot
created. The final plan will therefore b
stages, versus the initial 4 operators (an
Model Plan Compiler: Model plans
two DAGs: a DAG composed of log
DAG of physical stages. Logical stag
tion of the stages output of the Oven O
lated parameters; physical stages conta
that will be executed by the PRETZEL
given DAG, there is a 1-to-n mapping
physical stages so that a logical stage
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
recognize that the Linear Regression can be pushed in
Char and WordNgram, therefore bypassing the executi
of Concat. Additionally, Tokenizer can be reused betwe
Char and WordNgram, therefore it will be pipelined w
CharNgram (in one stage) and a dependency betwe
CharNgram and WordNGram (in another stage) will
created. The final plan will therefore be composed by
stages, versus the initial 4 operators (and vectors) of ML
Model Plan Compiler: Model plans are composed
two DAGs: a DAG composed of logical stages, and
DAG of physical stages. Logical stages are an abstra
tion of the stages output of the Oven Optimizer, with
lated parameters; physical stages contains the actual co
that will be executed by the PRETZEL runtime. For.ea
given DAG, there is a 1-to-n mapping between logical
physical stages so that a logical stage can represent t
execution code of different physical implementations.
physical implementation is selected based on the param
Flour API
5
Off-line phase [1] - Flourhe dynamic decisions on how to schedule plans
on machine workload. Finally, a FrontEnd is used
mit prediction requests to the system.
del pipeline deployment and serving in PRETZEL
a two-phase process. During the off-line phase
n 4.1), ML2’s pre-trained pipelines are translated
our transformation-based API. Oven optimizer re-
es and fuses transformations into model plans com-
of parameterized logical units called stages. Each
stage is then AOT compiled into physical computa-
its where memory resources and threads are pooled
me. Model plans are registered for prediction serv-
he Runtime where physical stages and parameters
red between pipelines with similar model plans. In
line phase (Section 4.2), when an inference request
gistered model plan is received, physical stages are
eterized dynamically with the proper values main-
in the Object Store. The Scheduler is in charge of
g physical stages to shared execution units.
ures 6 and 7 pictorially summarize the above de-
ons; note that only the on-line phase is executed
rence time, whereas the model plans are generated
etely off-line. Next, we will describe each layer
sing the PRETZEL prediction system.
Off-line Phase
Flour
al of Flour is to provide an intermediate represen-
between ML frameworks (currently only ML2) and
EL that is both easy to target and amenable to op-
ions. Once a pipeline is ported into Flour, it can
arrays indicate the number and type of input fields. The
successive call to Tokenize in line 4 splits the input
fields into tokens. Lines 6 and 7 contain the two branches
defining the char-level and word-level n-gram transforma-
tions, which are then merged with the Concat transform
in line 9 before the linear binary classifier of line 10. Both
char and word n-gram transformations are parametrized
by the number of n-grams and maps translating n-grams
into numerical format (not shown in the Figure). Addi-
tionally, each Flour transformation accepts as input an
optional set of statistics gathered from training. These
statistics are used by the compiler to generate physical
plans more efficiently tailored to the model characteristics.
Example statistics are max vector size (to define the mini-
mum size of vectors to fetch from the pool at prediction
time 4.2), dense or sparse representations, etc.
Listing 1: Flour program for the sentiment analysis
pipeline. Transformations’ parameters are extracted from
the original ML2 pipeline.
1 var fContext = new FlourContext(objectStore, ...)
2 var tTokenizer = fContext.CSV.
3 FromText(fields, fieldsType,
sep).
4 Tokenize();
5
6 var tCNgram = tTokenizer.CharNgram(numCNgrms,
...);
7 var tWNgram = tTokenizer.WordNgram(numWNgrms,
...);
8 var fPrgrm = tCNgram.
9 Concat(tWNgram).
10 ClassifierBinaryLinear(cParams);
11
12 return fPrgrm.Plan();
We have instrumented the ML2 library to collect statis-
tics from training and with the related bindings to the
Box of Machine Learning
ving Systems
# 355
“This is a nice product”
Positive vs. Negative
Tokenizer
Char
Ngram
Word
Ngram
Concat
Linear
Regression
Figure 1: A sentimental analysis pipeline consisting of
operators for featurization (ellipses), followed by a ML
model (diamond). Tokenizer extracts tokens (e.g., words)
from the input string. Char and Word Ngrams featurize
input tokens by extracting n-grams. Concat generates a
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
recognize that the Linear Regression c
Char and WordNgram, therefore bypas
of Concat. Additionally, Tokenizer can
Char and WordNgram, therefore it will
CharNgram (in one stage) and a dep
CharNgram and WordNGram (in anot
created. The final plan will therefore b
stages, versus the initial 4 operators (an
Model Plan Compiler: Model plans
two DAGs: a DAG composed of log
DAG of physical stages. Logical stag
tion of the stages output of the Oven O
lated parameters; physical stages conta
that will be executed by the PRETZEL
given DAG, there is a 1-to-n mapping
physical stages so that a logical stage
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
recognize that the Linear Regression can be pushed in
Char and WordNgram, therefore bypassing the executi
of Concat. Additionally, Tokenizer can be reused betwe
Char and WordNgram, therefore it will be pipelined w
CharNgram (in one stage) and a dependency betwe
CharNgram and WordNGram (in another stage) will
created. The final plan will therefore be composed by
stages, versus the initial 4 operators (and vectors) of ML
Model Plan Compiler: Model plans are composed
two DAGs: a DAG composed of logical stages, and
DAG of physical stages. Logical stages are an abstra
tion of the stages output of the Oven Optimizer, with
lated parameters; physical stages contains the actual co
that will be executed by the PRETZEL runtime. For.ea
given DAG, there is a 1-to-n mapping between logical
physical stages so that a logical stage can represent t
execution code of different physical implementations.
physical implementation is selected based on the param
Flour API
5
Off-line phase [1] - Flour
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
recognize that the Linear Regression c
Char and WordNgram, therefore bypas
of Concat. Additionally, Tokenizer can
Char and WordNgram, therefore it will
CharNgram (in one stage) and a dep
CharNgram and WordNGram (in anot
created. The final plan will therefore b
stages, versus the initial 4 operators (an
Model Plan Compiler: Model plans
two DAGs: a DAG composed of log
DAG of physical stages. Logical stag
tion of the stages output of the Oven O
lated parameters; physical stages conta
that will be executed by the PRETZEL
given DAG, there is a 1-to-n mapping
physical stages so that a logical stage
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
recognize that the Linear Regression can be pushed in
Char and WordNgram, therefore bypassing the executi
of Concat. Additionally, Tokenizer can be reused betwe
Char and WordNgram, therefore it will be pipelined w
CharNgram (in one stage) and a dependency betwe
CharNgram and WordNGram (in another stage) will
created. The final plan will therefore be composed by
stages, versus the initial 4 operators (and vectors) of ML
Model Plan Compiler: Model plans are composed
two DAGs: a DAG composed of logical stages, and
DAG of physical stages. Logical stages are an abstra
tion of the stages output of the Oven Optimizer, with
lated parameters; physical stages contains the actual co
that will be executed by the PRETZEL runtime. For.ea
given DAG, there is a 1-to-n mapping between logical
physical stages so that a logical stage can represent t
execution code of different physical implementations.
physical implementation is selected based on the param
Flour API
5
Off-line phase [1] - Flour
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
recognize that the Linear Regression c
Char and WordNgram, therefore bypas
of Concat. Additionally, Tokenizer can
Char and WordNgram, therefore it will
CharNgram (in one stage) and a dep
CharNgram and WordNGram (in anot
created. The final plan will therefore b
stages, versus the initial 4 operators (an
Model Plan Compiler: Model plans
two DAGs: a DAG composed of log
DAG of physical stages. Logical stag
tion of the stages output of the Oven O
lated parameters; physical stages conta
that will be executed by the PRETZEL
given DAG, there is a 1-to-n mapping
physical stages so that a logical stage
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the elements
and is fed to the runtime.
recognize
Char and
of Concat
Char and
CharNgra
CharNgra
created. T
stages, ve
Model Pl
two DAG
DAG of p
tion of the
lated para
that will b
given DA
physical s
execution
physical i
ters chara
Plan co
DAG is g
Plan Com
representa
formation
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
recognize that the Linear R
Char and WordNgram, ther
of Concat. Additionally, To
Char and WordNgram, ther
CharNgram (in one stage)
CharNgram and WordNGr
created. The final plan wil
stages, versus the initial 4 o
Model Plan Compiler: M
two DAGs: a DAG comp
DAG of physical stages. L
tion of the stages output of
lated parameters; physical s
that will be executed by the
given DAG, there is a 1-to-
physical stages so that a lo
execution code of different
physical implementation is
ters characterizing a logica
Plan compilation is a two
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET
ZEL. In (1), a model is translated into a Flour program. (2
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the element
and is fed to the runtime.
(such as most featurizers) are pipelined together in a sin
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg
isters [33, 38]. Compute-intensive transformations (e.g
Oven
Optimiser
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
recognize that the Linear Regression can be pushed in
Char and WordNgram, therefore bypassing the executi
of Concat. Additionally, Tokenizer can be reused betwe
Char and WordNgram, therefore it will be pipelined w
CharNgram (in one stage) and a dependency betwe
CharNgram and WordNGram (in another stage) will
created. The final plan will therefore be composed by
stages, versus the initial 4 operators (and vectors) of ML
Model Plan Compiler: Model plans are composed
two DAGs: a DAG composed of logical stages, and
DAG of physical stages. Logical stages are an abstra
tion of the stages output of the Oven Optimizer, with
lated parameters; physical stages contains the actual co
that will be executed by the PRETZEL runtime. For.ea
given DAG, there is a 1-to-n mapping between logical
physical stages so that a logical stage can represent t
execution code of different physical implementations.
physical implementation is selected based on the param
Flour API
5
Off-line phase [1] - Flour
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
recognize that the Linear Regression c
Char and WordNgram, therefore bypas
of Concat. Additionally, Tokenizer can
Char and WordNgram, therefore it will
CharNgram (in one stage) and a dep
CharNgram and WordNGram (in anot
created. The final plan will therefore b
stages, versus the initial 4 operators (an
Model Plan Compiler: Model plans
two DAGs: a DAG composed of log
DAG of physical stages. Logical stag
tion of the stages output of the Oven O
lated parameters; physical stages conta
that will be executed by the PRETZEL
given DAG, there is a 1-to-n mapping
physical stages so that a logical stage
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the elements
and is fed to the runtime.
recognize
Char and
of Concat
Char and
CharNgra
CharNgra
created. T
stages, ve
Model Pl
two DAG
DAG of p
tion of the
lated para
that will b
given DA
physical s
execution
physical i
ters chara
Plan co
DAG is g
Plan Com
representa
formation
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
recognize that the Linear R
Char and WordNgram, ther
of Concat. Additionally, To
Char and WordNgram, ther
CharNgram (in one stage)
CharNgram and WordNGr
created. The final plan wil
stages, versus the initial 4 o
Model Plan Compiler: M
two DAGs: a DAG comp
DAG of physical stages. L
tion of the stages output of
lated parameters; physical s
that will be executed by the
given DAG, there is a 1-to-
physical stages so that a lo
execution code of different
physical implementation is
ters characterizing a logica
Plan compilation is a two
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET
ZEL. In (1), a model is translated into a Flour program. (2
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the element
and is fed to the runtime.
(such as most featurizers) are pipelined together in a sin
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg
isters [33, 38]. Compute-intensive transformations (e.g
Oven
Optimiser
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the elements
and is fed to the runtime.
(such as most featurizers) are pipelined together in a sin-
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg-
isters [33, 38]. Compute-intensive transformations (e.g.,
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
recognize that the Linear Regression can be pushed in
Char and WordNgram, therefore bypassing the executi
of Concat. Additionally, Tokenizer can be reused betwe
Char and WordNgram, therefore it will be pipelined w
CharNgram (in one stage) and a dependency betwe
CharNgram and WordNGram (in another stage) will
created. The final plan will therefore be composed by
stages, versus the initial 4 operators (and vectors) of ML
Model Plan Compiler: Model plans are composed
two DAGs: a DAG composed of logical stages, and
DAG of physical stages. Logical stages are an abstra
tion of the stages output of the Oven Optimizer, with
lated parameters; physical stages contains the actual co
that will be executed by the PRETZEL runtime. For.ea
given DAG, there is a 1-to-n mapping between logical
physical stages so that a logical stage can represent t
execution code of different physical implementations.
physical implementation is selected based on the param
Flour API
• Two main components:
– Runtime, with an Object Store
– Scheduler
• Runtime handles physical resources: threads and buffers
• Object Store caches objects of all models
– models register and retrieve state objects via a key
• Scheduler is event-based, each stage being an event
6
On-line phase [1] - Runtime
• 7x less memory for SA models, 40% for RT models
• Higher model density, higher efficiency and
profitability
7
Memory
Figure 8: Cumulative memory usage of the model
pipelines with and without Object Store, normalized by
Figure 9: Latency co
ZEL, with values nor
Cliffs indicate pipeli
• Hot vs Cold scenarios
• Without and with Caching of partial results
• Pretzel vs ML:NET: cold is 15x faster, hot is 2.6x
8
Latency
Figure 9: Latency comparison between ML2 and PRET-
• Batch size is 1000 inputs, multiple runs of same batch
• Delay batching: as in Clipper, let requests tolerate a given
delay
• Batch while latency is smaller than tolerated delay
• ML.NET suffers from missed data sharing, i.e. higher memory
traffic
9
Throughput
Figure 11: The average throughput computed among the
latency, instead, gracefully increases linearl
increase of the load.
• Batch size is 1000 inputs, multiple runs of same batch
• Delay batching: as in Clipper, let requests tolerate a given
delay
• Batch while latency is smaller than tolerated delay
• ML.NET suffers from missed data sharing, i.e. higher memory
traffic
9
Throughput
Figure 11: The average throughput computed among the
latency, instead, gracefully increases linearl
increase of the load.
the number of cores on the x-axis. PRETZEL scales lin
early to the number of CPU cores, close to the expecte
maximum throughput with Hyper Threading enabled.
not shared, thus increasing the pressure on the memor
subsystem: indeed, even if the data values are the sam
the model objects are mapped to different memory area
Delay Batching: As in Clipper, PRETZEL FrontEnd a
lows users to specify a maximum delay they can wait t
maximize throughput. For each model category, Figure 1
depicts the trade-off between throughput and latency a
the delay increases. Interestingly, for such models even
small delay helps reaching the optimal batch size.
Figure 12: Throughput and latency of SA and RT model
• We addressed performance/density bottlenecks in ML
inference for MaaS by advocating white-box approach
• Future work
- Code generation
- Support more model formats than ML.NET
- Distributed version
- NUMA awareness
10
Conclusions and current work
• Physical stages can be offloaded to FPGA
• How to deploy it?
– Fixed subset of common operators
– Multiple operators in kernels
– Whole model, deployed via partial reconf
11
Future work with FPGA
• Physical stages can be offloaded to FPGA
• How to deploy it?
– Fixed subset of common operators
– Multiple operators in kernels
– Whole model, deployed via partial reconf
11
Future work with FPGA
QUESTIONS ?
Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it

More Related Content

What's hot (20)

PDF
Chapter 4: Parallel Programming Languages
Heman Pathak
 
PDF
Parallel External Memory Algorithms Applied to Generalized Linear Models
Revolution Analytics
 
PPS
PRAM algorithms from deepika
guest1f4fb3
 
DOCX
A high performance fir filter architecture for fixed and reconfigurable appli...
Ieee Xpert
 
PPT
Parallel algorithms
guest084d20
 
DOCX
Graph based transistor network generation method for supergate design
Ieee Xpert
 
DOCX
High performance pipelined architecture of elliptic curve scalar multiplicati...
Ieee Xpert
 
DOCX
High performance nb-ldpc decoder with reduction of message exchange
Ieee Xpert
 
PPTX
Matrix multiplication
International Islamic University
 
PDF
Elementary Parallel Algorithms
Heman Pathak
 
PPTX
ECE 565 FInal Project
Lakshmi Yasaswi Kamireddy
 
PDF
Transfer Learning for Performance Analysis of Configurable Systems: A Causal ...
Pooyan Jamshidi
 
PDF
IRJET- Latin Square Computation of Order-3 using Open CL
IRJET Journal
 
PPT
Chapter 4 pc
Hanif Durad
 
PPT
3DD 1e SyCers
Marco Santambrogio
 
PDF
Design and Estimation of delay, power and area for Parallel prefix adders
IJERA Editor
 
PDF
[IJCT-V3I2P17] Authors: Sheng Lai, Xiaohua Meng, Dongqin Zheng
IJET - International Journal of Engineering and Techniques
 
PPT
32-bit unsigned multiplier by using CSLA & CLAA
Ganesh Sambasivarao
 
PDF
Parallel Algorithms
Heman Pathak
 
PDF
D0212326
inventionjournals
 
Chapter 4: Parallel Programming Languages
Heman Pathak
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Revolution Analytics
 
PRAM algorithms from deepika
guest1f4fb3
 
A high performance fir filter architecture for fixed and reconfigurable appli...
Ieee Xpert
 
Parallel algorithms
guest084d20
 
Graph based transistor network generation method for supergate design
Ieee Xpert
 
High performance pipelined architecture of elliptic curve scalar multiplicati...
Ieee Xpert
 
High performance nb-ldpc decoder with reduction of message exchange
Ieee Xpert
 
Matrix multiplication
International Islamic University
 
Elementary Parallel Algorithms
Heman Pathak
 
ECE 565 FInal Project
Lakshmi Yasaswi Kamireddy
 
Transfer Learning for Performance Analysis of Configurable Systems: A Causal ...
Pooyan Jamshidi
 
IRJET- Latin Square Computation of Order-3 using Open CL
IRJET Journal
 
Chapter 4 pc
Hanif Durad
 
3DD 1e SyCers
Marco Santambrogio
 
Design and Estimation of delay, power and area for Parallel prefix adders
IJERA Editor
 
[IJCT-V3I2P17] Authors: Sheng Lai, Xiaohua Meng, Dongqin Zheng
IJET - International Journal of Engineering and Techniques
 
32-bit unsigned multiplier by using CSLA & CLAA
Ganesh Sambasivarao
 
Parallel Algorithms
Heman Pathak
 

Similar to Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads (20)

PDF
Pretzel: optimized Machine Learning framework for low-latency and high throug...
NECST Lab @ Politecnico di Milano
 
PDF
Pretzel: optimized Machine Learning framework for low-latency and high throu...
NECST Lab @ Politecnico di Milano
 
PDF
Machine Learning @NECST
NECST Lab @ Politecnico di Milano
 
PDF
MLSEV Virtual. From my First BigML Project to Production
BigML, Inc
 
PDF
[ESWC2017 - PhD Symposium] Enhancing white-box machine learning processes by ...
Gilles Vandewiele
 
PPTX
Accelerating Deep Learning Inference 
on Mobile Systems
Darian Frajberg
 
PPTX
Open, Secure & Transparent AI Pipelines
Nick Pentreath
 
PPTX
Machine Learning With ML.NET
Dev Raj Gautam
 
PDF
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
Pooyan Jamshidi
 
PDF
Continuous Architecting of Stream-Based Systems
CHOOSE
 
PDF
DutchMLSchool 2022 - Automation
BigML, Inc
 
PPTX
Compositional AI: Fusion of AI/ML Services
Debmalya Biswas
 
PDF
CGI trainees workshop Distributed Deep Learning, 24/5 2019, Kim Hammar
Kim Hammar
 
PDF
Machine_Learning_Blocks___Bryan_Thesis
Bryan Collazo Santiago
 
PDF
Scolari's ICCD17 Talk
NECST Lab @ Politecnico di Milano
 
PDF
Machine Learning meets DevOps
Pooyan Jamshidi
 
PDF
AI Library - An Open Source Machine Learning Framework
MLconf
 
PPTX
Generative AI in CSharp with Semantic Kernel.pptx
Alon Fliess
 
PDF
Hybrid use of machine learning and ontology
Anthony (Tony) Sarris
 
ODP
Fast Approximate A-box Consistency Checking using Machine Learning
Heiko Paulheim
 
Pretzel: optimized Machine Learning framework for low-latency and high throug...
NECST Lab @ Politecnico di Milano
 
Pretzel: optimized Machine Learning framework for low-latency and high throu...
NECST Lab @ Politecnico di Milano
 
Machine Learning @NECST
NECST Lab @ Politecnico di Milano
 
MLSEV Virtual. From my First BigML Project to Production
BigML, Inc
 
[ESWC2017 - PhD Symposium] Enhancing white-box machine learning processes by ...
Gilles Vandewiele
 
Accelerating Deep Learning Inference 
on Mobile Systems
Darian Frajberg
 
Open, Secure & Transparent AI Pipelines
Nick Pentreath
 
Machine Learning With ML.NET
Dev Raj Gautam
 
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
Pooyan Jamshidi
 
Continuous Architecting of Stream-Based Systems
CHOOSE
 
DutchMLSchool 2022 - Automation
BigML, Inc
 
Compositional AI: Fusion of AI/ML Services
Debmalya Biswas
 
CGI trainees workshop Distributed Deep Learning, 24/5 2019, Kim Hammar
Kim Hammar
 
Machine_Learning_Blocks___Bryan_Thesis
Bryan Collazo Santiago
 
Scolari's ICCD17 Talk
NECST Lab @ Politecnico di Milano
 
Machine Learning meets DevOps
Pooyan Jamshidi
 
AI Library - An Open Source Machine Learning Framework
MLconf
 
Generative AI in CSharp with Semantic Kernel.pptx
Alon Fliess
 
Hybrid use of machine learning and ontology
Anthony (Tony) Sarris
 
Fast Approximate A-box Consistency Checking using Machine Learning
Heiko Paulheim
 
Ad

More from NECST Lab @ Politecnico di Milano (20)

PDF
Mesticheria Team - WiiReflex
NECST Lab @ Politecnico di Milano
 
PPTX
Punto e virgola Team - Stressometro
NECST Lab @ Politecnico di Milano
 
PDF
BitIt Team - Stay.straight
NECST Lab @ Politecnico di Milano
 
PDF
BabYodini Team - Talking Gloves
NECST Lab @ Politecnico di Milano
 
PDF
printf("Nome Squadra"); Team - NeoTon
NECST Lab @ Politecnico di Milano
 
PPTX
BlackBoard Team - Motion Tracking Platform
NECST Lab @ Politecnico di Milano
 
PDF
#include<brain.h> Team - HomeBeatHome
NECST Lab @ Politecnico di Milano
 
PDF
Flipflops Team - Wave U
NECST Lab @ Politecnico di Milano
 
PDF
Bug(atta) Team - Little Brother
NECST Lab @ Politecnico di Milano
 
PDF
#NECSTCamp: come partecipare
NECST Lab @ Politecnico di Milano
 
PDF
NECSTLab101 2020.2021
NECST Lab @ Politecnico di Milano
 
PDF
TreeHouse, nourish your community
NECST Lab @ Politecnico di Milano
 
PDF
TiReX: Tiled Regular eXpressionsmatching architecture
NECST Lab @ Politecnico di Milano
 
PDF
Embedding based knowledge graph link prediction for drug repurposing
NECST Lab @ Politecnico di Milano
 
PDF
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
NECST Lab @ Politecnico di Milano
 
PDF
EMPhASIS - An EMbedded Public Attention Stress Identification System
NECST Lab @ Politecnico di Milano
 
PDF
Luns - Automatic lungs segmentation through neural network
NECST Lab @ Politecnico di Milano
 
PDF
BlastFunction: How to combine Serverless and FPGAs
NECST Lab @ Politecnico di Milano
 
PDF
Maeve - Fast genome analysis leveraging exact string matching
NECST Lab @ Politecnico di Milano
 
Mesticheria Team - WiiReflex
NECST Lab @ Politecnico di Milano
 
Punto e virgola Team - Stressometro
NECST Lab @ Politecnico di Milano
 
BitIt Team - Stay.straight
NECST Lab @ Politecnico di Milano
 
BabYodini Team - Talking Gloves
NECST Lab @ Politecnico di Milano
 
printf("Nome Squadra"); Team - NeoTon
NECST Lab @ Politecnico di Milano
 
BlackBoard Team - Motion Tracking Platform
NECST Lab @ Politecnico di Milano
 
#include<brain.h> Team - HomeBeatHome
NECST Lab @ Politecnico di Milano
 
Flipflops Team - Wave U
NECST Lab @ Politecnico di Milano
 
Bug(atta) Team - Little Brother
NECST Lab @ Politecnico di Milano
 
#NECSTCamp: come partecipare
NECST Lab @ Politecnico di Milano
 
NECSTLab101 2020.2021
NECST Lab @ Politecnico di Milano
 
TreeHouse, nourish your community
NECST Lab @ Politecnico di Milano
 
TiReX: Tiled Regular eXpressionsmatching architecture
NECST Lab @ Politecnico di Milano
 
Embedding based knowledge graph link prediction for drug repurposing
NECST Lab @ Politecnico di Milano
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
NECST Lab @ Politecnico di Milano
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
NECST Lab @ Politecnico di Milano
 
Luns - Automatic lungs segmentation through neural network
NECST Lab @ Politecnico di Milano
 
BlastFunction: How to combine Serverless and FPGAs
NECST Lab @ Politecnico di Milano
 
Maeve - Fast genome analysis leveraging exact string matching
NECST Lab @ Politecnico di Milano
 
Ad

Recently uploaded (20)

PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
AsadShad4
 
PPT
FINAL plumbing code for board exam passer
MattKristopherDiaz
 
PDF
PROGRAMMING REQUESTS/RESPONSES WITH GREATFREE IN THE CLOUD ENVIRONMENT
samueljackson3773
 
PPTX
Artificial Intelligence jejeiejj3iriejrjifirirjdjeie
VikingsGaming2
 
PDF
Authentication Devices in Fog-mobile Edge Computing Environments through a Wi...
ijujournal
 
PDF
Artificial Neural Network-Types,Perceptron,Problems
Sharmila Chidaravalli
 
DOCX
Engineering Geology Field Report to Malekhu .docx
justprashant567
 
PPTX
Unit_I Functional Units, Instruction Sets.pptx
logaprakash9
 
PPT
دراسة حاله لقرية تقع في جنوب غرب السودان
محمد قصص فتوتة
 
PPTX
Work at Height training for workers .pptx
cecos12
 
PDF
Clustering Algorithms - Kmeans,Min ALgorithm
Sharmila Chidaravalli
 
PDF
Module - 5 Machine Learning-22ISE62.pdf
Dr. Shivashankar
 
PDF
Module - 4 Machine Learning -22ISE62.pdf
Dr. Shivashankar
 
PDF
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
AlqualsaDIResearchGr
 
PDF
輪読会資料_Miipher and Miipher2 .
NABLAS株式会社
 
PPTX
Comparison of Flexible and Rigid Pavements in Bangladesh
Arifur Rahman
 
PDF
PRIZ Academy - Process functional modelling
PRIZ Guru
 
PPTX
FSE_LLM4SE1_A Tool for In-depth Analysis of Code Execution Reasoning of Large...
cl144
 
PDF
FSE-Journal-First-Automated code editing with search-generate-modify.pdf
cl144
 
PPTX
Computer network Computer network Computer network Computer network
Shrikant317689
 
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
AsadShad4
 
FINAL plumbing code for board exam passer
MattKristopherDiaz
 
PROGRAMMING REQUESTS/RESPONSES WITH GREATFREE IN THE CLOUD ENVIRONMENT
samueljackson3773
 
Artificial Intelligence jejeiejj3iriejrjifirirjdjeie
VikingsGaming2
 
Authentication Devices in Fog-mobile Edge Computing Environments through a Wi...
ijujournal
 
Artificial Neural Network-Types,Perceptron,Problems
Sharmila Chidaravalli
 
Engineering Geology Field Report to Malekhu .docx
justprashant567
 
Unit_I Functional Units, Instruction Sets.pptx
logaprakash9
 
دراسة حاله لقرية تقع في جنوب غرب السودان
محمد قصص فتوتة
 
Work at Height training for workers .pptx
cecos12
 
Clustering Algorithms - Kmeans,Min ALgorithm
Sharmila Chidaravalli
 
Module - 5 Machine Learning-22ISE62.pdf
Dr. Shivashankar
 
Module - 4 Machine Learning -22ISE62.pdf
Dr. Shivashankar
 
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
AlqualsaDIResearchGr
 
輪読会資料_Miipher and Miipher2 .
NABLAS株式会社
 
Comparison of Flexible and Rigid Pavements in Bangladesh
Arifur Rahman
 
PRIZ Academy - Process functional modelling
PRIZ Guru
 
FSE_LLM4SE1_A Tool for In-depth Analysis of Code Execution Reasoning of Large...
cl144
 
FSE-Journal-First-Automated code editing with search-generate-modify.pdf
cl144
 
Computer network Computer network Computer network Computer network
Shrikant317689
 

Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads

  • 1. Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - [email protected] UC Berkeley, 05/23/2018 Yunseong Lee, Alberto Scolari, Matteo Interlandi, Markus Weimer, Marco D. Santambrogio, Byung-Gon Chun PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
  • 2. ML-as-a-Service • ML models are deployed on cloud platforms as black boxes • Users deploy multiple models per machine (10-100s) 2
  • 3. ML-as-a-Service • ML models are deployed on cloud platforms as black boxes • Users deploy multiple models per machine (10-100s) 2 good:0 bad:1 … This is a good product
  • 4. ML-as-a-Service • ML models are deployed on cloud platforms as black boxes • Users deploy multiple models per machine (10-100s) 2 • Deployed models are often similar - Similar structure - Similar state good:0 bad:1 … This is a good product
  • 5. ML-as-a-Service • ML models are deployed on cloud platforms as black boxes • Users deploy multiple models per machine (10-100s) 2 • Deployed models are often similar - Similar structure - Similar state good:0 bad:1 … This is a good product good:0 bad:1 … This is a good product
  • 6. ML-as-a-Service • ML models are deployed on cloud platforms as black boxes • Users deploy multiple models per machine (10-100s) 2 • Deployed models are often similar - Similar structure - Similar state • Two key requirements: 1. Performance: latency or throughput 2. Model density: number of models per machine good:0 bad:1 … This is a good product good:0 bad:1 … This is a good product
  • 7. 3 Limitations of black-box 300 models in ML.NET [1], C#, from Amazon Reviews dataset [2]: 250 Sentiment Analysis + 50 Regression Task [1] https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet [2] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, WWW ’16
  • 8. • Overheads: JIT, GC, virtual calls, … • Operators are separate calls: no code fusion, multiple data accesses • Operators have different buffers and break locality 3 Limitations of black-box (a) We identify the operators by their parameters. The first operators, which have multiple versions with different length Figure 3: Probability for an operator within the 250 d 250⇥ y 100 pipelines have operator x, implying that pipe Figure 4: CDF of latency of prediction requests of 25 DAGs. We denote the first prediction as cold; the h line is reported as average over 100 predictions after warm-up period of 10. The plot is normalized over t 99th percentile latency of the hot case. describes this situation, where the performance of h predictions over the 250 sentiment analysis pipelines wi memory already allocated and JIT-compiled code is mo than an order-of-magnitude faster then the cold versio for the same pipelines. To drill down more into the pro Limited performance 300 models in ML.NET [1], C#, from Amazon Reviews dataset [2]: 250 Sentiment Analysis + 50 Regression Task [1] https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet [2] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, WWW ’16
  • 9. • Overheads: JIT, GC, virtual calls, … • Operators are separate calls: no code fusion, multiple data accesses • Operators have different buffers and break locality 3 Limitations of black-box (a) We identify the operators by their parameters. The first operators, which have multiple versions with different length Figure 3: Probability for an operator within the 250 d 250⇥ y 100 pipelines have operator x, implying that pipe Figure 4: CDF of latency of prediction requests of 25 DAGs. We denote the first prediction as cold; the h line is reported as average over 100 predictions after warm-up period of 10. The plot is normalized over t 99th percentile latency of the hot case. describes this situation, where the performance of h predictions over the 250 sentiment analysis pipelines wi memory already allocated and JIT-compiled code is mo than an order-of-magnitude faster then the cold versio for the same pipelines. To drill down more into the pro (a) We identify the operators by their parameters. The first two groups represent N-gram operators, which have multiple versions with different length (e.g., unigram, trigram) (b) eac Figure 3: Probability for an operator within the 250 different pipelines. If an operator x 250⇥ y 100 pipelines have operator x, implying that pipelines can save memory by sharing • No state sharing: state is duplicated-> memory is wasted Limited performance Limited density 300 models in ML.NET [1], C#, from Amazon Reviews dataset [2]: 250 Sentiment Analysis + 50 Regression Task [1] https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet [2] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, WWW ’16
  • 10. White-box: open the black box of models for them to co- exist better, be scheduled better 1. End-to-end Optimizations: merge operators in computational units (logical stages), to decrease overheads 2. Multi-model Optimizations: create once, use everywhere, for both data and stages 4 White-Box Design principles
  • 11. 5 Off-line phase [1] - Flour var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan recognize that the Linear Regression c Char and WordNgram, therefore bypas of Concat. Additionally, Tokenizer can Char and WordNgram, therefore it will CharNgram (in one stage) and a dep CharNgram and WordNGram (in anot created. The final plan will therefore b stages, versus the initial 4 operators (an Model Plan Compiler: Model plans two DAGs: a DAG composed of log DAG of physical stages. Logical stag tion of the stages output of the Oven O lated parameters; physical stages conta that will be executed by the PRETZEL given DAG, there is a 1-to-n mapping physical stages so that a logical stage
  • 12. 5 Off-line phase [1] - Flour var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan recognize that the Linear Regression c Char and WordNgram, therefore bypas of Concat. Additionally, Tokenizer can Char and WordNgram, therefore it will CharNgram (in one stage) and a dep CharNgram and WordNGram (in anot created. The final plan will therefore b stages, versus the initial 4 operators (an Model Plan Compiler: Model plans two DAGs: a DAG composed of log DAG of physical stages. Logical stag tion of the stages output of the Oven O lated parameters; physical stages conta that will be executed by the PRETZEL given DAG, there is a 1-to-n mapping physical stages so that a logical stage var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from recognize that the Linear Regression can be pushed in Char and WordNgram, therefore bypassing the executi of Concat. Additionally, Tokenizer can be reused betwe Char and WordNgram, therefore it will be pipelined w CharNgram (in one stage) and a dependency betwe CharNgram and WordNGram (in another stage) will created. The final plan will therefore be composed by stages, versus the initial 4 operators (and vectors) of ML Model Plan Compiler: Model plans are composed two DAGs: a DAG composed of logical stages, and DAG of physical stages. Logical stages are an abstra tion of the stages output of the Oven Optimizer, with lated parameters; physical stages contains the actual co that will be executed by the PRETZEL runtime. For.ea given DAG, there is a 1-to-n mapping between logical physical stages so that a logical stage can represent t execution code of different physical implementations. physical implementation is selected based on the param Flour API
  • 13. 5 Off-line phase [1] - Flourhe dynamic decisions on how to schedule plans on machine workload. Finally, a FrontEnd is used mit prediction requests to the system. del pipeline deployment and serving in PRETZEL a two-phase process. During the off-line phase n 4.1), ML2’s pre-trained pipelines are translated our transformation-based API. Oven optimizer re- es and fuses transformations into model plans com- of parameterized logical units called stages. Each stage is then AOT compiled into physical computa- its where memory resources and threads are pooled me. Model plans are registered for prediction serv- he Runtime where physical stages and parameters red between pipelines with similar model plans. In line phase (Section 4.2), when an inference request gistered model plan is received, physical stages are eterized dynamically with the proper values main- in the Object Store. The Scheduler is in charge of g physical stages to shared execution units. ures 6 and 7 pictorially summarize the above de- ons; note that only the on-line phase is executed rence time, whereas the model plans are generated etely off-line. Next, we will describe each layer sing the PRETZEL prediction system. Off-line Phase Flour al of Flour is to provide an intermediate represen- between ML frameworks (currently only ML2) and EL that is both easy to target and amenable to op- ions. Once a pipeline is ported into Flour, it can arrays indicate the number and type of input fields. The successive call to Tokenize in line 4 splits the input fields into tokens. Lines 6 and 7 contain the two branches defining the char-level and word-level n-gram transforma- tions, which are then merged with the Concat transform in line 9 before the linear binary classifier of line 10. Both char and word n-gram transformations are parametrized by the number of n-grams and maps translating n-grams into numerical format (not shown in the Figure). Addi- tionally, each Flour transformation accepts as input an optional set of statistics gathered from training. These statistics are used by the compiler to generate physical plans more efficiently tailored to the model characteristics. Example statistics are max vector size (to define the mini- mum size of vectors to fetch from the pool at prediction time 4.2), dense or sparse representations, etc. Listing 1: Flour program for the sentiment analysis pipeline. Transformations’ parameters are extracted from the original ML2 pipeline. 1 var fContext = new FlourContext(objectStore, ...) 2 var tTokenizer = fContext.CSV. 3 FromText(fields, fieldsType, sep). 4 Tokenize(); 5 6 var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); 7 var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); 8 var fPrgrm = tCNgram. 9 Concat(tWNgram). 10 ClassifierBinaryLinear(cParams); 11 12 return fPrgrm.Plan(); We have instrumented the ML2 library to collect statis- tics from training and with the related bindings to the Box of Machine Learning ving Systems # 355 “This is a nice product” Positive vs. Negative Tokenizer Char Ngram Word Ngram Concat Linear Regression Figure 1: A sentimental analysis pipeline consisting of operators for featurization (ellipses), followed by a ML model (diamond). Tokenizer extracts tokens (e.g., words) from the input string. Char and Word Ngrams featurize input tokens by extracting n-grams. Concat generates a var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan recognize that the Linear Regression c Char and WordNgram, therefore bypas of Concat. Additionally, Tokenizer can Char and WordNgram, therefore it will CharNgram (in one stage) and a dep CharNgram and WordNGram (in anot created. The final plan will therefore b stages, versus the initial 4 operators (an Model Plan Compiler: Model plans two DAGs: a DAG composed of log DAG of physical stages. Logical stag tion of the stages output of the Oven O lated parameters; physical stages conta that will be executed by the PRETZEL given DAG, there is a 1-to-n mapping physical stages so that a logical stage var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from recognize that the Linear Regression can be pushed in Char and WordNgram, therefore bypassing the executi of Concat. Additionally, Tokenizer can be reused betwe Char and WordNgram, therefore it will be pipelined w CharNgram (in one stage) and a dependency betwe CharNgram and WordNGram (in another stage) will created. The final plan will therefore be composed by stages, versus the initial 4 operators (and vectors) of ML Model Plan Compiler: Model plans are composed two DAGs: a DAG composed of logical stages, and DAG of physical stages. Logical stages are an abstra tion of the stages output of the Oven Optimizer, with lated parameters; physical stages contains the actual co that will be executed by the PRETZEL runtime. For.ea given DAG, there is a 1-to-n mapping between logical physical stages so that a logical stage can represent t execution code of different physical implementations. physical implementation is selected based on the param Flour API
  • 14. 5 Off-line phase [1] - Flour var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan recognize that the Linear Regression c Char and WordNgram, therefore bypas of Concat. Additionally, Tokenizer can Char and WordNgram, therefore it will CharNgram (in one stage) and a dep CharNgram and WordNGram (in anot created. The final plan will therefore b stages, versus the initial 4 operators (an Model Plan Compiler: Model plans two DAGs: a DAG composed of log DAG of physical stages. Logical stag tion of the stages output of the Oven O lated parameters; physical stages conta that will be executed by the PRETZEL given DAG, there is a 1-to-n mapping physical stages so that a logical stage var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from recognize that the Linear Regression can be pushed in Char and WordNgram, therefore bypassing the executi of Concat. Additionally, Tokenizer can be reused betwe Char and WordNgram, therefore it will be pipelined w CharNgram (in one stage) and a dependency betwe CharNgram and WordNGram (in another stage) will created. The final plan will therefore be composed by stages, versus the initial 4 operators (and vectors) of ML Model Plan Compiler: Model plans are composed two DAGs: a DAG composed of logical stages, and DAG of physical stages. Logical stages are an abstra tion of the stages output of the Oven Optimizer, with lated parameters; physical stages contains the actual co that will be executed by the PRETZEL runtime. For.ea given DAG, there is a 1-to-n mapping between logical physical stages so that a logical stage can represent t execution code of different physical implementations. physical implementation is selected based on the param Flour API
  • 15. 5 Off-line phase [1] - Flour var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan recognize that the Linear Regression c Char and WordNgram, therefore bypas of Concat. Additionally, Tokenizer can Char and WordNgram, therefore it will CharNgram (in one stage) and a dep CharNgram and WordNGram (in anot created. The final plan will therefore b stages, versus the initial 4 operators (an Model Plan Compiler: Model plans two DAGs: a DAG composed of log DAG of physical stages. Logical stag tion of the stages output of the Oven O lated parameters; physical stages conta that will be executed by the PRETZEL given DAG, there is a 1-to-n mapping physical stages so that a logical stage var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by the Oven Compiler using logical stages, parameters, and statistics. A model plan is the union of all the elements and is fed to the runtime. recognize Char and of Concat Char and CharNgra CharNgra created. T stages, ve Model Pl two DAG DAG of p tion of the lated para that will b given DA physical s execution physical i ters chara Plan co DAG is g Plan Com representa formation var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by recognize that the Linear R Char and WordNgram, ther of Concat. Additionally, To Char and WordNgram, ther CharNgram (in one stage) CharNgram and WordNGr created. The final plan wil stages, versus the initial 4 o Model Plan Compiler: M two DAGs: a DAG comp DAG of physical stages. L tion of the stages output of lated parameters; physical s that will be executed by the given DAG, there is a 1-to- physical stages so that a lo execution code of different physical implementation is ters characterizing a logica Plan compilation is a two var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET ZEL. In (1), a model is translated into a Flour program. (2 Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by the Oven Compiler using logical stages, parameters, and statistics. A model plan is the union of all the element and is fed to the runtime. (such as most featurizers) are pipelined together in a sin gle pass over the data. This strategy achieves best data locality because records are likely to reside in CPU reg isters [33, 38]. Compute-intensive transformations (e.g Oven Optimiser var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from recognize that the Linear Regression can be pushed in Char and WordNgram, therefore bypassing the executi of Concat. Additionally, Tokenizer can be reused betwe Char and WordNgram, therefore it will be pipelined w CharNgram (in one stage) and a dependency betwe CharNgram and WordNGram (in another stage) will created. The final plan will therefore be composed by stages, versus the initial 4 operators (and vectors) of ML Model Plan Compiler: Model plans are composed two DAGs: a DAG composed of logical stages, and DAG of physical stages. Logical stages are an abstra tion of the stages output of the Oven Optimizer, with lated parameters; physical stages contains the actual co that will be executed by the PRETZEL runtime. For.ea given DAG, there is a 1-to-n mapping between logical physical stages so that a logical stage can represent t execution code of different physical implementations. physical implementation is selected based on the param Flour API
  • 16. 5 Off-line phase [1] - Flour var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan recognize that the Linear Regression c Char and WordNgram, therefore bypas of Concat. Additionally, Tokenizer can Char and WordNgram, therefore it will CharNgram (in one stage) and a dep CharNgram and WordNGram (in anot created. The final plan will therefore b stages, versus the initial 4 operators (an Model Plan Compiler: Model plans two DAGs: a DAG composed of log DAG of physical stages. Logical stag tion of the stages output of the Oven O lated parameters; physical stages conta that will be executed by the PRETZEL given DAG, there is a 1-to-n mapping physical stages so that a logical stage var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by the Oven Compiler using logical stages, parameters, and statistics. A model plan is the union of all the elements and is fed to the runtime. recognize Char and of Concat Char and CharNgra CharNgra created. T stages, ve Model Pl two DAG DAG of p tion of the lated para that will b given DA physical s execution physical i ters chara Plan co DAG is g Plan Com representa formation var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by recognize that the Linear R Char and WordNgram, ther of Concat. Additionally, To Char and WordNgram, ther CharNgram (in one stage) CharNgram and WordNGr created. The final plan wil stages, versus the initial 4 o Model Plan Compiler: M two DAGs: a DAG comp DAG of physical stages. L tion of the stages output of lated parameters; physical s that will be executed by the given DAG, there is a 1-to- physical stages so that a lo execution code of different physical implementation is ters characterizing a logica Plan compilation is a two var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET ZEL. In (1), a model is translated into a Flour program. (2 Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by the Oven Compiler using logical stages, parameters, and statistics. A model plan is the union of all the element and is fed to the runtime. (such as most featurizers) are pipelined together in a sin gle pass over the data. This strategy achieves best data locality because records are likely to reside in CPU reg isters [33, 38]. Compute-intensive transformations (e.g Oven Optimiser var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by the Oven Compiler using logical stages, parameters, and statistics. A model plan is the union of all the elements and is fed to the runtime. (such as most featurizers) are pipelined together in a sin- gle pass over the data. This strategy achieves best data locality because records are likely to reside in CPU reg- isters [33, 38]. Compute-intensive transformations (e.g., var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from recognize that the Linear Regression can be pushed in Char and WordNgram, therefore bypassing the executi of Concat. Additionally, Tokenizer can be reused betwe Char and WordNgram, therefore it will be pipelined w CharNgram (in one stage) and a dependency betwe CharNgram and WordNGram (in another stage) will created. The final plan will therefore be composed by stages, versus the initial 4 operators (and vectors) of ML Model Plan Compiler: Model plans are composed two DAGs: a DAG composed of logical stages, and DAG of physical stages. Logical stages are an abstra tion of the stages output of the Oven Optimizer, with lated parameters; physical stages contains the actual co that will be executed by the PRETZEL runtime. For.ea given DAG, there is a 1-to-n mapping between logical physical stages so that a logical stage can represent t execution code of different physical implementations. physical implementation is selected based on the param Flour API
  • 17. • Two main components: – Runtime, with an Object Store – Scheduler • Runtime handles physical resources: threads and buffers • Object Store caches objects of all models – models register and retrieve state objects via a key • Scheduler is event-based, each stage being an event 6 On-line phase [1] - Runtime
  • 18. • 7x less memory for SA models, 40% for RT models • Higher model density, higher efficiency and profitability 7 Memory Figure 8: Cumulative memory usage of the model pipelines with and without Object Store, normalized by Figure 9: Latency co ZEL, with values nor Cliffs indicate pipeli
  • 19. • Hot vs Cold scenarios • Without and with Caching of partial results • Pretzel vs ML:NET: cold is 15x faster, hot is 2.6x 8 Latency Figure 9: Latency comparison between ML2 and PRET-
  • 20. • Batch size is 1000 inputs, multiple runs of same batch • Delay batching: as in Clipper, let requests tolerate a given delay • Batch while latency is smaller than tolerated delay • ML.NET suffers from missed data sharing, i.e. higher memory traffic 9 Throughput Figure 11: The average throughput computed among the latency, instead, gracefully increases linearl increase of the load.
  • 21. • Batch size is 1000 inputs, multiple runs of same batch • Delay batching: as in Clipper, let requests tolerate a given delay • Batch while latency is smaller than tolerated delay • ML.NET suffers from missed data sharing, i.e. higher memory traffic 9 Throughput Figure 11: The average throughput computed among the latency, instead, gracefully increases linearl increase of the load. the number of cores on the x-axis. PRETZEL scales lin early to the number of CPU cores, close to the expecte maximum throughput with Hyper Threading enabled. not shared, thus increasing the pressure on the memor subsystem: indeed, even if the data values are the sam the model objects are mapped to different memory area Delay Batching: As in Clipper, PRETZEL FrontEnd a lows users to specify a maximum delay they can wait t maximize throughput. For each model category, Figure 1 depicts the trade-off between throughput and latency a the delay increases. Interestingly, for such models even small delay helps reaching the optimal batch size. Figure 12: Throughput and latency of SA and RT model
  • 22. • We addressed performance/density bottlenecks in ML inference for MaaS by advocating white-box approach • Future work - Code generation - Support more model formats than ML.NET - Distributed version - NUMA awareness 10 Conclusions and current work
  • 23. • Physical stages can be offloaded to FPGA • How to deploy it? – Fixed subset of common operators – Multiple operators in kernels – Whole model, deployed via partial reconf 11 Future work with FPGA
  • 24. • Physical stages can be offloaded to FPGA • How to deploy it? – Fixed subset of common operators – Multiple operators in kernels – Whole model, deployed via partial reconf 11 Future work with FPGA QUESTIONS ? Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - [email protected]