0% found this document useful (0 votes)
8 views

autobot-mlsys2020

autobot-mlsys2020

Uploaded by

hectorjazz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

autobot-mlsys2020

autobot-mlsys2020

Uploaded by

hectorjazz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Appears in the proceedings of the 2020 On-Device Intelligence Workshop, co-located with the MLSys Conference

KML IB : T OWARDS M ACHINE L EARNING F OR O PERATING S YSTEMS

Ibrahim Umit Akgun 1 Ali Selman Aydin 1 Erez Zadok 1

A BSTRACT
Despite the ever-changing software and hardware profiles of modern computing systems, many operating systems
(OS) components adhere to designs developed decades ago. Considering the variety of dynamic workloads that
modern operating systems are expected to manage, it is quite beneficial to develop adaptive systems that learn
from data patterns and OS events. However, developing such adaptive systems in kernel space involves the
bottom-up implementation of math and machine learning (ML) primitives that are readily available in user space
via widely-used ML libraries. However, user-level ML engines are often too costly (in terms of CPU and memory
footprint) to be used inside a tightly controlled, resource constrained OS. To this end, we started developing
KMLib, a lightweight yet efficient ML engine targeting kernel space components. We detail our proposed design
in this paper, demonstrated through a first prototype targeting the OS I/O scheduler. Our prototype’s memory
footprint is 804KB for the kernel module and 96KB for the library; experiments show we can reduce I/O latency
by 8% on our benchmark workload and testbed, which is significant for typically slow I/O devices.

1 I NTRODUCTION This is followed by work on data management systems that


are optimized for workloads and underlying system spec-
Rapid changes in hardware that are interacting heavily with ifications (Kraska et al., 2019). In computer architecture
operating systems raise questions about OS design. OS research, researchers realized that predicting memory ac-
development is a difficult and tedious task, and it is not able cess patterns can be formulated as an ML problem, and
to keep with these hardware changes or new algorithmic they developed cache-replacement models to improve the
techniques quickly. In addition, recent years have witnessed system performance (Hashemi et al., 2018; Shi et al., 2019).
major changes in workloads. Contrary to these changes, OS page-cache management is a similar problem as cache-
most of the OS components’ designs have changed little replacement in CPUs. In addition, operating systems use
over the years. hash tables in numerous places, which might be enhanced
One example of the divergence between hardware and soft- with learned structures (Kraska et al., 2018).
ware can be seen in storage technologies. Storage devices Although it is possible to utilize well-known ML libraries to
are getting faster and different every day. Keeping up with build ML approaches for data management systems, using
the changes to storage devices require either a complete re- ML in operating systems poses unique three challenges. (1)
design of some of the components in the storage stack or tun- Developing ML solutions working in kernel space requires
ing parameters and developing more workload-aware data extensive kernel programming skills. (2) Debugging and
structures and algorithms. In the past few years, we have wit- fine-tuning ML models, which is an essential component of
nessed such a paradigm shift in data management systems most ML development pipelines, could be quite challenging
and computer architectures. Both OS research and these for ML models working only in kernel space, because the
fields tackle similar tasks such as caching, indexing, and OS is naturally hard to debug and notoriously sensitive to
scheduling. For example, in the data management system bugs and performance overheads. (3) Certain QoS for oper-
research, researchers have developed learned structures to ating system requirements could require ML models to be
improve performance and adaptability(Kraska et al., 2018). deployed in kernel space to avoid the extra costs incurred
1
Department of Computer Science, Stony Brook for user-kernel switches. There are kernel tasks that can
University, USA. Correspondence to: Ibrahim Umit not tolerate the overhead of user-kernel switches. Because
Akgun <[email protected]>, Ali Selman these kernel tasks might be running under hard time lim-
Aydin <[email protected]>, Erez Zadok its, and adding extra overhead can cause timeouts. These
<[email protected]>.
challenges motivated us to design and develop an ML li-
Proceedings of the On-Device Intelligence Workshop, co-located brary targeted for adoption within the kernel, called KMLib.
with the MLSys Conference, Austin, Texas, USA, 2020. Copyright KMLib is an attempt to enable ML applications in a rela-
2020 by the author(s). tively unexplored yet challenging environment of the OS

1
KMLib

kernel. Researchers have proposed interesting ideas related et al., 2016; Paszke et al., 2019), we decided on a common
to ML for task scheduling (Negi & Kumar, 2005; Smith tensor-like representation for matrices and model parame-
et al., 1998), I/O scheduling (Hao et al., 2017), and storage ters. Functionality for manipulating matrices, such as matrix
parameter tuning (Cao et al., 2018). However, to the best of addition-multiplication and l2 norm has also been imple-
our knowledge, there is no previous work that attempts to mented as part of the library. Third, neural networks are rep-
develop an ML ecosystem for operating systems. resented as a collection of layers, each of which implement
forward() for forward propagation and backward()
KMLib aims to (i) enable easy to develop ML applications
for backward propagation. Whenever a new layer is to be
with low computational cost and memory footprint and (ii)
added to the library, forward() and backward() func-
make it easier to debug and fine-tune ML applications by
tions need to be implemented. In addition, our plan is to
providing primitives that behave identically in user space
use lock-free data structures when implementing the lay-
and in kernel space. We believe that a library like KMLib
ers to allow for parallel processing by breaking down the
could enable numerous ML based applications targeting
computation DAG when possible. Finally, neural networks
operating systems and help us to rethink how to design
implemented with this library will use an API similar to
adaptive and self-configured operating systems.
the individual layers, where forward() will facilitate for-
ward propagation of input through the computation DAG,
2 BACKGROUND AND R ELATED W ORK and backward() will apply backward propagation via
chain-rule, using backward() method in each layer for
While mainstream machine learning libraries like Tensor-
computing the derivatives of the corresponding layer. In our
Flow (Abadi et al., 2016) and PyTorch (Paszke et al., 2019)
design, the loss functions are treated like the other layers
has gained widespread use in research and production, there
in terms of implementation. Our library will implement
have also been several attempts to build machine learning
reverse-mode automatic differentiation to compute the gra-
libraries to address specific needs. Embedded Learning Li-
dients, which are then used to update the model weights
brary (ELL) (ELL) by Microsoft is one example, targeting
using gradient-based learning algorithms such as gradient
embedded devices. TensorFlow Lite(TensorFlow Lite) by
descent.
Google is a library for running machine learning applica-
tions on resource constrained devices. For using ML to Our initial goal is to provide users with the implemen-
improve operating systems, there has been several propos- tations of most widely-used linear layers, such as fully-
als (Zhang & Huang, 2019). connected and convolutional (LeCun et al., 1998) layers,
and widely-used non-linearities such as ReLU (Nair & Hin-
Researchers have investigated to tune file system param-
ton, 2010) and Sigmoid, in addition to sequential models
eters (Cao et al., 2018). Because this work performs the
like LSTMs (Hochreiter & Schmidhuber, 1997). We also
optimization in an offline manner, it is not designed to adapt
provide users with widely used losses such as cross entropy
to workload changes. Another work has attempted to im-
and mean square error. Users are able to extend the library
prove I/O schedulers by predicting whether the I/O request
with their own layers and loss functions by providing their
meets the deadline or not (Hao et al., 2017). But, the predic-
own implementations.
tions for I/O request deadlines were based on the result of a
linear regression model that is trained on synthetically gen-
erated data in an offline manner. These examples suggest Adapting to new workloads. The ever-changing work-
that having a machine learning library that works in kernel loads of modern computing systems means that machine
can help to build adaptive operating system components. learning models developed to exploit patterns in any work-
load must be adaptive. This could be achieved by constantly
3 M ACHINE L EARNING L IBRARY FOR training the model, which incurs extra computational costs
O PERATING S YSTEMS and memory footprint. Hence, there is a trade-off between
the power of adaptation and computational efficiency. For
3.1 Machine Learning Library Design the low-dimensional and less challenging machine learn-
Overview. There are several points and design choices ing problems where convergence could be achieved after
worth mentioning regarding our machine learning library a small number of steps, one could employ a simple feed-
that will power ML applications in kernel space. First, back mechanism to control the training schedule. The goal
the lack of access to standard math floating-point func- of this mechanism is to perform inference only when the
tions in the kernel means we have to implement nearly all performance is better than random guess by a pre-defined
math functions (including common functions such as pow threshold. More formally, for a classification task we per-
and log) ourselves. Second, following the design choice form inference only when the classification accuracy over
seen in numerous mainstream deep-learning libraries (Abadi the last k batches is at least pmargin higher than the most fre-
quent label in these k batches. k and pmargin are adjustable,

2
3.2
and higher stability.

Operating System Integration


in training and used more for inference.

kernel neon begin and kernel neon end.) We


v8 integration, floating-point enable/disable functions are
abled by calling kernel fpu end. (For KMLib ARM
point operations are not allowed in the Linux kernel. One
way to perform floating point operations in the kernel
tures and is also flexible to adapt custom data types. One
of the ways to reduce computation overhead for KMLib
cause timeouts and serious performance degradations. One
of our biggest concerns while designing KMLib. There
Low-precision training. Computation overhead is one
could result in models that spend the least amount of time
fective utilization of methods for both of these problems
from active learning (Settles, 2009), where the learning is
model performance. This could be approach using ideas
essary to avoid using samples that are not likely to improve
ideas from few-shot learning (Wang & Yao, 2019) when
training time and samples could be achieved by borrowing
the smallest number of samples for training. Reducing the
system that spends the least amount of time training, using
factors. Ideally, one would like to deploy a machine learning
putation costs and memory footprints incurred by training
an edge device is a multi-objective problem, where some
More specifically, learning the ever-changing workloads on
lenging machine learning problems in kernel space could
sonable performance is not likely to take significant amounts
with implications on memory footprint, computational cost

and inference makes it necessary to consider multiple other

is to enable the x86 architecture’s floating-point unit by


microsecond time and any extra latency for these tasks may

KMLib can work in both user and kernel spaces. But,


of these data types is float. As we mentioned above,
is by using low-precision training techniques (Choi et al.,
of the objectives are more obvious, i.e., computation time,

erations are finalized, use of floating points can be dis-


calling kernel fpu begin. Once floating point op-
ations in kernel space. It is well-known that floating-
there are some challenges in using floating-point oper-
KMLib can support different data types for tensor struc-
are operating system tasks that must be completed in sub-
performed on a promising subset of the labeled data. Ef-
applicable. The relatively high cost of training makes it nec-
mizing for these objectives while there are non-zero com-
memory footprint, and energy consumption. However, opti-
of computational power, high-dimensional and more chal-
tive in a low-dimensional problem where converging to rea-
While the simple mechanism described above could be effec-

2019; De Sa et al., 2018; Gupta et al., 2015; Sa et al., 2017).


require taking into account other aspects of the problem.

3
KMLib

User space

kernel space library.


Kernel space

mq-kmlib.ko
mq-kmlib.ko

sha1_base64="/G4s+qGxPkCvpkHCP9y15mBxbNk=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBg5REED0WvHisYD+kDWGz3bRLN5uwOxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzwlQKg6777ZTW1jc2t8rblZ3dvf2D6uFR2ySZZrzFEpnobkgNl0LxFgqUvJtqTuNQ8k44vp35nSeujUjUA05S7sd0qEQkGEUrdboBXjwGGFRrbt2dg6wSryA1KNAMql/9QcKymCtkkhrT89wU/ZxqFEzyaaWfGZ5SNqZD3rNU0ZgbP5+fOyVnVhmQKNG2FJK5+nsip7Exkzi0nTHFkVn2ZuJ/Xi/D6MbPhUoz5IotFkWZJJiQ2e9kIDRnKCeWUKaFvZWwEdWUoU2oYkPwll9eJe3LuufWvXu31rgq4ijDCZzCOXhwDQ24gya0gMEYnuEV3pzUeXHenY9Fa8kpZo7hD5zPH96+jzI=</latexit>
sha1_base64="f36H0rK+xs1ulc2r1EdsYTeAQbI=">AAAB7nicbVDLSgNBEOz1GeNr1aOXwSB4kLAriB4DXjxGMA9JlmV2MpsMmZ1dZnqFEPIRXjwookc/w2/wlr9x8jhoYkFDUdVNd1eUSWHQ88bOyura+sZmYau4vbO7t+8eHNZNmmvGayyVqW5G1HApFK+hQMmbmeY0iSRvRP2bid945NqIVN3jIONBQrtKxIJRtFKjGeL5Q4ihW/LK3hRkmfhzUqq4X+MPAKiG7ne7k7I84QqZpMa0fC/DYEg1Cib5qNjODc8o69Mub1mqaMJNMJyeOyKnVumQONW2FJKp+ntiSBNjBklkOxOKPbPoTcT/vFaO8XUwFCrLkSs2WxTnkmBKJr+TjtCcoRxYQpkW9lbCelRThjahog3BX3x5mdQvyr5X9u9sGpcwQwGO4QTOwIcrqMAtVKEGDPrwBC/w6mTOs/PmvM9aV5z5zBH8gfP5A0qbkbs=</latexit><latexit
<latexit
sha1_base64="/G4s+qGxPkCvpkHCP9y15mBxbNk=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBg5REED0WvHisYD+kDWGz3bRLN5uwOxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzwlQKg6777ZTW1jc2t8rblZ3dvf2D6uFR2ySZZrzFEpnobkgNl0LxFgqUvJtqTuNQ8k44vp35nSeujUjUA05S7sd0qEQkGEUrdboBXjwGGFRrbt2dg6wSryA1KNAMql/9QcKymCtkkhrT89wU/ZxqFEzyaaWfGZ5SNqZD3rNU0ZgbP5+fOyVnVhmQKNG2FJK5+nsip7Exkzi0nTHFkVn2ZuJ/Xi/D6MbPhUoz5IotFkWZJJiQ2e9kIDRnKCeWUKaFvZWwEdWUoU2oYkPwll9eJe3LuufWvXu31rgq4ijDCZzCOXhwDQ24gya0gMEYnuEV3pzUeXHenY9Fa8kpZo7hD5zPH96+jzI=</latexit>
sha1_base64="f36H0rK+xs1ulc2r1EdsYTeAQbI=">AAAB7nicbVDLSgNBEOz1GeNr1aOXwSB4kLAriB4DXjxGMA9JlmV2MpsMmZ1dZnqFEPIRXjwookc/w2/wlr9x8jhoYkFDUdVNd1eUSWHQ88bOyura+sZmYau4vbO7t+8eHNZNmmvGayyVqW5G1HApFK+hQMmbmeY0iSRvRP2bid945NqIVN3jIONBQrtKxIJRtFKjGeL5Q4ihW/LK3hRkmfhzUqq4X+MPAKiG7ne7k7I84QqZpMa0fC/DYEg1Cib5qNjODc8o69Mub1mqaMJNMJyeOyKnVumQONW2FJKp+ntiSBNjBklkOxOKPbPoTcT/vFaO8XUwFCrLkSs2WxTnkmBKJr+TjtCcoRxYQpkW9lbCelRThjahog3BX3x5mdQvyr5X9u9sGpcwQwGO4QTOwIcrqMAtVKEGDPrwBC/w6mTOs/PmvM9aV5z5zBH8gfP5A0qbkbs=</latexit><latexit
<latexit
computation overheads.

sha1_base64="/G4s+qGxPkCvpkHCP9y15mBxbNk=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBg5REED0WvHisYD+kDWGz3bRLN5uwOxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzwlQKg6777ZTW1jc2t8rblZ3dvf2D6uFR2ySZZrzFEpnobkgNl0LxFgqUvJtqTuNQ8k44vp35nSeujUjUA05S7sd0qEQkGEUrdboBXjwGGFRrbt2dg6wSryA1KNAMql/9QcKymCtkkhrT89wU/ZxqFEzyaaWfGZ5SNqZD3rNU0ZgbP5+fOyVnVhmQKNG2FJK5+nsip7Exkzi0nTHFkVn2ZuJ/Xi/D6MbPhUoz5IotFkWZJJiQ2e9kIDRnKCeWUKaFvZWwEdWUoU2oYkPwll9eJe3LuufWvXu31rgq4ijDCZzCOXhwDQ24gya0gMEYnuEV3pzUeXHenY9Fa8kpZo7hD5zPH96+jzI=</latexit>
sha1_base64="f36H0rK+xs1ulc2r1EdsYTeAQbI=">AAAB7nicbVDLSgNBEOz1GeNr1aOXwSB4kLAriB4DXjxGMA9JlmV2MpsMmZ1dZnqFEPIRXjwookc/w2/wlr9x8jhoYkFDUdVNd1eUSWHQ88bOyura+sZmYau4vbO7t+8eHNZNmmvGayyVqW5G1HApFK+hQMmbmeY0iSRvRP2bid945NqIVN3jIONBQrtKxIJRtFKjGeL5Q4ihW/LK3hRkmfhzUqq4X+MPAKiG7ne7k7I84QqZpMa0fC/DYEg1Cib5qNjODc8o69Mub1mqaMJNMJyeOyKnVumQONW2FJKp+ntiSBNjBklkOxOKPbPoTcT/vFaO8XUwFCrLkSs2WxTnkmBKJr+TjtCcoRxYQpkW9lbCelRThjahog3BX3x5mdQvyr5X9u9sGpcwQwGO4QTOwIcrqMAtVKEGDPrwBC/w6mTOs/PmvM9aV5z5zBH8gfP5A0qbkbs=</latexit><latexit
<latexit
sha1_base64="/G4s+qGxPkCvpkHCP9y15mBxbNk=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBg5REED0WvHisYD+kDWGz3bRLN5uwOxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzwlQKg6777ZTW1jc2t8rblZ3dvf2D6uFR2ySZZrzFEpnobkgNl0LxFgqUvJtqTuNQ8k44vp35nSeujUjUA05S7sd0qEQkGEUrdboBXjwGGFRrbt2dg6wSryA1KNAMql/9QcKymCtkkhrT89wU/ZxqFEzyaaWfGZ5SNqZD3rNU0ZgbP5+fOyVnVhmQKNG2FJK5+nsip7Exkzi0nTHFkVn2ZuJ/Xi/D6MbPhUoz5IotFkWZJJiQ2e9kIDRnKCeWUKaFvZWwEdWUoU2oYkPwll9eJe3LuufWvXu31rgq4ijDCZzCOXhwDQ24gya0gMEYnuEV3pzUeXHenY9Fa8kpZo7hD5zPH96+jzI=</latexit>
sha1_base64="f36H0rK+xs1ulc2r1EdsYTeAQbI=">AAAB7nicbVDLSgNBEOz1GeNr1aOXwSB4kLAriB4DXjxGMA9JlmV2MpsMmZ1dZnqFEPIRXjwookc/w2/wlr9x8jhoYkFDUdVNd1eUSWHQ88bOyura+sZmYau4vbO7t+8eHNZNmmvGayyVqW5G1HApFK+hQMmbmeY0iSRvRP2bid945NqIVN3jIONBQrtKxIJRtFKjGeL5Q4ihW/LK3hRkmfhzUqq4X+MPAKiG7ne7k7I84QqZpMa0fC/DYEg1Cib5qNjODc8o69Mub1mqaMJNMJyeOyKnVumQONW2FJKp+ntiSBNjBklkOxOKPbPoTcT/vFaO8XUwFCrLkSs2WxTnkmBKJr+TjtCcoRxYQpkW9lbCelRThjahog3BX3x5mdQvyr5X9u9sGpcwQwGO4QTOwIcrqMAtVKEGDPrwBC/w6mTOs/PmvM9aV5z5zBH8gfP5A0qbkbs=</latexit><latexit
<latexit
X t , Yt

inference

kmlib

X t , Yt

inference
(a) Kernel library
kmlib.ko
OS-ML Api

(b) User-Kernel shared library


sha1_base64="9VkaJeGxf54kbOUrzrCppqBR+2k=">AAACAHicbVBNS8NAEJ3Ur1q/oh48eFksgqeSCKLHghcPHirYD2hK2Ww27dLNJuxulBJy8a948aCIV3+GN/+NmzYHbX0w8Hhvhpl5fsKZ0o7zbVVWVtfWN6qbta3tnd09e/+go+JUEtomMY9lz8eKciZoWzPNaS+RFEc+p11/cl343QcqFYvFvZ4mdBDhkWAhI1gbaWgfeaHEJAu8COsxwTy7zfMseMyHdt1pODOgZeKWpA4lWkP7ywtikkZUaMKxUn3XSfQgw1Izwmle81JFE0wmeET7hgocUTXIZg/k6NQoAQpjaUpoNFN/T2Q4Umoa+aazuFMteoX4n9dPdXg1yJhIUk0FmS8KU450jIo0UMAkJZpPDcFEMnMrImNsEtEms5oJwV18eZl0zhuu03DvnHrzooyjCsdwAmfgwiU04QZa0AYCOTzDK7xZT9aL9W59zFsrVjlzCH9gff4A0SeXIQ==</latexit>
sha1_base64="NeiItkj9C53fOPXxgdZxTFMa9OA=">AAACAHicbVBNS8NAEJ1Uq7V+VT148LJYBC+WRBA9Frx48FDBfkATymazaZduNmF3o5SQi3/FiwdFvPozvPlv3LQ9aOuDgcd7M8zM8xPOlLbtb6u0slpeW69sVDe3tnd2a3v7HRWnktA2iXksez5WlDNB25ppTnuJpDjyOe364+vC7z5QqVgs7vUkoV6Eh4KFjGBtpEHt0A0lJlngRliPCObZbZ5nwWM+qNXthj0FWibOnNSbZ2UEBq1B7csNYpJGVGjCsVJ9x060l2GpGeE0r7qpogkmYzykfUMFjqjysukDOToxSoDCWJoSGk3V3xMZjpSaRL7pLO5Ui14h/uf1Ux1eeRkTSaqpILNFYcqRjlGRBgqYpETziSGYSGZuRWSETSLaZFY1ITiLLy+TznnDsRvOnUnjAmaowBEcwyk4cAlNuIEWtIFADs/wCm/Wk/VivVsfs9aSNZ85gD+wPn8AlxeXsw==</latexit><latexit
<latexit
sha1_base64="9VkaJeGxf54kbOUrzrCppqBR+2k=">AAACAHicbVBNS8NAEJ3Ur1q/oh48eFksgqeSCKLHghcPHirYD2hK2Ww27dLNJuxulBJy8a948aCIV3+GN/+NmzYHbX0w8Hhvhpl5fsKZ0o7zbVVWVtfWN6qbta3tnd09e/+go+JUEtomMY9lz8eKciZoWzPNaS+RFEc+p11/cl343QcqFYvFvZ4mdBDhkWAhI1gbaWgfeaHEJAu8COsxwTy7zfMseMyHdt1pODOgZeKWpA4lWkP7ywtikkZUaMKxUn3XSfQgw1Izwmle81JFE0wmeET7hgocUTXIZg/k6NQoAQpjaUpoNFN/T2Q4Umoa+aazuFMteoX4n9dPdXg1yJhIUk0FmS8KU450jIo0UMAkJZpPDcFEMnMrImNsEtEms5oJwV18eZl0zhuu03DvnHrzooyjCsdwAmfgwiU04QZa0AYCOTzDK7xZT9aL9W59zFsrVjlzCH9gff4A0SeXIQ==</latexit>
sha1_base64="NeiItkj9C53fOPXxgdZxTFMa9OA=">AAACAHicbVBNS8NAEJ1Uq7V+VT148LJYBC+WRBA9Frx48FDBfkATymazaZduNmF3o5SQi3/FiwdFvPozvPlv3LQ9aOuDgcd7M8zM8xPOlLbtb6u0slpeW69sVDe3tnd2a3v7HRWnktA2iXksez5WlDNB25ppTnuJpDjyOe364+vC7z5QqVgs7vUkoV6Eh4KFjGBtpEHt0A0lJlngRliPCObZbZ5nwWM+qNXthj0FWibOnNSbZ2UEBq1B7csNYpJGVGjCsVJ9x060l2GpGeE0r7qpogkmYzykfUMFjqjysukDOToxSoDCWJoSGk3V3xMZjpSaRL7pLO5Ui14h/uf1Ux1eeRkTSaqpILNFYcqRjlGRBgqYpETziSGYSGZuRWSETSLaZFY1ITiLLy+TznnDsRvOnUnjAmaowBEcwyk4cAlNuIEWtIFADs/wCm/Wk/VivVsfs9aSNZ85gD+wPn8AlxeXsw==</latexit><latexit
<latexit
OS-ML Api
on a context switch, and adds additional overheads.

dw
dL
User space
Kernel space

is to save the input data and the predictions for training.


explain more how KMLib handles capping memory and
training not only helps to reduce computational overhead
numbers (Wang et al., 2018) with KMLib. Low-precision

The blocking mode helps the user to process every single


duce interference. The only interference that KMLib adds
while the underlying operating system is running. KMLib
tem. KMLib is capable of training and inference operations
Computation and memory capping. KMLib is designed
implementation for KMLib library and kmlib.ko refers to KMLib
We are working to support 16-bit and 8-bit wide fixed-point
unit is enabled, the kernel must save floating point registers
ing context-switched to other tasks. When the floating-point
code block, because the more time KMLib spends in a
tried to minimize the size of the floating-point enabled

Users can configure the size of the circular buffers. Circular


but also lowers memory consumption, which is another
critical point when we started designing KMLib. We now

Figure 1. KMLib architectures: mq-kmlib.ko is a reference ML

offloads the training computation to library threads to re-


to create as little as possible interference in the running sys-
floating-point–enabled regions, the higher the chance of be-

buffers have two running modes: blocking and dropping.


We used lock-free circular buffers to store training data.
KMLib

input piece data, but if the frequency of computation re- ordinalized operation type as features. The predicted issue
quests is high, this blocking mode might add extra overhead time is then thresholded to predict whether the I/O request
by blocking additional inputs from being processed. The should be early-rejected or not. We hypothesize that this
dropping mode overruns unprocessed input data: it does not should reduce the overall latency.
add extra overhead, but KMLib then loses data, which may
We have conducted the experiments on QEMU with I/O
hurt training quality. Using these features, the user can cap
throttling running on Intel(R) Core(TM) i7-7500U and 8GB
memory overhead based on their ML application needs.
RAM and Intel SSD(256GB). We use our modified version
The computational overhead of training varies based on of Linux Kernel v4.19.51+ for all experiments.
the complexity of the learning model. We designed of-
For the workload generation, we ran the FIO (FIO) micro-
floading training computation to KMLib library threads, but
benchmark which is configured to perform random read and
there are other challenges of partitioning computation DAG.
write operations with 4 threads on a 1GB dataset. Each
Even though KMLib uses lock-free data structures to re-
experiment is executed on a fresh QEMU instance. We
duce multi-threaded communication and synchronization
cloned the mq-deadline I/O scheduler as mq-kmlib
overhead, there might be dependencies in the computational
and integrated it with KMLib. We made three key
DAG, which might cause latencies. That is why we also
changes in the mq-kmlib I/O scheduler compared to
allow the user to choose how many threads can be used for
mq-deadline: (i) In the dd init queue function, we
(i) training and (ii) inference. All these features that related
inserted initialization code fragments to set the learning
to offloading training/inference computation can be disabled
rate, batch size, momentum, and number of features to
and can be done in the original thread context as well.
learn. Initial weights are also set randomly here. (ii) In
the dd dispatch request function, we call the func-
User space vs. kernel space. The first question that tions that collect Xt and Yt and perform the training steps.
comes to mind is why we started implementing a machine (iii) In the dd insert request function, we invoke an
learning library from scratch for optimizing operating sys- inference function; and based on the prediction we decide
tem tasks, rather than using a well-known user space library whether to early-reject the I/O request or not.
with data collected from the operating system. It is possible We observed that the thresholded regression output could
to collect data from the operating system and feed into user predict with an accuracy of 74.62% whether the I/O requests
space ML implementations. But, there are challenges with miss the deadline or not: this reduced the overall I/O latency
that approach. For example, offloading training and infer- by 8%, a promising result given that I/O is so much slower
ence should be running sub-microsecond because of the than memory or CPU (and hence I/O should be the first to
nature of operating system tasks. KMLib can be deployed optimize). Our test involved a single synthetic workload
in two different modes: (i) kernel mode (Figure 1(a)) and that does not cover a large number of use cases, and our
(ii) kernel-user memory mapped shared mode (Figure 1(b)) performance may not generalize to other workloads. Further,
. In kernel mode, both training and inference happens in the the emulated environment provided by QEMU may not
kernel space. In kernel-user memory mapped shared mode, represent a realistic use case, due to artificial throttling in
KMLib collects data from the kernel space and trains using QEMU. This is why our next step would be to investigate
user-space threads. For the inference, KMLib still runs the if these results generalize to other workloads under more
operations in kernel space to reduce the latency. We are realistic conditions (e.g., physical machines). We are also
using user-kernel shared lock-free circular buffers (Desnoy- planning to apply machine learning models to other storage
ers & Dagenais, 2012) for collecting training data. But, stack components like the page cache.
KMLib threads can drain training request only when it gets
scheduled because KMLib threads are working in a polling We wrote nearly 3,000 lines of C/C++ code (LoC). Because
manner. We continue improving the user-space approach the current set of machine learning tools we have imple-
because we believe that it improves developer productivity, mented is small, the memory footprint size of the KMLib
and developing, debugging, and testing learning models is user-space library is just 96KB, and the size of the KMLib
much easier in user space than developing in the kernel. kernel module is only 804KB. However, we expect these
numbers to increase as additional functionality is imple-
mented.
4 E VALUATION
We developed a sample application of KMLib to fine-tune 5 C ONCLUSION
mq-deadline I/O scheduler. To predict whether the I/O
request will meet the deadline or not, we train a linear re- Adapting operating system components to running work-
gression model. The regression model predicts issue time loads and hardware has been done by tuning parameters or
for a given I/O request using normalized block number and changing the critical data structure properties empirically.

4
KMLib

We have proposed that lightweight machine learning ap- Hashemi, M., Swersky, K., Smith, J. A., Ayers, G., Litz, H.,
proaches may help to solve these problems. Our preliminary Chang, J., Kozyrakis, C., and Ranganathan, P. Learning
evaluation show some promising results. Our plan is to memory access patterns. In Proceedings of the 35th
expand on the work, apply it to other OS components, and International Conference on Machine Learning, ICML
evaluate and optimize the ML library for a wide range of 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15,
workloads. 2018, pp. 1924–1933, 2018.

Hochreiter, S. and Schmidhuber, J. Long short-term memory.


R EFERENCES Neural computation, 9(8):1735–1780, 1997.
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,
Kraska, T., Beutel, A., Chi, E. H., Dean, J., and Polyzotis,
J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kud-
N. The case for learned index structures. In Proceedings
lur, M., Levenberg, J., Monga, R., Moore, S., Murray,
of the 2018 International Conference on Management of
D. G., Steiner, B., Tucker, P. A., Vasudevan, V., Warden,
Data, pp. 489–504. ACM, 2018.
P., Wicke, M., Yu, Y., and Zheng, X. Tensorflow: A
system for large-scale machine learning. In 12th USENIX Kraska, T., Alizadeh, M., Beutel, A., Chi, E. H., Kristo, A.,
Symposium on Operating Systems Design and Implemen- Leclerc, G., Madden, S., Mao, H., and Nathan, V. Sagedb:
tation, OSDI 2016, Savannah, GA, USA, November 2-4, A learned database system. In CIDR 2019, 9th Biennial
2016, pp. 265–283, 2016. Conference on Innovative Data Systems Research, Asilo-
mar, CA, USA, January 13-16, 2019, Online Proceedings,
Cao, Z., Tarasov, V., Tiwari, S., and Zadok, E. Towards
2019.
better understanding of black-box auto-tuning: A compar-
ative analysis for storage systems. In 2018 USENIX An- LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al.
nual Technical Conference, USENIX ATC 2018, Boston, Gradient-based learning applied to document recognition.
MA, USA, July 11-13, 2018, pp. 893–907, 2018. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Choi, J., Venkataramani, S., Srinivasan, V., Gopalakrishnan, Nair, V. and Hinton, G. E. Rectified linear units improve
K., Wang, Z., and Chuang, P. Accurate and efficient 2-bit restricted boltzmann machines. In Proceedings of the 27th
quantized neural networks. In Proceedings of the 2nd International Conference on Machine Learning (ICML-
SysML Conference, 2019. 10), June 21-24, 2010, Haifa, Israel, pp. 807–814, 2010.
De Sa, C., Leszczynski, M., Zhang, J., Marzoev, A., Negi, A. and Kumar, P. K. Applying machine learning tech-
Aberger, C. R., Olukotun, K., and Ré, C. High-accuracy niques to improve linux process scheduling. In TENCON
low-precision training. arXiv preprint arXiv:1803.03383, 2005-2005 IEEE Region 10 Conference, pp. 1–6. IEEE,
2018. 2005.
Desnoyers, M. and Dagenais, M. R. Lockless multi-core Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
high-throughput buffering scheme for kernel tracing. Op- Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
erating Systems Review, 46(3):65–81, 2012. L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Rai-
ELL. Embedded Learning Library (ELL), January 2020. son, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang,
https://microsoft.github.io/ELL/. L., Bai, J., and Chintala, S. Pytorch: An imperative style,
high-performance deep learning library. In Advances
FIO. Flexible I/O Tester, January 2020. https://fio. in Neural Information Processing Systems 32: Annual
readthedocs.io/en/latest/. Conference on Neural Information Processing Systems
2019, NeurIPS 2019, 8-14 December 2019, Vancouver,
Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, BC, Canada, pp. 8024–8035, 2019.
P. Deep learning with limited numerical precision. In
Proceedings of the 32nd International Conference on Sa, C. D., Feldman, M., Ré, C., and Olukotun, K. Un-
Machine Learning, ICML 2015, Lille, France, 6-11 July derstanding and optimizing asynchronous low-precision
2015, pp. 1737–1746, 2015. stochastic gradient descent. In Proceedings of the 44th
Annual International Symposium on Computer Architec-
Hao, M., Li, H., Tong, M. H., Pakha, C., Suminto, R. O., ture, ISCA 2017, Toronto, ON, Canada, June 24-28, 2017,
Stuardo, C. A., Chien, A. A., and Gunawi, H. S. Mittos: pp. 561–574, 2017.
Supporting millisecond tail tolerance with fast rejecting
slo-aware OS interface. In Proceedings of the 26th Sympo- Settles, B. Active learning literature survey. Technical
sium on Operating Systems Principles, Shanghai, China, report, University of Wisconsin-Madison Department of
October 28-31, 2017, pp. 168–183, 2017. Computer Sciences, 2009.

5
KMLib

Shi, Z., Huang, X., Jain, A., and Lin, C. Applying deep
learning to the cache replacement problem. In Proceed-
ings of the 52nd Annual IEEE/ACM International Sym-
posium on Microarchitecture, MICRO 2019, Columbus,
OH, USA, October 12-16, 2019, pp. 413–425, 2019.
Smith, W., Foster, I. T., and Taylor, V. E. Predicting
application run times using historical information. In
Job Scheduling Strategies for Parallel Processing, IPP-
S/SPDP’98 Workshop, Orlando, Florida, USA, March 30,
1998, Proceedings, pp. 122–142, 1998.
TensorFlow Lite. TensorFlow Lite, January 2020. https:
//www.tensorflow.org/lite.

Wang, N., Choi, J., Brand, D., Chen, C., and Gopalakrish-
nan, K. Training deep neural networks with 8-bit float-
ing point numbers. In Advances in Neural Information
Processing Systems 31: Annual Conference on Neural
Information Processing Systems 2018, NeurIPS 2018, 3-
8 December 2018, Montréal, Canada, pp. 7686–7695,
2018.
Wang, Y. and Yao, Q. Few-shot learning: A survey. arXiv
preprint arXiv:1904.05046, 2019.
Zhang, Y. and Huang, Y. ”Learned”: Operating systems.
SIGOPS Oper. Syst. Rev., 53(1):40–45, July 2019.

You might also like