autobot-mlsys2020
autobot-mlsys2020
A BSTRACT
Despite the ever-changing software and hardware profiles of modern computing systems, many operating systems
(OS) components adhere to designs developed decades ago. Considering the variety of dynamic workloads that
modern operating systems are expected to manage, it is quite beneficial to develop adaptive systems that learn
from data patterns and OS events. However, developing such adaptive systems in kernel space involves the
bottom-up implementation of math and machine learning (ML) primitives that are readily available in user space
via widely-used ML libraries. However, user-level ML engines are often too costly (in terms of CPU and memory
footprint) to be used inside a tightly controlled, resource constrained OS. To this end, we started developing
KMLib, a lightweight yet efficient ML engine targeting kernel space components. We detail our proposed design
in this paper, demonstrated through a first prototype targeting the OS I/O scheduler. Our prototype’s memory
footprint is 804KB for the kernel module and 96KB for the library; experiments show we can reduce I/O latency
by 8% on our benchmark workload and testbed, which is significant for typically slow I/O devices.
1
KMLib
kernel. Researchers have proposed interesting ideas related et al., 2016; Paszke et al., 2019), we decided on a common
to ML for task scheduling (Negi & Kumar, 2005; Smith tensor-like representation for matrices and model parame-
et al., 1998), I/O scheduling (Hao et al., 2017), and storage ters. Functionality for manipulating matrices, such as matrix
parameter tuning (Cao et al., 2018). However, to the best of addition-multiplication and l2 norm has also been imple-
our knowledge, there is no previous work that attempts to mented as part of the library. Third, neural networks are rep-
develop an ML ecosystem for operating systems. resented as a collection of layers, each of which implement
forward() for forward propagation and backward()
KMLib aims to (i) enable easy to develop ML applications
for backward propagation. Whenever a new layer is to be
with low computational cost and memory footprint and (ii)
added to the library, forward() and backward() func-
make it easier to debug and fine-tune ML applications by
tions need to be implemented. In addition, our plan is to
providing primitives that behave identically in user space
use lock-free data structures when implementing the lay-
and in kernel space. We believe that a library like KMLib
ers to allow for parallel processing by breaking down the
could enable numerous ML based applications targeting
computation DAG when possible. Finally, neural networks
operating systems and help us to rethink how to design
implemented with this library will use an API similar to
adaptive and self-configured operating systems.
the individual layers, where forward() will facilitate for-
ward propagation of input through the computation DAG,
2 BACKGROUND AND R ELATED W ORK and backward() will apply backward propagation via
chain-rule, using backward() method in each layer for
While mainstream machine learning libraries like Tensor-
computing the derivatives of the corresponding layer. In our
Flow (Abadi et al., 2016) and PyTorch (Paszke et al., 2019)
design, the loss functions are treated like the other layers
has gained widespread use in research and production, there
in terms of implementation. Our library will implement
have also been several attempts to build machine learning
reverse-mode automatic differentiation to compute the gra-
libraries to address specific needs. Embedded Learning Li-
dients, which are then used to update the model weights
brary (ELL) (ELL) by Microsoft is one example, targeting
using gradient-based learning algorithms such as gradient
embedded devices. TensorFlow Lite(TensorFlow Lite) by
descent.
Google is a library for running machine learning applica-
tions on resource constrained devices. For using ML to Our initial goal is to provide users with the implemen-
improve operating systems, there has been several propos- tations of most widely-used linear layers, such as fully-
als (Zhang & Huang, 2019). connected and convolutional (LeCun et al., 1998) layers,
and widely-used non-linearities such as ReLU (Nair & Hin-
Researchers have investigated to tune file system param-
ton, 2010) and Sigmoid, in addition to sequential models
eters (Cao et al., 2018). Because this work performs the
like LSTMs (Hochreiter & Schmidhuber, 1997). We also
optimization in an offline manner, it is not designed to adapt
provide users with widely used losses such as cross entropy
to workload changes. Another work has attempted to im-
and mean square error. Users are able to extend the library
prove I/O schedulers by predicting whether the I/O request
with their own layers and loss functions by providing their
meets the deadline or not (Hao et al., 2017). But, the predic-
own implementations.
tions for I/O request deadlines were based on the result of a
linear regression model that is trained on synthetically gen-
erated data in an offline manner. These examples suggest Adapting to new workloads. The ever-changing work-
that having a machine learning library that works in kernel loads of modern computing systems means that machine
can help to build adaptive operating system components. learning models developed to exploit patterns in any work-
load must be adaptive. This could be achieved by constantly
3 M ACHINE L EARNING L IBRARY FOR training the model, which incurs extra computational costs
O PERATING S YSTEMS and memory footprint. Hence, there is a trade-off between
the power of adaptation and computational efficiency. For
3.1 Machine Learning Library Design the low-dimensional and less challenging machine learn-
Overview. There are several points and design choices ing problems where convergence could be achieved after
worth mentioning regarding our machine learning library a small number of steps, one could employ a simple feed-
that will power ML applications in kernel space. First, back mechanism to control the training schedule. The goal
the lack of access to standard math floating-point func- of this mechanism is to perform inference only when the
tions in the kernel means we have to implement nearly all performance is better than random guess by a pre-defined
math functions (including common functions such as pow threshold. More formally, for a classification task we per-
and log) ourselves. Second, following the design choice form inference only when the classification accuracy over
seen in numerous mainstream deep-learning libraries (Abadi the last k batches is at least pmargin higher than the most fre-
quent label in these k batches. k and pmargin are adjustable,
2
3.2
and higher stability.
3
KMLib
User space
mq-kmlib.ko
mq-kmlib.ko
sha1_base64="/G4s+qGxPkCvpkHCP9y15mBxbNk=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBg5REED0WvHisYD+kDWGz3bRLN5uwOxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzwlQKg6777ZTW1jc2t8rblZ3dvf2D6uFR2ySZZrzFEpnobkgNl0LxFgqUvJtqTuNQ8k44vp35nSeujUjUA05S7sd0qEQkGEUrdboBXjwGGFRrbt2dg6wSryA1KNAMql/9QcKymCtkkhrT89wU/ZxqFEzyaaWfGZ5SNqZD3rNU0ZgbP5+fOyVnVhmQKNG2FJK5+nsip7Exkzi0nTHFkVn2ZuJ/Xi/D6MbPhUoz5IotFkWZJJiQ2e9kIDRnKCeWUKaFvZWwEdWUoU2oYkPwll9eJe3LuufWvXu31rgq4ijDCZzCOXhwDQ24gya0gMEYnuEV3pzUeXHenY9Fa8kpZo7hD5zPH96+jzI=</latexit>
sha1_base64="f36H0rK+xs1ulc2r1EdsYTeAQbI=">AAAB7nicbVDLSgNBEOz1GeNr1aOXwSB4kLAriB4DXjxGMA9JlmV2MpsMmZ1dZnqFEPIRXjwookc/w2/wlr9x8jhoYkFDUdVNd1eUSWHQ88bOyura+sZmYau4vbO7t+8eHNZNmmvGayyVqW5G1HApFK+hQMmbmeY0iSRvRP2bid945NqIVN3jIONBQrtKxIJRtFKjGeL5Q4ihW/LK3hRkmfhzUqq4X+MPAKiG7ne7k7I84QqZpMa0fC/DYEg1Cib5qNjODc8o69Mub1mqaMJNMJyeOyKnVumQONW2FJKp+ntiSBNjBklkOxOKPbPoTcT/vFaO8XUwFCrLkSs2WxTnkmBKJr+TjtCcoRxYQpkW9lbCelRThjahog3BX3x5mdQvyr5X9u9sGpcwQwGO4QTOwIcrqMAtVKEGDPrwBC/w6mTOs/PmvM9aV5z5zBH8gfP5A0qbkbs=</latexit><latexit
<latexit
sha1_base64="/G4s+qGxPkCvpkHCP9y15mBxbNk=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBg5REED0WvHisYD+kDWGz3bRLN5uwOxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzwlQKg6777ZTW1jc2t8rblZ3dvf2D6uFR2ySZZrzFEpnobkgNl0LxFgqUvJtqTuNQ8k44vp35nSeujUjUA05S7sd0qEQkGEUrdboBXjwGGFRrbt2dg6wSryA1KNAMql/9QcKymCtkkhrT89wU/ZxqFEzyaaWfGZ5SNqZD3rNU0ZgbP5+fOyVnVhmQKNG2FJK5+nsip7Exkzi0nTHFkVn2ZuJ/Xi/D6MbPhUoz5IotFkWZJJiQ2e9kIDRnKCeWUKaFvZWwEdWUoU2oYkPwll9eJe3LuufWvXu31rgq4ijDCZzCOXhwDQ24gya0gMEYnuEV3pzUeXHenY9Fa8kpZo7hD5zPH96+jzI=</latexit>
sha1_base64="f36H0rK+xs1ulc2r1EdsYTeAQbI=">AAAB7nicbVDLSgNBEOz1GeNr1aOXwSB4kLAriB4DXjxGMA9JlmV2MpsMmZ1dZnqFEPIRXjwookc/w2/wlr9x8jhoYkFDUdVNd1eUSWHQ88bOyura+sZmYau4vbO7t+8eHNZNmmvGayyVqW5G1HApFK+hQMmbmeY0iSRvRP2bid945NqIVN3jIONBQrtKxIJRtFKjGeL5Q4ihW/LK3hRkmfhzUqq4X+MPAKiG7ne7k7I84QqZpMa0fC/DYEg1Cib5qNjODc8o69Mub1mqaMJNMJyeOyKnVumQONW2FJKp+ntiSBNjBklkOxOKPbPoTcT/vFaO8XUwFCrLkSs2WxTnkmBKJr+TjtCcoRxYQpkW9lbCelRThjahog3BX3x5mdQvyr5X9u9sGpcwQwGO4QTOwIcrqMAtVKEGDPrwBC/w6mTOs/PmvM9aV5z5zBH8gfP5A0qbkbs=</latexit><latexit
<latexit
computation overheads.
sha1_base64="/G4s+qGxPkCvpkHCP9y15mBxbNk=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBg5REED0WvHisYD+kDWGz3bRLN5uwOxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzwlQKg6777ZTW1jc2t8rblZ3dvf2D6uFR2ySZZrzFEpnobkgNl0LxFgqUvJtqTuNQ8k44vp35nSeujUjUA05S7sd0qEQkGEUrdboBXjwGGFRrbt2dg6wSryA1KNAMql/9QcKymCtkkhrT89wU/ZxqFEzyaaWfGZ5SNqZD3rNU0ZgbP5+fOyVnVhmQKNG2FJK5+nsip7Exkzi0nTHFkVn2ZuJ/Xi/D6MbPhUoz5IotFkWZJJiQ2e9kIDRnKCeWUKaFvZWwEdWUoU2oYkPwll9eJe3LuufWvXu31rgq4ijDCZzCOXhwDQ24gya0gMEYnuEV3pzUeXHenY9Fa8kpZo7hD5zPH96+jzI=</latexit>
sha1_base64="f36H0rK+xs1ulc2r1EdsYTeAQbI=">AAAB7nicbVDLSgNBEOz1GeNr1aOXwSB4kLAriB4DXjxGMA9JlmV2MpsMmZ1dZnqFEPIRXjwookc/w2/wlr9x8jhoYkFDUdVNd1eUSWHQ88bOyura+sZmYau4vbO7t+8eHNZNmmvGayyVqW5G1HApFK+hQMmbmeY0iSRvRP2bid945NqIVN3jIONBQrtKxIJRtFKjGeL5Q4ihW/LK3hRkmfhzUqq4X+MPAKiG7ne7k7I84QqZpMa0fC/DYEg1Cib5qNjODc8o69Mub1mqaMJNMJyeOyKnVumQONW2FJKp+ntiSBNjBklkOxOKPbPoTcT/vFaO8XUwFCrLkSs2WxTnkmBKJr+TjtCcoRxYQpkW9lbCelRThjahog3BX3x5mdQvyr5X9u9sGpcwQwGO4QTOwIcrqMAtVKEGDPrwBC/w6mTOs/PmvM9aV5z5zBH8gfP5A0qbkbs=</latexit><latexit
<latexit
sha1_base64="/G4s+qGxPkCvpkHCP9y15mBxbNk=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBg5REED0WvHisYD+kDWGz3bRLN5uwOxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzwlQKg6777ZTW1jc2t8rblZ3dvf2D6uFR2ySZZrzFEpnobkgNl0LxFgqUvJtqTuNQ8k44vp35nSeujUjUA05S7sd0qEQkGEUrdboBXjwGGFRrbt2dg6wSryA1KNAMql/9QcKymCtkkhrT89wU/ZxqFEzyaaWfGZ5SNqZD3rNU0ZgbP5+fOyVnVhmQKNG2FJK5+nsip7Exkzi0nTHFkVn2ZuJ/Xi/D6MbPhUoz5IotFkWZJJiQ2e9kIDRnKCeWUKaFvZWwEdWUoU2oYkPwll9eJe3LuufWvXu31rgq4ijDCZzCOXhwDQ24gya0gMEYnuEV3pzUeXHenY9Fa8kpZo7hD5zPH96+jzI=</latexit>
sha1_base64="f36H0rK+xs1ulc2r1EdsYTeAQbI=">AAAB7nicbVDLSgNBEOz1GeNr1aOXwSB4kLAriB4DXjxGMA9JlmV2MpsMmZ1dZnqFEPIRXjwookc/w2/wlr9x8jhoYkFDUdVNd1eUSWHQ88bOyura+sZmYau4vbO7t+8eHNZNmmvGayyVqW5G1HApFK+hQMmbmeY0iSRvRP2bid945NqIVN3jIONBQrtKxIJRtFKjGeL5Q4ihW/LK3hRkmfhzUqq4X+MPAKiG7ne7k7I84QqZpMa0fC/DYEg1Cib5qNjODc8o69Mub1mqaMJNMJyeOyKnVumQONW2FJKp+ntiSBNjBklkOxOKPbPoTcT/vFaO8XUwFCrLkSs2WxTnkmBKJr+TjtCcoRxYQpkW9lbCelRThjahog3BX3x5mdQvyr5X9u9sGpcwQwGO4QTOwIcrqMAtVKEGDPrwBC/w6mTOs/PmvM9aV5z5zBH8gfP5A0qbkbs=</latexit><latexit
<latexit
X t , Yt
inference
kmlib
X t , Yt
inference
(a) Kernel library
kmlib.ko
OS-ML Api
dw
dL
User space
Kernel space
input piece data, but if the frequency of computation re- ordinalized operation type as features. The predicted issue
quests is high, this blocking mode might add extra overhead time is then thresholded to predict whether the I/O request
by blocking additional inputs from being processed. The should be early-rejected or not. We hypothesize that this
dropping mode overruns unprocessed input data: it does not should reduce the overall latency.
add extra overhead, but KMLib then loses data, which may
We have conducted the experiments on QEMU with I/O
hurt training quality. Using these features, the user can cap
throttling running on Intel(R) Core(TM) i7-7500U and 8GB
memory overhead based on their ML application needs.
RAM and Intel SSD(256GB). We use our modified version
The computational overhead of training varies based on of Linux Kernel v4.19.51+ for all experiments.
the complexity of the learning model. We designed of-
For the workload generation, we ran the FIO (FIO) micro-
floading training computation to KMLib library threads, but
benchmark which is configured to perform random read and
there are other challenges of partitioning computation DAG.
write operations with 4 threads on a 1GB dataset. Each
Even though KMLib uses lock-free data structures to re-
experiment is executed on a fresh QEMU instance. We
duce multi-threaded communication and synchronization
cloned the mq-deadline I/O scheduler as mq-kmlib
overhead, there might be dependencies in the computational
and integrated it with KMLib. We made three key
DAG, which might cause latencies. That is why we also
changes in the mq-kmlib I/O scheduler compared to
allow the user to choose how many threads can be used for
mq-deadline: (i) In the dd init queue function, we
(i) training and (ii) inference. All these features that related
inserted initialization code fragments to set the learning
to offloading training/inference computation can be disabled
rate, batch size, momentum, and number of features to
and can be done in the original thread context as well.
learn. Initial weights are also set randomly here. (ii) In
the dd dispatch request function, we call the func-
User space vs. kernel space. The first question that tions that collect Xt and Yt and perform the training steps.
comes to mind is why we started implementing a machine (iii) In the dd insert request function, we invoke an
learning library from scratch for optimizing operating sys- inference function; and based on the prediction we decide
tem tasks, rather than using a well-known user space library whether to early-reject the I/O request or not.
with data collected from the operating system. It is possible We observed that the thresholded regression output could
to collect data from the operating system and feed into user predict with an accuracy of 74.62% whether the I/O requests
space ML implementations. But, there are challenges with miss the deadline or not: this reduced the overall I/O latency
that approach. For example, offloading training and infer- by 8%, a promising result given that I/O is so much slower
ence should be running sub-microsecond because of the than memory or CPU (and hence I/O should be the first to
nature of operating system tasks. KMLib can be deployed optimize). Our test involved a single synthetic workload
in two different modes: (i) kernel mode (Figure 1(a)) and that does not cover a large number of use cases, and our
(ii) kernel-user memory mapped shared mode (Figure 1(b)) performance may not generalize to other workloads. Further,
. In kernel mode, both training and inference happens in the the emulated environment provided by QEMU may not
kernel space. In kernel-user memory mapped shared mode, represent a realistic use case, due to artificial throttling in
KMLib collects data from the kernel space and trains using QEMU. This is why our next step would be to investigate
user-space threads. For the inference, KMLib still runs the if these results generalize to other workloads under more
operations in kernel space to reduce the latency. We are realistic conditions (e.g., physical machines). We are also
using user-kernel shared lock-free circular buffers (Desnoy- planning to apply machine learning models to other storage
ers & Dagenais, 2012) for collecting training data. But, stack components like the page cache.
KMLib threads can drain training request only when it gets
scheduled because KMLib threads are working in a polling We wrote nearly 3,000 lines of C/C++ code (LoC). Because
manner. We continue improving the user-space approach the current set of machine learning tools we have imple-
because we believe that it improves developer productivity, mented is small, the memory footprint size of the KMLib
and developing, debugging, and testing learning models is user-space library is just 96KB, and the size of the KMLib
much easier in user space than developing in the kernel. kernel module is only 804KB. However, we expect these
numbers to increase as additional functionality is imple-
mented.
4 E VALUATION
We developed a sample application of KMLib to fine-tune 5 C ONCLUSION
mq-deadline I/O scheduler. To predict whether the I/O
request will meet the deadline or not, we train a linear re- Adapting operating system components to running work-
gression model. The regression model predicts issue time loads and hardware has been done by tuning parameters or
for a given I/O request using normalized block number and changing the critical data structure properties empirically.
4
KMLib
We have proposed that lightweight machine learning ap- Hashemi, M., Swersky, K., Smith, J. A., Ayers, G., Litz, H.,
proaches may help to solve these problems. Our preliminary Chang, J., Kozyrakis, C., and Ranganathan, P. Learning
evaluation show some promising results. Our plan is to memory access patterns. In Proceedings of the 35th
expand on the work, apply it to other OS components, and International Conference on Machine Learning, ICML
evaluate and optimize the ML library for a wide range of 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15,
workloads. 2018, pp. 1924–1933, 2018.
5
KMLib
Shi, Z., Huang, X., Jain, A., and Lin, C. Applying deep
learning to the cache replacement problem. In Proceed-
ings of the 52nd Annual IEEE/ACM International Sym-
posium on Microarchitecture, MICRO 2019, Columbus,
OH, USA, October 12-16, 2019, pp. 413–425, 2019.
Smith, W., Foster, I. T., and Taylor, V. E. Predicting
application run times using historical information. In
Job Scheduling Strategies for Parallel Processing, IPP-
S/SPDP’98 Workshop, Orlando, Florida, USA, March 30,
1998, Proceedings, pp. 122–142, 1998.
TensorFlow Lite. TensorFlow Lite, January 2020. https:
//www.tensorflow.org/lite.
Wang, N., Choi, J., Brand, D., Chen, C., and Gopalakrish-
nan, K. Training deep neural networks with 8-bit float-
ing point numbers. In Advances in Neural Information
Processing Systems 31: Annual Conference on Neural
Information Processing Systems 2018, NeurIPS 2018, 3-
8 December 2018, Montréal, Canada, pp. 7686–7695,
2018.
Wang, Y. and Yao, Q. Few-shot learning: A survey. arXiv
preprint arXiv:1904.05046, 2019.
Zhang, Y. and Huang, Y. ”Learned”: Operating systems.
SIGOPS Oper. Syst. Rev., 53(1):40–45, July 2019.