0% found this document useful (0 votes)
2 views

AI-based_Resource_Allocation_Reinforcement_Learning_for_Adaptive_Auto-scaling_in_Serverless_Environments

This document discusses the implementation of an AI-based resource allocation model using reinforcement learning for adaptive auto-scaling in serverless environments, specifically focusing on the Knative framework. The study demonstrates that varying concurrency levels can significantly impact the performance of serverless applications, and proposes a Q-learning model to optimize these concurrency settings dynamically. The results indicate that the reinforcement learning approach can enhance resource utilization and improve application performance compared to default auto-scaling configurations.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

AI-based_Resource_Allocation_Reinforcement_Learning_for_Adaptive_Auto-scaling_in_Serverless_Environments

This document discusses the implementation of an AI-based resource allocation model using reinforcement learning for adaptive auto-scaling in serverless environments, specifically focusing on the Knative framework. The study demonstrates that varying concurrency levels can significantly impact the performance of serverless applications, and proposes a Q-learning model to optimize these concurrency settings dynamically. The results indicate that the reinforcement learning approach can enhance resource utilization and improve application performance compared to default auto-scaling configurations.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

AI-based Resource Allocation: Reinforcement


2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid) | 978-1-7281-9586-5/21/$31.00 ©2021 IEEE | DOI: 10.1109/CCGrid51090.2021.00098

Learning for Adaptive Auto-scaling in Serverless


Environments
Lucia Schuler Somaya Jamil Niklas Kühl
Karlsruhe Institute of Technology IBM Research & Development GmbH IBM Deutschland GmbH
[email protected] [email protected] Karlsruhe Institute of Technology
[email protected]

Abstract—Serverless computing has emerged as a compelling ability to optimize resource utilization and reduce the effort
new paradigm of cloud computing models in recent years. It required to manage cloud-scale applications [1].
promises the user services at large scale and low cost while In the implementation, the scaling mechanisms differ within
eliminating the need for infrastructure management. On cloud
provider side, flexible resource management is required to meet the serverless offerings. Some open source serverless frame-
fluctuating demand. It can be enabled through automated pro- works use the resource-based Kubernetes Horizontal Pod
visioning and deprovisioning of resources. A common approach Autoscaler (HPA) to drive scaling via per-instance CPU or
among both commercial and open source serverless computing memory utilization thresholds (e.g. Fission [3]). This, of
platforms is workload-based auto-scaling, where a designated course, makes the auto-scaling feature dependent on the fast
algorithm scales instances according to the number of incoming
requests. In the recently evolving serverless framework Knative and correct calculations of respective system components
a request-based policy is proposed, where the algorithm scales [4]. Commercially provided serverless platforms often fea-
resources by a configured maximum number of requests that ture workload-based scaling by providing additional resources
can be processed in parallel per instance, the so-called concur- when incoming traffic increases, e.g. AWS Lambda initializes
rency. As we show in a baseline experiment, this predefined an instance for each new request coming in until a limit is
concurrency level can strongly influence the performance of
a serverless application. However, identifying the concurrency reached [5]. However, the creation of a new instance implies
configuration that yields the highest possible quality of service is a a certain time lag, known as cold start. To bypass this issue to
challenging task due to various factors, e.g. varying workload and a certain extent, a recently emerging open-source framework
complex infrastructure characteristics, influencing throughput Knative supports parallel processing of up to a predefined
and latency. While there has been considerable research into number of concurrent requests per instance [6]. When the
intelligent techniques for optimizing auto-scaling for virtual
machine provisioning, this topic has not yet been discussed in so-called concurrency is reached, Knative Pod Autoscaler
the area of serverless computing. For this reason, we investigate (KPA) deploys additional pods to handle the load. Moreover,
the applicability of a reinforcement learning approach to request- the concurrency parameter can be adjusted manually to use
based auto-scaling in a serverless framework. Our results show resources more efficiently and to adapt the auto-scaling system
that within a limited number of iterations our proposed model to individual workloads.
learns an effective scaling policy per workload, improving the
performance compared to the default auto-scaling configuration. In the work at hand we show that, depending on the
Index Terms—serverless, auto-scaling, reinforcement learning, workload, different concurrency levels can influence the per-
Knative formance and can lead to a latency difference of up to
multiple seconds. Since this can have a critical impact on
I. I NTRODUCTION the user experience in serverless computing, we propose a
Driven by the advancements and proliferation of virtual reinforcement learning (RL) based model to dynamically de-
machines (VMs) and container technologies, the adoption of termine the optimal concurrency for an individual workload. In
serverless computing models has increased in recent years general, RL formalizes the idea of an agent learning effective
[1]. According to the Cloud Native Computing Foundation, decision-making policies through a sequence of trial-and-error
serverless computing offers two main advantages to the user interactions with its environment. Thereby, the agent evaluates
[2]. First, with a true and fine-grained pay-as-you-go pricing the current state of the system dynamics in each iteration,
model, costs only occur when resources are actually used and and then decides on a particular action. After the action has
not for idle VMs or containers. Second, there is no overhead been performed, the agent receives either positive or negative
for the user associated with infrastructure maintenance, such reward and consequently learns about the goodness of the
as provisioning, updating, and managing the server resources, respective action-state combination. As this approach does
as this is delegated to the cloud provider. This also includes not require any prior knowledge about incoming workload
flexible on-the-fly scalability which enables resources to be and can adapt to changes at runtime, RL algorithms have
added or removed automatically depending on the incoming been proven as valid methods in the field of VM auto-
load. For providers, the auto-scaling capability provides the scaling techniques in research [7]. However, it has not been

978-1-7281-9586-5/21/$31.00 ©2021 IEEE 804


DOI 10.1109/CCGrid51090.2021.00098
Authorized licensed use limited to: VinUni. Downloaded on March 15,2025 at 15:33:59 UTC from IEEE Xplore. Restrictions apply.
studied in a serverless environment. Therefore, we evaluate Route Revision
the applicability of the established RL-algorithm Q-learning to Requests User pod
Ingress gateway
User pod
determine the concurrency level with optimized performance. Queue-proxy
User pod
When scaling Queue-proxy
Specifically, we implement a cloud-based framework upon Queue-proxy

Service
from zero
which two consecutive experiments are conducted. First, we User-container
perform an analysis to examine performance variations of Activator User-container
User-container
different workload profiles under different auto-scaling config-
Report incoming
urations. We demonstrate the dependence of throughput and requests Scrape
Create/delete
latency on the concurrency level and indicate the potential for metrics
Autoscaler Deployment
improvement through adaptive scaling settings. Using these
Scale
results, we enhance the framework with an intelligent RL-
based logic to evaluate the ability of a self-learning algorithm Fig. 1: Illustration of the request flow in Knative v0.12
for effective decision making in a serverless framework. As
we show in a second experiment, our proposed model is able
to learn in limited time an appropriate scaling policy without the user-container. The queue-proxy only allows a certain
prior knowledge of the incoming workload, resulting in an number of requests to enter the user-container simultaneously,
increased performance compared to the framework’s default and queues the requests if necessary. The amount of parallel
auto-scaling settings. processed requests is specified by the concurrency parameter
The remainder of the work is organized as follows. Section configured for a particular revision. Depending on which
II introduces the serverless platform Knative and the theory of concurrency is set in the revision, the queue-proxy will only
Q-learning. Section III reviews related work in both serverless allow a corresponding number of requests to be processed by
frameworks and cloud-based auto-scaling techniques. Section the user-container simultaneously, queuing them if necessary.
IV gives an overview of the underlying experimental setup By default, the value is set to a concurrency target of 100,
of the work, based on which section V presents the tests defining how many parallel requests are preferred per user-
on the impact of different concurrency limits. Using these container at a given time. However, the user can restrict the
findings, section VI proposes a Q-learning model to adapt the number of concurrent requests by specifying a value between 0
concurrency limit on-the-fly. Section VII concludes the paper and 1000 for the concurrency limit.1 Further, each queue-proxy
with remarks on limitations and possible future work. measures the incoming load, reporting the average concurrency
and requests per second on a separate port. The metrics
II. BACKGROUND of all queue-proxy containers are scraped by the autoscaler
To allow for a common understanding of the application component, which then decides how many new pods will be
domain and used techniques, we first provide an overview added or removed to keep the desired concurrency level.
of the functionality of Knative and its auto-scaling feature.
Further, we introduce the theoretical foundations of the Q- B. Q-learning
learning algorithm which is applied in the second experiment. RL refers to a collection of trial-and-error methods in which
an agent is trained to make good decisions by interacting with
A. Knative Serverless Platform his environment and receiving positive or negative feedback
As an open-source serverless platform, Knative provides a in form of rewards for a respective action. A popular RL
set of Kubernetes-based middleware components to support algorithm is the model-free Q-learning.
deploying and serving of serverless applications, including the Q-learning stepwise trains an approximator Qθ (s, a) of
capability to automatically scale resources on demand [6]. the optimal action-value function Q∗ . Qθ (s, a) specifies the
The auto-scaling function is implemented by different serv- cumulated reward the agent can expect when starting in a
ing components, described by the request flow in Fig. 1 based state s, taking an action a, and then acting according to the
on Knative v0.12. If a service revision is scaled to zero, i.e. the optimal policy forever after. By observing the actual reward in
service deployment is reduced to a replica of null operating each iteration, the optimization of the Q-function is performed
pods, the ingress gateway forwards incoming requests first to incrementally per step t:
the activator [6]. The activator then reports the information
to the autoscaler, which instructs the revision’s deployment to Q(st , at ) ←
− (1 − α)Q(st , at ) + α[rt + γ max Q(st+1 , a)] (1)
a
scale-up appropriately. Further, it buffers the requests until the α describes the learning rate, i.e. to what extent newly ob-
user pods of the revision become available, which can cause served information overrides old information and γ a discount
cold-start costs in terms of latency, as the requests are blocked factor that serves to balance between the current and future
for the corresponding time. In comparison, if a minimum of reward. As RL is a trial-and-error method, during training, the
one replica is maintained active, the activator is bypassed and agent has to choose between the exploration of a new action
the traffic can flow directly to the user pod. and the exploitation of the current best option [9]. In research,
When the requests reach the pod, they are channeled by
the queue-proxy container and, subsequently, processed in 1A value of 0 allows unlimited concurrent requests (no scaling) [8].

805

Authorized licensed use limited to: VinUni. Downloaded on March 15,2025 at 15:33:59 UTC from IEEE Xplore. Restrictions apply.
this is often implemented with an -greedy strategy, where  auto-scaling is built with a focus on RL, and second, the
defines the probability of exploration that usually decreases as entities being scaled.
the learning process advances [10], [11]. With a probability To classify numerous techniques at the algorithmic level,
of 1 − , the agent selects based on the optimal policy and different taxonomies were proposed, where the predominant
chooses the action that maximizes the expected return from categories are threshold-based rules, queuing theory and RL
starting in s, i.e. the action with the highest Q-value: [7], [17]. In the former, scaling decisions are made on
predefined thresholds and are most popular among public
a∗ (s) = arg max Q∗ (s, a) (2)
a cloud providers, e.g. Amazon ECS [18]. Despite the simplistic
In the basic algorithm, the Q-values for each state-action implementation, identifying suitable thresholds requires expert
combination are stored in a lookup table, the so-called Q-table, knowledge [17], or explicit application understanding [19].
indexed by states and actions. The tabular representation of Queuing theory has been used to mathematically model ap-
the agent’s knowledge serves as a basis for decision-making plications [7]. As they usually impose a stationary system, the
during the entire learning episode. models are less reactive towards changes [7].
In contrast, RL offers an interesting approach through online
III. R ELATED W ORK learning of the most suitable scaling action and without the
To the best of our knowledge, the applicability of RL-based need for any a-priori knowledge [7]. Many authors have
technology to optimize auto-scaling capabilities in serverless therefore investigated the applicability of model-free RL al-
environments has not been investigated. However, considering gorithms, such as Q-learning, in recent years [10]. Dutreihl et
the areas of serverless and intelligent auto-scaling separately, al. [19] show that although Q-learning based VM controlling
a large body of knowledge is available, summarized in the requires an extensive learning phase and adequate system in-
following subsections. tegration, it can lead to significant performance improvements
compared to threshold-based auto-scaling, since thresholds are
A. Serverless computing often set too tightly while seeking for the optimal resource
With the growing number of serverless computing offerings, allocation. To combine the advantages of both, Q-learning
there has been an increasing interest of the academic commu- itself can be used to automatically adapt thresholds to a
nity in comparing different solutions, with scalability being specific application [20].
one of the key elements of evaluation [4]. In multiple works, In terms of the entity being scaled, RL has been mostly
different propriety serverless platforms were benchmarked, applied to policies for VM allocation, e.g. in [21]. With
including their ability to scale, focusing on Amazon Lambda, the emergence of container-based applications, this field has
Microsoft Azure Functions [12], along with Google Cloud become a greater focus of research [10]. In both areas, the
Functions [13] and IBM Cloud Functions [14]. Similar studies scope of action is concentrated mainly on horizontal (scale-
have been carried out in the area of open-source serverless out/-in) [20], vertical scaling (scale-up/-down) [22], or the
frameworks, with greater attention paid to the auto-scaling combination of both [10]. However, little research has been
capabilities. Mohanty et al. [15] evaluated Fission, Kubeless, done in areas that extend the classic auto-scaling problem of
and OpenFaaS and concluded that Kubeless provides the most VM or container configuration.
consistent performance in terms of response time. Another As a novel approach we investigate the applicability of Q-
comparison of both qualitative and quantitative features of learning to request-based auto-scaling in a serverless environ-
Kubeless, OpenFaas, Apache Openwhisk, and Knative, comes ment. Differently from the existing work on direct vertical or
to the same conclusion, albeit generally indicating the limited horizontal scaling with RL, we propose a model that learns
user control over custom Quality of Service requirements an effective scaling policy by adapting the level of concurrent
[16]. These studies solely consider the default auto-scaler requests per container instance to a specific workload.
Kubernetes HPA. Possible adjustments to the auto-scaling
mechanism itself are not further examined. Li et al. [4] IV. A PPROACH
propose a more concrete distinction between resource-based To investigate different concurrency configurations, a flex-
and workload-based scaling policies. The authors compare the ible Kubernetes-based framework is designed which can be
performance of different workload scenarios using the tuning extended by an intelligent RL-based logic. In this section,
capability of concurrency levels in Knative and clearly suggest we present the experimental setup including the cloud archi-
further investigation of the applicability of this auto-scaling tecture, specification of the utilized workload and standard
capability, which further motivates this research. process flow. This section provides the foundation for both the
first experiment assessing the impact of concurrency changes
B. Auto-scaling and the second experiment evaluating RL-based auto-scaling.
As elasticity is one of the main characteristics of the
increasing adaption of cloud computing, the automatic, on- A. Cloud Architecture
demand provisioning of cloud resources have been the subject The overall architecture of our experiment is illustrated in
of intensive research in recent years [17]. We discuss related Fig. 2. To test the auto-scaling capabilities in an isolated
work under two aspects: first, the underlying theories on which environment, we set up two separate Kubernetes clusters, using

806

Authorized licensed use limited to: VinUni. Downloaded on March 15,2025 at 15:33:59 UTC from IEEE Xplore. Restrictions apply.
Agent Agent Client cluster Service cluster
Decision controller Metrics monitor update concurrency limit
create new revision
Start signal Metrics Metrics send start signal
Concurrency send requests
configuration Client cluster process requests
update test results return responses
Requests/responses scrape metrics

Service cluster return metrics

Service deployment Knative components

Fig. 2: Architectural setup including the information flow Fig. 3: Process flow of one test iteration
between the three components agent, client, and service cluster
The three application parameters are bloat, prime and sleep,
wherein the first is used to specify the number of megabytes
IBM Cloud Kubernetes Service (IKS). On the service cluster,
to be allocated and the second to calculate the prime numbers
the sample service used for the experiments is deployed.
up to the given number, to create either memory- or compute-
The cluster is designed to provide sufficient capacity to host
intensive loads. The sleep parameter pauses the request for the
all Knative components and avoid performance limitations
corresponding number of milliseconds, as in applications with
(9 nodes, 16 vCPU, 64 GB memory). The client cluster
certain waiting times.
is responsible for sending requests to the service cluster to
generate load (one node, 16 vCPU, 64 GB memory). The C. Process Flow
agent manages the activities on both clusters, including the
The basic process flow of one iteration is illustrated in Fig.
configuration updates of the sample service based on collected
3. In each iteration the agent sends a concurrency update to
metrics, and coordinates the process flow of the experiment,
the service cluster, which accordingly creates a new revision
taking the role of an IKS user.
with the respective concurrency limit. When the service update
The Knative resources are installed on the service cluster
is complete, the agent sends the start signal to the client
(version v0.12), including the serving components explained
cluster, which begins issuing parallel requests against the
in Section II-A, which control the state of the deployed
service cluster. To simulate a large number of user requests
sample service and enable auto-scaling of additional pods on
at the same time, we use the HTTP load testing tool Vegeta,
demand. Using the trial-and-error method of RL in the second
which features sending HTTP requests at a constant rate. In the
experiment, we update the concurrency configuration of the
experiment, 500 requests are sent simultaneously over a period
service in each iteration, creating a new revision each time.
of 30s to ensure sufficient demand for scaling and sufficient
Incoming requests are routed by default to the most recent
time to provide additional instances. After the last response
revision with the newest concurrency update.
is received, Vegeta outputs a report of the test results, includ-
To comprehensively test the auto-scaling capability, we ing information on latency distribution of requests, average
activated the scale-to-zero functionality in the autoscaler’s throughput and success ratio of responses. The performance
configmap, which requires a cold start in each iteration. We measures are then stored by the agent. Additionally, the
further increased the replica number of ingress gateways, agent crawls metrics from the Knative monitoring components,
which handle load balancing, to bypass performance issues exposed via a Prometheus-based HTTP API within the cluster,
and to focus exclusively on the auto-scaling functionalities. to get further information about resource usage at cluster,
node, pod and container level. Using this data, the concurrency
B. Workload
update is chosen to proceed to the next iteration.
Serverless computing is used for a variety of applications,
accompanied by different resource requirements. For example, V. BASELINE E XPERIMENT
the processing of video material or highly-parallel analyti- To determine the implications of varying concurrency limits,
cal workloads, demand considerable memory and computing we first conduct a baseline experiment comparing different
power. Other applications, such as chained API compositions workloads on their relative performance.
or chatbots, tend to be less compute-intensive but may require
longer execution or response time. A. Design
To investigate the concurrency impact of many different As outlined in the previous section, we use the application
workloads, we generate a synthetic, stable workload profile parameters bloat, prime and sleep to simulate varying work-
simulating serverless applications. We use Knative’s exam- load characteristics. Starting with a no-operation workload
ple Autoscale-go application for this purpose, which allows where no parameters are passed, the memory allocation and
different parameters to be passed with the request to test CPU load were gradually increased for each new experiment.
incremental variations of the workload characteristics and thus The step size of the memory allocating parameter was aligned
emulate varying CPU- and memory- intensive workloads [23]. with the memory buckets commonly used for the standard

807

Authorized licensed use limited to: VinUni. Downloaded on March 15,2025 at 15:33:59 UTC from IEEE Xplore. Restrictions apply.
pricing model of serverless platforms. To simulate compute- TABLE I: Concurrency Performance Tests
intensive and longer-lasting requests, different prime and sleep Test Workload Profile Conc. Limit Yielding Best Perf.
parameters were chosen correspondingly. The detailed values # bloat∗ prime sleep∗ thrghpt mean lat. 95th lat.
are specified in Table I. I - - - 50 50 70
II 128 - - 30 30 30
Per profile, we run performance tests for varying concur-
III 128 1000 - 10 10 10
rency levels according to the process flow described in Fig. IV 128 10.000 - 30 30 10
3. Theoretically, the concurrency limit can take any value V 128 100.000 - 10 10 10
between 0 and 1000. To keep the experiments computationally VI 128 1000 1000 110 110 150
VII 128 10.000 1000 70 70 70
feasible, we proceed in steps of 20, starting at a concurrency VIII 128 100.000 1000 110 110 110
limit of 10 and ending at 310. As stated in related literature, we IX 256 - - 10 10 10
focus on latency and throughput as key performance measures X 256 1000 - 10 10 10
of serverless applications [4]. Average throughput is defined XI 256 10.000 - 10 10 10
XII 256 100.000 - 10 10 10
by requests per second (RPS), mean latency refers to the XIII 256 1000 1000 50 50 50
average time in seconds taken to return the request response. XIV 256 10.000 1000 110 110 110
To cover tail latency, we include the 95th percentile latency XV 256 100.000 1000 10 10 30
XVI 512 - - 10 10 10
as an additional metric. Furthermore, each test is repeated ten XVII 1024 - - 30 30 30
times to compensate for outliers or other fluctuations, before ∗
bloat is defined in MB and sleep in milliseconds.
the concurrency is updated to the next limit.
B. Results 400 6

Throughput (RPS)
We structure the analysis of the baseline experiment results 350 5

Latency (sec)
in three parts. First, we examine the behavior of the individual Throughput
workload profiles under different concurrency configurations. 300 Lat_mean 4
Lat_95th 3
Second, we focus on the relation of the target variables
250 2
throughput and latency. Finally, we analyze further metrics
about resource utilization on container and pod level.
0 50 100 150 200 250 300 1
As described above, we conducted the experiment for differ- Concurrency
ent combinations of the three parameters to simulate possible
Fig. 4: Performance in baseline experiment (workload #VII)
use cases. Table I gives an overview of the outcomes with the
concurrency limit that lead to the optimal test result in terms
of one of the performance measures. Due to the numerous yields the lowest value for mean latency, differing from the
uncontrollable factors that influence the performance of the second-best value at concurrency 50 by only 80 milliseconds.
cluster, each result forms a snapshot in time. The respective The distance becomes more critical when considering the tail
workload configuration is described by the three columns on latency of the 95th percentile, where a request takes more than
the left. Taking all tests into account, the smallest possible 740 milliseconds on average longer to receive a response when
concurrency of 10 is the most common configuration that compared to the most effective configuration. At a concurrency
resulted in the best performance across all three indicators. of 10, the difference amounts to almost 3 seconds, further
Interestingly, this does not correspond to the default setting underlining the performance variations caused by the different
of the KPA where a target concurrency value of 100 is settings. The greatest slowdown in tail latency in this test
preferred [24]. In particular, workloads that consume memory occurs at a concurrency of 310 with more than 3.7 seconds.
exclusively perform better with fewer parallel requests per Besides, the overall performance decreases strongly when
pod instance, e.g tests #II, #IX, #XVI and #XVII. Similar the concurrency limit exceeds a level of 210. This tendency can
observations are made for workloads with additional low CPU be found across the majority of tests, indicating that due to the
usage, i.e. lower prime parameter, as in tests #III and #X. high simultaneous processing of many requests, only a limited
Deviations can be observed when the requests pause for a amount of resources are available for a single request. Further
certain time. These workloads result in higher throughput and observations show that with increasing memory utilization,
lower mean and tail latency when a higher concurrency is i.e. the bloat parameter, performance tends to drop at lower
chosen, e.g tests #VII, #VIII and #XIV. concurrency limits. In some cases, additionally, the success
Depending on the workload, the distance between the opti- ratio strongly declines. For example in test #XVI, from a
mal configuration and the second best concurrency can be very concurrency of 170 onwards, more than 10% of the requests
small, which becomes more evident when analyzing a single received non-successful responses. In test #XVII accordingly,
test in detail. Fig. 4 shows the result of test #VII, which is this output can be observed from a concurrency of 90 onwards.
examined representatively. Although the individual measure- Focusing on the target metrics, the tests show that adjusting
ment points fluctuate, clear trends are identified in the average the concurrency limit to an appropriate setting can yield sig-
values. A significant increase in throughput can be observed nificant improvements in throughput and latency. Furthermore,
when the concurrency limit is raised to 70. This setting also an inverse behavior of the measures can be observed within the

808

Authorized licensed use limited to: VinUni. Downloaded on March 15,2025 at 15:33:59 UTC from IEEE Xplore. Restrictions apply.
tests. For the previously considered test #VII, the results indi- endpoints on the concurrency scale, i.e. the minimum or
cate a significant negative correlation of throughput and mean maximum concurrency, the action space in this state is reduced
latency of −0.989, and a similar correlation for throughput accordingly by the non-executable action.
and 95th -percentile latency of −0.916.2 This strong negative After each iteration, the agent receives an immediate reward
relationship between the metrics is found across all tests, with according to the performance achieved through the action. In
significant correlation coefficients ranging from −0.995 to related literature, the reward is often based on the distance or
−0.748.3 Subsequently, an improvement in throughput usually ratio between the performance measure and a certain Service
results in a lower and more favorable latency. This finding im- Level Agreement, such as a throughput or response time target
plies that there is no need to make trade-offs between different value [20], [26]. Since there is no target level to be achieved
target metrics when adjusting the concurrency. Instead, the nor prior information about the performance given in our
problem can be reduced to one objective metric, representing problem definition, we define an artificial reference value
the others. ref value as the best value obtained to date. Due to the per-
manent, albeit minor fluctuations in the measures, we propose
VI. R EINFORCEMENT L EARNING E XPERIMENT a tolerance band around the reference value to avoid weighting
The experiment described in the previous section demon- minor non-relevant deviations. Furthermore, the results from
strates the impact the concurrency configuration can have on the preliminary study have shown a highly negative correlation
performance. Therefore, we evaluate the applicability of the between throughput and latency, i.e. higher throughput usually
model-free RL algorithm Q-learning in a second experiment leads to lower and therefore better latency. This relation in
to learn effective scaling policies by adjusting the concurrency turn allows to focus exclusively on throughput (thrghpt) as
limit during run time. one single objective. The calculation of the reward r in time
step i is as follows.
A. Design 
The process flow is based on the procedure from section  thrghpti
 if thrghpti ≤ ref value · 0.95
IV-C, extended with a more sophisticated logic of the agent. ri = ref value or thrghpti ≥ ref value · 1.05
Instead of incrementally increasing the concurrency, the agent

1 else

uses knowledge of the system environment (states) to test
different concurrency updates (actions) and evaluates them by Q-learning is initiated with the following parameters. A
receiving scores (reward). learning rate α = 0.5 is chosen to balance newly acquired and
In each iteration, the environment is defined by the cur- existing information, a discount factor γ = 0.9 to ensure that
rent state, which, should provide a complete description of the agent strives for a long-term high return. To encourage
the system dynamics including all relevant information for the exploration of actions at the beginning of training, we
optimal decision making. Due to the large number of factors implement a decaying -greedy policy starting at iteration 50
influencing performance, e.g. hidden cluster activities or net- with  = 1 and then slowly decrease over time by a decay factor
work utilization, this is neither traceable nor computationally of 0.995 per iteration. The minimum exploration probability
feasible in the used Q-learning algorithm. Therefore, we break is set to min = 0.1, to allow for the detection of possible
down our state space S into three key features. We define changes in the system. The knowledge the agent acquires is
S at time step i as the combination of the state variables stored in a Q-table and updated each iteration.
si = (conci , cpui , memi ), where conci depicts the concur- To examine whether the model can effectively learn the
rency limit, cpui is the average CPU utilization per user- concurrency values identified in section V as high throughput
container and memi is the average memory utilization per configurations, the results are analyzed representatively based
user-container. The selection of the features is aligned with on workload test #VII and #X. The former test showed high
related research, with conci as the equivalent of the number performance at a concurrency limit of 70, while the second
of VMs in VM auto-scaling approaches [21], [25]. Further, reached the best test results at an edge concurrency of 10.
cpui and memi serve as a direct source of information of the
resource utilization of a respective workload and are therefore B. Results
used to describe the current system state. First, we analyze the results to examine the suitability of
Since both CPU and memory utilization are continuous the proposed Q-learning-based model to fine-tune the auto-
numbers, we discretize them into bins of equal size. In each scaling. Second, we evaluate the performance of the approach
state si ∈ S, we define A(si ) as the set of valid actions, in terms of throughput improvements compared to Knative’s
where A is the set of all actions. The agent can choose default auto-scaling configuration.
between decreasing, maintaining or increasing the concurrency Based on workload profile #X, Fig. 5 shows how the agent
limit by 20, i.e. A = {−20, 0, 20}. If the agent reaches the applies RL logic to incrementally change the concurrency
and to adjust it as the training progressed. Beginning at
2 For all statistical tests, Pearson correlation coefficient is used with a two-
concurrency 170, random exploration leads to a moderate
sided p-value for testing non-correlation and an alpha level of .001.
3 Except for test #I and #XIII with significant correlations of throughput decline of the concurrency limit in the first 30 iterations.
and tail latency of −0.629, and throughput and mean latency of −0.677. This results in an improvement in throughput, captured by

809

Authorized licensed use limited to: VinUni. Downloaded on March 15,2025 at 15:33:59 UTC from IEEE Xplore. Restrictions apply.
800 1.0

Throughput (RPS)
200
0.8 600
Concurrency Limit

Throughput (RPS)
600

-probability
150
0.6
400 400
100
0.4
50 200 0 100 200 300 400 500 600
0.2 Iteration
0 0 Q-learning model - workload # 7 Q-learning model - workload # 10
100 200 300 400 500 600 Knative default - workload # 7 Knative default - workload # 10
Iteration
Concurrency Throughput Epsilon
Fig. 7: Comparison of average throughput of the Q-learning
Fig. 5: Performance of Q-learning model (workload #X) model and the Knative default auto-scaling setting

1.0
200 400 best outcomes in terms of throughput, and which agrees with
0.8
Concurrency Limit

the result of the baseline experiment in Section V.

-probability
150 350
0.6 To evaluate the proposed scaling policies, we benchmark

RPS
100 300 the average performance of the Q-learning-based approach
0.4
50 with the static default setting. For this purpose, the same
250 0.2
experimental setup is used as in the Q-learning test, except
0 0 100 200 300 400 500 600
Iteration for the auto-scaling configuration, where the original setting
Concurrency Throughput Epsilon of a concurrency target of 100 is applied [6].4 Fig. 7 depicts
the average throughput up to the respective iteration of the
Fig. 6: Performance of Q-learning model (workload #VII)
Q-learning model and the default configuration for the two
considered workloads. Both result in the Q-learning model out-
performing the test based on Knative’s standard settings. Con-
the rewards and corresponding Q-values for each state-action
sidering workload #VII first, the model requires approximately
combination. The most effective scaling policy of 10 parallel
150 iterations until the average performance reaches default-
requests per container is first reached in iteration 121. Never-
level. Subsequently, the throughput increases to an average of
theless, due to the -greedy strategy, exploratory actions are
400 RPS providing a minor advantage of 20 RPS compared to
chosen, which might differ from the down-scaling decision
the standard system. A more significant enhancement shows
and cause the agent to deviate from a good strategy. As
workload #X. While in the first 10 iteration the default settings
training progresses, a trend towards performance-enhancing
alternate between 350 and 440 RPS, the performance of our
concurrency configurations can be observed, indicating the
model is initially lower. However, with ongoing learning the
agent is more likely to exploit the optimal decision rather than
average throughput improves and excels already from iteration
exploring. After 330 iterations, the concurrency stabilizes at a
10 onwards. After 600 iterations, the presented Q-learning
limit of 10 parallel requests per container, implying the agent
based model reaches an average throughput of 740 RPS, hence
has learned the correct scaling policy, according to the results
achieving more than 80% of the performance of the default
from section V. Due to the minimum  = 0.1, exploration still
setting, which stabilizes at 390 RPS on average.
rarely occurs to ensure the agent can respond to changes in
To summarize the results, the proposed model learned
the environment.
within finite time a scaling policy that outperforms the default
A different learning process of the proposed Q-learning Knative configuration in terms of throughput, proving the Q-
approach can be observed for workload #VII, depicted in Fig. learning-based approach is well-feasible to refine the auto-
6. The varying concurrency curve shows the initial strategy scaling mechanism.
of the agent exploring first the higher state space before
proceeding with lower concurrency limits. After 250 iterations VII. C ONCLUSION
the exploitation phase outweighs and the concurrency grad- With the emergence of serverless frameworks, the ability
ually levels off. In comparison to workload #X, where the of dynamic, real-time resource provisioning to meet varying
algorithm’s scaling policy converges to a single concurrency demand has become a key area of interest and has led to
limit, the configuration here fluctuates, mainly between 50 the development of numerous scaling mechanisms. Focusing
and 70, and retains this pattern. Further differences between on request-based scaling, we first investigated the impact
the test results arise from the throughput metric, which modifying the main scaling parameter, i.e. the number of con-
shows strong fluctuations between 230 and 415 RPS across current requests per instance, may have on performance. The
all iterations. The deviations, which also appear within one experiments showed deviations of up to multiple seconds in the
concurrency setting, considerably impair the agent’s ability average latency as well as significant differences in throughput,
to evaluate suitable state-action-pairs via the reward function.
Nevertheless, the agent is able to narrow down the scaling 4 Additionally, the container target percentage is set to 0.7 as in the default

range to a limited number of values at which it identified the configmap.

810

Authorized licensed use limited to: VinUni. Downloaded on March 15,2025 at 15:33:59 UTC from IEEE Xplore. Restrictions apply.
thus indicating that the concurrency configuration can affect [4] J. Li, S. G. Kulkarni, K. Ramakrishnan, and D. Li, “Understanding open
source serverless platforms: Design considerations and performance,” in
the performance depending on the workload. To flexibly adjust Proceedings of the 5th International Workshop on Serverless Computing,
the auto-scaling settings to specific requirements, we designed 2019, pp. 37–42.
a RL model based on Q-learning and evaluated its applicability [5] “Aws lambda function scaling,” https://docs.aws.amazon.com/lambda/
latest/dg/invocation-scaling.html, accessed: 2020-03-26.
to learn effective scaling policies during runtime. Based on [6] “Knative serving autoscaling system,” https://github.com/knative/
different workloads, we showed that the proposed model can serving/blob/master/docs/scaling/SYSTEM.md, accessed: 2020-03-26.
adapt the concurrency appropriately without prior knowledge [7] T. Lorido-Botran, J. Miguel-Alonso, and J. A. Lozano, “A review of
auto-scaling techniques for elastic applications in cloud environments,”
within limited time and outperforms the average throughput Journal of grid computing, vol. 12, no. 4, pp. 559–592, 2014.
compared to the default setting of Knative. [8] “Configuring knative serving autoscaling,” https://docs.openshift.com/
Given these results, the presented work offers valuable container-platform/4.2/serverless/configuring-knative-serving-
autoscaling.html, accessed: 2020-03-28.
contributions to both the existing work in the field of serverless [9] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
frameworks and the application of RL-based auto-scaling. MIT press, 2018.
• In addition to previous studies on scaling capabilities in
[10] F. Rossi, M. Nardelli, and V. Cardellini, “Horizontal and vertical scaling
of container-based applications using reinforcement learning,” in 2019
serverless platforms, we provided a detailed analysis to IEEE 12th International Conference on Cloud Computing (CLOUD).
reveal the performance implications of changes in the IEEE, 2019, pp. 329–338.
concurrency configuration. [11] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-
stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-
• Furthermore, we demonstrated with our proposed model ing,” arXiv preprint arXiv:1312.5602, 2013.
the applicability of Q-learning-based auto-scaling in the [12] W. Lloyd, S. Ramesh, S. Chinthalapati, L. Ly, and S. Pallickara,
field of serverless applications. “Serverless computing: An investigation of factors influencing microser-
vice performance,” in 2018 IEEE International Conference on Cloud
• Additionally, the findings contribute to the ongoing de- Engineering (IC2E). IEEE, 2018, pp. 159–169.
velopment of the auto-scaling system of the Knative [13] L. Wang, M. Li, Y. Zhang, T. Ristenpart, and M. Swift, “Peeking
community project. behind the curtains of serverless platforms,” in 2018 {USENIX} Annual
Technical Conference ({USENIX}{ATC} 18), 2018, pp. 133–146.
Nevertheless, we identified some limitations in the approach [14] H. Lee, K. Satyam, and G. Fox, “Evaluation of production serverless
during the experiments. First, the results from Section V are computing environments,” in 2018 IEEE 11th International Conference
on Cloud Computing (CLOUD). IEEE, 2018, pp. 442–450.
based on synthetic workloads simulated by one application [15] S. K. Mohanty, G. Premsankar, M. Di Francesco et al., “An evaluation
with varying parameters, and thus cannot be interpreted as of open source serverless computing frameworks.” in CloudCom, 2018,
a universally valid conclusion on the effects of real-world pp. 115–120.
[16] A. Palade, A. Kazmi, and S. Clarke, “An evaluation of open source
applications. Second, due to the focus on general applicability serverless computing frameworks support at the edge,” in 2019 IEEE
of Q-learning, the approach uses a rather simplistic reward World Congress on Services (SERVICES), vol. 2642. IEEE, 2019, pp.
function measuring exclusively the proximity to the reference 206–211.
[17] P. Singh, P. Gupta, K. Jyoti, and A. Nayyar, “Research on auto-scaling of
value. Further refinement of the reward function may improve web applications in cloud: survey, trends and future directions,” Scalable
the efficiency of the proposed model. Similarly, the description Computing: Practice and Experience, vol. 20, no. 2, pp. 399–432, 2019.
of the system state of the RL environment could be extended [18] “Aws service auto scaling,” https://docs.aws.amazon.com/AmazonECS/
latest/developerguide/service-auto-scaling.html, accessed: 2020-03-27.
by additional parameters such as memory allocation and time [19] X. Dutreilh, A. Moreau, J. Malenfant, N. Rivierre, and I. Truck, “From
constraints to improve the model’s accuracy. data center resource allocation to control theory and back,” in 2010
While in this work a RL approach has been developed, IEEE 3rd international conference on cloud computing. IEEE, 2010,
pp. 410–417.
which learns a certain scaling policy per workload mainly [20] S. Horovitz and Y. Arian, “Efficient cloud auto-scaling with sla objective
through testing different concurrency states, it remains to using q-learning,” in 2018 IEEE 6th International Conference on Future
be analyzed to what extent the ratio of resource usage of Internet of Things and Cloud (FiCloud). IEEE, 2018, pp. 85–92.
[21] C. Bitsakos, I. Konstantinou, and N. Koziris, “Derp: A deep reinforce-
individual components might impact the performance. Thus, ment learning cloud system for elastic resource provisioning,” in 2018
a comprehensive study could be conducted to determine the IEEE International Conference on Cloud Computing Technology and
combination of utilization levels that might achieve the best Science (CloudCom). IEEE, 2018, pp. 21–29.
[22] J. Rao, X. Bu, C.-Z. Xu, and K. Wang, “A distributed self-learning
possible performance across all workloads. Consequently, the approach for elastic provisioning of virtualized cloud resources,” in 2011
concurrency configuration could merely serve as a tool to bring IEEE 19th Annual International Symposium on Modelling, Analysis, and
the system into this particular state. Simulation of Computer and Telecommunication Systems. IEEE, 2011,
pp. 45–54.
[23] “Knative autoscale-go sample app - go,” https://github.com/knative/docs/
R EFERENCES tree/master/docs/serving/samples/autoscale-go, accessed: 2020-04-04.
[24] “Configuring autoscaling,” https://knative.dev/v0.12-docs/serving/
[1] P. Castro, V. Ishakian, V. Muthusamy, and A. Slominski, “The server
configuring-autoscaling//, accessed: 2020-04-28.
is dead, long live the server: Rise of serverless computing, overview of
[25] E. Barrett, E. Howley, and J. Duggan, “Applying reinforcement learning
current state and future trends in research and industry,” arXiv preprint
towards automating resource allocation and application scalability in
arXiv:1906.02888, 2019.
the cloud,” Concurrency and Computation: Practice and Experience,
[2] S. Allen, C. Aniszczyk, C. Arimura, and et al.,
vol. 25, no. 12, pp. 1656–1674, 2013.
“Cncf serverless whitepaper v1.0,” 2018. [Online]. Avail-
[26] X. Dutreilh, S. Kirgizov, O. Melekhova, J. Malenfant, N. Rivierre,
able: https://github.com/cncf/wg-serverless/blob/master/whitepapers/
and I. Truck, “Using reinforcement learning for autonomic resource
serverless-overview/cncf serverless whitepaper v1.0.pdf
allocation in clouds: towards a fully automated workflow,” in ICAS 2011,
[3] “A high-level view of the internals of fission.” https://github.com/fission/
The Seventh International Conference on Autonomic and Autonomous
fission/blob/master/Documentation/Architecture.md, accessed: 2020-03-
Systems, 2011, pp. 67–74.
26.

811

Authorized licensed use limited to: VinUni. Downloaded on March 15,2025 at 15:33:59 UTC from IEEE Xplore. Restrictions apply.

You might also like