LIPIcs ECRTS 2020 9
LIPIcs ECRTS 2020 9
Latency
Daniel Bristot de Oliveira
Red Hat, Italy
[email protected]
Daniel Casini
Scuola Superiore Sant’Anna, Italy
[email protected]
Tommaso Cucinotta
Scuola Superiore Sant’Anna, Italy
[email protected]
Abstract
Linux has become a viable operating system for many real-time workloads. However, the black-box
approach adopted by cyclictest, the tool used to evaluate the main real-time metric of the kernel,
the scheduling latency, along with the absence of a theoretically-sound description of the in-kernel
behavior, sheds some doubts about Linux meriting the real-time adjective. Aiming at clarifying the
PREEMPT_RT Linux scheduling latency, this paper leverages the Thread Synchronization Model
of Linux to derive a set of properties and rules defining the Linux kernel behavior from a scheduling
perspective. These rules are then leveraged to derive a sound bound to the scheduling latency,
considering all the sources of delays occurring in all possible sequences of synchronization events
in the kernel. This paper also presents a tracing method, efficient in time and memory overheads,
to observe the kernel events needed to define the variables used in the analysis. This results in
an easy-to-use tool for deriving reliable scheduling latency bounds that can be used in practice.
Finally, an experimental analysis compares the cyclictest and the proposed tool, showing that the
proposed method can find sound bounds faster with acceptable overheads.
2012 ACM Subject Classification Computer systems organization → Real-time operating systems
Keywords and phrases Real-time operating systems, Linux kernel, PREEMPT_RT, Scheduling
latency
Funding This work has been partially supported by CAPES, The Brazilian Agency for Higher
Education, project PrInt CAPES-UFSC “Automation 4.0.”
Acknowledgements The authors would like to thank Thomas Gleixner, Peter Zijlstra, Steven
Rostedt, Arnaldo Carvalho De Melo and Clark Williams for the fruitful discussions about the model,
analysis, and tool.
© Daniel Bristot de Oliveira, Daniel Casini, Rômulo Silva de Oliveira, and Artifact
* Complete
Tommaso Cucinotta; en
t
ECRTS *
*
*
st
We
nsi
AE *
* Co
ll Docum
Ev
e
nt
ed
* Easy to
alu
at e d
Editor: Marcus Völp; Article No. 9; pp. 9:1–9:23
Leibniz International Proceedings in Informatics
Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany
9:2 Demystifying the Real-Time Linux Scheduling Latency
1 Introduction
Real-time Linux has been a recurring topic in both research [5, 6, 30] and industry [10, 11,
12, 21, 39] for more than a decade. Nowadays, Linux has an extensive set of real-time related
features, from theoretically-supported schedulers such as SCHED_DEADLINE [27] to the
priority inversion control in locking algorithms and a fully-preemptive mode. Regarding
the fully-preemptive mode, Linux developers have extensively reworked the Linux kernel
to reduce the code sections that could delay the scheduling of the highest-priority thread,
leading to the well-known PREEMPT_RT variant. cyclictest is the primary tool adopted
in the evaluation of the fully-preemptive mode of PREEMPT_RT Linux [8], and it is used
to compute the time difference between the expected activation time and the actual start
of execution of a high-priority thread running on a CPU. By configuring the measurement
thread with the highest priority and running a background taskset to generate disturbance,
cyclictest is used in practice to measure the scheduling latency of each CPU of the
system. Maximum observed latency values generally range from a few microseconds on
single-CPU systems to 250 microseconds on non-uniform memory access systems [35], which
are acceptable values for a vast range of applications with sub-millisecond timing precision
requirements. This way, PREEMPT_RT Linux closely fulfills theoretical fully-preemptive
system assumptions that consider atomic scheduling operations with negligible overheads.
Despite its practical approach and the contributions to the current state-of-art of real-time
Linux, cyclictest has some known limitations. The main one arises from the opaque nature
of the latency value provided by cyclictest [4]. Indeed, it only informs about the latency
value, without providing insights on its root causes. The tracing features of the kernel are
often applied by developers to help in the investigation. However, the usage of tracing is
not enough to resolve the problem: the tracing overhead can easily mask the real sources of
latency, and the excessive amount of data often drives the developer to conjunctures that
are not the actual cause of the problem. For these reasons, the debug of a latency spike on
Linux generally takes a reasonable amount of hours of very specialized resources.
A common approach in the real-time systems theory is the categorization of a system as
a set of independent variables and equations that describe its integrated timing behavior.
However, the complexity of the execution contexts and fine-grained synchronization of the
PREEMPT_RT make application of classical real-time analysis for Linux difficult. Linux
kernel complexity is undoubtedly a barrier for both expert operating system developers and
real-time systems researchers. The absence of a theoretically-sound definition of the Linux
behavior is widely known, and it inhibits the application of the rich arsenal of already existing
techniques from the real-time theory. Also, it inhibits the development of theoretically-sound
analysis that fits all the peculiarities of the Linux task model [23].
Aware of the situation, researchers and developers have been working together in the
creation of models that explain the Linux behavior using a formal notation, abstracting the
code complexity [2]. The Thread Synchronization Model for the fully-preemptive PREEMPT
RT Linux Kernel [14] proposes an automata-based model to explain the synchronization
dynamics for the de facto standard for real-time Linux. Among other things, the model can
be used as an abstraction layer to translate the kernel dynamics as analyzed by real-time
Linux kernel developers to the abstractions used in the real-time scheduling theory.
Paper approach and contributions: This paper leverages the Thread Synchronization
Model [14] of Linux to derive a set of properties and rules defining the Linux kernel behavior
from a scheduling perspective. These properties are then leveraged in an analysis that derives
D. B. de Oliveira, D. Casini, R. S. de Oliveira, and T. Cucinotta 9:3
2 Background
This section provides background information on the main concepts used in this paper, and
discusses related research works.
ECRTS 2020
9:4 Demystifying the Real-Time Linux Scheduling Latency
nmi_entry
non_nmi nmi
nmi_exit
Linux has an advanced set of tracing methods [28]. An essential characteristic of the
Linux tracing feature is its efficiency. Currently, the majority of Linux distributions have the
tracing features enabled and ready to use. When disabled, the tracing methods have nearly
zero overhead, thanks to the extensive usage of runtime code modifications. Currently, there
are two main interfaces by which these features can be accessed from user-space: perf and
ftrace. The most common action is to record the occurrence of events into a trace-buffer
for post-processing or human interpretation of the events. Furthermore, it is possible to take
actions based on events, such as to record a stacktrace. Moreover, tools can also hook to the
trace methods, processing the events in many different ways, and also be leveraged for other
purposes. For example, the Live Patching feature of Linux uses the function tracer to
hook and deviate the execution of a problematic function to a revised version of the function
that fixes a problem [32]. A similar approach was used for runtime verification of the Linux
kernel, proving to be an efficient approach [18].
3 System Model
The task set is composed of a single NMI τ NMI , a set ΓIRQ = {τ1IRQ , τ2IRQ , . . .} of maskable
interruptions (IRQ for simplicity), and a set of threads ΓTHD = {τ1THD , τ2THD , . . .}. The NMI,
IRQs, and threads are subject to the scheduling hierarchy discussed in Section 2.1, i.e.,
the NMI has always a higher priority than IRQs, and IRQs always have higher priority
than threads. Given a thread τiTHD , at a given point in time, the set of threads with a
higher-priority than τiTHD is denoted by ΓTHD
HPi . Similarly, the set of tasks with priority lower
than τiTHD is denoted by ΓTHDLPi . Although the schedulers might have threads with the same
priority in their queues, only one among them will be selected to have its context loaded, and
consequently, starting to run. Hence, when scheduling, the schedulers elect a single thread
as the highest-priority one.
ECRTS 2020
9:6 Demystifying the Real-Time Linux Scheduling Latency
local_irq_disable hw_local_irq_disable
enabled disabled
local_irq_enable non_irq
hw_local_irq_enable irq
Figure 2 IRQ disabled by software (O2). Figure 3 IRQs disabled by hardware (O3).
sched_switch_in
sched_switch_suspend
not_running
sched_switch_preempt running sched_switch_out_o
running preempted
sched_switch_blocking sched_switch_in_o
Figure 4 Context switch generator (04). Figure 5 Context switch generator (05).
preempt_disable
no_preempt
preempt_enable
preempt
preempt_disable_sched schedule_entry
thread sched
preempt_enable_sched scheduling schedule_exit
sched_waking sched_need_resched
sched_set_state_runnable
sleepable runnable
sched_set_state_sleepable need_resched
The system model is formalized using the modular approach, where the generators model
the independent action of tasks and synchronization primitives, and the specification models
the synchronized behavior of the system. The next sections explains the generators as the
basic operations of the system, and the specifications as a set of rules that explains the
system behavior.
The reference model considers two threads: the thread under analysis and an arbitrary
other thread (including the idle thread). The corresponding operations are discussed next.
D. B. de Oliveira, D. Casini, R. S. de Oliveira, and T. Cucinotta 9:7
O4: The thread is not running until its context is loaded in the processor (sched_switch-
_in). The context of a thread can be unloaded by a suspension (sched_switch_suspend),
blocking (sched_switch_blocking), or preemption (sched_switch_preempt), as in Fig-
ure 4.
O5: The model considers that there is always another thread ready to run. The reason is
that, on Linux, the idle state is implemented as a thread, so at least the idle thread is
ready to run. The other thread can have its context unloaded (sched_switch_out_o)
and loaded (sched_switch_in_o) in the processor, as modeled in Figure 5.
O6: The preemption is enabled by default. Although the same function is used to
disable preemption, the model distinguishes the different reasons to disable preemption,
as modeled in Figure 6. The preemption can be disabled either to postpone the scheduler
execution (preempt_disable), or to protect the scheduler execution of a recursive call
(preempt_disable_sched). Hereafter, the latter mode is referred to as preemption
disabled to call the scheduler or preemption disabled to schedule.
O7: The scheduler starts to run selecting the highest-priority thread (schedule_entry,
in Figure 7), and returns after scheduling (schedule_exit).
O8: Before being able to run, a thread needs to be awakened (sched_waking). A
thread can set its state to sleepable (sched_set_state_sleepable) when in need of
resources. This operation can be undone if the thread sets its state to runnable again
(sched_set_state_runnable). The automata that illustrates the interaction among
these events is shown in Figure 8.
O9: The set need re-schedule (sched_need_resched) notifies that the currently running
thread is not the highest-priority anymore, and so the current CPU needs to re-schedule,
in such way to select the new highest-priority thread (Figure 9).
3.2 Rules
The Thread Synchronization Model [14] includes a set of specifications defining the synchro-
nization rules among generators (i.e., the basic operations discussed in Section 3.1). Next,
we summarize a subset of rules extracted from the automaton, which are relevant to analyze
the scheduling latency. Each rule points to a related specification, graphically illustrated
with a corresponding figure.
IRQ and NMI rules. First, we start discussing rules related to IRQs and NMI.
R1: There is no specification that blocks the execution of a NMI (O1) in the automaton.
R2: There is a set of events that are not allowed in the NMI context (Figure 10),
including:
R2a: set the need resched (O9).
R2b: call the scheduler (O7).
R2c: switch the thread context (O4 and O5)
R2d: enabling the preemption to schedule (O6).
R3: There is a set of events that are not allowed in the IRQ context (Figure 11), including:
R3a: call the scheduler (O7).
R3b: switch the thread context (O4 and O5).
R3c: enabling the preemption to schedule (O6).
R4: IRQs are disabled either by threads (O2) or IRQs (O3), as in the model in Figure 12.
Thus, it is possible to conclude that:
R4a: by disabling IRQs, a thread postpones the begin of the IRQ handlers.
R4b: when IRQs are not disabled by a thread, IRQs can run.
ECRTS 2020
9:8 Demystifying the Real-Time Linux Scheduling Latency
Thread context. Next, synchronization rules related to the thread context are discussed.
We start presenting the necessary conditions to call the scheduler (O7).
hw_local_irq_disable
hw_local_irq_enable
local_irq_disable
local_irq_enable
preempt_disable
preempt_enable
preempt_disable_sched
preempt_enable_sched
sched_need_resched local_irq_disable
local_irq_enable
sched_set_state_runnable
preempt_enable_sched
sched_set_state_sleepable sched_set_state_runnable
sched_switch_blocking sched_set_state_sleepable
sched_switch_in sched_switch_in
sched_switch_in_o sched_switch_in_o
sched_switch_out_o sched_switch_out_o
sched_switch_preempt nmi_entry sched_switch_preempt hw_local_irq_disable
sched_switch_suspend non_nmi nmi sched_switch_suspend
nmi_exit non_irq irq
sched_waking sched_switch_blocking hw_local_irq_enable
schedule_entry schedule_entry
schedule_exit schedule_exit
non_atomic_events* non_atomic_events*
Figure 10 Operations blocked in the NMI con- Figure 11 Operations blocked in the IRQ con-
text (R2). text (R3).
local_irq_disable
irq_disabled
local_irq_enable
no_irq
hw_local_irq_disable
hw_local_irq_enable irq_running
Figure 13 The scheduler is called with inter- Figure 14 The scheduler is called with pre-
rupts enabled (R5). emption disabled to call the scheduler(R6).
sched_switch_in local_irq_enable
preempt_disable sched_switch_suspend preempt_enable_sched
preempt_enable sched_switch_preempt disabled
preempt_disable_sched sched_switch_in_o
local_irq_disable
sched_switch_out_o
preempt_enable_sched sched_switch_blocking
preempt_disable_sched
local_irq_disable p_xor_i
preempt_disable_sched
schedule_entry
thread scheduling local_irq_enable
schedule_exit preempt_enable_sched
enabled
Figure 15 The scheduler context does not en- Figure 16 The context switch occurs with
able the preemption (R7). interrupts and preempt disabled (R8).
preempt_enable
preempt_enable_sched
local_irq_enable
sched_need_resched hw_local_irq_enable
sched_waking
preempt_disable
disabled
preempt_disable_sched
sched_switch_in local_irq_disable
sched_switch_in_o hw_local_irq_disable
sched_switch_suspend preempt_disable
sched_switch_preempt preempt_disable_sched p_xor_i
sched_switch_out_o local_irq_disable
hw_local_irq_disable
sched_switch_blocking
preempt_enable
schedule_entry enabled preempt_enable_sched
thread sched local_irq_enable
schedule_exit hw_local_irq_enable
Figure 17 The context switch occurs in the Figure 18 Wakeup and need resched requires
scheduling context (R9). IRQs and preemption disabled (R10 and R11).
ECRTS 2020
9:10 Demystifying the Real-Time Linux Scheduling Latency
schedule_exit
schedule_entry
preempt_enable_sched sched_switch_suspend
thread schedule_entry sched_switch_preempt
preempt_disable_sched schedule_entry a_switch
thread sched sched_switch_blocking
pd b_switch sched_switch_out_o
non_atomic_events*
Cases in Section 4.2 preempt_disable_sched preempt_enable_sched
hw_local_irq_disable hw_local_irq_enable
i-a local_irq_disable local_irq_enable
preempt_disable preempt_enable
i-b schedule_entry schedule_exit thread
sched_switch_in sched_switch_in_o
i-c
ii-a
ii-b sched_need_resched
sched_switch_in
local_irq_disable sched_switch_in_o
hw_local_irq_disable
pd_id
local_irq_enable
hw_local_irq_enable
preempt_enable preempt_disable
pd_ie
sched_switch_in
pe_id
sched_switch_in_o
schedule_exit
hw_local_irq_disable
pe_ie
hw_local_irq_enable
schedule_entry
preempt_disable_sched
sched
local_irq_disable
local_irq_enable
hw_local_irq_disable
hw_local_irq_enable
schedule_entry
Thread Scheduling (Thread) Hard IRQ NMI Preemption disabled IRQ disabled
I
NMI
(L)
IIRQ (L)
A Dpoid Dpsd F
Dst
IRQ disable IRQ enable Schedule call EV2 IRQ disable EV3
Preempt enable from sched EV7
Preempt disable to sched EV1 Schedule return EV6
Preempt disable Preempt enable IRQ enable EV5
Context switch EV4
For brevity, we refer next to the event that causes any job of τiTHD becoming ready and
with the maximum priority as RHPi event1 . With Definition 1 in place, this paper aims at
computing a theoretically-sound upper bound to the latency experienced by an arbitrary
τiTHD ∈ ΓTHD under analysis. To this end, we extract next some formal properties and lemmas
from the operations and rules presented in Section 3. We begin determining which types of
entities may prolong the latency of τiTHD .
I Property 1. The scheduling latency of an arbitrary thread τiTHD ∈ ΓTHD cannot be prolonged
due to high-priority interference from other threads τjTHD ∈ ΓTHD
HPi .
Proof. By contradiction, assume the property does not hold. Then, due to the priority
ordering, it means that either: (i) τiTHD was not the highest-priority thread at the beginning
of the interval [A, F ] (as defined in Definition 1), or (ii) τiTHD has been preempted in [A, F ].
Both cases contradict Definition 1, hence the property follows. J
Differently, Property 2 shows that the latency of a thread may be prolonged due to
priority-inversion blocking caused by other threads τjTHD ∈ ΓTHD
LPi with a lower priority.
I Property 2. The latency of an arbitrary thread τiTHD ∈ ΓTHD can be prolonged due to
low-priority blocking from other threads τjTHD ∈ ΓTHD
LPi .
Proof. The property follows by noting that, for example, a low-priority thread may disable
the preemption to postpone the scheduler, potentially prolonging the latency of τiTHD . J
With Property 1 and Property 2 in place, we bound the Linux latency as follows, referring
to an arbitrary thread τiTHD under analysis. First, as a consequence of Property 1, only
the NMI and IRQs may prolong the latency due to high-priority interference, and such an
interference is equal for all threads τiTHD ∈ ΓTHD since NMI and IRQs have higher priorities
than threads. We model the interference due to the NMI and IRQs in a time window of
length t with the functions I NMI (t) and I IRQ (t), respectively. We then show next in Section 5
how to derive such functions. Besides interference, the latency is caused by constant kernel
overheads (e.g., due to the execution of the kernel code for performing the context switch) and
priority-inversion blocking (see Property 2), which we bound with a term LIF . In principle,
the delays originating LIF may be different for each thread τiTHD ∈ ΓTHD . However, for
1
Note that RHPi is an event external to the model, for instance, it can be a hardware event that dispatches
an IRQ, or the event that causes a thread to activate another thread.
ECRTS 2020
9:12 Demystifying the Real-Time Linux Scheduling Latency
Note that, depending on what the processor is executing when the RHPi event occurs, not all
the events may be involved in (and hence prolong) the scheduling latency. Figure 21 illustrates
all the allowed sequences of events from the occurrence of the set_need_resched event
(caused by RHPi ) until the context switch (EV4), allowing the occurrence of the other events
(EV5-EV7). According to the automaton model, there are five possible and mutually-exclusive
cases, highlighted with different colors in Figure 21. Our strategy for bounding LIF consists in
deriving an individual bound for each of the five cases, taking the maximum as a safe bound.
To derive the five cases, we first distinguish between: (i) if RHPi occurs when the current
thread τjTHD ∈ ΓTHD
LPi is in the scheduler execution flow, both voluntarily, or involuntarily as a
Variables Selection. One of the most important design choices for the analysis consists
in determining the most suitable variables to be used for deriving the analytical bound.
Since the very early stages of its development, the PREEMPT_RT Linux had as a target to
minimize the code portions executed in interrupt context and the code sections in which the
preemption is disabled. One of the advantages of this design choice consists indeed in the
reduction of scheduling delays. Nevertheless, disabling the preemption or IRQs is sometimes
merely mandatory in the kernel code. As pointed out in Property 2, threads may also disable
the preemption or IRQs, e.g., to enforce synchronization, thus impacting on the scheduling
latency. Building upon the design principles of the fully-preemptive PREEMPT_RT kernel,
Table 1 presents and discusses the set of variables selected to bound the latency, which are
more extensively discussed next in Sections 5, and graphically illustrated in Figure 22. Such
variables considers the longest intervals of time in which the preemption and/or IRQs are
disabled, taking into consideration the different disabling modes discussed in Section 3.
Deriving the bound. Before discussing the details of the five cases, we present a bound on
the interference-free duration of the scheduler code in Lemma 2.
I Lemma 2. The interference-free duration of the scheduler code is bounded by DPSD .
Proof. It follows by noting that by rule R6 the scheduler is called and returns with the
preemption disabled to call the scheduler and, by rules R2d, R3c, and R7, the preemption is
not enabled again until the scheduler returns. J
Next, we provide a bound to LIF in each of the five possible chains of events.
Case (i). In case (i), the preemption is already disabled to call the scheduler, hence
either set_need_resched has already been triggered by another thread τjTHD = 6 τiTHD or the
current thread voluntarily called the scheduler. Then, due to rules R13 and R14, a context
switch will occur. Consequently, the processor continues executing the scheduler code. Due
to rule R5, the scheduler is called with interrupts enabled and preemption disabled, hence
RHPi (and consequently set_need_resched) must occur because of an event triggered by
an interrupt. By rule R2, NMI cannot cause set_need_resched; consequently, it must be
caused by an IRQ or the scheduler code itself. Due to EV3, IRQs are masked in the scheduler
code before performing the context switch. We recall that case (i) divides into three possible
sub-cases, depending on whether RHPi occurs between EV1 and EV2 (case i-a), EV2 and
EV3 (case i-b), or EV3 and EV7 (case i-c). Lemma 3 bounds LIF for cases (i-a) and (i-b).
I Lemma 3. In cases (i-a) and (i-b), it holds
LIF IF
(i−a) ≤ DPSD , L(i−b) ≤ DPSD . (2)
Proof. In both cases it holds that preemption is disabled to call the scheduler and IRQs
have not been disabled yet (to perform the context switch) when RHPi occurs. Due to rules
ECRTS 2020
9:14 Demystifying the Real-Time Linux Scheduling Latency
R2 and R5, RHPi may only be triggered by an IRQ or the scheduler code itself. Hence, when
RHPi occurs set_need_resched is triggered and the scheduler performs the context switch
for τiTHD . Furthermore, in case (i-b) the processor already started executing the scheduler
code when RHPi occurs. It follows that LIF is bounded by the interference-free duration of
the scheduler code. By Lemma 2, such a duration is bounded by DPSD . In case (i-a), the
scheduler has not been called yet, but preemptions have already been disabled to schedule.
By rule R12, it will immediately cause a call to the scheduler, and the preemption is not
enabled again between EV1 and EV2 (rules R2d, R3c, and R7). Therefore, also for case (i-a)
LIF is bounded by DPSD , thus proving the lemma. J
Differently, case (i-c), in which RHPi occurs between EV3 and EV7, i.e., after interrupts
are disabled to perform the context switch, is discussed in Lemma 4.
LIF
(i−c) ≤ DST + DPAIE + DPSD . (3)
Proof. In case (i), the scheduler is already executing to perform the context switch of a
thread τjTHD 6= τiTHD . Due to rules R2 and R5, RHPi may only be triggered by an IRQ or the
scheduler code itself. If the scheduler code itself caused RHPi before the context switch (i.e.,
between EV3 and EV4), the same scenario discussed for case (i-b) occurs, and the bound
of Equation 2 holds. Then, case (i-c) occurs for RHPi arriving between EV4 and EV7 for
the scheduler code, or EV3 and EV7 for IRQs. IRQs may be either disabled to perform the
context switch (if RHPi occurs between EV3 and EV5), or already re-enabled because the
context switch already took place (if RHPi occurs between EV5 and EV7). In both cases,
thread τiTHD needs to wait for the scheduler code to complete the context switch for τjTHD .
If RHPi occurred while IRQs were disabled (i.e., between EV3 and EV5), the IRQ causing
RHPi is executed, triggering set_need_resched, when IRQs are enabled again just before
the scheduler returns (see rule R5).
Hence, due to rule R14, the scheduler needs to execute again to perform a second context
switch to let τiTHD execute. As shown in the automaton of Figure 21, there may exist a possible
system state in case (i-c) (the brown one in Figure 21) in which, after RHPi occurred and
before the scheduler code is called again, both the preemption and IRQs are enabled before
calling the scheduler (state pe_ie in Figure 21). This system state is visited when the kernel
is executing the non-atomic function to enable preemption, because the previous scheduler
call (i.e., the one that caused the context switch for τjTHD ) enabled IRQs before returning
(EV5). Consequently, we can bound LIF in case (i-c) by bounding the interference-free
durations of the three intervals: IST , which lasts from EV3 to EV7, IPAIE , which accounts
for the kernel being in the state pe_ie of Figure 21 while executing EV7, and IS , where
preemption is disabled to call the scheduler and the scheduler is called again to schedule τiTHD
(from EV1 to EV7). By definition and due to Lemma 2 and rules R2d, R3c, R7, and R12,
IST , IPAIE , and IS cannot be longer than DST , DPAIE , and DPSD , respectively. The lemma
follows by noting that the overall duration of LIF is bounded by the sum of the individual
bounds on IST , IPAIE , and IS . J
Case (ii). In case (ii), RHPi occurs when the current thread τjTHD ∈ ΓTHD LPi is not in the
scheduler execution flow. As a consequence of the RHPi events, set_need_resched is triggered.
By rule R14, triggering set_need_resched always result in a context switch and, since RHPi
occurred outside the scheduler code, the scheduler needs to be called to perform the context
switch (rule R9). Hence, we can bound LIF in case (ii) by individually bounding two time
D. B. de Oliveira, D. Casini, R. S. de Oliveira, and T. Cucinotta 9:15
intervals IS and ISO in which the processor is executing or not executing the scheduler
execution flow (from EV1 to EV7), respectively. As already discussed, the duration of IS is
bounded by DPSD (Lemma 2). To bound ISO , we need to consider individually cases (ii-a)
and (ii-b). Lemma 5 and Lemma 6 bound LIF for cases (ii-a) and (ii-b), respectively.
LIF
(ii−a) ≤ DPOID + DPSD . (4)
Proof. In case (ii-a) RHPi occurs due to an IRQ. Recall from Operation O3 that when an
IRQ is executing, it masks interruptions. Hence, the IRQ causing RHPi can be delayed by
the current thread or a lower-priority IRQ that disabled IRQs. When RHPi occurs, the
IRQ triggering the event disables the preemption (IRQs are already masked) to fulfill R10
and R11, and triggers set_need_resched. If preemption was enabled before executing
the IRQ handler and if set_need_resched was triggered, when the IRQ returns, it first
disables preemptions (to call the scheduler, i.e., preempt_disable_sched). It then unmasks
interrupts (this is a safety measure to avoid stack overflows due to multiple scheduler calls in
the IRQ stack). This is done to fulfill the necessary conditions to call the scheduler discussed
in rules R5 and R6. Due to rules R3a and R12, the scheduler is called once the IRQ returns.
Hence, it follows that in the whole interval ISO , either the preemption or interrupts are
disabled. Then it follows that ISO is bounded by DPOID , i.e., by the length of the longest
interval in which either the preemption or IRQs are disabled. The lemma follows recalling
that the duration of IS is bounded by DPSD . J
LIF
(ii−b) ≤ DPOID + DPAIE + DPSD , (5)
Proof. In case (ii-b) the currently executing thread delayed the scheduler call by disabling
the preemption or IRQs. The two cases in which the RHPi event is triggered either by a
thread or an IRQ are discussed below.
(1) RHPi is triggered by an IRQ. Consider first that RHPi is triggered by an IRQ. Then,
the IRQ may be postponed by a thread or a low-priority IRQ that disabled interrupts.
When the IRQ is executed, it triggers set_need_resched. When returning, the IRQ returns
to the previous preemption state2 , i.e, if it was disabled before the execution of the IRQ
handler, preemption is disabled, otherwise it is enabled. If the preemption was enabled
before executing the IRQ, the same scenario discussed for case (ii-a) occurs, and the bound
of Equation 4 holds. Otherwise, if the preemption was disabled to postpone the scheduler
execution, the scheduler is delayed due to priority-inversion blocking. Then it follows that
when delaying the scheduler execution, either the preemption or IRQs are disabled. When
preemption is re-enabled by threads and interrupts are enabled, the preemption needs to be
disabled again (this time not to postpone the scheduler execution, but to call the scheduler)
to fulfill the necessary conditions listed in rules R5 and R6, hence necessarily traversing
the pe_ie state (shown in Figure 21), where both preemptions and interrupts are enabled.
Hence, it follows that ISO is bounded by DPOID + DPAIE if RHPi is triggered by an IRQ.
(2) RHPi is triggered by a thread. In this case, the thread triggers set_need_resched.
Since the set_need_resched event requires IRQs and preemption disabled, the scheduler
2
Note that, internally to the IRQ handler, the preemption state may be changed, e.g., to trigger
set_need_resched.
ECRTS 2020
9:16 Demystifying the Real-Time Linux Scheduling Latency
execution is postponed until IRQs and preemption are enabled (pe_ie state). Once both are
enabled, the preemption is disabled to call the scheduler. Then it follows that ISO is bounded
by DPOID + DPAIE if RHPi is triggered by a thread. Then it follows that ISO is bounded by
DPOID + DPAIE in case (ii-b). The lemma follows recalling that IS is bounded by DPSD . J
By leveraging the individual bounds on LIF in the five cases discussed above, Lemma 7
provides an overall bound that is valid for all the possible events sequences.
I Lemma 7.
Proof. The lemma follows by noting that cases (i-a), (i-b), (i-c), (ii-a), (ii-b) are mutually-
exclusive and cover all the possible sequences of events from the occurrence of RHPi and
set_need_resched, to the time instant in which τiTHD is allowed to execute (as required
by Definition 1), and the right-hand side of Equation 6 simultaneously upper bounds the
right-hand sides of Equations 2, 3, 4, and 5. J
Analysis
tracepoints
perf.data
Chart
occurs, it computes the execution time of the NMI, and prints the arrival time and the
execution time of the NMI. A similar behavior is implemented for other metrics, for instance
for the IRQ occurence. The difference is that the interference must be removed from other
metrics. For example, if an NMI and an IRQ occur while measuring a candidate DPOID , the
IRQ and the NMI execution time are discounted from the measured value.
The latency parser communicates with perf using a new set of tracepoints, and
these are printed to the trace buffer. The following events are generated by the latency
parser:
irq_execution: prints the IRQ identifier, starting time, and execution time;
nmi_execution: prints the starting time, and execution time;
max_poid: prints the new maximum observed DPOID duration;
max_psd: prints the new maximum observed DPSD duration;
max_dst: prints the new maximum observed DST duration;
max_paie: prints the new maximum observed DPAIE duration;
By only tracing the return of interrupts and the new maximum values for the thread
metrics, the amount of data generated is reduced to the order of 200KB of data per second
per CPU. Hence, reducing the overhead of saving data to the trace buffer, while enabling
the measurements to run for hours by saving the results to the disk. The data collection
is done by the perf rtsl script. It initiates the latency parser and start recording its
events, saving the results to the perf.data file. The command also accepts a workload as
an argument. For example, the following command line will start the data collection while
running cyclictest concurrently:
perf script record rtsl cyclictest –smp -p95 -m -q
Indeed, this is how the data collection is made for Section 6. The trace analysis is done
with the following command line: perf script report rtsl. The perf script will read
the perf.data and perform the analysis. A cyclictest.txt file with cyclictest output
is also read by the script, adding its results to the analysis as well. The script to run the
analysis is implemented in python, which facilitates the handling of data, needed mainly for
the IRQ and NMI analysis.
IRQ and NMI analysis. While the variables used in the analysis are clearly defined (Table 1),
the characterization of IRQs and NMI interference is delegated to functions (i.e., I NMI (L)
and I IRQ (L)), for which different characterizations are proposed next. The reason being is
that there is no consensus on what could be the single best characterization of interrupt
interference. For example, in a discussion among the Linux kernel developers, it is a common
opinion that the classical sporadic model would be too pessimistic [17]. Therefore, this
work assumes that there is no single way to characterize IRQs and NMIs, opting to explore
different IRQs and NMI characterizations in the analysis. Also, the choice to analyze the
ECRTS 2020
9:18 Demystifying the Real-Time Linux Scheduling Latency
Figure 24 perf rtsl output: excerpt from the textual output (time in nanoseconds).
Figure 25 Using perf and the latency parser to find the cause of a large DPOID value.
data in user-space using python scripts were made to facilitate the extension of the analysis
by other users or researchers. The tool presents the latency analysis assuming the following
interrupts characterization:
No Interrupts: the interference-free latency (LIF );
Worst single interrupt: a single IRQ (the worst over all) and a single NMI occurrence;
Single (worst) of each interrupt: a single (the worst) occurrence of each interrupt;
Sporadic: sporadic model, using the observed minimum inter-arrival time and WCET;
Sliding window: using the worst-observed arrival pattern of each interrupt and the
observed execution time of individual instances;
Sliding window with oWCET: using the worst-observed arrival pattern of each
interrupt and the observed worst-case execution time among all the instances (oWCET).
These different characterization lead to different implementations of I NMI (L) and I IRQ (L).
perf rtsl output. The perf rtsl tool has two outputs: the textual and the graphical one.
The textual output prints a detailed description of the latency analysis, including the values
for the variables defined in Section 4. By doing so, it becomes clear what are the contributions
of each variable to the resulting scheduling latency. An excerpt from the output is shown
in Figure 24. The tool also creates charts displaying the latency results for each interrupt
characterization, as shown in the experiments in Section 6.
When the dominant factor of latency is an IRQ or NMI, the textual output already
serves to isolate the context in which the problem happens. However, when the dominant
factor arises from a thread, the textual output points only to the variable that dominates
the latency. Then, to assist in the search for the code section, the tracepoints that prints
each occurrence of the variables from latency parser can be used. These events are not
used during the measurements because they occur too frequently, but they can be used in
D. B. de Oliveira, D. Casini, R. S. de Oliveira, and T. Cucinotta 9:19
the debug stage. For example, Figure 25 shows the example of the poid tracepoint traced
using perf, capturing the stack trace of the occurrence of a DPOID value higher than 60
microseconds3 . In this example, it is possible to see that the spike occurs in the php thread
while waking up a process during a fork operation. This trace is precious evidence, mainly
because it is already isolated from other variables, such as the IRQs, that could point to the
wrong direction.
6 Experimental Analysis
This section presents latency measurements, comparing the results found by cyclictest
and perf rtsl while running concurrently in the same system. The main objective of
this experimental study is to corroborate the practical applicability of the analysis tool.
To this end, we show that the proposed approach provides latency bounds respecting the
under millisecond requirement in scheduling precision (which is typical of applications using
PREEMPT_RT) for most of the proposed interrupt characterizations. The proposed perf
rtsl tool individually characterizes the various sources of latency and composes them
leveraging a theory-based approach allowing to find highly latency-intensive schedules in
a much shorter time than cyclictest. The experiment was made in a workstation with
one Intel i7-6700K CPU @ 4.00GHz processor, with eight cores, and in a server with two
Non-Uniform Memory Access (NUMA) Intel Xeon L5640 CPU @ 2.27GHz processors with
six cores each. Both systems run the Fedora 31 Linux distribution, using the kernel-rt
5.2.21-rt14. The systems were tuned according to the best practices of real-time Linux
systems [34].
The first experiment runs on the workstation three different workloads for 30 minutes.
In the first case, the system is mostly idle. Then workloads were generated using two
phoronix-test-suite (pts) tests: the openssl stress test, which is a CPU intensive
workload, and the fio, stress-ng and build-linux-kernel tests together, causing a
mixed range of I/O intensive workload [31]. Different columns are reported in each graph,
corresponding to the different characterization of interrupts discussed in Section 5. The
result of this experiment is shown in Figure 26: 1.a, 1.b and 1.c, respectively. In the second
experiment, the I/O intensive workload was executed again, with different test durations, as
described in 2.a, 2.b, and 2.c. The results from cyclictest did not change substantially as
the time and workload changes. On the other hand, the proposed approach results change,
increasing the hypothetical bounds as the kernel load and experiment duration increase.
Consistently with cyclictest results, the No Interrupts column also do not vary substantially.
The difference comes from the interrupt workload: the more overloaded the system is, and
the longer the tests run, the more interrupts are generated and observed, influencing the
results. In all the cases, the sporadic task model appears to be overly pessimistic for IRQs:
regularly, the oWCET of IRQs were longer than the minimal observed inter-arrival time of
them. The Sliding Window with oWCET also stand out the other results. The results are
truncated in the charts 2.b and 2.c: their values are 467 and 801 microseconds, respectively.
Although the reference automata model was developed considering single-core systems, the
same synchronization rules are replicated in the multiple-core (mc) configuration, considering
the local scheduling latency of each CPU. The difference between single and multiple-core
cases resides in the inter-core synchronization using, for example, spinlocks. However, such
synchronization requires preemption and IRQs to be disabled, hence, taking place inside the
3
The latency parser tracepoints are also available via ftrace.
ECRTS 2020
9:20 Demystifying the Real-Time Linux Scheduling Latency
Latency in microseconds
467 801
already defined variables. Moreover, when cyclictest runs in the –smp mode, it creates a
thread per-core, aiming to measure the local scheduling latency. In a mc setup, the workload
experiment was replicated in the workstation. Furthermore, the I/O intensive experiment
was replicated in the server. The results of these experiments are shown in Figure 27. In
these cases, the effects of the high kernel activation on I/O operations becomes evident in the
workstation experiment (3.c) and in the server experiment(4.a). Again the Sliding Window
with oWCET also stand out the other results, crossing the milliseconds barrier. The source
of the higher values in the thread variables (Table 1) is due to cross-core synchronization
using spinlocks. Indeed, the trace in Figure 25 was observed in the server running the
I/O workload. The php process in that case was part of the phoronix-test-suit used to
generate the workload.
Finally, by running cyclictest with and without using the perf rtsl tool, it was
possible to observe that the trace impact in the minimum, average and maximum values are
in the range from one to four microseconds, which is an acceptable range, given the frequency
in which events occurs, and the advantages of the approach.
The usage of the Thread Synchronization Model [14] was a useful logical step between the real-
time theory and Linux, facilitating the information exchange among the related, but intricate,
domains. The analysis, built upon a set of practically-relevant variables, ends up concluding
what is informally known: the preemption and IRQ disabled sections, along with interrupts,
are the evil for the scheduling latency. The tangible benefits of the proposed technique come
from the decomposition of the variables, and the efficient method for observing the values.
Now users and developers have precise information regarding the sources of the latency on
their systems, facilitating the tuning, and the definition of where to improve the Linux code,
respectively. The improvement of the tool and its integration with the Linux kernel and
perf code base is the practical continuation of this work.
D. B. de Oliveira, D. Casini, R. S. de Oliveira, and T. Cucinotta 9:21
Latency in microseconds
2944
3.a) Workstation Idle 3.b) Workstation CPU Intensive 3.c) Workstation I/O Intensive
Latency in microseconds
1900
References
1 L. Abeni, A. Goel, C. Krasic, J. Snow, and J. Walpole. A measurement-based analysis of
the real-time performance of linux. In Proceedings. Eighth IEEE Real-Time and Embedded
Technology and Applications Symposium, September 2002.
2 Jade Alglave, Luc Maranget, Paul E. McKenney, Andrea Parri, and Alan Stern. Frightening
Small Children and Disconcerting Grown-ups: Concurrency in the Linux Kernel. In Proceedings
of the Twenty-Third International Conference on Architectural Support for Programming
Languages and Operating Systems, ASPLOS ’18, pages 405–418, New York, NY, USA, 2018.
ACM. doi:10.1145/3173162.3177156.
3 Neil Audsley, Alan Burns, Mike Richardson, Ken Tindell, and Andy J Wellings. Applying
new scheduling theory to static priority pre-emptive scheduling. Software engineering journal,
8(5):284–292, 1993.
4 Bjorn Brandenbug and James Anderson. Joint Opportunities for Real-Time Linux and Real-
Time System Research. In Proceedings of the 11th Real-Time Linux Workshop (RTLWS 2009),
pages 19–30, September 2009.
5 B. B. Brandenburg and M. Gül. Global scheduling not required: Simple, near-optimal
multiprocessor real-time scheduling with semi-partitioned reservations. In 2016 IEEE Real-
Time Systems Symposium (RTSS), pages 99–110, November 2016. doi:10.1109/RTSS.2016.
019.
6 John M. Calandrino, Hennadiy Leontyev, Aaron Block, UmaMaheswari C. Devi, and James H.
Anderson. LitmusRT : A testbed for empirically comparing real-time multiprocessor schedulers.
In Proceedings of the 27th IEEE International Real-Time Systems Symposium, RTSS ’06, pages
111–126, Washington, DC, USA, 2006. IEEE Computer Society. doi:10.1109/RTSS.2006.27.
7 Christos G. Cassandras and Stephane Lafortune. Introduction to Discrete Event Systems.
Springer Publishing Company, Incorporated, 2nd edition, 2010.
8 F. Cerqueira and B. Brandenburg. A Comparison of Scheduling Latency in Linux, PREEMPT-
RT, and LITMUS-RT. In Proceedings of the 9th Annual Workshop on Operating Systems
Platforms for Embedded Real-Time applications, pages 19–29, 2013.
ECRTS 2020
9:22 Demystifying the Real-Time Linux Scheduling Latency
26 J. Lehoczky, L. Sha, and Y. Ding. The rate monotonic scheduling algorithm: exact charac-
terization and average case behavior. In [1989] Proceedings. Real-Time Systems Symposium,
pages 166–171, December 1989.
27 Juri Lelli, Claudio Scordino, Luca Abeni, and Dario Faggioli. Deadline scheduling in the Linux
kernel. Software: Practice and Experience, 46(6):821–839, 2016. doi:10.1002/spe.2335.
28 Linux Kernel Documentation. Linux tracing technologies. https://www.kernel.org/doc/
html/latest/trace/index.html, February 2020.
29 G. Matni and M. Dagenais. Automata-based approach for kernel trace analysis. In 2009
Canadian Conference on Electrical and Computer Engineering, pages 970–973, May 2009.
doi:10.1109/CCECE.2009.5090273.
30 L. Palopoli, T. Cucinotta, L. Marzario, and G. Lipari. AQuoSA – Adaptive Quality of Service
Architecture. Softw. Pract. Exper., 39(1):1–31, January 2009. doi:10.1002/spe.v39:1.
31 Phoronix Test Suite. Open-source, automated benchmarking. www.phoronix-test-suite.com,
February 2020.
32 Josh Poimboeuf. Introducing kpatch: Dynamic kernel patching. https://www.redhat.com/
en/blog/introducing-kpatch-dynamic-kernel-patching, February 2014.
33 P. J. Ramadge and W. M. Wonham. Supervisory control of a class of discrete event processes.
SIAM J. Control Optim., 25(1):206–230, January 1987. doi:10.1137/0325013.
34 Red Hat. Inc,. Advanced tuning procedures to optimize latency in RHEL for Real
Time. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_
for_real_time/8/html/tuning_guide/index, February 2020.
35 Red Hat. Inc,. Red Hat Enterprise Linux Hardware Certification. https:
//access.redhat.com/documentation/en-us/red_hat_enterprise_linux_hardware_
certification/1.0/html/test_suite_user_guide/sect-layered-product-certs#
cert-for-rhel-for-real-time, February 2020.
36 F. Reghenzani, G. Massari, and W. Fornaciari. Mixed time-criticality process interferences
characterization on a multicore linux system. In 2017 Euromicro Conference on Digital System
Design (DSD), August 2017.
37 Paul Regnier, George Lima, and Luciano Barreto. Evaluation of interrupt handling timeliness
in real-time linux operating systems. ACM SIGOPS Operating Systems Review, 42(6):52–63,
2008.
38 Steven Rostedt. Finding origins of latencies using ftrace, 2009.
39 Carlos San Vicente Gutiérrez, Lander Usategui San Juan, Irati Zamalloa Ugarte, and Víctor
Mayoral Vilches. Real-time linux communications: an evaluation of the linux communication
stack for real-time robotic applications, August 2018.
URL: https://arxiv.org/pdf/1808.10821.pdf.
ECRTS 2020