Scheduling
Scheduling
in Linux
COMS W4118
Spring 2008
Scheduling Goals
O(1) scheduling; 2.4 scheduler iterated through
Run queue on each invocation
Task queue at each epoch
Scale well on multiple processors
per-CPU run queues
SMP affinity
Interactivity boost
Fairness
Optimize for one or two runnable processes
2
Basic Philosophies
Priority is the primary scheduling mechanism
Priority is dynamically adjusted at run time
Processes denied access to CPU get increased
Processes running a long time get decreased
Try to distinguish interactive processes from non-
interactive
Bonus or penalty reflecting whether I/O or compute
bound
Use large quanta for important processes
Modify quanta based on CPU use
Quantum != clock tick
Associate processes to CPUs
Do everything in O(1) time
3
The Run Queue
140 separate queues, one for each priority
level
Actually, two sets, active and expired
Priorities 0-99 for real-time processes
Priorities 100-139 for normal processes;
value set via nice() system call
4
Runqueue for O(1) Scheduler
priority array Higher priority
more I/O
priority queue 800ms quanta
. .
active . .
. . priority queue lower priority
more CPU
10ms quanta
expired
priority array
priority queue
. .
. .
. . priority queue
5
Scheduler Runqueue
A scheduler runqueue is a list of tasks that are
runnable on a particular CPU.
A rq structure maintains a linked list of those
tasks.
The runqueues are maintained as an array
runqueues, indexed by the CPU number.
The rq keeps a reference to its idle task
The idle task for a CPU is never on the scheduler
runqueue for that CPU (it's always the last choice)
Access to a runqueue is serialized by
acquiring and releasing rq->lock
Basic Scheduling Algorithm
Find the highest-priority queue with a
runnable process
Find the first process on that queue
Calculate its quantum size
Let it run
When its time is up, put it on the expired list
Repeat
7
The Highest Priority Process
There is a bit map indicating which queues
have processes that are ready to run
Find the first bit that’s set:
140 queues 5 integers
Only a few compares to find the first that is non-
zero
Hardware instruction to find the first 1-bit
bsfl on Intel
Time depends on the number of priority levels,
not the number of processes
8
Scheduling Components
Static Priority
Sleep Average
Bonus
Interactivity Status
Dynamic Priority
9
Static Priority
Each task has a static priority that is set based
upon the nice value specified by the task.
static_prio in task_struct
The nice value is in a range of 0 to 39, with the
default value being 20. Only privileged tasks
can set the nice value below 20.
For normal tasks, the static priority is 100 + the
nice value.
Each task has a dynamic priority that is set
based upon a number of factors
Sleep Average
Interactivity heuristic: sleep ratio
Mostly sleeping: I/O bound
Mostly running: CPU bound
Sleep ratio approximation
sleep_avg in the task_struct
Range: 0 .. MAX_SLEEP_AVG (10 ms)
When process wakes up (is made runnable),
recalc_task_prio adds in how many ticks it was sleeping
(blocked), up to some maximum value
(MAX_SLEEP_AVG)
When process is switched out, schedule subtracts the
number of ticks that a task actually ran (without
blocking)
11
Bonus and Dynamic Priority
/* We scale the actual sleep average
* [0 .... MAX_SLEEP_AVG] into the
* -5 ... 0 ... +5 bonus/penalty range.
DP = SP − bonus + 5
DP = min(139, max(100, DP))
12
Calculating Time Slices
time_slice in the task_struct
Calculate Quantum where
If (SP < 120): Quantum = (140 − SP) × 20
if (SP >= 120): Quantum = (140 − SP) × 5
where SP is the static priority
Higher priority process get longer quanta
Basic idea: important processes should run longer
As we will see, other mechanisms are used for quick
interactive response
13
Typical Quanta
Priority: Static Pri Niceness Quantum
Low 130 10 50 ms
Lowest 139 20 5 ms
14
Interactive Processes
A process is considered interactive if
bonus − 5 >= (Static Priority / 4) − 28
Low-priority processes have a hard time becoming
interactive:
A high static priority (100) becomes interactive when its
average sleep time is greater than 200 ms
A default static priority process becomes interactive when
its sleep time is greater than 700 ms
Lowest priority (139) can never become interactive
The higher the bonus the task is getting and the
higher its static priority, the more likely it is to be
considered interactive.
15
Using Quanta
At every time tick (in scheduler_tick) , decrement the quantum of
the current running process (time_slice)
If the time goes to zero, the process is done
Check interactive status:
If non-interactive, put it aside on the expired list
If interactive, put it at the end of the active list
Exceptions: don’t put on active list if:
If higher-priority process is on expired list
If expired task has been waiting more than STARVATION_LIMIT
If there’s nothing else at that priority, it will run again immediately
Of course, by running so much, its bonus will go down, and so
will its priority and its interactive status
16
Avoiding Starvation
The system only runs processes from active
queues, and puts them on expired queues when
they use up their quanta
When a priority level of the active queue is empty,
the scheduler looks for the next-highest priority
queue
After running all of the active queues, the active and
expired queues are swapped
There are pointers to the current arrays; at the end
of a cycle, the pointers are switched
17
The Priority Arrays
struct prio_array {
unsigned int nr_active;
unsigned long bitmap[5];
struct list_head queue[140];
};
struct rq {
spinlock_t lock;
unsigned_long nr_running;
struct prio_array *active, *expired;
struct prio_array arrays[2];
task_struct *curr, *idle;
…
};
18
Swapping Arrays
struct prioarray *array =
rq->active;
if (array->nr_active == 0) {
rq->active = rq->expired;
rq->expired = array;
}
19
Why Two Arrays?
Why is it done this way?
It avoids the need for traditional aging
Why is aging bad?
It’s O(n) at each clock tick
20
The Traditional Algorithm
for(pp = proc; pp < proc+NPROC; pp++) {
if (pp->prio != MAX)
pp->prio++;
if (pp->prio > curproc->prio)
reschedule();
}
Every process is examined, quite frequently (This code
is taken almost verbatim from 6th Edition Unix, circa
1976.)
21
Linux is More Efficient
Processes are touched only when they start
or stop running
That’s when we recalculate priorities,
bonuses, quanta, and interactive status
There are no loops over all processes or
even over all runnable processes
22
Real-Time Scheduling
Linux has soft real-time scheduling
No hard real-time guarantees
All real-time processes are higher priority than any
conventional processes
Processes with priorities [0, 99] are real-time
saved in rt_priority in the task_struct
scheduling priority of a real time task is: 99 - rt_priority
Process can be converted to real-time via
sched_setscheduler system call
23
Real-Time Policies
First-in, first-out: SCHED_FIFO
Static priority
Process is only preempted for a higher-priority process
No time quanta; it runs until it blocks or yields voluntarily
RR within same priority level
Round-robin: SCHED_RR
As above but with a time quanta (800 ms)
Normal processes have SCHED_OTHER
scheduling policy
24
Multiprocessor Scheduling
Each processor has a separate run queue
Each processor only selects processes from its own
queue to run
Yes, it’s possible for one processor to be idle while
others have jobs waiting in their run queues
Periodically, the queues are rebalanced: if one
processor’s run queue is too long, some processes
are moved from it to another processor’s queue
25
Locking Runqueues
To rebalance, the kernel sometimes needs to move
processes from one runqueue to another
This is actually done by special kernel threads
Naturally, the runqueue must be locked before this
happens
The kernel always locks runqueues in order of
increasing indexes
Why? Deadlock prevention!
26
Processor Affinity
Each process has a bitmask saying what
CPUs it can run on
Normally, of course, all CPUs are listed
Processes can change the mask
The mask is inherited by child processes
(and threads), thus tending to keep them on
the same CPU
Rebalancing does not override affinity
27
Load Balancing
To keep all CPUs busy, load balancing
pulls tasks from busy runqueues to idle
runqueues.
If schedule finds that a runqueue has no
runnable tasks (other than the idle task), it
calls load_balance
load_balance also called via timer
schedule_tick calls rebalance_tick
Every tick when system is idle
Every 100 ms otherwise
Load Balancing
load_balance looks for the busiest runqueue
(most runnable tasks) and takes a task that is
(in order of preference):
inactive (likely to be cache cold)
high priority
load_balance skips tasks that are:
likely to be cache warm (hasn't run for
cache_decay_ticks time)
currently running on a CPU
not allowed to run on the current CPU (as
indicated by the cpus_allowed bitmask in the
task_struct)
Optimizations
If next is a kernel thread, borrow the MM
mappings from prev
User-level MMs are unused.
Kernel-level MMs are the same for all kernel
threads
If prev == next
Don’t context switch
30
Sleep Time and Bonus
Average Sleep Time (ms) Bonus Time Slice Granularity
000 to 100 0 5120
100 to 200 1 2560
200 to 300 2 1280
300 to 400 3 640
400 to 500 4 320
500 to 600 5 160
600 to 700 6 80
700 to 800 7 40
800 to 900 8 20
900 to 999 9 10
31
1 second 10 10