0% found this document useful (0 votes)
37 views

L04 Parallel Systems Synchronization Communication Scheduling

The document discusses the shared memory machine model and describes three architectures: 1) Dance hall architecture with CPUs on one side and memory on the other connected by an interconnection network. Each CPU has a cache. 2) Symmetric multiprocessor (SMP) architecture with CPUs and memory connected by a simple system bus. Each CPU has a cache. 3) Distributed shared memory (DSM) architecture where each CPU has a local memory and they can access all memories through the interconnection network, though nonlocal accesses are slower.

Uploaded by

Jiaxu Chen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

L04 Parallel Systems Synchronization Communication Scheduling

The document discusses the shared memory machine model and describes three architectures: 1) Dance hall architecture with CPUs on one side and memory on the other connected by an interconnection network. Each CPU has a cache. 2) Symmetric multiprocessor (SMP) architecture with CPUs and memory connected by a simple system bus. Each CPU has a cache. 3) Distributed shared memory (DSM) architecture where each CPU has a local memory and they can access all memories through the interconnection network, though nonlocal accesses are slower.

Uploaded by

Jiaxu Chen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 117

149 - Shared Memory Machine Model

In this day and age, parallelism has become fundamental to computer systems. Any general purpose CPU chip,
has multiple cores in it. Every general purpose operating system is designed to take advantage of such hardware
parallelism.
In this unit, we will study the basic algorithms for synchronization, communication, and scheduling, in a
shared memory multiprocessor. Leading up to the structure of the operating system for parallel machines.
We'll start today's lecture with a discussion of the model of a parallel machine. A shared memory multi-
processor, or a shared memory machine we can think of three different structures for this shared memory
machine.
• The first structure is a dance hall architecture and it is one in which you have CPUs on one side and
the memory on the other side of an interconnection network. Let me say something that is common to all
the three structures that I'm going to describe to you. The common things are that every one of these
structures have CPUs, memory and an interconnection network. And the key thing is it's a shared
memory machine. What that means is that the entire address space defined by the memories is accessible
from any of the CPUs. And in addition to that, in the dance hall architecture, you'll see that there is a
cache that is associated with each of these CPUs.
• The next style is what is called an SMP architecture, i.e. Symmetric multiprocessor. Here what you see
is the interconnection network that I showed you from the dance hall architecture but I simplified it
considerably to a simple system bus that connects all the CPU's to talk to the main memory. It is
symmetric because the access time from any of the CPUs to memory is the same. That's the idea of
the system bus that allows all of these CPUs to talk to the main memory. The other thing that you'll
notice that I already mentioned is that every CPU comes equipped with a cache.

• Third style of architecture is what is called a distributed shared memory architecture, DSM. In DSM,
you have memory and a piece of memory that is associated with each CPU. Still, each CPU is able to
access all of the memories through the interconnection network. It is just that the access to memory that
is close to each CPU is going to be faster than trying to access memory that is farther from here, which
has to be accessed from the interconnection network.
150 - Shared Memory and Caches

Now let's start discussing shared memory and private caches. To simply discussion, I will use the simplest form
of the shared memory machine, an SMP where there's a single system bus that connects all these processors to
talk to the main memory.
• The cache that is associated with the CPU serves exactly the same purpose in a multiprocessor, as it does
in a uniprocessor. And that is, when CPU wants to access some memory location, of course, it will go to
the cache and see if it is there. If it is there, life is good, it can get it from the cache. If it is not in the cache,
then it has to go to the main memory, fetch it from there and put it into the cache, so it can reuse it later,
and that's the purpose that cache performs in a uniprocessor.
• In a multiprocessor, cache performs exactly the same function as in a uniprocessor by caching data or
instructions that are pulled in from memory into the cache, so that CPU can re-use them later.
In a multiprocessor, why are there private caches associated with each one of these CPUs while the memory itself
is shared across all of these processors? Let me explain that.
• Let's say that there's a memory location y that is currently in the private caches of all the processors. Well
maybe y is a hot memory location so all the processes happen to fetch it and therefore it is sitting in the
private caches of all the processes.
• Let's say that process P1 decides to write to this memory location y, now y is changed to y’.
• Now what should happen to the value of y in all the other CPU caches?
Clearly, in a multiprocessor or a multi-threaded program, there could be a shared data structure that is being
shared with all the processors. Therefore if this guy writes to a particular memory location, it is possible that that
same memory location is in the private caches of its peers. This is referred to as the cache coherence problem.
We need to ensure that, if another process (e.g. P2 or P3) wants access y at a later point of time, they should get
y’, not y.

Now who should ensure this consistency? Here again there's a partnership between hardware and software. In
other words, the hardware and software have to agree as to what is called the memory consistency model. The
memory consistency model, is a contract between hardware and software as to what behavior a programmer can
expect, writing a multi-threaded application running on this multiprocessor.
An analogy that you may be familiar with is a contract between hardware and software. If you just think about
a uniprocessor and there's a compiler writer that knows about the instruction set provided by the CPU. Then the
architect goes and builds a CPU, and he has no idea how this instruction set is going to be used, but there is an
expectation that the semantics of that instruction set is going to be adhered to in the implementation of the
processor so that the compiler writer can use that instruction set in order to compile high level language programs.
Similarly, when you're writing a multithreaded application, you need a contract between the software and the
hardware, as to the behavior of the hardware when processors are accessing the same memory location. That is
what is called the memory consistency model.
151 - Processes Question

152 - Processes Solution

Let's talk through these different choices, to see what are possible given this set of instructions and the fact that.
Processing P1 and P2 are executing, independently on two different processors and we have no way of knowing
what is going on with the shared memory.

The first option is possible, if P2 completes execution before P1 starts. In that case, what d and c would get
the old values of a and b, namely zero.
The second option is also possible, if P1 completes execution before P2.
The third option is also possible, if the execution order is: a = a + 1; d = b; c = a; b = b + 1.
Could the fourth option happen?
• If we end up with d = 1, that means P1 executes b = b +1 before P2 executes d = b.
• But how do we get c = 0, i.e. P2 only see the old value of a when P2 execute c = a.
• How can this happen? It can happen if messages go out of order.
• Remember that there is an interconnection network connecting all the CPU in SMP. A write operation
that happens on this processor has to go through the interconnection network to get to another processor.
• If messages go out of order, it is possible that when P2 execute d = b, the new value of b from P1 has
arrived. But when P2 execute c = a, the new value of a from P1 hasn't arrived yet.
• Do you want it to happen? Now intuitively, you would see that this is not something you expect to happen.
As a programmer, you don't want surprises, right? And if you don't want surprises, perhaps if it is a non-
intuitive result, that's something that should not be allowed by the model.
So, when we talk about memory consistency model, we're saying what is the contract between the programmer
and system. What we are seeing through this example is that, this particular outcome is counter-intuitive and
therefore the model should not allow this particular outcome to be possible in the execution. And this is the reason
why you have memory consistency model.
153 - Memory Consistency Model

So here I'm showing you a set of memory accesses on P1: r(a), w(b), w(c) and r(b). On P2 we have w(a), r(b)
and w(c).
• P1 and P2 accesses memory completely independent of each other.
• Therefore it is possible that in one execution, P2-w(a) happens after P1-r(a).
• If you run the same program again, it's possible that P2-w(a) happens before P1-r(a).
• It's perfectly feasible for this to happen because there is no guarantee on the ordering of these memory
access.

Both execution orders above are reasonable and correct and the programmer can live with it. In other words, the
programmer needs to know is what to expect from the system in terms of the behavior of shared memory reads
and writes that can be emanating from several different processors. The is the memory consistency model. As a
programmer, you don't want any surprises. And there's a purpose of the memory consistency model to sort of
satisfy the expectation of the programmer.
So I'm going to talk to you about one particular memory consistency model, which is called a sequential
consistency memory model.
• Consider the memory accesses from P1, one expectation that you have of the programmer is that the
accesses that on particular processor is going to be in the exact order in which your write the codes. If you
look at these sequences of accesses, you have w(b) before r(b), so you can expect the result of r(b) to be
whatever is put into the memory by w(b). That's what's called a program order. What you expect is the
program order to be maintained, namely the order in which you've generated memory accesses should
be maintained by the execution on that processor.
• In addition to that, there is this interleaving of memory accesses between P1 and P2. This is where we
said, we have no way of controlling the order in which these accesses are going to be satisfied by the
memory. Interleaving can be arbitrary. That is, interleaving between accesses from P1 and P2 can be
interleaved arbitrarily.
So this is the sequential consistency memory model, which has two parts to it.
• One is the program order in each process, the textual order in which memory accesses are generated,
they're going to be satisfied.
• On the other hand, interleaving of these memory access has occurred all of the processes is going to be
arbitrary.
An analogy of the sequential consistency is what you might see in a casino. If you watch a casino card shark
shuffle cards, he might take a card deck and split it into two halves, and then he'll do a merge shuffle of two splits,
and, and create a complete deck. That is exactly what's going on with sequential consistency.
You have splits of memory accesses on several different processors, and they're getting interleaved in some
fashion, just like card shuffler is interleaving the cards from two decks and creating one card deck.
By the way, this particular memory consistency model, sequential consistency, was proposed by Leslie
Lamport and he is a popular guy. You're going to see him again later on when we talk about distributed systems.
But he came up with this idea of sequential consisting memory model back in 1977. And since then there have
been a lot of different consistency models that have been proposed.
In future lessons on distributed systems, we will see other forms of memory consistency models such as release
consistency and lazy release consistency and eventual consistency. But hold on. We will come back to that later
on.
154 - Memory Consistency and Cache Coherence

So now having seen the sequential memory consistency model, what we can do is go back to our original example,
and ask the question, what are all the possible outcomes for this particular set of memory accesses performed on
P1 and P2? Now what possible values can d and c get?

• Well obviously, you can get the first choice, no problem with that.
• We can get the second choice.
• And we can also get the third choice and as we illustrated earlier, all of these are just interleaving of these
memory accesses on P1 and P2.
• But the fourth one Is not possible with sequential consistency, because there's no interleaving of these
memory accesses that'll result in this particular outcome without breaking the program order.
That's comforting, that's exactly what we thought would be a useful thing to have in a memory-consistency model
that gives only intuitive results and makes sure that non-intuitive results don't happen. Memory consistency model
is what the application programmer needs to be aware of, to develop his code and know that it will execute
correctly on the shared memory machine.

As operating system designers, however, we need to help make sure that this code runs quickly. To do that,
we need to understand how to implement the model efficiently and also the relationship between hardware and
software that makes it possible to achieve this goal.
Cache coherence is how is the system implementing the model in the presence of private caches. So this is a
handshake, a partnership between hardware and software, between the application programmer and the system,
in order to make sure that the consistency model is actually implemented correctly by the cache coherence
mechanism that is ingrained in the system.
And the system implementation of cache coherence is itself a hardware-software trade off. For instance, one
possibility is that the hardware is only giving shared address space and it's not giving you any way of making sure
that the caches are coherent. And it is letting the software, the system software ensure that this contract is
somewhat satisfied. That is one possibility where the cache coherence is implemented in software by the system.
This is called a non-cache coherent shared address multiprocessor (NCC SMP).
The other possibility of course is that the hardware does everything. It provides the shared address space, but
it also maintains cache coherence in hardware. And that's what is called a cache coherent multi-processor, or a
CC multi-processor.
155 - Hardware Cache Coherence

Now let's focus on the hardware implementing cache coherence entirely in addition to giving the shared address
space. There are two possibilities if the hardware is going to maintain the cache coherence.

One possibility is what is called write invalidate (CC-WI)scheme.


• The idea is, if a particular memory location is contained in all the caches, all these processes have fetched
this particular memory location y, and it's been sitting in the private caches of all these CPUs.

• If now, CPU P1 decides to write to this memory location and change it from y to y’.

• When that happens, the hardware is going to ensure that all of other CPUs’ caches are invalidated.

• The way it's done is that, as soon as this change happens, the hardware is going to broadcast a signal on
the bus, called inv(y). That's something that propagates on the system bus.

• All these processors are observing the caches, and this is sometimes referred to as snooping caches. In a
lighter vein, these caches are snooping on the bus to see if there's any change to memory locations that are
cached locally.

• In this case, if an invalidation signal inv(y) goes out, then each of these caches are going to say do I have
that particular memory location? If I do, let me invalidate it and that particular memory location gets
invalidated. If you don't have that memory location, ignore it.

• That's what is called write invalidate cache coherence scheme.


You may already be one step ahead of me, and you may be thinking what would be an alternative to doing this
invalidation scheme? There is a write update scheme.

• The idea here is, if this guy is going to write to this particular memory location and modify y to y’.

• What we do is, instead of invalidating it on the bus, we send out update(y) signal on the bus, saying that
I modified this particular memory location and this is a new value.

• If these caches happen to have the same memory location, they all modify it from y to y’.

• Now all of these caches have the new value of y’ and the old values disappear from the system.

• So in this case, what we are doing is: if you have it, update it. Once again, you're snooping on the bus.
Each of these process of caches is snooping on the bus and if they see an update for a particular memory
location, they will say “well, let me modify it so that future accesses by my CPU will get the most recent
value that had been written into this particular cache line”. That's the idea behind write update scheme.
Now whether we're talking about write update scheme or the earlier write invalidate scheme, one thing should
become very clear in your mind and that is there is work to be done whenever you change some memory location
that could conceivably be cached in the other private caches of the CPUs.
• The invalidate scheme sends out an invalidate message.
• The update scheme sends out an update message.
Such kind of transaction is an overhead. And as a system designer, one of the thing that we've been emphasizing
all along is that we want to keep the overhead to a minimum.
But you can also see immediately that the overhead is going to be something that grows as you increase the
number of processors as you change this inter-connection network from a simple bus to a more exotic network.
And it also depends on the amount of sharing that is happening for a particular shared memory location.
156 - Scalability

Now as a programmer, you have a certain expectation as you add more processors to the system. Your expectation
is naturally that if you add more processors your performance should go up. Such an expectation is called
scalability, that the performance of a parallel machine is going to scale up as you increase the number of processes.

• It is reasonable to expect that. However, I mentioned just now that the overhead associated with increasing
the number of processors in terms of maintaining cache coherence for shared data. Therefore, the pro in
adding more processors is the fact that you can exploit parallelism. That's the reason why you're able to
get this expectation of increased performance with processors.
• But unfortunately, as you increase the number of processors, there is increased overhead. If we have an 8-
processor SMP, the overhead for cache coherence is less than when we have a 16-processor SMP or a 32-
processor or a 64-processor.
As a result, you can see that you have the pro of exploiting parallelism but you have the con of increased
overhead and you end up with an actual performance that's somewhere in the middle between your expectation
and the overhead.
So, in some sense this is a difference between what your expectation is and what the overhead you're paying.
That becomes the actual delivered performance of a parallel machine. This is very important to remember, that
your delivered performance may not necessarily be linear in the number of processors added to the system.
So what should we do to get good performance?
Don't share memory across threads as much as possible. If you want good performance from the parallel
machine, a quote that is attributed to a famous computer scientist Chuck Thacker comes to mind “shared memory
machines scale well when you don't share memory”.
Of course as operating system designers, we have no control over what the application programmer does. All
we can do is to ensure that the users shared data structures is kept to a minimum in the implementation of the
operating system core itself. You will see how relevant Chuck Thacker’s quote is as we visit operating system
synchronization, communication and scheduling algorithms and more generally the structure of the operating
system itself in this lesson. See if you can remind yourself of this quote, and how often it permeates our discussion
as we go through this lesson.
157 - Lesson Summary

In this lesson, what we're going to start doing is talking about synchronization algorithms that goes into the guts
of any parallel operating systems that is supporting multi-threaded applications. As we discuss the
synchronization algorithms, watch out for Thacker’s quote that I mentioned in the previous lesson on sharing, in
shared memory multiprocessors that is going to be the key in terms of understanding the scalability of the
synchronization algorithms.
Synchronization primitives are a key for parallel programming. Well, you know, in the metaphor that you
know about in real life. Lock is something that protects something that is precious. In the context of parallel
programming, if you have multiple threads executing and they share some data structure, it is important that the
threads don't mess up each other's work. And a lock is something that allows a thread to make sure that when it
is accessing some particular piece of shared data, it is not interfering or interfered by some other thread. So the
idea would be that: a thread would acquire a lock, and once it acquires a lock, it knows that it can access this data
that it shares with other threads.
I'm showing only two threads here, but potentially in a multi-threaded program you can have a lot more threads
that are sharing a data structure. Once T1 knows that it has access to this data structure, then it can do whatever
it wants with it. When it is done, it can release the lock. So that's sort of the idea behind a lock.

Locks come in two flavors, one is what we'll call an exclusive lock, or a mutual exclusion lock. The idea is
that a mutually exclusive lock means that it can only be used by one thread at a time. Here's an example of two
children playing, and they have to take turns in order to hit this ball and obviously, you don't want both of them
hitting the ball at the same time. Not good for the game and not good for the safety of the children either. This is
the same thing that applies to the mutual exclusion lock that we use in parallel programming. When a thread that
wants to modify data, it has to make sure that nobody else is accessing that particular data structure.
You can also have a shared lock. Now, what that means is that this lock is something that allows multiple
threats to access the data at the same time. Well, under what conditions would that be meaningful? Well, here is
an analogy again. If there is a newspaper, and multiple people want to read the newspaper at the same time,
perfectly fine to do that, right? That's the same sort of thing that happens often in parallel programming. That you
have a database and and there are records in the database that multiple threads want to inspect. But they want to
make sure that while they're inspecting it the, the data itself is not going to be changed so a shared lock is
something that allows multiple readers to access some data with the assurance that nobody else is going to be
modifying the data.
158 - Synchronization Primitives

Another kind of synchronization primitive that is very popular in multithread parallel programs, and extremely
useful in developing scientific applications, is a barrier synchronization. The idea is that there are multiple
threads and they are doing some computation and they want to get to a point where they want to know where
everybody else is at that at a certain point of time. They also want insurance that everybody has reached a
particular point in the respective computations so that then they can all go forward from this barrier to the next
phase of the computation.

Now I'm sure that you've gone to dinner with your friends and one of the experiences that you may have had
is that, and you may have a party of four or five people that are going for dinner. Two or three of you showed up
at the restaurant but the usher say “wait, do you have the entire members of your party here? If they're not here,
wait till the other members of the party show up, so that I can seat you all at the same time.”
That's sort of the same thing that's happening with barrier condition. It is possible that thread T1 and T2 arrive
at the barrier, meaning they completed their portion of the work. But the other threads are lagging behind and
they're not here yet. Until everybody shows up, nobody can advance to the next phase of the computation.
So that's the idea behind barrier synchronization.
We looked at two types of synchronization primitives. One is the lock, the other is the barrier
synchronization. Now that we understand the basic synchronization primitives that are needed for developing
multithreaded applications on a shared memory machine. It's time now to look at how to implement them. But
before we do that, let's do a quiz to get you in the right frame of mind.
159 - Programmers Intent Question

Let's first discuss a little bit about the instructions set architecture of a processor. In the ISA of a processor,
instructions are atomic by definition. In other words if you think about reads and writes to memory which are
usually implemented at loads and stores, and the instructions are in the architecture of the processor, then those
are atomic instructions. What that means is, during the execution of either a load or store instruction, the processor
cannot be interrupted. That's important, that's the definition of an atomic instruction that the processor is not going
to be interrupted during the execution of an atomic instruction.
• Now the question that I'm going to ask you to think about is, if I have a multi-threaded program and in
that program.
• There is a process of P1, which is modifying a data structure A.
• There is a process of P2 which is waiting for the modification by P1 to be done. And P2 wants to use this
structure after the modification is done.
• It is very natural to think about situations in which you have this kind of a producer-consumer relationship.
This guy is the producer of data and this guy is the consumer of data.
• The consumer wants to make sure that the producer is done producing it before he starts using it.
Now, given that the ISA is only read and write atomic instructions, is it possible to achieve the programmer's
intent that I have embodied in this code snippet here?
160 - Programmers Intent Solution

If you answered yes, then you and I are on the same wavelength. And in the next few panels, I'm going to show
you how this particular programming construct that a multithreaded program may execute in terms of producer
and a consumer, can be accomplished with simple read/write atomic operations available in the instruction set of
a processor.

• The solution, it turns out is surprisingly very simple. The idea is that between P1 and P2, I'm going to
introduce a new shared variable called a flag.
• I'll initialize this flag = 0.
• The agreement between the producers and the consumer is that the producer will modify the data structure
that he wants to modify
• Once the producer P1 is done with the modification he will set this flag = 1. This is the signal to P2 that
P1 is done with the modification.
• Now, what has P2 been doing? Well, P2 is basically waiting. For initially this flag = 0 and basically the
processor P2 is waiting for the flag to change from a 0 to 1.
• As soon as this flag = 1, P2 will break out of this spin loop and he is now ready to use the data structure.
• Once he is done using the data structure, he can flip it back to flag = 0 to indicate that he is done using it.
So next time if the producer wants to modify it again, P1 can do that.
So that's sort of the solution. Now, let's see why it works by only using atomic reads and writes.
161 - Programmers Intent Explanation

Now the first thing to notice is that all of these are read and write accesses. There's nothing special about them.
So all of these are normal read and write accesses.

But there is a difference between the way the program uses this flag variable versus this data structure A.
The flag variable is being used as a synchronization variable and that's a secret that only P1 and P2 know about.
That is, even though initially it looks like a simple integer variable, P1 and P2 know that this is the way by which
they are signaling each other. It is a synchronization variable. On the other hand, the data structure A is a normal
data.
Both accessing the synchronization variable and normal data is being accomplished by simple read and write
accesses that's available in the processor. That's how we're able to get the solution for this particular question.
It's comforting to know that atomic read and write operations are good for doing simple coordination among
processes as we illustrated here. In fact, when we look at certain implementation of barrier algorithms later on
You'll find that this is all that is needed from the architecture in order to implement some of them.
But now, how about a mutual-exclusion lock? Are atomic reads and writes sufficient to implement a
synchronization primitive like a mutual-exclusion lock? Let's investigate that.
162 - Atomic Operations

Let's look at a very simple implementation of a mutual exclusion lock.


• In terms of the instructions that the processor will execute in order to get this lock, will be to come in and
check if the lock is currently available by if (L == 0).
• If it is available then we're going to set it to one to indicate that I've got the lock, nobody can get it. That's
the idea behind this check and then setting this to one.
• On the other hand, if somebody already has the lock, i.e. L = 1, and therefore I'm going to wait here(while
(L == 1)) until some the lock is released.
• Once the lock is released, I can go back, double check that the lock is available and set it to one.
So this is the basic idea. Very simple implementation of this lock. The “double check” in the last step is to
make sure nobody else could have gotten in the middle.
Is it possible to implement the simple minded implementation of the lock using atomic reads and writes alone?
Let's talk thought this implementation here.

• Now, look at this set of instructions that the processor has to execute to acquire the lock, it has to first read
L from memory, then check if it is 0 and lastly store the new value of 1 to this memory location.
• That's a group of three instructions that the processor has to execute and the key thing is these three
instructions have to be executed atomically to make sure that I got the lock and nobody else is going to
interfere with my getting the lock.
• Reads and writes instructions by themselves are atomic, but a group of reads and writes are not atomic.
• Therefore, what we have here is a group of tree instructions and we need them to be atomic.
• What that means is we cannot have just reads and writes as the only atomic operations if we want to
implement this lock algorithm. We need a new semantic for an atomic instruction, and the semantic is
what I call the read_modify_write operation, meaning that I'm going to read from memory, modify the
value and write it back to memory, atomically.
• So that's the kind of instruction that is needed in order to ensure that we can implement a lock algorithm.
Now several flavors of read-modify-write instructions have been proposed, or have been implemented in
processor architectures.
The first one is what is called a test-and-set instruction. The idea here is, the test and set instruction takes on
a memory location as an argument. It returns the current value in this particular memory location and also sets
the memory location to 1. So, getting the current value from memory and setting it to one, are being done
atomically. That is the key. That it is testing the old value and setting it to this new value, atomically.
Another atomic read_modify_write instruction that has been imposed and/or implemented is what is called a
fetch-and-increment instruction. It also takes a memory location of an argument. It is going to fetch the old value
of what was in the memory, and then increase the current value that is in the memory by 1 or whatever value (it
could be that this may take on an extra argument to indicate how much it is going to change it by).
As I said before, there have been several flavors of read_modify_write instructions that have been proposed in
the literature. Often generically these read_modify_write instructions are called fetch_and_phi instructions,
meaning that it is going to fetch an old value from memory and do some operation on that fetched value and write
it back to memory.
• For instance, fetch-and-increment is one flavor of that.

• There are other flavors, like fetch-and-store, fetch-and-decrement, compare-and-swap and so on.
You can read about that in the papers that we've identified for you.
Okay, now that we have an atomic read modify write instruction available from the architecture, we can start
looking at how to implement the mutual exclusion lock algorithms. Now, I gave you a very simple version of it
and we'll talk more about that in a minute.
I'm sure that in the first project, when you implemented the mutual exclusion lock, you did not care too much
about the scalability of your locking implementation. Now if you are implementing your mutual exclusion
algorithm on a large scale shared memory multi-processor, let's say with 1000's of processes, you'll be very
worried about making sure that, that your synchronization algorithm scale. Scalability issue is fundamental to the
implementation of synchronization algorithms.
163 - Scalability Issues With Synchronization

Now let's discuss some of the issues with scalability of the synchronization primitives in a shared memory
multiprocessor.
• We already saw that lock (mutual exclusion or shared lock), is one type of a synchronization operation.
• We also saw that barrier algorithms is another type of synchronization operation

When you look at both of these guys, the sources of inefficiencies that come aboard first of all, is latency.
What do we mean by that? If the thread wants to acquire this lock, it has to go to memory, get this lock, and make
sure that nobody else is competing with it. That is the latency, which is inherently “what is the time that is spent
by a thread in acquiring the lock”. To be more precise, what we mean by latency is: say the lock is currently not
being used, how long does it take for me to go and get it? That's really the key question that latency is trying to
look at.
The second source of scalability with synchronization, is the waiting time, and that is if I want to go and get
the lock, how long do I need to wait in order to get that lock? Well clearly this is not something that you and I as
the OS designer has complete control over, because it really depends on what these threads are doing with this
lock. So for instance, if this thread acquires this lock and it modifies the data for a long time before releasing it,
another thread is going to wait for a long time before the lock is released. So the waiting time is really in the
purview of the application and there's not much you can do as an OS designer in reducing the waiting time.
The third source of un-scalability of locks is contention. If currently some guy is using the lock and he releases
it, making the lock available for grab, there may be 0, 1, 2... or a bunch of threads waiting here to acquire this
particular lock. If they're all waiting to acquire this lock, they're all contending for this lock. How long does it
take in the presence of contention for one of them to become the winner of the lock and the others to go away?
So that's the contention part of implementing a synchronization primitive.
All of these things, latency, waiting time and contention, even though I mentioned it in the context of a mutual
exclusion lock, also appear when you're talking about barrier synchronization algorithms or shared locks. So
latency and contention are two things all OS designers have to be worried about, in order to implement scalable
versions of synchronization primitives.
164 - Native Spinlock

Let's start our discussion with the world's simplest and most naive implementation of the lock, and we're calling
it spin-lock because the processor that is waiting for lock has to spin, i.e. burning CPU cycles by doing NO useful
work but waiting for the lock to be released.
The first one that we're going to look at, is what is called spin on test-and-set. (T+S)

• The idea is very simple and straightforward. There's a shared memory location L and it can have one of
two possible values: unlocked or locked.
• At the beginning, we've initialized unlocked L = 0, so nobody has the lock.
• The way to implement this naive spinlock algorithm is the following. You go in and check, using test-
and-set primitive at the memory location L.
• Test-and-set(L) returns the old value from L and set it to the new value L = 1. That's going to be done
atomically, which is provided by the ISA.
• Now, if we find that this test-and-set(L) == 1, it means that somebody else is holding the lock. Therefore,
I cannot use it and I'm going to basically spin in this while-loop. That's why it’s called spin on test-and-
set(), where you're basically spinning waiting until test-and-set() == 0.
So let's put up some threads here that are trying to get this lock.
• Say T1 is the first one to make a test-and-set() call on this lock and finds out it is unlocked, i.e. test-and-
set() == 0.
• So that T1 jumps out of the while loop and go off to messing with the data structure. So far as T1 is
concerned.
• Test-and-set() also set L = 1, so when T2 and T3 come along and try test-and-set() on the lock, they will
find out that test-and-set() == 1. As a result, T2 and T3 are stuck in this while-loop.
• How long are they going to be stuck? Until T1 releases the lock. The way to do that is very simple. T1
comes along and calls an unlock function and the unlock function basically sets L = 0.
• Once T1 does that, the lock becomes available. So when T2 and T3 try to do this test-and-set() again, at
least of them will find that the test-and-set() == 0, i.e. lock is unlocked and acquire the lock.
• Note that only exactly one of T2 or T3 will win because that's a semantic of test-and-set().
165 - Problems With Native Spinlock Question

What are the problems with this naive spinlock?


• Is there going to be too much contention?
• Or is it not good enough because it does not exploit caches?
• Or will it disrupt useful work?

166 - Problems With Native Spinlock Solution

If you checked all three of them, you're exactly on the right track.
First of all, with the naive implementation there is going to be too much contention for the lock when the lock
is released. Because everybody, both T2 and T3 in the previous example, will jump in and execute test-and-set.
If there are thousands of processes, everybody is going to be executing this test-and-set. So there is going to be
plenty of contention on the network.
The second answer is also the right answer. From the previous lesson, a shared memory multiprocessor has
private caches associated with every one of the processors. It is often the case that the caches may be kept coherent
by the hardware. If the private caches associated with every processor, there is an issue with test-and-set
instruction. Test-and-set instruction CANNOT use the cached value, because it has to make sure that the
memory value is modified atomically when it inspects the memory. Therefore by definition, test-and-set does
not exploit caches. It bypass the cache and access the memory directly, to ensure atomicity.
The third problem is also a good answer. When a processor releases the lock, that processor wants to go on
and do some useful work. Let's say there are four processors trying to acquire the lock. Only one of them is going
to get it, and the others have to back off and wait in the while-loop. Now the one guy that does get the lock tries
to do the useful work. But because there's a lot of contention, this guy is being impeded from doing useful work.
In all the other processors trying to go and get the lock when it is not available.
So this is really the problem, that test-and-set instructions because it bypasses the caches, it causes a lot of
contention on the network and this contention also impedes some of the useful processor from carrying on with
its work.
167 - Caching Spinlock

Now let's look at how we can exploit the caches.


Now, we have to execute the test-and-set instruction on memory so that we can atomically make sure that exactly
one processor gets the lock. On the other hand, the guys that don't have the lock could exploit their private caches
while waiting for the lock. That's why this particular algorithm that I'm going to describe to you is what is called
spin-on-read.
The assumption here is that you have a shared memory machine in which the architecture is providing cache
coherence. In other words, through the system bus or interconnection network, the hardware ensures that the
caches are kept coherent. The waiting threads, instead of executive a test-and-set on memory, they can spin locally
on their private cache that holds the value of the lock. If that lock value in memory changes, these guys that spin
on the cache will also notice that, thanks to the cache coherence that is ensured by the hardware.

• The idea of the spin-on-read is that the lock algorithm will first go and do a check on the memory location
to see if it is locked. So this is a normal atomic read operation on memory, not a test-and-spin operation.
• If it is not in the cache, you're going to go to memory and bring it in. And once you bring it in, as long as
this value doesn't change, we're going to look at the value the cache while spinning. Therefore, I'm not
producing any contention on the network because I am spinning on the local cache.
• There could be any number of processes waiting on the lock simultaneously. No problem with that because
all of them are going to be spinning on the value of L in their private caches.
• If there is one processor that's actually doing useful work and needs to go to memory, it won’t be impeded
by the contention on the network anymore.
• If the one processor eventually releases the lock, all the waiting processes/threads will notice that L
becomes L = 0, because of cache coherence mechanism. But immediately, I want to acquire the lock by
doing this test-and-set. So multiple processors will execute this test-and-set simultaneously.
• It's possible somebody else is going to beat me to the punch. If that happens, I simply go back and do the
looping on my private copy of L and wait for the lock holder to release the lock.
• So that's the idea of spin-on-read. The idea is you spin locally. When you notice that the lock has been
released you try and do a test-and-set. If you get lucky, you will get the lock. If you lose, you go back and
spin again locally.
The unlock operation of course is pretty straightforward. The guy that wants to unlock is simply going to change
the memory location to set L = 0. Then other processes can observe it through the cache coherence mechanism,
and try to acquire the lock.
But note when the lock is released, all the processes that are stuck here in the spin loop will try test-and-set at the
same time. We know that test-and-set bypasses the cache. So everyone is hitting on the bus and trying to go to
memory during test-and-set. This essentially means that, in a write-invalidate cache coherence (CC-WI)
mechanism, it will result in order O(n^2) bus transactions for all of these guys to stop chattering on the bus.
Every test-and- set instructions will result in invalidating the caches, and as a result you have an order of O(N^2)
contention on the bus, where n is the number of processors that are simultaneously trying to get the lock.
Obviously, this is impeding that one guy that is able to acquire the lock. And this is clearly disruptive.
168 - Spinlocks With Delay

Now, in order to limit the amount of contention on the network when a lock is released, we're going to do
something that we often do in real life, procrastination.
So basically, each processor is going to delay before asking for the lock, even though they observe that the lock
is released. It's sort of like what happens at rush hour. If you find that the traffic is too much, you might decide
that I don't want to get on the highway right now. Let's discuss two different delay alternatives.
• In the first one, when the lock is released, we break out of this loop, instead of immediately trying to
acquire the lock by test-and-set, I’m going to delay myself by a certain amount of time. This delay is
conditioned by the processor ID. So every processor is waiting for a different amount of delay.
• Since the delay is different for each processor, even though all of them notice that the lock has been
released simultaneously, only one of them will try to acquire it at a specific time. So we are sort of
serializing the order in which the waiting processors try to acquire the lock.
That is one possible scheme for delaying, a static delay. Every processor has been preassigned a certain amount
of delay, which means that even if the lock is available, if my assigned delay is very high, I will always have to
wait longer than others. That's always an issue when you have static decision making.

• What we can do is instead make the decision dynamically. In this example, I'm making it very simple by
saying that if you don't get the lock, just delay a little bit before you try for this lock again. This delay is
some small number to start with. If after this small delay, I find that the lock is still locked, I'm going to
increase the delay before next attempt by doubling it. It is called exponential backoff.
• This is essentially saying that when the lock is not highly contented for, I can get it in the first (or first few
attempt). so overall I did not wait too long before being able acquire the lock.
• On the other hand, if the lock is highly contended for, I will keep finding that the lock is locked. As a
result, I will have to keep increasing the delay, to avoid the contention.
• One nice thing about this simple algorithm is that I'm not using the caches at all. And, if the processor
happens to be a non-cache-coherent multi-processor, this algorithm will still work, because we're always
using test-and-set and it operates on the memory.
Generally speaking, if there's a lot of contention, then static assignment of delay may be better than the dynamic
exponential backoff. But in general any kind of delay will help a lock algorithm better than the naive spinlock.
169 - Ticket Lock

Up to now, what we've talked about is how to reduce the contention when the lock is released. So far we've not
talked about fairness. What do we mean by fairness? Well, if multiple people are waiting for the lock, should we
not give the lock to the guy that made the lock request first? Unfortunately, in spinlock, there is no way to
distinguish who came first. As soon as the lock is released, they are all going to try and gab the lock.
How can we ensure fairness in the lockout position? Now many shops and restaurants often use a ticketing
system to ensure fairness for those who started waiting early. For instance, I walk in the deli shop and my ticket
is 25. I notice that currently they're serving 16 so I know that I have to wait a little bit of time. Once my number
comes up, I get served.

That's the basic idea that we're going to use in this ticket lock algorithm.
• The lock data structure has two fields, a next_ticket field and a now_serving field.
• In order to acquire a lock, I'm going to mark my position.
• I get a unique ticket by doing a fetch-and- increment() on the next ticket field of the lock data structure.
• The result is that I get a unique number and the ticket number is also advanced
• When can I get my lock? I can wait by procrastination, i.e. pausing by an amount of time that is
proportional to the difference between my_ticket value and now_serving.
• After this amount of delay, I will check if now-serving equals my ticket value. If yes, then I can go ahead
and acquire the lock. Otherwise I go back to loop.
• When the current lock holder releases the lock, he also increase the value of now_serving in the lock data
structure.
• Eventually, the now_serving will advance to my_ticket, and I can exit the loop to acquire the lock.
Now this algorithm is good, that it preserves fairness, but you notice that, every time the lock is released, there is
now_serving value that is in my local cache is going to be updated with a cache coherence mechanism, and
that will cause contention on the network.
So on one hand fairness is achieved, but on the other hand, we have not really completely gotten rid of the
contention that can happen on the network when the lock is released.
170 - Spinlock Summary

To summarize the spinlock algorithms that we've seen so far, we saw that the spin-on-read, spin on test-and-set
and spin on test-and- set with a delay.
All of these spin algorithms, there's no fairness associated with them.
If you think about the ticket lock algorithm, it is fair but it is noisy.
So, all of them are not quite there yet in terms of our objectives of reducing latency and reducing contention.
If you think about it, let's say that you know that currently T1 has got the lock. All of these guys are waiting
for the lock to get released. When T1 releases the lock, exactly one of them is going to get it. Why should all of
them be attempting to see if they can get the lock?
Ideally, what we would want is that when T1 releases a lock, exactly one guy receives the signal that you've
got the lock. Because exactly one guys can get the lock to start with, T1 should signal exactly on the next thread
and not all of them. Now, this is the idea behind the queueing locks that you're going to see next.
171 - Array Based Queueing Lock

We will discuss two different variants of the queueing lock.


The first one is the array-based queueing lock by Anderson (GIOS!) and l refer to it as Anderson's lock later on.
Associated with each lock L, is an array of flags.
• The size of this array is equal to the number of processors in the SMP. So if you have a N-way
multiprocessor, then you have N elements in the circular array of flags.
• This array of flags serves as a circular queue, for enqueuing the requesters that are requesting the lock.
• This is quite intuitive, since we have utmost N processors in this SMP, we can have at most N requests
simultaneously waiting for this particular lock.
• Now each element in this flags array can be one of two states. One is has-lock state and the other is must-
wait state.
• Has-lock says that whoever is waiting on a particular slot has the lock. So let’s say this particular entity
is has-lock, that means whichever processor happens to be waiting on this particular slot is a winner of
the lock.
• On the other hand, must-wait is indicating that if a processor in in the must-wait slot, it has to wait.
• There can be exactly one processor that can be in the has-lock state because it's a mutually exclusive lock.
• Therefore, only one processor can have a lock at a time and all others should wait.

To initialize the lock, we mark one slot as has-lock and all others as must-wait.
• An important point I want you to notice is that the slots are not statically associated with any particular
processor.
• As requesters come in, they line up in this array of flags by occupying any next available slot.
• The key point is that there is a unique spot available for every waiting processor, but it is not statically
assigned.
172 - Array Based Queueing Lock (cont)

Since we've initialized this array with HL in the first spot and MW in all the rest spots.
• To enable the queuing, we will associate each lock with another variable called queuelast.
• This queuelast variable is initialized to zero. The queuelast variable indicates where the available slot is
in this array for you to queue yourself in.
So as you can see, at initialization since there is no lock request yet.
• The first guy that comes around to ask for the lock will queue himself here and get the lock.
• The queuelast variable will advance to the next spot to indicate that future requesters have to start queuing
up from here.
• The current lock holder can go off in the critical section and do whatever he wants in terms of managing
or messing up with the shared data structure.

Let's say that at some point of time, I come along and request the same lock.
• I will be queueed at wherever queuelast is pointing (my place). I will mark my place in this flags array by
calling fetch-and-increment on the queuelast variable.
• Since fetch-and-increment is an atomic operation. Remember that we have read-modify-write
operations and fetch-and-increment is one of those. Therefore, even though it's a multiprocessor and there
could be multiple guys trying to get the same block at the same time, they will all be sequenced through
this fetch-and-increment atomic operation and so there is no issue of any race condition in that sense.
• Of course, if the architecture does not support this fancy fetch-and-increment atomic read modify write
operation, then you have to simulate that operation using test-and-increment instructions.
• Now the important point is that the array size is N and the number of processors is N so nobody will be
denied. Everybody can come and queue up waiting for this lock.
• So once I've marked my position in this flags array, then I'm going to basically wait for my turn because
now the flag value of my place is must-wait. Given the timing of my lock request and the position of the
current lock holder, you can see that I have some waiting to do, because there are a quite a few requests
that are ahead of me.
If all threads ahead of me are done with the lock and my place is changed to has-lock, I will know that it is my
turn to acquire the lock.
173 - Array Based Queueing Lock (cont)

Now let's see how unlocking works in the queueing lock, i.e. the unlock algorithm.

• First the lock holder will set his current position in the array from HL to MW.
• And the reason for that is that this is a circular queue and queuelast will eventually points back to this
position, so some other processor might use this slot and wait here later.
• Next he will mark the next slot ((current+1) % N) as HL, so the next guy waiting gets the signal and
knows that he has the lock now. Then he can get into the critical section and do whatever he wants to do
with the shared data structure.
• Now this process will go on and eventually, my predecessor will become the current lock holder. When
my predecessor is done using the lock, he'll unlock and that's going to be a signal to me. Basically, my
predecessor marked my spot as HL and I know that I have the lock now.
Now let's talk about some of the virtues of this algorithm.
• The first thing is that there is exactly one atomic operation that you have to carry out, the fetch-and-
increment to get the lock. So there is only one atomic operation that needs to be done per critical section.
• The other thing is that the processes are all sequenced. In other words there is fairness. So whoever comes
into the queue first will get the lock first . This is also good news.
• The spin variable we use to mark my position in this array my spin variable is distinct. That's another good
thing. In other words, I'm completely unaffected by all the signaling that is intended for other processes,
because I'm spinning on my own private variable and waiting for the lock. And of course correlating to
what I just said is that whenever a lock is released, exactly one guy is signaled to indicate that they've got
the lock. That's another important virtue of this particular algorithm.
We saw that the ticket lock algorithm was fair, but it is noisy when the lock is released. So that problem has gotten
away with this queuing lock, because only one process is signaled each time the lock becomes free.
Now you might be wondering, are there any downside to this array based queuing lock. I assure there is.
The first thing is the size of the data structure is as big as the number of processors in SMP system. The space
complexity for this algorithm is O(N) for every lock that you have in the multiprogram. So if you have a large
scale multiprocessor with dozens of processors, that can start eating into the memory space. That's something that
you have to watch out for and the space can be a big overhead.
The reason I'm emphasizing that is because in any well-structured multi-threaded program, even though we
may have lots of threads executing in all the processors. At any point of time for a particular lock, they may not
be in contention with all the processors, but only a subset of them may be requesting the lock. But still, this
particular algorithm has to worry about the worst case and create a data structure that is as big as the number of
processors.
That's the only downside to this, but all the other things are good stuff about this algorithm. This Anderson
queue is a static data structure, an array. So you have to allocate space for the worst case, i.e. as big as the number
of processors.
This is the catch in this algorithm.
174 - Link Based Queueing Lock

So to avoid the space complexity in the Anderson's array based queueing lock, we're going to use a linked list
representation for the queue. The size of the queue is going to be exactly equal to the dynamic sharing of the lock.
This particular linked list based queueing lock algorithm is due to the authors of the paper that I've prescribed for
you in the reading list, namely Mellor-Crummey and Scott. So sometimes this particular queueing lock is also
referred to as the MCS lock.

• In the lock data structure, the head of the queue is a dummy node. It is associated with every lock so every
lock is going to have this dummy node and will initialize this dummy node to indicate there is no lock
requesters for lock. So, this pointer is pointing to nil since nobody is holding the lock.
• Every new requester will get a queue node. This queue node has two fields.
• One field is the got-it field, which is a Boolean. If it is true, I have the lock; if not, I don't have it.
• The next field is a next pointer to my successor in the queue.
• If I try to acquire the lock, I will get into the queue.
• If a successor also comes and requests the lock, he gets queued up behind me.
So that's this basic data structure.
• Every queue node is associated with a requester. The dummy node that we start with represents the lock
itself.
• Since we are implementing a queue, fairness is automatically assured. The requesters get queued up in the
order in which they make the request, and so we have fairness built into this algorithm, just like the
Anderson's array-based queue lock.
• The lock to nil indicates there are no requests yet. And let's say that I come along and request a lock. I
don't have to wait because there's nobody in the queue. I can go off into the critical section that is
associated with this particular lock.
So what I would have done, when I came in to make this lock request is to
• Get this queue node
• Make the lock data structure point to me
• Also set the next pointer to null, to indicate there's nobody after me.
Once I've done that, I know that I've got the lock.
175 - Link Based Queueing Lock (cont)

I was lucky this time that there was nobody in the queue when I, when I first came and requested the lock. But
another time, I may not be that lucky. There may be somebody else using the lock already.
• If that is the case, I would have to queue myself in this data structure by setting the last/next pointer of
the dummy node point to me. This pointer is always points to the last requestor.
• I'm also going to fix up the link list so that the current guy’s next pointer also points to me.
• Why am i doing this? Well, the reason is that when my predecessor is done using the lock, he needs to
reach out and signal me.
• What am I doing? I'm spinning. What am I spinning on? I'm spinning on the got-it field, waiting my
predecessor to set it to TRUE. , which was initialized as FALSE when I came in.
• My next field, of course, is null because there is no requester after me.

176 - Link Based Queueing Lock (cont)

So now we can describe to you the lock algorithm. Basically the lock algorithm takes 2 arguments. One is this
name L that is associated with this particular lock. It also takes my queue node, ME, as an argument, “please put
my queue node to this lock request queue”.
When I make this call. it could be that I'm in this happy state, i.e. there is no requesters ahead of me.
But if it turns out that there is somebody is using this lock, then I'm going to join this queue and has to be done
atomically. When I'm joining this queue, I'm doing two things simultaneously. One is I'm taking the pointer that
was pointing to my predecessor and making it point to me. I also need the coordinates of the previous guy so
that I can set his next pointer to point to me.
I have to do this double act and this has to be done atomically. So essentially, joining the queue is a double act
of breaking a link that used to exist here and make it point to me, as well as get the coordinates of my predecessor
so that I can let him point to me.
And remember that this is happening simultaneously. Perhaps there are other guys trying to join this queue.
Therefore, this operation of breaking the queue and getting the coordinate of my predecessor has to be done
atomically.
In order to facilitate that, we will propose having a primitive operation called fetch-and-store as an atomic
operation. The semantics of this fetch-and-store operation is that when you make this call and give it two
arguments, L and ME. What this fetch-and-store does, it to return to me what used to be contained in L (which
is the address of my predecessor) and at the same time also to store into L a new node that is the pointer to the
new node that is me.
So that is what is being accomplished by this fetch-and-store. The double act that I mentioned of getting my
predecessors coordinates and setting this guy to point to me is accomplished using this fetch-and-store operation.
It's an atomic operation and clearly the architecture is not having this fetch and store instruction you have to
simulate that with a test-and-set instruction.
177 - Link Based Queueing Lock (cont)

Now that I have joined the queue and waiting on the lock. How dol I know that I've got the lock?
• Well, my predecessor who is currently using the lock will eventually come around and call this unlock()
function. The unlock() function is basically taking, again, two arguments, one being the name of the lock,
and the other being ME node of whoever is making the unlock call.
• In this case, the CURR node (lock holder) makes the unlock call. The unlock function removes CURR
from the list and signals CURR’s successor. The CURR node has a next pointer which points to the next
guy waiting for this particular lock and spinning on his got-it variable.
• So unlock function will set the got-it = TRUE for the successor. This will get me out of my spin loop and
now I have the lock. Then I can proceed with my critical section.
• Eventually I'll be done with my critical section and have to unlock as well.
Normally the unlock function involves removing myself from this link list and then signaling the successor. The
special case occurs if there is no successor to me. When the special case occurs, I have to set the head/dummy
node to link to null, indicating that there is no more requester waiting.
But wait, there could be a new request that is forming. This new request would do a fetch-and-store(), get my
coordinates and set the list L to point to himself. So the new request is forming, but has not completed yet. In
other words, the next pointer in me is not pointing to this new request yet.
This is the classic race condition that can occur in parallel programs. In this particular case, the race condition
is between the unlocker (myself), and the new requester that is coming. Such race conditions are the bane of
parallel programs. One has to be very watchful for such race conditions. Being an operating system designer, you
have to be ultra-careful to ensure that your synchronization algorithm is implemented correctly. You don't want
to give the user the experience of the blue screen of death. You have to think through any corner cases that can
happen In this kind of scenario and design the software in such a way, operating system in particular, to make
sure that all sets of these conditions are completely avoided. Now, let's return to this particular case and see how
we can take care of this situation.
178 - Link Based Queueing Lock (cont)

To solve the race condition for “forming request”, I need to double check if there is a request that is in “formation”.
• In other words, I want to have an atomic way of setting L = nil, if L is pointing to me.
• If he's not pointing to me, I cannot set him to nil because he's pointing to somebody else, i.e. there is a
requester that has just joined the queue.
• That's the invariant that I should be looking for, so I need an atomic way of checking for the that invariant.
The invariant is in the form of a conditional store operation: store only if some condition is satisfied.
Now in this particular case, we will use a primitive called compare-and-swap(L. ME, nil). It takes three
arguments.
• The first two arguments is saying, here is L and this is me, check if these two are the same.
• If L == ME, then set L to the third argument, L = nil.
• The third argument is what L has to be set to if these two are the same.
That's compare-and-swap. You are comparing the first two arguments, and if the first two arguments happen to
be equal, then we set the first argument to be equal to the third argument.

179 - Link Based Queueing Lock (cont)

This compare-and-swap instruction is going to return true if it finds that L and me are the same. It also sets L to
the third argument. In that case, it's a success and success is indicated by a TRUE returned by the operation.
On the other hand, if the comparison failed, it won't do the swap and it will return FALSE.
That's the semantic of this particular instruction. Again, this is an atomic instruction and it may be available in
the architecture or not. If it is not available, then you have to simulate it using test-and-set instruction.
• In this particular example that I am showing you, when I try to do this unlock operation, this new guy has
come in and he is halfway through executing his lock algorithm.
• He has done the fetch-and-store and he's going to set up the list so that my next pointer will point to him.
That's the process that he's in right now.
• At that point, I want to do the unlock operation, and that's when I found that my next pointer is nil. But I
need to double check and make sure there is no “forming request” so I will do this compare-and-swap.
• The return value of compare-and-swap will be false, indicating that this particular operation failed.
• Once I know that this operation has failed, I'm going to spin.
• Now what am I spinning on? So basically I'm going to spin on my next pointer being not nil.
• Now my next pointer is still nil, which makes me think that there's nobody after me.
• But I also just learns that there's a request in formation because compare-and-swap() = FALSE.
• So I will spin waiting for my next pointer to become NOT nil.
• When will my next pointer become not nil? As stated in the lock algorithm earlier, my next will be set by
this new requester.
• This new requester has already gotten my coordinates and he is in the process of setting it up. Eventually,
he'll complete that operation and set my next to himself.
• After this, I can come out of my spin loop can continue with the unlock algorithm. I'm ready to signal the
successor, which is the new requester.
That's how we handle the corner case when the new request is forming while the lock is be released.
180 - Link Based Queueing Lock (cont)

I strongly advise you to look through the paper and understand both the link list version as well as the previous
Anderson's array based lock version of the queuing locks. Because there are lots of subtleties in implementing
these kinds of algorithms in the kernel and in the parallel operating system kernel. Therefore, it is important that
you understand the subtleties by looking at the code
I've given you, of course, a description at a semantic level of what happens. But looking at the code will
actually make it very clear what is going on in terms of writing a synchronization algorithm on a multiprocessor.
And one of the things that I mentioned is that both the link list based queuing lock as well as the array based
queuing lock required fancier read-modify-write instruction.
• For instance, in link list queueing lock, we need a fetch-and-store and also compare-and-swap.

• Similarly the array based queuing log requires a fetch-and-increment.


Now it is possible that the architecture doesn't have that. If that is the case then you have to simulate these fancier
read-modify-write instructions using the simple test-and-set instruction.

181 - Link Based Queueing Lock (cont)

So now let's talk about the virtues of this link list based queuing lock. Some of virtues are exactly similar to the
Anderson's queuing lock
• The link based queueing lock is fair. And Anderson's lock is also fair. Ticket lock was also fair.

• Every spinner has a unique spin location to wait on and so that is similar to the Anderson's queue lock as
well. That's good because you're not causing contention on the network when the lock is released. When
one guy releases the lock, others if they're waiting, they don't get bothered by the signal. Only exactly one
processor gets signaled when the lock is released.

• Usually, there's only one atomic operation per critical section. The only thing that happens is this corner
case in the link list queueing lock. For this corner case, we need a second atomic operation. But if the link
list has several members in this, fetch-and-store alone is enough.

• The other good thing that we already mentioned is that the space complexity of this data structure is
proportional to the number of requesters to the lock at any point of time. So it is dynamic. It's not statically
defined as in the array-based queueing lock. This is one of the biggest virtues of this particular algorithm.
The space complexity is bound by the number of dynamic requests to a particular lock, not the size of the
multi-processor itself.

• Now the downside to this link list based queuing lock is the fact that there is link list maintenance overhead
associated with making a lock request or unlock request. Because Anderson's array based queue lock is
an regular structure, it can be slightly faster than the link list based algorithm.
One of the things that I should mention to that is that both Anderson's array based queue lock as well as the
MCS link list based queueing lock, may result in poorer performance if the architecture does not support fancy
instructions like this, because they have to be simulated using test-and-set. That can be a little detrimental to the
particular algorithm.
We have discussed different algorithms for implementing locks in a shared memory multi-processor. If the
processor has some form of affection free operation, then the two flavors of queue based locks (Anderson and
MCS) are good bet for scalability. If on the other hand, the processor only has test-and-set, then an exponential
backoff algorithm would be a good bet for scalability.
182 - Algorithm Grading Question

We've talked about several different algorithms, spin on test-and-set, spin on read, spin with delay, ticket lock,
Anderson's array queue lock, MCS link based queue lock. Along the way, I mentioned some of the attributes that
we look for
• latency for getting the lock
• contention when locks are released
• fairness whether the spin is on a private variable or a shared variable
• how many read modify write operations are required for acquiring a lock
• what are the space overhead associated with the lock
• when the lock is released, are we signaling one guy or are we signaling everybody?

183 - Algorithm Grading Solution

• MCS link-based queue lock and Anderson's array-based queue lock are the two algorithms that do quite
well on most of the different categories of attributes. if you have fancy read-modify-write instructions,
Anderson's and MCS lock give you the best performance on all these attributes.

• On the other hand, if the architecture does not support fancy RMW operations, then some sort of a delay
or exponential delay based spin lock algorithm may turn out to be the best performer.

• In fact, when the amount of contention for lock is fairly low, it's best to use a spin lock with exponential
delay. On the other hand, if it is a highly contended lock, then it is good to use a spin Lock that has
categorically assigned various spots for every processor.
• One thing that I also want you to notice is that the number of RMW operations that you need for the
different lock algorithms really depends on the amount of contention for the lock. In the case of Anderson's
and MCS the number of atomic operation is always one, regardless of how much contention there is. And
of course, in MCS, this is the corner case that you have to worry about during unlocking that might result
in an extra RMW operation. In the case of the spin algorithms the amount of contention is really dependent
on the number of RMW operations that you have to perform per critical section. Really depends on the
mode of contention that is there for the lock.
184 - Barrier Synchronization

In the previous lesson we looked at efficient implementation of mutual exclusion lock algorithms. In this lesson
we're going to look at barrier synchronization and how to implement that efficiently in the operating system.
This kind of synchronization is very popular in scientific applications. With the barrier, all of the threads have to
arrive at the barrier before they can proceed on. That's the semantic of the barrier synchronization.
I'm going to describe to you a very simple implementation of this barrier. The first algorithm I'm going to describe
to you as what is called a centralized barrier or also sometimes called a counting barrier.

• The idea is very simple. You have a counter, i.e. counting barrier.
• The counter is initialized to N, where N is the number of threads that need to synchronize at the barrier.
• When a thread arrives at the barrier, it will atomically decrement the count. Then it will wait for the count
to become zero, as long as the count is not zero, it will wait.
• All the processors except the last one will spin on the count becoming zero.
• Now when the last processor, the straggler, maybe T2 arrives, he will decrement the count, resulting in
count = 0. This is an indication to everybody waiting on count that they can be released from the barrier.
• Then the last processor to arrive will reset the count to N.
So that's the idea behind that.
• Decrement the count atomically when you come to the barrier.
• If the count is greater than zero, we spin because not everybody has arrived.
• When the last guy comes around, it will decrement the counter to zero.
• Once the counter becomes zero, all the guys that are stuck at the barrier will be released.
• Then the last processor will reset this count to N, so that the barrier can be executed again when all these
guys get to the next barrier.
185 - Problems With Algorithm Question

Now, I'm going to ask you a question. Given this very simple implementation of the barrier decrementing count,
count becoming zero, resetting it to N by the last processor, and all the other guys waiting on the count not being
not yet being zero, do you see any problem with this algorithm?

186 - Problems With Algorithm Solution

The answer is yes there is a problem.


And, and the problem is that the before the last processor comes and sets the count back up to N, the other
processors may race to the next barrier, and they may go through because they may find that this count has not
been set to N yet.
187 - Counting Barrier

So there is a problem with the centralized barrier. That is that when the count has become 0, if3these guys are
immediately allowed to go on executing before the count has been reset to N, then they can all reach the next
barrier and fall through. That is a problem.
So the key thing to do to avoid this problem is to make sure that the threads that are waiting here not leave the
barrier before the count has been reset to N. Right? So they're all waiting here for the count to become zero, and
once the count has become zero they are ready to go. But we don't want to let the go yet. We want to let them go
only after the count has been reset to N.
So we're going to add another spin loop here. That is, after they recognize that the count has become 0, they're
going to wait till the count becomes N again. So the ordering of these two statements is very important, obviously.
We want to wait till the count has become 0. At that point we know that the wait is over, but we want to make
sure that the counter has been reset to N by the last guy. Once that has been done, we are ready to go on executing
the code till the next barrier.

So we solve the problem with the first version of the centralized/counting barrier by having a second spindle. But
there are two spin loops for every barrier in the counting algorithm. Ideally we would like to have a single spindle.
That's the reason that we have this particular algorithm, which is called sense reversing barrier.
• In the sense reversing barrier, we will get rid of one of those spinning episodes, the arrival one. So what
you notice is that in addition to the count, there is a sense variable in the shared variables. We have
included a new variable called sense variable that's also shared by all the processes that want to
accomplish a barrier synchronization.
• The sense variable is going to be true for one barrier episode, and it's going to be false for the next barrier.
• At most you have one barrier at a time. Therefore, if you call this barrier the true barrier, the next barrier
is going to be the false barrier.
That's the way we can identify which barrier we are in at any particular point of time as far as a given thread is
concerned by looking at the sense variable.
188 - Sense Reversing Barrier

So the sense reversing barrier algorithm is going to work like this.

• When a thread arrives at a barrier, it will decrement the count in a way exactly like in the counting barrier.
• But after decrementing the count, it will spin on sense reversal.
• Remember that the sense flag is TRUE for this barrier. Once everybody has progressed to the next barrier,
the sense flag will become FALSE.
• Therefore, we are executing the TRUE barrier. So if T1 comes along, it decrements the count and it will
wait for the sense to reverse.
• That's what all the processors will do except the last one.
• What did the last one do? Well, the last one, in addition to resetting the count to N, which was happening
in the counting barrier, it will also reverse the sense flag.
• So last processor finds that the count = 0 and it'll reset it to N. Then it will reverse the sense flag from
TRUE to FALSE.
• After the sense reversal, all other threads will come out of the spindle and move on.
So you can see now that we have only one spinning episode per critical section or per Barrier.
But the problem is that you have a shared variable for all the processors. So if you have a large scale multi-
processor with lots of parallel threads, it will causes a lot of contention on the interconnection network because
of this hot spot of this shared variable.
Remember what our good friend Chuck Thacker said, less sharing means the multi-processor is more scalable.
That is something that we want to carry forward in thinking about how to get rid of this sharing that is happening
among the large number of processes in order to build a more scalable version of a barrier synchronization
algorithm.
189 - Tree Barrier

So I'm going to first describe to you a more scalable version of the sense reversal algorithm. The basic idea is to
use divide and conquer. I have a hierarchical solution, which is limiting the amount of sharing to a small number
of processes. So essentially, if you have N processors that the conditions break them up into small groups of K
processors.
• We build the hierarchical solution and the hierarchical solution obviously leads to a tree solution.
• And so, since we have K processors competing and accomplishing a barrier among themselves.
• If you have N processors, then you have logK(N) number of levels in the tree.
• In this case, what we have K= 2 and N=8. So the number of levels in the tree is 3.
Let's talk about what happens when we arrive at a barrier.

At micro level, the algorithm works exactly like a sense reversing algorithm.
• These K=2 processors are sharing this data structure: a count variable and a locksense variable.
• This count and locksense variable are replicated in every level of the tree.
• At each level, each counter will be set to count = K in the beginning.
• Let's say at the bottom level, P1 arrives at the barrier, decrements this counter and starts spin waiting on
the locksense to reverse.
• Sometime later, P0 comes to the barrier and decrements the count.
• Now count=0 but we are not done with the barrier yet because the barrier is for all of the processors.
• From the perspective of P0, it knows that between P0 and P1, count = 0. So P0 will go up to the level up
but P1 stay at the bottom level.
• When P0 arrives at the next level, again it decrements the count from 2 to 1 and it starts waiting on
locksense to reverse here.
• P1 is waiting at the bottom level for locksense to reverse while P0 waits at the up level for the locksense
to reverse. Again, locksense won’t reverse until all processors have arrived at the barrier.
190 - Tree Barrier (cont)

• Of course, multiple processors can arrive at the barrier at the same time and all of them are going to work
with their local data structure. Each of them is waiting for their partner to arrive.

• So eventually, P3 arrives, decrements the count, sees it is zero, and P3 moves up.
• Now P3 is at the next level, it decrement the count, the count becomes zero and again, P3 moves up.
• Still, the barrier is not done until we know that everybody has arrived at the barrier.
• So in the meanwhile, P5 and P6 will move up in the same manner and finally P5 reaches the top level.
• Now P5 sees that P3 has already decremented the count to one. P5 further decrements the count to 0.
• At this point, everybody has arrived at the barrier.
To summarize, when a processor arrives at a barrier, it decrements the count.
• If the count is not zero, it spins on the locksense
• If the count becomes 0 after the decrement, the processor will check if it has a parent? If yes, it will recurse
and do the same thing at the next level, until nor more parents can be found.
• So you continue this P0, that this came up here and it found this is another parent. It is stuck here.
• When P3 comes later on, it moves up all the way to the top level.
• Lastly when P5 finally arrives here, it finds that there is no more parent. This is the root of the tree.
• At the root, count=0 indicates that the last processor (P5) has arrived, i.e. everybody has arrived at the
barrier and it's time now to wake up everyone.
191 - Tree Barrier (cont)

The last processor to arrive at the root of the tree is P5. P5 will start the waking up process for everyone.

• The way the wake up process works is that P5 will flip this locksense flag.
• So, when P5 flips this locksense flag, what's going to happen?
• The wakeup process starts from the root.
• In this case P5 and P3 having been released from the root level because the locksense is reversed.
• P3 and P5 will go down to the lower level and they're going to wake up the processors that are waiting at
this level of the tree by flipping the locksense at each level.
• (Note that for any general number of K, at every lower level of the tree, there's going to be on K -1
processors waiting).
• Going down from the root (0-th level), we have the first level of the tree, the second level of tree, etc...P3
will release P0 and P5 will release P6.
• Further down, P0 will release P1, P3 will release P2, P5 will release P4 and P6 will release P7.
And basically what each of the processor does on the way down is to flip this locksense flag.
Once everybody has been released from the barrier, the spin is done for all the processors that have been waiting
and the barrier is completed.
192 - Tree Barrier (cont)

The tree barrier is a, a fairly intuitive algorithm that builds on the simple centralized sense reversal barrier except
that it breaks up the N processors into K-sized smaller groups, so that they can do spinning on different sets of
shared variables with less contention. It's a recursive algorithm that builds on the centralized sense reversal
algorithm and allows scaling up to a large number of processors. Because the amount of sharing is limited to the
number of K, as long as K is small, the amount of contention for shared variables is limited to that number. So
that's, those are all good things about the tree barrier.
But there are lots of problems as well.
• The first problem that I want you to notice is that the spin location is not statically determined for each
processor. In the previous example, P1 arrived first and P1 decremented count and waited on the locksense.
Later P0 arrived, reduced the count to 0 and went up to the next level. In another execution of the same
program, it is possible that P0 arrives first and P1 arrives later. As a result, P0 will spin at the bottom level
while P1 moves up. So, the locksense flag that a particular processor spins on must be dynamically
determined, depending on the arrival pattern of all these processors at a barrier. And the arrival pattern
will be different for different runs of the program.
• The second source of problem is that the ary-ness of the tree determines the amount of contention for
shared variables. In our example we have K=2 for N=8. But if we have N=1000, we may end up with a
larger K (i.e. ary-ness) In that case, the contention for the shared data structures will be significant.
• The other issue with tree barrier is that it depends on whether our SMP hardware is cache-coherent or not.
If it is cache-coherent, then the spin could be on the cached variable in a private cache, and the cache
coherent hardware will indicate when the spin variable changes value. But if it's a non-cache-coherent
SMP, the fact the spin variable that we have to associate with a particular processor is dynamic, means
that the spin may actually be happening for P0 on a remote memory.
• Remember I mentioned to you that one of the styles of architecture is a distributive shared memory
architecture? Sometimes the distributive shared memory (DSM) architecture is also called a non-
uniform memory access architecture, or NUMA. The reason is because the access to local memory for a
particular processor is going to be faster than the access to a remote memory. If you don't have cache
coherence, then the spinning has to be done on a remote memory, and that goes through the interconnect
network.
So static association of the spin location of the processor is very crucial if it's a non-cache-coherent shared
memory machine.
193 - 4-Ary Arrival (MCS Arrival Tree)

So the next algorithm that I'm going to describe to you is due to the authors of the paper that we are reviewing in
this lesson, which is Mellor-Crummey and Scott, and for this reason, this algorithm is called MCS barrier. It's
also a tree barrier but the spin location is statically determined as opposed to the dynamically in the hierarchal
tree barrier here.
I'm showing you an arrangement of the MCS tree barrier with 4 nodes, and it's a 4-ary arrival tree. The arrival
tree and the wake-up tree are different in the MCS algorithm.

There are 2 data structures that are associated with every parent: HC (have children) and CN (child not ready).
• HC is a data structure that is associated with every node and it is meaningful when a node “has children
nodes”. For example in this arrangement, node P0 has 4 children (P1, P2, P3 and P4) and P1 has 3 children
(P5, P6 and P7). Meanwhile, P2 - P7 don't have any children so their HC vector is false. The HC vector
for P0 has 4 elements because this is a 4-ary tree. And P0 has 4 children so all elements in its HC vector
are TURE. For P1, the HC vector is true for the first 3 children and false for the fourth.
• What about CN? The CN data structure is a way by which each of these processor has a unique spot in
the parent node structure to signal when a child processor arrives at a barrier.
• So the black arrows in this structure show the arrangement of the tree, in terms of the parent child
relationship for the 4-ary arrival tree. The red arrows show the specific spots where a particular child is
going to indicate to the parent that they have arrived at the barrier. As you can see that since P1 has 3
children, the fourth spot is empty indicating it only needs to wait on 3 children to move up the barrier.
So, the algorithm for barrier arrival is going to work like this.
• When each of these processors arrive at a barrier, they will reach into the parent’s data structure, which is
statically determined specific spots in the CN vector.
• Then the parent node, for example P1, can check whether his CN vector is all set. If it is all set (i.e. all his
children have arrived at the barrier), it can move up the tree.
• Once P1 moves up, it will inform P0, again, by going to the predetermined spot in the P0’s CN vector.
• As for P0, if P0 is the first to arrive at the barrier, it will just wait until all his children to arrive at the
barrier.
• When P0’s children arrive at a barrier, they know their position in the data structure relative to other
processes arriving at the barrier.
That is how the MCS arrival tree works.
• The arrival tree is a 4-ary tree and the reason why they chose to use a 4-ary tree is because there is a
theoretical result backing the use of 4-ary tree leading to the best performance.
• The second thing that I want you to notice is that each processor is assigned to a unique spot in this 4-ary
tree and this is by construction. Because of its unique spot, a particular processor may have children or
not.
• The other nice thing about this particular arrangement is that in a cache coherent multiprocessor, it is
possible to arrange so that all the specific spots that children have to signal the parent can be packed into
one word. Therefore, a parent has to simply spin on one memory location (one word) in order to know
the arrival of everybody. The cache coherence mechanism will ensure that P0 is alerted every time any of
these guys modify this shared memory location.
194 - Binary Wakeup (MCS Wakeup Tree)

The wakeup tree for the MCS barrier is a binary wakeup tree. Once again here, there's a theoretical paper that
backs this particular choice, that the critical path from the root to the last awakened child is shortest when you
have a binary wakeup tree. Even though the arrival tree is a 4-ary tree, the wakeup tree is a binary tree.
Let me explain the construction of this binary wakeup tree.

• Every processor is assigned a unique spot. So P0 at the root, P1 and P2 at the lower levels, the P3, P4,
etc...
• The data structure used in the wakeup tree is as a child pointer data structure CP. The CP data structure
is a way by which a parent can reach down to the children and indicate that it is time to wake up. Once
again, depending on the particular location in this wakeup tree, some nodes may have children while some
may not.
• In terms of wakeup, when P0 notices that all processors have arrived at the barrier through the arrival tree,
P0 will wake up P1 and P2 using his CP data structure.
• The important point is that when it is time to wake up, each of these processes is standing on a statically
determined location. So when P0 signals P1, it is sending a signal to P1 exactly without affecting any
other processors. Similarly, when it signals P2, it signals exactly P2.
• Once P1 and P2 are woken up, they can march down the tree and signal P3, P4, P5 and P6 using the
statically assigned spots that their children are spinning on.
• The key point, I want to stress again, is that in this construction of the tree by design, we know a position
in the tree and we know exactly the memory location that each processor spins on. These red arrows
show the specific location that is associated with each one of these processors in the wakeup tree.
• Once the parents signal the children, they march down and signal their children until everybody awakes
up.
So the key takeaway with the MCS tree barrier is that the wakeup tree is binary while the arrival tree is 4-ary.
There are static locations associated with each processor, both in the arrival and the wakeup tree.
Through the specific statically assigned spot that each processor can spin on, we ensure that the amount of
contention on the network is limited. By packing the variables into a single data structure we can make sure that
the contention for shared locations is minimized as neat as possible.
195 - Tournament Barrier

Okay, the next value algorithm we're going to look at is the Tournament Barrier.

The barrier is organized in the form of the tournament with N players. Since it's a tournament with N players and
two players play against each other in every match, there will be log2(N) rounds.
• As a result, the tournament barrier of 8 processors will have log2(8) = 3 levels by design.
• In the first round, we have four matches. The only catch is that we're going to rig this tournament. In other
words, the winner of each match in this round is pre-determined. So P0, P2, P4 and P6 will be the winners.

• What is the rational for match fixing? The key rational is that if the processors are executing on a SMP,
the winner P0 can sit on his bumper and wait for a process of P1 to come over and let him know that he
has won the match, P2 can wait until P3 comes over and so on and so forth. What that means in a SMP,
is that the spin location where P0 is waiting for P1 to come and inform him that he's lost the match is
fixed/static.

• This is the idea behind match fixing, that the spin location for each of these processes are pre-determined.
• That is very useful especially if you don't have a CC SMP multiprocessor. If you have NCC SMP, it is
possible to locate the spin location in the memory that is very close to P0 P2 P4 and P6, respectively.
• So once all competitors have arrived, P0, P2, P4 and P6 will advance to the next round.
• Once again in the second round, we're going to fix the matches and the winners will be P0 and P4.
• Essentially P0 and P4 can spin on a statically determined location. Once again, when P2 and P6 arrive, P0
and P4 will advance to the next round.
• The pre-determined winners will propagate up this tree in this fashion, all the way up to the last round. In
this case, P0 is pre-determined to win so P0 spins on a statically determined location waiting for P4.
196 - Tournament Barrier (cont)

So at this point, when P0 is declared the champion of the tournament, we know that everybody has arrived at the
barrier. But this knowledge is only available to P0. So clearly, the next thing that has to happen is to free up all
the processors and let them move on to the next phase of your computation.

• First P0 will tell P4 that it's time to wake up. Just like in any tournament the winner walks over to the loser
and shakes hands.
• Then P0 and P4 can backtrack to earlier rounds and do the same thing. So P0 goes down and shakes hands
with P2, P4 goes down and shakes hands with P6, and so on.
• So, all of these winners will go down to the next level and wake up the losers at that level.
• Again, from the point of view of a shared memory multiprocessor, the spin location for all those processes
are all fixed and statically determined. If P4 knows that P0 is going to come over and shake hands, he can
spin on a local variable that is close to its processor, and again this is important for NCC NUMA machines.
Same thing with P2 and P6 at the next level.
• This process of waking up the losers at every level goes on till we reach round 1. When everybody's awake,
the barrier is done.
So the two things that I want you to take away is:
• The arrival moves up the tree like this with match fixing and all the respective winners at every round
waiting on a statically determined spin location
• Similarly, when the wake up happens, the losers are all waiting on statically determined spin location in
their respective processors.
So that's how this whole thing works.
197 - Tournament Barrier - Comparison

So now that we understand this tournament algorithm let's talk about the virtues of this algorithm. There's a lot of
similarity between the tournament, the sense reverse tree and the MCS barrier algorithm.
First let’s compare the tree barrier and the tournament barrier:
• The main difference between the tree barrier and tournament barrier is that in the tournament barrier, the
spin locations are statically determined, whereas in the tree barrier the spin locations are dynamically
determined based on who arrives at a particular node in the barrier first.
• Another important difference between the tournament barrier and the tree barrier is that the tournament
barrier does not need an atomic fetch-and-decrement(count) operation, because all that's happening at
every round of the tournament, there is spinning happening. What is spinning? Basically it is just reading.
And what is signaling? This is just writing. So as long as we have atomic read and write operations in
the multi-processor, we can implement the tournament barrier. Whereas uh, if you recall in the tree
barrier we need fetch-and- decrement(count) operation in order to atomically decrement the count
variable.
• Now what about the total amount of communication that is needed. Well, it's exactly similar because as
you go up the tree, the amount of communication that happens will decrease because the tree is getting
pruned as you go towards the root of the tree. So does the amount of communication in the tournament
barrier. The amount of communication that is needed is O(N).
• The other important thing is that at every round of the tournament, there's quite a bit of communication
happening. In the first round of the tournament, P1 is communicating with P0, P3 with P2 and so on. All
of these red arrows are parallel communications that potentially take advantage of any inherent
parallelism that exists in the interconnection network. We can exploit that and that's good news.
• Another important point is that the tournament barrier works even if the processor is not a shared-
memory machine. Because all that we're showing here is a message communication. So P0 is waiting for
a message from P1, and so on. All of these arrows can be thought of as messages. So even if the processor
the multiprocessor is a cluster, well by a cluster what I mean is a set of processors in which the only way
they can communicate with one another is through message passing and there is no physical shared
memory. Even in that situation, the tournament barrier will work perfectly fine to implement.
Now let's make a comparison between tournament and MCS.
• Now because this tournament is arranged as a tournament there are only two processes involved in this
communication at any point of time in parallel. So it means that it cannot exploit the spatial locality that
may be there in the caches. One virtue of the MCS algorithm is that it could exploit spatial locality, that
is multiple spin variables could be located in the same cache line and the parent can spin on a location to
which multiple children can come and indicate that they have arrived. That's not possible in the tournament
barrier because it is arranged as a tournament where there are only two players playing against each other
in every match.
• Neither MCS nor tournament barrier needs a fetch-and-decrement operation, so that's good.
• The other important thing that the tournament barrier has an edge over MCS is the fact that tournament
barrier works even if the processors are in a cluster. That's another good thing. Modern computation
clusters can employ on the order of thousands or 10000 nodes connected together through an
interconnected network and they can operate as a parallel machine with only message passing as the
vehicle for communication among the processors.
198 - Dissemination Barrier

The last barrier algorithm I'm going to describe to you is what is called a dissemination barrier.
It works by information diffusion in an ordered manner among a set of participating processors. It is not
pairwise communication as you saw in the tree, or MCS, or the tournament barrier.
The other nice thing about the dissemination barrier is that, since it is based on ordered communication among
participating nodes, it's like a well-orchestrated gossip protocol. Therefore N need NOT be a power of 2.

• There's going to be information diffusion among these processors in several different rounds.
• In each round, a processor will send a message to another processor. The recipient depends on the round
which we're in, i.e. processor P[i] will send a message to processor P[(i+2^k) % N].
• So we have five processors here, we can then figure what's going to happen in every round.
• At round k=0, P[0] is going to be sending a message to P[1].
• Similarly P1 sends a message to P2, P2 sends to P2 to P3, P3 to P4 and P5 to P0. The arrangement is that
this is cyclically arranged.
• This completes round 0 of the communication.
The key thing is that in every round, a processor is sending a message to a known processor based on their
own number.
All of these communications that I'm showing you are parallel communications. They're not waiting on each
other. So whenever P1 is ready to arrive at a barrier, it's going to tell the P2 that he is ready.
Now how will these guys know that round 0 is finished? If you take any particular process here, say P2, as
soon as it gets a message from P1 and it has sent a message to P3, it knows that round 0 is done, as far as P2 is
concerned. Then P2 can progress to the next round of the dissemination.
So each of these processes are independently making decisions on whether the round is over based on two
things:
• if they have sent a message to the peer
• if they have received the message from the ordain neighbor that they're supposed to get it from.
At the end of that, they can move on to the next round.
199 - Dissemination Barrier (cont)

How many communication events are happening in in every round? Well, the communication events per round
are on the order of O(N), where N is the number of processors that are participating in this barrier because every
processor is sending a message to another processor.
• So now you can quickly see what's going to happen in the next round.
• In the next round, k = 1, each processor is going to choose a neighbor to send the message to based on this
formula. So P0 will send to P2, and similarly P1 to P3, P2 to P4, P3 to P0, and P4 to P1.
• Just as I said about round k = 0, every processor will know that this round is completed when it receives
a message from its ordained neighbor.
• So in this case, P2 is going to expect to receive a message from P0, and it has also sent its message to P4,
to which it is supposed to send the message in this round. Once it is done, P2 knows that round one is over
and it can progress to the next round.
• Just as I mentioned in the previous round, independent decisions are being made by each process in terms
of knowing that this particular round is over.
All of these communications happen in parallel, so if the interconnection network has a redundant parallel path,
these parallel paths can be exploited by the dissemination barrier, in order to do this communication very
effectively.
• In the next round k = 2, we're going to do is, every processor will choose a neighbor that is 4-distance
away from itself.
• So, P0 is going to send a message to P4, P1 to P0, P2 to P1, P3 to P2 and P4 to P3.
200 - Barrier Completion Question

So given that there are N processors participating in this dissemination barrier algorithm. how many rounds do
you think the dissemination barrier algorithm needs to complete?
The choices I'm giving you:
• N*log2(N)
• log2(N)
• ceiling of log2(N)
• N
A hint for you, I told you already that N in a dissemination barrier need not be a power of 2. So that's a hint for
you to, to answer this question.

201 - Barrier Completion Answer

The right answer is ceiling of log2(N).


The completion condition is that: every processor has received a message from every other processor.
We need to take the ceiling() operation because N might not be exactly power of 2.
202 - Dissemination Barrier (cont)

So with N = 5, at the end of round two, every processor has heard from every other processor in the entire system.
Right? So you can eyeball this figure and see that every processor has received a message from every other
processor. Or in other words, it's common knowledge that for every processor that, everyone else has also arrived
at the barrier.
• How many rounds does it take to know that everybody has arrived at the barrier? Well its ceiling of
log2(N). You add the ceiling because N need not be a power of two.
• So at this point, everybody is awake.
• For the dissemination barrier, there is no distinction between arrival and wakeup as you saw in the case
of the tree barrier or the MCS barrier or the tournament barrier.
• In the dissemination barrier, because it is happening by information diffusion, in the end, everybody has
heard from everyone else in the entire system. So everyone knows that the barrier has been breached.
• In other words, in every round of communication in the dissemination barrier, every processor receives
exactly one message.
• Overall the entire dissemination barrier information diffusion takes ceiling(log2(N)) rounds.
• So when every processor has received this ceil(log2(N)) messages, it knows that the barrier is complete
and it can move on.
I've been using the word message in describing this dissemination barrier. It's convenient to use that word
because it is information diffusion but if you think about a shared memory machine, a message is basically a spin
location. Once again, because I know an ordained processor is going to talk to me in every round, the spin location
for this guy is statically determined.
So every round of the communication, we can statically determine the spin location that a particular processor
has to wait on, in order to receive a message, which is really a signal from its ordained peer for that particular
round of the dissemination barrier.
Again static determination of spin location becomes extremely important if the multiprocessor happens to be
an NCC NUMA machine. In that case, what you want to do is to allocate the spin location in the memory that is
closest to the particular processor, so that is becomes more efficient.
As always, in every one of these barrier algorithms, you have to do sense reversal, once the barrier is complete.
When they go to the next phase of the competition, they have to make sure that the barrier that they arrive is the
next barrier. So you need sense reversal to happen at the end of every barrier algorithm.
Now let's talk about some of the virtues of the dissemination barrier.
• The first thing that you'll notice is in the structure, there is no hierarchy. In the tree algorithms, you have
hierarchy in terms of the organization of the tree. In the dissemination barrier, there's no such thing.
• I already mentioned that this algorithm works for both NCC NUMA machine, as well as clusters. That's
also a good thing.
• There is no waiting on anybody else. Every processor is independently making a decision to send a
message. As soon as it arrives at the barrier, it is ready to send a message to its peer for that particular
round. And of course, every processor can move to the next round only after it has received a
corresponding message from its peer for this particular round. Once that happens, it can move on to the
next round of the dissemination barrier.
• All the processes will realize that the barrier is complete when they received ceil(log2(N)) messages
during the whole process. So if you think about the total amount of communication, because the
communication in every round is fixed, it's N messages in every round and since there are ceil(log2(N))
rounds, the communication complexity. Of this algorithm is order O(N*logN).

• Compare that to the communication complexity of the tournament, or the tree barrier, in both cases, the
communication complexity was only O(N). Because of the hierarchy, as you go toward the root of the
tree, the amount of communication shrinks.
203 - Performance Evaluation

We covered a lot of ground discussing different synchronization algorithms for parallel machines, both mutual
exclusion lock as well as barriers, but now it's time to talk a little about performance evaluation.
We looked at a whole lot of spin algorithms and a whole number of barrier synchronization algorithms.
We also introduced several different kinds of parallel architectures: shared memory multiprocessor that is cache
coherent, which may be a symmetric multiprocessor (SMP) or a non-uniform memory access multiprocessor
(NUMA). You can also have non-cache coherence shared memory multiprocessor. And also message passing
clusters.

The question is, if you implement the different types of spin algorithms, which would be the winner on these
machine? Well the answer is not so obvious. It depends really on the kind of architecture. So as OS designers, it
is extremely important for us to take these different spin algorithms and implement them on these different flavors
of architectures.
And the same thing you should do for the barrier algorithms as well. Which would be most appropriate to
implement on these different flavors of architectures?
As always, I've been emphasizing that when you look at performance evaluation that is reported in any research
paper, you have to always look at the trends. The trends are more important than the absolute numbers because
the dated nature of the architecture on which a particular evaluation may have been done, makes the absolute
numbers not that relevant, but trends could still be relevant. These kinds of architectures that I mentioned to you,
they're still relevant to this day.
I encourage you to think about the characteristics of the different spin lock and barrier synchronization
algorithms discussed in this lesson. We also looked at two different types of architectures. One was a symmetric
multiprocessor (SMP), the other was a non-uniform memory access (NUMA) architecture.
Given the nature of the two architectures, try to form an intuition on your own on which one will win in each
of these architectures. Verify whether the results reported in the MCS paper matches your intuition. Such an
analysis will help you very much in doing the second project.
204 - RPC and Client Server Systems

The next topic we'll start discussing is efficient communication across address spaces.
The client-server paradigm is used in structuring system services in a distributed system. If we're using a file
server in a local area network every day, we are using a client-server system when we are accessing a remote file
server. Remote procedure call (RPC) is the mechanism that is used in building this kind of client-server
relationship in a distributed system.

What if the client and the server are on the same machine? Would it also not be a good way to structure the
relationship between client and servers using RPC. Even if the clients and the servers happen to be on the same
machine, it seems logical to structure clients of a systems using this RPC paradigm. But the main concern is
performance and the relationship between performance and safety.
Now for reasons of safety, which we have talked a lot about when we talked about operating system structures,
you want to make sure that servers and clients are in different address spaces or different protection domains.
Even if they are on the same machine, you want to give a separate protection domain for each one of these
servers for safety. But what that also means, there's going to be a hit on performance because of the fact that an
RPC has to go across the outer spaces. A client on a particular outer space, server on a different outer space. So
that is going to be a performance penalty that you pay.
As operating system designers, we want to make RPC calls across protection domains as efficient as a normal
procedure call within a process. If you are able to do that, it would actually encourage system designers to use
RPC as a vehicle, for structuring services, even within a same machine.
Why is that a good idea? Well, what that means is that you know, we've talked about the fact that in structuring
operating systems in microkernel, you want to be able to provide every service with its own protection domain.
What that means is that to go across these protection domains when you make a protected procedure call or a
RPC, going from one protection domain to another protection domain, that is going to be more expensive than a
simple procedure call. It won't encourage system designers to use these separate protection domains to provide
the services independently.
So you want the protection and you also want the performance.
205 - RPC Vs Simple Procedure Call

All of you know how a simple procedure call works. There is a caller you have a process in which all the functions
are being compiled together and linked together, made into an executable. So when a caller makes a call to the
callee and pass the arguments on the stack. The callee can execute the procedure and then a return to the caller.
The important thing is that all of the interactions that I'm showing you here is happening at compile time. All of
these things are being done at compile time.

Now let's see what happens with remote procedure call. You know in principle a remote procedure call looks
exactly like this picture. That you have a caller and a callee.
• When the caller makes its RPC call, it's really is a trap into the kernel.
• The kernel validates the call and copies the arguments of the call into kernel buffers from the client idle
space.
• The kernel then locates the server procedure that needs to be executed, copies the arguments that it has
buffered in the kernel buffer into the idle space of the server.
• Once it has done that, it schedules the server to run the particular procedure.
• At this point, the server procedure actually starts executing using the arguments and performs a function
that was requested by the client. When the server finishes the procedure, it needs to return the results of
this procedure execution back to the client.
• In order to do that, it's going to tap into the kernel, there's the return trap that the server is experiencing in
order to return the results back to the client.
• What the Kernel does at this point, is to copy the results from the address space of the server into the
kernel buffers, and then it copies out the results from the kernel buffer into the client's address space.
• At this point, we have completed sending the results back to the client. So the kernel can then reschedule
the client who can then receive the results and go on its merry way of executing whatever it was doing.
So that's essentially what's going on under the cover, which is fairly complex. And more importantly, all of these
actions are happening at runtime as opposed to what I mentioned about a procedure call, where everything is
happening in compile time, all of these actions are happening at run time, and that is one of the fundamental
sources of performance hit that an RPC system is going to take, because everything is being done at the time of
the call.
In particular, if you want to analyze all the overheads that needs to get done at run time.
• There are two traps. The first trap is a call trap. The other trap is a return trap.
• Therefore we have two context switches. The first context switch is when the kernel switches from the
client to the server to the run the server procedure. When the server procedure is done with its execution
of the server procedure, it has to reschedule the client to run again.
• So, two traps, two context switches, and one procedure execution.
That's the work that is being done by the runtime system in order to execute this remote procedure call.
So what are all the sources of overhead now?
• Well, first of all, when this call trap happens, the kernel has to validate the access, whether this client is
allowed to make this procedure call or not. The validation has to happen.

• Then it has to copy the arguments from the client's address space into kernel buffers. And potentially, if
you look at this picture, there could multiple copies that are going to happen in order to do this exchange
between the client and the server.

• Then there is the scheduling of the server in order to run the server code.

• Then there is the context which overhead, we talked about. The explicit and implicit costs of doing context
switches, there is a context which overhead that is associated between when we go from the client to the
server and back again to the client from the server. Of course dispatching a thread on the processor itself
is also time, which is explicit cost of scheduling.
206 - Kernel Copies Question

207 - Kernel Copies Solution

The right answer is four times. Basically, the kernel has to


• Copy from the client address space into the kernel buffer. That's the first copy.
• The second copy is, the kernel has to copy from the kernel buffer into the server.
• Then the third time when the server procedure is completed, the kernel has to copy it from the server
address to the kernel buffer.
• Then the fourth time, it's going to be copied from the kernel buffer into the client address space.
208 - Copying Overhead

This copying overhead that we're talking about in this client server interaction in RPC call is serious concern in
RPC design. Because this copying happens every time you have a call return between the client and the server.
So if there is a place where we want to focus on saving overheads, it'll be on avoiding copying multiple times
between the client and the server.
If you go back to this analogy of a procedure call, the nice thing about this is that this, the arguments are set
up in the stack. That might also involve some data movement, but there is no kernel involvement in the data
movement. That's what we would like to be able to accomplish in the RPC world as well.

In fact let's analyze how many times copying happens in the RPC system.
Recall that in a RPC system the kernel has no idea of the syntax and semantics of the arguments that are passed
between the client and the server. But yet, the kernel has to be the intermediary in arranging the rendezvous
between the client and the server. Therefore what happens in the RPC system is that when a client makes a call,
there's a client stub, which takes the arguments that is in the client call, which is living on the stack of the client,
and makes an RPC packet out of it.
1. This RPC packet process is essentially serializing the data structures that are being passed as arguments
by the client into a sequence of bytes. This is the only way the client can actually communicate this
information to the kernel. So this is the first copy that's happening from the client stack into creating the
RPC message.
2. The next thing that happens, the client traps into the kernel and the kernel says well, you know there is a
message, which is the RPC message, sitting in the client’s user address space. I better copy it into my
kernel buffer. So the RPC message is copied from the address space into the kernel buffer. So that's the
second copy.
3. Next the kernel schedules the server in the server domain because the server has to execute this procedure.
So once that server has been scheduled the kernel copies the buffer (which has all the arguments packaged
in), into the server domain. This is the third copy.
4. But unfortunately even though we've reached the address space of the server, the server procedure cannot
access this because from the point of view of the procedure call semantics, the server think that they are
just doing procedure call. So the server procedure is expecting all of the arguments in the original form
on the stack of the server, and that's where the server stub comes in. The server stub, just like the client
stub, is a piece of code that is part of the RPC infrastructure that understands the syntax and semantics of
the client server communication for this particular RPC call. Therefore it can take this information that
has now been passed into the server's address space by the kernel, and structure it into the set of actual
parameters that the server procedure is expecting. So this, from the server domain into the stack of the
server for the server procedure to execute that procedure, that's the fourth copy.
So you can see that just going from the client to the server there are four copies involved.
• These two copies are at the user level.

• These two copies are what the kernel is doing in order to protect itself and the address spaces from one
another by buffering the address space contents into a kernel buffer.
At this point, the server can start executing and do its job. When it is done, it has to do exactly the same thing
to pass the results back to the client. So it is going to go through four copies again, except that we're going to
reverse it. We're going to start from the server stack and go all the way down to getting the information to the
client stack in order for that exchange to happen.
So, in other words, with the client server RPC call on the same machine with the kernel involvement in this
process, there's going to be four copies each way.
This is a huge, huge overhead compared106 to a simple procedure call that I showed you early on.
209 - Making RPC Cheap

If RPC has to be a viable mechanism for structuring operating systems services above the kernel, using the client
server paradigm, then it is important to reduce this copying overhead.
Now let's see how we can reduce the overheads. The trick is to optimize the common case. Now what are the
common case? Well, the common case is the actual calls that are being made by the client to the server. We expect
that those calls are going to be made several times during the lifetime of the server and the client. That's the key
thing. You want to make sure that during the actual calls, the copying overhead and the locality of the arguments
and the results, in terms of stuff being in the caches that are accessible to the client and the server. that's the key.
That's the common case. That's what we want to make as efficient as possible.
Now, setting up the relationship between the client and the server itself, on the very first call by the client, that
needs to be done exactly once. This process is what is called binding. Now, since the setup for the binding is
done only once, it's a one-time cost. So we can afford to make binding more time consuming. These ideas should
sound very familiar to you from exo-kernel that we've discussed before.

So now, let's discuss in more detail how this binding works.


The server has an entry point procedure called foo() that it wants to make it available for clients to call. In
order to make it available for everybody, it publishes this entry point procedure in a name server and let the kernel
know that. The name server is like the yellow pages for anyone to look up and it tells the clients what services
are offered by a particular server.
The foo() server is basically waiting for bind requests to come from the kernel.
The client looks up the name server and finds that there is an entry point called foo(). So the client issues this
call foo(), meaning that it wants to execute this procedure foo() on server S.
That's an RPC call so the first time C makes this call, it will result in a trap into the kernel.
The kernel doesn't know whether the server is willing to accept calls from the client or not. Therefore the
kernel will check with the server whether there's a legitimate bona-fide client that can make calls on those entry
point procedure foo. So the kernel basically makes an up-call into the server saying that hey, you know what?
There is this client with this identity that wants to make a foo() call. The server, if it recognizes that this client is
a bona-fide client that can make this call, it grants permission via the kernel that this client can make this call on
its entry point procedure foo().
Once this validation has been done, the kernel will set up a descriptor called the procedure descriptor. The
procedure descriptor is a data structure in the kernel and it is for this particular entry point procedure foo(). It's
part of granting access to the client to make this call into its entry point procedure, foo(). The server will tell the
kernel that these are the characteristics of this particular entry point procedure. The server will say this is the entry
point address where you have to call me if there is call. This is the address of the entry point procedure in my
address space where code exists for this particular procedure foo. And this is indicating the size of an argument
stack and this argument stack is going to be the communication area between the client and the server.
What at are the sizes of this argument stack? So this communication vehicle that is going to be established
between the client and the server is going to be dependent at the form of parameters that are being passed the
client and the server and the results that are being passed from the server back to the client. Based on that, the
server is going to indicate to the kernel that communication area that I want is this size. So, that's the size of this
A-stack.
It is also going to say how many simultaneous calls S is willing to accept for this particular procedure foo().
The purpose of this is, if this is a multi-processor and there are multiple cores and multiple processors available,
then it may be possible for S to farm out multiple threads to execute simultaneous calls that are coming in from
multiple clients distributed in the system.
So this procedure descriptor is specific to this procedure foo(), and it is saying
• Where the entry point is in the server's domain for this particular procedure?
• What are the size of the communication buffer that is needed to be established by the kernel for
communication between the client and the server?
• How many simultaneous calls the server is willing to accept for this particular procedure, foo()?
210 - Making RPC Cheap (Binding)

So once the kernel gets all the information from the server, kernel gets to work.

• First of all, the kernel creates this data structure on behalf of the server, and holds it internally for itself.
Only the kernel knows all the information that is needed, in order to make this upcall into the entry point
procedure.
• The kernel also establishes a buffer called the A-stack. This A-stack size was just specified by the server
as part of this binding communication to indicate how big this A-stack is got to be.
• The kernel allocates shared memory, and maps it into the address space of both the client and the server
So essentially what we have now is shared memory for communication directly between the client and the
server, without mediation of the kernel.
• Client and the server can directly communicate the arguments and the results back and forth, using this
A-Stack., which stands for argument stack.
• At this point, the kernel will authenticate the client that you're good to go. You can make calls on this
procedure foo() that is being exported through the name server by the server S.
• What the client needs to do every time he makes a call to S.foo(), he has to give the kernel a descriptor,
which I'm going to call the binding object BO. In the western world, BO has a different colloquial
connotation. I won't go there. BO is basically a capability for the client to present to the kernel that I am
authenticated to make this call into the service domain to this particular procedure S.foo(). So that's the
idea.
All the work that I have described to you up until now, is the kernel mediation that happens in terms of entry
point setup on the first call from the client. Of course the important point is that the kernel knows that this binding
object and this procedure descriptor are related. Or in other words, if the client is going to present a binding object,
the kernel knows from the binding object what is the proceeded descriptor that corresponds to the binding object,
so that it can find the entry point to call into the server.
Once again, what I want to stress is the fact that this kernel mediation happens only one time. On the first call by
the client.
211 - Making RPC Cheap (Actual Calls)

Now let's see what is involved in making the actual calls between the client and the server.
• When the client makes the call, the client stub will take the arguments and put those arguments into the
A-stack.
• In the A-stack, you can only pass arguments by value, not by reference. The reason is that this A-stack is
mapped into the client address space, and shortly will to be mapped into server address space by the kernel.
• Since only the A-stack is mapped into the address space of both the client and the server, if this has pointers
pointing to the other parts of the client address space, the server won’t be able to access that. So it is
important that the arguments are passed by value, and not by reference.
• The work done by the stub in preparing the A-stack is much simpler than what I told you earlier about.
The general RPC mechanism of creating an RPC packet is to serialize the data structures that are being
passed as arguments. In this case, the client stub simply copying the arguments from the stack of the client
thread into this A-stack.
• Then the client traps into the kernel when making a procedure call to S.foo(). At this point, the client stub
present the kernel the binding object associated with S.foo(). So the binding object is the capability that
this client is authorized to make calls on S.foo().
• Once the BO is validated by the kernel, it can check what procedure descriptor is associated with the BO.
The procedure descriptor is the information that is needed by the kernel to pass the control to the server,
to start executing the server RPC implementation procedure.
Recall that the semantics of RPC is that when the client makes the RPC call, it will be blocked. It will wait for
the call to be completed before it can resume its execution.
• Therefore the optimization, because all of this is happening on the same machine, the kernel can borrow
the client thread and doctor the client thread to run on the server address place.
• Now what do I mean by doctoring the client thread? Basically what you want to do is you want to make
sure that the client's thread starts executing in the address space of the server, and the PC (program
counter in the CPU register) that the client thread is going to start executing in is the entry point procedure
that is pointed to by the procedure descriptor. So you have to fix the PC, the address space descriptor, and
the stack that is being used by the server in order to execute this entry-point procedure.
For this purpose, what the kernel does is it allocates a special stack, which is called the execution stack (not
shown in the picture) or E-Stack. It is a stack that the server procedure is going to use in order to do its own thing.
Because server procedure may be making its own procedure calls and so on, so it's going to do all of that action
on the E-stack.
So the A-stack is only for the purpose of passing the arguments, and the E-stack is what the server is going to
use to do its own work.

212 - Making RPC Cheap (Actual Calls) cont

So at this point, once the kernel has doctored this client thread to start executing the server procedure, it can
transfer control to the server.
• So it transfers the control to the server and now we're starting to execute the server procedure in the server's
address space.
• Because A-stack is mapped as shared memory, it is also available to the server domain. The first thing
that's going to happen in the server domain is our server stub, is to copy the arguments from the A-stack
to the E-stack.
• After the copying, the procedure foo() is ready to start and work with the E-stack.
Once foo() is done with executing, it has to pass back the results to the client.
• The server stub will take the results of this procedure execution and copy them into the A-stack.
• Once the server stub has copied the results into the A-stack, it can trap (this is a return trap) into the kernel,
and this is the vehicle by which the kernel can transfer control back to the client.
• Now, when this return trap happens, there is no need for the kernel to validate this trap as opposed to the
call trap, because the up-call was made by the kernel in the very first place. The kernel is expecting this
return trap to happen.
• At this point, the kernel will re-doctor the thread to start executing in the client address space. So basically
it knows the return address where it has to go back in order to start executing the client code, and it knows
the client's address space so it's going to re-doctor the thread to start executing in the client address space.
• When the client thread is rescheduled to execute, at that point, the client stub gets back into action, copies
the results from the A-stack into the stack of the client. Once it has done that, the client thread can continue
on with its normal execution.
So that's what is going on. The important point that you notice is that the copying through the kernel buffer
is now completely eliminated, because your arguments are passed through the A-stack into the server and
similarly the result is passed through the A-stack into the client.
So let's analyze what we've accomplished in terms of reducing the cost of the RPC in the actual calls that are
being made between the client and the server.
213 - Making RPC Cheap (Actual Calls) cont

Recall that in “conventional RPC”, we have four copies just transferring the arguments from the client to the
server's domain. First copy data into an RPC packet, then copy the RPC packet into the kernel buffer, then copy
the kernel buffer into the server domain, and finally copy/un-marshall the data in the server domain. That was the
original cost.

Now, life is much simpler.


• The client's stub is copies (only copy, no serialize) the parameters into the A-stack. It is simply copying
it, because the client and the server know exactly what the semantics and syntax of the arguments that are
being passed back and forth and therefore there is no need to serialize the data structure. This A-stack is
shared between the client and the server.
• The server stub then takes the arguments from the A-stack and copy it into the E- stack of the server. The
execution stack is provided by the kernel for executing the server procedure. Once the server is done with
the copying procedure, it is now ready to be executed in the server domain.
So what we accomplished is that the entire client server interaction requires only two copies. One for copying
the arguments from the client stack into the A-stack, which is usually called the marshal link of the arguments.
And the second copy is taking the A-stack arguments that are sitting in the A-stack and copying it into the server's
stack, that is the unmarshal link.
Both these copying happens above the kernel and stays in the user space.
So basically we got rid of the two copies that were being done in/out the kernel.
These copies is also much simpler that the “serialization” process for creating an RPC argument message.
It is a more efficient way of creating the information that needs to be passed back and forth between the client
and the server using this A-stack.
214 - Making RPC Cheap Summary

So to summarize what goes on in the new way we are doing the RPC between the client and the server. During
the actual call, copies through the kernel is completely eliminated thanks to the A-stack, which is mapped into
the outer space of both the client and the server.
• So the actual overheads that are incurred in making this RPC call is this client trap and validation by the
kernel that this call can be allowed to go through.
• And switching the domains I told you about this trick of doctoring the client thread to start executing in
the server procedure. That is really switching the protection domain from the client address space into the
server address space so that you can start executing the procedure that's visible only in this address space.
So that is the switching domain in the second overhead.
• Finally, when the server procedure is done executing, the return trap. That's the third explicit cost.
So three explicit costs associated with the actual call.
• The first explicit is the client trap and, and validating this BO.
• The second explicit cost is switching the protection domain from the client to the server so that you can
start executing the server procedure.
• The third explicit cost is when we have this return track to go back to the client address space.
There are also implicit overheads that are associated with switching protection domains.
• The implicit overhead is the loss of locality due to the domain switching that's happening when we go
from the client address space to the server address space. We are touching some part of the address space
in physical memory. Therefore in the CPU caches, there's a lot of stuff that may not be in the caches (cache
miss). So, there is going to be a loss of locality due to the domain switch, in the sense that caches and the
processor may not have all the stuff that the server needs in order to do its execution.

215 - RPC on SMP

This is where multiprocessor comes in.


If you're implementing this light RPC on a shared memory multiprocessor, then we can exploit multiprocessors
that are available in the SMP.

• We can preload the server domains in a particular CPU and just let it idle. For example, this particular
server domain is loaded on CPU-2 and we're not going to let any other thing disturb this CPU. So the
caches associated with CPU-2 will be warm with the stuff that this particular domain needs.
• If you have multiple processors, then you can exploit the multiple processors in a SMP. So if a client
comes along and wants to make an RPC call, we want to use the server that has been preloaded in a CPU
as a recipient of this particular RPC call. That call will be directed to that CPU and the caches will be
warm and therefore we can avoid or reduce, mitigate the impact on loss of locality.
• Another thing that the kernel can do is look at the popularity of a particular server based on monitoring
the site. We may want to have multiple CPUs dedicated to the servers and that way you have several
different domains of the same server preloaded in several CPUs to cater to the needs of several
simultaneous requests that may be coming in for a particular service.

216 - RPC on SMP Summary

In summary we have taken a mechanism that is typically used in distributed systems, namely RPC, and asked the
question if we want to use RPC as a structuring mechanism in a multiprocessor, how to make it efficient so that
we can use it as vehicle for services? The reason why you want to promote that, is because when you put every
service in its own protection domain, you are building safety into the system and that is very important for the
integrity of an operating system. If we can make RPC cheap enough, we could use RPC as a structuring
mechanism so that we can build OS services in separate protection domains.
217 - Scheduling First Principles

The next part of the lesson will cover scheduling issues in parallel systems. You'll notice once again when we
discuss scheduling issues that the mantra is always the same, namely, pay attention to keeping the caches warm
to ensure good performance.
We're going to look at scheduling issues in parallel systems.

Fundamentally, the typical behavior of any thread or process running on a processor is to do the following:
• compute for a while and then make a blocking I/O system call
• it might want to synchronize with other threads that it is part of in the application
• it might be that it is a computation-intensive (CPU-bound) thread, in which case it might just run out of
the time quantum that it has been given by the scheduler on the processor.
But fundamentally what that means is this is a point at which the operating system, in particular the OS
scheduler, can schedule some other thread or process to run on the CPU.
So how should the scheduler go about picking the next thread or process to run in the processors, given that it
had the choice of other threads that in can run at any point of time?
218 - Scheduler Question

Now I'll turn that into a question for you. How should the scheduler choose the next thread to run on the CPU?
• The first is FCFS, first come first served.
• The second possibility is, it's going to assign static priority to all the threads so it's going to pick the threads
that have the highest static priority to run on the processor.
• The third is that the priority is not static, but it is dynamic or in other words it changes over time.
• The fourth is, it’s going to pick a thread whose memory contents are likely to be in the cache of the CPU.

219 - Scheduler Solution

If you picked any or more or all of the choices that I gave you, you're not completely off base. Let me just talk
through each of these choices and why it may be a perfectly valid choice for the scheduler in picking the next
thread to run on the processor.

• There is an order of arrival into the processor, there's a fairness issue in the FCFS policy.
• The second is, well somebody paid a lot of money to run the program, and so I'm going to give it a priority
that it statically assigned with every process or thread. And I'm going to pick the one that has the highest
priority, so that's also a valid choice.
• The third possibility, a thread's priority is not static. It may be born with a certain priority, but over time,
it might change. For example Linux O1 and CFS scheduler.
• The fourth choice is pick the one whose memory contents in the CPU cache is likely to be the highest.
This means that the thread that has the cache warm for its working set is likely to do really well when it
gets scheduled on the same processor. And so it makes sense to suggest that this might be a good choice
as well.
But in this particular lecture, we're going to think about this last choice: picking the thread whose memory
contents are likely to be in the CPU cache. Why does that makes a lot sense especially in a multiprocessor, where
there's going to be several levels of caches and coherences.
220 - Memory Hierarchy Refresher

Here's a quick refresher on the memory hierarchy of a processor.

• Between the CPU and the main memory, there are several levels of caches.
• Typically these days, you may have up to three levels of caches between the CPU and the memory.
• The nature of the memory hierarchy is that you can have something that is really fast but of a small amount
(L1 cache), or something really slow but of big amount (memory), and all of these choices that are in
between.
• If you take the disparity between the CPU and the main memory, it's more than two orders of magnitude
today, and it's only growing.
So any hiccup that the CPU has in not finding data or instructions that it needs to execute the currently running
thread in the caches and it has to go all the way to the memory, is bad news in terms of performance.
So what this suggests is that in picking the next thread to run on the CPU, it'll probably be a very good idea if
the scheduler picks a thread whose memory contents are likely to be in the caches. If not in the L1 cache, at least
in the L2 cache. If not in the L2 cache, at least in the L3 cache. So that it doesn't have to go all the way to the
memory in order to get the instructions and data for the currently running thread. That's an important point to
think about.

221 - Cache Affinity Scheduling

So that brings us to this concept of cache affinity scheduling.


• The idea is very simple, if within a particular CPU P1, I had this thread T1 running for a while and it got
de-scheduled at some point of time because it made an I/O call (synchronization, timeslice expires, or
whatever).
• Then the scheduler is going to use the CPU to run some of the thread.
• Finally, at some point of time, if T1 gets ready to be scheduled again, it makes sense for T1 to be scheduled
on the same CPU.
• Why? Because it used to run on P1 and therefore the memory contents of T1 were in the cache of P1.
Therefore, if we schedule T1 on the same processor again, it is likely that T1 will find its working set in
the caches of P1. That's the reason why it makes sense to look at the affinity of a particular thread to a
processor. In this case, the cache affinity for T1 is likely to be higher for P1.

But can anything go wrong? Well, what can go wrong is the following.
• When T1 was de-scheduled, the scheduler may have scheduled other threads such as T2 or T3, before re-
scheduling T1. The cache may be polluted by the contents of intervening threads T2 and T3.
• When T1 runs again, it's possible that it may not find a lot of its memory contents in the cache anymore.
• Therefore, even though we re-schedule T1 on the same CPU, these intervening threads may have polluted
the cache and T1 is not like to have a warm cache anymore.
The moral of the story is that you want to exploit cache affinity when scheduling threads on processors. But also,
you have to be worried about any intervening threads that may have run on the same processor and polluted the
cache as a result. This is the idea of cache affinity for a processor. Now we are ready to discuss different
scheduling policies that an operating system may choose to employ.
222 - Scheduling Policies

The first scheduling policy is the very simple one, first-come-first-serve, FCFS. You look at the order of
arrival of threads in the scheduling queue, and the scheduler picks the one that is the earliest to become runnable
again. This is saying is that we will give importance to fairness for threads as opposed to affinity.
The second policy is called fixed processor. For every thread, when I schedule the thread the first time, I'm
going to pick a particular processor and I'm always going to stick to that. The way we choose the initial processor
on which to schedule T-i may depend on load balance, making sure that all the processors in the multiprocessor
are equally stressed in terms of resource assignment. For the life of thread T-i, the processor on which T-i is going
to run is always fixed.
The third scheduling policy is what is called a last processor scheduling policy. The idea is to pick a thread
to run the last CPU that it has run on. This is giving preference to the fact that there could be affinity for T-i to
this processor, because it used to run on this. When a processor is looking for work and it looks at the run queue,
does not find any thread that used to run on it, and of course it has to pick some thread, right? So the, the inclination
is to pick the thread that had run on this processor before. And, and that's the one that I want to schedule on P-
last. But if such a thread is not available, then you're going to pick something else. So, the idea behind this is to
make sure that if this processor is going to pick a thread to run on it, the likelihood of this thread finding its
memory contents in this processor is high. That's, that's what we're trying to shoot for in this last processor.
The next couple of scheduling policies I'm going to tell you about. It requires more sophistication in terms of
the information that the scheduler needs to keep on behalf of every thread.
You know in order to make a scheduling decision. The next scheduling policy is what is called Minimum
Intervening policy, MI for short. In MI, for every thread we will keep a record of its affinity with respect to a
particular processor, and pick the processor for running this thread in which this thread has the highest affinity.
223 - Minimum Intervening Policy

If you look at the timeline for a particular process of P-j, it might look like this.

• Thread T-i was running here, got de-scheduled.


• Then there were a couple of other threads that ran on P-j, T-x and T-y.
• So now if I want to think about the affinity for T-i with respect to this processor P-j. At this moment, the
affinity number of T-i on P-j is 2, indicating the number of intervening threads that ran on P-j between
the last time T-i ran on it and now is 2.
• So clearly, a smaller affinity number, indicate a higher affinity.
• So in order to do a minimum intervening scheduling policy, you want to keep this information about the
affinity for T-i to run on every processor.
• If I have a multiprocessor with 16 processors, then there is an affinity index associated with every one of
those processors for T-i. Then when it comes to schedule T-i, I want to pick a processor on which the
affinity index is the minimum.

224 - Minimum Intervening Plus Queue Policy

Next we have the limited minimum intervening policy.


• In minimum intervening policy, if I have let's say, 1,000 processors in the multiprocessor, then the amount
of information that I want to keep for every thread is huge. For every processor that is available in the
multiprocessor, I need to keep this affinity index for this thread. And the scheduler may need to maintain
too much meta data on behalf of every thread.
• Therefore, there's a variant of minimum intervening which is called limited minimum intervening, which
is saying, don't keep this infinity index for all the processors, just keep it for the top few processors.
• So if the infinity index is only 2 or 3, those are the ones that I care about. If the infinity index is 20 or 30,
I'm not going to pick that so why bother keeping all of the affinity index for a particular thread? Just keep
the top candidates. That's the idea behind limited minimum intervening scheduling policy.
The last policy I'm going to introduce you to is called Minimum Intervening Plus Queueing.
• In minimum intervening scheduling plus queuing, besides the affinity index, I'm also going to look at the
queue for this particular processor.
• Why do we need to know do that? Well, if T-i is going to be scheduled on this particular processor P-j,
maybe there's a scheduling queue associated with P-j, which already has some number of threads waiting
to be run.
• Therefore, even though I'm picking P-j based on cache affinity, by the time T-i gets to actually run, two
other threads are going to run before it.
• So at the moment that T-i was schedule to run, I might find the affinity of T-i with respect to P-j, is two, .
• But then T-i has to be put into the scheduling queue of P-j. So if the scheduling queue of P-j has other
threads waiting already (T-m and T-n), by the time T-i actually gets to run on P-j, T-m and T-n could have
already run on the processor.
• So even though the affinity index that I computed at the point of the scheduling decision told me to put T-
i on P-j, but the reality is that T-i won’t run immediately, but run much later in time.
• By the time it gets to run, two other threads that are already sitting in the queue of P-j might have already
polluted the cache of P-j.
That's the reason that this scheduling policy's called minimum intervening + queue.
• In that case, the processor that I want to pick T-i to run on is the processor with (i+q)min.
So basically have introduced five different scheduling policies, first come first serve. Fixed processor, last
processor, minimum intervening, and minimum intervening plus queuing.
For really large system, we might not be able to have the information for a thread with respect to ALL the
processors in the system to do minimum intervening and MI+Q.
So we limit the amount of information that is kept for the threads. Remember one of the attributes of a good
operating system is to make decisions really quickly and get out of the way. From that point of view, the less
information it has to sift through to make a scheduling decision, the faster it can do its work.
225 - Summarizing Scheduling Policies

So to summarize the scheduling policies,


• The first come first serve simply ignores affinity, and pays attention only to fairness.
• Then, for fixed processor and last processor, the focus is cache affinity of a thread with respect to a
particular processor.
• The next two policies, they focus not only on cache affinity, but also cache pollution. In particular it asks
the question, how polluted is a cache going to be by the time T-i gets to run on the processor. That's the
question we're asking in both minimum intervening, as well as minimum intervening plus queuing.

It should be clear from the discussion up until now, the amount of information that the scheduler has to keep
grows as you grow.
• Go down this list, the order of arrival is all that you care about, you put them in priority order in the queue
and you're done with it.
• Going down the list, the schedule has to do a little bit more work, and the corresponding amount of
information that this schedule has to keep for every one of these scheduling policies is going to be more.
• But the result of scheduling decision is likely to be better when you have more information to make the
scheduling decision.
Another way to think about the scheduling policy is that,
• The fixed processor and the last processor policies are thread-centric, in saying what is the best decision
for a particular thread with respect to its execution.
• As for MI and MI+Q. both of these policies are processor-centric, saying that what thread should a
particular processor choose in order to maximize the chance that the amount of cache is warm for that
thread.
226 - Scheduling Policy Question

• On processor P-u, the queue contains a task T-x. On another processor P-v, the queue contains four threads
T-m, T-q, T-r, and T-n.
• There's a particular thread T-y and the affinity of T-y with respect to P-u, is 2. So PI-u for T-y is 2, and
similarly PI-v for T-y is 1.
• The scheduling policy we're going to pick is the minimum intervening plus queue.
So, given those information, when we decide when T-y gets to run again, which queue will I put T-y on?

227 - Scheduling Policy Solution

Now if you ask the question, what is the (I + Q)min for thread T-y?
• I + Q for P-u = 2 + 1 = 3
• I + Q for P-v = 1 + 4 = 5
So I am going to put T-y on P-u, because P-u has a smaller I + Q number, i.e., less amount of intervention for
polluting the cache of P-u with respect to thread T-y.
228 - Implementation Issues

Now that we looked different scheduling policies, let's discuss the implementation issues of these scheduling
policies in an operating system.
• The operating system can maintain a global queue of all the threads that are runnable in the system. When
those processors are ready for work, they'll go to this global queue and pick the next available thread from
this queue, and run that on itself. The way we organize the queues is orthogonal to the scheduling policy
itself.
• If the policy is something like FCFS, it makes sort of logical sense to have a global queue and let the
processors simply pick from the head of the queue. But this global queue becomes very infeasible as an
implementation vehicle when the size of the multiprocessor is really big. So typically, what is done is to
keep local queues with every processor and these local queues are going to be based on affinity
• The particular organization of queues in each of processor depends on the specific scheduling policy.
• The important point in implementing the scheduling policies is, you have to have a ready queue of threads
from which the processor will pick the next piece of work to do. The organization of these queues will be
based on the specific scheduling policy that you might choose to employ for the scheduler.
It might be that processor P2 runs out of work completely and nothing is in its local queue. In that case it might
pull its peers' queues to get some work. That's something called work stealing in the scheduling literature and it
is commonly employed in a multiprocessor scheduler.
So I mentioned that the way these queues are organized is based on policies that scheduler picks which might be
affinity-based or fairness-based and so on. But in addition to the policy specific attribute, it might also use
additional information in order to organize its queue. In particular, a priority of a thread is determined by three
components.
• The affinity component assuming it's an affinity based scheduling policy.
• Every thread may be born with a certain priority, so that is the base priority. A particular thread has a
base priority when it is started and this could be determined by something like how much money is paid.
• Also there is age component coming in. This is sort of like a senior citizen discount. If a thread T-i has
been in the system for a long time, you want to get it out of the system as quickly as possible. So you
boost the priority of the thread by certain amount, so it will get scheduled on the processor early.
And as I said, three attributes that go with it is a base priority that you may associate with a thread when it is first
created, the affinity it has for a particular processor, and also the senior citizen discount that it might give to a
particular thread depending on how long it's been on the system.
229 - Performance

So having discussed several different scheduling policies, we have to talk about performance. The figures of merit
that is associated with the scheduling policy are threefold.
The first scheduling policy figures of merit is what is called throughput. This measures how many threads get
executed or completed in a certain amount of time, i.e. how many thread have been pushed through the system.
It's a system centric metric. It doesn't say anything about the performance of individual threads, how soon they
are performing their work and getting out of the system, but it is asking what is the throughput of the system with
respect to the threads that need to run on it?
The next two metrics are user-centric metrics.
• The response time (wait time + completion time) is saying, if I start up a thread, how long does it take
for that thread to complete execution?
• The variance of response time is saying, does the time that it takes to run my particular thread vary
depending on when I run it? Why will it vary?
• For instance, if you think about a first come first serve policy, if I have a very small job to run, and if it
gets the processor immediately it's going to quickly complete its execution. But suppose when I start up
my particular thread, there are other threads in the system ahead of me that are going to take a long time
to execute, then I'm going to see a very poor response time.
• So from run to run, the same program may experience different response times depending on the load that
is currently on that system. That's where the variance of response time comes in.
So clearly from a user's perspective I want response time to be very good and variance to be very small as well.
Now when you think about first come first serve scheduling, it is fair but it doesn't pay attention to infinity at all.
And it doesn't give importance to small jobs vs big jobs. It's just doing it first come first serve, and therefore,
there's going to be a high variance, especially if it is small jobs that need attention of the processor, and there are
long-running jobs on the processor currently.

230 - Performance (cont)

Now if you look at the memory footprint of a process and the amount of time it takes to load all of its working
set into the cache. The bigger the memory footprint, the more time it takes for the processor to get the working
set of a particular thread into the cache.
This suggests is that it's important to pay attention to cache affinity in scheduling. The cache affinity scheduling
policies that we talked about are all excellent candidates to run on a multiprocessor.
In fact, the minimum intervention policy, and the minimum intervention plus queue, both of those are very
good policies to use when you have a fairly light to medium load on a multiprocessor. Because that is the time
when it is likely that a thread, when it is run on a processor, if it has an affinity for that particular processor, the
contents of the cache are going to contain the memory contents for that particular thread.
But on the other hand, if you have very heavy load in the system, then it is likely that by the time a thread is
run on a processor, on which it supposedly having an affinity, all of the cache may have been polluted because
the load is very heavy. Maybe its cache contents have been highly polluted by other threads.
Therefore, if the load is very heavy then maybe a fixed processor scheduling may work out to be much better
than the variants of minimum intervening scheduling policies.
So, the moral of the story is that you really have to pay attention to both how heavily loaded your processor
is and also what is the kind of workload that you are catering to. A real agile operating system may choose to
vary the scheduling policy based on the load as well as the current set of threads that need to run on the system.

Another interesting wrinkle to taking a scheduling policy is the idea of procrastination. Normally we think
of the scheduler when the processor is looking for work, it's going to the run queue and say I need to do something
and let me pick the next thread to run on myself. Why would procrastination help?
• First of all, how do we implement procrastination? Basically instead of picking the next available thread
from the running queue, the processor can insert an idle loop.
• This is because the processor checked the scheduling queue and there is no thread in the scheduling queue
that has run on it before. Therefore, if it schedules any one of those threads though all of those threads are
not going to have a cold cache.
• So the processor says okay, now let me just spin my wheels for a while, not schedule anything. It is likely
that a thread that has its cache content in that processor becomes runnable again later. Then you can
schedule that, and that might be better for performance.
So procrastination may help boost performance. We saw that already in the synchronization algorithms where
we talked about inserting delays in order to reduce the amount of contention in the connection of the network.
It's the same sort of principle. Often times you'll see in system design procrastination actually helps in boosting
performance. It helps in the synchronization regimes, it helps in scheduling and later on when we talk about file
systems you'll see that it helps in the design of file systems, too.
231 - Cache Affinity and Multicore

So let's talk about cache affinity and modern multicore processors, in modern multicore processors you have
multiple cores on a single processor, and in addition to the multiple cores that are in a single processor, the
processors themselves are also hardware-multithreaded.

What hardware multithreading means is that:


• There's only one execution engine in this core, but it has four threads that it can run on this core.

• If a thread is experiencing a long latency operation, for instance it misses in the cache and has to go out
to fetch the contents from memory. In that case, the hardware may switch to one of the other threads
and run those.

• So, in other words, it wants to keep this core busy. Depending on what these threads are doing, if they are
involved in long latency operations, meaning they are going out, they're not switched out of the processor,
in terms of operating system scheduling. It is just that these are the threads that have been scheduled to
run on this core and the hardware is switching among these threads by itself without the intervention
of the operating system. It is automatically switching among these threads depending on what these threads
are doing.

• If this thread does the memory access which is going outside the processor, then the hardware is going to
say, well, you know this guy is going to be waiting for a while, so let me switch to this guy and let him do
its work because it's possible that what he needs is available in the cache. Of course, if all of these guys
are waiting on memory, then the core is not able to do anything useful until at least one of these memory
accesses are complete.
That's the idea behind hardware multithreading.
It is not very unusual for modern multicore processors to employ hardware multithreading.
• In this example, there are four cores and in each core I have four hardware threads. So it is a four-way
hardware multithreaded core.

• There are two levels of caches, L1 and L2 cache.

• The L1 cache is specific to each particular core C1, C2 etc, and shared by these threads within each core.

• The L2 cache, is common for all the cores.

• If the processor has only two levels of caches, L1 and L2, a cache miss in L2 cache is really bad news.
This is because then you need to perform the long latency memory operation. Modern multiprocessors
may in fact even employ even more levels of caching. In addition to L1 and L2, there may be an L3 cache.
So what we have to think about now is how the operating system should make its scheduling decisions when
scheduling on multicore and hardware-multithreaded platforms. Once again, there's a partnership between the
operating system and the hardware.
• The hardware is providing these hardware threads inside each core.

• What the operating system does is picking which threads that it has in its pool of runnable threads and
map them on to the hardware threads that are available in the hardware.

• Clearly, the scheduling decision needs to make sure that most of the threads that are scheduled a particular
core may find their working set in the L1 cache if possible, and similarly for C2, C3 and so on.

• The other thing that the operating system may try to do is to make sure all these threads that are currently
mapped to all the cores, are likely to find their working set in the common L2 cache.

• Of course you can extend this idea if, if there is a L3 cache.


232 - Cache Aware Scheduling

So let me briefly introduce here the idea of cache aware scheduling, when you have these multithreaded multi-
core processors.
Let's assume that you have a pool of 32 ready threads. And I have a four-way multi-core per CPU, meaning that
there are four cores in the CPU and each core is four way hardware multithreaded. In other words, at any point of
time, the operating system can choose 16 threads from this pool and put them on the CPU.
The job of the OS scheduler is to pick from 16 threads from the pool of ready threads and schedule them on the
CPU. So how does the operating system make the choice?
• The operating system should try to schedule some number of cache frugal threads, and some number of
cache hungry threads on the different cores, so that together the sum of all the cache hungriness of the
16 threads that are executing at any point of time in the CPU is less than the total size of the L2 cache.

• As I said, L2 cache in this simple example, but of course, you can generalize this and say it is the last
level cache.

• In other words, you want to make sure the sum total of the cache requirements of the universe of threads
scheduled on the processor, is less than the total capacity of the last level cache in the CPU.

• So we're going to categorize threads as either cache frugal threads, or cache hungry threads. So cache
frugal threads are the ones that require only a small portion of the cache to keep them happy. A cache
hungry thread is one that requires a huge amount of cache space.
Now how do we know which threads are cache frugal and which threads are cache hungry?
• Well that's something that we can know only by profiling the execution of the threads overtime.
• So the assumption is that many of these threads get to run on the CPU over and over again, so overtime
you can profile these threads and tell where a thread is cache-frugal or cache-hungry.
The more information the scheduler has, the better decision it can take in terms of scheduling.
But we have to be careful about that. In order for the system to do this monitoring and profiling, clearly, the
operating system has to lose some work in the middle of these threads doing useful work.
I always maintain that a good operating system gives you the resources that you need and gets out of the way
very quickly. So, you have to be very careful about the amount of time that the OS takes in terms of doing this
kind of monitoring and profiling.
Such information is useful in scheduling decisions, but it should not be disrupting useful work that these guys
have to do in the first place.
In other words, the overhead for information gathering has to be kept minimal so that the OS does not consume
too many cycles in doing this kind of overhead work accounting for making better decisions in, in terms of
scheduling.

233 - Conclusion

Since it is well known that processes scheduling is NP-complete (Nondeterministic Polynomial-time), we have
to resort to heuristics to come up with good scheduling algorithms.
The literature is ripe with such heuristics, as the workload changes and the details of the parallel system,
namely how many processors does it have, how many cores does it have, how many levels of caches does it have,
and how are the caches organized.
There's always a need for coming up with better heuristics.
In other words, we've not seen the last word yet on scheduling algorithms for parallel systems.
234 - Introduction

Thus far, we've seen how to design and implement scalable algorithms that go into the guts of an operating system
for a parallel machine.
Now it is time to look at a case study of an operating system that has been built for a shared memory
multiprocessor.
This operating system is called Tornado.
The purpose of this case study is to understand the principles that go into the structuring of an operating system
for a shared memory multi-processor. Thus far, we have covered a lot of ground on parallel systems. And as a
reminder, I want to tell you that you should be reading and understanding the papers by
• Merigram and Scard on synchronization
• Anderson and others on communication issues in parallel systems
• Squillante and Fedorova on scheduling.
What we're going to do now is look at how some of the techniques that we've discussed thus far gets into a
parallel operating system. So, I'm going to look at one or two examples of parallel operating system case studies,
so that we can understand these issues somewhat in in more detail.
235 - OS for Parallel Machines

Modern parallel machines offer a lot of challenges in converting the algorithms and the techniques that we have
learned so far into scalable implementations.

What are these challenges?


First, there's a size bloat of the operating system. And the size bloat comes because of additional features
that we have to add to the operating system and so on. That results in system software bottlenecks, especially for
global data structures.
Then, of course, we have already been discussing this quite a bit, that the memory latency to go from the
processor to memory is huge. All the cores of the processor are on a chip, and if you go outside the chip to the
memory, that latency is huge. A ratio of 100 : 1 is what we've been talking about and that latency is only growing.
The other thing that happens in parallel machines is the fact that, this is a single node and we already have the
memory latency going from the processor to the memory. But in a parallel machine, it's typically constructed as
a non-uniform memory access (NUMA) machine. That is, you take individual nodes like this that contains a
processor and memory and put all them together and connect them through an interconnection network. With
NUMA architecture, there's different access speed to memory depending on whether this processor is accessing
memory that is local to it, or it has to reach out into the network and access some memory that is farther away
from it.
In addition to the NUMA effect, there is also the deep memory hierarchy itself. We already talked about the
fact a single processor these days contains multiple levels of caches before it goes to the memory. And this deep
memory hierarchy is another thing that you have to worry about in building the operating system for a parallel
machine.
There is the issue of false sharing. False sharing is essentially saying that even though programmatically there
is no connection between a piece of memory that is being touched by a particular thread executing on this core,
and another thread that is executing on this core. The cache hierarchy may make the block that contains the
individual memory touched by different threads on different cores to be on the same cache block. So,
programmatically, there's no sharing, but because of the fact that the memory that is being touched by threads on
different cores happen to be on the same cache line, they appear to be shared. False sharing is essentially saying
that there is no programmatic sharing, but because of the way the cache coherence mechanism operates, they
appear shared.
The false sharing is happening more and more in modern processors, because modern processors tend to
employ larger cache blocks. Why is that? Well, the analogy I'm going to give you is that of a handyman. If you're
good at doing chores around the house, then you might relate to this analogy quite well. You probably have a tool
box if you're a handyman. If you want to do some work, let's say a leaky faucet that you want to fix, what you do
is you put the tools that you need into a tool tray and bring it from the tool box to the site where you're doing the
work. Basically, what you're doing there is collecting the set of tools that you need for the project so that you
don't have to go running back to the tool tray all the time. That's the same sort of thing that's happening with
caching and memory. Memory contains all this stuff that I need.
The more I bring in from the memory, the less time that I have to go out to memory in order to fetch it. That
means that I want to keep increasing the block size of the cache, in order to make sure that I take advantage of
spatial locality in the cache design. And that increases the chances that false sharing is going to happen. The larger
the cache line, the more chances are that memory that is being touched by different threads happen to be on the
same cache block, and that results in false shading. So all of these effects, the NUMA effect, the deep memory
hierarchy, and increasing block size leading to false sharing.
All of these are things that the operating system designers have to worry about in making sure that the algorithms
and the techniques that we have learned, when it is translated to a large scale parallel machine, it remains scalable.
So that's really the challenge that the operating system designer faces.
Some of the things that the OS designer would have to do is work hard to avoid false sharing, to reduce write
sharing the same cache line. Because if you write share the same cache line, then it is going to result among
different cores of the same processor, then it's going to result in the cache line migrating from one processor to
another, even within the same core and even within the same processor, multiple cores, and across processors that
are on different nodes of parallel machine, connected by the interconnection network.
236 - Principles

So, we can think about some general principles that one, has to keep in mind as an OS designer in designing
operating systems for earlier machines.

• The first principle is of course cache conscious decisions. What that means is, you want to pay attention
to locality, exploit affinity to caches in scheduling decisions for instance.

• You want to reduce the amount of sharing of data structures. If you reduce the amount of sharing of data
structures, you reduce contention. We've seen this when we talked about different synchronization
algorithms.

• Another thing that you want to do is, you want to keep the memory accesses local to every node in the
multiprocessor as possible, and basically what that means is you're reducing the distance between the
accessing processor and the memory. Already, the distance is pretty big when you go from inside the chip
to the memory. But the distance is even more if you have a traverse interconnection network and reach
into a memory that is on a different node of the multiprocessor.
237 - Refresher on Page Fault Service
Let's understand exactly what happens during a page fault service.

• When a thread is executing on the CPU, it generates a virtual address and the hardware takes that VPN
and looks up the TLB to see if it can translate that VPN to a PPN that contains the contents of that page.
• If it is a TLB miss, it'll go to the page table and look up the page table to see if the mapping between VPN
and PPN is in the page table.
• This would have been there if the OS has already put the contents of the page in physical memory.
• But if the OS has not brought in that page from the disk into physical memory, the hardware may not find
the mapping between the virtual page and the physical frame. That will cause a page table miss and you
have a page fault.
• So you have a page fault and the operating system has a page fault handler. The page fault handler will
locate where on the disk that particular pages reside and allocate a physical page frame for the content.
• Then do the I/O to move the virtual page content from the disk into the page frame that is allocated.
• Once the I/O is completed, the OS can update the page table to indicate now it has a mapping between
that VPN and the PPN.
In this way we handled the page fault by bringing in the missing page from the disc into physical memory.
• Then we update the page table to indicate that the mapping is now established between the VPN and PPN.
• Then we can update the TLB to indicate that now we have the mapping between VPN and PFN, and once
the TLB has also been updated, the page fault service is complete.
Now let's analyze this picture and ask the question, where are potential points of bottlenecks?
Now what I'm showing you here is thread specific. A thread is executing on the CPU. And looking up the
virtual page, advance leading that to physical frame. It's entirely local to a particular thread and local to the
processor on which that thread is executing. No problem with that. No serialization at that point.
Now, moving over here, once the page fault has been serviced, updating the TLB to indicate that there is a
mapping now, a valid mapping between the virtual page number and the physical page number, that is done on
the TLB that is local to a particular processor and therefore it processes a specific action that's going on in terms
of updating this TLB.
Now, let's come to the middle structure here. This is where all the problem is. We have the situation where we
have to first allocate a physical page frame, which is an operating system function. You have to update the page
table to indicate now when the I/O has been completed and now we can have a mapping between virtual plane
and physical frame.
And I told you that the page table data structure is a common data structure that might be shared by the
threads in which case, all of these things, what I've shown you here can lead to serialization.
So this is what we want to avoid. We want to avoid the serialization when allocating data structures, allocating
physical resources in order to serve as a page fault.
• What we are seeing here is entirely lookup, and that can be done in parallel. No problem with that. Reading
is something that you can do in parallel.
• Similarly what is happening over here when we update the TLB, but it is local to a processor. There's no
serialization that's going to happen here.
• But here we can have serialization if you're not careful. So, as an OS designer and designing this particular
service, page fault service, this is what you have to focus on to make sure that you avoid serialization.
238 - Parallel OS and Page Fault Service

Let’s look at the parallel operating system and page fault service.
The easy scenario for the parallel operating system is what I call as a multi-process workload.
• Here we have threads executing on all the nodes of the multiprocessor, but these threads are completely
independent of one another.

• Think of this as a separate process, this as an independent process. Maybe you have a web browser here,
a word processor here, and so on.

• So they are completely independent processes.

• If that is the case, if there's a page fault that has incurred on, on this node, simultaneously a page fault on
another node, they can be handled completely independently. Because the threads are independent. The
page tables are distinct.

• Therefore, you don't have to serialize the page fault service. The parallel operating system is going to have
a page fault handler that's available in each one of these nodes. So the work can be done in parallel, as
long as there is no data structures that are shared among these different units of work that the operating
system has to do.

• So long as page tables are distinct, which is the case in a multi-process workload, there is no serialization.
The hard scenario for a parallel operating system is a multi-threaded workload.
• Now what I mean by a multi-threaded workload is that you have a process that as multiple threads, so
there is opportunity for exploiting the concurrency that's available in the multiprocessor by scheduling
these threads on the different nodes of the multiprocessor.

• To make it concrete, I'm going to show only two notes, N1 and N2 and assume that there are two cores
available in each one of these nodes. The OS may have chosen to put T1 and T3 on node N1, and T2 and
T4 on node N2.

• Now you have a multithreaded workload now executing on different nodes of the multiprocessor and there
is hardware concurrency, because there are multiple cores available.

• So in principle, all of these threads can work in parallel.


If they incur a page fault, it is in incumbent on the operating system to see how it can ensure that there is no
serialization of the work that needs to be done to service the page faults.
• The address space is shared and therefore, the page table is shared.

• Since the threads are executing on different processors, the TLBs will have shared entries because they
are accessing the same address space.

• Now what we want is to limit the amount of sharing in the operating system data structures when they are
executing on different processors.
In particular, for this particular mapping that I've shown you, that T1 and T3 are executing on N1 and T2 and
T4 are executing on N2. What we would want is the operating system data structures that they have to mess with,
i.e. T1 and T3 should be working with distinct part of page tables than T2 and T4.
That will ensure that you can have scalability.
239 - Recipe for Scalable Structure in Parallel OS

So popping up a level, what we can learn from the example that I just gave you with page fault service is: in order
to design a scalable OS service for a parallel operating system, you have to think about what is the right recipe.
For every subsystem that you want to design,

• First determine functionally what needs to be done for that service.


• Now you've got parallel hardware and therefore the functional part of that service can be executed in
parallel in the different processors. That's the easy part but in order to ensure concurrent execution of the
service, you have to minimize the shared data structures.
• So less sharing will result in more scalable implementation of the service.
• But it is easy to say avoid sharing data structures, but it is hard to practice. It is always not very clear how
in designing the subsystem, we can limit the amount of sharing of shared data structures.
• In the page fault service example, the page table data structure is maintained by the OS on behalf of the
process and logically shared. But if you want true concurrency for updating this data structure, it is
inappropriate to have a single (page table) data structure for a process. Updating a shared page table require
mutex protection and that leads to a serial bottleneck. At the same time if we say, you know, let's take
this page table data structure and replicate it on all the nodes of the multiprocessor, that probably is not
also a very good idea. Because the operating system has to worry about the consistency of the shared data
structure copies among all the processors.
So we can now quickly see what the dilemma is of the OS designer. As an OS designer, we want the freedom
to think logically about shared data structures. But depending on the usage of the data structure, we want to
replicate or partition the data structure so that we can have less locking and more concurrency. That's the real
trick.
Keep this recipe and the principles we talked about in mind, and we will talk about one particular service,
namely the memory management subsystem, and how we can avoid serial bottlenecks using the techniques that
are proposed in the Tornado System. The key property is: less sharing leads to more scalable design.
240 - Tornados Secret Sauce

The secret sauce in Tornado for achieving the scalability is the concept called clustered object.
• A clustered object presents the illusion of a single object, but is actually composed of multiple
component objects, each of which handles calls from a specified subset of the processors.
• Each component object represents the collective whole for some set of processors, and is thus termed a
clustered object representative.
• For instance, N0 may have a representation that it is looking at, which is different from N1 and N2. But
the object reference is the same, i.e. an illusion of a single object.
• That's what I meant when I said: “think of a shared data structure is logically the same thing”. But
physically, it may be replicated under the covers.

Who decides to replicate it? That's the decision of the operating system.
• The degree of clustering (i.e. the replication of a particular object) is an implementation choice of the
service. So as a designer of the service, you make the decision whether a particular object is going to have
a singleton representation, or one per core in the machine, or one per CPU (meaning it is shared by all the
cores of one CPU), or maybe one representation of an object for a group of processes.
• So when designing the service, you can think abstractly about the components of the service containing
objects and each object is giving you the illusion that it is single object reference. But under the covers
you might choose to implement the object with different levels of replication.
• Of course for replicated objects, we have to worry about the consistency and the Tornado system suggests
maintaining the consistency of objects through protected procedure call.(PPC)
• The reason for using PPC over hardware cache coherence mechanism is that hardware CC can be
indiscriminate about the shared memory location and possibly cause false sharing.
• But, of course, when in doubt, use a single representation. That way, you have the hardware cache
coherence as a security blanket when you're not sure yet about the level of clustering.
All of these may seem a little bit abstract at this point of time, but I'll make it very concrete, when we talk about
a simple example, namely the memory management subsystem.
241 - Traditional Structure
Just to put our discussion in perspective, let's look at a traditional structure of an operating system.

• In the traditional structure of the operating system there is something called a page cache, which is in
DRAM, and this page cache is supporting both the file system and the virtual memory subsystem.
• The file system has opened files explicitly from the storage and they live in the page cache that is in the
physical memory.
• Similarly, processes are executing in the virtual memory and the virtual memory of every process has to
be backed by physical memory.
• Therefore, the page cache in DRAM contains the contents of the virtual pages.
• All these virtual pages are in the storage subsystem.
So let's, for the purpose of our discussion, we will focus only on the virtual memory subsystem. In the virtual
memory subsystem, the data structures are kept per process in a traditional structure.
• There is a PCB, the process context block or process control block, that contains information specific to
that particular process in terms of memory management, the memory footprint of that process.
• There is a page table that describes the mapping between the virtual pages that is occupied by the process
and the physical memory allocated in the DRAM.
• If the OS is also managing the TLB, then there will be a global data structure that describes the current
occupancy of the TLB for that particular process.
• Of course, all the virtual pages for the process are resident on the storage subsystem as well.
If there is a page fault, the missing virtual page can be brought from the storage subsystem into the page cache
for future access by the process.
This is the traditional structure. And for scalability, we want to eliminate as much of the centralized data
structures as possible.
242 - Objectization of Memory Management

Now let's talk about objectization of the memory management function.

We first start with the address space of the process.


• The address space of the process is shared by all the threads, and there will be representation for the
address space. This is Process object and it is shared by all the threads that are executing on the CPU.
• So we can think of this process object as somewhat equivalent to the PCB in a tradition setting.
• If you think about the multi-threaded application, the different thread of the application maybe accessing
different portions of the address space. Therefore, there is no reason to have a centralized data structure
in the operating system to describe the entire address space of that process.
• So we break the process’s address space into Regions, e.g. a green region here and a purple region here.
• These regions have to be backed by files on the storage sub system. So similar to breaking up the address
space into regions, we are going to carve up the backing storage into what we call File Cache Manager
for each Region.
• For example, FCM1 is a piece of the storage subsystem that backs this Region R1 and similarly FCM2
backs Region R2.
• Of course, for any of these threads to do their work, the Regions that they execute in have to be in physical
memory. So we need a page frame manager, which will be implemented as a DRAM object.
• This DRAM object is the one that serves physical page frames. So when the page fault service needs to
get a physical page frame, it contacts a page frame DRAM object to get a physical page frame and bring
the contents of this FCM from the storage subsystem into the DRAM, for future use by a particular thread.
• And of course, you have to do the I/O in order to move the page from the backing store into DRAM. So
we will declare that there'll be another object, the Cached Object Representation - COR, and this is the
one that is going to be responsible for knowing the location of the object that you are looking for on the
backing storage and do the actual page I/O.
243 - Objectized Structure of VM Manager

So we end up with an objectized structure of the virtual memory manager that looks like this.
• Process object: it is equivalent to a PCB in the traditional setting.

• TLB: the TLB on the processor could be maintained by hardware or software depending on the
architecture.

• Region object: a portion of the process’s address piece and essentially the page table data structure is split
into these region objects.

• FCM: the file cache manager that knows the location of the files on the backing storage for each Region.

• The FCM interacts with the DRAM manager in order to get a physical frame in case of a page fault in a
particular region.

• The FCM also interacts with COR (Cached Object Representation) of a page to perform the I/O work
of bringing the content from the disk into the physical memory.
So, this is the structure of the objectized Virtual Memory Manager.
Depending on the region of the virtual memory space that you're accessing, the path that a particular page fault
handling routine may take will be different.
• If you're accessing a page that is in the green region, then the top path will be taken.
• If the page fault happens to be in the purple region, then the bottom path will be taken.
So, logically, given the structure, let's think about what is the work flow in handling a page fault with this
objectized structure of the virtual menu manager.
• The thread T1 is executing. Poor guy, and it incurs a page fault goes to a Process object.
• The Process object is able to give the virtual page number so we know which Region that particular page
fault falls into.
• The Region object is then going to contact the File Cache Manager that corresponds to this Region object.
• The FCM will do two things. It will see the backing file for that particular VPN is missing. It may be that
it is a file that contains multiple pages. So it's going to pass file and offset information to the COR object,
so the COR can find the faulty page content in the storage device.
• The FCM will also contact the DRAM object in order to get a physical frame. Once it has the physical
frame and it has the actual location of the file, the COR object can perform the I/O and pull the data from
the disc into the DRAM.
• Now the page frame that has been allocated for backing this particular virtual page, has been populated
by the OCR once its I/O is completed.
• The missing virtual page has been fulfilled and then the FCM can indicate to the Region that your page
fault service is complete.
• Then the Region can go through the Process object in order to update the TLB.
Now the page fault is resolved and the process can resume. This is the flow of information in order to make
the process runnable again, which faulted on a virtual page that is missing in physical memory.
So, now that we have this flow, and we also mentioned that the cluster object has a single reference. When it
is a region, it's a region.
Now how do we replicate a region object? Should this region object be a singleton object, should it be
replicated? If you're going to replicate it, should it be replicated for every core or a set of processors of, of a group
of processors and so on?

These are all the design decisions. That the operating system designer has to make.
• The Process object is mostly read-only. You can replicate it one per CPU like a process control block
PCB, i.e. one per CPU and all the cores on the CPU can share the same Process object because ultimately
the TLB is a common entity for the entire processor, and since the Processor object is updating the TLB,
we can have a single Processor object that manages the TLB.
• What about the Region object? Region represents a portion of the address space and a portion of the
address space may be traversed by more than one threads. So a set of threads that are running on a group
of processors may actually access the same region concurrently. We don’t know how many threads may
actually access a particular region. It's something that may have to evolve over time. But it is definitely a
candidate for partial replication. That is, it is in the critical path of a page fault, so let's partial replicate the
Region. Not one per processor, but maybe for a group of processors, because a group of processors may
be running threads that are accessing the same portion of the address space. And the granularity of
replication decides the exploitable concurrency from parallel page fault handling.
Now, the interesting thing to notice is that the degree of replication and the design decision that we take for how
we cluster, the degree of clustering that we choose for every one of these objects is independent of one another.
• Now what about the FCM object. FCM object is backing a Region. There may be multiple replicas of a
Region, but all of those regions are backed by the same FCM. So, we can go for a partitioned
representation of this FCM. where competition represents the portion of the agro space that is managed
by this particular FCM.
• So, you can see that there is a degree of freedom in how we choose to replicate process object, and how
we're partitioning the region objects. But once we've partitioned the Region, how we have replications for
each of these partitioned regions is something that is up for grabs as an OS designer.
• Similarly, for the FCM, because it's backing a specific region, we can go for a partitioned representation
of the FCM.
• What about the COR, the Cached Object Representation? now? This object is the one that is really dealing
with physical entities. It is actually doing the I/O from the disk into the physical memory. Since we are
dealing with physical entities, it may be appropriate to have a true shared object for COR. And all the
I/O is going to be managed by this single COR, even though I'm showing you two different boxes here,
in principle it could be a singleton object that is managing all of the I/O activity that corresponds to the
virtual memory management.
• What about the DRAM object? You can have several representations for the DRAM object depending on
how the physical memory is managed. For example, we may have at least one representation of the DRAM
object for every DSM (Distributed Shared Memory) piece that you have in a single node's portion of the
physical memory. So in other words, we can break up the entire physical memory that's available in the
entire system into the portions that are managed individually by each processor. You can go even finer
than that if it is appropriate. It is a design decision up to the designer.
So we come up with a
• replicated Process object
• a partial replication for the Region object
• a partitioned representation for the FCM object
• a trued shared object for COR
• several representations for the DRAM object.
The nice thing about this objectized structure is that when we designed it, we did not have to think about how we
could replicate it when we actually populate these objects. That is a level of design decision that could be deferred
because of the secret source that's available in Tornado System is a cluster object.
244 - Advantages of Clustered Object

Clustered object offers several advantages.

• First of all, the object reference is the same on all the nodes. Regardless of which node a particular server
is executing on, they all have the same object reference.
• But under the covers, according on the usage pattern, you can have different levels of replication of the
same object. So it allows for incremental optimization and you can also have different implementations
of the same object.
• It also allows for the potential for dynamic adaptations of the representations.
Of course, the advantage of replication is that the object references can access the respective replicas
independently. This means that you have less locking of shared data structures.
• Let's think about the Process object as a concrete example. The Process object is one per CPU and is
mostly read-only. Therefore, page fault handling can start on all of these different processors, independent
of one another.
• If they touch different regions of the address space, they will just take different path depending on the
Region, as discussed earlier. So what that means is that page fault handling will scale in this case using
this as a concrete service. It'll scale to the number of processors. This is important because page fault
handling is something that is going to happen often.
• On the other hand, if we want to get rid of a Region, then the destruction of Region may take more time
because the region may have multiple representations, and all of the representations of that particular
region has to be gotten rid of. But that's okay because we don't expect region destructions to happen as
often as handling page faults.
• So the principle again is to make sure that the common case is optimized. The common case is page fault
handling, Region creation and destruction happen more rarely and it is okay if those functionalities take
a little bit more time.
245 - Implementation of Clustered Object

Let's now talk about the implementation of clustered objects.

The key to the implementation of clustered objects is the use of per-processor translation tables.
The translation table maps an object reference to a representation in memory.
Again the reference itself is common, but the same object reference may be pointing to different replica on
different processors. That's a function of the translation table.
So on each CPU, this is what happens. When an object reference is presented, the OS converts it to a
representation. And this is a normal operation.
Now, you present an object reference but that object reference may not be in the translation table yet, because
this object has not been referenced so far. In that case, you'll have a translation table miss.
There is another data structure in the OS called the miss handling object table. The miss handling object table
is a mapping between the object reference that you are presenting and the handler that the OS has for dealing with
this missing object reference.
The per-cluster object miss handler knows the particular representation for this reference and it is also going
to make the decision: Should this object reference point to a representation that already exists? or should it create
a new representation for it?
Once the handler finds/creates a representation for this object reference, it installs the mapping between this
object reference and this representation into the translation table. So subsequently, when you present the object
reference, it'll go to the underlying particular representation.
This is the work done by the object miss handler on every translation miss. The object reference is locally
resolved in this case, because the object miss handler is locally available, per-cluster.
To allow the global miss handler to locate the clustered object, the clustered object miss handler is installed at
creation time in a global miss handling object table. This table is not replicated per-processor, but instead is
partitioned, with each processor receiving a fraction of the total table. The global miss handler can then use the
object reference that caused the miss to locate the clustered object’s miss handler. This interaction with the
clustered object system adds approximately 300 instructions to the overhead of object creation and destruction.
Again, although significant, we feel the various benefits of clustered objects make this cost reasonable, especially
if the ratio of object calls to object creation is high, as we expect it to be.
For example, if we are referencing a Region object, under the cover each region object itself is only a portion
of the virtual address space, so the Region miss handler can also be partitioned specific to each region.
It can happen that the object miss handler is not available locally. Now how will that happen?
So in this particular example that I give you, the miss handling table happens to contain the miss handler for
this particular object reference. It is possible that when an object referenced is presented in a particular processor,
the object miss handler is not local, because the miss handling table is a partitioned data structure.
What happens in that case? Well, that's why you have a notion of a global miss handler, and the idea here is if
the miss handling table does not have the miss handler for that particular object reference, then you go to a global
miss handler. This is something that is replicated on every node. Every node has its global miss handler. And this
global miss handler knows exactly the partitioning of the miss handling table. So it knows, how this miss handler
table has been partitioned and distributed on all the nodes of the multi-processer.
And so, if an object reference is presented on a node, the translation table will say, well, you know, this
particular object reference, we don't know how to resolve it because the object miss handler doesn't exist here.
And therefore, we're going to go to this global miss handler. And the global miss handler, because it is replicated
on every node, it says, oh, I know exactly which node has the miss handling table that corresponds to this object
reference.
And so it can go to that node. And from that node it can obtain a replica, and once it obtains a replica, it can
populate it in the translation table for this particular node. And once it populates it, then we are back in business
again, as in this case. So, the function of the global miss handler is to resolve the location of the object miss
handler for a particular object reference. So given an object reference i, if you have no way of resolving it
locally, then the global miss handler that is present on every node can tell you the location of the object miss
handler for this particular object, so that he can resolve that reference, get the replica for it, install it locally,
populate the translation table, then you're back in business again.
So what this workflow is telling you is how incrementally the Tornado system can optimize the implementation
of the objects. So depending on the usage pattern, it can make a decision that I used to have a single replica, it is
now accessed on multiple nodes. Maybe I should really replicate it on multiple nodes. So that's a decision that
can be taken during the running of the system on the usage pattern of the different objects.
246 - Non Hierarchical Locking

Let's say that in this example, there are two threads T1 and T2 and they are mapped to the same processor, which
means they share the same Process object. • If T1 incurs a page fault, it's going to through the Process object to
the Region that corresponds to this particular page fault. What needs to happen in order to service this page fault?
• We might do hierarchical locking. Say let's lock the Process object, the lock the FCM object that is
backing the Region, then lock the COR object that will do the actual I/O work.
• If I do this, and now the OS is incurring a page fault for a second thread T2 and this page fault happens
on the same processor. So it shares the same Process object, but this page fault may be at a different region,
e.g. region 2.
• But if we have locked the Process object to service this page fault, then we cannot get past this Process
object, because the lock is held on behalf of T1, even though we may have other cores available.
• So this hierarchical locking kills concurrency and that's a bad idea.

If I want integrity for these objects, I want to be able to lock the critical data structures. But if the path that is
taken by this page fault is different from the path that is taken by this page fault, why lock this object in the first
place? The answer is we don’t need to lock the Process object.
However, we do still need to make sure that Process object is not moved to a different processor due to
something like OS load balancing. So instead of hierarchical locking, we will associate a reference count with
the Process object to ensure existence guarantee.
• If T1 has a page fault, it first goes to this Process object. Then it will increment the reference count for
the Process object, saying that this object is in use, please don't get rid of it. Subsequently, if T2 also has
a page fault, it will access the same Process object and increment the reference count.
• The whole point of having a reference count is to prevent the Process object from being moved to a
different processor by some OS entity such as a process migration facility for load-balance reasons.
• Reference count > 0 indicates that the Process object is currently servicing some requests locally. In this
way we can ensure the existence guarantee for this Process object and integrity of this object.
• Since we are not using lock, the page fault service that can be happening in parallel for independent
Regions of the address space touched by the threads that are running on the same processor, but executing
on different cores perhaps of the same processor.
247 - Non Hierarchical Locking (cont)

The reference count ensures the existence guarantee of the root level objects (for example, Process object).
Of course if these page faults for T1 and T2 are accessing the same Region of memory, you have no choice except
to go to the same region object.
But there again, this is something that the operating system can monitor over time and see if, even though it is the
same region, maybe this region itself can be carved up into sub regions and promote concurrency.
Even further, you can have a region for every virtual page, but that might be too much, right? And that's the reason
that you don't want to have a detonation of a page table into such a fine grain partition. But you might want to
think about what is the right granularity to promote concurrency for services like this to go on in the
multiprocessor.
So coming back again to the hierarchical locking.
(Original paper:
One of the key problems with respect to locking is its overhead. In addition to the basic instruction overhead
of locking, there is the cost of the extra cache coherence traffic due to the intrinsic write-sharing of the lock. With
Tornado’s object-oriented structure it is natural to encapsulate all locking within individual objects. This helps
reduce the scope of the locks and hence limits contention. Moreover, the use of clustered objects helps limit
contention further by splitting single objects into multiple representatives thus limiting contention for any one
lock to the number of processors sharing an individual representative. This allows us to optimize for the
uncontended case. We use highly efficient spin-then-block locks, that require only two dedicated bits from any
word (such as the lower bits of an aligned pointer), at a total cost of 20 instructions for a lock/unlock pair in the
uncontended case.)
The key to avoiding hierarchical locking in Tornado is to make the locking encapsulated in individual objects.
Locking is encapsulated in the individual object and you're reducing the scope of the lock to that particular object.
So if there's a replica of this region, then a lock for a particular replica is only limited to that replica. And not
across all the replicas of a particular region. That's important, it reduces the scope of the lock. And therefore it
limits the contention for the lock.
But of course it is incumbent on the service provider to make sure that if a particular region is replicated, then
the integrity of that replication is guaranteed by the operating system through a protective procedure called
mechanism that keeps these regions consistent with one another because you made a replica of that. Even if the
hardware provides cache coherence, there's no way to guarantee that these replicas will be consistent. Because
they are dealing with different physical memories, and therefore, it is the responsibility of the operating system
to make sure that these regions are kept consistent with one another.
248 - Dynamic Memory Allocation

Dynamic memory allocation is another important service that is part of memory management. It's important, once
again, to make sure that memory allocation scales with the size of the system.

In order to do that, one possibility is to take the heap space of the process and break it up.
For example, this is the logical address space of a multi-threaded application and everything in the logical
address space is shared.
But now we are going to take this heap portion of the address space and break it up into the portion of physical
memories that are associated with the nodes on which these threads are executing.
Suppose the mapping of the threads of this particular application is such that T1 and T2 are executing on N1
and T3 and T4 are executing N2.
It's a NUMA machine, so there's a physical memory that is local to node N1 and N2.
For dynamic memory allocation requests, we're going to break up the heap and say that this portion of the heap
fits in the physical memory that is close to N1. This portion of the heap fits in the physical memory that is close
to N2.
So dynamic memory allocation requests from these threads are satisfied from here while requests from these
threads are satisfied from here.
That allows for scalable implementation of dynamic memory allocation. The other side benefit that you get by
breaking up the heap space into these distinct physical memories that it can avoid false sharing across nodes of
the parallel machines.
249 - IPC

So similar to microkernel-based OS design that we have discussed before, functionalities in the Tornado OS are
contained in these clustered objects. These clustered objects have to communicate with one another in order to
implement the services.

So we need efficient inter process communication between objects.


For instance The FCM object may need to contact the DRAM object in order to get a page frame. In this case, the
FCM object is a client and the DRAM object is a server. The way the request is satisfied is through the IPC
realized by a protective procedure call mechanism.
• It's very similar to the LRPC paper on how we can have efficient communication without a context switch.
So for local PPC, you don't have to have a context switch, because you can implement this by handoff
scheduling between the calling object and the called object on the same processor.
• On the other hand, if the called object is on a remote processor, then you have to have a full context switch
in order to go across to the other processor and execute the protective procedure call.
And this ICP mechanism is fundamental to the Tornado system. Both for implementing any service as a collection
of cluster objects, and even for managing the replicas of objects.
So for instance, I mentioned that you might decide based on usage pattern that I want to have replicas of the
region object which represents a particular portion of the address space. If I replicate it, then I have to make sure
that the replicas remain consistent.
When you modify one replica, you have to make a particular PPC to the other replicas to deflect the other
changes that you made in the first replica. So all of these are things that are happening under the cover.
The key thing is that the management of replicas is done in software, without relying on the hardware cache
coherence. The hardware cache coherence only works on physical memory. Now if it replicated. The physical
memory is not the same anymore. But is a replica that is known only to the system software. So the
management of the replica has to be done by the operating system.
The implementation of our PPC facility involves on demand creation of server threads to handle calls and the
caching of the threads for future calls.
250 - Tornado Summary

So to summarize the Tornado features


It's an object oriented design, which promotes scalability. The idea of cluster objects in the protected procedure
call is mainly with a view to preserving locality, while ensuring concurrency.
We see how reference counting is used in the implementation of the objects so that, you don't have to have
hierarchical locking of objects.
The scope of locks held by an object is confined to itself. And doesn't span across objects, or its replicas. That's
very important, because that's what promotes concurrency, and that also means that careful management is needed
of the reference count mechanism to provide existence guarantee and garbage collection of objects based on
reference counts.
Multiple implementations are possible for the same operating system object. Now for instance, you may have
a low-overhead version when scalability is not important. Then you might decide to know this particular operating
system object I am experiencing a lot of contention for this. I want to go for a more scalable implementation of
this particular operating system object. So, this is where incremental optimization and dynamic adaptation of the
implementation of objects comes into play and the other important principle that is used in Tornado is optimizing
the common case.
I mentioned that when we talked about page fault handling, that is something that happens quite often. On the
other hand, destroying a portion of the address based because the application does not need it any more, that is
called region destruction. That happens fairly infrequently, so if it takes more time, that's okay So that's where
the principle of optimizing the common case comes in.
No hierarchical locking through the reference counting mechanism.
Limiting the sharing of operating system data structures by replicating critical data structures and managing
the replicas under the covers is a creep up property in Tornado to promote scalability and concurrency.
251 - Summary of Ideas in Corey System

The main principle in structuring an operating system for a shared memory multiprocessor, is to limit sharing
kernel data structures, which both limits concurrency and increases contention. This principle finds additional
corroboration in the Corey operating systems research that was done at MIT, which wants to involve the
applications that run on top of the operating system to give hints to the kernel.

The idea of the Corey System is similar to what we saw in Tornado, namely you want to reduce the amount of
sharing.
Corey System has an idea of address ranges in an application. This is similar to the region concept in Tornado.
In Tornado, the application doesn't know anything about it the Region. The application only see his address space
but OS partition the page table into different regions if it sees the user threads are accessing different regions of
the address space. So we can have concurrency among the page fault service handling for the different regions.
Similar idea, except here the address ranges are exposed to the application. So in other words, a thread says that
this is a region in which I'm going to operate and if the kernel knows the address range in which a particular thread
is going to operate in, then it can actually use that as a hint in where to run this thread. If a bunch of threads are
touching the same address range maybe you want to put it on the same processor
Similarly, shares is another concept, and the idea here is that an application thread can say that here is the data
structure, here is a system facility that I'm going to use, but I don’t want to share it with anybody. An example
would be a file. The file descriptor is a data structure that the OS maintains. If a thread of that application opens
the file and it's not going to share that file with anybody else, it can communicate that intent through the shares
mechanism. That hint is useful for the kernel to optimize shared data structures. In particular if you have a multi-
core processor and if I have threads of an application running on multiple cores, I don't have to worry about the
consistency of that file descriptor across all these cores. That gives an opportunity for reducing the amount of
work that the OS has to do in managing shared data structures.
Another facility is dedicated cores. If you have multiple cores, you might as well dedicate some of the cores
for kernel activity. And that way, we can confine the locality of kernel data structures to a few cores. And not
have to worry about moving data between the cores.
So all of these techniques that are being proposed in the Corey system is really trying to attack the fundamental
problem that there is a huge latency involved when you have to communicate across cores. Or when you have to
communicate outside of the core into the, into the memory subsystem and so on. And all of these facilities are
trying to reduce the amount of inter-core communication and core to memory communication and so on and so
forth.
252 - Virtualization

Through this lesson, I'm sure you have a good understanding and appreciation for the hard work in the
implementation of an operating system on a shared memory multiprocessor that ensures capability of the basic
mechanisms like synchronization, communication, and scheduling. This is not done just once. It has to be done
for every new parallel architecture that comes to market that has a vastly different memory hierarchy compared
to its predecessors. Can we reduce the pain point of individually optimizing every operating system that runs on
a multi-processor? And how about device drivers, that form a big part of the code base of an operating system?
Do we have to reimplement them for every flavor of operating systems that runs on a new machine? Can we
leverage third party device drivers from the OEM's to reduce the pain point?

253 - Virtualization to the Rescue

To alleviate some of the pain points that I just mentioned, what we want to ask is the question, can virtualization
help? We've seen how virtualization is a powerful technique for hosting multiple operating systems images on
the same hardware without a significant loss of performance in the context of a single processor. Now, the
question is, can this idea be extended to a multiprocessor?

This is the thought experiment that was carried out at Stanford, in the cellular disco project.
Cellular disco combines the idea of virtualization and the needs for scalability of parallel operating system.
Commensurate with the underlying hardware, there is a thin virtualization layer, which is the cellular disco
layer. The cellular disco layer manages the hardware resources namely CPU, the I/O devices, memory and so on.
Now the hairiest part in dealing with any operating system is the I/O management. Even in a desktop and PC
environment, most of the code is really third-party code (device driver code) that is sitting inside the operating
system.
In this thought experiment, what cellular disco does is to show by construction that you can alleviate some of
the pain points in building an operating system, especially with this I/O management.
So I'm going to focus on just the I/O management part and on how I/O is handled by the cellular disco layer
sitting in the middle between the virtual machine and the physical hardware.
• This particular thought experiment was conducted on a machine called the Origin 2000 from SGI (Silicon
Graphics). It's a 32 processor machine.
• The host operating system is a flavor of a UNIX operating system called IRIX.
• The VMM layer cellular disco sits in between the guest operating system, and the host operating.
• The way virtualization is done is a standard virtual machine trick: trap and emulate.
Let's just walk through what happens on an I/O request.

To run our Cellular Disco prototype, we first boot the IRIX 6.4 operating system with a minimal amount of
memory. Cellular Disco is implemented as a multi-threaded kernel process that spawns a thread on each CPU.
The threads are pinned to their designated processors to prevent the IRIX scheduler from interfering with the
control of the virtual machine monitor over the machine's CPUs. Subsequent actions performed by the monitor
violate the IRIX process abstraction, effectively taking over the control of the machine from the operating system.
After saving the kernel registers of the host operating system, the monitor installs its own exception handlers and
takes over all remaining system memory. The host IRIX 6.4 operating system remains dormant but can be
reactivated any time Cellular Disco needs to use a device driver.
Whenever one of the virtual machines created on top of Cellular Disco requests an I/O operation, the request is
handled by the procedure illustrated in Figure 3.
1. The I/O request causes a trap into Cellular Disco.
2. Cellular Disco checks access permissions and forwards the request to the host IRIX by restoring the saved
kernel registers and exception vectors.
3. Then Cellular Disco requests the host kernel to issue the appropriate I/O request to the device.
From the perspective of the host operating system, it looks as if Cellular Disco had been running all the time just
like any other well-behaved kernel process. After IRIX initiates the I/O request, control returns to Cellular Disco,
which puts the host kernel back into the dormant state.
4. Upon I/O completion the hardware raises an interrupt, which is handled by Cellular Disco because the
exception vectors have been overwritten.
5. To allow the host drivers to properly handle I/O completion, the Cellular Disco monitor reactivates the
dormant IRIX, making it look as if the I/O interrupt had just been posted.
6. Finally, Cellular Disco posts a virtual interrupt to the virtual machine to notify it of the completion of its
I/O request.
Since some drivers require that the kernel be aware of time, Cellular Disco forwards all timer interrupts in addition
to device interrupts to the host IRIX.
The thought experiment was really to show by construction how to develop an OS for a new hardware, without
completely rewriting the OS, by exploiting the facilities that may be already in the host operating system. Once
again this should remind you it was shown by construction. That a microkernel design can be as efficient as a
monolithic design. Similar to that, it was shown by construction that the Cellular Disco VMM can manage the
resources of a multiprocessor as well as a native OS.
More importantly, the overhead of providing the services needed for the Guest OS in this virtualized setting
can be kept low. Or in other words, the virtualization can be efficient. It was shown that it can be done within 10%
for many applications that run on the guest operating system.
Instead of modifying the operating system, our approach inserts a software layer between the hardware and
the operating system. By applying an old idea in a new context, we show that our virtual machine monitor (called
Cellular Disco) is able to supplement the functionality provided by the operating system and to provide new
features.

254 - Conclusion

So that completes the last portion of this lesson module, where we visited the structure of parallel operating
systems. In particular we looked at Tornado as a case study.
This completes the second major module in the advance operating systems course, namely parallel systems.
I expect you to carefully read the papers we have covered in this module which I've listed in the required readings
for the course and which served as the inspiration for the lectures that I gave you in this module.
I want to emphasize once more the importance of reading and understanding the performance sections, of all
the papers. Both to understand the techniques and methodologies therein, even if the actual results may not be
that relevant due to the dated nature of the systems on which they've been implemented.

You might also like