A Pipelined Multi-Core MIPS Machine Hardware Implementation and Correctness Proof (Mikhail Kovalev, Silvia Melitta Muller, Wolfgang J. Paul)
A Pipelined Multi-Core MIPS Machine Hardware Implementation and Correctness Proof (Mikhail Kovalev, Silvia Melitta Muller, Wolfgang J. Paul)
Müller
Tutorial Wolfgang J. Paul
A Pipelined
Multi-core
LNCS 9000
MIPS Machine
Hardware Implementation
and Correctness Proof
123
Lecture Notes in Computer Science 9000
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
Mikhail Kovalev Silvia M. Müller
Wolfgang J. Paul (Eds.)
A Pipelined
Multi-core
MIPS Machine
Hardware Implementation
and Correctness Proof
13
Volume Editors
Mikhail Kovalev
Wolfgang J. Paul
Saarland University, Saarbrücken, Germany
E-mail: {kovalev,wjp}@wjpserver.cs.uni-saarland.de
Silvia M. Müller
IBM-Labor Böblingen, Böblingen, Germany
E-mail: [email protected]
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 Digital Gates and Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Some Basic Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Clocked Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 Digital Clocked Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.2 The Detailed Hardware Model . . . . . . . . . . . . . . . . . . . . . . 44
3.3.3 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Drivers and Main Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5.1 Open Collector Drivers and Active Low Signal . . . . . . . . 55
3.5.2 Tristate Drivers and Bus Contention . . . . . . . . . . . . . . . . . 56
3.5.3 The Incomplete Digital Model for Drivers . . . . . . . . . . . . 60
3.5.4 Self Destructing Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5.5 Clean Operation of Tristate Buses . . . . . . . . . . . . . . . . . . . 64
VIII Contents
5 Arithmetic Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.1 Adder and Incrementer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2 Arithmetic Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3 Arithmetic Logic Unit (ALU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.4 Shift Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.5 Branch Condition Evaluation Unit . . . . . . . . . . . . . . . . . . . . . . . . . 113
7 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.1 MIPS ISA and Basic Implementation Revisited . . . . . . . . . . . . . . 162
7.1.1 Delayed PC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.1.2 Implementing the Delayed PC . . . . . . . . . . . . . . . . . . . . . . 163
7.1.3 Pipeline Stages and Visible Registers . . . . . . . . . . . . . . . . 164
7.2 Basic Pipelined Processor Design . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.2.1 Transforming the Sequential Design . . . . . . . . . . . . . . . . . . 167
7.2.2 Scheduling Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
7.2.3 Use of Invisible Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.2.4 Software Condition SC-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.2.5 Correctness Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
7.2.6 Correctness Proof of the Basic Pipelined Design . . . . . . . 178
7.3 Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.3.1 Hits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.3.2 Forwarding Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.3.3 Software Condition SC-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
7.3.4 Scheduling Functions Revisited . . . . . . . . . . . . . . . . . . . . . . 192
7.3.5 Correctness Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
7.4 Stalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7.4.1 Stall Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7.4.2 Hazard Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.4.3 Correctness Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.4.4 Scheduling Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.4.5 Correctness Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
7.4.6 Liveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
X Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
1
Introduction
Building on [12] and [6], we present at the gate level the construction of a
multi-core MIPS machine with “basic” pipelined processors and prove that
it works. “Basic” means that the processors only implement the part of the
instruction set architecture (ISA) that is visible in user mode; we call it ISA-u.
Extending it to the full architecture ISA-sp, that is visible in system program-
mers mode, we would have to add among other things the following mecha-
nisms: i) local and inter processor interrupts, ii) store buffers, and iii) memory
management units (MMUs). We plan to do this as future work. In Sect. 1.1
we present reasons why we think the results might be of interest. In Sect. 1.2
we give a short overview of the book.
1.1 Motivation
The are several reasons why we wrote this book and which might motivate
other scientists to read it.
Lecture Notes
The book contains the lecture notes of the third author’s lectures on Computer
Architecture 1 as given in the summer semester 2013 at Saarland University.
The purpose of ordinary architecture lectures is to enable students to draw
the building plans of houses, bridges, etc., and hopefully, also to explain why
they won’t collapse. Similarly, we try in our lectures on computer architecture
to enable students to draw the building plans of processors and to explain
why they work. We do this by presenting in the classroom a complete gate
level design of a RISC processor. We present correctness proofs because, for
nontrivial designs, they are the fastest way we know to explain why the designs
work. Because we live in the age of multi-core computing we attempted to
treat the design of such a processor in the classroom within a single semester.
With the help of written lecture notes this happened to work out. Indeed, a
M. Kovalev et al.: A Pipelined Multi-core MIPS Machine, LNCS 9000, pp. 1–6, 2014.
c Springer International Publishing Switzerland 2014
2 1 Introduction
student who had only 6 weeks of previous experience with hardware design
succeeded to implement the processor presented here on a field programmable
gate array (FPGA) within 6 weeks after the end of the lectures [8].
To the best of our knowledge, this book contains the first correctness proof
for the gate level implementation of a multi-core processor.
As a building block for the processor design, the book contains a gate level
implementation of a cache consistent shared memory system and a proof that
it is sequentially consistent. That such shared memory systems can be imple-
mented is in a sense the fundamental folklore theorem of multi-core comput-
ing: i) everybody believes it; experimental evidence given by the computers
on our desk is indeed overwhelming, ii) proofs are widely believed to exist,
but iii) apparently nobody has seen the entire proof; for an explanation why
the usual model checked results for cache coherence protocols fail to prove
the whole theorem see the introduction of Chap. 8. To the best of our knowl-
edge, this book contains the first complete correctness proof for a gate level
implementation of a cache based sequentially consistent shared memory.
Proofs in this book are presented on paper; such proofs are often called “paper-
and-pencil proofs”. Because they are the work of humans they can contain er-
rors and gaps, which are hopefully small and easily repaired. With the help of
computers one can debug proofs produced by humans: one enters them in com-
puter readable form into so called computer aided verification systems (CAV
systems) and then lets the computer check the correctness and completeness
of the proof. Technically, this reduces to a syntax check in a formal language
of what the system accepts as a proof. This process is called formal verifica-
tion. Formal, i.e., machine readable proofs, are engineering objects. Like any
other engineering objects, it makes a lot of sense to construct them from a
building plan. Sufficiently precise paper and pencil proofs serve this purpose
extremely well. Proofs published in [12] led, for instance, very quickly to the
formal verification of single core processors of industrial complexity [1,3]. The
proofs in this book are, therefore, also meant as blueprints for later formal
verification work of shared memory systems and multi-core processors. This
explains, why we have included some lengthy proofs of the bookkeeping type
in this text. They are only meant as help for verification engineers. At the
beginning of chapters, we give hints to these proofs: they can be skipped at a
first reading and should not be presented in the classroom. There is, however,
a benefit of going through these proofs: afterwards you feel more comfortable
when you skip them in the classroom.
1.1 Motivation 3
level n − 1 (2)
Fig. 1. Functional correctness (1) is shown on level n. Proof obligation (3) is not
necessary for the proof of (1) but has to be discharged on level n in order to guarantee
implementation (2) at level n − 1. If level n is considered in isolation it drops out of
the blue
1.2 Overview
Chapter 2 contains some very basic material about number formats and
Boolean algebra. In the section on congruences we establish simple lemmas
which we use in Chap. 5 to prove the remarkable fact that binary numbers
and two’s complement numbers are correctly added and subtracted (modulo
2n ) by the very same hardware.
In Chap. 3, we define the digital and the physical hardware model and
show that, in common situations, the digital model is an abstraction of the
detailed model. For the operation of drivers, buses, and main memory compo-
nents we construct circuits which are controlled completely by digital signals,
and show in the detailed model that the control circuits operate the buses
without contention and the main memory without glitches. We show that
these considerations are crucial by constructing a bus control which i) has no
bus contention in the (incomplete) digital model and ii) has bus contention
for roughly 1/3 of the time in the detailed model. In the physical world, such
a circuit destroys itself due to short circuits.
Chapter 4 contains a collection of various RAM constructions that we need
later. Arithmetic logic units, shifters, etc. are treated in Chap. 5.
In Chap. 6, the basic MIPS instruction set architecture is defined and a
sequential reference implementation is given. We make the memory of this
machine as wide as the cache line size we use later. Thus, shifters have to
be provided for the implementation of load and store operations. Intuitively,
it is obvious that these shifters work correctly. Thus, in [12], where we did
not aim at formal verification, a proof that loads and stores were correctly
implemented with the help of these shifters was omitted. Subsequent formal
verification work only considered loads and stores of words and double words;
this reduced the shifters to trivial hardware which was easy to deal with. As
this text is also meant as a help for verification engineers, we included the
correctness proof for the full shifters. These proofs argue, in the end, about
the absence of carries in address computations across cache line boundaries
for aligned accesses. Writing these arguments down turned out slightly more
tricky than anticipated.
With the exception of the digital hardware model and the shifters for load
and store operations, the book until this point basically contains updated and
revised material from the first few chapters in [12]. In Chap. 7, we obtain a
6 1 Introduction
We begin in Sect. 2.1 with very basic definitions of intervals of integers. Be-
cause this book is meant as a building plan for formal proofs, we cannot make
definitions like [1 : 10] = {1, 2, . . . , 10} because CAV systems don’t understand
such definitions. So we replace them by fairly obvious inductive definitions.
We also have to deal with the minor technical nuisance that, usually, sequence
elements are numbered from left to right, but in number representations, it is
much nicer to number them from right to left.
Section 2.2 on modulo arithmetic was included for several reasons. i) The
notation mod k is overloaded: it is used to denote both the congruence relation
modulo a number k or the operation of taking the remainder of an integer
division by k. We prefer our readers to clearly understand this. ii) Fixed
point arithmetic is modulo arithmetic, so we will clearly have to make use
of it. The most important reason, however, is that iii) addition/subtraction
of binary numbers and of two’s complement numbers is done by exactly the
same hardware1. When we get to this topic this will look completely intuitive
and, therefore, there should be a very simple proof justifying this fact. Such a
proof can be found in Sect. 5.1; it hinges on a simple lemma about the solution
of congruence equations from this section.
The very short Sect. 2.3 on geometric sums is simply there to remind the
reader of the proof of the formula for the computation of geometric sums,
which is much easier to memorize than the formula itself.
Section 2.4 introduces the binary number format, presents the school
method for binary addition, and proves that it works. Although this looks
simple and familiar and the correctness proof of the addition algorithms is
only a few lines long, the reader should treat this result with deep respect:
it is probably the first time that he or she sees a proof of the fact that the
addition algorithm he learned at school always works. The Old Romans, who
were fabulous engineers in spite of their clumsy number systems, would have
loved to see this proof.
1
Except for the computation of overflow and negative signals.
M. Kovalev et al.: A Pipelined Multi-core MIPS Machine, LNCS 9000, pp. 7–27, 2014.
c Springer International Publishing Switzerland 2014
8 2 Number Formats and Boolean Algebra
2.1 Basics
2.1.1 Numbers, Sets, and Logical Connectives
We denote by
N = {0, 1, 2, . . .}
the set of natural numbers including zero, by
N+ = {1, 2, . . .}
Z = {. . . , −2, −1, 0, 1, 2, . . .}
[i : i] = {i}
[i : j + 1] = [i : j] ∪ {j + 1} .
∈{x} = x .
For finite sets A, we denote by #A the cardinality, i.e., the number of elements
in A.
Given a function f operating on a set A and a set A1 ⊆ A, we denote by
f (A1 ) the image of set A1 under function f , i.e.,
f (A1 ) = {f (a) | a ∈ A1 } .
a : [0 : n − 1] → A .
10 2 Number Formats and Boolean Algebra
An = {a | a : [0 : n − 1] → A} .
a = (a1 , . . . , an ) = a[1 : n]
and is defined as
a : [1 : n] → A .
The set An of such sequences is then formalized as
An = {a | a : [1 : n] → A} .
3. We can also number the elements in a sequence a from right to left starting
with 0. Then we write
a = (an−1 , . . . , a0 ) = a[n − 1 : 0] ,
a : [0 : n − 1] → A .
An = {a | a : [0 : n − 1] → A} .
Thus, the direction of ordering does not show up in the formalization yet.
The reason is, that the interval [0 : n − 1] is a set, and elements of sets
are unordered. The difference, however, will show up when we formalize
operations on sequences.
The concatenation operator ◦ is defined for sequences a[n − 1 : 0], b[m − 1 : 0]
numbered from right to left and starting with 0 as
b[i] i<m
∀i ∈ [n + m − 1 : 0] : (a ◦ b)[i] =
a[i − m] i ≥ m ,
or, respectively, for sequences a[0 : n − 1], b[0 : m − 1] numbered from left to
right as
a[i] i<n
∀i ∈ [0 : n + m − 1] : (a ◦ b)[i] =
b[i − n] i ≥ n .
For sequences a[1 : n] and b[1 : m] numbered from left to right, concatenation
is defined as
2.1 Basics 11
a[i] i≤n
∀i ∈ [1 : n + m] : (a ◦ b)[i] =
b[i − n] i > n .
xn = x
.
. . x .
n times
and formally by
x1 = x
xn+1
= x ◦ xn .
12 = 11
04 = 0000 .
In these examples and later in the book, we often omit ◦ when denoting the
concatenation of bit-strings x1 and x2 :
x1 x2 = x1 ◦ x2 .
a = (an−1 , . . . , a0 ) .
We have to point out that in mathematics the three letter word “mod” is not
only used for the relation defined above. It is also used as a binary operator in
which case (a mod k) denotes the representative of a in [0 : k − 1]. Let a, b ∈ Z
and k ∈ N+ . Then,
From Lemma 2.3 we infer a simple but useful lemma about the solution of
equivalences mod k.
Lemma 2.5 (solution of equivalences). Let k be even and x ≡ y mod k,
then
1. x ∈ [0 : k − 1] → x = (y mod k) ,
2. x ∈ [−k/2 : k/2 − 1] → x = (y tmod k) .
n−1
2i = 2n − 1 .
i=0
n−1
a = ai · 2 i
i=0
100 = 4
111 = 7
10n = 2n .
n−1
1n = 2i = 2n − 1 ,
i=0
i.e., the largest binary number representable with n bits corresponds to the
natural number 2n − 1.
Note that binary number interpretation is an injective function.
Lemma 2.7 (binary representation injective). Let a, b ∈ Bn . Then,
a = b → a = b .
j
j
a − b = ai · 2 i − bi · 2i
i=0 i=0
j−1
≥ 2j − 2i
i=0
=1
by Lemma 2.6.
Let n ∈ N . We denote by
+
Bn = {a | a ∈ Bn }
the set of natural numbers that have a binary representation of length n. Since
16 2 Number Formats and Boolean Algebra
n−1
0 ≤ a ≤ 2i = 2n − 1 ,
i=0
we deduce
Bn ⊆ [0 : 2n − 1] .
As · is injective and
#Bn = #Bn = 2n = #[0 : 2n − 1] ,
we observe that · is bijective and get the following lemma.
Lemma 2.8 (natural numbers with binary representation). For n ∈
N+ we have
Bn = [0 : 2n − 1] .
For x ∈ Bn we denote the binary representation of x of length n by binn (x):
binn (x) = ∈{a | a ∈ Bn ∧ a = x} .
To shorten notation even further, we write xn instead of binn (x):
xn = binn (x) .
It is often useful to decompose n bit binary representations a[n − 1 : 0] into an
upper part a[n − 1 : m] and a lower part a[m − 1 : 0]. The connection between
the numbers represented is stated in Lemma 2.9.
Lemma 2.9 (decomposition). Let a ∈ Bn and n ≥ m. Then,
a[n − 1 : 0] = a[n − 1 : m] · 2m + a[m − 1 : 0] .
Proof.
n−1
m−1
a[n − 1 : 0] = ai · 2 i + ai · 2 i
i=m i=0
n−1−m
= am+j · 2m+j + a[m − 1 : 0]
j=0
n−1−m
=2 ·m
am+j · 2j + a[m − 1 : 0]
j=0
= 2 · a[n − 1 : m] + a[m − 1 : 0]
m
We obviously have
a[n − 1 : 0] ≡ a[m − 1 : 0] mod 2m .
Using Lemma 2.5, we infer the following lemma.
2.4 Binary Numbers 17
Proof. By induction on n. For n = 0 this follows directly from (1). For the
induction step we conclude from n − 1 to n:
The following simple lemma allows breaking the addition of two long num-
bers into two additions of shorter numbers. It is useful, among other things,
for proving the correctness of recursive addition algorithms (as applied in
recursive hardware constructions of adders and incrementers).
Lemma 2.12 (decomposition of binary addition). For a, b ∈ Bn , for
d, e ∈ Bm , and for c0 , c , c ∈ B, let
then
ad + be + c0 = c st .
Repeatedly using Lemma 2.9, we have
Tn = {[a] | a ∈ Bn }
2.5 Two’s Complement Numbers 19
Proof. The first line is trivial. The second line follows from
[a] − a = −an−1 · 2n−1 + a[n − 2 : 0] − (an−1 · 2n−1 + a[n − 2 : 0])
= −an−1 · 2n .
This shows the third line. The fourth line follows from
n−2
[a] = −an−1 · 2n−1 + ai · 2 i
i=0
n−2
= −(1 − an−1 ) · 2n−1 + (1 − ai ) · 2i
i=0
n−2
n−2
= −2 n−1
+ 2 + an−1 · 2
i n−1
− ai · 2 i
i=0 i=0
= −1 − [a] . (Lemma 2.6)
We conclude the discussion of binary numbers and two’s complement numbers
with a lemma that provides a subtraction algorithm for binary numbers.
Lemma 2.15 (subtraction for binary numbers). Let a, b ∈ Bn . Then,
a − b ∈ Bn .
C = {0, 1}
V = {x0 , x1 , . . .}
F = {f0 , f1 , . . .} .
and denotes the number of arguments for function fi with ni . Now we can
define the set BE of Boolean expressions by the following rules:
1. Constants and variables are Boolean expressions:
C ∪ V ⊂ BE .
e ∈ BE → (e) ∈ BE .
e, e ∈ BE ∧ ◦ ∈ {∧, ∨, ⊕} → (e ◦ e ) ∈ BE .
1. Substitute ai for xi :
xi = ai .
2. If e = (e ), then evaluate e(a) by evaluating e (a) and negating the result
according to the predefined meaning of negation in Table 3:
3. If e = (e ◦ e ), then evaluate e(a) by evaluating e (a) and e (a) and then
combining the results according to the predefined meaning of ◦ in Table
3:
(e ◦ e )(a) = e (a) ◦ e (a) .
4. Expressions of the form e = fi (e1 , . . . , eni ) can only be evaluated if the
symbol fi has an interpretation as a function
fi : Bni → B .
In this case evaluate fi (e1 , . . . , eni )(a) by evaluating arguments ej (a), sub-
stituting the result into f and evaluating f :
The following small example shows that this very formal and detailed set of
rules captures our usual way of evaluating expressions:
e = e ,
where e and e are expressions involving variables x = x[1 : n]. They come in
two flavors:
• Identities. An equation e = e is an identity iff for any substitution of the
variables a = a[1 : n] ∈ Bn , expressions e and e evaluate to the same value
in B:
∀a ∈ Bn : e(a) = e (a) .
• Equations which one wants to solve. A substitution a = a[1 : n] ∈ Bn
solves equation e = e if e(a) = e (a).
We observe that identities and equations we want to solve do differ formally
in the implicit quantification. If not stated otherwise, we usually assume equa-
tions to be of the first type, i.e., to be implicitly quantified over all free vari-
ables. This is also the case with definitions of functions, where the left-hand
2.6 Boolean Algebra 23
side of an equation represents an entity being defined. For instance, the fol-
lowing definition of the function
f (x1 , x2 ) = x1 ∧ x2
is the same as
∀a, b ∈ B : f (a, b) = a ∧ b .
We may also write
e ≡ e
to stress that a given equation is an identity or to avoid brackets in case if
this equation is a definition and the right-hand side itself contains an equality
sign.
In case we talk about several equations in a single statement (this is often
the case when we solve equations), we assume implicit quantification over the
whole statement rather than over every single equation. For instance,
e1 = e2 ↔ e3 = 0
is the same as
∀a ∈ Bn : e1 (a) = e2 (a) ↔ e3 (a) = 0
and means that, for any given substitution a, equations e1 and e2 evaluate
to the same value if and only if equation e3 evaluates to 0. In other words,
equations e1 = e2 and e3 = 0 have the same set of solutions.
In Boolean algebra there is a very simple connection between the solution
of equations and identities. An identity e ≡ e holds iff equations e = 1 and
e = 1 have the same set of solutions.
Lemma 2.16 (identity from solving equations). Given Boolean expres-
sions e and e with inputs x[1 : n], we have
e ≡ e ↔ ∀a ∈ Bn : (e(a) = 1 ↔ e (a) = 1) .
Proof. The direction from left to right is trivial. For the other direction we
distinguish cases:
• e(a) = 1. Then e (a) = 1 by hypothesis.
• e(a) = 0. Then e (a) = 1 would by hypothesis imply the contradiction
e(a) = 1. Because in Boolean algebra e (a) ∈ B we conclude e (a) = 0.
Thus, we have e(a) = e (a) for all a ∈ Bn .
2.6.1 Identities
• Commutativity:
x1 ∧ x2 ≡ x2 ∧ x1
x1 ∨ x2 ≡ x2 ∨ x1
x1 ⊕ x2 ≡ x2 ⊕ x1
• Associativity:
(x1 ∧ x2 ) ∧ x3 ≡ x1 ∧ (x2 ∧ x3 )
(x1 ∨ x2 ) ∨ x3 ≡ x1 ∨ (x2 ∨ x3 )
(x1 ⊕ x2 ) ⊕ x3 ≡ x1 ⊕ (x2 ⊕ x3 )
• Distributivity:
• Identity:
x1 ∧ 1 ≡ x1
x1 ∨ 0 ≡ x1
• Idempotence:
x1 ∧ x1 ≡ x1
x1 ∨ x1 ≡ x1
• Annihilation:
x1 ∧ 0 ≡ 0
x1 ∨ 1 ≡ 1
• Absorption:
x1 ∨ (x1 ∧ x2 ) ≡ x1
x1 ∧ (x1 ∨ x2 ) ≡ x1
• Complement:
x1 ∧ x1 ≡ 0
x1 ∨ x1 ≡ 1
• Double negation:
x1 ≡ x1
2.6 Boolean Algebra 25
• De Morgan’s laws:
x1 ∧ x2 ≡ x1 ∨ x2
x1 ∨ x2 ≡ x1 ∧ x2
Each of these identities can be proven in a simple brute force way: if the
identity has n variables, then for each of the 2n possible substitutions of the
variables the left and right hand sides of the identities are evaluated with the
help of Table 3. If for each substitution the left hand side and the right hand
side evaluate to the same value, then the identity holds. For the first of de
Morgan’s laws this is illustrated in Table 4.
(e1 ∨ e2 ) = 1 ↔ e1 = 1 ∨ e2 = 1 .
∀f : Bn → B : ∃e : f (x) ≡ e .
Then,
Thus, we have
m(a) = 1 ↔ x = a . (3)
6
The term switching function comes from electrical engineering and stands for a
Boolean function.
2.6 Boolean Algebra 27
We define the support S(f ) of f as the set of arguments a, where f takes the
value f (a) = 1:
S(f ) = {a | a ∈ Bn ∧ f (a)} .
If the support is empty, then e = 0 computes f . Otherwise we set
e= m(a) .
a∈S(f )
Then,
Thus, equations e = 1 and f (x) = 1 have the same solutions. By Lemma 2.16
we conclude
e ≡ f (x) .
The expression e constructed in the proof of Lemma 2.20 is called the com-
plete disjunctive normal form of f . As an example, we consider the complete
disjunctive normal forms of the sum and carry functions c defined in Table 2:
c (a, b, c) ≡ a ∧ b ∧ c ∨ a ∧ b ∧ c ∨ a ∧ b ∧ c ∨ a ∧ b ∧ c (4)
s(a, b, c) ≡ a ∧ b ∧ c ∨ a ∧ b ∧ c ∨ a ∧ b ∧ c ∨ a ∧ b ∧ c . (5)
c (a, b, c) ≡ a ∧ b ∨ b ∧ c ∨ a ∧ c
s(a, b, c) ≡ a ⊕ b ⊕ c .
The correctness can be checked in the usual brute force way by trying all 8
assignments of values in B3 to the variables of the expressions, or by applying
the identities listed in Sect. 2.6.1.
In Sect. 3.1 we introduce the classical model of digital circuits. This includes
the classical definition of the depth d(g) of a gate g in a circuit as the length
of a longest path from an input of the circuit to the gate. For the purpose
of timing analysis in a later section, we also introduce the function sp(g)
measuring the length of a shortest path from an input of the circuit to gate
g. We present the classical proof by pigeon hole principle that the depth of
gates is well defined. By induction on the depth of gates we then conclude the
classical result that the semantics of switching circuits is well defined.
A few basic digital circuits are presented for later use in Sect. 3.2. This is
basically the same collection of circuits as presented in [12].
In Sect. 3.3 we introduce two hardware models: i) the usual digital model
consisting of digital circuits and 1-bit registers as presented in [11, 12] and
ii) a detailed model involving propagation delays, set up and hold times as
presented, e.g., in [10,14]. Working out the proof sketch from [10], we formalize
timing analysis and show by induction on depth that, with proper timing
analysis, the detailed model is simulated by the digital model. This justifies
the use of the digital model as long as we use only gates and registers.
In the very simple Sect. 3.4 we define n-bit registers which are composed
of 1-bit registers in order to use them as single components h.R of hardware
configurations h.
As we aim at the construction of memory systems, we extend in Sect.
3.5 both circuit models with open collector drivers, tristate drivers, buses,
and a model of main memory. As new parameters we have to consider in the
detailed model the enable and disable times of drivers. Also – as main memory
is quite slow – we have to postulate that, during accesses to main memory, its
input signals should be stable in the detailed model, i.e., there should be no
glitches on the input signals to the memory. We proceed to construct digital
interface circuitry for controlling buses and main memory and show in the
detailed model that, with that circuitry, buses and main memory are properly
operated. These results permit to abstract buses, main memory and their
interface circuitry to the digital model. So in later constructions, we only
M. Kovalev et al.: A Pipelined Multi-core MIPS Machine, LNCS 9000, pp. 29–82, 2014.
c Springer International Publishing Switzerland 2014
30 3 Hardware
have to worry about proper operation of the interface circuitry in the digital
world, and we do not have to reconsider the detailed model.
For readers who suspect that we might be paranoid, we also prove that
the proof obligations for the interface circuitry which we impose on the digital
model cannot be derived from the digital model itself. Indeed, we construct
a bus control that i) is provably free of contention in the digital model and
which ii) has – for common technology parameters – bus contention for about
1/3 of the time. As “bus contention” translates in the real world to “short
circuit”, such circuits destroy themselves.
Thus, the introduction of the detailed hardware model solves a very real
problem which is usually ignored in previous work, such as [12]. One would
suspect that the absence of such an argument leads to constructions that
malfunction in one way or another. In [12], buses are replaced by multiplexers,
so bus control is not an issue. But the processor constructed there has caches
for data and instructions, which are backed up by a main memory. Moreover,
the interface circuitry for the instruction cache as presented in [12] might
produce glitches1 . On the other hand, the (digital) design in [12] was formally
verified in subsequent work [1,3]; it was also put on a field programmable gate
array (FPGA) and ran correctly immediately after the first power up without
ever producing an error. If there are no malfunctions where one would worry
about them in view of later insights, one looks for explanations. It turned
out that the hardware engineer who transferred the design to the FPGA had
made a single change to the design (without telling anybody about it): he
had put a register stage in front of the main memory. In the physical design,
this register served as interface circuitry to the main memory and happened
to conform to all conditions presented in Sect. 3.5. Thus, although the digital
portion of the processor was completely verified, the design in the book still
contained a bug, which is only visible in the detailed model. The bug was
fixed without proof (indeed without being recognized) by the introduction of
the register stage in front of the memory. In retrospect, in 2001 the design
was not completely verified; that it ran immediately after power up involved
luck.
A few constructions for control automata are presented for later use in Sect.
3.6. This is basically the same collection of automata as presented in [12].
1
In the design from [12] the glitches can be produced on the instruction memory
address by the multiplexer between pc and dpc as described in Chap. 7.
3.1 Digital Gates and Circuits 31
ab ab a b a
Inputs 1 0 x0 x1 xn−1
...
...
Outputs y0 y1 yt−1
In = {xn−1 , . . . , x0 , 0, 1}
Sig(C) = In ∪ G .
Depending on its type, every gate has one or two inputs which are connected
to signals of the circuit. We denote the input signals connected to a gate g ∈ G
of a circuit C by in1(g), in2(g) for gates with two inputs (AND, OR, ⊕) and
by in1(g) for gates with a single input (inverter). Note that we denote the
output signal of a gate g ∈ G simply by g.
At first glance it is very easy to define how a circuit should work. For a cir-
cuit C, we define the values s(a) of signals s ∈ Sig(C) for a given substitution
a = a[n − 1 : 0] ∈ Bn of the input signals:
2
Intuitively, the reader may think of g ∈ G consisting of two parts, one that
uniquely identifies the particular gate of the circuit (e.g., a name) and another
that specifies the type of the gate (AND, OR, ⊕, inverter).
32 3 Hardware
1. If s = xi is an input, then
∀i ∈ [n − 1 : 0] : xi (a) = ai .
2. If s is an inverter, then
s(a) = in1(s)(a) .
3. If s is a ◦-gate with ◦ ∈ {∧, ∨, ⊕}, then
(s[0 : m]) = m .
∃i, j : i < j ∧ si = sj .
3
This proof uses the so called pigeonhole principle. If k + 1 pigeons are sitting in
k holes, then one hole must have at least two pigeons.
3.2 Some Basic Circuits 33
Since every path in a circuit has finite length, one can define for each signal
s the depth d(s) of s as the number of gates on a longest path from an input
to s:
d(s) = max{m | ∃ path s[0 : m] : s0 ∈ In ∧ sm = s} .
For later use we also define the length sp(s) of a shortest such path as
c , s = a + c .
The resulting circuit is called a half adder. Symbol and implementation are
shown in Fig. 6. The circuit in Fig. 7(b) is called a multiplexer or short mux.
34 3 Hardware
a b c a
b
1 1 c
1
FA
1
1
s
c s c
a c
1 1 a
c
HA
1
c
1
s
c s
(a) symbol (b) implementation
For multiplexers we use the symbol from Fig. 7(a). The n-bit multiplexer or
short n-mux in Fig. 8(b) consists of n multiplexers with a common select
signal s. Its inputs and outputs satisfy
x[n − 1 : 0] s = 0
z[n − 1 : 0] =
y[n − 1 : 0] s = 1 .
For n-muxes we use the symbol from Fig. 8(a). Figure 9(a) shows the symbol
for an n-bit inverter. Its inputs and outputs satisfy
y[n − 1 : 0] = x[n − 1 : 0] .
y x
x y s
1 1
0 1 s
1
z
z
(a) symbol (b) implementation
Fig. 7. Multiplexer
x y xn−1 yn−1 x0 y0
1 1 1 1
n n
0 1 s 0 1 0 1 s
n
···
1 1
z zn−1 z0
x xn−1 x0
n 1 1
···
n 1 1
y yn−1 y0
(a) symbol (b) implementation
x y xn−1 yn−1 x0 y0
n n 1 1 1 1
◦ ◦ ··· ◦
n 1 1
z zn−1 z0
v y v yn−1 v y0
1 n 1 1 1 1
◦ ◦ ··· ◦
n 1 1
u un−1 u0
◦ ◦
◦
n
b b
Fig. 11. Implementation of an n-bit ◦-tree for ◦ ∈ {∧, ∨, ⊕}
zero ≡ a = 0n
nzero ≡ a = 0n .
n−1
nzero(a[n − 1 : 0]) = ai , zero = nzero .
i=0
3.2 Some Basic Circuits 37
a
a
n
n
∨
n-Zero
nzero
1 1
zero nzero
zero
a b
a b n n
n n
n
n-eq
n-Zero
1 1
eq neq 1 1
eq neq
The inputs a[n − 1 : 0], b[n − 1 : 0] and outputs eq, neq of an n-bit equality
tester in Fig. 13 satisfy
eq ≡ a = b , neq ≡ a = b .
(n − k)−Dec k−Dec
2n−k 2k
V [2n−k − 1 : 0] U[2k − 1 : 0]
y1 y0
V [i] U[j]
··· ···
y[2k · i + j]
Fig. 14. Implementation of an n-bit decoder
n = 1: n > 1: x = x[n − 2 : 0]
n−1
0 x0
(n − 1)−hdec
2n−1
U[L]
xn−1
y1 y0 2n−1 2n−1
Y [H] Y [L]
Fig. 15. Recursive construction of n-bit half decoder
i.e., input x is interpreted as a binary number and decoded into a unary num-
ber. The remaining output bits are filled with zeros. A recursive construction
of n-half decoders is shown in Fig. 15. For the construction of n-half decoders
from (n − 1)-half decoder, we divide the index range into upper and lower
half:
L = [2n−1 − 1 : 0] , H = [2n − 1 : 2n−1 ] .
3.2 Some Basic Circuits 39
Also we divide x[n − 1 : 0] into the leading bit xn−1 and the low order bits
x = x[n − 2 : 0] .
In the induction step we then conclude
Y [H] ◦ Y [L] = xn−1 ∧ U [L] ◦ (xn−1 ∨ U (L))
n−1
◦ 02 −x 1x : xn−1 = 0
n−1
02
=
02 −x 1x ◦ 12
n−1 n−1
: xn−1 = 1
n
02 −x 1x : xn−1 = 0
= 2n −(2n−1 +x ) 2n−1 +x
0 1 : xn−1 = 1
n
−xn−1 x xn−1 x
= 02 1
2 −x x
n
=0 1 .
An n-parallel prefix circuit P P◦ (n) for an associative function ◦ : B × B → B
is a circuit with inputs x[n − 1 : 0] and outputs y[n − 1 : 0] satisfying
y0 = x0 , yi+1 = xi+1 ◦ yi . (6)
For even n a recursive construction of an n-parallel prefix circuit based on
◦-gates is shown in Fig. 16. For odd n one can realize an (n − 1)-bit parallel
prefix from Fig. 16 and compute output yn−1 as
yn−1 = xn−1 ◦ yn−2
using one extra ◦-gate.
For the correctness of the construction, we first observe that
xi = x2i+1 ◦ x2i
y2i = x2i ◦ yi−1
y2i+1 = yi .
We first show that odd outputs of the circuit satisfy (6). For i = 0 we have
y1 = y0 (construction)
= x0 (ind. hypothesis P P◦ (n/2))
= x1 ◦ x0 . (construction)
For i > 0 we conclude
y2i+1 = yi (construction)
= xi ◦ yi−1
(ind. hypothesis P P◦ (n/2))
= (x2i+1 ◦ x2i ) ◦ yi−1 (construction)
= x2i+1 ◦ (x2i ◦ yi−1 ) (associativity)
= x2i+1 ◦ y2i . (construction)
40 3 Hardware
P P◦ (n/2)
y1 y0 y n −1 y0
2
yn−1 yn−2 y2 y1 y0
Fig. 16. Recursive construction of an n-bit parallel prefix circuit of the function ◦
for an even n
out
n n
out
Fig. 17. Recursive construction of an (n, A)-OR tree
y0 = x0 (construction)
i > 0 → y2i = x2i ◦ yi−1 (construction)
= x2i ◦ y2i−1 . (construction)
An (n, A)-OR tree has A many input vectors b[i] ∈ Bn with i ∈ [0 : A − 1],
where b[i][j] with j ∈ [0 : n − 1] is the j-th bit of input vector b[i]. The outputs
of the circuit out[n − 1 : 0] satisfy
3.3 Clocked Circuits 41
A−1
out[j] = b[i][j] .
i=0
The implementation of (n, A)-OR trees, for the special case where A is a power
of two, is shown in Fig. 17.
4
In this book we do not present such hardware.
42 3 Hardware
···
x[n − 1] x[n − 1]ce x[1] x[1]ce x[0] x[0]ce
x[n − 1] x[1]
x[0] reset
0
1 circuit c
n
n
Fig. 18. A digital clocked circuit. Every output signal x[i]in of circuit c is the data
input of the corresponding register x[i] and every output x[i]ce produced by circuit
c is the clock enable input of the corresponding register
h = δH (h, reset) .
1 0 reset
x[1] 1
t 1 t = −1
reset =
0 t≥0.
At power up, register values are binary but unknown. We denote this sequence
of unknown binary values at startup by a[n − 1 : 0]:
x−1 [n − 1 : 0] = a[n − 1 : 0] ∈ Bn .
The current value of a circuit signal y in cycle t is defined according to the
previously introduced circuit semantics:
in1(y)t y is an inverter
yt =
in1(y) ◦ in2(y) y is a ◦-gate .
t t
Let x[n − 1 : 0]int and x[n − 1 : 0]cet be the register input and clock enable
signals computed from the current configuration xt [n − 1 : 0] and the current
value of the reset signal resett . Then the register value xt+1 [i] of the next
hardware configuration xt+1 [n − 1 : 0] = δH (xt [n − 1 : 0], resett ) is defined as
t+1 x[i]int x[i]cet = 1
x [i] =
xt [i] x[i]cet = 0 ,
i.e., when the clock enable signal of register x[i] is active in cycle t, the register
value of x[i] in cycle t + 1 is the value of the data input signal in cycle t;
otherwise, the register value does not change.
As an example, consider the digital clocked circuit from Fig. 19. There is
only one register, thus we abbreviate x = x[0]. For cycle −1 we have
x−1 = a[0]
reset−1 = 1
xce−1 = 1
xin−1 = 0 .
44 3 Hardware
resett = 0
xcet = 1
xint = y t = xt .
∀t ≥ 0 : xt = (t mod 2) .
y : R → {0, 1, Ω}
e(c) = γ + c · τ .
Inspired by data sheets from hardware manufacturers, registers and gates have
six timing parameters:
• ρ: the minimal propagation delay of register outputs after clock edges,
• σ: the maximal propagation delay of register outputs after clock edges (we
require 0 ≤ ρ < σ),
• ts: setup time of register input and clock enable before clock edges,
• th: hold time of register input and clock enable after clock edges,
• α: minimal propagation delay of gates, and
• β: maximal propagation delay of gates5 (we require 0 < α < β).
5
Defining such delays from voltage levels of electrical signals is nontrivial and can
go wrong in subtle ways. For the deduction of a negative propagation delay from
the data of a very serious hardware catalogue, see [5].
3.3 Clocked Circuits 45
e(c)
ck
ts th
x[i]in
x[i]ce 1
ρ
x[i] x[i]in(e(c))
σ
Fig. 20. Detailed timing of a register x[i] with stable inputs and ce = 1
This is a simplification. Setup and hold times can be different for register
inputs and clock enable signals. Also, the propagation delays of different types
of gates are, in general, different. Generalizing our model to this situation is
easy but requires more notation.
Let y be any signal. The requirement that this signal satisfies the setup
and hold times of registers at clock edge c is defined by
The behavior of a register x[i] with stable input and clock enable at edge t is
illustrated in Fig. 20.
For c ∈ N ∪ {−1} and t ∈ (e(c) + ρ, e(c + 1) + ρ], we define the register
value x[i](t) and output at time t by a case split:
• Clocking the register at edges c ≥ 0. The clock enable signal is 1 at edge
e(c) and the setup and hold times for the input and clock enable signals
are met:
x[i]ce(e(c)) ∧ stable(x[i]in, c) ∧ stable(x[i]ce, c) .
Then the data input at edge e(c) becomes the new value of the register,
and it becomes visible (at the latest) at time σ after clock edge e(c).
• Not clocking the register at edges c ≥ 0. The clock enable signal is 0 at
edge e(c) and the setup and hold times for it are met:
x[i]ce(e(c)) ∧ stable(x[i]ce, c) .
reset 1 0
ρ
ρ to σ after regular clocking, and ii) the entire time interval if there was a
violation of the stability conditions of any kind. Usually, a physical register
will settle in this situation quickly into an unknown logical value, but in
rare occasions the register can “hang” at a voltage level not recognized as
0 or 1 for a long time. This is called ringing or metastability.
Formally, we define the register semantics of the detailed hardware model in
the following way:
⎧
⎪
⎪ a[i] reset(t)
⎪
⎪
⎪
⎪ x[i]in(e(c)) t ∈ [e(c) + σ, e(c + 1) + ρ] ∧ stable(x[i]in, c)
⎪
⎪
⎨ ∧ stable(x[i]ce, c) ∧ x[i]ce(e(c)) ∧ ¬reset(t)
x[i](t) =
⎪
⎪ x[i](e(c)) t ∈ (e(c) + ρ, e(c + 1) + ρ] ∧ stable(x[i]ce, c)
⎪
⎪
⎪
⎪ ∧ ¬x[i]ce(e(c)) ∧ ¬reset(t)
⎪
⎪
⎩
Ω otherwise .
Notice that during regular clocking in, the output is unknown between e(c)+ρ
and e(c)+σ. This is the case even if x[i]in(e(c)) = x[i](e(c)), i.e., when writing
the same value the register currently contains. In this case a glitch on the
register output can occur. A glitch (or a spike) is a situation when a signal
has the same digital value x ∈ B in cycle t and t + 1 but in the physical model
it temporarily has a value not recognized as x. The only way to guarantee
constant register outputs during the time period is not to clock the register
during that time.
We require the reset signal to behave like an output of a register which is
clocked at cycles −1 and 0 and is not clocked afterwards (Fig. 21):
⎧
⎪
⎨1 t ∈ [e(−1) + σ, e(0) + ρ]
reset(t) = Ω t ∈ (e(0) + ρ, e(0) + σ)
⎪
⎩
0 otherwise .
in1 a
in2 b
α β
y a◦b
hold(y, t) reg(y, t)
We show a simple lemma, which guarantees that the output from a register
does not have any glitches if this register is not clocked.
Lemma 3.3 (glitch-free output of non-clocked register). Assume stable
data is clocked into register x[i] at edge e(c):
stable(x[i]in, c) ∧ stable(x[i]ce, c) ∧ x[i]ce(e(c)) .
Assume further that the register is not clocked for the following K − 1 clock
edges:
∀k ∈ [1 : K − 1] : stable(x[i]ce(c + k)) ∧ ¬x[i]ce(e(c + k)) .
Then the value x[i]in(e(c)) is visible at the output of register x[i] from time
e(c) + σ to e(c + K) + ρ:
∀t ∈ [e(c) + σ, e(c + K) + ρ] : x[i](t) = x[i]in(e(c)) .
Proof. One shows by an easy induction on k
∀t ∈ [e(c) + σ, e(c + 1) + ρ] : x[i](t) = x[i]in(e(c))
and
∀k ∈ [1 : K − 1] : ∀t ∈ (e(c + k) + ρ, e(c + k + 1) + ρ] : x[i](t) = x[i]in(e(c)) .
For the definition of the value y(t) of gates g at time t in the detailed model,
we distinguish three cases (see Fig. 22):
• Regular signal propagation. Here, all input signals are binary and stable for
the maximal propagation delay β before t. For inverters y this is captured
by the following predicate:
reg(y, t) ↔ ∃a ∈ B : ∀t ∈ [t − β, t] : in1(y)(t ) = a .
Gate y in this case outputs a at time t. For ◦-gates y we define
reg(y, t) ↔ ∃a, b ∈ B : ∀t ∈ [t − β, t] : in1(y)(t ) = a ∧ in2(y)(t ) = b .
Then gate y outputs a ◦ b at time t.
48 3 Hardware
in1
in2
β α
t2 − β t − α t1 t2 t
• Signal holding. Here, signal propagation is not regular anymore but it was
regular at some time during the minimal propagation delay α before t:
The gate y in this case still holds the old value y(t ) at time t. We will
show that the value y(t ) is well defined for all t .
• All other cases where we cannot give any guarantees about y(t).
y(t1 ) = y(t2 ) .
Proof. The proof is illustrated in Fig. 23. Without loss of generality, we have
t1 < t2 . Let z ∈ {in1(y), in2(y)} be any input of y. From reg(y, t1 ) we infer
From
0 < t2 − t 1 < α < β
we infer
t2 − β < t1 < t2 .
Thus,
t1 ∈ (t2 − β, t2 ).
From reg(y, t2 ) we get
and hence,
z(t2 ) = z(t1 ) = a .
For ◦-gates y we have
3.3 Clocked Circuits 49
For values t satisfying hold(y, t), we define lreg(y, t) as the last value t before
t when signal propagation was regular:
Timing analysis is performed in the detailed model in order to ensure that all
register inputs x[i]in and clock enables x[i]ce are stable at clock edges. We
capture the conditions for correct timing by
After a reminder that d(y) and sp(y) are the lengths of longest and shortest
paths from the inputs to y, we define the minimal and the maximal propaga-
tion delays of arbitrary signals y relative to the clock edges:
tmin(y) = ρ + sp(y) · α,
tmax(y) = σ + d(y) · β .
6
The input signals 0 and 1 of a circuit do in fact have no propagation delay.
However, giving a precise definition that takes this into account would make
things unnecessarily complicated here since we would need to define and argue
about the longest and shortest path without the 0/1 signals. Instead, we prefer
to overestimate and keep things simple by using already existing definitions.
50 3 Hardware
c equals the value y(e(c + 1)) at the end of the cycle. In other words, with
correct timing the digital model is an abstraction of the detailed model:
y c = y(e(c + 1)) .
∀y : tmax(y) + ts ≤ τ
and if for all inputs x[i]in and clock enable signals x[i]ce of registers we have
∀i : th ≤ tmin(x[i]in) ∧ th ≤ tmin(x[i]ce),
then
1. ∀y, c : ∀t ∈ [e(c) + tmax(y), e(c + 1) + tmin(y)] : y(t) = y c ,
2. ∀i, c : c ≥ 0 → stable(x[i]in, c) ∧ stable(x[i]ce, c).
Proof. By induction on the depth d(y) of signals. Let the statement hold for
signals of depth d − 1 and let y be a ◦-gate of depth d. We show that it holds
for y.
Consider Fig. 24. There are inputs z1 , z2 of y such that
Hence,
tmax(y) = tmax(z1 ) + β
tmin(y) = tmin(z2 ) + α .
Since we have
e(c) e(c + 1)
ck
z2
tmax(z1 ) tmin(z2 )
z1
β α
y yc
tmax(y) tmin(y)
and thus,
We conclude
and
e(c) e(c + 1)
ck
ts th
y
tmax(y) tmin(y)
τ
e(c + 1) + tmin(y) − α ∈ [t − β, t]
which implies
In case reg(y, t) doesn’t hold, we get hold(y, t). We have to show that
y(lreg(y, t)) = y c .
y(lreg(y, t)) = y c .
We now continue with the proof of Lemma 3.5. For the induction base c = −1
and the signals coming from the registers x[i] with d(x[i]) = 0, we have
tmin(x[i]) = ρ ∧ tmax(x[i]) = σ .
From the initialization rules in the digital and detailed models we get
and conclude the proof of part 1 by Lemma 3.6. For part 2 there is nothing
to show.
3.3 Clocked Circuits 53
Rinn−1 Rin0
Rn−1 ··· R0
Rce
3.4 Registers
So far we have shown that there is one basic hardware model, namely the
detailed one, but with correct timing it can be abstracted to the digital model
(Lemma 3.5). From now on we assume correct timing and stick to the usual
digital model unless we need to prove properties not expressible in this model
– like the absence of glitches.
Although all memory components can be built from 1-bit registers, it is
inconvenient to refer to all memory bits in a computer by numbering them
with an index i of a clocked circuit input x[i]. It is more convenient to deal
with hardware configurations h and to gather groups of such bits into certain
memory components h.M . For M we introduce n-bit registers h.R. In Chap.
4 we add to this no less than 9 (nine) random access memory (RAM) designs.
As before, in a hardware computation with memory components, we have
ht+1 = δH (ht , resett ) .
An n-bit register R consists simply of n many 1-bit registers R[i] with a
common clock enable signal Rce as shown in Fig. 26.
Register configurations are n-tuples:
h.R ∈ Bn .
Given input signals Rin(ht ) and Rce(ht ), we obtain from the semantics of the
basic clocked circuit model:
Rin(ht ) Rce(ht ) = 1
ht+1 .R =
ht .R Rce(ht ) = 0 .
Recall that, from the initialization rules for 1-bit registers, after power up
register content is binary but unknown (metastability is extremely rare):
h0 .R ∈ Bn .
yin
yin
α α
OC
y y
β β
collector drivers, and main memory. For hardware consisting only of gates,
inverters, and registers, we have shown in Lemma 3.5 that a design that works
in the digital model also works in the detailed hardware model. For tristate
drivers and main memory this will not be the case.
A single open collector driver y and its detailed timing is shown in Fig. 27. If
the input yin is 0, then the open collector driver also outputs 0. If the input
is 1, then the driver is disabled. In detailed timing diagrams, an undefined
value due to disabled outputs is usually drawn as a horizontal line in the
middle between 0 and 1. In the jargon of hardware designers this is called the
high impedance state or high Z or simply Z. In order to specify behavior and
operating conditions of open collector and tristate drivers, we have to permit
Z as a signal value for drivers y. Thus, we have
y : R → {0, 1, Ω, Z} .
For the propagation delay of open collector drivers, we use the same parame-
ters α and β as for gates. Regular signal propagation is defined the same way
as for inverters:
reg(y, t) ↔ ∃a ∈ B : ∀t ∈ [t − β, t] : yin(t) = a .
y1 in yk in
OC ··· OC
y1 yk
b
Fig. 28. Open collector drivers yi connected by a bus b
⎧
⎪
⎨0 ∃i : yi (t) = 0
b(t) = 1 ∀i : yi (t) = Z
⎪
⎩
Ω otherwise .
but this abstracts away an important detail: glitches on a driver input can
propagate to the bus, for instance when other drivers are disabled. This will
not be an issue for the open collector buses constructed here. It is, however,
an issue in the control of real time buses [10].
By de Morgan’s law, one can use open collector buses together with some
inverters to compute the logical OR of signals ui :
b= ui = ¬( ui ) . (7)
i i
Tristate drivers y are controlled by output enable signals yoe7 . Symbol and
timing are shown in Fig. 30. Only when the output enable signal is active, a
tristate driver propagates the data input yin to the output y. Like ordinary
switching, enabling and disabling tristate drivers involves propagation delays.
7
Like clock enable signals, we model them as active high, but in data sheets for
real hardware components they are usually active low.
3.5 Drivers and Main Memory 57
x xn−1 xi x0
n
OC OC ··· OC ··· OC
n
y yn−1 yi y0
(a) symbol (b) implementation
yoe
yin
yin
α α α
yoe
y y
β β β
For simplicity, we use the same timing parameters as for gates. Regular signal
propagation is defined as for gates:
Observe that a glitch on an output enable signal can produce a glitch in signal
y. In contrast to glitches on open collector buses this will be an issue in our
designs involving main memory.
Like open collector drivers, the outputs of tristate drivers can be connected
via so called tristate buses. The clean way to operate a tristate bus b with
58 3 Hardware
y1 in yk in
···
y1 oe yk oe
y1 yk
b
Fig. 31. Tristate drivers yi connected by a bus b
R0 1 R1 1
0 1
y0 y1
b
Fig. 32. Switching enable signals of drivers at the same clock edge
If this invariant is maintained, the following definition of the bus value b(t) at
time t is well defined:
yi (t) ∃i : yi (t) = Z
b(t) =
Z otherwise .
The invariant excludes a design like in Fig. 32, where drivers y0 and y1 are
switched on and off at the same clock edge8 . In order to understand the
possible problem with such a design consider a rising clock edge when R0 =
y0 oe is turned on and R1 = y1 oe is turned off. This can lead to a situation as
shown in Fig. 33.
There, we assume that the propagation delay of R0 is ρ = 1 and the
propagation delay of R1 is σ = 2. Similarly, assume that the enable time of y0
8
This is not unheard of in practice.
3.5 Drivers and Main Memory 59
0 1 2 3 4
ck
R0
R1
y0
y1
Fig. 33. Possible timing when enable signals are switched at the same clock edge
V CC
R1
y
R2
GN D
R1 R2 y
low high 1
high low 0
high high Z
low low short circuit
V CC
high low
y0 b y1
low high
GN D
Fig. 35. Short circuit via the bus b when two drivers are enabled at the same time
Fig. 35 the short circuit is still possible via the low resistance path
GN D → y0 → b → y1 → V CC .
This occurs when two drivers are simultaneously enabled and one of the drivers
drives 0 while the other driver drives 1. Exactly this situation occurs tem-
porarily in the real-valued time interval [r + 2, r + 3] after each rising clock
edge r. In the jargon of hardware designers this is called – temporary – bus
contention, which clearly sounds much better than “temporary short circuit”.
But even with the nicer name it remains of course a short circuit. In the best
case, it increases power consumption and shortens the life time of the driver.
The spikes in power consumption can have the side effect that power supply
voltage falls under specified levels; maybe not always, but sporadically when
power consumption in other parts of the hardware is high. Insufficient supply
voltage then will tend to produce sporadic non reproducible failures in other
parts of the hardware.
The bad part – as we will demonstrate later – is the very natural looking
condition:
yit = Z ∧ yjt = Z → i = j . (10)
The good part, i.e., (9), correctly models the behavior of drivers for times after
clock edges where all propagation delays have occurred and when registers are
updated. Indeed, if we consider a bus b driven by drivers yi as a gate with
depth
3.5 Drivers and Main Memory 61
we can immediately extend Lemma 3.5 to circuits with buses and drivers of
both kinds.
Lemma 3.7 (timing and simulation with drivers and buses). Assume
that (8) holds for all tristate buses and assume the correct timing
∀y : tmax(y) + ts ≤ τ
and
∀i : th ≤ tmin(x[i]in) ∧ th ≤ tmin(x[i]ce) .
Then,
1. ∀y, c : ∀t ∈ [e(c) + tmax(y), e(c + 1) + tmin(y)] : y(t) = y c ,
2. ∀i, c : c ≥ 0 → stable(x[i]in, c) ∧ stable(x[i]ce, c).
This justifies the use of the digital model as far as register update is concerned.
The lemma has, however, a hypothesis coming from the detailed model. Re-
placing it simply by what we call the bad part of the digital model, i.e., by
(10), is the highway to big trouble. First of all, observe that our design in Fig.
33, which switched enable signals at the same clock edge, satisfies it. But in
the detailed model (and the real world) we can do worse. We can construct
hardware that destroys itself by the short circuits caused by bus contention
but which is contention free according to the (bad part of) the digital model.
In what follows we will do some arithmetic on time intervals [a, b] where signals
change. In our computations of these time bounds we use the following rules:
c + [a, b] = [c + a, c + b]
c · [a, b] = [c · a, c · b]
[a, b] + [c, d] = [a + c, b + d]
c + (a, b) = (c + a, c + b)
c · (a, b) = (c · a, c · b) .
Lemma 3.8 (self destructing hardware). For any > 0 there is a design
satisfying (10) which produces continuous bus contention for at least a fraction
α/β − of the total time.
62 3 Hardware
g1 gc
u
···
u v
Fig. 36. Generating a pulse of arbitrary width by a sufficiently long delay line
t
u
u
t1
v
t2 t3
Proof. The key to the construction is the parametrized design of Fig. 36. The
timing diagram in Fig. 37 shows that the entire design produces a pulse of
length growing with c; hence, we call it a c-pulse generator.
Signal u goes up at time t. The chain of c AND gates just serves as a delay
line. The result is finally inverted. Thus, signal u falls in time interval with
t1 = t + (c + 1) · [α, β] .
The final AND gate produces a pulse v with a rise time in interval t2 and a
fall time in interval t3 satisfying
t2 = t + [α, β]
t3 = t + (c + 2) · [α, β] .
v t = (ut ∧ ¬ut ) = 0 ,
which is indeed correct after propagation delays are over – and that is all
the digital model captures. Now consider the design in Fig. 38. In the digital
model, v1 and v2 are always zero. The only driver ever enabled in the digital
model is y3 . Thus, the design satisfies (10) for the digital model.
Now consider the timing diagram in Fig. 39. At each clock edge T , one of
registers Ri has a rising edge in time interval
t1 = T + [ρ, σ] ,
which generates a pulse with rising edge in time interval t2 and falling edge
in time interval t3 satisfying
3.5 Drivers and Main Memory 63
R1 1 R2 1
C−PULSE C−PULSE
v1 v2
0 0 1
1
y1 y2 y3
b
Fig. 38. Generating contention with two pulse generators
τ (c)
ck
Ri
t1
vi
t2 t3
yi
t4 l(c) t5
Fig. 39. Timing analysis for the period of bus contention from t4 to t5
t2 = T + [ρ, σ] + [α, β]
t3 = T + [ρ, σ] + (c + 2) · [α, β] .
t4 = T + [ρ, σ] + 2 · [α, β]
t5 = T + [ρ, σ] + (c + 3) · [α, β] .
τ (c) = σ + (c + 3) · β ,
64 3 Hardware
set ∧ ¬reset
R
set
clr
R set ∨ clr ∨ reset
such that the timing diagram fits exactly into one clock cycle. In the next
cycle, we then have the same situation for the other register and driver. We
have contention on bus b at least during time interval
C = T + [σ + 2 · β, ρ + (c + 3) · α]
of length
(c) = ρ + (c + 3) · α − (σ + 2 · β) .
Asymptotically we have
We now construct control logic for tristate buses. We begin with a digital
specification, construct a control logic satisfying this (incomplete) specifica-
tion and then show in the detailed model i) that the bus is free of contention
and ii) that signals are free of glitches while we guarantee their presence on
the bus.
As a building block of the control, we use the set-clear flip-flops from
Fig. 40. This is simply a 1-bit register which is set to 1 by activation of the
set signal and to 0 by activation of the clr signal (without activation of the
set signal). During reset, i.e., during cycle −1, the flip-flops are forced to zero.
3.5 Drivers and Main Memory 65
Rj Rj ce
yj oe
yj
b
Fig. 41. Registers Rj connected to a bus b by tristate drivers yj
R0 = ⎧
0
⎪
⎨1 sett
Rt+1 = 0 ¬sett ∧ clrt
⎪
⎩ t
R otherwise
Ti = [ai : bi ]
send : N → [1 : k]
specifying for each i ∈ N the unique index j = send(i) such that register Rj
is “sending” on the bus during “time” interval Ti :
t Rj ∃i : j = send(i) ∧ t ∈ Ti
b =
Z otherwise .
yj oeset
yj oe
yj oeclr
Rj Rj ce
yj
b
Fig. 42. Generation of output enable signals yoej by set-clear flip-flops
Rsend(i) Rsend(i)
b
0 < a0 .
An example of idealized timing of the tristate bus control in shown in Fig. 43.
Unit send(i) is operating on the bus in two consecutive time intervals Ti and
Ti+1 (the value of the register Rsend(i) can be updated in the last cycle of the
first interval) and then its driver is disabled. Between the end bi+1 of interval
Ti+1 and the start ai+2 of the next interval Ti+2 , there is at least one cycle
where no driver is enabled in the digital model.
In the digital model, we immediately conclude
3.5 Drivers and Main Memory 67
yj oeset
yj oeclr
ρ σ ρ σ
yj oe
ρ+α ρ+α
yj Rj
σ+β σ+β
Rj ce
ρ σ ρ σ
Rj
Fig. 44. Timing diagram for clean bus control in case unit j is sending in the
interval Ti and is giving up the bus afterwards. Timing of signals yj oeset, yj oeclr,
and Rj ce are idealized. Other timings are detailed
yj oet ≡ ∃i : send(i) = j ∧ t ∈ Ti
t Rj ∃i : send(i) = j ∧ t ∈ Ti
b =
Z otherwise ,
Proof. Note that the hypotheses of this lemma are all digital. Thus, we can
prove them entirely in the digital world.
Consider the timing diagram in Fig. 44. For the outputs of the set-clear
flip-flop yj oe, we get after reset
68 3 Hardware
x[n − 1 : 0]
n
oe
n
y[n − 1 : 0]
(a) symbol
xn−1 xi x0
oe
··· ···
yn−1 yi y0
(b) implementation
b.mmack
b.mmreq
b.ad address
access memory (RAM)9 . The standard use for this is to store boot code in the
read only portion of the memory. Since, after power up, the memory content
of RAM is unknown, computation will not start in a meaningful way unless at
least some portion of memory contains code that is known after power up. The
reset mechanism of the hardware ensures that processors start by executing
the program stored in the ROM. This code usually contains a so called boot
loader which accesses a large and slow memory device – like a disk – to load
further programs, e.g., an operating system to be executed, from the device.
For the purpose of storing a boot loader, we assume the main memory to
behave as a ROM for addresses a = 029−r b, where b ∈ Br and r < 29.
Operating conditions of the main memory are formulated in the following
definitions and requirements:
1. Stable inputs. In general, accesses to main memory last several cycles.
During such an access, the inputs must be stable:
b.mmreq t ∧ ¬b.mmack t ∧ X ∈ mmin(q) → b.X t+1 = b.X t ,
where mmin(q) is the set of inputs of an access active in cycle q:
b.d b.mmwq
mmin(q) = {b.ad, b.mmreq, b.mmw} ∪
∅ otherwise .
3. Memory liveness. If the inputs are stable, we may assume liveness for
the main memory, i.e., every request should be eventually served:
b.mmreq t → ∃t ≥ t : b.mmack t .
4. Effect of write operations. If the inputs are stable and the write access
is on, then in the next cycle after the acknowledgement, the data from b.d
is written to the main memory at the address specified by b.ad:
⎧
⎪
⎨b.d
q
x = b.adq ∧ b.mmack q ∧ b.mmwq
mmq+1 (x) = ∧ x[28 : r] = 029−r
⎪
⎩ q
mm (x) otherwise .
The writes only affect the memory content if they are performed to ad-
dresses larger than 029−r 1r .
5. Effect of read operations. If the inputs are stable, then, in the last
cycle of the read access, the data from the main memory specified by b.ad
is put on b.d:
6. Tristate driver enable. The driver mmbd connecting the main memory
to bus b.d is never enabled outside of a read access:
Xj Xj ce
Xj bdoe
Xj bd
b.X
Fig. 47. Registers and their drivers on bus components b.X
We extend the control of the tristate bus from Sect. 3.5.5 to a control of
the components of the main memory bus. We consider k units U (j) with
j ∈ [1 : k] capable of accessing main memory. Each unit has output registers
mmreqj , mmwj , aj , and dj and an input register Qj . They are connected to
the bus b accessing main memory in the obvious way: bus components b.X
with X ∈ {ad, mmreq, mmw} occur only as inputs to the main memory. The
situation shown in Fig. 47 for unit U (j) is simply a special case of Fig. 41
with
3.5 Drivers and Main Memory 73
dj dj ce mm
dj bdoe mmbdoe
dj bd 64 64 mmbd
b.d
Qj Qj ce
Fig. 48. Registers and main memory with their drivers on bus component b.d
Rj = Xj
yj = Xj bd
b = b.X .
As shown in Fig. 48, bus components b.d can be driven both by the units and
by the main memory. If main memory drives the data bus, the data on the bus
can be clocked into input register Qj of unit U (j). If the data bus is driven
by a unit, the data on the bus can be stored in main memory. We treat main
memory simply as unit k + 1. Then we almost have a special case of Fig. 41
with
Rj = dj if 1 ≤ j ≤ k
dj bd 1≤j≤k
yj =
mmbd j = k + 1
b = b.d .
Signal b.mmack is broadcast by main memory. Thus, bus control is not neces-
sary for this signal. We want to extend the proof of Lemma 3.9 to show that
all four tristate buses given above are operated in a clean way. We also use the
statement of Lemma 3.9 to show that the new control produces memory input
without glitches in the sense of the main memory specification. The crucial
signals governing the construction of the control are the main memory request
signals mmreqj . We compute them in set-clear flip-flops; they are cleared at
reset:
∀y : mmreqy0 = 0 .
For the set and clear signals of the memory request, we use the following
discipline:
74 3 Hardware
• a main memory request signal is only set when all request signals are off
• at most one request signal is turned on at a time (this requires some sort
of bus arbitration):
mmreqj setq−1 →
(∀x ∈ [q : ack(q) − 1] : ¬mmreqj clrx ) ∧ mmreqj clrack(q) .
Now we can define access intervals Ti = [ai : bi ]. The start cycle ai of interval
Ti is occurrence number i of the event that any signal mmreqj turns on. In
the end cycle bi , the corresponding acknowledgement occurs:
For bus components b.X with X ∈ {ad, mmreq, mmw}, we say that a unit
U (j) is sending in interval Ti if its request signal is on at the start of the
interval:
send(i) = j ↔ mmreqjai = 1 .
Controlling the bus components b.X with X ∈ {ad, mmreq, mmw} (which
occur only as inputs to the main memory) as prescribed in Lemma 3.910 , we
conclude
For the data component b.d of the bus, we define unit j to be sending if its
request signal is on in cycle ai and the request is a write request. We define
the main memory to be sending (send (i) = k + 1) if the request in cycle ai
is a read request:
j mmreqjai = 1 ∧ mmwjai
send (i) =
k + 1 ∃j : mmreqjai = 1 ∧ ¬mmwjai .
3.9 and (14) in the specification of the main memory. For write operations,
we conclude by Lemma 3.9:
Under reasonable assumptions for timing parameters and cycle time τ , this
completes the proof of (11) of the main memory specification requiring that
glitches are absent in main memory input.
Lemma 3.10 (clean opeation of memory). Let ρ + α ≥ th and σ + β +
mmts ≤ τ . Then,
Equation (13) is needed for timing analysis. In order to meet set up times for
the data of input Qj in of registers Qj on bus b.d, it obviously suffices if
mmts ≥ ts .
However, a larger lower bound for parameter mmts will follow from the con-
struction of particular control automata in Chap. 8.
δA : Z × I → Z
η :Z ×I →O
i
z z
Z = [0 : k − 1] , z0 = 0 .
h0 .S = code(0) .
Circuits out (like output) and nexts are constructed such that the automaton
is simulated in the following sense: if h.S = z, i.e., state z is encoded by the
hardware, then
1. out(h) = η(z), i.e., automaton and hardware produce the same output,
2. nexts(h) = code(δA (z, in(h))), i.e., in the next cycle the hardware h .S
encodes the next state δA (z, in(h)).
3.6 Finite State Transducers 77
in
σ
nexts
0k−1 1
k k
0 1 reset
k
S 1
k
out
γ
out
Fig. 50. Naive implementation of a Moore automaton
The following lemma states correctness of the construction shown in Fig. 50.
Then,
out(h) = η(z) ∧ h .S = code(z ) .
For all i ∈ [0 : γ − 1], we construct the i’th output simply by OR-ing together
all bits S[x] where η(x)[i] = 1, i.e., such that the i-th output is on in state x
of the automaton:
out(h)[i] = h.S[x] .
η(x)[i]=1
A straightforward argument shows the first claim of the lemma. Assume h.S =
z. Then,
h.S[x] = 1 ↔ x = z .
Hence,
out(h)[i] = 1
↔ h.S[x] = 1
η(x)[i]=1
↔ ∃x : η(x)[i] = 1 ∧ h.S[x] = 1
↔ η(z)[i] = 1 .
78 3 Hardware
δi,j : Bσ → B
i.e., function δi,j (in) is on if input in takes the automaton from state i to state
j. Boolean formulas for functions δi,j can be constructed by Lemma 2.20. For
each state j, component nexts[j], which models the next state function, is
turned on in states x, which transition under input in to state j according to
the automaton’s transition function:
nexts(h)[j] = h.S[x] ∧ δx,j (in(h)) .
x
h.S = code(z)
δA (z, in(h)) = z .
nexts(h)[j] = 1
↔ h.S[x] ∧ δx,j (in(h)) = 1
x
↔ δz,j (in(h)) = j
↔ δA (z, in(h)) = j .
Hence,
1 j = z
nexts(h)[j] =
0 otherwise .
Thus,
code(z ) = nexts(h)
= h .S .
The previous construction has the disadvantage that the propagation delay of
circuit out tends to contribute to the cycle time of the circuitry controlled by
3.6 Finite State Transducers 79
in
k σ
nexts
0k−1 1 k
k k
out
reset 0 1
γ
Sin k
S 1 outR 1
out
Fig. 51. Implementation of a Moore automaton with precomputed outputs
Thus,
The following lemma states correctness of the construction shown in Fig. 51.
Lemma 3.12 (Moore automaton with precomputed outputs). For h =
ht and t ≥ 0, let
h.S = code(z) ∧ δA (z, in(h)) = z .
Then,
h .S = code(z ) ∧ h .outR = η(z ) .
We have reset(h) = 0, and hence, Sin(h) = nexts(h). From above we have
h .S = nexts(h) = code(z )
and
h .outR = out(h) = η(z ) .
80 3 Hardware
nexts
0k−1 1
k k
0 1 reset
k
S 1
k
in
σ
out
γ
out
Fig. 52. Simple implementation of a Mealy automaton
in
σ
nexts
0k−1 1 k
k k out
reset 0 1 M oore
β
Sin k
S 1 outR 1
k
in β
σ
out outβ
M ealy
α
outα
Fig. 53. Separate realization of Moore and Mealy components
We describe two optimizations that can reduce the delay of outputs of Mealy
automata. The first one is trivial. We divide the output components out[j] into
two classes: i) Mealy components η[k](z, in), which have a true dependency
on the input variables, and ii) Moore components that can be written as
η[k](z), i.e., that only depend on the current state. Suppose we have α Mealy
components and β Moore components with γ = α + β. Obviously, one can
precompute the Moore components as in a Moore automaton and realize the
Mealy components as in the previous construction of Mealy automata. The
resulting construction is shown without further correctness proof in Fig. 53.
However, quite often, more optimization is possible since Mealy compo-
nents usually depend only on very few input bits of the automaton. As an
example, consider a Mealy output depending only on two input bits:
For x, y ∈ B, we derive Moore outputs fx,y (z) that precompute η(z, in)[j] if
in[1 : 0] = xy:
fx,y (z) = f (z, xy) .
Output η(z, in)[j] in this case is computed as
in
σ
nexts
0k−1 1
k k
reset 0 1
Sin k
S 1
k
precompute
in[0] 1 0 1 0 in[0]
1 0 in[1]
outα [j]
Fig. 54. Partial precomputation of a Mealy output depending on two input bits
Now, we precompute automata outputs fx,y (Sin(h)) and store them in regis-
ters fx,y R as shown in Fig. 54.
As for precomputed Moore signals, one shows
h.S = code(z) → h.fx,y R = fx,y (z) .
For the output outα [j] of the multiplexer tree, we conclude
outα [j](h) = h.fin[1:0] R
= fin[1:0] (z)
= η(z, in[1 : 0])[j] .
This construction has the advantage that only the multiplexers contribute to
the delay of the control signals generated by the automaton. In general, for
Mealy signals which depend on k input bits, we have k levels of multiplexers.
4
Nine Shades of RAM1
M. Kovalev et al.: A Pipelined Multi-core MIPS Machine, LNCS 9000, pp. 83–98, 2014.
c Springer International Publishing Switzerland 2014
84 4 Nine Shades of RAM
Sin
n
Sa Sw
a S
n
Sout
Fig. 55. Symbol for an (n, a)-SRAM
h.S : Ba → Bn .
∀x : h0 .S(x) ∈ Bn .
The output of the RAM is the register content selected by the address input:
Sout(h) = h.S(Sa(h)) .
For addresses x ∈ Ba we define the next state transition function for SRAM
as
Sin(h) Sa(h) = x ∧ Sw(h) = 1
h .S(x) =
h.S(x) otherwise .
The implementation of an SRAM is shown in Fig. 56. We use 2a many n-bit
registers R(i) with i ∈ [0 : 2a − 1] and an a-decoder with outputs X[2a − 1 : 0]
satisfying
X(i) = 1 ↔ i = Sa(h) .
The inputs of register R(i) are defined as
h.R(i) in = Sin(h)
h.R(i) ce = Sw(h) ∧ X[i] .
R(0)
Sw
···
X[0] Sin
n
···
a
a-Dec X[i] R(i)
Sa
···
n
n b[i] n
X[2a − 1] (n, 2a )-OR Sout
···
R(2
a
−1)
Sin(h) i = Sa(h) ∧ Sw(h)
h .R(i) =
h.R(i) otherwise .
= h.R(Sa(h)) .
As a result, when we choose
h.S(x) = h.R(x)
as the defining equation of our abstraction relation, the presented construction
implements an SRAM.
Sa
a S
n
Sout
Fig. 57. Symbol of an (n, a)-ROM
S(ia )
X[0] n
···
X[i] n b[i] n
Sa
a
a-Dec (n, 2a )-OR Sout
···
X[2a − 1]
Sout(h) = h.S(Sa(h)) .
4.2 Single-Port RAM Designs 87
Sin
8k
a k
Sa Sbw
S
8k
Sout
Fig. 59. Symbol of an (n, a)-multi-bank RAM
byte(i, x) byte(i, y)
8 8
0 1 bw[i]
8
For the definition of the next state, we first introduce auxiliary function
This function selects bytes from two provided strings according to the provided
byte write signals. Let y, x ∈ B8k and bw ∈ Bk . Then, for all i ∈ [0 : k − 1],
byte(i, y) bw[i] = 1
byte(i, modify(x, y, bw)) =
byte(i, x) bw[i] = 0 ,
i.e., for all i with active bw[i] one replaces byte i of x by byte i of y. The next
state of multi-bank RAM is then defined as
modify (h.S(x), Sin(h), Sbw(h)) x = Sa(h)
h .S(x) =
h.S(x) otherwise .
As shown in Fig. 60, each byte of the output of a modify circuit is simply
computed by an 8-bit wide multiplexer.
The straightforward construction of a multi-bank RAM uses k separate so
called banks. These are (8, a)-RAMs S (i) for i ∈ [0 : k − 1]. For each i, bank
S (i) is wired as shown in Fig. 61:
88 4 Nine Shades of RAM
byte(i, Sin)
8
a
Sa Sbw[i]
S (i)
8
byte(i, Sout)
Fig. 61. Bank i of an (n, a)-multi-bank RAM
S (i) a = Sa(h)
S (i) in = byte(i, Sin(h))
S (i) out = byte(i, Sout(h))
S (i) w = Sbw(h)[i] .
For the new state of the multi-bank RAM and address x = Sa(h), we have
Sin Svinv
n n
Sw
Sa
a S Sinv
n
Sout
Fig. 62. Symbol of an (n, a)-CS RAM
As a result, we have
The symbol of an (n, a)-cache state RAM or CS RAM is shown in Fig. 62.
This type of RAM is used later for holding the status bits of caches. It has
two extra inputs:
• a control signal Sinv – on activation, a special value is forced into all
registers of the RAM. Later, we will use this to set a value that indicates
that all cache lines are invalid2 and
• an n-bit input Svinv providing this special value. This input is usually
wired to a constant value in Bn .
Activation of Sinv takes precedence over ordinary write operations:
2
I.e., not a copy of meaningful data in our programming model. We explain this
in much more detail later.
90 4 Nine Shades of RAM
Sin Svinv
n n
0 1 Sinv
n
a Sw
Sa
S 2 a
Sce
n n
··· n
a
Sout Sdout[2 − 1] Sdout[0]
Fig. 64. Symbol of an (n, a)-SPR RAM
⎧
⎪
⎨Svinv(h) Sinv(h) = 1
h .S(x) = Sin(h) x = Sa(h) ∧ Sw(h) = 1 ∧ Sinv(h) = 0
⎪
⎩
h.S(x) otherwise .
The changes in the implementation for each register R(i) are shown in Fig. 63.
The clock enable is also activated by Sinv and the data input comes from a
multiplexer:
An (n, a)-SPR RAM as shown in Fig. 64 is used for the realization of special
purpose register files and in the construction of fully associative caches. It
behaves both as an (n, a)-RAM and as a set of 2a many n-bit registers. It has
the following inputs and outputs:
4.2 Single-Port RAM Designs 91
Sw
Sin Sdin[i]
n n
Sce[i]
0 1 Sce[i]
n
R(i)
Sdout[i]
n
n b[i] n
Sa
a
a-Dec (n, 2a )-OR Sout
X[i]
Sout(h) = h.S(Sad(h))
Sdout(h)[i] = h.S(bina (i)) .
Register updates to R(i) can be performed either by Sin for regular writes
or by Sdin[i] if the special clock enables are activated. Special writes take
precedence over ordinary writes:
⎧
⎪
⎨Sdin(h)[x] Sce(h)[x] = 1
h .S(x) = Sin(h) Sce(h)[x] = 0 ∧ Sw(h) = 1
⎪
⎩
h.S(x) otherwise .
A single address decoder with outputs X[i] and a single OR-tree suffices.
Figure 65 shows the construction satisfying
92 4 Nine Shades of RAM
Sin
n
a
Sa
Sw
a S
Sb
a
Sc
n n
Souta Soutb
Fig. 66. Symbol of an (n, a)-GPR RAM
An (n, a)-GPR RAM is a three-port RAM that we use later for general purpose
registers. As shown in Fig. 66, it has the following inputs and outputs:
• an n-bit data input Sin,
• three a-bit address inputs Sa, Sb, Sc,
• a write signal Sw, and
• two n-bit data outputs Souta, Soutb.
As for ordinary SRAM, the state of the 3-port RAM is a mapping
h.S : Ba → Bn .
Souta(h) = h.S(Sa(h))
Soutb(h) = h.S(Sb(h)) .
Sw
Sin
n
a
a-Dec Z[i] R(i)
Sc
X[i] n a[i] n
Sa
a
a-Dec (n, 2a )-OR Souta
Y [i] n b[i] n
Sb
a
a-Dec (n, 2a )-OR Soutb
Sina Sinb
n n
a
Sa
Swa
a S
Sb Swb
n n
Souta Soutb
Fig. 68. Symbol of an (n, a)-2-port RAM
A general (n, a)-2-port RAM is shown in Fig. 68. This is a RAM with the
following inputs and outputs:
• two data inputs Sina, Sinb,
• two addresses Sa, Sb,
• two write signals Swa, Swb.
The data outputs are determined by the addresses as in the 3-port RAM for
general purpose registers:
Souta(h) = h.S(Sa(h))
Soutb(h) = h.S(Sb(h)) .
The 2-port RAM allows simultaneous writes to two addresses. In case both
write signals are active and both addresses point to the same port we have to
resolve the conflict: the write via the a port will take precedence:
⎧
⎪
⎨Sina(h) x = Sa(h) ∧ Swa(h) = 1
h .S(x) = Sinb(h) x = Sb(h) ∧ Swb(h) = 1 ∧ x = Sa(h) ∧ Swa(h) = 1
⎪
⎩
h.S(x) otherwise .
X[i] = 1 ↔ i = Sa(h)
Y [i] = 1 ↔ i = Sb(h) .
Figure 69 shows the changes to each register R(i) . Clock enable is activated in
case a write via the a address or via the b address occurs. The input is chosen
from the corresponding data input by a multiplexer:
4.3 Multi-port RAM Designs 95
Sina Sinb
n n
1 0 X[i] ∧ Swa
n
8 8
byte(i, Souta) byte(i, Soutb)
Fig. 70. Bank i of an (n, a)-2-port multi-bank RAM
S(ia )
n
X[i] n a[i] n
Sa
a
a-Dec (n, 2a )-OR Souta
Y [i] n b[i] n
Sb
a
a-Dec (n, 2a )-OR Soutb
Souta(h) = h.S(Sa(h))
Soutb(h) = h.S(Sb(h)) .
Write operations, however, only affect addresses larger than 0a−r 1r . Moreover,
we only need the writes to be performed through port b of the memory (port
a will only be used for instruction fetches):
⎧
⎪
⎨modify (h.S(x), Sin(h), Sbw(h)) x[a − 1 : r] = 0
a−r
h .S(x) = ∧ x = Sb(h)
⎪
⎩
h.S(x) otherwise .
Sin
n
a
Sa k
S Sbw
a
Sb
(n, r)-ROM
n n
Souta Soutb
Fig. 72. Symbol of an (n, r, a)-2-port multi-bank RAM-ROM
Sin
n
a ina inb k k r
Sa 0 Sa[r − 1 : 0]
a S1 k r S2
Sb Sbw Sb[r − 1 : 0]
n n n n
Sa[a − 1 : r] Sb[a − 1 : r]
a−r a−r
(a − r)-zero (a − r)-zero
0 1 0 1
n n
Souta Soutb
Exactly as the name indicates, an (n, a)-2-port CS RAM is a RAM with all
features of a 2-port RAM and a CS RAM . Its symbol is shown in Fig. 74.
Inputs and outputs are:
• two data inputs Sina, Sinb,
• two addresses Sa, Sb,
• two write signals Swa, Swb,
• a control signal Sinv, and
• an n-bit input Svinv providing a special data value.
Address decoding, data output generation, and execution of writes is as for
2-port RAMs. In write operations, activation of signal Sinv takes precedence
98 4 Nine Shades of RAM
Sina Sinb
n n
Svinv 1 0 X[i] ∧ Swa
n n
1 0 Sinv
n
The changes in the implementation for each register R(i) are shown in Fig. 75.
The signals generated are:
For later use in processors with the MIPS instruction set architecture (ISA),
we construct several circuits: as the focus in this book is on correctness and not
so much on efficiency of the constructed machine, only the most basic adders
and incrementers are constructed in Sect. 5.1. For more advanced construc-
tions see, e.g., [12]. An arithmetic unit (AU) for binary and two’s complement
numbers is studied in Sect. 5.2. In our view, understanding the correctness
proofs of this section is a must for anyone wishing to understand fixed point
arithmetic.
With the help of the AU we construct in Sect. 5.3 an arithmetic logic
unit (ALU) for the MIPS ISA in a straightforward way. Differences to [12]
are simply due to differences in the encoding of ALU operations between the
MIPS ISA considered here and the DLX ISA considered in [12].
Also the shift unit considered in Sect. 5.4 is basically from [12]. Shift
units are not completely trivial. We recommend to cover this material in the
classroom.
As branch instructions in the DLX and the MIPS instruction set archi-
tectures are treated in quite different ways, the new Sect. 5.5 with a branch
condition evaluation unit had to be included here.
M. Kovalev et al.: A Pipelined Multi-core MIPS Machine, LNCS 9000, pp. 99–115, 2014.
c Springer International Publishing Switzerland 2014
100 5 Arithmetic Circuits
a b
n n c0
n-Add
n
cn
s
Fig. 76. Symbol of an n-adder
cn sn−1
Fig. 77. Recursive construction of a carry chain adder
It is often convenient to ignore the carry out bit cn of the n-adder and to talk
only about the sum bits s[n − 1 : 0]. With the help of Lemma 2.10, we can
then rewrite the specification of the n-adder as
Throwing away the carry bit cn and using Lemma 2.10, we can rewrite this
as
s = ((a + c0 ) mod 2n ).
We use the symbol from Fig. 78 for n-incrementers. Obviously, incrementers
can be constructed from n-adders by tying the b input to 0n . As shown in
5.2 Arithmetic Unit 101
a
n c0
n−inc
n
cn s
Fig. 78. Symbol of an n-incrementer
n = 1: n > 1:
a0 c0 a[n − 2 : 0]
n−1 c0
HA
an−1 (n − 1)−inc
n−1
c1 s0 s[n − 2 : 0]
HA
cn sn−1
Fig. 79. Recursive construction of a carry chain incrementer
Sect. 3.2 a full adders whose b input is tied to zero can be replaced with a
half adder. This yields the construction of carry chain incrementers shown in
Fig. 79.
a b
n n
u
n-AU sub
2
n
ovf neg
s
Fig. 80. Symbol of an n-arithmetic unit
Data Paths
We introduce special symbols +n and −n to denote addition and subtraction
of n-bit binary numbers mod 2n :
5.2 Arithmetic Unit 103
The following lemma asserts that, for signed and unsigned numbers, the sum
bits s can be computed in exactly the same way.
Lemma 5.1 (computing sum bits). Compute the sum bits as
a +n b sub = 0
s=
a −n b sub = 1 ,
then
[s] = (S tmod 2n ) if u = 0
s = (S mod 2n ) if u = 1 .
Proof. For u = 1 this follows directly from the definitions. For u = 0 we have
from Lemma 2.14 and Lemma 2.2:
[s] = (S tmod 2n ) .
The main data paths of an n-AU are shown in Fig. 81. The following lemma
asserts that the sum bits are computed correctly.
d = b ⊕ sub
b sub = 0
=
b sub = 1 .
104 5 Arithmetic Circuits
b sub
a
n d n
n-Add
n
s
Fig. 81. Data paths of an n-arithmetic unit
From the specification of an n-adder, Lemma 2.10, and the subtraction algo-
rithm for binary numbers (Lemma 2.15), we conclude
a + b sub = 0
s = mod 2 n
a + b + 1 sub = 1
a + b sub = 0 n
= mod 2 .
a − b sub = 1
Application of binn (·) to both sides completes the proof of the lemma.
Negative Bit
We start with the case u = 0, i.e., with two’s complement numbers. We have
S = [a] ± [b]
= [a] + [d] + sub
≤ 2n−1 − 1 + 2n−1 − 1 + 1
= 2n − 1,
S ≥ −2n−1 − 2n−1
= −2n .
Thus,
S ∈ Tn+1 .
According to Lemma 2.14 we use sign extension to extend operands to n + 1
bits:
5.2 Arithmetic Unit 105
[a] = [an−1 a]
[d] = [dn−1 d] .
sn = an−1 ⊕ dn−1 ⊕ cn ,
and conclude
S = [s[n : 0]].
Again by Lemma 2.14 this is negative if and only if the sign bit sn is 1:
S < 0 ↔ sn = 1 .
For the case u = 1, i.e., for binary numbers, a negative result can only occur
in the case of subtraction, i.e., if sub = 1. In this case we argue along the lines
of the correctness proof for the subtraction algorithm:
S = a − b
= a − [0b]
= a + [1b] + 1
= a + b − 2n + 1
= cn s[n − 1 : 0] − 2n
= 2n (cn − 1) + s[n − 1 : 0] .
∈Bn
Thus,
u = 1 → neg = sub ∧ cn ,
and together with Lemma 5.3 we can define the negative bit computation.
Lemma 5.4 (negative bit).
Overflow Bit
We compute the overflow bit only for the case of two’s complement numbers,
i.e., when u = 0. We have
We claim
S ∈ Tn ↔ cn−1 = cn .
If cn = cn−1 we obviously have S = [s], thus S ∈ Tn . If cn = 1 and cn−1 = 0
we have
−2n + [s] ≤ −2n + 2n−1 − 1 = −2n−1 − 1 < −2n−1
and if cn = 0 and cn−1 = 1, we have
cn = cn−1 ↔ cn ⊕ cn−1 = 1 ,
ovf = u ∧ cn ⊕ cn−1
a b
n n
i
4
n-ALU af
n
ovfalu alures
Fig. 82. Symbol of an n-arithmetic logic unit
The results that must be generated are specified in Table 6. There are three
groups of operations:
• Arithmetic operations.
• Logical operations. At first sight, the result b[n/2 : 0]0n/2 might appear
odd. This ALU function is later used to load the upper half of an n-bit
constant using the immediate fields of an instruction.
• Test and set instructions. They compute an n-bit result 0n−1 z where only
the last bit is of interest. The result of these instructions can be computed
by performing a subtraction in the AU and then testing the negative bit.
Figure 83 shows the fairly obvious data paths of an n-ALU. The missing signals
are easily constructed. We subtract if af [1] = 1. For test and set operations
with af [3] = 1, output z is simply the negative bit neg. The overflow bit can
only differ from zero if we are doing an arithmetic operation. Thus, we have
108 5 Arithmetic Circuits
n
a
n
b
n nn nn n b[ n2 − 1 : 0]
n
2
n n n
n n
0n/2
0 1 af [0] n
u n 2
n-AU sub 0 1 i
n
2
0 1 af [0]
ovf neg n
0 1 af [1]
n n
0 1 af [2]
n−1
n 0 z
n
0 1 af [3]
n
alures
Fig. 83. Data paths of an n-arithmetic logic unit
sub = af [1]
z = neg
u = af [0]
ovfalu = ovf ∧ af [3] ∧ af [2] .
or, equivalently, as
Proof.
j + i = j − (−i)
≡ j − (n − i) mod n
Here, we only build shifters for numbers n which are a power of two:
n = 2k , k∈N.
Basic building blocks for all following shifter constructions are (n, b)-cyclic
left shifters or short (n, b)-SLCs for b ∈ [1 : n − 1]. They have
• input a[n − 1 : 0] ∈ Bn for the data to be shifted,
• input s ∈ B indicating whether to shift or not,
• data outputs a [n − 1 : 0] ∈ Bn satisfying
slc(a, b) s = 1
a =
a otherwise
110 5 Arithmetic Circuits
a[n − 1 : n − b] a[n − b − 1 : 0]
b n−b
1 0 s
n
a
Fig. 84. Implementation of an (n, b)-SLC ((n, b)-cyclic left shifter)
a
n
(n, 1)-SLC b0
n (0)
r
...
(n, 2i )-SLC bi
n
r (i)
...
r = slc(a, b) .
b
k
k 1
k−inc
c k
a 0 1 f
k
n
d
n-SLC
n
r
Fig. 86. Implementation of an n-SRLC (cyclic n-right-left shifter)
which follows from the subtraction algorithm for binary numbers (Lemma
2.15).
The output d of the multiplexer then satisfies
b f =0
d =
n − b mod n f = 1 .
a b
n k
2
n-SU sf
n
sures
Fig. 87. Symbol of an n-shift unit
a b
n k
n-rl- f sf [1]
shifter
n
r
Fig. 88. Right-left shifter of an n-shift unit
b
k
k-hdec
n
flip
n
0 1 sf [1]
n
mask
Fig. 89. Mask computation of an n-shift unit
r[j] f ill
0 1 mask[j]
sures[j]
Fig. 90. Result computation of an n-shift unit
a[n − i − 1 : 0]a[n − 1 : n − i] sf [1] = 0
r=
a[i − 1 : 0]a[n − i : i] sf [1] = 1 .
For each index j ∈ [0 : n − 1], the multiplexer in Fig. 90 replaces the shifter
output r[j] by the f ill bit if this is indicated by the mask bit mask[j]. As a
result we get
a[n − i − 1 : 0]f illi sf [1] = 0
sures =
f illia[n − i : i] sf [1] = 1 .
By setting
f ill = sf [0] ∧ an−1 ,
we conclude
⎧
⎪
⎨sll(a, i) sf = 00
sures = srl(a, i) sf = 10
⎪
⎩
sra(a, i) sf = 11 .
a b
n n
4
n-BCE bf
bcres
Fig. 91. Symbol of an n-branch condition evaluation unit
b bf [3] ∧ bf [2]
n
an−1 a
d
n n
n-eq
lt
eq neq
le
Fig. 92. Computation of auxiliary signals in an n-branch condition evaluation unit
We define the basic MIPS instruction set architecture (ISA) without delayed
branch, interrupt mechanism and devices. The first Sect. 6.1 of this chapter is
very short. It contains a very compact summary of the instruction set archi-
tecture (and the assembly language) in the form of tables, which define the
ISA if one knows how to interpret them. In Sect. 6.2 we provide a succinct
and completely precise interpretation of the tables, leaving out only the co-
processor instructions and the system call instruction. From this we derive in
Sect. 6.3 the hardware of a sequential, i.e., not pipelined, MIPS processor and
provide a proof that this processor construction is correct.
This chapter differs from its counter part in [12] in several ways:
• The ISA is MIPS instead of DLX. Most of the resulting modifications are
already handled in the control logic of the ALU and the shift unit1 .
• The machine implements each instruction in one very long hardware cycle
and uses only precomputed control. It is not meant to be an efficient se-
quential implementation and serves later only as a reference machine. This
turns most portions of the correctness proof into straightforward bookkeep-
ing exercises, which would be terribly boring if presented in the classroom.
We included this bookkeeping only as a help for readers, who want to use
this book as a blueprint for formal proofs.
• Because the byte addressable memory of the ISA is embedded in the imple-
mentation into a 64-bit wide hardware memory, shifters have to be used
both for the load and store operations of words, half words, and bytes.
In [12] the memory is 32 bits wide, the shifters for loads and stores are
present; they must be used for accesses of half words or bytes. However, [12]
provides no proof that with the help of these shifters loads and stores of
half words or bytes work correctly. Subsequent formal correctness proofs
for hardware from [12] as presented in [1, 3, 6] restricted loads and stores
to word accesses, and thus, did not provide these proofs either. We present
1
In contrast to [9] we do not tie register 0 to 0. We also do not consider interrupts
and address translation in this book.
M. Kovalev et al.: A Pipelined Multi-core MIPS Machine, LNCS 9000, pp. 117–160, 2014.
c Springer International Publishing Switzerland 2014
118 6 A Basic Sequential MIPS Machine
these proofs here; they hinge on the software condition that accesses are
aligned and turn out to be not completely trivial.
6.1 Tables
In the “Effect” row of the tables we use the following shorthands: m =
md (ea(c)) where ea(c) = rs(c)+32 sxtimm(c), rx = gpr(rx(c)) for x ∈ {t, s, d}
(except for the coprocessor instructions, where rd = rd(c) and rt = rt(c)), and
iindex stands for iindex(c)2 . In the table for J-type instructions, R31 stands
for gpr(315 ). Arithmetic operations + and − are modulo 232 . Sign extension
is denoted by sxt and zero extension by zxt.
6.1.1 I-type
opcode Instruction Syntax d Effect
Data Transfer
100 000 lb lb rt rs imm 1 rt = sxt(m)
100 001 lh lh rt rs imm 2 rt = sxt(m)
100 011 lw lw rt rs imm 4 rt = m
100 100 lbu lbu rt rs imm 1 rt = 024 m
100 101 lhu lhu rt rs imm 2 rt = 016 m
101 000 sb sb rt rs imm 1 m = rt[7:0]
101 001 sh sh rt rs imm 2 m = rt[15:0]
101 011 sw sw rt rs imm 4 m = rt
Arithmetic, Logical Operation, Test-and-Set
001 000 addi addi rt rs imm rt = rs + sxt(imm)
001 001 addiu addiu rt rs imm rt = rs + sxt(imm)
001 010 slti slti rt rs imm rt = ([rs] < [sxt(imm)] ? 132 : 032 )
001 011 sltiu sltiu rt rs imm rt = (rs < sxt(imm) ? 132 : 032 )
001 100 andi andi rt rs imm rt = rs ∧ zxt(imm)
001 101 ori ori rt rs imm rt = rs ∨ zxt(imm)
001 110 xori xori rt rs imm rt = rs ⊕ zxt(imm)
001 111 lui lui rt imm rt = imm016
opcode rt Instr. Syntax Effect
Branch
000 001 00000 bltz bltz rs imm pc = pc + ([rs] < 0 ? sxt(imm00) : 432 )
000 001 00001 bgez bgez rs imm pc = pc + ([rs] ≥ 0 ? sxt(imm00) : 432 )
000 100 beq beq rs rt imm pc = pc + (rs = rt ? sxt(imm00) : 432 )
000 101 bne bne rs rt imm pc = pc + (rs = rt ? sxt(imm00) : 432 )
000 110 00000 blez blez rs imm pc = pc + ([rs] ≤ 0 ? sxt(imm00) : 432 )
000 111 00000 bgtz bgtz rs imm pc = pc + ([rs] > 0 ? sxt(imm00) : 432 )
2
Formal definitions for predicates and functions used here are given in Sect. 6.2.
6.1 Tables 119
6.1.2 R-type
6.1.3 J-type
32
pc
m 232
gpr 32
8
CPU memory
Fig. 93. Visible data structures of MIPS ISA
In this section we give the precise formal interpretation of the basic MIPS
ISA without the coprocessor instructions and the system call instruction.
A basic MIPS configuration c has only three user visible data structures
(Fig. 93):
• c.pc ∈ B32 – the program counter (PC).
• c.gpr : B5 → B32 – the general purpose register (GPR) file consisting of 32
registers, each 32 bits wide. For register addresses x ∈ B5 , the content of
general purpose register x in configuration c is denoted by c.gpr(x) ∈ B32 .
• c.m : B32 → B8 – the processor memory. It is byte addressable; addresses
have 32 bits. Thus, for memory addresses a ∈ B32 , the content of memory
location a in configuration c is denoted by c.m(a) ∈ B8 .
Program counter and general purpose registers belong to the central process-
ing unit (CPU).
Let K be the set of all basic MIPS configurations. A mathematical defini-
tion of the ISA will be given by a function
δ:K→K,
where
c = δ(c)
is the configuration reached from configuration c, if the next instruction is
executed. An ISA computation is a sequence (ci ) of ISA configurations with
i ∈ N satisfying
6.2 MIPS ISA 121
c0 .pc = 032
ci+1 = δ(ci ) ,
i.e., initially the program counter points to address 032 and in each step one
instruction is executed. In the remainder of this section we specify the ISA
simply by specifying function δ, i.e., by specifying c = δ(c) for all configura-
tions c.
Recall that for numbers y ∈ Bn we abbreviate the binary representation
of y with n bits as
yn = binn (y) ,
e.g., 18 = 00000001 and 38 = 00000011. For memories m : B32 → B8 , ad-
dresses a ∈ B32 , and numbers d of bytes, we define the content of d consecutive
memory bytes starting at address a informally by
m1 (a) = m(a)
md+1 (a) = m(a +32 d32 ) ◦ md (a) .
Because all instructions are 4 bytes long, one requires that instructions are
aligned on 4 byte boundaries 3 or, equivalently, that
c.pc[1 : 0] = 00 .
The six high order bits of the current instruction are called the op-code:
There are three instruction types: R-, J-, and I-type. The current instruction
type is determined by the following predicates:
3
In case this condition is violated a so called misalignment interrupt is raised. Since
we do not treat interrupts in our construction, we require all ISA computations
to have only aligned instructions.
122 6 A Basic Sequential MIPS Machine
31 26 25 21 20 16 15 0
I opc rs rt imm
31 26 25 21 20 16 15 11 10 6 5 0
R opc rs rt rd sa f un
31 26 25 0
J opc iindex
Fig. 94. Types and fields of MIPS instructions
Depending on the instruction type, the bits of the current instruction are sub-
divided as shown in Fig. 94. Register addresses are specified in the following
fields of the current instruction:
f un(c) = I(c)[5 : 0] .
sa(c) = I(c)[10 : 6]
imm(c) = I(c)[15 : 0]
iindex(c) = I(c)[25 : 0] .
In case of sign extension, Lemma 2.14 guarantees that the value of the constant
interpreted as a two’s complement number does not change:
[sxtimm(c)] = [imm(c)] .
6.2 MIPS ISA 123
• Branches are of I-Type and are recognized by the three leading bits of the
opcode:
6.2.3 ALU-Operations
We can now go through the ALU-operations in the tables one by one and give
them precise interpretations. We do this for two examples.
add(c)
addi(c)
This table defines functions alures(a, b, af, i) and ovf (a, b, af, i). As we do
not treat interrupts in this book, we use only the first of these functions here.
We observe that in all ALU operations a function of the ALU is performed.
The left operand is always
lop(c) = c.gpr(rs(c)) .
For R-type operations the right operand is the register specified by the rt
field of R-type instructions. For I-type instructions it is the sign extended
immediate operand if opc(c)[2] = I(c)[28] = 0 or zero extended immediate
operand if opc(c)[2] = 1. Thus, we define immediate fill bit ifill (c), extended
immediate constant xtimm(c), and right operand rop(c) in the following way:
imm(c)[15] opc(c)[2] = 0
ifill (c) =
0 opc(c)[2] = 1
= imm(c)[15] ∧ opc(c)[2]
= imm(c)[15] ∧ I(c)[28]
sxtimm(c) opc(c)[2] = 0
xtimm(c) =
zxtimm(c) opc(c)[2] = 1
= ifill (c)16 imm(c)
c.gpr(rt(c)) rtype(c)
rop(c) =
xtimm(c) otherwise .
Comparing Table 6 with the tables for I-type and R-type instructions we see
that bits af [2 : 0] of the ALU control can be taken from the low order fields of
the opcode for I-type instructions and from the low order bits of the function
field for R-type instructions:
f un(c)[2 : 0] rtype(c)
af (c)[2 : 0] =
opc(c)[2 : 0] otherwise
I(c)[2 : 0] rtype(c)
=
I(c)[28 : 26] otherwise .
For bit af [3] things are more complicated. For R-type instructions it can be
taken from the function code. For I-type instructions it must only be forced
to 1 for the two test and set operations, which can be recognized by opc(c)[2 :
1] = 01:
f un(c)[3] rtype(c)
af (c)[3] =
opc(c)[2] ∧ opc(c)[1] otherwise
I(c)[3] rtype(c)
=
I(c)[28] ∧ I(c)[27] otherwise .
126 6 A Basic Sequential MIPS Machine
The i-input of the ALU distinguishes for af [3 : 0] = 0111 between the lui-
instruction of I-type for i = 0 and the nor-instruction of R-type for i = 1.
Thus, we set it to itype(c). The result of the ALU computed with these inputs
is denoted by
alu(c) →
ares(c) x = rdes(c)
c .gpr(x) =
c.gpr(x) otherwise
c .m = c.m
c .pc = c.pc +32 432 .
Shift operations come in two flavors: i) for f un(c)[2] = 0 the shift distance
sdist(c) is an immediate operand specified by the sa field of the instruction.
For f un(c)[2] = 1 the shift distance is specified by the last bits of the register
specified by the rs field:
sa(c) f un(c)[2] = 0
sdist(c) =
c.gpr(rs(c))[4 : 0] f un(c)[2] = 1 .
The left operand that is shifted is always the register specified by the rt-field:
slop(c) = c.gpr(rt(c)) .
and the control bits sf [1 : 0] are taken from the low order bits of the function
field:
sf (c) = f un(c)[1 : 0] .
The result of the shift unit computed with these inputs is denoted by
For shift operations the destination register is always specified by the rd field.
Thus, the shift unit operations can be summarized as
6.2 MIPS ISA 127
su(c) →
sres(c) x = rd(c)
c .gpr(x) =
c.gpr(x) otherwise
c .m = c.m
c .pc = c.pc +32 432 .
Jump and link instructions are used to implement calls of procedures. Besides
setting the PC to the branch target, they prepare the so called link address
and save it in a register. For the R-type instruction jalr, this register is
specified by the rd field. J-type instruction jal does not have an rs field, and
the link address is stored in register 31 (= 15 ). Branch and jump instructions
do not change the memory.
Therefore, for the update of registers in branch and jump instructions, we
have:
jb(c) →
linkad(c) jalr(c) ∧ x = rd(c) ∨ jal(c) ∧ x = 15
c .gpr(x) =
c.gpr(x) otherwise
c .m = c.m .
6.2 MIPS ISA 129
byte(i, a) = a[8 · (i + 1) − 1 : 8 · i] .
Proof.
byte(i, c) = a ◦ b[8 · (i + 1) − 1 : 8 · i]
a i=d
=
b[8 · (i + 1) − 1 : 8 · i] i < d
a i=d
=
byte(i, b) i < d .
The state of byte addressable memory with 32-bit addresses is modeled as a
mapping
m : B32 → B8 ,
where for each address x ∈ B32 one interprets m(x) ∈ B8 as the current
value of memory location x. Recall that we defined the content md (x) of d
consecutive locations starting at address x by
m1 (x) = m(x)
md+1 (x) = m(x +32 d32 ) ◦ md (x) .
Load and store operations access a certain number d(c) ∈ {1, 2, 4} of bytes
of memory starting at a so called effective address ea(c). Letters b,h, and w
in the mnemonics define the width: b stands for d = 1 resp. a byte access; h
stands for d = 2 resp. a half word access, and w stands for d = 4 resp. a word
access. Inspection of the instruction tables gives
⎧
⎪
⎨1 opc(c)[0] = 0
d(c) = 2 opc(c)[1 : 0] = 01
⎪
⎩
4 opc(c)[1 : 0] = 11 .
⎧
⎪
⎨1 I(c)[26] = 0
= 2 I(c)[27 : 26] = 01
⎪
⎩
4 I(c)[27 : 26] = 11 .
Note that the immediate constant is sign extended. Thus, negative offsets can
be realized in the same way as negative branch distances. Effective addresses
are required to be aligned. If we interpret them as binary numbers they have
to be divisible by the width d(c):
d(c) | ea(c)
or equivalently
Stores
A store instruction takes the low order d(c) bytes of the register specified by
the rt field and stores them as md(c) (ea(c)). The PC is incremented by 4 (but
6.2 MIPS ISA 131
we have already defined that on page 128). Other memory bytes and register
values are not changed:
s(c) →
byte(i, c.gpr(rt(c))) x = ea(c) +32 i32 ∧ i < d(c)
c .m(x) =
c.m(x) otherwise
c .gpr = c.gpr
A word of caution in case you plan to enter this into a CAV system: the
first case of the “definition” of c .m(x) is very well understandable for humans,
but actually it is a shorthand for the following: if
then update c.m(x) with the hopefully unique i satisfying this condition. In
this case we can compute this i by solving the equation
resp.
x = (ea(c) + i mod 232 ) .
From alignment we conclude
ea(c) + i ≤ 232 − 1 .
Hence,
(ea(c) + i mod 232 ) = ea(c) + i .
And we have to solve
x = ea(c) + i
as
i = x − ea(c) .
This turns the above definition into
byte(x − ea(c), c.gpr(rt(c))) x − ea(c) ∈ [0 : d(c) − 1]
c .m(x) =
c.m(x) otherwise ,
Loads
Loads, like stores, access d(c) bytes of memory starting at address ea(c). The
result is stored in the low order d(c) bytes of the destination register, which is
specified by the rt field of the instruction. This leaves 32 − 8 · d(c) bits of the
destination register to be filled by some bit fill (c). For unsigned loads (with a
132 6 A Basic Sequential MIPS Machine
suffix “u” in the mnemonics) the fill bit is zero; otherwise it is sign extended
by the leading bit of c.md(c) (ea(c)). In this way a load result lres(c) ∈ B32 is
computed and the general purpose register specified by the rt field is updated.
Other registers and the memory are left unchanged:
u(c) = opc(c)[2]
0 u(c)
fill (c) =
c.m(ea(c) +32 (d(c) − 1)32 )[7] otherwise
lres(c) = fill (c)32−8·d(c) c.md(c) (ea(c))
l(c) →
lres(c) x = rt(c)
c .gpr(x) =
c.gpr(x) otherwise
c .m = c.m .
We collect all previous definitions of destination registers for the general pur-
pose register file into
⎧
⎪
⎨1
5
jal(c)
Cad(c) = rd(c) rtype(c)
⎪
⎩
rt(c) otherwise .
Also we collect the data gprin to be written into the general purpose register
file. For technical reasons, we define on the way an intermediate result C that
collects the possible GPR input from arithmetic, shift, and jump instructions:
⎧
⎪
⎨sres(c) su(c)
C(c) = linkad(c) jal(c) ∨ jalr(c)
⎪
⎩
ares(c) otherwise
lres(c) l(c)
gprin(c) =
C(c) otherwise .
Finally, we collect in a general purpose register write signal all situations when
some general purpose register is updated:
Now we can summarize the MIPS ISA in three rules concerning the updates
of PC, general purpose registers, and memory:
6.3 A Sequential Processor Design 133
31 3 2 0
a a.l a.o
Fig. 95. Line address a.l and offset a.o of a byte address a
btarget(c) jbtaken(c)
c .pc =
c.pc +32 432 otherwise
gprin(c) x = Cad(c) ∧ gprw(c)
c .gpr(x) =
c.gpr(x) otherwise
byte(i, c.gpr(rt(c))) x = ea(c) +32 i32 ∧ i < d(c) ∧ s(c)
c .m(x) =
c.m(x) otherwise .
∀i > 0 : ci .pc[1 : 0] = 00 ∧
ls(ci ) → (d(ci ) = 2 → ea(ci )[0] = 0 ∧
d(ci ) = 4 → ea(ci )[1 : 0] = 00) .
As illustrated in Fig. 95, we divide addresses a ∈ B32 into line address a.l ∈
B29 and offset a.o ∈ B3 as
a.l = a[31 : 3]
a.o = a[2 : 0] .
When we later introduce caches, a.l will be the address of a cache line in the
cache and a.o will denote the offset of a byte in the cache line.
134 6 A Basic Sequential MIPS Machine
For the time being we will assume that there is a code region CR ⊂ B29
such that all instructions are fetched from addresses with a line address in CR.
We also assume that there is a data region DR ⊂ B29 such that all addresses
of loads and stores have a line address in DR:
∀i : ci .pc.l ∈ CR
∀i : ls(ci ) → ea(ci ).l ∈ DR .
For the time being we will also assume that these regions are disjoint:
DR ∩ CR = ∅ .
Moreover, in the next section we require the code region CR to always include
the addresses which belong to the ROM portion of the hardware memory.
Let m : B32 → B8 be a byte addressable memory like c.m in the ISA specifi-
cation, and let cm : B29 → B64 be a line addressable memory like h.m in the
intended hardware implementation. Let A ⊆ B29 be a set of line addresses like
CR and DR. We define in a straightforward way a relation cm ∼A m stating
that with respect to the addresses in A memory m is embedded in memory
cm by
∀a ∈ A : cm(a) = m8 (a03 ) .
Thus, illustrating with dots, each line of memory cm contains 8 consecutive
bytes of memory m, namely
We are interested to localize the single bytes of sequences md (x) in the line
addressable memory cm. We are only interested in access widths, which are
powers of two and at most 8 bytes6 :
d ∈ {2k | k ∈ [0 : 3]} .
Also we are only interested in so called accesses (x, d) which are aligned in
the following sense: if d = 2k with k ≥ 1 (i.e., to more than a single byte),
then the last k bits of address x must all be zero:
d = 2k ∧ k ≥ 1 → x[k − 1 : 0] = 0k .
For accesses of this nature and i < d, the expressions x.o +32 i32 that are used
in Lemma 6.2 to localize bytes of md (x) in byte addressable memory have
three very desirable properties: i) their numerical value is at most 7, hence,
ii) computing their representative mod 8 in B3 gives the right result, and iii)
all bytes are embedded in the same cache line. This is shown in the following
technical lemma.
Lemma 6.3 (properties of aligned addresses). Let (x, d) be aligned and
i < d. Then,
1. x.o + i ≤ 7,
2. x.o +3 i3 = x.o + i,
3. x +32 i32 = x.l ◦ (x.o +3 i3 ).
6
Double precision floating point numbers are 8 bytes long.
136 6 A Basic Sequential MIPS Machine
x[2 : k] ◦ x[k − 1 : 0] + i k ≤ 2
x.o + i =
x[k − 1 : 0] + i k=3
x[2 : k] ◦ 0k + i k≤2
=
03 + i k=3
7 − (2k − 1) + d − 1 k≤2
≤
7 k=3
=7.
3. We write
x = x.l ◦ x.o
i32 = 029 ◦ i3 .
Adding the offset and the line components separately, we get by part 2 of
the lemma
Hence,
(x + i32 mod 232 ) = x.l ◦ (x.o +3 i3 ) .
Applying bin32 ( ) to this equation proves part 3 of the lemma.
∀a ∈ A : cm(a) = m8 (a03 ) .
Then access (a03 , 8) is aligned and Lemma 6.4 can be reformulated for single
bytes:
a ◦ i3 = x = x.l ◦ x.o
and get
byte(x.o, cm(x.l)) = m(x) .
For the opposite direction of the proof we assume
We instantiate
x = x.l ◦ x.o = a ◦ i3 ,
and get for all i < 8 and a ∈ A
We further derive
which implies
cm(a) = m8 (a03 ) .
Finally, we can formulate for aligned accesses (x, d), how the single bytes of
consecutive sequences md (x) are embedded in memory cm.
138 6 A Basic Sequential MIPS Machine
Proof.
Concatenating bytes we can rewrite the embedding relation for aligned word
accesses.
Lemma 6.7 (embedding for word accesses). Let relation cm ∼A m hold,
x ∈ B32 , x.l ∈ A, and x[1 : 0] = 00. Then,
cm(x.l)[31 : 0] x[2] = 0
m4 (x) =
cm(x.l)[63 : 32] x[2] = 1 .
sim(c, h) ≡
1. h.pc = c.pc ∧
2. h.gpr = c.gpr ∧
3. h.m ∼CR c.m ∧
4. h.m ∼DR c.m,
i.e., every hardware memory location h.m(a) for a ∈ CR ∪ DR contains the
contents of eight ISA memory locations:
6.3 A Sequential Processor Design 139
IF
IF im imout
env
I
nextpc F, p p, Cad, F
env I-decoder
bf
ID
A, pc xtimm, af, sf, sa
pc
ima
pc A, B, af xtimm A, B sa, sf A xtimm B
A, B
p, F
inc ALUenv SUenv add sh4s
muxes p
C
ea.l dmin, bw
ima
m imout
M
dmout
ea.o, p, F
sh4l
lres
WB 0 1
l
A B
6.3.6 Initialization
8
sh4s is a shorthand for “shift for store” and sh4l is a shorthand for “shift for
load”.
142 6 A Basic Sequential MIPS Machine
nextpc 032
32 32
reset 0 1 [31 : 3]
a
29
32 b m
pc 1
(64, r)-ROM
32
imout
ima 64
0 1 pc[2]
32
I
Fig. 97. PC initialization and the instruction port environment
c0 .gpr = h0 .gpr
c .m8 (a000) = h0 .m(a) .
0
The treatment of the instruction fetch stage is short. The instruction port of
the hardware memory h.m is addressed with bits
It satisfies
6.3 A Sequential Processor Design 143
instruction decoder
4
2
predicates p 5 bf
4
instruction fields F
Cad sf
af
Using Lemma 6.7 we conclude that the hardware instruction I(h) fetched by
the circuitry in Fig. 97 is
h.m(h.pc[31 : 3])[63 : 32] h.pc[2] = 1
I(h) =
h.m(h.pc[31 : 3])[31 : 0] h.pc[2] = 0
= c.m4 (c.pc[31 : 2]00) (Lemma 6.7, sim.3)
= c.m4 (c.pc) (alignment)
= I(c) .
I(h) = I(c)
f (c) = f (I(c)) .
For example,
144 6 A Basic Sequential MIPS Machine
Predicates
This trivial transformation, however, gives a straightforward way to construct
circuits for all predicates p(c) from the ISA specification that depend only on
the current instruction:
• Construct a boolean formula for p . This is always possible by Lemma 2.20.
In the above example
rtype (I) = I[31] ∧ I[29] ∧ I[28] ∧ I[27] ∧ I[26] .
• Translate the formula into a circuit and connect the inputs of the circuit to
the hardware instruction register. The output p(h) of the circuit satisfies
p(h) = p (I(h))
= p (I(c)) (Lemma 6.10)
= p(c) .
Thus, the instruction decoder produces correct instruction predicates.
Lemma 6.11 (instruction predicates). For all predicates p depending only
on the current instruction:
p(h) = p(c) .
Instruction Fields
All instruction fields F have the form
F (c) = I(c)[m : n] .
Compute the hardware version as
F (h) = I(h)[m : n]
= I(c)[m : n] (Lemma 6.10)
= F (c) .
15 rd
5 5
1 0 jal
5
rt
5
1 0 rtype
5
Cad
Fig. 99. C address computation
C Address
The output Cad(h) in Fig. 99 computes the address of the destination register
for the general purpose register file. By Lemmas 6.11 and 6.12 it satisfies
⎧
⎪
⎨1
5
jal(h)
Cad(h) = rd(h) rtype(h)
⎪
⎩
rt(h) otherwise
⎧
⎪
⎨1
5
jal(c)
= rd(c) rtype(c)
⎪
⎩
rt(c) otherwise
= Cad(c) .
The fill bit ifill (c) is a predicate and imm(c) is a field of the instruction. Thus,
we can compute the extended immediate constant in hardware as
xtimm(h) = xtimm(c) .
Figure 100 shows the computation of the function fields af ,i, sf , and bf for
the ALU, the shift unit, and the branch condition evaluation unit.
146 6 A Basic Sequential MIPS Machine
af [3] af [2 : 0] i sf [1 : 0] sf [3 : 1] sf [0]
Fig. 100. Computation of function fields for ALU, SU, and BCE
One shows
i(h) = i(c)
sf (h) = sf (c)
bf (h) = bf (c)
in the same way. Bit af [3](c) is a predicate. Thus, af (h) is computed in the
function decoder as a predicate and we get by Lemma 6.11
af [3](h) = af [3](c) .
Cad(h) = Cad(c)
af (h) = af (c)
i(h) = i(c)
sf (h) = sf (c)
bf (h) = bf (c)
gprin
32
5
rs a in
5 w gprw
rt b
gpr
5
Cad c
outa outb
32 32
A B
Fig. 101. General purpose register file
A B
32 32
4
BCE bf
bres
Fig. 102. The branch condition evaluation unit and its operands
pc[31 : 2]
30 1
30−inc 00
30 2
pcinc[31 : 2] pcinc[1 : 0]
Fig. 103. Incrementing an aligned PC with a 30-incrementer
Thus, the branch condition evaluation unit produces the correct result.
Lemma 6.16 (BCE result).
bres(h) = bres(c)
Incremented PC
pc[31 : 2] imm14
15 imm
30 30
Next PC Computation
The circuit computing the next PC input, which was left open in Fig. 97 when
we treated the instruction fetch, is shown in Fig. 104.
Predicates p ∈ {jr, jalr, jump, b} are computed in the instruction decoder.
Thus, we have
p(c) = p(h)
by Lemma 6.11.
We compute jbtaken in the obvious way and conclude by Lemma 6.16
jbtaken(h) = jump(h) ∨ b(h) ∧ bres(h)
= jump(c) ∨ b(c) ∧ bres(c)
= jbtaken(c) .
We have by Lemmas 6.15, 6.17, and 6.12
A(h) = c.gpr(rs(c))
pcinc(h) = c.pc +32 442
imm(h)[15]14 imm(h)00 = imm(c)[15]14 imm(c)00
= bdist(c) .
150 6 A Basic Sequential MIPS Machine
We conclude
⎧
⎪
⎨c.pc +32 bdist(c) b(c)
btarget(h) = c.gpr(rs(c)) jr(c) ∨ jalr(c)
⎪
⎩
(c.pc +32 432 )[31 : 28]iindex(c)00 j(c) ∨ jal(c)
= btarget(c) .
Exploiting
reset(h) = 0
and the semantics of register updates we conclude
h .pc = nextpc(h)
btarget(c) jbtaken(c)
=
c.pc +32 432 otherwise
= c .pc .
We begin with the treatment of the execute stage. The ALU environment is
shown in Fig. 105. For the ALU’s left operand, we have
lop(h) = A(h)
= c.gpr(rs(c)) (Lemma 6.15)
= lop(c) .
xtimm B
32 32
0 1 rtype
A rop
32 32
i
4
32-ALU af
32
ares
Fig. 105. ALU environment
B(h) rtype(h)
rop(h) =
xtimm(h) otherwise
c.gpr(rt(c)) rtype(c)
=
xtimm(c) otherwise
= rop(c) .
For the result ares of the ALU, we get
ares(h) = alures(lop(h), rop(h), itype(h), af (h)) (Sect. 5.3)
= alures(lop(c), rop(c), itype(c), af (c)) (Lemma 6.11)
= ares(c) .
We summarize in the following lemma.
Lemma 6.19 (ALU result).
ares(h) = ares(c)
Note that in contrast to previous lemmas the proof of this lemma is not just
bookkeeping; it involves the not so trivial correctness of the ALU implemen-
tation from Sect. 5.3.
sa A[4 : 0]
5 5
0 1 f un[2]
B sdist
32 5
2
32-SU sf
32
sres
Fig. 106. Shift unit environment
Using the non trivial correctness of the shift unit implementation from
Sect. 5.4 we get
sres(h) = sres(c)
The value linkad that is saved in jump and link instructions is identical with
the incremented PC pcinc from the next PC environment (Lemma 6.17):
1 0 jal ∨ jalr
32
C
Fig. 107. Collecting results into signal C
A imm16
15 imm
32 32
0
32-Add
32
ea
Fig. 108. Effective address computation
B
32
smask
4
bw[7 : 4] bw[3 : 0]
Fig. 110. Computation of byte write signals bw[7 : 0] in the sh4s-environment
ea(h) = ea(c)
Figure 109 shows a shifter construction and the data inputs for the data port
of the hardware memory h.m. The shifter construction serves to align the B
operand with the 64-bit wide memory. A second small shifter construction
generating the byte write signals is shown in Fig. 110.
The initial mask signals are generated as
6.3 A Sequential Processor Design 155
bw(h) = 08 .
By alignment we have
Similarly we have for the large shifter from Fig. 109 and for i < d(c)
Using B(h) = c.gpr(rt(c)) from Lemma 6.15, we summarize this for the
shifters supporting the store operations.
Lemma 6.23 (shift for store). If s(c) = 1, i.e., if a store operation is per-
formed in ISA configuration c, then
dmin
64
29
ima 8
bw
29 m
ea.l
(64, r)-ROM
64 64
imout dmout
Fig. 111. Wiring of the hardware memory
In the memory stage we access port b of hardware memory h.m with the
line address ea.l and the signals bw(h)[7 : 0] and dmin(h) constructed above.
Figure 111 shows wiring of the hardware memory. We proceed to prove the
induction step for h.m.
Lemma 6.24 (hardware memory).
and we will prove the lemma for the data region in this form.
By induction hypotheses, sim.3, and sim.4 we have
h .m(a) = h.m(a)
= c.m8 (a000) (sim.3, sim.4)
= c .m8 (a000) .
dmout
64
1 0 ea[2]
32
G
(32, 16)-SLC ea[1]
32
H
(32, 24)-SLC ea[0]
32
J
Fig. 112. Shifter for load operations in the sh4l-environment
The only remaining stage is the write back stage. A shifter construction sup-
porting load operations is shown in Fig. 112. Assume l(c) holds, i.e., a load
instruction is executed. Because c.m ∼DR h.m holds by induction hypothesis,
we can use Lemma 6.6 to locate for i < d(c) the bytes to be loaded in h.m
and subsequently – using memory semantics – in dmout(h). Then we simply
track the effect of the two shifters taking into account that a 24-bit left shift
is the same as an 8-bit right shift:
byte(i, c.md(ea(c))
= byte(ea(c).o + i, h.m(ea(c).l)) (Lemma 6.6)
= byte(ea(h).o + i, h.m(ea(h).l)) (Lemma 6.22)
= byte(ea(h).o + i, dmout(h)) (H)
= byte(ea(h)[1 : 0] + i, G(h)) (H)
= byte(ea(h)[0] + i, H(h)) (H)
= byte(i, J(h)) . (H)
By setting
fill(h) = J(h)[7] ∧ lb(h) ∨ J(h)[15] ∧ lh(h)
6.3 A Sequential Processor Design 159
J[i] fill
1 0 lmask[i]
lres[i]
Fig. 113. Fill bit computation for loads
we conclude
s(c) ∧ d(c) = 4 → fill (h) = fill (c) .
Similar to the mask smask for store operations we generate a load mask
As shown in Fig. 113 we insert the fill bit at positions i where the correspond-
ing mask bit lmask[i] is zero:
fill (h) lmask(h)[i] = 0
lres(h)[i] =
J(h)[i] lmask(h)[i] = 1 .
Figure 114 shows the last multiplexer connecting the data input of the general
purpose register file with intermediate result C and the result lres coming from
the sh4l-environment. The write signal gprw of the general purpose register
file and the predicates su, jal, jalr, l controlling the muxes are predicates p
computed in the instruction decoder. By Lemma 6.11 we have for them
p(c) = p(h) .
160 6 A Basic Sequential MIPS Machine
C lres
32 32
0 1 l
32
gprin
Fig. 114. Computing the data input of the general purpose register file
Using RAM semantics, induction hypothesis sim.2, and Lemma 6.14 we com-
plete the induction step for the general purpose register file:
gprin(h) gprw(h) ∧ x = Cad(h)
h .gpr(x) =
h.gpr(x) otherwise
gprin(c) gprw(c) ∧ x = Cad(c)
=
c.gpr(x) otherwise
= c .gpr(x) .
This concludes the proof of Lemma 6.8 as well as the correctness proof of the
entire (simple) processor.
7
Pipelining
In this chapter we deviate from [12] and present somewhat simpler proofs in
the spirit of [6]. Pipelining without speculative instruction fetch introduces de-
lay slots after branch and jump instruction. The corresponding simple changes
to ISA and reference implementation are presented in Sect. 7.1.
In Sect. 7.2 we use what we call invisible registers to partition the refer-
ence implementation into pipeline stages. Replacing the invisible registers by
pipeline registers and controlling the updates of the pipeline stages by a very
simple stall engine we produce a basic pipelined implementation of the MIPS
ISA. As in [12] and [6] we use scheduling functions which, for all pipeline
stages k and hardware cycles t, keep track of the number of the sequential
instruction I(k, t) which is processed in cycle t in stage k of the pipelined
hardware. The correctness proof intuitively then hinges on two observations:
1. The circuits of stage k in the sequential hardware σ and the pipelined
hardware π are almost identical; the one difference (for the instruction
address ima) is handled by an interesting special case in the proof.
2. If Xπ is a signal of circuit stage k of the pipelined machine and Xσ is its
counter part in the sequential reference machine, then the value of Xπ in
cycle t equals the value of Xσ before execution of instruction I(k, t). In
I(k,t)
algebra Xπt = Xσ .
Although we are claiming to follow the simpler proof pattern from [6] the cor-
rectness proof presented here comes out considerably longer than its counter
parts in [12] and [6]. The reason is a slight gap in the proof as presented in [12]:
the second observation above is almost but not quite true. In every cycle it
only holds for the signals which are used in the processing of the instruction
I(k, t) currently processed in stage k. Proofs with slight gaps are wrong1 and
should be fixed. Fixing the gap discussed here is not hard: one formalizes the
concept of signals used for the processing of an instruction and then does the
1
Just as husbands which are almost never cheating are not true husbands.
M. Kovalev et al.: A Pipelined Multi-core MIPS Machine, LNCS 9000, pp. 161–206, 2014.
c Springer International Publishing Switzerland 2014
162 7 Pipelining
bookkeeping, which is lengthy and should not be presented fully in the class-
room. In [6], where the correctness of the pipelined hardware was formally
proven, the author clearly had to fix this problem, but he dealt with it in
different way: he introduced extra hardware in the reference implementation,
which forced signals, which are not used to zero. This makes observation 2
above strictly true.
Forwarding circuits and their correctness are studied in Sect. 7.3. The
material is basically from [12] but we work out the details of the pipe fill
phase more carefully.
The elegant general stall engine in Sect. 7.4 is from [6]. Like in [6], where
the liveness of pipelined processors is formally proven, the theory of scheduling
functions with general stall engines is presented here in much greater detail
than in [12]. The reason for this effort becomes only evident at the very end of
this book: due to possible interference between bus scheduler of the memory
system and stall engines of the processors, liveness of pipelined multi-core
machines is a delicate and nontrivial matter.
What we have presented so far – both in the definition of the ISA and in
the implementation of the processor – was a sequential version of MIPS. For
pipelined machines we introduce one change to the ISA.
So far in an ISA computation (ci ) new program counters ci+1 .pc were
computed by instruction I(ci ) and the next instruction
was fetched with this PC. In the new ISA the instruction fetch after a new
PC computation is delayed by 1 instruction. This is achieved by leaving the
next PC computation unchanged but i) introducing a delayed PC (DPC) c.dpc
which simply stores the PC of the previous instruction and ii) fetching instruc-
tions with this delayed PC. At the start of computations the two program
counters are initialized such that the first two instructions are fetched from
addresses 032 and 432 . Later we always obtain the delayed PC from the value
of the regular PC in the previous cycle:
c0 .dpc = 032
c0 .pc = 432
ci+1 .dpc = ci .pc
I(ci ) = ci .m4 (ci .dpc) .
The reason for this change is technical and stems from the fact that, in basic
5-stage pipelines, instruction fetch and next PC computation are distributed
7.1 MIPS ISA and Basic Implementation Revisited 163
over two pipeline stages. The introduction of the delayed PC permits to model
the effect of this in the sequential ISA. In a nutshell, PC and DPC are a tiny
bit of visible pipeline in an otherwise completely sequentially programming
model.
The 4 bytes after a jump or branch instruction are called a delay slot,
because the instruction in the delay slot is always executed before the branch
or jump takes effect.
The semantics of jump and branch instructions stays unchanged. This
means that for computation of the link address and of the jump target we
still use the regular PC and not the delayed PC. For instance, for the link
address we have
linkad(c) = c.pc +32 432 .
In case there are no jumps or branches in delay slots2 and the current in-
struction I(ci ) = ci .m4 (ci .dpc) is a jump or branch instruction, we have for
i > 0:
The changes in the simple non pipelined implementation for the new ISA are
completely obvious and are shown in Fig. 115.
The resulting new design σ is a sequential implementation of the MIPS ISA
for pipelined machines. We denote hardware configurations of this machine
by hσ . The simulation relation sim(c, hσ ) from Sect. 6.3.4 is extended with
the obvious coupling for the DPC:
sim(c, hσ ) ≡
2
A software condition which one has to maintain for the ISA to be meaningful.
164 7 Pipelining
pc dpc
32 32
[31 : 2]
1. hσ .pc = c.pc ∧
2. hσ .dpc = c.dpc ∧
3. hσ .gpr = c.gpr ∧
4. hσ .m ∼CR c.m ∧
5. hσ .m ∼DR c.m.
For ISA computations (ct ) of the new pipelined instruction set one shows
in the style of the previous chapter under the same software conditions the
correctness of the modified implementation for the new (and real) instruction
set.
Lemma 7.1 (MIPS with delayed PC). There is an initial ISA configura-
tion c0 such that
∀t ≥ 0 : sim(ct , htσ ) .
Note that absence of jump or branch instructions in the delay slot is necessary
for the ISA to behave in the expected way, but is not needed for the correctness
proof of the sequential MIPS implementation.
When designing processor hardware one tries to solve a fairly well defined
optimization problem that is formulated and studied at considerable length
in [12]. In this text we focus on correctness proofs and only remark that one
tries i) to spend (on average) as few as possible hardware cycles per executed
ISA instruction and ii) to keep the cycle time (as, e.g., introduced in the
detailed hardware model) as small as possible. In the first respect the present
design is excellent. With a single processor one cycle per instruction is hard
to beat. As far as cycle time is concerned, it is a perfect disaster: the circuits
of every single stage contribute to the cycle time.
In a basic 5 stage pipeline one partitions the circuits of the sequential
design into 5 circuit stages cir(i) with i ∈ [0 : 4], such that
7.1 MIPS ISA and Basic Implementation Revisited 165
• the circuit stages have roughly the same delay which then is roughly 1/5
of the original cycle time and
• connections between circuit stages are as simple as possible.
We have already introduced the stages in Fig. 96 of Chap. 6. That the cycle
times in each stage are roughly equal cannot be shown here, because we have
not introduced a detailed and realistic enough delay model. The interested
reader is referred to [12].
Simplicity of inter stage connections is desirable, because in pipelined im-
plementations most of theses connections are realized as register stages. And
registers cost money without computing anything new. For a study of how
much relative cost is increased by such registers we refer the reader again
to [12].
We conclude this section by a bookkeeping exercise about the interconnec-
tions between the circuit stages. We stress that we do almost nothing at all
here. We simply add the delayed PC and redraw Fig. 96 according to some
very simple rules:
1. Whenever a signal crosses downwards from one stage to the next we draw a
dotted box around it and rename it (before or after it crosses the bound-
ary). Exception to those are signal ima used for instruction fetch and
signals rs and rt used to address the GPR file during instruction decode.
For those signals we don’t draw a box, since we do not pipeline them later.
2. We collapse the circuits between stages into circles labelled cir(i).
The result is shown in Fig. 116. We observe two kinds of pipeline stages: i)
circuit stages cir(i) and ii) register stages reg(k) consisting either of registers
or memories of the sequential design or of dotted boxes for renamed signals.
Most of the figure should be self explaining, we add a few remarks:
• Circuit stage cir(1) and register stage reg(1) are the IF stage. cir(1) con-
sists only of the instruction port environment, which is presently read only
and hence behaves like a circuit. Signal I contains the instruction that was
fetched.
• Circuit stage cir(2) and register stage reg(2) are the ID stage. The circuit
stage consists of the instruction decoder and the next PC environment.
Signals A and B have been renamed before they enter circuit stage cir(2).
Signal Bin is only continued under the synonym B, but signal Ain is both
used in the next PC environment and continued under the synonym A.
The signals going from the instruction decoder to the next PC environment
are denoted by i2nextpc:
Register stage 2 contains program counters pc and dpc, the operands A and
B fetched from the GPRs, the incremented PC renamed to link address
linkad, and the signals i2ex going from the instruction decoder to the EX
stage:
166 7 Pipelining
imout
cir(1)
IF
1 I
Ain, Bin
cir(2)
ID
ima
EX cir(3)
WB cir(5)
gpr
5 rs, rt
Ain, Bin
Fig. 116. Arranging the sequential MIPS design into pipeline stages
• Circuit stage cir(3) and register stage reg(3) are the execute stage. The
circuit stage comprises the ALU-environment, the shift unit environment,
an incrementer for the computation of linkad, multiplexers for the collec-
tion of ares, sures, and linkad into intermediate result C, an adder for
the computation of the effective address, and the sh4s-environment.
Register stage 3 contains a version C.3 of intermediate result C, the effec-
tive address ea.3, the byte write signals bw.3, the data input dmin for the
hardware memory, and the copy con.3 of the control signals.
• Circuit stage cir(4) and register stage reg(4) are the M stage. The circuit
stage consists only of wires; so we have not drawn it here. Register stage
4 contains a version C.4 of C, the output of the data port dmout.4 as well
as versions con.4 and ea.4 of the control signals and the effective address.
Note that we also have included the hardware memory m itself in this
register stage.
• Circuit stage cir(5) and register stage reg(5) are the WB stage. The circuit
stage contains the sh4l-environment (controlled by ea.4.o) and a multi-
plexer collecting C.4 and result lres of the sh4l-environment into the data
input gprin of the general purpose register file. Register stage 5 consists
of the general purpose register file.
For the purpose of constructing a first pipelined implementation of a MIPS
processor we can simplify this picture even further:
• We distinguish in register stages k only between visible (in ISA) registers
pc, dpc and memories m, gpr on one side and other signals x.k on the
other side.
• Straight connections by wires, which in Fig. 116 are drawn as straight
lines, are now included into the circuits cir(i)3 .
• For k ∈ [1 : 5] circuit stage cir(k) is input for register stage k + 1 and for
k ∈ [1 : 4] register stage k is input to circuit stage cir(k + 1). We only
hint these connections with small arrows and concentrate on the other
connections.
We obtain Fig. 117. In the next section we will transform this simple figure
with very little effort into a pipelined implementation of a MIPS processor.
imout cir(1)
x.1
Ain, Bin
cir(2)
pc dpc x.2
cir(3)
x.3
cir(4)
ima
m x.4
cir(5)
gpr
rs, rt
Rσt = htσ .R
Rπt = htπ .R
Xσt = X(htσ )
Xπt = X(htπ ) .
For signals or registers only occurring in the pipelined design, we drop the sub-
script π. If an equation holds for all cycles (like equations describing hardware
construction) we drop the index t.
7.2 Basic Pipelined Processor Design 169
1 1
0 1 reset
f ull0
f ull1 1
0
0 1
f ull2 1
0
0 1
f ull3 1
0
0 1
f ull4 1
Fig. 118. Tracking full register stages with a basic stall engine
• For indices k of register stages, we collect in reg(k) all registers and memo-
ries of register stage k. We use a common clock enable uek for all registers
of reg(k).
• Initially after reset all register stages except the program counters contain
no meaningful data. In the next 5 cycles they are filled one after another
(Table 8). We introduce the hardware from Fig. 118 to keep track of this.
There are 5 full bits f ull[0 : 4], where the bit f ullkt means that circuit
stage cir(k + 1) contains meaningful data in cycle t.
Formally we define
f ull0 = 1
∀k ≥ 1 : f ullk0 = 0
∀k ≥ 1 : f ullkt+1 = f ullk−1
t
.
We show
t 1t+1 04−t t≤3
f ull[0 : 4] = (17)
15 t≥4
by the following simple lemma.
Lemma 7.2 (full bits).
0 t<k
∀k, t ≥ 0 : f ullkt =
1 t≥k
t≥1>0=k , f ull0t = 1 .
t<k , f ullk0 = 0 .
Thus, the lemma holds after reset. Assume the lemma holds for cycle t.
Then
f ullkt+1 = f ullk−1
t
0 t<k−1
=
1 t≥k−1
0 t+1 <k
=
1 t+1 ≥k .
7.2 Basic Pipelined Processor Design 171
cir(k)
[31 : 2]
imaπ
pc dpc
32 32
1 0 f ull1
Full bits being set to 0 prevent the update of register stages. This is also
called stalling a register stage; we call the hardware therefore a basic stall
engine. Other stall engines are introduced later.
• For any register stage k we update registers and memories in reg(k) only if
their input contains meaningful data, which is the case when the previous
stage is full. As illustrated in Fig. 119 we set the update enable signal as
uek = f ullk−1 .
For registers in stage 1 we have a special case, where the signal f ull0 is
not coming from a register, but is always tied to 1.
• For memories m and gpr we take the precomputed signals bw.3 and gprw.4
from the precomputed control and AND them, respectively, with the cor-
responding update enable signals to get the new write signals:
bwπ = bw.3π ∧ ue4
gprwπ = gprw.4π ∧ ue5 .
• The address of the instruction is now computed as shown in Fig. 120 as
dpcπ .l f ull1 = 0
imaπ =
pcπ .l f ull1 = 1 .
This has the remarkable effect that we fetch from the PC in all cycles
except the very first one. Thus, the important role of the delayed PC
172 7 Pipelining
In the sequential design, there was a trivial correspondence between the hard-
ware cycle t and the instruction I(ct ) executed in that cycle. In the pipelined
design π the situation is more complicated, because in 5 stages there are
up to 5 different instructions which are in various stages of completion. For
instructions I(ci ) of the sequential computation we use the shorthand
Ii = I(ci ) .
We introduce scheduling functions
I : [1 : 5] × N → N ,
which keep track of the instructions being processed every cycle in every circuit
stage. Intuitively, if
I(k, t) = i ,
then the registers of register stage k in cycle t are in the state before executing
instruction Ii . In case register stage k−1 is full, this is equivalent to saying that
instruction Ii during cycle t is being processed in circuit stage k. If register
stage k − 1 is not full, then circuit stage k does not have any meaningful input
during cycle t, but Ii will be the next instruction which will eventually be
processed by circuit stage k when register stage k − 1 becomes full. For both
cases if I(k, t) = i we say that instruction Ii is in circuit stage k during cycle
t. Note that if some stages of the pipeline are not full, then one instruction is
said to be present in several circuit stages simultaneously.
Formally the functions are defined with the help of the update enable
functions uek in the following way:
∀k : I(k, 0) = 0
I(1, t) + 1 uet1 = 1
I(1, t + 1) =
I(1, t) otherwise
I(k − 1, t) uetk = 1
∀k ≥ 2 : I(k, t + 1) =
I(k, t) otherwise ,
i.e., after reset every stage is in the state before executing instruction I0 . In
circuit stage 1 we fetch a new instruction and increase the scheduling function
7.2 Basic Pipelined Processor Design 173
every cycle (in the basic stall engine introduced thus far, stage 1 is always
updated). A register stage k which has a predecessor stage k − 1 is updated
or not in cycle t as indicated by the uet signal. If it is updated then the data
of instruction I(k, t) is written into the registers of stage k, and the circuit
stage k in cycle t + 1 gets the instruction from the previous stage. Later we
prove an easy lemma showing that this instruction is equal to the instruction
in stage k in cycle t increased by one. If register stage k is not updated in cycle
t then the scheduling function for this stage stays the same. Table 9 shows
the development of the scheduling function in our basic pipeline for the first
6 cycles.
The definition of the scheduling functions can be viewed in the follow-
ing way: imagine we extend each register stage reg(k) by a so called ghost
register I(k, ) that can store arbitrary natural numbers. In real machines
that is of course impossible because registers are finite, but for the purpose
of mathematical argument we can add the ghost registers to the construction
and update them like all other registers of their stage by uek . If we initialize
the ghost register I(1, ) of stage 1 with 0 and increase it by 1 every cycle,
then the pipeline of ghost registers simply clocks the index of the current
instruction through the pipeline together with the real data.
Augmenting real configurations for the purpose of mathematical argument
by ghost components is a useful proof technique. No harm is done to the real
construction as long as no information flows from the ghost components to
the real components.
With the help of Lemma 7.2 we show the following property of the schedul-
ing functions.
Lemma 7.3 (scheduling functions). For all k ≥ 1 and for all t
0 t<k
I(k, t) =
t−k+1 t≥k .
t<k , I(k, t) = 0 .
This shows the base case of the induction. Assume the lemma holds for t. In
the induction step we consider two cases:
174 7 Pipelining
k−1
I(k , t) = I(k, t) + f ulljt .
j=k
Not all registers or memories R are used in all instructions I(ci ). In the cor-
rectness theorem we need to show correct simulation of invisible registers only
in situations when they are used. Therefore, we define for each invisible reg-
ister X a predicate used(X, I) which must at least be true for all instructions
I, which require register X to be used for the computation. Some invisible
registers will always be correctly simulated, though not all of them are always
used. We define
Invisible register A is used when the GPR memory is addressed with rs, and
B is used when it is accessed with rt.
We first define auxiliary predicates A-used(I) and B-used(I) that we will
need later. Recall that in Sect. 6.3.8 we have written the functions f (c) and
the predicates p(c) that only depend on the current instruction I(c) as
We will use the same notation here. Inspection of the tables summarizing the
MIPS ISA gives
A-used(I) = alur (I) ∨ (su (I) ∧ f un (I)[2]) ∨ jr (I) ∨ jalr (I)
∨ (itype (I) ∧ ¬lui (I))
B-used(I) = s (I) ∨ beq (I) ∨ bne (I) ∨ su (I) ∨ alur (I) .
used(A, I) = A-used(I)
used(B, I) = B-used(I) .
Registers C.3 and C.4 are used when the GPR memory is written but no load
is performed:
4
The notation is obviously redundant here, but later we also use A-used and
B-used as hardware signals.
176 7 Pipelining
Registers ea.3 and ea.4 are used in load and store operations:
used(dmin, I) = s (I) .
used(dmout.4, I) = l (I) .
Hence,
I(2, t) − I(5, t) ≤ 3 . (18)
Assume in cycle t instruction I(2, t) = i is in circuit stage 2, i.e., the ID stage.
Then signals rs and rt of this instruction overtake up to 3 instructions in
circuit stages 2,3, and 4. If any of these overtaken instructions write to some
general purpose register x and instruction i tries to read it - as in our basic
design directly from the general purpose register file, then the data read will
be stale; more recent data from an overtaken instruction is on the way to the
7.2 Basic Pipelined Processor Design 177
GPR but has not reached it yet. For the time being we will simply formulate
a software condition SC-1 saying that this situation does not occur; we only
prove that the basic pipelined design π works for ISA computations (ci ) which
obey this condition. In later sections we will improve the design and get rid
of the condition.
Therefore we formalize for x ∈ B5 and ISA configurations c two predicates:
• writesgpr(x, i) - meaning ISA configuration ci writes gpr(x):
Now we can define the new software condition SC-1: for all i and x, if Ii
writes gpr(x), then instructions Ii+1 , Ii+2 , Ii+3 don’t read gpr(x):
writesgpr(x, i) → ∀j ∈ [i + 1 : i + 3] : ¬readsgpr(x, j) .
Now that we can express what instruction I(k, t) is in stage k in cycle t and
whether an invisible register is used in that instruction, we can formulate the
invariant coupling states htπ of the pipelined machine with the set of states
hiσ of the sequential machine that are processed in cycle t of the pipelined
machine, i.e., the set
{hσI(k,t) | k ∈ [1 : 5]} .
We intend to prove by induction on t the following simulation.
Lemma 7.5 (basic pipeline). Assume software condition SC-1, alignment,
and no self modification. For k ∈ [1 : 5] let R ∈ reg(k) be a register or memory
of register stage k. Then,
I(k,t)
t Rσ vis(R)
Rπ = I(k,t)−1 I(k,t)−1
Rσ f ullkt ∧ ¬vis(R) ∧ used(R, Iσ ).
By Lemma 7.1 we already know sim(ct , htσ ). In particular we have for predi-
cates p only depending on the current instruction I:
p(ci ) = p(hiσ ) .
Thus, Lemma 7.5 also establishes a simulation between the pipelined compu-
tation (htπ ) and the ISA computation (ci ).
Except for the subtraction of 1 from I(k, t) for non visible registers, the
induction hypothesis is quite intuitive: pipelined data htπ .R in stage k in cycle
178 7 Pipelining
We denote by inv(k, t) the statement of Lemma 7.5 for stage k and cycle t.
For t = 0 we have
f ullk0 = 1 ↔ k = 0 .
Thus, there is nothing to show for invisible registers. Initially we also have
∀k : I(k, 0) = 0 .
The initial content of general purpose registers and hardware memory of the
sequential machine is defined by the content of the pipelined machine after
reset:
Thus, we have
∀k : inv(k, 0) .
7.2 Basic Pipelined Processor Design 179
No Updates
Assume the lemma holds for t. We show for each stage k separately that the
lemma holds for stage k and t + 1. For all stages we always proceed in the
same way. There are two cases. The easy case is
uetk = 0 ,
i.e., register stage k is not updated in cycle t. By the definition of full bits we
know
f ullkt+1 = f ullk−1
t
= uetk = 0 .
Thus, for invisible registers R ∈ reg(k) there is nothing to show either. For
the scheduling functions uetk = 0 implies
I(k, t + 1) = I(k, t) .
Recall, that the byte write signals for the hardware memory and the write
signal for the GPR memory are defined as
This shows inv(k, t + 1) for stages k that are not updated in cycle t.
I(k, t + 1) = I(k − 1, t)
= I(k, t) + 1 .
180 7 Pipelining
I(k, t) = i .
• For k = 4, data memories in stage reg(4) always have identical byte write
signals bw and have the same effective address input ea and data input
dmin in case of a store operation:
bwπt = bwσi
siσ → eatπ = eaiσ ∧ dmintπ = dminiσ .
• For k = 5, GPRs in stage reg(5) always have identical GPR write signal
gprw, and have the same write address Cad and the same data input gprin
in case if instruction i is writing to GPRs:
gprwπt = gprwσi
gprwσi → Cadtπ = Cadiσ ∧ gprintπ = gpriniσ .
Proof. Let R ∈ reg(k). The proof hinges on Lemma 7.6 and splits cases in the
obvious way:
7.2 Basic Pipelined Processor Design 181
• R ∈ {pc, dpc} is a visible register. Because the register has in both ma-
chines the same input, it gets updated in both machines in the same way:
Rπt+1 = Rintπ
= Riniσ
= Rσi+1
= RσI(k,t+1) .
Rπt+1 = Rintπ
= Riniσ
= Rσi
= RσI(k,t)
= RσI(k,t+1)−1 .
¬siσ → bwσi = 08 .
Moreover, from the software conditions we know that eaiσ .l ∈ DR and that
the data region DR is disjoint from the ROM portion of the hardware
memory. Then for all a ∈ B29 we get
t+1 modify (mtπ (a), dmintπ , bwπt ) a = eatπ .l
mπ (a) =
mtπ (a) otherwise
modify (miσ (a), dminiσ , bwσi ) a = eaiσ .l
=
miσ (a) otherwise .
= mi+1
σ (a)
= mσI(k,t+1) (a) .
It remains to prove hypothesis P (k, t) of Lemma 7.7 for each stage k separately
under the assumption that the simulation relation holds for all stages in cycle
t and update enable for stage k is on.
Lemma 7.8 (proof obligations basic pipeline).
(∀k : inv(k , t)) ∧ uetk → P (k, t) .
Proof. We prove the statement of the lemma by a case split on stage k. For
each circuit stage cir(k) we identify a set of input signals in(k) of the stage
which are identical in cycle t of π and in the configuration i of σ:
in(k)tπ = in(k)iσ .
We then show that these inputs determine the relevant outputs Rin, dmin,
etc. of the circuit stage. Because the circuit stages are identical in both ma-
chines, this suffices for showing that the outputs which are used have identi-
cal values. Unfortunately, the proofs require simple but tedious bookkeeping
about the invisible registers used. The only real action is in the proofs for
signals Ain, Bin, and ima.
Stage IF (k=1)
We first consider the address input ima of the instruction port. We consider
the multiplexer in Fig. 120, which selects between visible registers pc, dpc ∈
reg(2), and distinguish two cases:
• t = 0. Then f ull1t = 0 and I(1, 0) = I(2, 0) = 0. We conclude with
inv(2, 0):
ima0π = dpc0π .l
= dpcI(2,0)
σ .l
= dpc0σ .l
= ima0σ
= imaI(1,0)
σ .
• t ≥ 1. Then f ull1t = 1. By Lemma 7.4 we have
i = I(1, t) = I(2, t) + f ull1t = I(2, t) + 1 .
Using inv(2, t) and the definition of the delayed PC we conclude:
imatπ = pctπ .l
= pcI(2,t)
σ .l
= dpcI(2,t)+1
σ .l
i
= dpcσ .l
= imaiσ .
7.2 Basic Pipelined Processor Design 183
From the software condition we know that imaiσ ∈ CR and that the content
of the code region does not change during the execution. Using inv(4, t) we
get
Thus, the instruction port environment has in both machines the same input;
therefore it produces the same output:
Iin tπ = Iin iσ ,
i.e., we have shown P (1, t) and thus by Lemma 7.7 inv(1, t + 1).
Stage ID (k = 2)
From uet2 = f ull1t we know that t > 0. Hence, by Lemma 7.4 we have:
I(1, t) = I(2, t) + 1 .
Let
i = I(2, t) = I(1, t) − 1 .
There are three kinds of input signals for the circuits cir(2) of this stage:
• Signal from invisible register I ∈ reg(1). It is always used. With inv(1, t)
we get:
Iπt = IσI(1,t)−1 = Iσi .
This already determines the inputs of invisible registers con.2 and i2ex:
It also determines the signals i2nextpc from the instruction decoder to the
next PC environment so that we have for these signals:
i2nextpctπ = i2nextpciσ .
• Signals from visible registers pc, dpc ∈ reg(2) which are inputs to the next
PC environment. From inv(2, t) we get immediately:
• For inputs Ain and Bin of circuit stage cir(2) we have to make use of
software condition SC-1, which is stated on the MIPS ISA level. Hence, we
assume here that the sequential MIPS implementation is correct (Lemma
7.1), i.e., that we always have
∀j ≥ 0 : sim(cj , hjσ ).
used(A, Iσi ).
Let
x = rstπ = rsiσ = rs(ci ) ,
i.e., instruction I(2, t) reads gpr(x):
By (18) we have
I(5, t) ≤ I(2, t) + 3 = i + 3 .
If any of instructions I(3, t), I(4, t), I(5, t) would write gpr(x), this would
violate software condition SC-1. Thus,
Hence,
gprσI(2,t) (x) = gprσI(5,t) (x) .
Using inv(5, t) we conclude:
For the input to invisible register linkad we have from inv(2, t) and because
t ≥ 1:
linkadintπ = pcinctπ
= pctπ +32 432
= pciσ +32 432
= linkadiniσ .
7.2 Basic Pipelined Processor Design 185
It remains to argue about the inputs of visible registers pc and dpc, i.e., about
signals nextpc and register pc which is the input of dpc. For the input pc of
dpc we have from inv(2, t) and because t ≥ 1:
For the computation of the nextpc signal there are four cases:
• beiσ ∨ bneiσ . This is the easiest case, because it implies used(A, Iσi ) ∧
used(B, Iσi ) and we have
intπ = iniσ
for all inputs in ∈ {A, B, i2nextpc} of the next PC environment. Because
the environment is identical in both machines we conclude
jbtakentπ = jbtakeniσ
btargettπ = btargetiσ
nextpctπ = nextpciσ .
Stage EX (k = 3)
From uet3 = f ull2t we know that t > 1. Hence, by Lemma 7.4 we have:
I(2, t) = I(3, t) + 1 .
Let
i = I(3, t) = I(2, t) − 1 .
We have to consider three kinds of input signals for the circuits cir(3) of this
stage:
• Invisible registers i2ex, con.2, and linkad. They are always used. Using
inv(2, t) we get:
Because con.2 = con.3in this shows P (3, t) for the pipelined control regis-
ter con.3,
• Invisible registers A and B. From inv(2, t) we have:
We proceed to show P (3, t) for registers dmin, bw.3, register ea, and register
C.3 separately:
• For ea we have
smaskπt = smaskσi
= 0000 .
Thus, we get
bw.3intπ = 08
= bwσi
= bw.3iniσ .
7.2 Basic Pipelined Processor Design 187
Stage M (k = 4)
From uet4 = f ull3t we know that t > 2. Hence, by Lemma 7.4 we have:
I(3, t) = I(4, t) + 1 .
Let
i = I(4, t) = I(3, t) − 1 .
We have to argue about 3 kinds of signals:
• X ∈ {dmin, con.3, ea.3, C.3}. From inv(3, t) we have:
This shows P (4, t) for the data inputs of registers con.4, ea.4, and C.4.
• dmout.4. We have
bw.3tπ = bwσI(3,t)−1
= bwσi
bwπt = bw.3tπ ∧ uet4
= bwσi ,
For the effective address and the data input to the hardware memory we
have
siσ → used(ea.3, Iσi ) ∧ used(dmin, Iσi ) .
As shown above, this implies in case of siσ :
dmintπ = dminiσ
eatπ = ea.3tπ
= eaiσ .
7.2 Basic Pipelined Processor Design 189
Stage WB (k = 5)
From uet5 = f ull4t we know that t > 3. Hence, by Lemma 7.4 we have
I(4, t) = I(5, t) + 1 .
Let
i = I(5, t) = I(4, t) − 1 .
We only have to consider the input registers of the stage and to show P (4, t)
for the general purpose register file:
• All input registers are invisible thus let X ∈ {C.4, dmout.4, ea.4, con.4}.
From inv(4, t) we have:
used(X, Iσi ) → Xπt = Xσi .
• Signal gprw.4 is a component of con.4. Thus, we have:
gprw.4tπ = gprwσi
gprwπt = gprw.4tπ ∧ uet5
= gprwσi .
Signal Cad.4 is a component of con.4. Thus
Cad.4tπ = Cadiσ .
Assume gprwσi , i.e., the general purpose register file, is written. We have
to consider two subcases:
– A load is performed. Then dmout.4 and ea.4 are both used, load result
lres is identical for both computations and the data input gprin for the
general purpose register file comes for both computations from lres:
lσi → used(dmout.4, Iσi ) ∧ used(ea.4, Iσi )
dmintπ = lrestπ
= lresiσ
= dminiσ .
– No load is performed. Then C.4 is used and it is put to the data input
gprin:
¬siσ → used(C.4, Iσi )
dmintπ = C.4tπ
= Cσi
= dminiσ .
This completes the proof of P (5, t), the proof of Lemma 7.8, and the
correctness proof of the basic pipeline design.
190 7 Pipelining
7.3 Forwarding
Software condition SC-1 forbids to read a general purpose register gpr(x)
that is written by instruction i in the following three instructions i + 1, i + 2,
and i + 3. We needed this condition because with the basic pipelined machine
constructed so far we had to wait until the written data had reached the
general purpose register file, simply because that’s where we accessed them.
This situation is greatly improved by the forwarding circuits studied in this
section.
7.3.1 Hits
• The instruction in stage k must write to the general purpose register file:
gprw.k t .
Second, in case we have a hit in stage reg(2) or reg(3) and the instruction
is not a load instruction, then the data we want to fetch into A or B can be
found as the input of the C register of the following circuit stage, i.e., as C.3in
or C.4in. In case of a hit in stage reg(4) we can find the required data at the
data input grpin of the general purpose register file, even for loads.
Ain
0 1 hitA [2]
C.3in
0 1 hitA [3]
C.4in
0 1 hitA [4]
gprin
gprouta
Fig. 121. Forwarding circuit F orA
recent instruction producing a hit. This is the “top” instruction in the pipe
(i.e., with the smallest k) producing a hit:
topA [k] = hitA [k] ∧ hitA [j].
j<k
topB [k] = hitB [k] ∧ hitB [j] .
j<k
Figure 121 shows the forwarding circuit F orA . If we find nothing to forward
we access the general purpose register file as in the basic design. We have:
⎧
⎪
⎪ C.3in topA [2]
⎪
⎨C.4in topA [3]
Ain =
⎪
⎪ gprin topA [4]
⎪
⎩
gprouta otherwise .
Further, let
i = I(2, t) = I(5, t) + s(t) ,
let s(t) > 0 and 0 ≤ j < s(t). Then
I(2 + j, t) = i − j and t
f ull2+j ,
i.e., any instruction i − j between i and i − s(t) is found in the full register
stage 2 + j between stages 2 and 5.
Proof. From Lemma 7.3 we get
0 t<2
I(2, t) =
t−1 t≥2
= I(5, t) + s(t)
≥1.
Hence, t ≥ 2 and
I(2, t) = t − 1 ≥ s(t) .
Thus,
t ≥ s(t) + 1 > 1 + j ,
which implies
t≥2+j .
Applying again Lemma 7.3 we get
0 t<2+j
I(2 + j, t) =
t − (2 + j) + 1 t ≥ 2 + j
= t−1−j
= i−j .
From Lemma 7.2 we get
t
f ull2+j =t≥j+2=1.
7.3 Forwarding 193
The only case in the proof affected by the addition of the two forwarding
circuits F orA and F orB is the proof of P (2, t) in Lemma 7.8 for signals Ain
and Bin. Also the order in which proof obligations P (k, t) are shown becomes
important: one proves P (2, t) after P (3, t), P (4, t), and P (5, t).
We present the modified proof for Ain. The proof for Bin is completely
analogous. Assume
uet2 = f ull1t = 1 ,
and let
i = I(2, t) .
Our goal is to show that the forwarding circuit outputs the same content of
the GPR register file, as we get in the sequential configuration i:
∀j ≥ 0 : sim(cj , hjσ ).
k ∈ [2 : 4] ∧ f ullkt .
Then by Lemma 7.2 stage k and all preceding stages must be full in cycle t
∀j ≤ k : f ulljt ,
and we can use induction hypothesis inv(j, t) for the invisible registers. Let
k =2+α with α ∈ [0 : 2] .
For the scheduling function for stages k and k + 1 we get by Lemma 7.4
I(2 + α, t) = I(k, t)
I(2, t) k=2
= k−1
I(2, t) − j=2 f ulljt k>2
= i − (k − 2)
= i−α
I(3 + α, t) = I(k + 1, t)
= I(k, t) − f ullkt
= i−α−1.
194 7 Pipelining
x = rsiσ ∧ k = 2 + α ∧ f ullkt .
Then,
hittA [k] = writesgpr(x, i − α − 1) .
Proof. For the hit signal under consideration we can conclude with inv(k, t)
for the invisible registers Cad.k and gprw.k and with inv(1, t) for the signal
rs:
Now assume
hitA [k]t ∧ k = 2 + α ∧ x = rsiσ .
Then by Lemma 7.10 we have writesgpr(x, i − α − 1). For α ∈ [0 : 1] and
exploiting the fact that instruction i reads GPR x we can also conclude from
software condition SC-2 that instruction i − α − 1 is not a load instruction:
This in turn implies that registers C.3 and C.4 are used by instruction i−α−1,
i.e.,
used(C.(3 + α), I(ci−α−1 )) ,
and that the content of these registers is written into register x by this in-
struction. Thus, we can apply P (3, t) and P (4, t) to conclude
If we have hitA [2 + α]t for α = 2 we conclude from gprwσi−α−1 and the proof
of P (5, t)
gprintπ = gprinI(5,t)
σ
= gprinI(3+α,t)
σ
= gprini−α−1
σ
= gprσi−α (x) .
The proof of P (2, t) for Ain can now be completed. There are two major cases:
7.3 Forwarding 195
reg(k − 1) f ullk−1
hazk
cir(k)
uek
reg(k) f ullk
Fig. 122. Signals between data paths and control and the stall engine
7.4 Stalling
In this last section of the pipelining chapter we use a non trivial stall engine,
which permits to improve the pipelined machine π such that we can drop
software condition SC-2. As shown in Fig. 122 the new stall engine receives
from every circuit stage cir(k) an input signal hazk indicating that register
stage reg(k) should not be clocked, because correct input signals are not
available.
In case a hazard signal hazk is active the improved stall engine will stall the
corresponding circuit stage cir(k), but it will keep clocking the other stages
if this is possible without overwriting instructions. Care has to be taken that
the resulting design is live, i.e., that stages generating hazard signals are not
blocking each other.
The stall engine we use here was first presented in [6]. It is quickly described
but is far from trivial. The signals involved for stages k are:
• full signals f ullk for k ∈ [0 : 4],
• update enable signals uek for k ∈ [1 : 5],
• stall signals stallk indicating that stage k should presently not be clocked
for k ∈ [1 : 6]; the stall signal for stage 6 is only introduced to make
definitions more uniform,
• hazard signal hazk generated by circuit stage k for k ∈ [1 : 5].
As before, circuit stage 1 is always full (i.e., f ull0 =1) and circuit stages 2 to
5 are initially empty. Register stage reg(6) does not exist, and thus it is never
stalled:
7.4 Stalling 197
f ull0 = 1
f ull[1 : 4]0 = 04
stall6 = 0 .
We specify the new stall engine with 3 equations. Only full circuit stages k
with full input registers (registers reg(k − 1)) are stalled. This happens in two
situations: if a hazard signal is generated in circuit stage k or if the subsequent
stage k + 1 is stalled and clocking registers in stage k would overwrite data
needed in the next stage:
stallk = f ullk−1 ∧ (hazk ∨ stallk+1 ) .
Stage k is updated, when the preceding stage k − 1 is full and stage k itself is
not stalled:
uek = f ullk−1 ∧ stallk
= f ullk−1 ∧ (f ullk−1 ∧ (hazk ∨ stallk+1 ))
= f ullk−1 ∧ (f ullk−1 ∨ (hazk ∨ stallk+1 ))
= f ullk−1 ∧ f ullk−1 ∨ f ullk−1 ∧ (hazk ∨ stallk+1 ))
= f ullk−1 ∧ (hazk ∨ stallk+1 ) .
A stage is full in cycle t + 1 in two situations: i) if new data were clocked in
during the preceding cycle or ii) if it was full before and the old data had to
stay where they are because the next stage was stalled:
f ullkt+1 = uetk ∨ f ullkt ∧ stallk+1
t
.
Because
(stallk+1 ∧ f ullk ) = stallk+1 ,
this can be simplified to
f ullkt+1 = uetk ∨ stallk+1
t
.
The corresponding hardware is shown in Fig. 123.
stallk
f ullk−1
uek
hazk
stallk+1
f ullk
The correctness statement formulated in Lemma 7.5 stays the same as before.
Software conditions SC-1 resp. SC-2 are dropped. Only alignment of memory
accesses and the disjointness of the code and data regions are assumed.
The correctness proof follows the pattern of previous proofs, but due to the
non trivial stall engine the arguments about scheduling functions now become
considerably more complex. Before we can adapt the overall proof we have to
show the counter parts of Lemmas 7.4 and 7.9 for the new stall engine. We
begin with three auxiliary technical results.
i.e., if a full stage k − 1 is clocked then the previous data are clocked into the
next stage.
Proof. By contradiction. Assume
0 = uek
= f ullk−1 ∧ ¬stallk
= ¬stallk .
7.4 Stalling 199
Thus,
stallk = 1
stallk−1 = f ullk−2 ∧ (hazk−1 ∨ stallk )
= f ullk−2
uek−1 = f ullk−2 ∧ ¬stallk−1
= stallk−1 ∧ ¬stallk−1
=0.
t
= stallk+1
= f ullk ∧ (hazk+1 ∨ stallk+2 )
t
=0.
¬f ullk−1
t
∨ ¬uetk → I(k, t + 1) = I(k, t) ,
i.e., the scheduling function of stage k that does not have a full input stage
k − 1 or that is not clocked, stays the same.
Proof. By the definitions of the scheduling functions we have
¬f ullk−1
t
→ ¬uetk .
t
Table 10. Case split according to bits f ullk−1 , uetk−1 , and uetk in the proof of
Lemma 7.15
t
f ullk−1 uetk−1 uetk I(k − 1, t) I(k − 1, t + 1) I(k, t + 1) t+1
f ullk−1
0 0 0 i i i 0
0 1 0 i i+1 i 1
1 0 0 i+1 i+1 i 1
1 0 1 i+1 i+1 i+1 0
1 1 1 i+1 i+2 i+1 1
i.e., after stage k is clocked, it is full and its scheduling function is increased
by one.
Proof. In the proof we distinguish two cases. If k = 1 then uet1 implies
I(1, t + 1) = I(1, t) + 1
uetk → f ullk−1
t
∧ f ullkt+1 .
Thus we have by the definition of the scheduling functions and the induction
hypothesis of Lemma 7.14
I(k, t + 1) = I(k − 1, t)
t
= I(k, t) + f ullk−1
= I(k, t) + 1 .
Lemma 7.14 for t + 1 is now proven by a case split. Let
I(k, t) = i .
t
The major case split is according to bit f ullk−1 as shown in Table 10:
• t
f ullk−1 = 0. By Lemma 7.13 and the induction hypothesis we have
¬f ullk−1
t+1
∧ I(k − 1, t + 1) = I(k − 1, t) = i .
• t
f ullk−1 = 1. By induction hypothesis we have
I(k − 1, t) = I(k, t) + 1 = i + 1 .
uetk−1 → uetk
by Lemma 7.11:
– uetk−1 = 1. Then,
t+1
f ullk−1 = uetk−1 ∨ stallkt
=1.
– uetk−1 = 0. Then,
t+1
f ullk−1 = stallkt
uetk = f ullk−1
t
∧ ¬stallkt
= ¬stallkt
= ¬f ullk−1
t+1
.
From Lemma 7.14 we conclude the same formula as for the machine without
stalling for any k < k:
k−1
I(k , t) = I(k, t) + f ulljt .
j=k
4
s(t) = f ulljt .
j=2
Let
i = I(2, t) = I(5, t) + s(t).
For j ∈ [−1 : s(t) − 1] we define numbers a(j, t) by5
a(−1, t) = 1
a(j, t) = min{x | x > a(j − 1, t) ∧ f ullxt } .
Then,
∀j ∈ [0 : s(t) − 1] : f ulla(j,t)
t
∧ I(a(j, t), t) = i − j .
Proof. The lemma follows by an easy induction on j. For j = −1 there is
nothing to show. Assume the lemma holds for j. By the minimality of a(j+1, t)
we have
∀x : a(j, t) < x < a(j + 1, t) → ¬f ullxt .
By Lemma 7.14 we get
a(j+1,t)−1
I(a(j + 1, t), t) = I(a(j, t), t) − f ullxt
x=a(j,t)
= I(a(j, t), t) − 1
= I(2, t) − j − 1
= I(2, t) − (j + 1) .
We are now also able to state the version of Lemma 7.10 for the machine with
stalling.
5
For j ∈ [0 : s(t) − 1], the function a(j, t) returns the index of the (j + 1)-th full
stage, starting to count from stage 2 .
7.4 Stalling 203
Lemma 7.17 (hit signal with stalling). Let i = I(2, t) and the numbers
s(t) and a(j, t) be defined as in the previous lemma. Further, let α ∈ [0 :
s(t) − 1] and let
x = rsiσ ∧ k = a(α, t) ∧ f ullkt .
Then,
hittA [k] = writesgpr(x, i − α − 1) .
Proof. For the hit signal under consideration we can conclude with inv(k, t),
inv(2, t) and Lemma 7.16:
hitA [k]t ≡ f ullkt ∧ Cad.kπt = rstπ ∧ gprw.kπt
≡ CadσI(k,t)−1 = rsiσ ∧ gprwσI(k,t)−1
≡ Cadi−α−1
σ = x ∧ gprwσi−α−1
≡ writesgpr(x, i − α − 1) .
¬writesgpr(x, i − α − 1) .
Hence, we have
∀j ∈ [i − α, i − 1] : ¬writesgpr(x, j) .
Hence, we get
Aintπ = gprini−α−1
σ
= gprσi−α (x)
= gprσi (x) .
7.4 Stalling 205
∀k ∈ [2 : 4] : ¬hittA [k].
∀j ∈ [i − s(t), i − 1] : ¬writesgpr(x, j) .
Aintπ = gproutatπ
= gprσI(5,t) (x)
= gprσi−s(t) (x)
= gprσi (x) .
7.4.6 Liveness
We have to show that all active hazard signals are eventually turned off, so
that no stage is stalled forever. By the definition of the stall signals we have
¬stallk+1
t
∧ ¬hazkt → ¬stallkt ,
i.e., a stage whose successor stage is not stalled and whose hazard signal is off
is not stalled either. From
we conclude
k ≥ 3 → ¬stallk ,
i.e., stages k ≥ 3 are never stalled. Stages k with empty input stage k − 1 are
never stalled. Thus it suffices to show the following lemma.
i.e., with a full input stage, stage 2 is not stalled for more than 2 successive
cycles.
Proof. From the definitions of the signals in the stall engine we conclude suc-
cessively:
stall2t = stall2t+1 = 1
f ull1t+1 = 1
uet2 = uet+1
2 =0.
206 7 Pipelining
Using
stall3 = stall4 = 0
we conclude successively
f ull2t+1 = f ull2t+2 = 0
uet+1
3 =0
f ull3t+2 = 0 .
Thus in cycle t + 2 stages 2 and 3 are both empty, hence the hit signals of
these stages are off:
which implies
haz2t+2 = 0 .
8
Caches and Shared Memory
In this chapter we implement a cache based shared memory system and prove
that it is sequentially consistent. Sequential consistency means: i) answers of
read accesses to the memory system behave as if all accesses to the memory
system were performed in some sequential order and ii) this order is consis-
tent with the local order of accesses [7]. Cache coherence is maintained by the
classical MOESI protocol as introduced in [16]. That a sequentially consistent
shared memory system can be built at the gate level is in a sense the funda-
mental result of multi-core computing. Evidence that it holds is overwhelm-
ing: such systems are since decades part of commercial multi-core processors.
Much to our surprise, when preparing the lectures for this chapter, we found
in the open literature only one (undocumented) published gate level design of
a cache based shared memory system [17]. Closely related to our subject, there
is of course also an abundance of literature in the model checking community
showing for a great variety of cache protocols, that desirable invariants - in-
cluding cache coherence - are maintained, if accesses to the memory system
are performed atomically at arbitrary caches in an arbitrary sequential order.
In what follows we will call this variety of protocols atomic protocols. For a
survey on the verification techniques for cache coherence protocols see [13],
and for the model checking of the MOESI protocol we refer the reader to [4].
Atomic protocols and shared memory hardware differ in several important
aspects:
• Accesses to shared memory hardware are as often as possible performed in
parallel. After all, the purpose of multi-core computing is gaining speed by
parallelism. If memory accesses were sequential as in the atomic protocols,
memory would be a sequential bottleneck.
• Accesses to cache based hardware memory systems take one, two, or many
more hardware cycles. Thus, they are certainly not performed in an atomic
fashion.
Fortunately, we will be able to use the model checked invariants literally as
lemmas in the hardware correctness proof presented here, but very consider-
M. Kovalev et al.: A Pipelined Multi-core MIPS Machine, LNCS 9000, pp. 207–310, 2014.
c Springer International Publishing Switzerland 2014
208 8 Caches and Shared Memory
able extra proof effort will be required to establish a simulation between the
hardware computation and the atomic protocol. After it is established one
can easily conclude sequential consistency of the hardware system, because
the atomic computation is sequential to begin with.
In Sect. 8.1 we introduce what we call abstract caches and show that the
common basic types of single caches (direct mapped, k-way set associative,
fully associative) can be modelled as abstract caches. This will later permit
to simplify notation considerably. It also permits to unify most of the theory
of shared memory constructions for all basic cache types. However, presently
our definition of abstract caches does not yet include eviction addresses. The
construction we present involves direct mapped caches, and we have to deal
with eviction addresses below the abstract cache level. Modifying this small
part of the construction and the corresponding arguments to other types of
basic caches should not be hard. Modification of the definition of abstract
caches such that they can be used completely as a black box is still future
work. In the classroom it suffices to show that direct mapped caches can be
modeled as abstract caches.
In Sect. 8.2 we develop formalism permitting to deal with i) atomic proto-
cols, ii) hardware shared memory systems, and iii) simulations between them.
It is the best formalism we can presently come up with. Suggestions for im-
provement are welcome. If one aims at correctness proofs there is no way to
avoid this section (or an improved version of it) in the classroom.
Section 8.3 formulates in the framework of Sect. 8.2 the classical theory
of the atomic MOESI protocol together with some auxiliary technical results
that are needed later. Also we have enriched the standard MOESI protocol
by a treatment of compare-and-swap (CAS) operations. We did this for two
reasons: i) compare-and-swap operations are essential for the implementation
of locks. Thus, multi-core machines without such operations are of limited use
to put it mildly, ii) compare-and-swap is not a read followed by a conditional
write; it is an atomic read followed by a conditional write, and this makes a
large difference for the implementation.
A hardware-level implementation of the protocol for the direct mapped
caches is presented in Sect. 8.4. It has the obvious three parts: i) data paths,
ii) control automata, and iii) bus arbiter. The construction of data paths
and control automata is not exactly straightforward. Caches in the data part
generally consist of general 2-port RAMs, because they have to be able to
serve their processor and to participate in the snooping bus protocol at the
same time. We have provided each processor with two control automata: a
master automaton processing requests of the processor and a slave automaton
organizing the cache response to the requests of other masters on the bus. The
arbiter does round robin scheduling of bus requests. One should sleep an extra
hour at night before presenting this material in the classroom.
The correctness proof for the shared memory system is presented in
Sect. 8.5. An outline of the proof is given at the start of the section. Roughly
speaking, the proof contains the following kinds of arguments: i) showing that
8.1 Concrete and Abstract Caches 209
bus arbitration guarantees at any time that at most one master automaton
is not idle, ii) showing the absence of bus contention (except on open collec-
tor buses1 ), among other things by showing that during global transactions
(involving more than one cache) master and participating slave automata
stay “in sync”, iii) concluding that control signals and data “somehow corre-
sponding to the atomic protocol” are exchanged via the buses, iv) abstracting
memory accesses in the sense of Sect. 8.2 from the hardware computation and
ordering them sequentially by their end cycle; it turns out that for accesses
with identical end cycles it does not matter how we order them among each
other, and v) showing (by induction on the end cycles of accesses) that the
data exchanged via the buses are exactly the data exchanged by the atomic
protocol, if it were run in the memory system configuration at the end cy-
cle of the access. This establishes simulation and allows us to conclude that
cache invariants are maintained in the hardware computation (because hard-
ware simulates the atomic protocol, and there it was model-checked that the
invariants are maintained).
Many of the arguments of parts i) to iv) are tedious bookkeeping; in the
classroom it suffices to just state the corresponding lemmas and to present only
a few typical proofs. However, even in this preliminary/bookkeeping phase of
the proof the order of arguments is of great importance: the absence of bus
contention often hinges on the cache invariants. Part v) is not only hard; it
turns out that it is also highly dependent on properties of the particular cache
protocol we are using. Thus, reinspection of the corresponding portions of the
proof is necessary, if one wants to establish shared memory correctness for a
different protocol.
1
We do not use open collector buses to communicate with the main memory. Hence,
we do not worry about absence of bus contention on them.
210 8 Caches and Shared Memory
proof of the shared memory construction of the subsequent sections will then
to a very large extent be based on abstract caches.
For the states, we use the synonyms and names from Table 11.
In the digital model, main memory is simply a line addressable memory
with configuration
mm : B29 → B64 .
An abstract cache configuration aca has the following components:
• data memory aca.data : B29 → B64 - simply a line addressable memory,
• state memory aca.s : B29 → S mapping each line address a to its current
state aca.s(a).
We denote the set of all possible abstract cache configurations by Kaca .
If a cache line a with a ∈ B29 has state I, i.e., aca.s(a) = I, then the data
aca.data(a) of this cache line is considered invalid or meaningless, otherwise
it is considered valid. When cache line a has valid data, we also say that we
have an abstract cache hit in cache line a:
ahit(aca, a) ≡ aca.s(a) = I .
2
If line size was larger than the width of the memory bus, one would have to use
sectored caches. This would mildly complicate the control automata.
8.1 Concrete and Abstract Caches 211
mm
aca
m(h)
Fig. 124. An abstract cache aca and a main memory mm are abstracted to a single
memory m(h)
mm
m(h)
Fig. 125. Many caches ca(i) and a main memory mm are abstracted to a shared
memory m(h)
In this definition, valid data in the cache hide the data in main memory.
A much more practical and interesting situation arises if P many abstract
caches aca(i) are coupled with a main memory mm as shown in Fig. 125 to
get the abstraction of a shared memory. We intend to connect such a shared
memory system with p processors. The number of caches will be P = 2p. For
i ∈ [0 : p−1] we will connect processor i with cache aca(2i), which will replace
the instruction port of the data memory, and with cache aca(2i + 1), which
will replace the data port of the data memory.
Again, we want to get a memory abstraction by hiding the data in main
memory by the data in caches. But this only works if we have an invari-
ant stating coherence resp. consistency of caches, namely that valid data in
different caches are identical:
212 8 Caches and Shared Memory
a a.t a.c
τ
The purpose of cache coherence protocols like the one considered in this chap-
ter is to maintain this invariant. With this invariant the following definition
of an implemented memory m is well defined
aca(i).data(a) ahit(aca(i), a)
m(a) =
mm(a) otherwise .
All cache constructions considered here use the decomposition of byte ad-
dresses ad ∈ B32 into three components as shown in Fig. 126:
• a line offset ad.o ∈ B3 within lines,
• a cache line address ad.c ∈ B . This is the (short) address used to address
the (small) RAMs constituting the cache,
• a tag ad.t ∈ Bτ with
τ + + 3 = 32 ,
which completes cache line addresses to line addresses:
For line addresses a ∈ B29 this gives a decomposition into two components as
shown in Fig. 127.
We structure the hardware configurations h of our constructions by in-
troducing cache components h.ca. Direct mapped caches have the following
cache line addressable components:
• data memory h.ca.data : B → B64 implemented as a multi-bank RAM,
• tag memory h.ca.tag : B → Bτ implemented as an ordinary static RAM,
and
• state memory h.ca.s : B → B5 implemented as a cache state RAM.
8.1 Concrete and Abstract Caches 213
τ
a.t
4
01
a.c
5 cadin
64
vinv a in a a in
state tag data
RAM RAM RAM
(ca.s) (ca.tag) (ca.data)
out out out
4
01 5 τ 64
casout cadout
5-eq τ -eq
hhit
Fig. 128. Data paths of a direct mapped cache h.ca
3
That caches are smaller than main memory is achieved by mapping many line
addresses to the same cache line address.
214 8 Caches and Shared Memory
5 64
··· ··· ··· ···
(5, k)-OR (64, k)-OR
··· ···
∨
5 64
As shown in Fig. 129, k-way associative caches (also called set associative
caches) consist of k copies h.ca(i) of direct mapped caches for i ∈ [0 : k − 1]
8.1 Concrete and Abstract Caches 215
which are called ways. Individual hit signals hhit(i) , cache data out signals
cadout(i) , and cache state out signals casout(i) are computed in each way i as
A hit in any of the individual caches constitutes a hit in the set associative
cache:
hhit(h.ca, a) = hhit(i) (h.ca, a) .
i
Joint data output cadout(h.ca, a) and state output casout(h.ca, a) are ob-
tained by multiplexing the individual data and state outputs under control of
the individual hit signals:
cadout(h.ca, a) = cadout(i) (h.ca, a) ∧ hhit(i) (h.ca, a)
i
casout(h.ca, a) = casout(i) (h.ca, a) ∧ hhit(i) (h.ca, a) .
i
Initialization and update of the cache must maintain the invariant that valid
tags in different ways belonging to the same cache line address are distinct:
This implies that for every line address a, a hit can occur in at most one way.
Proof. Assume
Then,
h.ca(i) .s(a.c) = I ∧ h.ca(j) .s(a.c) = I
and
hhit(h.ca, a) = ∃i : hhit(h.ca(i) , a)
≡ aca (h).s(a) = I
≡ ahit(aca (h), a) .
In case of an abstract hit ahit(aca (h), a) we also have by Lemma 8.2 a unique
concrete hit hhit(h.ca(i) , a). For the data and state outputs of the k-way
associative cache, we conclude
These RAMs have the same components h.ca.s, h.ca.tag, and h.ca.data as
direct mapped caches, but data for any line address a can be stored at any
cache line and is addressed with a cache line address b ∈ Bα :
• Data memory h.ca.data : Bα → B64 is implemented as an SPR RAM.
• Tag memory h.ca.tag : Bα → B29 is implemented as an SPR RAM. The
tag RAM has width 29 so that it can store entire line addresses.
• State memory h.ca.s : Bα → B5 is implemented as an SPR RAM extended
with the invalidation option of a cache state RAM4 .
8.1 Concrete and Abstract Caches 217
29
a
04 1 α
d
5
cadin
64
vinv a in a a in
ca.s ca.tag ca.data
5-eq 29-eq
cadout(b)
hhit(b)
5 64
5 64
··· ··· ··· ···
α α
(5, 2 )-OR (64, 2 )-OR
··· ···
∨
5 64
4
We leave the construction of such a RAM as an easy exercise for the reader.
218 8 Caches and Shared Memory
A hit for the entire fully associative cache occurs if at least one of the lines
contains the valid data for a:
hhit(h.ca, a) = hhit(b) (h.ca, a) .
b
Along the lines of the proof of Lemma 8.2 this permits to show the uniqueness
of cache lines producing a hit.
Lemma 8.4 (fully associative hit unique).
hhit(b) (h.ca, a) ∧ hhit(b ) (h.ca, a) → b = b
cadout(h.ca, a) = h.ca.data(b)
= aca (h).data(a)
= acadout(aca (h), a)
casout(h.ca, a) = h.ca.s(b)
= aca (h).s(a)
= acasout(aca (h), a) .
So far, we have not yet explained how to update caches. For different types
of concrete caches this is done in different ways. In what follows we elaborate
details only for direct mapped caches.
8.2 Notation
We summarize a large portion of the notation we are going to use in the
remainder of this book.
8.2.1 Parameters
m : B29 → B64 .
ms.aca(i).s(a) = I ∧ ms.aca(j).s(a) = I →
ms.aca(i).data(a) = ms.aca(j).data(a) .
This definition would be shorter if memory systems were tensors. Then a slice
would simply be the submatrix with coordinate a5 .
Then the abstract cache component of a memory system slice is defined like a
row of a matrix:
acc.r → acc.bw = 08
acc.cas → acc.bw ∈ {04 14 , 14 04 } .
For CAS accesses, we define the predicate test(acc, d), which compares
acc.cdata with the upper or the lower word of the data d ∈ B64 depending on
the byte write signal acc.bw[0]:
d[63 : 32] ¬acc.bw[0]
test(acc, d) ≡ acc.cdata =
d[31 : 0] acc.bw[0] .
As the name suggests, access sequences are finite or infinite sequences of ac-
cesses. As with caches and abstract caches we use the same notation acc both
for single accesses and access sequences. Access sequences come in two flavors:
• Sequential access sequences. These are simply mappings acc : N → Kacc
in the infinite case and acc : [0 : n − 1] → Kacc for some n in the finite
case.
• Multi-port access sequences
acc : [0 : P − 1] × N → Kacc ,
δM : Km × Kacc → Km
m = δM (m, acc) .
For CAS accesses, if the data comparison test(acc, m(acc.a)) succeeds, we call
the CAS access positive. Otherwise we call it negative.
The answers dataout(m, acc) of read or CAS accesses are defined as
Void and flush accesses do not have any affect on the memory and do not
produce an answer.
The change of memory state by sequential access sequences acc of accesses
and the corresponding outputs dataout[i] are defined in the obvious way by
Then,
Δx+y y
M (m, acc[0 : x + y − 1]) = ΔM (m , acc[x : x + y − 1]) .
seq : [0 : P − 1] × N → N ,
then for read accesses the answer msdout(ms, acc, i, k) to access acc(i, k) of
the multi-port access sequence acc is the same as the answer to access seq(i, k)
of the sequential access sequence acc :
ca(i)t = ht .ca(i)
aca(i)t = aca(ht .ca(i)) .
ca(i).Y t = ht .ca(i).Y
aca(i).X t = aca(ht .ca(i)) .
with
ms(h).mm = h.mm
ms(h).aca(i) = aca(h.ca(i)) ,
m(h) = m(ms(h)) .
8.3.1 Invariants
mm = ms.mm
aca = ms.aca .
One calls the data in a cache line clean if this data are known to be the
same as in the main memory, otherwise it is called dirty. A line is exclusive
if the line is known to be only in one cache, otherwise it is called shared. The
intended meaning of the states is:
• E – exclusive clean (the data are in one cache and are clean).
• S – shared (the data might be in other caches and might be not clean).
• M – exclusive modified (the data are in one cache and might be not clean).
• O – owned (the data might be in other caches and might be not clean; the
cache with this line in owned state is responsible for writing it back to the
memory or sending it on demand to other caches).
• I – invalid (the data are meaningless).
This intended meaning is formalized in a crucial set of state invariants:
1. States E and M are exclusive; in other caches the line is invalid:
aca(i).s(a) ∈ {E, M } ∧ j = i → aca(j).s(a) = I ,
2. state E is clean:
aca(i).s(a) = E → aca(i).data(a) = mm(a) .
3. Shared lines, i.e., lines in state S, are clean or they have an owner:
aca(i).s(a) = S → aca(i).data(a) = mm(a) ∨ ∃j = i : aca(j).s(a) = O .
4. Data in lines in nonexclusive state are identical:
aca(i).s(a) = S ∧ aca(j).s(a) ∈ {O, S} →
aca(i).data(a) = aca(j).data(a) .
5. If a line is non-exclusive, i.e., in state S or O, other copies must be invalid
or in a non exclusive state. Moreover the owner is unique:
aca(i).s(a) = S ∧ j =
i → aca(j).s(a) ∈ {I, O, S}
aca(i).s(a) = O ∧ j = i → aca(j).s(a) ∈ {I, S} .
We introduce the notation sinv(ms)(a) to denote that the state invariants
hold for cache line address a with a system aca of abstract caches and main
memory mm. For cycle numbers t, we denote by SINV (t) the fact that the
state invariants hold for the memory system ms(h) abstracted from the hard-
ware for all cycles t ∈ [0 : t], i.e., from cycle 0 after reset until t:
sinv(ms) ≡ ∀a : sinv(ms)(a)
SINV (t) ≡ ∀t ∈ [0 : t] : sinv(ms(ht )) .
One easily checks that the state invariants hold if all cache lines are invalid.
In the hardware construction, this will be the state of caches after reset.
226 8 Caches and Shared Memory
We stress the fact that the atomic protocol is a sequential protocol operating
on a multi-port memory system ms. Its semantics is defined by two functions:
• A transition function
where
ms = δ1 (ms, acc, i)
defines the new memory system if single access acc is applied to (cache)
port i of memory system ms.
• An output function
where
d = pdout1(ms, acc, i)
specifies for read and CAS accesses (i.e., accesses with acc.r or acc.cas)
memory system output d as response to access acc at port i in memory
system configuration ms.
We abbreviate
without contacting the other caches; for some flushes it still may have to
write back a cache line to main memory. In case i) the table entry specifies
the next state of the cache line. The table does not explicitly state how data
are to be processed; this is implicitly specified by the fact that we aim at a
memory construction and by the state invariants. We will make this explicit
in Sect. 8.3.4.
In case there is more than a single state in the master table entry, the
protocol is run in four steps. Three steps concern the exchange of signals
belonging to the memory protocol and the next state computation. The fourth
step involves the processing of the data and is only implicitly specified.
1. Out of three master protocol signals Ca, im, bc the master activates the
ones specified in the table entry. These signals are broadcast to the other
caches ca(j), j = i which are called the slaves of the access. The intuitive
meaning of the signals is:
• Ca – intention of the master to cache line acc.a after the access is
processed.
• im – intention of the master to modify (write) the line.
228 8 Caches and Shared Memory
•bc – intention of the master to broadcast the line after the write has
been performed. This signal is activated after a write or a positive
CAS hit with non exclusive data.
2. The slaves j determine the local state aca(j).s(acc.a) of cache line acc.a,
which determines the row of Table 131(b) to be used. The column is
determined by the values of the master protocol signals Ca, im, and bc.
Each slave aca(j) goes to a new state as prescribed in the slave table entry
and activates two slave protocol signals ch(j) and di(j) as indicated by
the slave table entry used. The intuitive meaning of the signals is:
• ch(j) – cache hit in slave aca(j).
• di(j) – data intervention by slave aca(j). Slave aca(j) has the cache
line needed by the master and will put it on a bus, from which the
master can read it.
The individual signals are ORed together (in active low form on an open
collector bus) and made accessible to the master as
ch = ch(j) , di = di(j) .
j =i j =i
3. The master determines the new state of the cache line accessed as a func-
tion of the slaves’ responses as indicated by the table entry used. The
notation ch ? X : Y is an expression borrowed from C and means
X ch
ch ? X : Y =
Y ¬ch .
4. The master processes the data. This step is discussed in Sect. 8.3.4.
We extract from the tables three sets of switching functions. They correspond
to phases of the protocol, and we specify them in the order, in which they are
used in the protocol:
• C1 – this function is used by the master. It depends on a state6 s ∈ S and
the type of the access acc.type ∈ B4 , where
The function C1 computes the master protocol signals C1.Ca, C1.im, and
C1.bc. Thus,
6
Recall that we encode the cache states in the unary form as
C1 : S × B4 → B3 .
The component functions C1.X are defined by translating the master pro-
tocol table, i.e., looking up the corresponding cell (s, type) in Table 131(a)
and choosing the necessary protocol bits accordingly:
s type Ca im bc s ch s type
5 4 5 5 4
C1 C2 C3
5 5
Ca im bc di ch ss ps
Fig. 132. Symbols for circuits C1, C2, and C3 computing the protocol signals and
next state functions of the MOESI protocol
For the following definitions we assume sinv(ms), i.e., that the state invariants
hold for the memory system ms before the (sequential and atomic) processing
of access acc at port i.
For all components x of an access acc, we abbreviate
x = acc.x .
Note that a in this section and below, where applicable, denotes the line
address acc.a. Also, the functions we define depend on arguments ms.aca and
ms.mm. For brevity of notation we will omit these arguments most of the
time – but not always – in the remainder of this section. We now proceed to
define the effect of applying accesses acc to port i of the memory system ms
by specifying functions ms = δ1 (ms, acc, i) and d = pdout1(ms, acc, i).
We only specify the components that do change. We define a hit at atomic
abstract cache aca(i) by
hit(aca, a, i) ≡ aca(i).s(a) = I .
We further identify local read and write accesses. A local read access is either
a read hit or a CAS hit with the negative test result. A local write is either a
write hit to exclusive data or a positive CAS hit to exclusive data:
rlocal(aca, acc, i) = hit(aca, a, i) ∧ (r ∨ cas ∧ ¬test(acc, aca(i).data(a)))
wlocal(aca, acc, i) = hit(aca, a, i) ∧ (w ∨ cas ∧ test(acc, aca(i).data(a)))
∧ aca(i).s(a) ∈ {E, M } .
7
In Sect. 8.3.4 we treat cache hits on CAS with the negative test result the same
way as read hits and do not update the state of the cache line.
8.3 Atomic MOESI Protocol 231
Flush
A flush invalidates abstract cache line a and writes back the cache line in case
it is modified or owned:
f → aca (i).s(a) = I ∧ (aca(i).s(a) ∈ {M, O} → mm (a) = aca(i).data(a)) .
Note that we allow invalidation of any cache line, even the one which is initially
in state E, S, or I. The main memory in that case is not updated.
Local write accesses update the local cache line addressed by a and change
the state to M :
wlocal(aca, acc, i) →
aca (i).data(a) = modify (aca(i).data(a), data, bw)
aca (i).s(a) = C3(aca(i).s(a), acc.type, ∗).ps
= M.
In case of positive CAS hits we also need to specify the output of the memory
system. We do this later in this section.
232 8 Caches and Shared Memory
Global Accesses
global(aca, acc, i) →
mprot = C1(aca(i).s(a), acc.type)
∀j : sprot(j) = C2(aca(i).s(a), mprot)
∀X ∈ {ch, di} : sprot.X = sprot(j).X
j =i
C3(aca(i).s(a), acc.type, sprot.ch).ps i=j
∀j : aca (j).s(a) =
C2(aca(i).s(a), mprot).ss i = j .
Next, we specify the data broadcast bdata via the bus during a global transac-
tion. If the broadcast signal mprot(i).bc is active then the master broadcasts
the modified result modify (aca(i).data(a), data, bw). If a slave activates the
data intervention signal then it is responsible for putting the data on the bus.
The intervening slave j is unique by the state invariants sinv(ms). In other
cases the data are fetched from the main memory :
global(aca, acc, i) →
⎧
⎪
⎨modify (aca(i).data(a), data, bw) mprot(i).bc
bdata = aca(j).data(a) sprot(j).di
⎪
⎩
mm(a) otherwise .
During a global access the caches signaling a cache hit sprot(j).ch store the
broadcast result if the master activates a broadcast signal mprot(i).bc:
Note that in case of a write hit or a positive CAS hit the master and the
affected slaves store the same data for address a.
For global CAS accesses we define the test data as the old value stored in
cache i if we have a hit (which means that the broadcast signal is on) or as
the data obtained from the bus otherwise:
aca(i).data(a) mprot(i).bc
global(aca, acc, i) ∧ cas → tdata =
bdata otherwise .
Negative CAS misses are grouped together with the regular read misses into
global reads:
In case of a global read the master copies the missing cache line from the bus
without modifications:
In case of a global write the master either modifies its old data in case if it
is a hit (which means that the broadcast signal is on) or modifies the data
obtained from the bus:
wglobal(aca, acc, i) →
modify (aca(i).data(a), data, bw) mprot(i).bc
aca (i).data(a) =
modify (bdata, data, bw) otherwise .
Answers of Reads
For a read request or a CAS request, we return either the data from the local
cache or the data fetched from the bus:
aca(i).data(a) hit(aca, a, i)
r ∨ cas → pdout1(ms, acc, i) =
bdata otherwise .
Iterated Transitions
Then,
Δx+y
1 (ms, acc , is) = Δy1 (ms , acc [x : x + y − 1], is[x : x + y − 1]) .
234 8 Caches and Shared Memory
In the atomic execution of the MOESI protocol the state invariants are pre-
served.
Lemma 8.9 (invariants maintained). Let ms = δ1 (ms, acc, i). Then,
sinv(ms) → sinv(ms ) .
Proof. The proof of this lemma is error prone, so it is usually shown by model
checking [4, 13].
An easy proof shows that we have achieved two more goals that were stated
before.
Lemma 8.10 (memory abstraction 1 step). Let ms = δ1 (ms, acc, i) and
the state invariants sinv(ms) hold. Then,
• the resulting memory abstraction m(ms ) behaves as if access acc would
have been applied with ordinary memory semantics to the previous mem-
ory abstraction m(ms):
• the response to read accesses is equal to the response given by the memory
abstraction m(ms):
The following technical lemma formalizes the fact that the abstract protocol
with an access acc only operates on memory system slice Π(ms, acc.a). The
reader might have observed that this address does not even occur in the tables
specifying the protocol, because everybody understands, that line address
aca.a (alone) is concerned in each cache. Readers familiar with cache designs
will of course observe that read, write, or CAS accesses acc can trigger flushes
evicting cache lines with line addresses a = acc.a; but these are treated as
separate accesses in our arguments.
Lemma 8.12 (properties 1 step). Let ms = δ1 (ms, acc, i) and a = acc.a.
Then,
8.4 Gate Level Design of a Shared Memory System 235
2. Slices different from slice a of the memory system are not changed
b = a → Π(ms , b) = Π(ms, b) .
Proof. The proof for every case is based on the following arguments.
1. For local reads we specified no change of ms.
2. In the definition of function δ1 we only specified components that change.
Slices other than slice a were not among them.
3. This is a simple bookkeeping exercise, where one has to compare all parts
of the definition of function δ1 for the two memory system configurations
ms1 and ms2 8
4. Bookkeeping exercise.
p → ca Interface
Recall that our main memory (Sect. 3.5.6) behaves as a ROM for addresses
029−r br , where b ∈ Br for some small r. As a result, any write performed to
this memory region has no effect. Yet, in the sequential memory semantics
given in Sect. 8.2.4, we consider the whole memory to be writable. To resolve
8.4 Gate Level Design of a Shared Memory System 237
preq
mbusy
ptype
pa
pdin
pdout
Fig. 133. The timing diagram for a k−cycle write access followed by two consecutive
1-cycle read accesses
that problem we add a software condition, stating that the processors never
issue write and CAS requests to addresses smaller than or equal to 029−r 1r :
ca → p Interface
p ↔ ca Protocol
We need to define a protocol for interaction between a processor and its caches
(data and instruction cache). Communication between processor p and its
cache ca is done under the following rules:
• p starts a request by activating preq,
• ca in the same cycle acknowledges the request by raising (Mealy9 ) signal
mbusy (unless a one-cycle access is performed),
• ca finishes with lowering mbusy, and
• p disables preq in the next cycle.
9
Recall, that a Mealy output of the control automaton is a function of the input
and the current state.
238 8 Caches and Shared Memory
The timing diagram for a k−cycle (write) cache access is depicted in Fig. 133.
Cycle t is the first cycle of an access iff
Observe that 1-cycle accesses are desirable and indeed possible (in case of
local reads, including negative CAS hits). Then signal mbusy is not raised
at all and the processor can immediately start a new request in cycle t + 1.
The timing diagram for two consecutive 1−cycle read accesses is depicted in
Fig. 133.
Once the processor request signal is raised, inputs from the processor must
be stable10 until the cache takes away the mbusy signal. In order to formalize
this condition we identify the cache input signals of cache ca(i) in cycle t as
⎧
⎪
⎨{pdin} ca(i).pwt
cain(i, t) = {pa, type, pbw, preq} ∪ {pdin, pcdin} ca(i).pcast
⎪
⎩
∅ otherwise
Memory Bus
The memory bus b is subdivided into 4 sets of bus lines. The first three sets are
already known from the main memory specification in Sect. 3.5 and contain
regular tristate lines. The corresponding outputs are connected to these lines
via the tristate drivers. The fourth set of lines supports the cache protocol
and is an open collector bus:
• b.d ∈ B64 – for transmitting data contained in a cache line,
• b.ad ∈ B29 – memory line address,
• b.mmreq ∈ B, b.mmw ∈ B, b.mmack ∈ B – memory protocol lines,
• b.prot ∈ B5 – cache protocol lines.
ca ↔ b Interface
The following dedicated registers are used between cache ca(i) and bus b:
• ca(i).bdout ∈ B64 – cache data output to the bus,
10
Stability in the digital sense is enough here; the processors never access main
memory directly.
8.4 Gate Level Design of a Shared Memory System 239
ca(i).req
to the arbiter and waits until the arbiter activates signal grant[i]. Construction
of the arbiter makes sure that at most one grant signal is active at a time.
Register ca(i).req is implemented as a set-clear flip-flop. Control signals for
all set-clear flip-flops are generated by the control automata of the cache.
As shown in Fig. 134, the cache protocol signals are inverted before they
are put on the open collector bus and before they are clocked from the bus into
a register. Thus, by de Morgan’s law we have for every component x ∈ [0 : 4]:
¬b.prot[x] = ¬( ¬ca(j).bprotout[x])
j
= ca(j).bprotout[x] .
j
When several slaves signal a data intervention, further bus arbitration appears
to be necessary, since only one cache should access the bus at a time. However,
240 8 Caches and Shared Memory
ca(i).mprotout ca(i).sprotout
Ca im bc ch di
OC OC
3 2
5 5
b.prot
3 2
ca(i).mprotin ca(i).sprotin
Fig. 134. Using de Morgan’s law to compute the OR of active high signals on the
open collector bus b.prot
arbitration is not necessary as long as only one slave will forward the required
cache line. This is guaranteed by the cache coherency protocol, where we
do not raise di in case of a miss on data in state S. However, the protocol
provides that all caches keep the same data when it is shared, so that we
could in principle forward the data if we arbitrate the data intervention. A
possible arbitration algorithm for data intervention in a “shared clean miss”
case would be to select ca(i) with the smallest i, s.t., di is active. This can be
efficiently implemented using a parallel prefix OR circuit.
The data paths for the data RAM, state RAM, and tag RAM are presented
in Figs. 135, 137, 136 respectively.
The control signals for the data paths are generated by the control au-
tomata described in the following section. Let us try to get a first under-
standing of the designs.
In general RAMs are controlled from two sides: i) from the processor
side using port a and ii) from the bus side using port b. Auxiliary registers
ca(i).dataouta , ca(i).tagouta , ca(i).souta , and ca(i).soutb latch the outputs
of the RAMs. We have introduced these registers in order to prevent situa-
tions, where reads and writes to the same address of a RAM are performed
during the same cycle11 . Note that such situations are not problematic in our
11
Our construction guarantees that we never perform accesses to different ports of
the same RAM with the same cache line address in a single cycle. As a result,
8.4 Gate Level Design of a Shared Memory System 241
hardware model, because our RAMs are edge triggered. However, in many
technologies RAMs are pulse triggered, and then reads and writes to the
same address in the same cycle lead to undefined behaviour. With the aux-
iliary registers the design of this book is simply implementable in far more
technologies, in particular in the FPGA based implementation from [8]. For
the purpose of correctness proof we show in Lemma 8.24 that they always
have the same data as the current output of the edge triggered RAM. Hence,
in the remainder of the book we simplify our reasoning about the data paths
by simply replacing these registers with wires.
Simple construction of a modify circuit is given in Sect. 4.2.2. For any kind
of write, data to be written y and byte write signals bw come from the
processor. The result is written to the data RAM via port a.
• Flushes. Except for times when the cache is filling up, a cache miss access
is generally preceded by a flush: a so called victim line with some eviction
line address va is evicted from the cache in order to make space for the
missing line. In a direct mapped cache the eviction address has cache line
address
va.c = pa.c .
The victim line is taken from output outa of the data RAM and put on
the bus via register bdout.
reads and writes to the same cache line address in a single cycle can only occur
through the same port (our RAM construction does allow to read and write the
same port in a single cycle). Outputs of port b of data and tag RAMs are never
used in cycles when these ports are being written. Hence, we do not introduce
auxiliary registers for them.
242 8 Caches and Shared Memory
ca(i).bdin
64 64
ca(i).badin
pa.c ina inb
a wa
29 ca(i).data
badin.c
b wb
outa outb
64 64
ca(i).dataouta
0 1 w ∨ localw
64 64
0 1 ¬f lush ∧ di phit 1 0
64
modif y
64
bdoutoe
64
• Global write accesses. This includes write misses, positive CAS misses,
write hits to shared data, and positive CAS hits to shared data. In case
of a cache miss, the missing line is clocked from the bus into register
bdin. From there it becomes an input to the modifier. The output of the
modifier is written back to the data RAM at port a. In case of a hit to
shared data, the cache line is fetched from port a and stored temporarily
in register dataouta . After that it goes to the modify circuit; the output
of the modifier is written to the RAM via port a and is broadcast on the
bus via register bdout.
• Global read accesses. This includes read misses and negative CAS misses.
The missing line is clocked from the bus into register bdin. The modifier
with byte write signals bw = 08 is used as a data path for the missing
cache line. It is output to the processor via signal pdout and written into
the data RAM via input ina.
• Data intervention. The line address is clocked from the bus into register
badin. The intervention line missing in some other cache is taken from
output outb of the data RAM and put on the bus via register bdout.
Signal test is used to denote the result of the CAS test both for local and
global accesses:
⎧
⎪
⎪ data(pa.c)[63 : 32] ¬bw[0] ∧ phit
⎪
⎨data(pa.c)[31 : 0] bw[0] ∧ phit
test ≡ pcdin =
⎪
⎪ bdin[63 : 32] ¬bw[0] ∧ ¬phit
⎪
⎩
bdin[31 : 0] bw[0] ∧ ¬phit .
The tag RAM is very much wired like a tag RAM in an ordinary direct mapped
cache. It is addressed from the processor side by signal pa and from the bus
side by register badin . We operate the data paths for the tag RAM under the
following rules:
• New tags are only written into the tag RAM from the processor side.
• Hit signal phit for the processor side is computed from outputs souta and
tagouta of state and tag RAMs respectively or from the outputs of auxil-
iary registers souta and tagouta , depending on whether port a of these
RAMs is written in this cycle or not. Lemma 8.24 allows us to treat these
auxiliary registers simply as wires, which results in the desired definition
of the phit signal:
Signal bhit for the bus side is computed using outputs soutb and tagoutb
directly, because signal bhit is never used in cycles when state and tag
RAMs are written through port b:
244 8 Caches and Shared Memory
ca(i).pa b.ad
29
ca(i).badin
pa.t 29
ca(i).tagouta
0 1 w
badin.t
pa.t
ca(i).souta [0]
pa.c τ 0
pa.t (τ + 1)-eq
τ τ
wait ∧ (5) 0 1 ca(i).soutb[0]
τ τ 0
(τ + 1)-eq
ca(i).badout
29
badoutoe
29
• For global accesses, the processor address can be put on the bus via register
badout .
• For flushes, the tag of the victim address is taken from output outa of the
tag RAM. The victim line address is then
va = ca(i).tagouta ◦ pa.c .
As before, addressing from the processor side is by signal pa and from the bus
side by register badin. Some control signals come from the control automata
and are explained in Sect. 8.4.3. The data paths of the state RAM use the
circuits C1, C2, and C3 from Sect. 8.3.3 which compute the memory protocol
signals and the next state of cache lines. We operate the data paths for the
state RAM under the following rules:
• The current master state is read from output outa of the state RAM.
• The new state ps is computed by circuit C3 and is written back to input
ina of the state RAM. As one of the inputs to C3 we provide the type
of the access. This type depends not only on the processor request, but
also on the current state of the master automaton (Sect. 8.4.3): if the
automaton is in state f lush then we calculate the new state for a flush
access (which is I independent of other inputs of C3); there is also a case
when we perform a flush access while we are still in wait without going
to state f lush – this corresponds to an invalidation of a clean line without
writing it back to memory. For more explanations refer to the description
of states wait and f lush in Sect. 8.4.4.
• For global accesses, the master protocol signals are computed by circuit C1
and put on the bus via register mprotout. The mux on top of C1 provides
the invalid state in case if we don’t have a processor hit. The mux on top
of register mprotout is used to clear the master protocol signals after a
run of the protocol.
• If the cache works as a slave, it determines the slave response with circuit
C2 using the state from output outb of the state RAM and puts in on
the bus via register sprotout. The mux on top of circuit C2 forwards the
effect of local writes whose line address conflicts with the line address of
the current global access (so that we don’t have to wait 1 cycle until the
modified state is written to the RAM in case of a local write). The mux
on top of register sprotout is used to clear the slave response after a run
of the protocol.
• The new state of a slave ss is computed by circuit C2 and is written back
to input inb of the state RAM.
246 8 Caches and Shared Memory
0 1 1 0 sw
ca(i).souta
00001 5
¬phit 1 0
ca(i).pr
ca(i).pw
test 5 4 ca(i).pcas
C1 0
000 3
mprotz 1
C1: master protocol
0
3 phase 1
ca(i).mprotout
3
OC
b.mprot
3
10000
ca(i).mprotin 5
0 1 localw ∧ ca(i).pa = badin
3 5
C2
2 5 C2: slave protocol
00
sprotz 1 0
2
ca(i).pr ∧ w
ca(i).sprotout ca(i).pw ∧ w
2 ca(i).pcas ∧ w
f lush ∨ wait
OC
b.sprot
2
ca(i).sprotin
C3: master protocol
ch
5 4 phase 2
C3
5
ps ss
(1)
idle localw w
¬b.mmack ∧ b.mmreq
(9) (2)
(3)
¬grant[i]
mdata
(4)
wait m0 hot phase
(5) (6) m3
f lush
m1
m2
¬b.mmack
We define state automata for the master and the slave case in order to imple-
ment the cache coherency protocol. In general the protocol is divided into 3
phases:
• Master phase 1: Ca, im, bc are computed and put on the bus.
• Slave phase: slave responds by computing and sending ch, di, generating
new slave state ss .
• Master phase 2: master computes new state ps .
For local accesses, only the last step of the protocol is performed (master
phase 2).
The state diagrams for the master and slave automata are presented in
Figs. 138 and 139.
Automata States
The sets of states belonging to local and global transactions are defined as
L = {idle, localw}
G = M \L.
248 8 Caches and Shared Memory
sw
¬Ca
sidle
¬b.mmack ∧ b.mmreq
sidle
(7)
sdata
(8)
s1 s2 s3
We also define sets of states that correspond to warm and hot phases of global
transactions:
W = G \ {wait}
H = W \ {f lush} .
We use the following notation for the set of states A ∈ {M, S, L, G, W, H}:
z(i)t ∈ A A = S
A(i)t ≡
zs(i)t ∈ A A = S
A(i)[t:t ] ≡ ∀q ∈ [t : t ] : A(i)q .
Statements without index t are implicitly quantified for all cycles t. For tran-
sitions numbered with (n) in Figs. 138 and 139, we mean with (n)(i)t that
the condition holds for the automata of cache i in cycle t.
When talking about automata states and transitions we often omit explic-
itly specifying index i in case when it is clear from the context or when the
statement is implicitly quantified for all cache indices i.
8.4 Gate Level Design of a Shared Memory System 249
Note that a snoop conflict is discovered one cycle after the address is actually
on the bus (we have to clock data from the bus to register badin first).
A crucial decision on whether to handle an access locally or globally is
performed when the master automaton is in state idle and a processor request
is active. For decisions to go local, the master additionally has to ensure that
no snoop conflict is raised. In case of a global CAS hit we have to reassure
that the decision to go global is still correct in the last cycle of stage wait.
With these prerequisites at hand, we now continue with the actual state
transitions and generated control signals of the automata, starting with state
idle. For every state we write the generated signals in the form
meaning that this signal is raised in the given state if the condition is satisfied.
If we omit stating the condition, then the signal is always high in a given state.
State idle
In the idle state the signal mbusy is deactivated if either a local read is
performed (which can be finished in one cycle) or there is no processor request
at all:
Note that mbusy is a Mealy signal and thus does not need to be precomputed
before the clock edges. There are three possible transitions starting from state
idle.
1. Transition (1): idle → idle.
This transition is taken if there is a snoop conflict or if we have a local
read or there is no request from the processor to its cache at all:
ca(i).dataouta ce := (2)
ca(i).souta ce := (2) .
With the transition into state wait we activate signal ca(i).reqset to issue a
request for the bus to the arbiter (Sect. 8.4.5). Additionally, we clock register
ca(i).souta which might be used in the next cycle in place of the output of
the state RAM:
ca(i).souta ce := (3)
ca(i).reqset := (3) .
State localw
In state localw the master changes its state ps for the given cache line from
E to M , see Fig. 137. The activated signals are
ca(i).swa
ca(i).datawa
¬ca(i).mbusy .
Signal swa is used in the state RAM (Fig. 137) and signal datawa is used in
the data RAM (Fig. 135). In state localw we always go back directly to idle
in the next cycle.
State wait
In state wait, the processor waits for its request to be granted by the bus
arbiter (Sect. 8.4.5.
In the last cycle of wait we have to repeat the test for global CAS hits. In
case we went from idle to wait under condition
it may happen that during the time when we are waiting for the bus our
slave automaton updates the cache data or the cache state for address pa.c.
An update of the state is not a problem, because from S and O the line can
go only to S, O, or I, which means that we still need to perform a global
transaction. Yet, if the data RAM has been updated, the outcome of the local
252 8 Caches and Shared Memory
test might not be the same anymore. In this case we should not start a global
transaction at all; instead, we should return to idle. We call such an access
delayed local.
There are four transitions starting from state wait:
1. Transition: wait → wait.
While ¬grant[i], the automaton stays in wait.
2. Transition (5): wait → f lush.
When the request is granted, but the cache line is occupied and dirty, the
automaton goes to state f lush:
We use the output of the auxiliary register souta instead of the output of
port a of the state RAM here because signal (5) is used below to generate
the write enable signal to port a of the state RAM. By Lemma 8.24
(auxiliary registers) we always have:
ca(i).bdoutce := (5)
ca(i).badoutce := (5) ∨ (4)
ca(i).mmreqset := (5)
ca(i).mmwset := (5)
ca(i).bdoutoeset := (5)
ca(i).badoutoeset := (5) ∨ (4)
ca(i).mmreqoeset := (5)
ca(i).mmwoeset := (5)
ca(i).mprotoutce := (4)
ca(i).reqclr := (9)
¬ca(i).mbusy := (9)
8.4 Gate Level Design of a Shared Memory System 253
Signal reqclr is used to clear the request to the bus arbiter (Sect. 8.4.5) in
case of (9). Note that in case of (4) we have to load the master protocol data
for transmission on the bus.
There is a case when the cache line is occupied by another line (i.e., the
current tag of the cache line does not match to the tag of the processor
address) but we still go to state m0. This happens when the line is clean;
hence, it can be evicted without writing it back to the memory. In this case
we write I to the state RAM (as guaranteed by the output of the circuit C3 in
Fig. 137). This write is not necessary for the correctness of implementation:
we could simply “evict” this line later in state w, by overwriting the current
tag in the tag RAM with the tag of the processor address (such a write would
make the overwritten cache line invalid in the abstract cache). In the proof,
however, that would force us to simulate two accesses simultaneously: a global
access for the line addressed by pa and a flush access for the evicted line. To
avoid this confusion and to (greatly) simplify the proofs, we prefer to do this
invalidation explicitly by writing I to the state RAM on a transition from
wait to m0. Note, that for the generation of signal swa we use the output of
the auxiliary register souta instead of the output of port a of the state RAM.
By Lemma 8.24 (auxiliary registers) we have in this case:
ca(i).souta = ca(i).s(ca(i).pa.c) .
State f lush
In state f lush, we write the cache line that needs to be evicted to memory.
The following signals are set:
ca(i).bdoutoeclr := (6)
ca(i).mmreqoeclr := (6)
ca(i).mmreqclr := (6)
ca(i).mmwoeclr := (6)
ca(i).mmwclr := (6)
ca(i).badoutce := (6)
ca(i).mprotoutce := (6)
ca(i).swa := (6) .
When we leave the state f lush we have to load master data for transmission
on the bus12 . After the flush is done, we write I to the state RAM. Just
12
The multiplexer controlled by signal phit on Fig. 137 makes sure that we forward
the invalid state as an input to circuit C1. Note that for the computations of
254 8 Caches and Shared Memory
as the same kind of write performed in state wait, this write is not strictly
necessary for the correctness of implementation. Yet it makes the proofs much
easier. Particularly, in Lemma 8.64 we can simulate the 1-step flush access
immediately when the master automaton leaves the state flush, and do not
have to wait until the tag of the evicted line gets overwritten in state w.
There are two transitions starting from state f lush.
1. Transition: f lush → f lush.
While ¬b.mmack, we stay in f lush since the memory is still busy.
2. Transition (6): f lush → m0.
When the b.mmack signal gets active, the memory access is finished and
the automaton proceeds to state m0:
(6) = b.mmack .
During the m0 phase (1 cycle), master protocol data are transmitted on the
bus. In the next cycle, master automaton always advances to state m1.
The following transitions in the slave automaton start from state sidle.
1. Transition (7): sidle → s1.
Slave i leaves the sidle state iff some master j is in state m0 transmitting
signal Ca and j = i:
ca(i).mprotince := (7)
ca(i).badince := (7) .
signal phit in this case we use the output of port a of the state RAM, rather
than the output of the auxiliary register souta , even though port a of the state
RAM is written in the same cycle. The reason why this behaviour is acceptable
here is simple: we went to state flush because the tag of the line in the cache did
not match to the tag of the requested line. In state f lush the tag RAM is not
updated. Hence, the requested line stays invalid in the abstract cache and the
output of the state RAM is simply ignored in the computation of signal phit.
8.4 Gate Level Design of a Shared Memory System 255
States m1 and s1
During these states (1 cycle), the master does nothing and the slave computes
response signals. If the slave doesn’t have the requested data it goes to state
idle , where it waits until signal Ca is removed from the bus. The snoop
conflict starts to be visible in this phase.
Two transitions are starting in state s1.
1. Transition (8): s1 → sidle .
If the slave does not have an active bhit signal, then it goes to state sidle :
(8) = ¬bhit .
2. Transition: s1 → s2.
If the slave doesn’t go to sidle , then it goes to state s2.
The following signal is raised in the slave:
ca(i).sprotoutce := bhit .
State sidle
The slave waits until Ca is removed from the bus, then moves to idle.
States m2 and s2
During these states (1 cycle), the slave response signals are transmitted on
the bus. The following signal is raised in the master:
ca(i).sprotince .
States m3 and s3
Recall, that in a global transaction the master either has to read the data
from the memory or from another cache (in case of a cache miss) or has to
broadcast the data to other caches (in case of a cache hit in a shared or owned
state). In state m3 (1 cycle), the master makes a decision whether a memory
access must be performed in the mdata phase. This depends on whether di
was active on the bus during stage m2. In case of a cache hit the master
prepares the data for broadcasting. The following signals are raised in the
master:
The following signals are raised in the slave (preparing the data intervention
if necessary):
ca(i).bdoutce := ca(i).sprotout.di
ca(i).bdoutoeset := ca(i).sprotout.di .
During this phase the master performs a memory access if necessary and
reads the data from the bus. The data are either provided by a slave or are
provided by the main memory. In case the data are provided by a slave, the
mdata and sdata phases only last for 1 cycle. If the data are provided by the
main memory, the automata stay in states mdata and sdata as long as there
is an active memory request:
¬b.mmack ∧ b.mmreq.
Leaving this state, the master clears control signals. The following signals are
raised in the master:
The slave has to clock the broadcast data or clear the data output enable
signal. When leaving state sdata, the b output of the state RAM is clocked
into register ca(i).soutb , which will be used in the next cycle:
ca(i).bdince := ca(i).mprotin.bc
ca(i).bdoutoeclr := ca(i).sprotout.di
ca(i).soutb ce := b.mmack ∨ ¬b.mmreq .
States w and sw
During this phase (1 cycle) the master and the slave write the results of the
transaction into their RAMs (data, tag, and state). The following signals are
raised in the master:
8.4 Gate Level Design of a Shared Memory System 257
req
2p
nextgrant
2p
grant
2p
ca(i).datawa
ca(i).tagwa
ca(i).swa
ca(i).reqclr
¬ca(i).mbusy .
The following signals are raised in the slave:
ca(i).datawb := ca(i).mprotin.bc
ca(i).swb
ca(i).sprotoutce
ca(i).sprotz .
Master Arbitration
In case of master arbitration we have to ensure fairness. Fairness means that
every request to access the bus is finally granted. The arbiter collects requests
req(i) from the caches and chooses exactly one cache that will get the per-
mission to run on the bus. The winner is identified by the active grant[i]
signal.
The implementation of a fair arbiter is presented in Fig. 140. For the
computation of nextgrant, we use circuit f 1, which finds the first 1 in a
bit-string starting from the smallest index:
258 8 Caches and Shared Memory
2p − 1 0
X 1 ··· 11 0 ··· 0
Y 011 · · · 010 00 · · · 00
During the initialization phase we set grant[0] to 1 and all other grant signals
to 0. This guarantees that we always have an active grant signal, even if we
don’t have any active requests.
Note that if cache i gets a permission to run the bus it will maintain
this permission until it lowers its req signal (it will always be the winner in
the f 1 circuit). A cache may get access to the bus in two consecutive memory
accesses, however, only if there are no waiting requests from other caches. The
master lowers the req signal when it leaves stage w. Thus, when the master
returns to idle a new set of grant signals is computed and another cache may
start its bus access in the next cycle.
Our construction of the master arbiter guarantees that every request to access
the bus is finally granted if the following conditions are satisfied:
1. Without grant, request stays stable:
The first condition is true, since in state wait signal req(i) stays active and
we do not leave the state before grant[i] holds. The second condition holds
due to system liveness: being in the warm phase, the master automaton will,
eventually, always reach state idle and lower its request signal.
Lemma 8.13 (arbiter fairness).
req(i)t → ∃t ≥ t : grant[i]t
Proof. We only give a sketch of the proof here. In the proof we show that the
distance between the index of the current master and the index of any other
requesting cache i is strictly monotonic, decreasing with each arbitration. Let
one(X) be defined as13
one(X) = i ↔ X[i] = 1 .
13
Well-defined only if string X contains exactly one bit that is set.
260 8 Caches and Shared Memory
Then,
min{j ≥ one(grant) | req[j] = 1} such j exists
one(nextgrant) =
min{j | req[j]} otherwise .
By induction one can show that M (i, t) is decreasing with every new arbitra-
tion (i.e., when the grant register is clocked with the new value).
8.4.6 Initialization
We assume the reset signal to be active in cycle −1. The following signals are
activated during reset:
• Signals mprotz −1 , mprotoutce−1 , sprotz −1 , sprotoutce−1 . This ensures
that master and slave protocols are initialized correctly:
∀i, x : ca(i).s(x)0 = I .
∀i : idle(i)0 ∧ sidle(i)0 .
• Signal reqclr−1 guarantees that caches don’t request the bus until they
get to the mwait state.
• Signals bdoutoeclr−1 , badoutoeclr −1 and signals mmreqclr−1 , mmwclr−1 ,
mmwoeclr−1 , mmreqoeclr−1 make sure that no master automaton gets
the bus before requesting it.
• The grant signal for the cache 0 is set to 1 and all other grant signals are
set to 0:
0 1 i=0
grant[i] =
0 otherwise .
8.5 Correctness Proof 261
Recall that for cache abstraction aca(i) = aca(ca(i)) we use the same defini-
tion as was introduced for direct mapped caches in Sect. 8.1.2:
ca(i).s(a.c) hhit(ca(i), a)
aca(i).s(a) =
I otherwise
ca(i).data(a.c) hhit(ca(i), a)
aca(i).data(a) =
∗ otherwise .
We proceed with the correctness proof of the shared memory system in the
following order:
1. We show properties of the bus arbitration guaranteeing that the warm
phases of global transactions don’t overlap.
2. We show that caches not involved in global accesses output ones to the
open collector buses, i.e., they do not disturb signal transmission by other
caches.
3. We show that control automata run in sync during global accesses.
4. This permits to show that tristate buses are properly controlled.
5. We show that protocol data are exchanged in the intended way.
6. This permits to show that data are exchanged in the intended way.
7. Aiming at a simulation between hardware and the atomic MOESI cache
system, we identify the accesses of the hardware computation.
8. We prove a technical lemma stating that accesses acc(i, k) of the atomic
protocol only depend on cache lines with line address acc(i, k).a and only
modify such cache lines.
9. In Lemma 8.64 we show simulation between the hardware model executing
a given access and the atomic model executing the same access.
10. We order hardware accesses by their end cycle and relate the hardware
computation with the computation of the atomic protocol in Lemma 8.65.
Moreover, we show that the state invariants are maintained by the hard-
ware computation.
11. In Lemma 8.67 we show that our hardware memory system is a sequen-
tially consistent shared memory.
In lemmas and theorems given in this section we abbreviate
• (A) – automata construction,
• (HW) – hardware construction,
• (IH) – induction hypothesis.
8.5.1 Arbitration
Now we state the very crucial lemma about the uniqueness of the automaton
in the warm phase.
Lemma 8.18 (warm unique). Only one processor at a time can be in a
warm phase:
W (i) ∧ W (j) → i = j .
Proof. W (i)∧W (j) implies grant[i]∧grant[j] by Lemma 8.17 (grant at warm).
By Lemma 8.14 (grant unique) one concludes i = j.
Since the signals between caches are transmitted via an open collector bus,
we want both slaves and masters to stay silent when they do not participate
in a global transaction.
Lemma 8.19 (silent slaves). When a slave is not participating in the pro-
tocol, it puts slave response 00 on the control bus:
Lemma 8.21 (idle slaves). If no automaton is in a hot phase, then all slaves
are idle:
(∀i : ¬H(i)t ) → ∀j : sidle(j)t .
Proof. For all i, we have after reset idle(i)0 ∈
/ H and sidle(i)0 . Thus, the
lemma holds initially. The induction step requires to argue about all states
and can only be completed at the end of the section.
The next lemma explains how in a hot phase the master and the slave states
are synchronized.
Lemma 8.22 (sync). Consider a hot phase of master i lasting from cycles t
to t , i.e we have
¬H(i)t−1 ∧ H(i)[t:t ] ∧ ¬H(i)t +1 .
Then,
1. For the master i we have
m0(i)t ∧ m1(i)t+1 ∧ m2(i)t+2 ∧ m3(i)t+3 ∧
mdata(i)[t+4:t −1] ∧ w(i)t ∧ idle(i)t +1 .
Proof. Part (1) follows directly by (A). For the proof of parts (2), (3), and
(4), recall that we are proving both lemmas together by induction on t14 . Our
induction hypothesis is stated for all cycles q ≤ t and consists of two parts:
• ∀q ≤ t : ∀t : ¬H(i)q−1 ∧ H(i)[q:t ] ∧ ¬H(i)t +1 → (2), (3), (4), (sync)
• ∀q ≤ t : (∀i : ¬H(i)q ) → ∀j : sidle(j)q . (idle slaves)
14
Observe that for Lemma (sync) t is the start time of the hot phase.
8.5 Correctness Proof 265
The base case (t = 0) is trivial by (A). In the proof of the induction step from
t − 1 to t, the induction hypothesis trivially implies both statements in case
q < t. Hence, we only need to do the proof for the case q = t.
For the induction step of (sync), we have
¬H(i)t−1 ∧ H(i)t ∧ H(i)[t:t ] ∧ ¬H(i)t +1
and conclude
(wait(i)t−1 ∨f lush(i)t−1 ) ∧ grant[i]t−1 .
By Lemma 8.14 (grant unique) it follows that
∀j = i : ¬grant[j]t−1 .
∀j : ¬H(j)t−1 .
Using (idle slaves) as part of the induction hypothesis for cycle t − 1 we get
∀j : sidle(j)t−1 .
Using Lemma 8.20 (silent master), part (1) of Lemma (sync), (A), and (HW)
we then conclude for the cycles s ∈ [t − 1 : t ]:
0 s ∈ {t − 1, t }
s s
b.mprot.Ca = mprotout(j).Ca =
j
1 s ∈ [t : t − 1] .
By Lemmas 8.17, 8.14 (grant at warm, grant unique) we know that the grant
signals are stable during cycles s ∈ [t − 1 : t ]:
grant[i]s ∧ ∀j = i : ¬grant[j]s .
Parts (2), (3), and (4) follow now by construction of the slave automata and
observing that the exit conditions for states mdata and sdata are identical.
For the induction step of (idle slaves), we consider a cycle t such that
∀i : ¬H(i)t . We make the usual case distinction:
• ∀i : ¬H(i)t−1 . By (IH) we have
∀(j) : sidle(j)t−1 .
b.mprot.Cat−1 = 0
∀(j) : sidle(j)t .
Now we are able to argue about the uniqueness of the di signal put on the
bus by the slaves.
Recall, that the predicate SINV (t), introduced in Sect. 8.3.1, denotes that
the state invariants hold for the memory system ms(h) for all cycles t ≤ t:
SINV (t) ≡ ∀t ≤ t : sinv(ms(ht )) .
In the following lemma and in many other lemmas in this chapter we assume
that the predicate SIN V (t) holds. Later, we apply these lemmas in the proof
of a very important Lemma 8.64 (1 step). Then we use Lemma (1 step) as the
main argument in the induction step of Lemma 8.65 (relating hardware with
atomic protocol), where we in turn make sure that the predicate SIN V (t)
actually holds.
From (HW) we also know that ca(j).badin t−1 = ca(i).badin t−1 . From
SINV (t) and (A) it follows
Proof. By a case split on the state of master and slave automata in cycle t:
• Let w(i)t hold. By (A) we have mdata(i)t−1 , which implies for X ∈
{dataouta, tagouta, souta}
X (i)t = X(i)t−1 .
Applying Lemma 8.22 (sync) we know that the slave automaton of cache
i is in state sidle in cycle t − 1:
sidle(i)t−1 .
Since we don’t activate write enable signals for the RAMs in states sidle
and mdata, we know that the content of RAMs doesn’t change from t − 1
to t. By Lemma 8.16 (request at global), definition of mbusy, and stability
of processor inputs we get
pa(i)t = pa(i)t−1 .
Hence,
• Let localw(i)t hold. As in the previous case we conclude idle(i)t−1 and for
X ∈ {dataouta, souta}
X (i)t = X(i)t−1 .
Because transition (2) was taken in cycle t−1, we know that signals preq(i)
and mbusy(i) were high in cycle t − 1. Hence, from stability of processor
inputs we get
pa(i)t = pa(i)t−1 .
268 8 Caches and Shared Memory
which implies that ports b of RAMs were either not written, or were written
with a cache line address different from pa(i)t .c. The a ports of RAMs are
not clocked in state idle at all. Hence, for all outputs X we get
X(i)t = X(i)t−1
sdata(i)t−2 ∨ s3(i)t−2 .
Port b of the state RAM is never written in sdata. Port a is written only
in states w, f lush, and localw. By Lemmas 8.22 and 8.18 (sync, warm
unique) we conclude that the master automaton of cache i can not be in
states w or f lush in cycle t − 1. In case localw(i)t−1 holds, we know that
in cycle t − 2 there was no snoop conflict on the bus:
pa(i)t−2 .c = badin(i)t−2 .c ,
pa(i)t−2 = pa(i)t−1 ,
which implies
pa(i)t−1 .c = badin(i)t−1 .c .
Hence, port a of the state RAM could only be written with a different
cache line address and we get
• Let wait(i)t ∧ grant[i]t hold. In the previous cycle the master automaton
of cache i was in state wait or idle. Hence, the register souta was clocked:
By (A) we know that port a of the state RAM was not clocked in cycle t−1.
We show by contradiction that port b of the RAM is also not clocked in
8.5 Correctness Proof 269
Recall that in Sect. 3.5 we defined a discipline for the clean operation of the
tristate bus with and without the main memory. This discipline guaranteed
the absence of bus contention and the absence of glitches on the bus during
the transmission interval.
For the control of the tristate bus without the main memory we introduced
the function send(i) = j and the time intervals Ti = [ai : bi ] when unit j
is transmitting the data on the bus. We allowed unit j to transmit in two
consecutive time intervals Ti and Ti+1 where ai+1 = bi without disabling and
re-enabling the driver. If this is not the case, i.e., the driver of unit j is disabled
in cycle bi , then there must be at least one cycle between two consecutive time
intervals:
bi + 1 < ai+1 .
The sending register X is always clocked in cycle ai − 1 and must not be
clocked in the time interval [ai : bi − 1]. The flip-flop controlling the output
enable signal Xoe must be set in cycle ai − 1 (unless the same unit is sending
in the consecutive time interval, i.e., only if ai − 1 = bi−1 ) and must be cleared
in cycle bi (again, only if bi + 1 = ai+1 ).
A tristate bus with the main memory is controlled in the same way. The
start of the time interval Ti = [ai : bi ] of the memory access is identified by an
activation of signal mmreq(j) and the end of the time interval is determined
by the memory acknowledge signal. Note that for other inputs to the main
memory the time intervals are allowed to be larger than the time interval for
mmreq.
15
Observe, that this proof works only because we have a one cycle delay in the
generation of the grant signal: master j still has the request to the arbiter high in
state w(j)q . Hence, in cycle q + 1 the signal grant[j] is high, even though cache j
is already is state idle. This proof is the only place where we rely on this one cycle
delay in the generation of grant signals. With a more agressive arbitration, i.e.,
if request to the arbiter was lowered one cycle earlier, one would have to forward
the value written to the state RAM as an input to register souta in cycle q.
270 8 Caches and Shared Memory
Cy(X, i) = {t | Xoe(i)t } .
For each of the signals X concerned, we will formulate Lemma (X) charac-
terizing the set Cy(X, i) (Lemmas 8.28, 8.29, 8.30, 8.31). Clearly, the flip-flop
controlling the output enable driver of register X is set in cycle t if
t∈
/ Cy(X, i) ∧ t + 1 ∈ Cy(X, i)
t ∈ Cy(X, i) ∧ t + 1 ∈
/ Cy(X, i) .
To make sure that out construction satisfies the control discipline we show
the following properties:
• sets Cy(X, i) and Cy(X, j) are disjoint for i = j and there is at least one
cycle in between cycles t ∈ Cy(X, i) and t ∈ Cy(X, j); hence, there can
be no bus contention (as guaranteed by Lemma 3.9 (tristate bus control),
• for cycles t ∈ Cy(X, i) registers X(i) can be clocked in cycle t only if cycle
t + 1 does not belong to Cy(X, i) (with the exception of the signal badout,
as discussed above); the set-clear flip-flops controlling the drivers obey the
same rules; thus, the enabled drivers do not produce glitches on the bus,
8.5 Correctness Proof 271
• we show that the disabled drivers are not redundantly cleared; thus, the
disabled drivers do not produce glitches on the bus,
• one could also show that register X is always clocked in cycle t ∈
/ Cy(X, i)
if t + 1 ∈ Cy(X, i); yet, the absence of bus contention and the absence of
glitches does not really depend on this condition, so we do not bother.
Our next goal is to show the absence of bus contention. This involves a case
distinction. The easy case deals with signals which are active only during the
warm phases of master states, i.e., satisfying
∀t : t ∈ Cy(X, i) → W (i)t .
It will turn out that these are all signals except bdout. We deal with bus
contention for the latter signal at the end of the next section in Lemma 8.39.
For the easy case we formulate the following lemma.
Lemma 8.25 (no contention). Assume signal X satisfies
Then,
i = j ∧ t ∈ Cy(X, i) → {t, t + 1} ∩ Cy(X, j) = ∅ .
Proof. By contradiction. For i = j, first assume t ∈ Cy(X, i) ∩ Cy(X, j). By
hypothesis we have t ∈ W (i)t ∩ W (j)t . By Lemma 8.18 (warm unique) we
conclude i = j.
Now let t ∈ Cy(X, i) ∧ t + 1 ∈ / Cy(X, i). Assume t + 1 ∈ Cy(X, j). The
only way for the automaton i to leave the warm phase is to go from w to idle.
Hence, we have w(i)t . By Lemmas 8.16, 8.17, and 8.15 (request at global, grant
at warm, grant stable) this implies grant[i]t+1 and gives a contradiction by
Lemmas 8.17 and 8.14 (grant at warm, grant unique).
Showing absence of glitches for signals of the form mmreq(i), mmw(i),
bdout(i), and badout (i) involves two statements: one for enabled and one for
disabled drivers.
Lemma 8.26 (no glitches, enabled). Let
and let t, t + 1 be consecutive cycles in Cy(X, i). Then in the first of these
cycles registers X and Xoe are not clocked:
The only exception is the badoutce(i), which might be clocked whenf lush(i)t ∧
m0(i)t+1 holds.
Proof. For each of the signals X concerned, this lemma follows directly from
Lemmas 8.28, 8.29, 8.30, and 8.31 characterizing Cy(X, i) and (A).
272 8 Caches and Shared Memory
A glitch on the output enable signal Xoe of a disabled driver might propagate
to the output of the driver and thus on the bus.
Lemma 8.27 (no glitches, disabled). Let
and let t, t + 1 be consecutive cycles not in Cy(X, i). Then the output enable
signal Xoe is not redundantly cleared:
t∈
/ Cy(X, i) ∧ t + 1 ∈
/ Cy(X, i) → ¬Xoeclr(i)t .
Proof. For each signal concerned, the proof follows again directly from Lemmas
8.28, 8.29, 8.30, and 8.31 and automata construction.
The next few lemmas are characterizing the sets Cy(X, i).
Lemma 8.28 (mmw). We write to the main memory only in state f lush:
t ∈ Cy(mmw, i) ↔ f lush(i)t .
t ∈ Cy(mmw, i) → f lush(i)t .
by automata construction.
Finally, we conclude from
f lush(i)t ∧ ¬mmwoe(i)t +1
by automata construction
mmwoeclr(i)t ∧ m0(i)t +1 .
f lush(i)t → t ∈ Cy(mmw, i)
The proofs of all other lemmas characterizing sets Cy(X, i) follow very similar
patterns and we, therefore, just formulate the lemmas without proofs. For
many of the following lemmas it is convenient to define for each cycle t and
cache i the cycle ez(t, i) when the master automaton did the most recent
change of states, i.e., the cycle when the master left the previous state before
entering the current state z(i)t :
z(i)t = idle → ez(t, i) = max{t | t < t ∧ z(i)t = z(i)t } .
Lemma 8.30 (badout ). The bus address always comes from the master dur-
ing the entire warm phase:
t ∈ Cy(badout , i) ↔ W (i)t .
Observe that the output enable signal badoutoe for this signal stays constantly
1 during the entire warm phase, the content of the address register badout
changes after f lush. The last signal bdout treated here can be activated both
by masters and by slaves.
Lemma 8.31 (bdout). Signal bdout(i) is put on the bus by the master in
state mdata in case of a global write access, by the slave in state sdata if it
intervenes after a miss, or by a master in case of a flush access:
We see that, with the exception of X = bdout, all signals satisfy the hypothesis
of Lemma 8.25 (no contention), thus we can summarize in
Lemma 8.32 (no contention 2).
For the bus address data b.ad, we have by (HW), (A), and Lemmas 8.30 and
8.32 (badout , no contention 2):
badin(j)t+1 = b.adt
= badout (i)t .
Lemma 8.35 (m1). During m1 the affected slaves load their answer sprotout
with the output of circuit C2. The content of registers X ∈ {badout, mprotout}
of the master and registers Y ∈ {badin, mprotin} of the slaves is unchanged.
Let m1(i)t . Then for all j, s.t., s1(j)t ∧ ¬(8)(j)t :
X(i)t+1 = X(i)t
Y (j)t+1 = Y (j)t
sprotout(j)t+1 = C2(soutb(j)t , mprotin(j)t )
= C2(aca(j).st (badin (j)t ), mprotin(j)t ) .
Proof. Proof analogous to Lemma 8.33 (before m0).
Lemma 8.36 (m2). During m2 the protocol answer of the slaves is broadcast.
The content of registers X ∈ {badout, mprotout} of the master and registers
Y ∈ {badin, mprotin, sprotout} of the slaves is unchanged. Let m2(i)t . Then,
X(i)t+1 = X(i)t
Y (j)t+1 = Y (j)t
sprotin(i)t+1 = sprotout(j)t .
j
sidle(j)t−2 ∧ s1(j)t−1 .
sprotout(i).dit = 0.
By Lemma 8.22 (sync), if slave j does not have a hit (i.e., ¬bhit(j)t−1 ), then
it goes to state sidle (j)t and by Lemma 8.19 (silent slaves) we have for all
such j:
sprotout(j).dit = 0.
If slave j does have a hit, then we have s1(j)t−1 ∧¬(8)(j)t−1 . From the protocol
and its correct implementation in circuit C2 we conclude using Lemma 8.35
(m1):
f lush(i)q+1 ∨ m0(i)q+1 .
mprotout(i).bcq−2 ∧ sprotout(j).diq−2 .
are not possible by Lemmas 8.21 and 8.22 (idle slaves, sync). Finally, the
case sdata(j)q is ruled out by Lemma 8.23 (di unique).
Now that we know that the tristate drivers are properly controlled, it is very
easy to state the effect of data transferred via the buses.
Lemma 8.40 (f lush transfer). Assume SINV (t) and consider a maximal
time interval [s : t] when master i is in state flush:
Proof. By (A) and (HW) we have for the start cycle s of the time interval:
Let X ∈ {mmreq, mmw, badout , bdout}. Then we have by Lemma 8.26 (no
glitches, enabled):
∀q ∈ [s : t − 1] : X(i)q = X(i)q+1 .
278 8 Caches and Shared Memory
By Lemmas 8.32 (no contention 2), 8.39 (bdout contention) and Lemmas 8.28,
8.29, 8.30, 8.31 characterising Cy(X, i), we get for the bus component b.X:
By Lemmas 8.26 (no glitches, enabled) and 8.27 (no glitches, disabled) we
can conclude that the rules for operations with the main memory defined in
Sect. 3.5.7 are fulfilled. Hence, by Lemma 3.10 we have absence of glitches in
the main memory inputs. The lemma follows now from the specification of
the main memory16 .
mprotout(i).bct−1 ∧ mdata(i)t .
bdoutoe(i)t .
b.dt = bdout(i)t .
16
Additionally, one has to argue here that the memory write is not performed
to an address in ROM, i.e., that badout (i)s [28 : r] = 029−r . Intuitively this is
true, because the software condition introduced in Sect. 8.4.1 guarantees that
processors never issue write or CAS requests to such addresses. Hence, a cache
line with the line address less or equal to 029−r 1r can not be in states M or O.
Formally, one can prove this by maintaining a simple invariant for such addresses
and we leave that as an easy exercise for the reader.
8.5 Correctness Proof 279
mdata(i)t ∧ sprotout(j).dit−1 .
bdin(i)t+1 = bdout(j)t .
Lemma 8.44 (mdata miss no intervention). Assume SINV (t) and con-
sider a maximal time interval [s : t] when the master is in state mdata:
¬mprotout(i).bcs−1 ∧ ¬sprotin(i).dis−1 .
Proof. This lemma is proven along the lines of Lemma 8.40 (f lush transfer).
Note that from (A) and stability of processor inputs it follows that
idle(i)e(i,k) .
• An access (i, k) is a delayed local read if it ends in state wait and is not a
flush access:
wait(i)e(i,k) ∧ ¬flushend (i, e(i, k)) .
• An access (i, k) is a local write if it ends in state localw:
localw(i)e(i,k) .
w(i)e(i,k) .
• An access (i, k) is a flush if the condition for the end of a flush access is
satisfied:
flushend(i, e(i, k)) .
A local access is either a local read, a local write, or a delayed local read. The
correspondence between three introduced classifications of accesses is shown
in Table 13.
Start cycles s(i, k) of accesses are defined in the following way. Local reads
start and end in the same cycle. Delayed local reads also start and end in the
same cycle. Local writes start 1 cycle before they end. Global accesses start
in the cycle when their hot phase begins. Flushes ending is state f lush start
when the master enters state f lush. Flushes ending in state wait start in the
same cycle.
8.5 Correctness Proof 281
With the help of the end cycles e(i, k) alone we define the parameters of the
acc(i, k) of the sequential computation. We start with flush accesses ending
in state f lush, i.e., with the case
f lush(i)e(i,k) .
The address comes from badout at the end of the access. The rest is obvious:
For flush accesses ending in state wait, i.e., for the case
the tag of the address is taken from the tag RAM in cycle e(i, k) while the
cache line address is copied from the processor input. Let pa = pa(i)e(i,k) ,
then:
For all other accesses, we construct acc(i, k) from the processor input at the
end of the access t = e(i, k) (note that the processor inputs don’t change
during the access):
282 8 Caches and Shared Memory
For accesses acc(i, k) which are not flushes, we also define the last cycle d(i, k)
before or during (in case of a local operation) the access, when master i was
making a decision to go either global or local. For all CAS accesses which end
in stage w (i.e., global CAS accesses) and for those CAS accesses which end
in stage wait (i.e., delayed local reads), this will be the last cycle of wait. For
all other accesses, this is the last cycle when master automaton i was in state
idle:
⎧
⎪
⎨max{q | q ≤ s(i, k) ∧ wait(i) } acc(i, k).cas ∧ (w(i)
q e(i,k)
Note that for global CAS accesses and for delayed local reads we actually do
tests two times: the first time when we leave state idle and the second time
when we leave state wait. For these accesses we take into consideration only
the results of the second case. However, we also partially depend on the results
of the first test, because only the first test guarantees us that we do not have
a positive CAS hit in an exclusive state. In the proof of Lemma 8.46 we show
that if a positive exclusive CAS hit was not signalled at the time of the first
test, then it would also not be signalled at the time of the second test.
Further we aim at lemmas stating that the conditions for local and global
accesses are stable during an access: if we would make the decision based on
the cache content later during the access we would get the same result.
We now show some crucial lemmas for these accesses. For the predicates
defined in Sect. 8.3.4, we use the following shorthands:
Lemma 8.46 (global end cycle). Let acc(i, k) be an access. Then the global
test is successful in state d(i, k) iff the access ends in stage w:
Proof. We first prove the direction from left to right. We show w(i)e(i,k) by
contradiction. Let
¬f lush(i)e(i,k) .
global(i, k, acad(i,k)) .
This implies by (A) that condition (3) for the master automaton did not
hold in cycle d(i, k), which contradicts global(i, k, acad(i,k)).
For the direction from right to left, we have ¬acc(i, k).f by definition of e(i, k).
By automata construction we find cycles t, t , and t , such that
which implies
¬phit(i)t ∨ test(i)t .
In case ¬phit(i)t holds we obviously have global(i, k, acat ). For case
phit(i)t ∧ test(i)t , we still have to show aca(i).st (a) ∈ {O, S}. Let this
be not true, i.e.,
aca(i).st (a) ∈ {E, M } .
During cycles q ∈ [t : t ] the master automaton was in state wait or in
state idle. The tag RAM of cache i during these cycles is not updated. If
284 8 Caches and Shared Memory
in any such cycle the state RAM of cache i gets updated with address a.c
(this can only happen at cycle q if sw(i)q holds), then we can conclude by
(HW) and the construction of circuit C2:
Hence, the only possible case for line a to be in one of the exclusive states
in t is to be in the same state in cycle t :
aca(i).st (a) ∈ {E, M } .
But this by (A) and (HW) contradicts the fact that we moved from state
idle to state wait in cycle t .
• ¬acc(i, k).cas. By definition we have t = d(i, k). Since we have moved
to state wait in this cycle, we have (3)(i)d(i,k) and the lemma follows by
(HW) and (A).
In the very same way we show a lemma for the local accesses.
Lemma 8.47 (local end cycle). Let acc(i, k) be an access. Then
1. the test for a local read is successful in cycle d(i, k) iff the access ends in
stage idle or in stage wait:
2. the test for a local write is successful in cycle d(i, k) iff the access ends in
stage localw:
Proof. We first show the direction from right to left for both statements. By
definition of e(i, k) we have
preq(i)e(i,k) ∧ ¬mbusy(i)e(i,k) .
(9)(i)e(i,k) ,
(2)(i)d(i,k) .
8.5 Correctness Proof 285
∀j : ¬H(j)t .
Hence, by Lemma 8.21 (idle slaves) we know that all slaves are idle (including
slave i):
∀j : sidle(j)t .
By (A) we get Xwb(i)t = 0 and the lemma holds.
Lemma 8.49 (stable master). Assume acc(i, k).f ∨ global(i, k, acad(i,k)),
i.e., acc(i, k) is a flush or a global access. Then, during the entire access,
abstract cache i does not change:
Proof. For X ∈ {data, tag, s} the master automaton activates write signals
Xwa only in cycle q = e(i, k). These writes update the cache only after the
end of the access.
If access acc(i, k) is a flush, then for any cycle q under consideration we
have
wait(i)q ∧ grant[i]q ∨ W (i)q
and conclude the statement by Lemmas 8.17 and 8.48 (grant at warm, slave
write at hot). If access acc(i, k) is a global read or write, then by Lemma 8.22
(sync) for the slave automaton of cache i we have sidle(i)q . In this state the
slave automaton does not activate a write signal Xwb.
The following lemma states that in the last cycle of wait the RAMs of a waiting
cache are not updated, unless a flush access is performed on a transition from
wait to m0.
286 8 Caches and Shared Memory
aca(i)q = aca(i)q+1 .
aca(i).sq (pa(i)q ) = I.
Since we do not update the tag RAM, the line stays invalid in cycle q + 1.
Another lemma argues that in the last cycle of f lush the state of the abstract
cache line addressed by pa is not changed. As a result, the output of circuit
C1 is the same as if it was computed one cycle later.
Lemma 8.51 (last cycle of flush). Let f lush(i)t ∧ ¬f lush(i)t+1 . Then
Proof. The idea behind the proof is simple: when we made a decision to go to
state f lush, the cache line addressed by pa had a different tag. Hence, in the
abstract cache this line was invalid. Since we do not write the tag RAM in
the cycles under consideration, the line stays invalid when we leave the state
f lush.
Formally, let q be the last cycle when we were in state wait before the
flush:
q = max{t | wait(i)t ∧ t < q} .
Since in cycle q we made a decision to go to state f lush, it holds that
Lemma 8.48 (slave write at hot) guarantees that in between cycles q and t we
don’t activate signal tagwb(i). Since we also don’t write the tag RAM from
the master side in states wait and f lush, and the processor inputs are stable
we get
8.5 Correctness Proof 287
pa(i)q .t = pa(i)t .t
= pa(i)t+1 .t
ca(i).tag q (pa(i)q .c) = ca(i).tag t (pa(i)t .c)
= ca(i).tag t+1 (pa(i)t+1 .c) .
Hence, we conclude
Proof. Access acc(r, s) cannot be on the same master by Lemma 8.45 (local
order) and it cannot be global by Lemma 8.18 (warm unique). Thus, it is
a local access. It cannot be a delayed local read because that would imply
grant[r]s(r,s) which gives a contradiction by Lemmas 8.17 and 8.14 (grant at
warm, grant unique). Hence, the decision to go local is made in stage idle:
acc(r, s) cannot start later than s(i, k) because by Lemma 8.34 (m0) in cycle
s(i, k) + 1 slave r has already clocked in address a:
badin(r)s(i,k)+1 = a .
This gives a snoop conflict for address a in cache r in cycle s(i, k) + 1 and the
access acc(r, s) cannot start.
Lemma 8.53 (stable local). Let acc(i, k) be a local access. Then during the
access abstract cache i does not change:
Proof. By Lemma 8.47 (local end cycle), by the definition of s(i, k), and by
(A) we have
288 8 Caches and Shared Memory
Lemma 8.54 (overlapping accesses with flush). Assume SINV (t). Let
acc(i, k) be a flush with address a = acc(i, k).a ending in cycle e(i, k) = t.
Let acc(r, s) be any access to address a except a local read. Then the time
intervals of the two accesses are disjoint. Thus, only local reads can overlap
with flushes:
e(r, s) = s(r, s) + 1 .
Hence, access acc(r, s) has started in cycle q or in cycle q − 1 with a cache hit
in an exclusive state:
On the other hand, flush accesses like acc(i, k) are preceded by a cache hit in
state wait at the eviction address in cycle s(i, k) − 1 or s(i, k). We split cases:
8.5 Correctness Proof 289
By (A) and by Lemmas 8.49, 8.50 (stable master, last cycle of wait) we
get
∀u ∈ [s(i, k) − 1 : e(i, k)] : aca(i).su (a) = I .
For u = q this contradicts the state invariants.
• wait(i)s(i,k) . This implies s(i, k) = e(i, k) = q and
aca(i).sq (a) = I .
Lemma 8.55 (overlapping accesses with local write). Assume SINV (t).
Let acc(i, k) be a local write access with address a = acc(i, k).a ending in cycle
e(i, k) = t. Let acc(r, s) be another local access with address a. Then it cannot
overlap with acc(i, k):
and
d(r, s) = s(r, s).
By (A) and by Lemmas 8.53, 8.50 (stable local, last cycle of wait) we conclude
wlocal(i, k, acad(i,k) ).
time
global
flush
delayed local
local write
local reads
Fig. 142. Possible overlaps between accesses to the same cache address a
Thus,
The last three lemmas can be summarized as follows. The only possible over-
laps between accesses to the same cache address a are (see Fig. 142):
1. a flush with local reads,
2. a global access with local reads or local writes; in this case a local access
ends at most 1 cycle after the start of the global access,
3. a local read with other local reads and with delayed local reads.
If we are interested in accesses to the same address a and ending at the same
cycle t we are left only with the first and the third cases. Formally, let
#E(a, t) ≤ 1
∨ (∀(i, k) ∈ E(a, t) : idle(i)e(i,k) ∨ (wait(i)e(i,k) ∧ ¬acc(i, k).g))
∨ (∃(i, k) ∈ E(a, t) : acc(i, k).f ∧ ∀(r, s) ∈ E(a, t) :
(r, s) = (i, k) → idle(r)e(r,s) ) .
Proof. Trivial by using Lemmas 8.47, 8.46 (local end cycle, global end cycle)
and overlapping lemmas.
Let predicate P (i, k, a, t) be true if access (i, k) ends in cycle t, accesses address
a, and is not a local read or a delayed local read:
We can now identify all cache lines that get modified in cycle t.
Lemma 8.57 (unchanged cache lines). Let X ∈ {s, data}, then
Ports b of state and data RAMs are updated only when the slave automaton
of cache i is in state sw(i)t . Port b of the tag RAM is never written.
A write to a state or a data RAM can modify at most one line in the
abstract cache. A write to a tag RAM can modify at most 2 lines. We do the
proof by considering all possibly updates to cache RAMs.
• w(i)t . In this state all the RAMs are updated via ports a with the address
pa(i)t . In the abstract cache i this update can possibly modify the lines
global(i, k, acad(i,k))
and conclude
P (i, k, aca(i, k).a, t) .
Hence, for line address a = acc(i, k).a we are done. The line b can possibly
be affected by the update only if
i.e., if we are overwriting the tag with the new value. In that case the cache
line addressed by b becomes invalid:
aca(i).st+1 (b) = I .
Our goal is to show that (in this case) this line was already invalid in cycle
t and, hence, no change to the memory slice b in cache i has occured. By
Lemma 8.49 (stable master) we have
aca(i)t = aca(i)s(i,k) .
In cycle s(i, k), the automaton of cache i was in state m0. Hence, in the
previous cycle we had:
The tag RAMs are not updated on both of these transitions, as well as in
the time interval [s(i, k) : t − 1]. Hence,
Thus, if the tags did not match in cycle t, then they also did not match in
cycle s(i, k) − 1:
ca(i).tag s(i,k)−1 (a.c) = a.t .
Hence, independently of whether the automaton of cache i was in state
wait or in state f lush in cycle s(i, k), we get by (A) and (HW):
ca(i).ss(i,k) (a.c) = I .
By Lemmas 8.22, 8.46 (sync, global end cycle) there exists cache j and
access (j, k), s.t., w(j)t and
P (j, k, pa(j)t , t) .
a = pa(j)t
and we are done. For slave i we get by the data transfer lemmas:
badin(i)t = pa(j)t .
By Lemma 8.22 (sync), we know that there was a bus hit on slave i in
cycle s(j, k) + 1:
bhit(i)s(j,k)+1 .
Applying data transfer lemmas again this gives us
Since local write accesses never overwrite tags (global and flush accesses
in the interval [s(j, k) : t] can not occur at cache j by Lemmas 8.17 and
8.14 (grant at warm, grant unique)), we conclude
• localw(i)t . In this state ports a of data and state RAMs are updated with
the address pa(i)t . In the abstract cache i this update can only modify the
line
a = ca(i).tag t (pa(i)t .c) ◦ pa(i)t .c .
By Lemma 8.47 (local end cycle) there must be an access acc(i, k) ending
in cycle t:
P (i, k, pa(i)t , t) .
Hence, we only have to show that
a = pa(i)t .
This is very easy. By Lemma 8.53 (stable local) and by (A) we get
aca(i).st (pa(i)t ) = I .
Hence,
ca(i).tag t (pa(i)t .c) = pa(i)t .t .
294 8 Caches and Shared Memory
• f lush(i)t ∧ m0(i)t+1 . In this case port a of the state RAM is updated with
the address pa(i)t . In the abstract cache i this update can only modify the
line
a = ca(i).tag t (pa(i)t .c) ◦ pa(i)t .c .
By definition, there is a flush access acc(i, k) ending in cycle t:
P (i, k, badout(i)t , t) .
badout(i)t .c = pa(i)t .c
badout(i)t .t = ca(i).tag t (pa(i)t .c) .
Hence,
a = badout(i)t .
• wait(i)t ∧ m0(i)t+1 ∧ swa(i)t . Port a of the state RAM is updated with
the address pa(i)t .c. In the abstract cache i this update can only modify
the line
a = ca(i).tag t (pa(i)t .c) ◦ pa(i)t .c .
By definition, there is a flush access (i, k) to address a ending in cycle t
P (i, k, a, t) .
The next two lemmas make sure that (i) the state of a slave does not change
after it computes its response and until the global access ends and (ii) that
the slave decision to participate or not participate in the transaction stays
stable during the entire access.
Lemma 8.58 (stable slaves). Assume SINV (t). Let acc(i, k) be an access
to address a = acc(i, k).a ending in cycle t = e(i, k) in state w(i)t . Let X ∈
{s, data} and q ∈ [s(i, k) + 2 : t] then
Proof. By Lemmas 8.52 and 8.54 (overlapping accesses with global, overlap-
ping accesses with flush) we know that no local write or flush access to address
a can end in slave j in cycles q ∈ [s(i, k) + 2 : t]. Hence, we conclude the proof
by Lemma 8.57 (unchanged cache lines).
Lemma 8.59 (stable slave decision). Assume SINV (t). Let acc(i, k) be
an access to address a = acc(i, k).a ending in cycle t = e(i, k) in state w(i)t .
Then
∀j = i : bhit(j)s(i,k)+1 = bhit(j)t .
8.5 Correctness Proof 295
Proof. By Lemma 8.30 (badout), we know that the address is put on the bus
by the master during the entire hot phase. By (A), we know that the register
badout(i) is not clocked during the hot phase. Hence, for all q ∈ [s(i, k) : t]
b.adq = a .
By Lemma 8.22 (sync) for all slaves j = i we have
s1(j)s(i,k)+1 .
By Lemma 8.58 (stable slaves) we get for cycle q ∈ [s(i, k) + 2 : t] and RAMs
X ∈ {s, data}:
aca(j).X q (a) = aca(j).X t (a) .
Now, we consider cases on whether there is a bus hit in state s(i, k) + 1 or
not:
• ¬bhit(j)s(i,k)+1 . Hence,
aca(j).ss(i,k)+1 (a) = I .
aca(j).ss(i,k)+1 (a) = I .
Now we can state the crucial lemmas that guarantee stability of decisions to
go for a global or a local transaction.
Lemma 8.60 (stable local decision). Let access acc(i, k) end in cycle t =
e(i, k). Let in cycle d(i, k) the decision for a local read or a local write hold.
Then the same decision holds in cycle t:
Proof. Expand the definition of rlocal or wlocal and apply Lemma 8.53 (stable
local).
Lemma 8.61 (stable global decision). Assume SINV (t). Let acc(i, k) be
a global access, i.e., in cycle d(i, k) we have global(i, k, acad(i,k)). Then we
could have reached the same decision in cycle t = e(i, k):
Proof. Assume global(i, k, acad(i,k)) and let a = acc(i, k).a. We expand the
definition of global and observe
aca(i)s(i,k) = aca(i)e(i,k) .
global(i, k, acas(i,k) ) .
By Lemma 8.46 (global end cycle) and the definition of s(i, k) , we have
m0(i)s(i,k) ∧ w(i)e(i,k) .
aca(i).ss(i,k) (a) = I ,
The state RAM of cache i is updated only if there is another global access
ending in q and if cache i is participating in that access, i.e., if sw(i)q holds.
We split cases on the state of line a in cycle d(i, k):
• ca(i).sd(i,k) (a.c) = I. Hence, line a is invalid in aca(i). We show by con-
tradiction that this line stays invalid until s(i, k). Let q be the first cycle
when line a gets updated:
Then by (A) there exists cache j with w(j)q and there exists an access
(j, k ) ending at cycle q. Applying data transfer lemmas and Lemma 8.30
(badout) we get
acc(j, k ).c = pa(j)q .c = a.c = b.adq .c = b.ads(j,k )+1 .c .
By (A), slave i had a bus hit in state s(j, k ) + 1. Hence, by Lemma 8.59
(stable slave decision) it also has a bus hit in state q. This implies
ca(i).sq (a.c) = I
aca(i).ss(i,k) (a) = I .
• ca(i).sd(i,k) (a.c) ∈ {S, O}. In this case, by (HW) and by the construction
of circuit C2, any update to the state RAM can only change the state of
the line to S, O, or I. Hence, we can conclude for this case
global(i, k, acas(i,k) ) .
If acc(i, k) is a CAS access and we are not in f lush in cycle s(i, k) − 1, then
we have
wait(i)d(i,k) ∧ s(i, k) = d(i, k) + 1 .
298 8 Caches and Shared Memory
aca(i)d(i,k) = aca(i)s(i,k) ,
which implies
global(i, k, acas(i,k) ) .
For the implication from left to right, we apply Lemma 8.60 (stable local
decision) and get the proof.
For the other direction, we prove by contradiction. Let
Then we have
By Lemmas 8.60 and 8.61 (stable local decision, stable global decision) we get
2. If access (i, k) ends in t and P (i, k, a, t) does not hold, then in the atomic
protocol access acc(i, k) applied to port i does not change slice a of mem-
ory system ms(ht ):
3. At most one access ending in cycle t can change slice a both in the hard-
ware computation and in the atomic protocol:
4. If P (i, k, a, t) holds and access (i, k) is not a global access, then content
of abstract cache j = i for address a is not changed at cycle t. Let X ∈
{data, s}, then
5. If P (i, k, a, t) holds and access (i, k) does not end in state f lush, then the
content of the main memory for address a is not changed at cycle t:
f lush(i)t ∧ m0(i)t+1 .
P (i, k, badout(i)t , t) .
rlocal(i, k, acat ) .
acc(i, k).a = a.
By Lemma 8.57 (unchanged cache lines), a cache line can change in cycle
t only in two cases:
By part 3 of the lemma we are now proving, we conclude that there are
no other accesses to address a ending in cycle t:
∀r = i : ∀k : ¬P (r, k , a, t) .
∀k : ¬P (j, k , a, t) .
Since (i, k) is the only access to address a ending in cycle t and it does
not end in state w, we conclude
f lush(j)t ∧ m0(j)t+1 .
P (j, k , badout(j)t , t) .
By Lemma 8.40 (flush transfer) we conclude that the only address a that
is modified in the memory is a = badout(i)t , and get a contradiction by
part 3 of the lemma we are now proving.
8.5 Correctness Proof 301
We are now ready to establish a crucial simulation result between the se-
quential computation of the atomic protocol and the hardware computation.
Essentially it states that an access acc(i, k) of the hardware computation end-
ing in cycle t has the same effect as the same access acc(i, k) applied to port
i and memory system ms(ht ) of the atomic protocol.
Proof. The second statement is trivial for local reads and delayed local reads.
We will show the second statement for other accesses together with the first
statement.
By part 1 of Lemma 8.63 (unchanged memory slices) Π(ms(ht ), a) only
changes in cycles t + 1 following cycles t when a flush, a local write access, or
a global access with address a ends. Thus, for ¬∃(i, k) : P (i, k, a, t) there is
nothing left to show.
Next, we observe by part 3 of Lemma 8.63 (unchanged memory slices) that
in any cycle t there is at most one access acc(i, k) satisfying the conditions of
the predicate P (i, k, a, t) for any given address a. Thus, the statement of the
lemma is well defined. By definition of e(i, k) and automata construction we
have
localw(i)t ∨ w(i)t ∨ ((f lush(i)t ∨ wait(i)t ) ∧ m0(i)t+1 ) .
Now we split cases on the kind of access to address a ending in cycle t:
• Access acc(i, k) ends in state localw. Hence, s(i, k)+ 1 = t and by Lemmas
8.47, 8.60 (local end cycle, stable local decision) we have
local(i, k, acat) .
aca(i)s(i,k) = aca(i)s(i,k)+1 .
In state localw we write to cache address pa(i).c via ports a of the data
and the state RAMs. The tag RAM is not updated. The writes via port a
always have precedence over the writes via port b. Hence, even if there was
a write to pa(i).c via port b of the data or the state RAM, it would not
have any effect (below, in the proof for the global accesses, we show that
simultaneous writes via ports a and b to the same address never occur).
Hence, by (A), (HW), and the definition of the one step protocol we can
conclude:
Π(ms(ht+1 ), a) = Π(δ1 (ms(ht ), acc(i, k), i), a) .
In case of a CAS access we also have by automata and hardware construc-
tion:
pdout(i)t = aca(i).datat (a) = pdout1(ms(ht ), acc(i, k), i) .
• Access acc(i, k) ends in state w. By Lemmas 8.46 and 8.61 (global end
cycle, stable global decision) we have
global(acat, acc(i, k), i) .
By part 5 of Lemma 8.63 (unchanged memory slices) we know that the
memory content for address a is unchanged:
mmt+1 (a) = mmt (a) .
Using Lemma 8.49 (stable master) we get
∀q ∈ [s(i, k) : t] : aca(i)q = aca(i)t .
Moreover, by Lemmas 8.50 and 8.51 (last cycle of wait, last cycle of flush)
we have
aca(i).ss(i,k)−1 (a) = aca(i).st (a) .
By Lemma 8.58 (stable slaves) we get for slaves j = i, cycle q ∈ [s(i, k)+2 :
t], and RAMs X ∈ {s, data}:
aca(j).X q (a) = aca(j).X t (a) .
Using the protocol transfer lemmas and the stability of processor inputs
we get for all slaves j = i:
mprotin(j)s(i,k)+1 = mprotout(i)s(i,k)
= C1(aca(i).s(a)s(i,k)−1 , ptype(i)s(i,k)−1 )
= C1(aca(i).s(a)t , ptype(i)t )
sprotout(j)s(i,k)+2 = C2(aca(j).ss(i,k)+2 (a), mprotin(j)s(i,k)+1 ).(ch, di)
= C2(aca(j).st (a), mprotout(i)t ).(ch, di)
sprotin(i)s(i,k)+3 = sprotout(j)s(i,k)+2
j
= C2(aca(j).st (a), mprotout(i)t ).(ch, di) .
j
8.5 Correctness Proof 303
By Lemma 8.22 (sync) all slaves in cycle t are either in state sidle (if they
do not participate in the transaction) or are in state sw (if they participate
in the transaction). For participating slaves we have by Lemmas 8.22 and
8.59 (sync, stable slave decision):
bhit(j)t .
Hence,
ca(j).tag t (a.c) = a.t .
For not participating slaves by the same arguments we get
¬bhit(j)t
and
aca(j).st (a) = I .
By part 3 of Lemma 8.63 (unchanged memory slices) we know that no
other global, flush or local write accesses to address a end in cycle t. If
some access to address b = a, where b.c = a.c, on cache j ends in cycle t,
then this access cannot be a flush or a global access by Lemmas 8.17 and
8.14 (grant at warm, grant unique). If it is a local write, then by Lemmas
8.47 and 8.53 (local end cycle, stable local) and by (A) we get:
Hence, such an access can occur only on a not participating cache j and
for such cache it then holds:
aca(j).st+1 (a) = I .
now follows from the data transfer lemmas. If acc(i, k) is a read or a CAS
access, the statement
Then #E(t) is the number of accesses ending in cycle t, and the number NE (t)
of accesses that have ended before cycle t is defined by
NE (0) = 0
NE (t + 1) = NE (t) + #E(t) .
We number accesses acc(i, j) according to their end time and accesses with
the same end time arbitrarily. Thus, accesses ending before t get sequential
numbers [0 : NE (t) − 1] and accesses ending at t get numbers from set Q(t) =
[NE (t) : NE (t + 1) − 1]. Thus,
seq(E(0)) = [0 : NE (1) − 1]
seq(E(t)) = [NE (t) : NE (t + 1) − 1] .
If a flush access and one or more local reads to the same address end in cycle
t, we order the flush access last:
(i, k), (i , k ) ∈ E(a, t) ∧ acc(i, k).f → seq(i , k ) < seq(i, k) . (20)
is[seq(i, k)] = i .
We can now relate the hardware computation with the computation of the
atomic protocol and show that the state invariants hold for the hardware
computation.
Lemma 8.65 (relating hardware with atomic protocol). The following
statements hold for cycle t and hardware configuration ht :
1. The first NE (t) sequential atomic accesses lead exactly to the same ab-
stract memory system configuration ms as the first t cycles of the hardware
computation:
SINV (t) .
3. The memory abstraction after the first t cycles equals the memory ab-
straction after NE (t) sequential atomic memory accesses:
aca(i).s0 (a) = I
nx = NE (t) + x − 1 .
Then we have
seq(ix , kx ) = nx .
Then,
acc(ix , kx ) = acc [nx ] and ix = is[nx ] .
We also define a sequence of memory system configurations msx by
ms0 = ms(ht )
x > 0 → msx = δ1 (msx−1 , acc [nx ], ix ) .
By part 2 of the induction hypothesis the state invariants hold for ms0 :
Using Lemma 8.9 we conclude by induction that the state invariants hold for
all memory systems msx under consideration:
∀x ∈ [1 : #E(t)] : sinv(msx ) .
∀y = x : ¬P (iy , ky , a, t) .
Π(msx−1 , a) = Π(ms0 , a)
Π(ms#E(t) , a) = Π(msx , a) .
Π(ms#E(t) , a) = Π(msx , a)
= Π(δ1 (msx−1 , acc [nx ], ix ), a)
= Π(δ1 (ms0 , acc [nx ], ix ), a) .
Using the definition of acc [nx ] and part 1 of Lemma 8.64 (1 step) we conclude
Π(δ1 (ms(ht ), acc(ix , kx ), ix ), a) ∃x : P (ix , kx , a, t)
Π(ms#E(t) , a) =
Π(ms(ht ), a) otherwise
= Π(ms(ht+1 ), a) .
Hence,
ms#E(t) = ms(ht+1 ) .
308 8 Caches and Shared Memory
SIN V (t).
∀y = x : ¬P (iy , ky , a, t) .
∀y < x : ¬P (iy , ky , a, t) .
Proof. Using Lemma 8.66, (21), Lemma 8.11, and recalling the definition
m(h) = m(ms(h)) of the hardware memory we get
310 8 Caches and Shared Memory
8.5.11 Liveness
Having fairness of the bus arbiter (Lemma 8.13) and liveness of the main
memory (Sect. 3.5.6), the liveness proof of the shared memory construction
becomes trivial. Assuming stability of processor inputs defined in Sect. 8.4.1,
we can also guarantee that signal mbusy is off when there is no processor
request. This simple but important property of the memory system is very
helpful when we show liveness of the multi-core processor in Sect. 9.3.9. We
state the desired properties in the following lemma and leave the proof of the
lemma as an easy exercise for the reader.
Lemma 8.68 (liveness of shared memory).
1. preq(i)t → ∃t ≥ t : ¬mbusy(i)t ,
2. ¬preq(i)t → ¬mbusy(i)t .
Note, that the mbusy signal from cache i is guaranteed to eventually go away
even if for some cache j = i signal preq(j) always stays high. This is the case
because the master automaton of cache j, if it is in the hot phase, eventually
reaches state idle where it lowers its request to the arbiter and gives up own-
ership of the bus (Sect. 8.4.5). This property of the shared memory system
is very important for us, since in the next section we construct a processor
where the preq signal to the instruction cache can possibly stay high until the
mbusy signal for the data cache goes away. As a result, when proving liveness
of that construction (Lemma 9.18) we heavily rely on the fact that a memory
access to the data cache eventually ends (i.e, the mbusy signal from the data
cache goes away), even if the processor request to the instruction cache stays
high for the duration of the entire access to the data memory.
9
A Multi-core Processor
We finally are able to specify a multi-core MIPS machine, build it, and show
that it works. Clearly the plan is to take pipelined MIPS machines from
Chap. 7 and connect them to the shared memory system from Chap. 8. Before
we can do this, however, we have to address a small technical problem: the
pipelined machine was obtained by a transformation from a sequential refer-
ence implementation, and that machine does not have a compare-and-swap
operation. Thus, we have to add an introductory Sect. 9.1, where we augment
the sequential instruction set with a compare-and-swap instruction. This turns
out to be an instruction with 4 register addresses, where we accommodate the
fourth address in the sa field of an R-type instruction. In order to process
such instructions in a pipelined fashion we now also need in the sequential
reference implementation the ability to read three register operands and to
write one register operand in a single hardware cycle. If we would have treated
interrupts here, we would have a special purpose register file as in [12], and
we could take the third read operand from there. Here, we simply add a third
read port to the general purpose register file using technology from Chap. 4.
In Sect. 9.2 we specify the ISA of multi-core MIPS and give a reference
implementation with sequential processors. Like the hardware, ISA and ref-
erence implementation can be made completely deterministic, but in order to
hide implementation details they are modelled for the user in a nondetermin-
istic way: processors execute instructions one at a time; there is a stepping
function s specifying for each time n the processor s(n) executing an instruc-
tion at step n. We later derive this function from the implementation, but
the user does not know it; thus, programs have to work for any such stepping
function. Specifying multi-core ISA turns out to be very easy: we split the
sequential MIPS configuration into i) memory and ii) processor (everything
else). A multi-core configuration has a single (shared) memory component and
multiple processor components. In step n an ordinary sequential MIPS step
is executed with processor s(n) and the shared memory.
In the multi-core reference implementation, we hook the sequential refer-
ence processors to a memory (which now has to be able to execute compare-
M. Kovalev et al.: A Pipelined Multi-core MIPS Machine, LNCS 9000, pp. 311–344, 2014.
c Springer International Publishing Switzerland 2014
312 9 A Multi-core Processor
We start with extending the MIPS ISA from Sect. 6.2 with the compare-and-
swap operation. We define it as an R-type instruction with the function bits
being all ones:
cas(c) ≡ opc(c) = 06 ∧ f un(c) = 16 .
Recall that previously we defined the effective address of load and store in-
structions as
ea(c) = c.gpr(rs(c)) +32 sxt(imm(c)) .
In R-type instructions we do not have an immediate constant. So for the
effective memory address of a CAS instruction we simply take the value from
the GPR file addressed by the rs field:
cas(c) → d(c) = 4 .
The comparison data for a CAS operation is taken from a GPR specified by
field sa of the instruction1 :
1
An alternative would be to take this value from a dedicated SPR (special purpose
register file), but we do not consider SPRs in this book.
9.1 Compare-and-Swap Instruction 313
IF
IF im imout
env
I
nextpc F, p p, Cad, F
env I-decoder
bf
ID
A, pc xtimm, af, sf, sa
pc dpc
ima, D
pc A, B, af xtimm A, B sa, sf A xtimm B
A, B, D
p, F
inc ALUenv SUenv add sh4s
muxes p
D
C
mask4cas
ea.l dmin bw
M ima
m imout
dmout
dmout
ea.o p, F
sh4l
lres
WB 0 1
l
A B D
Fig. 143. Schematic view of a simple MIPS machine supporting the CAS instruction
9.1 Compare-and-Swap Instruction 315
C lres
32 32
0 1 l ∨ cas
32
gprin
Fig. 144. Computing the data input of the GPR
A sxtimm ∧ ¬cas
32 32 0
32-Add
32
ea
Fig. 145. Effective address computation
For computing the memory byte write signals in case of CAS accesses, we have
to take into consideration the result of the CAS test. Hence, we first have to
read the data from the hardware memory and only then decide whether we
need to perform a write. This is possible, because the construction of a 2-port
multi-bank RAM-ROM from Sect. 4.3.3 allows reading and writing the same
address through port b in a single cycle2 .
We now split the computation of byte write signals into two parts. First,
the environment sh4s computes the byte write signals assuming that the result
of the CAS test has succeeded. The construction of the circuit stays the same
as in the previous designs shown in Fig. 110, but the initial smask signals are
now calculated as
⎧
⎪ 2
⎨I(h)[27] I(h)[26]1 s(h)
smask(h)[3 : 0] = 1 4
cas(h)
⎪
⎩ 4
0 otherwise
= s(h) ∧ I(h)[27]2 I(h)[26]1 ∨ cas(h) ∧ 14 .
For the shifted version of the byte write signals, this gives us
dmout
64
1 0 ea[2]
D
32 32
32-eq
¬cas castest
bw
8
8
bw
Fig. 146. Implementation of circuit mask4cas
Circuit mask4cas shown in Fig. 146 first computes the signal castest, where
D(h) = dmout(h)[63 : 32] ea(h)[2] = 1
castest(h) ≡
D(h) = dmout(h)[31 : 0] ea(h)[2] = 0 ,
and then uses this signal to mask all active byte write signals in case the CAS
test was not successful.
These are all modification one has to make to the sequential reference
hardware implementation to introduce support for the CAS instruction. Ad-
ditionally, we extend the software condition on disjoint code and data regions
to handle the CAS instructions:
ls(c) ∨ cas(c) → ea(c).l ∈ DR.
Correctness of the construction is stated in
Lemma 9.1 (MIPS with CAS correct). Let alignment hold and let code
and data regions be disjoint. Then
sim(c, h) → sim(c , h ) .
Proof. For the case when we don’t execute a CAS instruction, i.e., ¬cas(c),
we observe that the signals generated by the hardware are the same as in the
sequential MIPS machine from Chap. 6. Hence, we simply use Lemma 6.8 and
we are done.
If cas(c) holds, then we consider the cases when the CAS test succeeds and
when it fails. In both cases the proof is completely analogous to the proofs
from Chap. 6 and we omit it here.
9.2 Multi-core ISA and Reference Implementation 317
Recall that MIPS configurations c have components c.pc, c.dpc, c.gpr, and
c.m. For the purpose of defining the programmer’s view of the multi-core
MIPS machine, we collect the first three components of c into a processor
configuration:
c.p = (c.p.pc, c.p.dpc, c.p.gpr) .
We denote by Kp the set of processor configurations. A MIPS configuration
now consists of a processor configuration and memory configuration:
c = (c.p, c.m) .
The next state function c = δ(c) is split into a next processor component δp
and a next memory component δm :
c = δ(c)
= (c .p, c .m)
= (δp (c.p, c.m), δm (c.p, c.m)) .
Note that this step function is unknown to the programmer; we will eventually
construct it from the hardware. Programs, thus, have to perform well for all
fair step functions.
Initially, we require
We now define the multi-core computation (mcn ) where mcn is the configu-
ration before step n:
318 9 A Multi-core Processor
n+1 δp (mcn .p(x), mcn .m) x = s(n)
mc .p(x) =
mcn .p(x) x = s(n)
mcn+1 .m = δm (mcn .p(s(n)), mcn .m) .
h = δH (h) .
acc.a = ea(h).l
acc.f = 0
acc.r = l(h)
(s(h), cas(h)) ea(h).l[28 : r] = 0r
(acc.w, acc.cas) =
(0, 0) otherwise
acc.data = dmin(h)
acc.cdata = D(h)
acc.bw = bw(h) .
3
Turning our construction into a real sequential implementation would require a
scheduler and a number of multiplexors connecting the shared memory to the
processors.
9.2 Multi-core ISA and Reference Implementation 319
In case instruction I(h) is neither a load, a store, nor a CAS instruction, all
bits f ,w, r, and cas of access dacc(h) are off and we have a void access. Recall
that a void access does not update memory and does not produce an answer.
In case a write or a CAS is performed to an address in the ROM region we
also have a void access4 .
In the same way we construct instruction fetch access acc = iacc(h) as
acc.a = ima(h)
acc.r = 1
acc.w = 0
acc.cas = 0
acc.f = 0 .
Proof. The first and the second statements we simply get by unfolding the
definitions and applying the semantics of the 2-port multi-bank RAM-ROM:
dmout(h) = h.m(ea(h).l)
= dataout(h.m, dacc(h))
imout(h) = h.m(ima(h))
= dataout(h.m, iacc(h)) .
For the third statement we have by the hardware construction and the se-
mantics of the 2-port multi-bank RAM-ROM:
⎧
⎪
⎨modify (h.m(b), dmin(h), bw (h)) b = ea(h).l ∧ (s(h)∨
h.m(b) = cas(h) ∧ castest(h))
⎪
⎩
h.m(b) otherwise
4
Note that this situation never occurs if the reference hardware computation is
simulated by an ISA computation and disjointness of data and code regions holds.
The only reason why we consider it possible here, is because we want to specify
the multi-core reference hardware before we show that it is simulated by a multi-
core ISA. Hence, at that point we cannot yet assume that there are no writes to
the ROM portion of the hardware memory.
320 9 A Multi-core Processor
⎧
⎪
⎨modify (h.m(b), dmin(h), bw(h)) b = ea(h).l ∧ (s(h)∨
= cas(h) ∧ castest(h))
⎪
⎩
h.m(b) otherwise .
For the original byte write signals generated by sh4s environment, in case of
a CAS access we have
04 14 ea(h)[2] = 0
cas(h) → bw(h) =
14 04 ea(h)[2] = 1 ,
which gives us
cas(h) → bw(h)[0] = ¬ea(h)[2] .
For the result of the CAS test, we have by construction of circuit mask4cas:
D(h) = dmout(h)[63 : 32] ea(h)[2] = 1
castest(h) =
D(h) = dmout(h)[31 : 0] ea(h)[2] = 0
D(h) = h.m(ea(h).l)[63 : 32] ¬bw(h)[0]
=
D(h) = h.m(ea(h).l)[31 : 0] bw(h)[0]
= test(dacc(h), h.m(ea(h).l)) .
Hence,
h.m(b)
⎧
⎪
⎨modify (h.m(b), dmin(h), bw(h)) b = ea(h).l ∧ (s(h) ∨
= cas(h) ∧ castest(h))
⎪
⎩
h.m(b) otherwise
⎧
⎪
⎨modify (h.m(b), dmin(h), bw(h)) b = ea(h).l ∧ (s(h) ∨ cas(h) ∧
= test(dacc(h), h.m(ea(h).l))
⎪
⎩
h.m(b) otherwise
= δM (h.m, dacc(h)) .
h = δH (h.p, h.m)
= (δhp (h.p, h.m), δhm (h.m, dacc(h)))
= (δhp (h.p, h.m), δM (h.m, dacc(h))) .
For the case the reset signal is off, the computation (hn ) of the multi-core
reference implementation is simply defined by
and
hn+1 .p(q) = hn .p(q) for q = s(n) .
An equivalent definition we state in the following lemma.
Lemma 9.4 (computation of the reference machine). Assume resetn =
0, then
h0 .p(q).dpc = 032
h0 .p(q).pc = 432 .
msim(mc, h) ≡
∀n : msim(mcn , hn ) .
322 9 A Multi-core Processor
and
and obtain
msim(mc0 , h0 ) .
For the induction step, we conclude for processor s(n) from the induction
hypothesis
By Lemma 9.1 , i.e., the correctness of the basic sequential hardware for one
step, we conclude
For processors q = (s(n)) that are not stepped, we have by induction hypoth-
esis
Lemma 9.5 obviously implies that write or CAS accesses to the ROM portion
of h.m never occur. Hence, we have
cas(h) ≡ dacc(h).cas
s(h) ≡ dacc(h).w.
For processor IDs q and local step numbers i, we define the step numbers
pseq(q, i) when local step i is executed on processor q:
ic(q, 0) = 0
ic(q, n) + 1 s(n) = q
ic(q, n + 1) =
ic(q, n) otherwise .
∀m ∈ [0 : n − 1] : s(m) = q .
Hence,
pseq(q, 0) = n .
In case i > 0, let
and
j0 < . . . < ji−1 .
A trivial induction shows
324 9 A Multi-core Processor
∀x ≤ i − 1 : jx = pseq(q, x) .
Because
∀m ∈ [ji−1 + 1 : n − 1] : s(m) = q ,
we conclude
Hence, up to configuration hpseq(q,i) , processor q has already been stepped i
times. The next step to be executed in hpseq(q,i) is step number i of processor
q, which is the (i + 1)-st local step of this processor.
For processor IDs q and step numbers i, we define the local hardware
configurations hq,i of processor q before local step i. We start with the mul-
tiprocessor hardware configuration hpseq(q,i) in which processor q makes step
i; then we construct a single processor configuration hq,i by taking the pro-
cessor component of the processor that is stepped, i.e., q, and the memory
component from the shared memory:
The following lemma asserts, for every q that, as far as the processor com-
ponents are concerned, the local configurations hq,i behave as in an ordinary
single processor hardware computation; the shared memory of course can
change between steps i and i + 1 of the same processor.
Lemma 9.8 (local computations).
hq,0 .p = h0 .p(q)
hq,i+1 .p = δH (hq,i .p, hpseq(q,i) .m).p
Next we show a technical result relating the local computations with the
overall computation.
Lemma 9.9 (relating local and overall computations).
hn .p(q) = hq,ic(q,n) .p
For the induction step assume the lemma holds for n. Let i = ic(q, n). We
distinguish two cases:
• q = s(n). Then,
ic(q, n + 1) = i + 1 .
By induction hypothesis, Lemma 9.7, and Lemma 9.8 we get
• q = s(n). Then,
ic(q, n + 1) = ic(q, n) = i
and by induction hypothesis and Lemma 9.7 we get
We define the instruction fetch access iacc(q, i) and the data access dacc(q, i)
in local step i of processor q as
iacc(q, i) = iacc(hq,i )
dacc(q, i) = dacc(hq,i ) .
326 9 A Multi-core Processor
9.3.1 Notation
ica.X = X(2q)
dca.X = X(2q + 1) .
Registers D.2 and D.3 are used only for CAS instructions:
The forwarding engine is extended to forward the data into D.2 during the
instruction decode stage:
ima
ica
IF
1 I
1 0
EX cir (3)
WB cir(5)
gpr
5 rs, rt, sa
Ain, Bin, D.2in
CAS instructions load data from the memory just as regular loads do. Hence,
we have to update the hazard signals:
hazA = A-used ∧ (topA [k] ∧ (con.k.l ∨ con.k.cas))
k∈[2,3]
hazB = B-used ∧ (topB [k] ∧ (con.k.l ∨ con.k.cas)) .
k∈[2,3]
9.3 Shared Memory in the Multi-core System 329
The stall engine has to additionally generate a hazard signal when the D data
can not be forwarded:
haz2 = hazA ∨ hazB ∨ hazD
hazD = D-used ∧ (topD [k] ∧ (con.k.l ∨ con.k.cas)) .
k∈[2,3]
Every MIPS processor in the multi-core system has an instruction cache and a
data cache. We connect the instruction cache ica = ca(2q) to MIPS processor
q in the following way:
ica.pa = imaπ
ica.pw = 0
ica.pr = 1
ica.pcas = 0
ica.preq = 1
Iinπ = ica.pdout
haz1 = ica.mbusy .
The data cache dca = ca(2q + 1) is connected to processor q in the following
way:
dca.pa = ea.3.lπ
dca.pw = con.3.sπ
dca.pr = con.3.lπ
dca.pcas = con.3.casπ
dca.bw = bw.3π
dca.pdin = dminπ
dca.pcdin = D.3π
dca.preq = f ull3 ∧ (con.3.sπ ∨ con.3.lπ ∨ con.3.casπ )
dmoutπ = dca.dout
haz4 = dca.mbusy .
Recall that the stall engine is defined as
stallk = f ullk−1 ∧ (hazk ∨ stallk+1 )
uek = f ullk−1 ∧ ¬stallk
f ullkt+1 = uetk ∨ stallk+1
t
.
In stage 1 we always perform a memory access, but we clock the results of the
access to the registers only if we don’t have a stall2 signal coming from the
330 9 A Multi-core Processor
stage below. As a result, we might perform the same access to the instruction
cache several times, until we are actually able to update the register stage
below. In Sect. 9.3.9 we show that this kind of behaviour does not produce
any deadlocks.
• For instruction cache 2q, if the request signal preq(2q) and the memory
busy signal mbusy(2q) are both on, the inputs to the instruction cache
remain stable:
preq(2q)t ∧ mbusy(2q)t → preq(2q)t+1 ∧ imaq,t+1
π = imaq,t
π .
The instruction address is taken either from the PC or from the DPC register
depending on whether stage 1 is full or not. Hence, we split cases on values
of f ull1q,t and f ull1q,t+1 :
• if ¬f ull1q,t we have
Lemma 9.12 (update enable implies access end). An active update en-
able signal denotes the end of a memory access:
332 9 A Multi-core Processor
ueq,t
4 ∧ preq(2q + 1) → ∃k : e(2q + 1, k) = t ∧ ¬acc(2q + 1, k).f .
t
2. For instruction cache 2q, if the update enable signal of stage 1 is activated,
then a read access ends:
ueq,t
1 → ∃k : e(2q, k) = t ∧ acc(2q, k).r .
=1.
Hence,
π ∨ con.3.lπ ∨ con.3.casπ .
con.3.sq,t q,t q,t
Thus, the update is due to a memory access and does not come from an
instruction that does not require accessing the memory. Because ueq,t
4 holds,
we have
Thus, we have someend(2q + 1, t). The ending access cannot be a flush access,
because by the construction of the control automata of the caches the mbusy
signal stays active during flush accesses.
For instruction caches we have by hypothesis
Hence,
stall1q,t = 0 .
Because
stall1q,t = f ull0q,t ∧ (haz1q,t ∨ stall2q,t )
9.3 Shared Memory in the Multi-core System 333
haz1q,t = mbusy(2q)t = 0
preq(2q)t = 1 .
Thus, we have someend(2q, t). We argue as above that the ending access is
not a flush access. Moreover, we know that write and CAS accesses do not
occur at instruction caches and conclude the proof.
We come to a subtle point which is crucial for the liveness of the system.
Lemma 9.13 (access end implies update enable). When a read, write,
or CAS access ends, the corresponding stage is updated, unless there is a stall
signal coming from the stage below:
1. For data cache 2q + 1, we have
Because
we conclude
f ull3q,t = 1 .
Because
we conclude
stall2q,t = 0
haz1q,t = mbusy(2q)t
=0
stall1q,t = f ull0 ∧ (haz1q,t ∨ stall2q,t )
=0
ueq,t q,t
1 = f ull0 ∧ ¬stall1
=1.
I(q, k, 0) = 0
I(q, 1, t) + 1 ueq,t
1
I(q, 1, t + 1) =
I(q, 1, t) otherwise
I(q, k − 1, t) ueq,t
k
I(q, k, t + 1) =
I(q, k, t) otherwise .
For the scheduling functions of the multi-core system, we state the counterpart
of Lemma 7.14.
Lemma 9.14 (scheduling functions difference multi-core). Let k ≥ 2.
Then for all q:
q,t
I(q, k − 1, t) = I(q, k, t) + f ullk−1 .
Proof. Completely analogous to the proof of Lemma 7.14.
PS (t) = {q | ueq,t
4 = 1} ,
NS (0) = 0
NS (t + 1) = NS (t) + #PS (t) .
Thus, in every cycle t we step #PS (t) processors. For every t we will define
the values s(m) of the step function s for m ∈ [NS (t) : NS (t + 1) − 1] such
that
s([NS (t) : NS (t + 1) − 1]) = PS (t) .
Any step function with this property would work, but we will later choose a
particular function which makes the proof (slightly) easier. For any function
with the above property the following easy lemma holds.
Lemma 9.15 (relating instruction count with scheduling functions).
For every processor q the scheduling function I(q, 4, t) of the pipelined machine
at time t counts the instructions completed ic(q, t) on the sequential reference
implementation:
ic(q, NS (t)) = I(q, 4, t) .
Proof. By induction on t. For t = 0 both sides of the equation are 0. For the
induction step we assume
I(q, 4, t + 1) = I(q, 3, t)
= I(q, 4, t) + 1.
• q∈
/ PS (t). This implies
my = NS (t) + y − 1
qy = s(my ) .
z = y → qz = qy ,
Thus,
pseq(q1 , I(q1 , 4, t)) = m1 = NS (t)
and
pseq(q#PS (t) , I(q#PS (t) , 4, t)) = NS (t + 1) − 1 .
We define the linear data access sequence dacc by
By Lemma 8.6 we get the following relation for the hardware memory of the
reference machine.
Lemma 9.16 (hardware memory of the reference computation).
For the correctness result of the multi-core system, we assume as before align-
ment and the absence of self modifying code. Recall that for R ∈ reg(k) the
single-core system simulation (correctness) theorem had the form
9.3 Shared Memory in the Multi-core System 337
I(k,t)
Rσ vis(R)
Rπt = I(k,t)−1 I(k,t)−1
Rσ f ullkt ∧ ¬vis(R) ∧ used(R, Iσ ).
For the multi-core machine we aim at a theorem of the same kind. We have,
however, to couple it with an additional statement correlating the memory
abstraction m(htπ ) of the pipelined machine with the hardware memory hσ .m
of the sequential reference implementation. We correlate the memory m(htπ )
of the pipelined machine π with the memory of the sequential machine σ after
NS (t) sequential steps:
The main result of this book asserts the simulation of the pipelined multi-core
machine π by the sequential multi-core reference implementation σ.
s : [0 : NS (t) − 1] → [0 : P − 1] ,
such that
• for all stages k, registers R ∈ reg(k), and all processor IDs q, let
I(q, k, t) = i ,
then
Rσq,i vis(R)
Rπq,t =
Rσq,i−1 f ullkt ∧ ¬vis(R) ∧ used(R, Iσq,i−1 ) ,
•
a ∈ CR ∪ DR → m(htπ )(a) = hσNS (t) .m(a) .
A meaningful initial program can only be guaranteed if the initial code region
CR is realized in the main memory as a ROM. We choose the size of the ROM
in hσ .m(a) to be the same as the size of the read only region in hπ .mm.
Compared to the proof for a single pipelined processor, the proof of the
induction step changes only for the instruction and memory stages, i.e., for
k = 1 and k = 4, and only for the memory components and their outputs. In
338 9 A Multi-core Processor
what follows, we only present these parts of the proof. We first consider stage
k = 4 and consider processors q with ueq,t
4 = 1 resp. with q ∈ PS (t).
Using the formalism from Sect. 9.3.7, we have for y ∈ [1 : #PS (t)], step
numbers my of steps performed during cycle t, and processor IDs qy of pro-
cessors that perform these steps:
my = NS (t) + y − 1
qy = s(my ) .
I(qy , 3, t) = I(qy , 4, t) + 1
= iy + 1
= ic(qy , NS (t)) + 1 .
All registers R of register stage 3 are invisible. From the induction hypothesis
we get
We split cases on the value preq(2qy + 1)t of the memory request signal of the
data cache of processor qy . Recall that it is defined as
q ,t
preq(2qy + 1)t = (con.3.lqy ,t ∨ con.3.sqy ,t ∨ con.3.casqy ,t ) ∧ f ull3y
= con.3.lqy ,t ∨ con.3.sqy ,t ∨ con.3.casqy ,t .
is a void access.
• If preq(2qy + 1)t = 1, we conclude with part 1 of Lemma 9.12 that there
exists a number ky , such that a data access acc(2qy + 1, ky ) ends in cycle
t. Because for the input registers R of the memory stage we have shown
9.3 Shared Memory in the Multi-core System 339
a ∈ CR ∪ DR → m(ht+1 NS (t+1)
π )(a) = hσ .m(a) .
seq(E(0)) = [0 : NE (1) − 1]
seq(E(t)) = [NE (t) : NE (t + 1) − 1] ,
Let dacc [0 : v−1] be the subsequence of the data access sequence dacc [NS (t) :
NS (t + 1) − 1] consisting only of the write and CAS accesses. Because reads
and void accesses don’t change the memory abstraction we get
Lemmas 9.13 and 9.12 guarantee that, if a write or a CAS memory access
ends in a cache of processor q in cycle t, then it is an access to the data cache
340 9 A Multi-core Processor
By Lemma 8.64 (1 step), part 2 of Lemma 8.10, (23), and part 2 of Lemma
9.10 we get
pdoutπ (2qy + 1)t = pdout1(ms(htπ ), dacc [my ], 2qy + 1) (Lemma 8.64)
= m(htπ )(a) (Lemma 8.10)
NS (t)
= hσ .m(a)
= hm
σ
y
.m(a)
= hpseq(q
σ
y ,i)
.m(a) (23)
pseq(qy ,i)
= dataout(hσ .m, dacc(qy , i))
= dmoutσqy ,i . (Lemma 9.10)
9.3 Shared Memory in the Multi-core System 341
π = imaσ ∈ CR .
a = imaq,t q,i
9.3.9 Liveness
In the liveness proof of the multi-core processor we argue that every stage
which is being stalled is eventually updated.
Proof. The order of stages for which we prove the statement of the lemma is
important. For liveness of the upper stages, we use liveness of the lower stages.
Recall that the signals in the stall engine are defined as
6
Note that, due to the fact that we keep the request signal to the instruction cache
active until the stall signal from the previous stage is removed, there may be
several accesses acc(2q, r1 ), acc(2q, r2 ), . . . corresponding to the access iacc(q, i).
Fortunately, all these accesses are reads which don’t modify the state of the
memory abstraction. As a result, we don’t care about the exact reconstruction of
the access sequence to the instruction cache and talk only about the existence of
an access, which ends in the same cycle when we activate signal ue1 .
342 9 A Multi-core Processor
From Lemma 8.68 (liveness of shared memory) we know that the processor
request to the data cache must be active in cycle t if the mbusy signal is
high:
dca.mbusy t → dca.preq t .
Moreover, there exists a cycle t > t such that
¬dca.mbusy t ∧ ∀t ∈ [t : t ) : dca.mbusy t
holds. From Lemma 9.11 (stable inputs of accesses) we get that the regis-
ters of stage 3 are not updated:
∀t ∈ [t : t ) : ¬uet3 ,
which implies
∀t ∈ [t : t ] : dca.preq t .
Hence, there is a data access to the data cache ending in cycle t , which
implies uet4 by Lemma 9.13.
• For k = 3 we have
Using liveness of stage 4, we find the smallest cycle t > t such that
¬stall4t holds and get
f ull3t +1
= (uet3 ∨ stall4t ) = 0 .
Hence, at cycle t + 1 both stages 2 and 3 are not full. This implies
¬haz2t +1
∧ ¬stall3t +1
.
Thus, we have ¬stall2t +1 and get a contradiction.
The case when we have haz2t ∧ topX [3]t is proven in the same way using
liveness of stage 4. For the last case we have
stall3t ∧ ¬haz2t ,
haz1 = ica.mbusy .
haz1t ∧ ¬stall2t .
In the proof of the first case we have already considered the same situation
for cycle t .
References
1. Beyer, S., Jacobi, C., Kroning, D., Leinenbach, D., Paul, W.: Instantiating un-
interpreted functional units and memory system: Functional verification of the
VAMP. In: Geist, D., Tronci, E. (eds.) CHARME 2003. LNCS, vol. 2860, pp.
51–65. Springer, Heidelberg (2003)
2. Cohen, E., Paul, W., Schmaltz, S.: Theory of multi core hypervisor verification.
In: Emde Boas, P., Groen, F.C.A., Italiano, G.F., Nawrocki, J., Sack, H. (eds.)
SOFSEM 2013: Theory and Practice of Computer Science. LNCS, vol. 7741, pp.
1–27. Springer, Heidelberg (2013)
3. Dalinger, I., Hillebrand, M.A., Paul, W.J.: On the verification of memory man-
agement mechanisms. In: Borrione, D., Paul, W. (eds.) CHARME 2005. LNCS,
vol. 3725, pp. 301–316. Springer, Heidelberg (2005)
4. Emerson, E.A., Kahlon, V.: Rapid parameterized model checking of snoopy
cache coherence protocols. In: Garavel, H., Hatcliff, J. (eds.) TACAS 2003.
LNCS, vol. 2619, pp. 144–159. Springer, Heidelberg (2003)
5. Keller, J., Paul, W.J.: Hardware Design. Teubner-Texte zur Informatik. Teub-
ner, Stuttgart (1995)
6. Kröning, D.: Formal Verification of Pipelined Microprocessors. PhD thesis, Saar-
land University (2001)
7. Lamport, L.: How to make a multiprocessor computer that correctly executes
multiprocess programs. IEEE Trans. Comput. 28(9), 690–691 (1979)
8. Maisuradze, G.: Implementing and debugging a pipelined multi-core MIPS ma-
chine. Master’s thesis, Saarland University (2014)
9. MIPS Technologies, Inc. MIPS32 Architecture For Programmers – Volume 2
(March 2001)
10. Müller, C., Paul, W.: Complete formal hardware verification of interfaces for a
FlexRay-like bus. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS,
vol. 6806, Springer, Heidelberg (2011)
11. Müller, S.M., Paul, W.J.: On the correctness of hardware scheduling mechanisms
for out-of-order execution. Journal of Circuits, Systems, and Computers 8(02),
301–314 (1998)
12. Müller, S.M., Paul, W.J.: Computer Architecture, Complexity and Correctness.
Springer, Heidelberg (2000)
13. Pong, F., Dubois, M.: A survey of verification techniques for cache coherence
protocols (1996)
346 References
14. Schmaltz, J.: A formal model of clock domain crossing and automated verifi-
cation of time-triggered hardware. In: FMCAD, pp. 223–230. IEEE Computer
Society, Los Alamitos (2007)
15. Schmaltz, S.: Towards the Pervasive Formal Verification of Multi-Core Operat-
ing Systems and Hypervisors Implemented in C. PhD thesis, Saarland Univer-
sity, Saarbrücken (2013)
16. Sweazey, P., Smith, A.J.: A class of compatible cache consistency protocols
and their support by the IEEE futurebus. SIGARCH Computer Architecture
News 14(2), 414–423 (1986)
17. Weaver, D.L.: OpenSPARC internals. Sun Microsystems (2008)
Index