0% found this document useful (0 votes)
354 views359 pages

A Pipelined Multi-Core MIPS Machine Hardware Implementation and Correctness Proof (Mikhail Kovalev, Silvia Melitta Muller, Wolfgang J. Paul)

Research Publication

Uploaded by

Kashif Virk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
354 views359 pages

A Pipelined Multi-Core MIPS Machine Hardware Implementation and Correctness Proof (Mikhail Kovalev, Silvia Melitta Muller, Wolfgang J. Paul)

Research Publication

Uploaded by

Kashif Virk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 359

Mikhail Kovalev Silvia M.

Müller
Tutorial Wolfgang J. Paul

A Pipelined
Multi-core
LNCS 9000

MIPS Machine
Hardware Implementation
and Correctness Proof

123
Lecture Notes in Computer Science 9000
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
Mikhail Kovalev Silvia M. Müller
Wolfgang J. Paul (Eds.)

A Pipelined
Multi-core
MIPS Machine
Hardware Implementation
and Correctness Proof

13
Volume Editors
Mikhail Kovalev
Wolfgang J. Paul
Saarland University, Saarbrücken, Germany
E-mail: {kovalev,wjp}@wjpserver.cs.uni-saarland.de
Silvia M. Müller
IBM-Labor Böblingen, Böblingen, Germany
E-mail: [email protected]

ISSN 0302-9743 e-ISSN 1611-3349


ISBN 978-3-319-13905-0 e-ISBN 978-3-319-13906-7
DOI 10.1007/978-3-319-13906-7
Springer Cham Heidelberg New York Dordrecht London

Library of Congress Control Number: Applied for

LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues


© Springer International Publishing Switzerland 2014
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and
executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication
or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location,
in ist current version, and permission for use must always be obtained from Springer. Permissions for use
may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution
under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of publication,
neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or
omissions that may be made. The publisher makes no warranty, express or implied, with respect to the
material contained herein.
Typesetting: Camera-ready by author, data conversion by Markus Richter, Heidelberg
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Preface

This book is based on the third author’s lectures on computer architecture,


as given in the summer semester of 2013 at Saarland University. It contains a
gate level construction of a multi-core machine with pipelined MIPS processor
cores and a sequentially consistent shared memory. This opens the way to the
formal verification of synthesizable hardware for multi-core processors in the
future.
We proceed in three steps: i) we review pipelined single core processor
constructions and their correctness proofs, ii) we present a construction of a
cache-based shared memory which is kept consistent by the MOESI protocol
and show that it is sequentially consistent, and iii) we integrate the pipelined
processor cores into the shared memory and show that the resulting hardware
simulates the steps of an abstract multi-core MIPS machine in some order. In
the last step the reference machine consists of MIPS cores and a single memory,
where the cores, together with the shared memory, execute instructions in a
nondeterministic order.
In the correctness proofs of the last two steps a new issue arises. Construc-
tions are in a gate level hardware model and thus deterministic. In contrast
the reference models against which correctness is shown are nondeterministic.
The development of the additional machinery for these proofs and the cor-
rectness proof of the shared memory at the gate level are the main technical
contributions of this work.

October 2014 Mikhail Kovalev


Silvia M. Müller
Wolfgang J. Paul
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Number Formats and Boolean Algebra . . . . . . . . . . . . . . . . . . . . 7


2.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Numbers, Sets, and Logical Connectives . . . . . . . . . . . . . . 8
2.1.2 Sequences and Bit-Strings . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Modulo Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Geometric Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Binary Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Two’s Complement Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Boolean Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6.1 Identities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6.2 Solving Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6.3 Disjunctive Normal Form . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 Digital Gates and Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Some Basic Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Clocked Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 Digital Clocked Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.2 The Detailed Hardware Model . . . . . . . . . . . . . . . . . . . . . . 44
3.3.3 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Drivers and Main Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5.1 Open Collector Drivers and Active Low Signal . . . . . . . . 55
3.5.2 Tristate Drivers and Bus Contention . . . . . . . . . . . . . . . . . 56
3.5.3 The Incomplete Digital Model for Drivers . . . . . . . . . . . . 60
3.5.4 Self Destructing Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5.5 Clean Operation of Tristate Buses . . . . . . . . . . . . . . . . . . . 64
VIII Contents

3.5.6 Specification of Main Memory. . . . . . . . . . . . . . . . . . . . . . . 69


3.5.7 Operation of Main Memory via a Tristate Bus . . . . . . . . 72
3.6 Finite State Transducers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.6.1 Realization of Moore Automata . . . . . . . . . . . . . . . . . . . . . 76
3.6.2 Precomputing Outputs of Moore Automata . . . . . . . . . . . 78
3.6.3 Realization of Mealy Automata . . . . . . . . . . . . . . . . . . . . . 80
3.6.4 Precomputing Outputs of Mealy Automata . . . . . . . . . . . 81

4 Nine Shades of RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83


4.1 Basic Random Access Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2 Single-Port RAM Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2.1 Read Only Memory (ROM) . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2.2 Multi-bank RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2.3 Cache State RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2.4 SPR RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3 Multi-port RAM Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3.1 3-port RAM for General Purpose Registers . . . . . . . . . . . 92
4.3.2 General 2-port RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.3.3 2-port Multi-bank RAM-ROM . . . . . . . . . . . . . . . . . . . . . . 95
4.3.4 2-port Cache State RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5 Arithmetic Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.1 Adder and Incrementer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2 Arithmetic Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3 Arithmetic Logic Unit (ALU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.4 Shift Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.5 Branch Condition Evaluation Unit . . . . . . . . . . . . . . . . . . . . . . . . . 113

6 A Basic Sequential MIPS Machine . . . . . . . . . . . . . . . . . . . . . . . . . 117


6.1 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.1.1 I-type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.1.2 R-type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.1.3 J-type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.2 MIPS ISA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.2.1 Configuration and Instruction Fields . . . . . . . . . . . . . . . . . 120
6.2.2 Instruction Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.2.3 ALU-Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.2.4 Shift Unit Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.2.5 Branch and Jump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2.6 Sequences of Consecutive Memory Bytes . . . . . . . . . . . . . 129
6.2.7 Loads and Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.2.8 ISA Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.3 A Sequential Processor Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.3.1 Software Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.3.2 Hardware Configurations and Computations . . . . . . . . . . 134
Contents IX

6.3.3 Memory Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135


6.3.4 Defining Correctness for the Processor Design . . . . . . . . . 138
6.3.5 Stages of Instruction Execution . . . . . . . . . . . . . . . . . . . . . 140
6.3.6 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.3.7 Instruction Fetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.3.8 Instruction Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.3.9 Reading from General Purpose Registers . . . . . . . . . . . . . 147
6.3.10 Next PC Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.3.11 ALU Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.3.12 Shift Unit Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.3.13 Jump and Link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.3.14 Collecting Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.3.15 Effective Address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.3.16 Shift for Store Environment . . . . . . . . . . . . . . . . . . . . . . . . 154
6.3.17 Memory Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.3.18 Shifter for Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.3.19 Writing to the General Purpose Register File . . . . . . . . . 159

7 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.1 MIPS ISA and Basic Implementation Revisited . . . . . . . . . . . . . . 162
7.1.1 Delayed PC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.1.2 Implementing the Delayed PC . . . . . . . . . . . . . . . . . . . . . . 163
7.1.3 Pipeline Stages and Visible Registers . . . . . . . . . . . . . . . . 164
7.2 Basic Pipelined Processor Design . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.2.1 Transforming the Sequential Design . . . . . . . . . . . . . . . . . . 167
7.2.2 Scheduling Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
7.2.3 Use of Invisible Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.2.4 Software Condition SC-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.2.5 Correctness Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
7.2.6 Correctness Proof of the Basic Pipelined Design . . . . . . . 178
7.3 Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.3.1 Hits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.3.2 Forwarding Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.3.3 Software Condition SC-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
7.3.4 Scheduling Functions Revisited . . . . . . . . . . . . . . . . . . . . . . 192
7.3.5 Correctness Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
7.4 Stalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7.4.1 Stall Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7.4.2 Hazard Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.4.3 Correctness Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.4.4 Scheduling Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.4.5 Correctness Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
7.4.6 Liveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
X Contents

8 Caches and Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207


8.1 Concrete and Abstract Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
8.1.1 Abstract Caches and Cache Coherence . . . . . . . . . . . . . . . 210
8.1.2 Direct Mapped Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
8.1.3 k-way Associative Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
8.1.4 Fully Associative Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
8.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
8.2.1 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
8.2.2 Memory and Memory Systems . . . . . . . . . . . . . . . . . . . . . . 219
8.2.3 Accesses and Access Sequences . . . . . . . . . . . . . . . . . . . . . . 220
8.2.4 Sequential Memory Semantics . . . . . . . . . . . . . . . . . . . . . . . 221
8.2.5 Sequentially Consistent Memory Systems . . . . . . . . . . . . . 222
8.2.6 Memory System Hardware Configurations . . . . . . . . . . . . 223
8.3 Atomic MOESI Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
8.3.1 Invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
8.3.2 Defining the Protocol by Tables . . . . . . . . . . . . . . . . . . . . . 226
8.3.3 Translating the Tables into Switching Functions . . . . . . . 228
8.3.4 Algebraic Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
8.3.5 Properties of the Atomic Protocol . . . . . . . . . . . . . . . . . . . 234
8.4 Gate Level Design of a Shared Memory System . . . . . . . . . . . . . . 235
8.4.1 Specification of Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 236
8.4.2 Data Paths of Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
8.4.3 Cache Protocol Automata . . . . . . . . . . . . . . . . . . . . . . . . . . 247
8.4.4 Automata Transitions and Control Signals . . . . . . . . . . . . 249
8.4.5 Bus Arbiter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
8.4.6 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
8.5 Correctness Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
8.5.1 Arbitration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
8.5.2 Silent Slaves and Silent Masters . . . . . . . . . . . . . . . . . . . . . 263
8.5.3 Automata Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . 264
8.5.4 Control of Tristate Drivers . . . . . . . . . . . . . . . . . . . . . . . . . 269
8.5.5 Protocol Data Transmission . . . . . . . . . . . . . . . . . . . . . . . . 274
8.5.6 Data Transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
8.5.7 Accesses of the Hardware Computation . . . . . . . . . . . . . . 279
8.5.8 Relation with the Atomic Protocol . . . . . . . . . . . . . . . . . . 301
8.5.9 Ordering Hardware Accesses Sequentially . . . . . . . . . . . . . 305
8.5.10 Sequential Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
8.5.11 Liveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310

9 A Multi-core Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311


9.1 Compare-and-Swap Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
9.1.1 Introducing CAS to the ISA . . . . . . . . . . . . . . . . . . . . . . . . 312
9.1.2 Introducing CAS to the Sequential Processor . . . . . . . . . 313
9.2 Multi-core ISA and Reference Implementation . . . . . . . . . . . . . . . 317
9.2.1 Multi-core ISA Specification . . . . . . . . . . . . . . . . . . . . . . . . 317
Contents XI

9.2.2 Sequential Reference Implementation . . . . . . . . . . . . . . . . 318


9.2.3 Simulation Relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
9.2.4 Local Configurations and Computations . . . . . . . . . . . . . . 323
9.2.5 Accesses of the Reference Computation . . . . . . . . . . . . . . 325
9.3 Shared Memory in the Multi-core System . . . . . . . . . . . . . . . . . . . 326
9.3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
9.3.2 Invisible Registers and Hazard Signals . . . . . . . . . . . . . . . 327
9.3.3 Connecting Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
9.3.4 Stability of Inputs of Accesses . . . . . . . . . . . . . . . . . . . . . . . 330
9.3.5 Relating Update Enable Signals and Ends of Accesses . . 331
9.3.6 Scheduling Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
9.3.7 Stepping Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
9.3.8 Correctness Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
9.3.9 Liveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
1
Introduction

Building on [12] and [6], we present at the gate level the construction of a
multi-core MIPS machine with “basic” pipelined processors and prove that
it works. “Basic” means that the processors only implement the part of the
instruction set architecture (ISA) that is visible in user mode; we call it ISA-u.
Extending it to the full architecture ISA-sp, that is visible in system program-
mers mode, we would have to add among other things the following mecha-
nisms: i) local and inter processor interrupts, ii) store buffers, and iii) memory
management units (MMUs). We plan to do this as future work. In Sect. 1.1
we present reasons why we think the results might be of interest. In Sect. 1.2
we give a short overview of the book.

1.1 Motivation
The are several reasons why we wrote this book and which might motivate
other scientists to read it.

Lecture Notes

The book contains the lecture notes of the third author’s lectures on Computer
Architecture 1 as given in the summer semester 2013 at Saarland University.
The purpose of ordinary architecture lectures is to enable students to draw
the building plans of houses, bridges, etc., and hopefully, also to explain why
they won’t collapse. Similarly, we try in our lectures on computer architecture
to enable students to draw the building plans of processors and to explain
why they work. We do this by presenting in the classroom a complete gate
level design of a RISC processor. We present correctness proofs because, for
nontrivial designs, they are the fastest way we know to explain why the designs
work. Because we live in the age of multi-core computing we attempted to
treat the design of such a processor in the classroom within a single semester.
With the help of written lecture notes this happened to work out. Indeed, a

M. Kovalev et al.: A Pipelined Multi-core MIPS Machine, LNCS 9000, pp. 1–6, 2014.

c Springer International Publishing Switzerland 2014
2 1 Introduction

student who had only 6 weeks of previous experience with hardware design
succeeded to implement the processor presented here on a field programmable
gate array (FPGA) within 6 weeks after the end of the lectures [8].

Multi-core Processor Correctness

To the best of our knowledge, this book contains the first correctness proof
for the gate level implementation of a multi-core processor.

Shared Memory Correctness

As a building block for the processor design, the book contains a gate level
implementation of a cache consistent shared memory system and a proof that
it is sequentially consistent. That such shared memory systems can be imple-
mented is in a sense the fundamental folklore theorem of multi-core comput-
ing: i) everybody believes it; experimental evidence given by the computers
on our desk is indeed overwhelming, ii) proofs are widely believed to exist,
but iii) apparently nobody has seen the entire proof; for an explanation why
the usual model checked results for cache coherence protocols fail to prove
the whole theorem see the introduction of Chap. 8. To the best of our knowl-
edge, this book contains the first complete correctness proof for a gate level
implementation of a cache based sequentially consistent shared memory.

Building Plans for Formal Verification

Proofs in this book are presented on paper; such proofs are often called “paper-
and-pencil proofs”. Because they are the work of humans they can contain er-
rors and gaps, which are hopefully small and easily repaired. With the help of
computers one can debug proofs produced by humans: one enters them in com-
puter readable form into so called computer aided verification systems (CAV
systems) and then lets the computer check the correctness and completeness
of the proof. Technically, this reduces to a syntax check in a formal language
of what the system accepts as a proof. This process is called formal verifica-
tion. Formal, i.e., machine readable proofs, are engineering objects. Like any
other engineering objects, it makes a lot of sense to construct them from a
building plan. Sufficiently precise paper and pencil proofs serve this purpose
extremely well. Proofs published in [12] led, for instance, very quickly to the
formal verification of single core processors of industrial complexity [1,3]. The
proofs in this book are, therefore, also meant as blueprints for later formal
verification work of shared memory systems and multi-core processors. This
explains, why we have included some lengthy proofs of the bookkeeping type
in this text. They are only meant as help for verification engineers. At the
beginning of chapters, we give hints to these proofs: they can be skipped at a
first reading and should not be presented in the classroom. There is, however,
a benefit of going through these proofs: afterwards you feel more comfortable
when you skip them in the classroom.
1.1 Motivation 3

level n (3) (1)

level n − 1 (2)

Fig. 1. Functional correctness (1) is shown on level n. Proof obligation (3) is not
necessary for the proof of (1) but has to be discharged on level n in order to guarantee
implementation (2) at level n − 1. If level n is considered in isolation it drops out of
the blue

“Obscure” Proof Obligations

In recent research we have encountered a general and quite hideous source of


gaps in formal verification work. Systems are composed of successive layers,
where on each layer n the user sees a certain computational model, and the
model on layer n is implemented in the simplest case simply in the model of
layer n − 1 . Typical layers are i) digital gates and registers, ii) instruction set
architecture (ISA), iii) assembly language, iv) high level language (+ maybe
assembly), and v) operating system kernel. These models often are very well
established and are successfully used for such a long time that one does not
worry about their completeness in places where one should. This can lead to
the situation established in Fig. 1. Imagine we want to prove a property of
a system component specified and implemented on level n. We successfully
finish the proof (1) that the implementation meets the specification on level n,
both on paper and in the formal verification work. We overlook the fact that
level n is implemented (2) in level n − 1 and that certain not so well published
implementation details lead to proof obligations (3) on the higher level n. If we
restrict attention to the single level n, these proof obligations are completely
obscure and drop out of the blue; their absence does not show up in the proof
of the desired correctness property (1). Of course, in mathematics, there are
no obscure proof obligations, and proof obligation (3) is only overlooked on
level n if we disregard the fact that simulation of level n by lower levels has to
be established. We give two examples of this effect and remedy one of them
in this book:
• The documentation of instruction set architectures tends to hide the pro-
cessor pipeline from the user. This does not always work. Between certain
operations (like the write of an instruction during page fault handling and
a subsequent fetch), one better waits long enough (which exposes the pipe
depth) or drains the pipe (which is a meaningless operation in a usual ISA
model where instruction execution is sequential). The same pipe drain
should occur between an update of a page table and its subsequent use
by the memory management unit (MMU). In the scenario studied here,
these problems do not occur: there is no MMU yet and – as in early work
4 1 Introduction

on sequential processor verification – we assume for the time being that


code is not self modifying. Identifying for a multi-core processor all such
implementation dependent proof obligations, which are obscure at the ISA
level alone, is a major research project and part of the future work outlined
below.
• Digital hardware is not the lowest system layer one has to consider in hard-
ware design. If one studies hardware with different clock domains, one has
to consider the more detailed hardware model from the data sheets of
hardware vendors; within each clock domain one can use timing analysis
to establish a simulation theorem justifying the use of digital design rules
for gates and registers except at the borders of the clock domain. This
approach was sketched in [10] and is worked out in Chap. 3 here. But even
in a single clock domain it turns out that for the drivers, the buses, and
the main memory components considered here, the digital model does not
suffice to derive all proof obligations for the digital control circuits of the
units. Additional proof obligations that arise from the detailed hardware
model have to be discharged by the digital control logic. This explains our
treatment of the detailed hardware model here. Note, by the way, that
in [12] and subsequent formal verification work only the digital hardware
model was considered. As a result, the control for the main memory as
published in [12] violates the obligations coming from the detailed model.
That the formally verified hardware did not crash (indeed it ran imme-
diately without ever crashing) was luck: the hardware implementer had
secretly changed the verified design in a single place which happened to
fix the problem. For details see the introduction of Chap. 3.

Basis for Future Work

In the preceding paragraph, we have pointed out the existence of software


conditions which have to be met by ISA programs and which stem from im-
plementation details of the processor hardware. We have given examples for
such conditions even in the context of single core processors. It is of paramount
importance to identify all such software conditions at the ISA level, because
without such knowledge it is simply impossible to guarantee that any pro-
gram (written directly in ISA or compiled or assembled into ISA) performs as
specified: any software conditions we have overlooked and which are not met
by our software might bring the system down. For multi-core processors oper-
ating in system mode this appears now as a quite nontrivial task. As outlined
in [2, 15] we intend to proceed in the following way:
• Define without software conditions the full ISA-sp visible by system pro-
grammers for a multi-core machine which is at the same time sufficiently
realistic and sufficiently small to permit gate level implementation by a
group working in an academic environment. In [15] this has been done
for a machine called MIPS-86. It has an x86-like memory system and
a MIPS instruction set. In addition to the machine considered here it
1.2 Overview 5

has store buffers, MMUs, advanced programmable interrupt controllers


(APICs) and mechanisms for inter processor interrupts. The model still
has to be augmented to permit multiple memory modes.
• Augment the design of this book by the missing units. Augment the proofs
of this book to show that the machine built this way implements the
specified ISA-sp. Identify the software conditions which make the proof
work.

1.2 Overview
Chapter 2 contains some very basic material about number formats and
Boolean algebra. In the section on congruences we establish simple lemmas
which we use in Chap. 5 to prove the remarkable fact that binary numbers
and two’s complement numbers are correctly added and subtracted (modulo
2n ) by the very same hardware.
In Chap. 3, we define the digital and the physical hardware model and
show that, in common situations, the digital model is an abstraction of the
detailed model. For the operation of drivers, buses, and main memory compo-
nents we construct circuits which are controlled completely by digital signals,
and show in the detailed model that the control circuits operate the buses
without contention and the main memory without glitches. We show that
these considerations are crucial by constructing a bus control which i) has no
bus contention in the (incomplete) digital model and ii) has bus contention
for roughly 1/3 of the time in the detailed model. In the physical world, such
a circuit destroys itself due to short circuits.
Chapter 4 contains a collection of various RAM constructions that we need
later. Arithmetic logic units, shifters, etc. are treated in Chap. 5.
In Chap. 6, the basic MIPS instruction set architecture is defined and a
sequential reference implementation is given. We make the memory of this
machine as wide as the cache line size we use later. Thus, shifters have to
be provided for the implementation of load and store operations. Intuitively,
it is obvious that these shifters work correctly. Thus, in [12], where we did
not aim at formal verification, a proof that loads and stores were correctly
implemented with the help of these shifters was omitted. Subsequent formal
verification work only considered loads and stores of words and double words;
this reduced the shifters to trivial hardware which was easy to deal with. As
this text is also meant as a help for verification engineers, we included the
correctness proof for the full shifters. These proofs argue, in the end, about
the absence of carries in address computations across cache line boundaries
for aligned accesses. Writing these arguments down turned out slightly more
tricky than anticipated.
With the exception of the digital hardware model and the shifters for load
and store operations, the book until this point basically contains updated and
revised material from the first few chapters in [12]. In Chap. 7, we obtain a
6 1 Introduction

pipelined machine from the reference implementation by a fairly simple trans-


formation. However, in contrast to the old pipelined construction from [12],
we incorporate improvements from [6]. As a result, both the transformation
and the correctness proof are simpler. A small gap in the corresponding proof
in [12] is closed. The improved stall engine controlling pipeline stalls from [6]
is used. Liveness is treated in greater detail than in [12].
The main new technical material is in the last two chapters. In Chap. 8,
we construct a shared memory system whose caches are kept consistent by
the MOESI protocol [16] and we prove that it is sequentially consistent.
In Chap. 9, we extend the specification of the basic MIPS ISA to the
multi-core case. We construct the multi-core machine by hooking the pipelined
processors from Chap. 7 into the shared memory system from Chap. 8. This
construction is fairly straightforward. Then, we combine previous correctness
proofs into an overall functional correctness proof of the multi-core machine.
Intuitively, it is quite clear why this works. Still, making this intuition precise
required the introduction of a modest amount of new proof machinery. We
conclude Chap. 9 by proving liveness of the multi-core machine.
2
Number Formats and Boolean Algebra

We begin in Sect. 2.1 with very basic definitions of intervals of integers. Be-
cause this book is meant as a building plan for formal proofs, we cannot make
definitions like [1 : 10] = {1, 2, . . . , 10} because CAV systems don’t understand
such definitions. So we replace them by fairly obvious inductive definitions.
We also have to deal with the minor technical nuisance that, usually, sequence
elements are numbered from left to right, but in number representations, it is
much nicer to number them from right to left.
Section 2.2 on modulo arithmetic was included for several reasons. i) The
notation mod k is overloaded: it is used to denote both the congruence relation
modulo a number k or the operation of taking the remainder of an integer
division by k. We prefer our readers to clearly understand this. ii) Fixed
point arithmetic is modulo arithmetic, so we will clearly have to make use
of it. The most important reason, however, is that iii) addition/subtraction
of binary numbers and of two’s complement numbers is done by exactly the
same hardware1. When we get to this topic this will look completely intuitive
and, therefore, there should be a very simple proof justifying this fact. Such a
proof can be found in Sect. 5.1; it hinges on a simple lemma about the solution
of congruence equations from this section.
The very short Sect. 2.3 on geometric sums is simply there to remind the
reader of the proof of the formula for the computation of geometric sums,
which is much easier to memorize than the formula itself.
Section 2.4 introduces the binary number format, presents the school
method for binary addition, and proves that it works. Although this looks
simple and familiar and the correctness proof of the addition algorithms is
only a few lines long, the reader should treat this result with deep respect:
it is probably the first time that he or she sees a proof of the fact that the
addition algorithm he learned at school always works. The Old Romans, who
were fabulous engineers in spite of their clumsy number systems, would have
loved to see this proof.

1
Except for the computation of overflow and negative signals.

M. Kovalev et al.: A Pipelined Multi-core MIPS Machine, LNCS 9000, pp. 7–27, 2014.

c Springer International Publishing Switzerland 2014
8 2 Number Formats and Boolean Algebra

Integers are represented in computers as two’s complement numbers. In


Sect. 2.5, we introduce this number format and derive a small number of basic
identities for such numbers. From this we derive a subtraction algorithms for
binary numbers, which is quite different from the school method, and show
that it works. Sections 2.2, 2.3, and 2.5 are the basis of our construction of
an arithmetic unit later.
Finally, in Sect. 2.6 on Boolean Algebra, we provide a very short proof of
the fundamental result that Boolean functions can be computed using Boolean
expressions in disjunctive normal form. This result can serve to construct all
small circuits – e.g., in the control logic – where we only specify their func-
tionality and do not bother to specify a concrete realization. The proof is
intuitive and looks simple, but it will give us the occasion to explain formally
the difference between what is often called “two kinds of equations”: i) iden-
tities2 e(x) = e (x) which hold for all x and ii) equations e(x) = e (x) that we
want to solve by determining the set of all x such that the equation holds3 .
The reader will notice that this might be slightly subtle, because both kinds
of equations have exactly the same form.

2.1 Basics
2.1.1 Numbers, Sets, and Logical Connectives

We denote by
N = {0, 1, 2, . . .}
the set of natural numbers including zero, by

N+ = {1, 2, . . .}

the set of natural numbers excluding zero, by

Z = {. . . , −2, −1, 0, 1, 2, . . .}

the set of integers, and by


B = {0, 1}
the set of Boolean values.
For integers i, j with i < j, we define the interval of integers from i to j
by
[i : j] = {i, i + 1, . . . , j} .
Strictly speaking, definitions using three dots are never precise; they resemble
intelligence tests, where the author hopes that all readers who are forced
to take the test arrive at the same solution. Usually, one can easily find a
2
In German: Identitäten.
3
In German: Bestimmungsgleichung.
2.1 Basics 9

Table 1. Logical connectives and quantifiers


x∧y and
x∨y or
¬x , x not
x⊕y exclusive or, + modulo 2
x→y implies
x↔y if and only if
∀x for all
∃x exists

corresponding and completely precise recursive definition (without three dots)


in such a situation. Thus, we define the interval [i : j] in a rigorous way as

[i : i] = {i}
[i : j + 1] = [i : j] ∪ {j + 1} .

The Hilbert ∈-Operator ∈ A picks an element from a set A. Applied to a


singleton set, it returns the unique element of the set:

∈{x} = x .

For finite sets A, we denote by #A the cardinality, i.e., the number of elements
in A.
Given a function f operating on a set A and a set A1 ⊆ A, we denote by
f (A1 ) the image of set A1 under function f , i.e.,

f (A1 ) = {f (a) | a ∈ A1 } .

In statements and predicates we use the logical connectives and quantifiers


from Table 1. For the negation of x we can write ¬x as well as x.
In computer science literature, logarithms are to the base of two, unless
explicitly stated otherwise. This text is no exception.

2.1.2 Sequences and Bit-Strings

A sequence a of n many elements ai = a[i] = a(i) from set A, where n ∈ N+


and i ∈ [0 : n − 1], can come in many flavors. In this book we use three of
them.
1. A sequence numbered from left to right starting with 0 is written as

a = (ai ) = (a0 , . . . , an−1 ) = a[0 : n − 1]

and is formalized without the dots as a mapping

a : [0 : n − 1] → A .
10 2 Number Formats and Boolean Algebra

With this formalization the set An of sequences of length n with elements


from A is defined as

An = {a | a : [0 : n − 1] → A} .

2. A sequence numbered from left to right starting with 1 is written as

a = (a1 , . . . , an ) = a[1 : n]

and is defined as
a : [1 : n] → A .
The set An of such sequences is then formalized as

An = {a | a : [1 : n] → A} .

3. We can also number the elements in a sequence a from right to left starting
with 0. Then we write

a = (an−1 , . . . , a0 ) = a[n − 1 : 0] ,

which, surprisingly, has the same formalization as a sequence starting with


0 and numbered from left to right:

a : [0 : n − 1] → A .

The set An is again defined as

An = {a | a : [0 : n − 1] → A} .

Thus, the direction of ordering does not show up in the formalization yet.
The reason is, that the interval [0 : n − 1] is a set, and elements of sets
are unordered. The difference, however, will show up when we formalize
operations on sequences.
The concatenation operator ◦ is defined for sequences a[n − 1 : 0], b[m − 1 : 0]
numbered from right to left and starting with 0 as

b[i] i<m
∀i ∈ [n + m − 1 : 0] : (a ◦ b)[i] =
a[i − m] i ≥ m ,

or, respectively, for sequences a[0 : n − 1], b[0 : m − 1] numbered from left to
right as 
a[i] i<n
∀i ∈ [0 : n + m − 1] : (a ◦ b)[i] =
b[i − n] i ≥ n .
For sequences a[1 : n] and b[1 : m] numbered from left to right, concatenation
is defined as
2.1 Basics 11

a[i] i≤n
∀i ∈ [1 : n + m] : (a ◦ b)[i] =
b[i − n] i > n .

Concatenation of sequences a with single symbols b ∈ A is handled by treating


elements b as sequences with one element b = b[0].
Let i ≤ j and j ≤ n − 1. Then for sequences a[0 : n − 1] we define a
subsequence a[i : j] as

a[i : j] = c[0 : j − i] with c[k] = a[i + k]

and for sequences a[n − 1 : 0] a subsequence a[j : i] is defined as

a[j : i] = c[j − i : 0] with c[k] = a[i + k] .

For sequences a[1 : n] and indices i ≤ j, where 1 ≤ i and j ≤ n, subsequence


a[i : j] is defined as

a[i : j] = c[1 : j − i + 1] with c[k] = a[i + k − 1] .

A single element x ∈ B is called a bit. A sequence a ∈ Bn is called a bit-


string. For bits x ∈ B and natural numbers n ∈ N+ , a bit-string obtained by
repeating x exactly n times is defined in the format of an intelligence test by

xn = x
 .
. . x .
n times

and formally by

x1 = x
xn+1
= x ◦ xn .

Examples of such strings are

12 = 11
04 = 0000 .

In these examples and later in the book, we often omit ◦ when denoting the
concatenation of bit-strings x1 and x2 :

x1 x2 = x1 ◦ x2 .

When dealing with the construction of RAMs and memory systems, it is


sometimes convenient to talk about bytes rather than individual bits. Function
byte(i, x) is used to extract the i-th byte from a bit-string x ∈ B8k , where
i ∈ [0 : k − 1]:
byte(i, x) = x[8 · i + 7 : 8 · i] .
12 2 Number Formats and Boolean Algebra

Complement a of a bit-string a ∈ Bn is defined in an obvious way:

a = (an−1 , . . . , a0 ) .

For logical connectives ◦ ∈ {∧, ∨, ⊕}, bit-strings a, b ∈ Bn , and a bit c ∈ B,


we borrow the notation from vector calculus to define the corresponding bit-
operations on bit-strings4 :

a[n − 1 : 0] ◦ b[n − 1 : 0] = (an−1 ◦ bn−1 , . . . , a0 ◦ b0 )


c ◦ b[n − 1 : 0] = (c ◦ bn−1 , . . . , c ◦ b0 ) .

2.2 Modulo Computation


There are infinitely many integers and every computer can only store finitely
many numbers. Thus, computer arithmetic cannot possibly work like ordinary
arithmetic. Fixed point arithmetic5 is usually performed modulo 2n for some
n. We review basics about modulo computation.
For integers a, b ∈ Z and natural numbers k ∈ N+ , one defines a and b to
be congruent mod k or equivalent mod k iff they differ by an integer multiple
of k:
a ≡ b mod k ↔ ∃z ∈ Z : a − b = z · k .
Congruence mod k has a number of important properties which we formalize
below.
Let R be a relation between elements of a set A. We say that R is reflexive
if we have aRa for all a ∈ A. We say that R is symmetric if aRb implies bRa.
We say that R is transitive if aRb and bRc imply aRc. If all three properties
hold, R is called an equivalence relation on A.

Lemma 2.1 (congruence properties). Congruence mod k is an equiva-


lence relation.
Proof. We show that the properties of an equivalence relation are satisfied:
• Reflexivity: For all a ∈ Z we have a − a = 0 · k. Thus, a ≡ a mod k and
congruence mod k is reflexive.
• Symmetry: Let a ≡ b mod k with a − b = z · k. Then, b − a = −z · k. Thus,
b ≡ a mod k.
• Transitivity: Let a ≡ b mod k with a − b = z · k and b ≡ c mod k with
b − c = u · k. Then, a − c = (z + u) · k, and thus, a ≡ c mod k.


4
Note that here ◦ is not to be confused with the concatenation operator.
5
The only arithmetic considered in this book. For the construction of floating point
units see [12].
2.2 Modulo Computation 13

Lemma 2.2 (plus, minus equivalence). Let a, b ∈ Z and k ∈ N+ with


a ≡ a mod k and b ≡ b mod k. Then,
a + b ≡ a + b mod k
a − b ≡ a − b mod k .
Proof. Let a − a = u · k and b − b = v · k, then we have
a + b − (a + b ) = a − a + b − b
= (u + v) · k
a − b − (a − b ) = a − a − (b − b )
= (u − v) · k ,
which imply the desired congruences. 

Two numbers r and s in an interval of the form [i : i + k − 1] that are both
equivalent to a mod k are identical.
Lemma 2.3 (equality from equivalence). Let i ∈ Z, k ∈ N+ , and let
r, s ∈ [i : i + k − 1], then
a ≡ r mod k ∧ a ≡ s mod k → r = s .
Proof. By symmetry we have s ≡ a mod k and by transitivity we get s ≡
r mod k. Thus, r − s = z · k for an integer z. We conclude z = 0 because
|r − s| < k. 


Let R be an equivalence relation on A. A subset B ⊂ A is called a system of


representatives if and only if for every a ∈ A there is exactly one r ∈ B with
aRr. The unique r ∈ B satisfying aRr is called the representative of a in B.
Lemma 2.4 (system of representatives). For i ∈ Z and k ∈ N+ , the
interval of integers [i : i + k − 1] is a system of representatives for equivalence
mod k.
Proof. Let a ∈ Z. We define the representative r(a) by
f (a) = max{j ∈ Z | a − k · j ≥ i}
r(a) = a − f (a) · k.

Then r(a) ≡ a mod k and r(a) ∈ [i : i + k − 1]. Uniqueness follows from


Lemma 2.3.
Note that in case i = 0, f (a) is the result of the integer division of a by k:
f (a) = a/k ,
and
r(a) = a − a/k · k
is the remainder of this division. 

14 2 Number Formats and Boolean Algebra

We have to point out that in mathematics the three letter word “mod” is not
only used for the relation defined above. It is also used as a binary operator in
which case (a mod k) denotes the representative of a in [0 : k − 1]. Let a, b ∈ Z
and k ∈ N+ . Then,

(a mod k) = ∈{b | a ≡ b mod k ∧ b ∈ [0 : k − 1]} .

Thus, (a mod k) is the remainder of the integer division of a by k for a ≥ 0. In


order to stress when mod is used as a binary operator, we always write (a mod
k) in brackets. For later use in the theory of two’s complement numbers, we
define another modulo operator. Let a, b ∈ Z and k = 2 · k  be an even number
with k  ∈ N+ . Then,

(a tmod k) = ∈{b | a ≡ b mod k ∧ b ∈ [−k/2 : k/2 − 1]} .

From Lemma 2.3 we infer a simple but useful lemma about the solution of
equivalences mod k.
Lemma 2.5 (solution of equivalences). Let k be even and x ≡ y mod k,
then
1. x ∈ [0 : k − 1] → x = (y mod k) ,
2. x ∈ [−k/2 : k/2 − 1] → x = (y tmod k) .

2.3 Geometric Sums


For q = 1 we consider

n−1
S= qi
i=0

the geometric sum over q. Then,



n
q·S = qi
i=1
q · S − S = qn − 1
qn − 1
S= .
q−1
For q = 2 we state this in the following lemma.
Lemma 2.6 (geometric sum). For n ∈ N+ ,


n−1
2i = 2n − 1 .
i=0

We will use this lemma in the next section.


2.4 Binary Numbers 15

2.4 Binary Numbers

For bit-strings a = a[n − 1 : 0] ∈ Bn we denote by


n−1
a = ai · 2 i
i=0

the interpretation of bit-string a as a binary number. We call a the binary


representation of length n of the natural number a. Examples of bit-strings
interpreted as binary numbers are given below:

100 = 4
111 = 7
10n  = 2n .

Applying Lemma 2.6, we get


n−1
1n  = 2i = 2n − 1 ,
i=0

i.e., the largest binary number representable with n bits corresponds to the
natural number 2n − 1.
Note that binary number interpretation is an injective function.
Lemma 2.7 (binary representation injective). Let a, b ∈ Bn . Then,

a = b → a = b .

Proof. Let j = max{i | ai = bi } be the largest index where strings a and b


differ. Without loss of generality assume aj = 1 and bj = 0. Then,


j 
j
a − b = ai · 2 i − bi · 2i
i=0 i=0


j−1
≥ 2j − 2i
i=0
=1

by Lemma 2.6. 

Let n ∈ N . We denote by
+

Bn = {a | a ∈ Bn }

the set of natural numbers that have a binary representation of length n. Since
16 2 Number Formats and Boolean Algebra


n−1
0 ≤ a ≤ 2i = 2n − 1 ,
i=0

we deduce
Bn ⊆ [0 : 2n − 1] .
As  ·  is injective and
#Bn = #Bn = 2n = #[0 : 2n − 1] ,
we observe that  ·  is bijective and get the following lemma.
Lemma 2.8 (natural numbers with binary representation). For n ∈
N+ we have
Bn = [0 : 2n − 1] .
For x ∈ Bn we denote the binary representation of x of length n by binn (x):
binn (x) = ∈{a | a ∈ Bn ∧ a = x} .
To shorten notation even further, we write xn instead of binn (x):
xn = binn (x) .
It is often useful to decompose n bit binary representations a[n − 1 : 0] into an
upper part a[n − 1 : m] and a lower part a[m − 1 : 0]. The connection between
the numbers represented is stated in Lemma 2.9.
Lemma 2.9 (decomposition). Let a ∈ Bn and n ≥ m. Then,
a[n − 1 : 0] = a[n − 1 : m] · 2m + a[m − 1 : 0] .
Proof.

n−1 
m−1
a[n − 1 : 0] = ai · 2 i + ai · 2 i
i=m i=0

n−1−m
= am+j · 2m+j + a[m − 1 : 0]
j=0


n−1−m
=2 ·m
am+j · 2j + a[m − 1 : 0]
j=0
= 2 · a[n − 1 : m] + a[m − 1 : 0]
m



We obviously have
a[n − 1 : 0] ≡ a[m − 1 : 0] mod 2m .
Using Lemma 2.5, we infer the following lemma.
2.4 Binary Numbers 17

Table 2. Binary addition of 1-bit numbers a, b with carry c


a b c c s
0 0 0 0 0
0 0 1 0 1
0 1 0 0 1
0 1 1 1 0
1 0 0 0 1
1 0 1 1 0
1 1 0 1 0
1 1 1 1 1

Lemma 2.10 (decomposition mod equality). For a ∈ Bn and m ≤ n,


a[m − 1 : 0] = (a[n − 1 : 0] mod 2m ) .

Intuitively speaking, taking a binary number modulo 2m means “throwing


away” the bits with position m or higher.
Table 2 specifies the addition algorithm for binary numbers a, b of length
1 and a carry-bit c. The binary representation (c , s) ∈ B2 of the sum of bits
a, b, c ∈ B is computed as
c s = a + b + c .
For the addition of n-bit numbers a[n − 1 : 0] and b[n − 1 : 0] with carry in
c0 , we first observe for the sum S:
S = a[n − 1 : 0] + b[n − 1 : 0] + c0
≤ 2n − 1 + 2n − 1 + 1
= 2n+1 − 1 .
Thus, the sum S ∈ Bn+1 can be represented as a binary number s[n : 0] with
n + 1 bits. For the computation of the sum bits we use the method for long
addition that we learn in elementary school for decimal numbers. We denote
by ci the carry from position i − 1 to position i and compute (ci+1 , si ) using
the basic binary addition of Table 2:
ci+1 , si  = ai + bi + ci
sn = cn . (1)
That this algorithm indeed computes the sum of the input numbers is asserted
in Lemma 2.11.
Lemma 2.11 (correctness of binary addition). Let a, b ∈ Bn and let
c ∈ B. Further, let cn ∈ B and s ∈ Bn be computed according to the addition
algorithm described above. Then,
cn , s[n − 1 : 0] = a[n − 1 : 0] + b[n − 1 : 0] + c0 .
18 2 Number Formats and Boolean Algebra

Proof. By induction on n. For n = 0 this follows directly from (1). For the
induction step we conclude from n − 1 to n:

a[n − 1 : 0] + b[n − 1 : 0] + c0


= (an−1 + bn−1 ) · 2n−1 + a[n − 2 : 0] + b[n − 2 : 0] + c0
= (an−1 + bn−1 ) · 2n−1 + cn−1 , s[n − 2 : 0] (induction hypothesis)
= (an−1 + bn−1 + cn−1 ) · 2n−1 + s[n − 2 : 0]
= cn , sn−1  · 2n−1 + s[n − 2 : 0] (1)
= cn , s[n − 1 : 0] . (Lemma 2.9)



The following simple lemma allows breaking the addition of two long num-
bers into two additions of shorter numbers. It is useful, among other things,
for proving the correctness of recursive addition algorithms (as applied in
recursive hardware constructions of adders and incrementers).
Lemma 2.12 (decomposition of binary addition). For a, b ∈ Bn , for
d, e ∈ Bm , and for c0 , c , c ∈ B, let

d + e + c0 = c t[m − 1 : 0]


a + b + c = c s[n − 1 : 0] ,

then
ad + be + c0 = c st .
Repeatedly using Lemma 2.9, we have

ad + be + c0 = a · 2m + d + b · 2m + e + c0


= (a + b) · 2m + c t
= (a + b + c ) · 2m + t
= c s · 2m + t
= c st .

2.5 Two’s Complement Numbers


For bit-strings a[n − 1 : 0] ∈ Bn , we denote by

[a] = −an−1 · 2n−1 + a[n − 2 : 0]

the interpretation of a as a two’s complement number. We refer to a as the


two’s complement representation of [a].
For n ∈ N+ , we denote by

Tn = {[a] | a ∈ Bn }
2.5 Two’s Complement Numbers 19

the set of integers that have a two’s complement representation of length n.


Since

Tn = {[0b] | b ∈ Bn−1 } ∪ {[1b] | b ∈ Bn−1 }


= Bn−1 ∪ {−2n−1 + x | x ∈ Bn−1 }
= [0 : 2n−1 − 1] ∪ {−2n−1 + x | x ∈ [0 : 2n−1 − 1], } (Lemma 2.8)

we have the following lemma.


Lemma 2.13 (integers with two’s complement representation). Let
n ∈ N+ . Then,
Tn = [−2n−1 : 2n−1 − 1] .
By twocn (x) we denote the two’s complement representation of x ∈ Tn :

twocn (x) = ∈{a | a ∈ Bn ∧ [a] = x} .

We summarize basic properties of two’s complement numbers.


Lemma 2.14 (properties of two’s complement numbers). Let a ∈ Bn .
Then, the following holds:

[0a] = a (embedding)


[a] ≡ a mod 2n
[a] < 0 ↔ an−1 = 1 (sign bit)
[an−1 a] = [a] (sign extension)
−[a] = [a] + 1 .

Proof. The first line is trivial. The second line follows from

[a] − a = −an−1 · 2n−1 + a[n − 2 : 0] − (an−1 · 2n−1 + a[n − 2 : 0])
= −an−1 · 2n .

If an−1 = 0 we have [a] = a[n − 2 : 0] ≥ 0. If an−1 = 1 we have

[a] = −2n−1 + a[n − 2 : 0]


≤ −2n−1 + 2n−1 − 1 (Lemma 2.8)
= −1 .

This shows the third line. The fourth line follows from

[an−1 a] = −an−1 · 2n + a[n − 1 : 0]


= −an−1 · 2n + an−1 · 2n−1 + a[n − 2 : 0]
= −an−1 · 2n−1 + a[n − 2 : 0]
= [a] .
20 2 Number Formats and Boolean Algebra

For the last line we observe that x = 1 − x for x ∈ B. Then,


n−2
[a] = −an−1 · 2n−1 + ai · 2 i
i=0

n−2
= −(1 − an−1 ) · 2n−1 + (1 − ai ) · 2i
i=0

n−2 
n−2
= −2 n−1
+ 2 + an−1 · 2
i n−1
− ai · 2 i
i=0 i=0
= −1 − [a] . (Lemma 2.6)



We conclude the discussion of binary numbers and two’s complement numbers
with a lemma that provides a subtraction algorithm for binary numbers.
Lemma 2.15 (subtraction for binary numbers). Let a, b ∈ Bn . Then,

a − b ≡ a + b + 1 mod 2n .

If additionally a − b ≥ 0, we have

a − b = (a + b + 1 mod 2n ) .

Proof. By Lemma 2.14 we have

a − b = a − [0b]


= a + [1b] + 1
= a − 2n + b + 1
≡ a + b + 1 mod 2n .

The extra hypothesis a − b ≥ 0 implies

a − b ∈ Bn .

The second claim now follows from Lemma 2.5. 




2.6 Boolean Algebra


We consider Boolean expressions with constants 0 and 1, variables x0 , x1 , . . .,
a, b, . . ., and function symbols , ∧, ∨, ⊕, f (. . .), g(. . .), . . .. Four of the function
symbols have predefined semantics as specified in Table 3.
For a more formal definition one collects the constants, Boolean variables,
and Boolean function symbols allowed into sets
2.6 Boolean Algebra 21

Table 3. Boolean operators


x y x x ∧ y x ∨ y x⊕y
0 0 1 0 0 0
0 1 1 0 1 1
1 0 0 0 1 1
1 1 0 1 1 0

C = {0, 1}
V = {x0 , x1 , . . .}
F = {f0 , f1 , . . .} .

and denotes the number of arguments for function fi with ni . Now we can
define the set BE of Boolean expressions by the following rules:
1. Constants and variables are Boolean expressions:

C ∪ V ⊂ BE .

2. If e is a Boolean expression, then (e) is also a Boolean expression:

e ∈ BE → (e) ∈ BE .

3. If e and e are boolean expressions then so is (e ◦ e ), where ◦ is a binary


connector:

e, e ∈ BE ∧ ◦ ∈ {∧, ∨, ⊕} → (e ◦ e ) ∈ BE .

4. If fi is a symbol for a function with ni arguments, then we can obtain a


Boolean expression fi (e1 , . . . , eni ) by substituting the function arguments
with Boolean expressions ej :

(∀j ∈ [1 : ni ] : ej ∈ BE) → fi (e1 , . . . , eni ) ∈ BE .

5. All Boolean expressions are formed by the above rules.


We call a Boolean expression pure if it uses only the predefined connectives
and doesn’t use any other function symbols.
In order to save brackets, one uses the convention that binds stronger
than ∧ and that ∧ binds stronger than ∨. Thus, x1 ∧x2 ∨x3 is an abbreviation
for
x1 ∧ x2 ∨ x3 = ((x1 ) ∧ x2 ) ∨ x3 .
We denote expressions e depending on variables x = x[1 : n] by e(x). Variables
xi can take values in B. Thus, x = x[1 : n] can take values in Bn . We denote
the result of evaluation of expression e ∈ BE with a bit-string a ∈ Bn of inputs
by e(a) and get a straightforward set of rules for evaluating expressions:
22 2 Number Formats and Boolean Algebra

1. Substitute ai for xi :
xi = ai .
2. If e = (e ), then evaluate e(a) by evaluating e (a) and negating the result
according to the predefined meaning of negation in Table 3:

(e )(a) = e (a) .

3. If e = (e ◦ e ), then evaluate e(a) by evaluating e (a) and e (a) and then
combining the results according to the predefined meaning of ◦ in Table
3:
(e ◦ e )(a) = e (a) ◦ e (a) .
4. Expressions of the form e = fi (e1 , . . . , eni ) can only be evaluated if the
symbol fi has an interpretation as a function

fi : Bni → B .

In this case evaluate fi (e1 , . . . , eni )(a) by evaluating arguments ej (a), sub-
stituting the result into f and evaluating f :

fi (e1 , . . . , eni )(a) = fi (e1 (a), . . . , eni (a)) .

The following small example shows that this very formal and detailed set of
rules captures our usual way of evaluating expressions:

(x1 ∧ x2 )(0, 1) = x1 (0, 1) ∧ x2 (0, 1)


= 0∧1
=0.

Boolean equations, therefore, are written as

e = e ,

where e and e are expressions involving variables x = x[1 : n]. They come in
two flavors:
• Identities. An equation e = e is an identity iff for any substitution of the
variables a = a[1 : n] ∈ Bn , expressions e and e evaluate to the same value
in B:
∀a ∈ Bn : e(a) = e (a) .
• Equations which one wants to solve. A substitution a = a[1 : n] ∈ Bn
solves equation e = e if e(a) = e (a).
We observe that identities and equations we want to solve do differ formally
in the implicit quantification. If not stated otherwise, we usually assume equa-
tions to be of the first type, i.e., to be implicitly quantified over all free vari-
ables. This is also the case with definitions of functions, where the left-hand
2.6 Boolean Algebra 23

side of an equation represents an entity being defined. For instance, the fol-
lowing definition of the function

f (x1 , x2 ) = x1 ∧ x2

is the same as
∀a, b ∈ B : f (a, b) = a ∧ b .
We may also write
e ≡ e
to stress that a given equation is an identity or to avoid brackets in case if
this equation is a definition and the right-hand side itself contains an equality
sign.
In case we talk about several equations in a single statement (this is often
the case when we solve equations), we assume implicit quantification over the
whole statement rather than over every single equation. For instance,

e1 = e2 ↔ e3 = 0

is the same as
∀a ∈ Bn : e1 (a) = e2 (a) ↔ e3 (a) = 0
and means that, for any given substitution a, equations e1 and e2 evaluate
to the same value if and only if equation e3 evaluates to 0. In other words,
equations e1 = e2 and e3 = 0 have the same set of solutions.
In Boolean algebra there is a very simple connection between the solution
of equations and identities. An identity e ≡ e holds iff equations e = 1 and
e = 1 have the same set of solutions.
Lemma 2.16 (identity from solving equations). Given Boolean expres-
sions e and e with inputs x[1 : n], we have

e ≡ e ↔ ∀a ∈ Bn : (e(a) = 1 ↔ e (a) = 1) .

Proof. The direction from left to right is trivial. For the other direction we
distinguish cases:
• e(a) = 1. Then e (a) = 1 by hypothesis.
• e(a) = 0. Then e (a) = 1 would by hypothesis imply the contradiction
e(a) = 1. Because in Boolean algebra e (a) ∈ B we conclude e (a) = 0.
Thus, we have e(a) = e (a) for all a ∈ Bn . 


2.6.1 Identities

In this section we provide a list of useful identities of Boolean algebra.


24 2 Number Formats and Boolean Algebra

• Commutativity:

x1 ∧ x2 ≡ x2 ∧ x1
x1 ∨ x2 ≡ x2 ∨ x1
x1 ⊕ x2 ≡ x2 ⊕ x1

• Associativity:

(x1 ∧ x2 ) ∧ x3 ≡ x1 ∧ (x2 ∧ x3 )
(x1 ∨ x2 ) ∨ x3 ≡ x1 ∨ (x2 ∨ x3 )
(x1 ⊕ x2 ) ⊕ x3 ≡ x1 ⊕ (x2 ⊕ x3 )

• Distributivity:

x1 ∧ (x2 ∨ x3 ) ≡ (x1 ∧ x2 ) ∨ (x1 ∧ x3 )


x1 ∨ (x2 ∧ x3 ) ≡ (x1 ∨ x2 ) ∧ (x1 ∨ x3 )

• Identity:

x1 ∧ 1 ≡ x1
x1 ∨ 0 ≡ x1

• Idempotence:

x1 ∧ x1 ≡ x1
x1 ∨ x1 ≡ x1

• Annihilation:

x1 ∧ 0 ≡ 0
x1 ∨ 1 ≡ 1

• Absorption:

x1 ∨ (x1 ∧ x2 ) ≡ x1
x1 ∧ (x1 ∨ x2 ) ≡ x1

• Complement:

x1 ∧ x1 ≡ 0
x1 ∨ x1 ≡ 1

• Double negation:
x1 ≡ x1
2.6 Boolean Algebra 25

Table 4. Verifying the first of de Morgan’s laws


x1 x2 x1 ∧ x2 x1 ∧ x2 x1 x2 x1 ∨ x2
0 0 0 1 1 1 1
0 1 0 1 1 0 1
1 0 0 1 0 1 1
1 1 1 0 0 0 0

• De Morgan’s laws:
x1 ∧ x2 ≡ x1 ∨ x2
x1 ∨ x2 ≡ x1 ∧ x2

Each of these identities can be proven in a simple brute force way: if the
identity has n variables, then for each of the 2n possible substitutions of the
variables the left and right hand sides of the identities are evaluated with the
help of Table 3. If for each substitution the left hand side and the right hand
side evaluate to the same value, then the identity holds. For the first of de
Morgan’s laws this is illustrated in Table 4.

2.6.2 Solving Equations

We consider expressions e and ei (where 1 ≤ i ≤ n), involving a vector


of variables x. We derive three basic lemmas about the solution of Boolean
equations. For a ∈ B we define

e a=1
ea =
e a=0.

Inspection of the semantics of in Table 3 immediately gives the rule for


solving negation.
Lemma 2.17 (solving negation). Given a Boolean expression e(x) and a ∈
B, we have
ea = 1 ↔ e = a .
Inspection of the semantics of ∧ in Table 3 gives
(e1 ∧ e2 ) = 1 ↔ e1 = 1 ∧ e2 = 1 .
Induction on n results in the rule for solving conjunction.
Lemma 2.18 (solving conjunction). Given Boolean expressions ei (x),
where 1 ≤ i ≤ n, we have

n
( ei ) = 1 ↔ ∀i ∈ [1 : n] : ei = 1 .
i=1
26 2 Number Formats and Boolean Algebra

From the semantics of ∨ in Table 3, we have

(e1 ∨ e2 ) = 1 ↔ e1 = 1 ∨ e2 = 1 .

Induction on n yields the rule for solving disjunction.


Lemma 2.19 (solving disjunction). Given Boolean expressions ei , where
1 ≤ i ≤ n, we have

n
( ei ) = 1 ↔ ∃i ∈ [1 : n] : ei = 1 .
i=1

2.6.3 Disjunctive Normal Form

Let f : Bn → B be a switching function6 and let e be a Boolean expression


with variables x. We say that e computes f iff the identity f (x) ≡ e holds.

Lemma 2.20 (computing switching function by Boolean expression).


Every switching function is computed by some Boolean expression:

∀f : Bn → B : ∃e : f (x) ≡ e .

Moreover, expression e is pure.


Proof. Let b ∈ B and let xi be a variable. We define the literal

xi b = 1
xbi =
xi b = 0 .
Then by Lemma 2.17,
xbi = 1 ↔ xi = b . (2)
Let a = a[1 : n] ∈ Bn and let x = x[1 : n] be a vector of variables. We define
the monomial
n
m(a) = xai i .
i=1

Then,

m(a) = 1 ↔ ∀i ∈ [1 : n] : xai i = 1 (Lemma 2.18)


↔ ∀i ∈ [1 : n] : xi = ai (2)
↔x=a.

Thus, we have
m(a) = 1 ↔ x = a . (3)
6
The term switching function comes from electrical engineering and stands for a
Boolean function.
2.6 Boolean Algebra 27

We define the support S(f ) of f as the set of arguments a, where f takes the
value f (a) = 1:
S(f ) = {a | a ∈ Bn ∧ f (a)} .
If the support is empty, then e = 0 computes f . Otherwise we set

e= m(a) .
a∈S(f )

Then,

e = 1 ↔ ∃a ∈ S(f ) : m(a) = 1 (Lemma 2.19)


↔ ∃a ∈ S(f ) : a = x (3)
↔ x ∈ S(f )
↔ f (x) = 1 .

Thus, equations e = 1 and f (x) = 1 have the same solutions. By Lemma 2.16
we conclude
e ≡ f (x) .



The expression e constructed in the proof of Lemma 2.20 is called the com-
plete disjunctive normal form of f . As an example, we consider the complete
disjunctive normal forms of the sum and carry functions c defined in Table 2:

c (a, b, c) ≡ a ∧ b ∧ c ∨ a ∧ b ∧ c ∨ a ∧ b ∧ c ∨ a ∧ b ∧ c (4)
s(a, b, c) ≡ a ∧ b ∧ c ∨ a ∧ b ∧ c ∨ a ∧ b ∧ c ∨ a ∧ b ∧ c . (5)

Simplified Boolean expressions for the same functions are

c (a, b, c) ≡ a ∧ b ∨ b ∧ c ∨ a ∧ c
s(a, b, c) ≡ a ⊕ b ⊕ c .

The correctness can be checked in the usual brute force way by trying all 8
assignments of values in B3 to the variables of the expressions, or by applying
the identities listed in Sect. 2.6.1.

In the remainder of this book, we return to the usual mathematical notation,


using the equality sign for both identities and equations to be solved. We will
only use the equivalence sign when defining predicates with the equality sign
in the right-hand side. Whether we deal with identities or whether we solve
equations will (hopefully) be clear from the context.
3
Hardware

In Sect. 3.1 we introduce the classical model of digital circuits. This includes
the classical definition of the depth d(g) of a gate g in a circuit as the length
of a longest path from an input of the circuit to the gate. For the purpose
of timing analysis in a later section, we also introduce the function sp(g)
measuring the length of a shortest path from an input of the circuit to gate
g. We present the classical proof by pigeon hole principle that the depth of
gates is well defined. By induction on the depth of gates we then conclude the
classical result that the semantics of switching circuits is well defined.
A few basic digital circuits are presented for later use in Sect. 3.2. This is
basically the same collection of circuits as presented in [12].
In Sect. 3.3 we introduce two hardware models: i) the usual digital model
consisting of digital circuits and 1-bit registers as presented in [11, 12] and
ii) a detailed model involving propagation delays, set up and hold times as
presented, e.g., in [10,14]. Working out the proof sketch from [10], we formalize
timing analysis and show by induction on depth that, with proper timing
analysis, the detailed model is simulated by the digital model. This justifies
the use of the digital model as long as we use only gates and registers.
In the very simple Sect. 3.4 we define n-bit registers which are composed
of 1-bit registers in order to use them as single components h.R of hardware
configurations h.
As we aim at the construction of memory systems, we extend in Sect.
3.5 both circuit models with open collector drivers, tristate drivers, buses,
and a model of main memory. As new parameters we have to consider in the
detailed model the enable and disable times of drivers. Also – as main memory
is quite slow – we have to postulate that, during accesses to main memory, its
input signals should be stable in the detailed model, i.e., there should be no
glitches on the input signals to the memory. We proceed to construct digital
interface circuitry for controlling buses and main memory and show in the
detailed model that, with that circuitry, buses and main memory are properly
operated. These results permit to abstract buses, main memory and their
interface circuitry to the digital model. So in later constructions, we only

M. Kovalev et al.: A Pipelined Multi-core MIPS Machine, LNCS 9000, pp. 29–82, 2014.

c Springer International Publishing Switzerland 2014
30 3 Hardware

have to worry about proper operation of the interface circuitry in the digital
world, and we do not have to reconsider the detailed model.
For readers who suspect that we might be paranoid, we also prove that
the proof obligations for the interface circuitry which we impose on the digital
model cannot be derived from the digital model itself. Indeed, we construct
a bus control that i) is provably free of contention in the digital model and
which ii) has – for common technology parameters – bus contention for about
1/3 of the time. As “bus contention” translates in the real world to “short
circuit”, such circuits destroy themselves.
Thus, the introduction of the detailed hardware model solves a very real
problem which is usually ignored in previous work, such as [12]. One would
suspect that the absence of such an argument leads to constructions that
malfunction in one way or another. In [12], buses are replaced by multiplexers,
so bus control is not an issue. But the processor constructed there has caches
for data and instructions, which are backed up by a main memory. Moreover,
the interface circuitry for the instruction cache as presented in [12] might
produce glitches1 . On the other hand, the (digital) design in [12] was formally
verified in subsequent work [1,3]; it was also put on a field programmable gate
array (FPGA) and ran correctly immediately after the first power up without
ever producing an error. If there are no malfunctions where one would worry
about them in view of later insights, one looks for explanations. It turned
out that the hardware engineer who transferred the design to the FPGA had
made a single change to the design (without telling anybody about it): he
had put a register stage in front of the main memory. In the physical design,
this register served as interface circuitry to the main memory and happened
to conform to all conditions presented in Sect. 3.5. Thus, although the digital
portion of the processor was completely verified, the design in the book still
contained a bug, which is only visible in the detailed model. The bug was
fixed without proof (indeed without being recognized) by the introduction of
the register stage in front of the memory. In retrospect, in 2001 the design
was not completely verified; that it ran immediately after power up involved
luck.
A few constructions for control automata are presented for later use in Sect.
3.6. This is basically the same collection of automata as presented in [12].

3.1 Digital Gates and Circuits

In a nutshell, we can think of hardware as consisting of three kinds of compo-


nents which are interconnected by wires: gates, storage elements, and drivers.
Gates are: AND-gates, OR-gates, ⊕-gates, and inverters. In circuit schematics
we use the symbols from Fig. 2.

1
In the design from [12] the glitches can be produced on the instruction memory
address by the multiplexer between pc and dpc as described in Chap. 7.
3.1 Digital Gates and Circuits 31

ab ab a b a

a∧b a∨b a⊕b ā


Fig. 2. Symbols for gates in circuit schematics

Inputs 1 0 x0 x1 xn−1
...

wires and gates

...
Outputs y0 y1 yt−1

Fig. 3. Illustration of inputs and outputs of a circuit C

A circuit C consists of a finite set G of gates2 , a sequence of input signals


x[n − 1 : 0], a set N of wires that connect them, as well as a sequence of
output signals y[t − 1 : 0] ⊆ Sig(C) chosen from all signals of circuit C (as
illustrated in Fig. 3). Special inputs 0 and 1 are always available to be used
in a circuit.
The signals Sig(C) of the circuit consist of the inputs

In = {xn−1 , . . . , x0 , 0, 1}

and of (the outputs of) the gates

Sig(C) = In ∪ G .

Depending on its type, every gate has one or two inputs which are connected
to signals of the circuit. We denote the input signals connected to a gate g ∈ G
of a circuit C by in1(g), in2(g) for gates with two inputs (AND, OR, ⊕) and
by in1(g) for gates with a single input (inverter). Note that we denote the
output signal of a gate g ∈ G simply by g.
At first glance it is very easy to define how a circuit should work. For a cir-
cuit C, we define the values s(a) of signals s ∈ Sig(C) for a given substitution
a = a[n − 1 : 0] ∈ Bn of the input signals:
2
Intuitively, the reader may think of g ∈ G consisting of two parts, one that
uniquely identifies the particular gate of the circuit (e.g., a name) and another
that specifies the type of the gate (AND, OR, ⊕, inverter).
32 3 Hardware

Fig. 4. Examples of cycles in circuits

1. If s = xi is an input, then

∀i ∈ [n − 1 : 0] : xi (a) = ai .

2. If s is an inverter, then
s(a) = in1(s)(a) .
3. If s is a ◦-gate with ◦ ∈ {∧, ∨, ⊕}, then

s(a) = in1(s)(a) ◦ in2(s)(a) .

Unfortunately, this is not always a definition. For counterexamples, see Fig.


4. Due to the cycles, one cannot find an order in which the above definition
can be applied. Fortunately, defining and then forbidding cycles solves the
problem.
A path from s0 to sm in C is a sequence of signals (s[0 : m]) such that for
all i < m we have
si = in1(si+1 ) ∨ si = in2(si+1 ) .
The length (s[0 : m]) of this path is

(s[0 : m]) = m .

The path is a cycle if s0 = sm . One requires circuits to be free of cycles.


Lemma 3.1 (length of a path in a circuit). Every path (without cycles)
in a circuit with set G of gates has length at most #G.
Proof. By contradiction. Assume a path s[0 : k] with k > #G exists in the
circuit. All si are gates except possibly s0 which might be an input. Thus, a
gate must be (at least) twice on the path:

∃i, j : i < j ∧ si = sj .

Then s[i : j] is a cycle3 . 




3
This proof uses the so called pigeonhole principle. If k + 1 pigeons are sitting in
k holes, then one hole must have at least two pigeons.
3.2 Some Basic Circuits 33

Since every path in a circuit has finite length, one can define for each signal
s the depth d(s) of s as the number of gates on a longest path from an input
to s:
d(s) = max{m | ∃ path s[0 : m] : s0 ∈ In ∧ sm = s} .
For later use we also define the length sp(s) of a shortest such path as

sp(s) = min{m | ∃ path s[0 : m] : s0 ∈ In ∧ sm = s} .

The definitions imply that d and sp satisfy




⎨0 s ∈ In
d(s) = d(in1(s)) + 1 s is an inverter


max{d(in1(s)), d(in2(s))} + 1 otherwise


⎨0 s ∈ In
sp(s) = sp(in1(s)) + 1 s is an inverter


min{sp(in1(s)), sp(in2(s))} + 1 otherwise .

By straightforward induction, we show that the output of a circuit is well-


defined.
Lemma 3.2 (well-defined circuit output). Let d(s) = n, then output s(a)
of the circuit is well defined.
Proof. By induction on n. If n = 0, then s is an input and s(a) is clearly well
defined by the first rule. If n > 0, then we have d(in1(s)) < n. If s is not
an inverter, we also have d(in2(s)) < n. By induction hypothesis in1(s)(a)
and in2(s)(a) (for the case if s is not an inverter) are well defined. We now
conclude that s(a) is well defined by the second and third rules. 


3.2 Some Basic Circuits


Boolean expressions can be translated into circuits in a very intuitive way. In
Fig. 5(b) we have translated the simple formulas from (4) for c (a, b, c) and
s(a, b, c) into a circuit. With inputs (a, b, c) and outputs (c , s) this circuit
satisfies
c , s = a + b + c .
A circuit satisfying this condition is called a full adder. We use the symbol
from Fig. 5(a) to represent this circuit in subsequent constructions. When the
b-input of a full adder is known to be zero, the specification simplifies to

c , s = a + c .

The resulting circuit is called a half adder. Symbol and implementation are
shown in Fig. 6. The circuit in Fig. 7(b) is called a multiplexer or short mux.
34 3 Hardware

a b c a
b
1 1 c
1

FA
1
1
s

c s c 

(a) symbol (b) implementation

Fig. 5. Full adder

a c
1 1 a
c
HA
1
c
1
s

c s
(a) symbol (b) implementation

Fig. 6. Half adder

Its inputs and outputs satisfy



x s=0
z=
y s=1.

For multiplexers we use the symbol from Fig. 7(a). The n-bit multiplexer or
short n-mux in Fig. 8(b) consists of n multiplexers with a common select
signal s. Its inputs and outputs satisfy

x[n − 1 : 0] s = 0
z[n − 1 : 0] =
y[n − 1 : 0] s = 1 .

For n-muxes we use the symbol from Fig. 8(a). Figure 9(a) shows the symbol
for an n-bit inverter. Its inputs and outputs satisfy

y[n − 1 : 0] = x[n − 1 : 0] .

n-bit inverters are simply realized by n separate inverters as shown in


Fig. 9(b). For ◦ ∈ {∧, ∨, ⊕}, Fig. 10(a) shows symbols for n-bit ◦-gates. Their
3.2 Some Basic Circuits 35

y x
x y s
1 1
0 1 s
1
z
z
(a) symbol (b) implementation

Fig. 7. Multiplexer

x y xn−1 yn−1 x0 y0
1 1 1 1
n n
0 1 s 0 1 0 1 s
n
···
1 1
z zn−1 z0

(a) symbol (b) implementation

Fig. 8. n-bit multiplexer

x xn−1 x0
n 1 1

···
n 1 1
y yn−1 y0
(a) symbol (b) implementation

Fig. 9. n-bit inverter

inputs and outputs satisfy


z[n − 1 : 0] = x[n − 1 : 0] ◦ y[n − 1 : 0]
u[n − 1 : 0] = v ◦ y[n − 1 : 0] .
n-bit ◦-gates are simply realized in the first case by n separate ◦-gates as shown
in Fig. 10(b). In the second case all left inputs of the gates are connected to
the same input v. An n-bit ◦-tree has inputs a[n − 1 : 0] and a single output
b satisfying
b = ◦ni=1 ai .
36 3 Hardware

x y xn−1 yn−1 x0 y0
n n 1 1 1 1

◦ ◦ ··· ◦
n 1 1
z zn−1 z0

v y v yn−1 v y0
1 n 1 1 1 1

◦ ◦ ··· ◦
n 1 1
u un−1 u0

(a) symbol (b) implementation

Fig. 10. Gates for n-bit wide inputs

n=1: a0 n > 1 : a[n − 1 :  n2 ] a[ n2  − 1 : 0]


n
2
 n
2


◦ ◦


n
b b
Fig. 11. Implementation of an n-bit ◦-tree for ◦ ∈ {∧, ∨, ⊕}

Recursive construction is shown in Fig. 11.


The inputs a[n − 1 : 0] and outputs zero and nzero of an n-zero tester
shown in Fig. 12 satisfy

zero ≡ a = 0n
nzero ≡ a = 0n .

The implementation uses


n−1
nzero(a[n − 1 : 0]) = ai , zero = nzero .
i=0
3.2 Some Basic Circuits 37

a
a
n
n

n-Zero
nzero
1 1
zero nzero
zero

(a) symbol (b) implementation

Fig. 12. n-bit zero tester

a b
a b n n

n n
n
n-eq
n-Zero
1 1
eq neq 1 1
eq neq

(a) symbol (b) implementation

Fig. 13. n-bit equality tester

The inputs a[n − 1 : 0], b[n − 1 : 0] and outputs eq, neq of an n-bit equality
tester in Fig. 13 satisfy

eq ≡ a = b , neq ≡ a = b .

The implementation uses

neq(a[n − 1 : 0]) = nzero(a[n − 1 : 0] ⊕ b[n − 1 : 0]) , eq = neq .

An n-decoder is a circuit with inputs x[n − 1 : 0] and outputs y[2n − 1 : 0]


satisfying
∀i : yi = 1 ↔ x = i .
A recursive construction with k =  n2  is shown in Fig. 14. For the correctness,
one argues in the induction step
38 3 Hardware

n = 1: x0 n > 1: x[n − 1 : k] x[k − 1 : 0]


n−k k

(n − k)−Dec k−Dec
2n−k 2k

V [2n−k − 1 : 0] U[2k − 1 : 0]
y1 y0
V [i] U[j]
··· ···
y[2k · i + j]
Fig. 14. Implementation of an n-bit decoder

n = 1: n > 1: x = x[n − 2 : 0]
n−1
0 x0
(n − 1)−hdec
2n−1
U[L]
xn−1
y1 y0 2n−1 2n−1

Y [H] Y [L]
Fig. 15. Recursive construction of n-bit half decoder

y[i · 2k + j] = 1 ↔ V [i] = 1 ∧ U [j] = 1 (construction)


↔ x[n − 1 : k] = i ∧ x[k − 1 : 0] = j (ind. hypothesis)
↔ x[n − 1 : k]x[k − 1 : 0] = i · 2 + j . (Lemma 2.9)
k

An n-half decoder has inputs x[n − 1 : 0] and outputs y[2n − 1 : 0] satisfying


n
−x x
y = 02 1 ,

i.e., input x is interpreted as a binary number and decoded into a unary num-
ber. The remaining output bits are filled with zeros. A recursive construction
of n-half decoders is shown in Fig. 15. For the construction of n-half decoders
from (n − 1)-half decoder, we divide the index range into upper and lower
half:
L = [2n−1 − 1 : 0] , H = [2n − 1 : 2n−1 ] .
3.2 Some Basic Circuits 39

Also we divide x[n − 1 : 0] into the leading bit xn−1 and the low order bits
x = x[n − 2 : 0] .
In the induction step we then conclude
Y [H] ◦ Y [L] = xn−1 ∧ U [L] ◦ (xn−1 ∨ U (L))
 n−1  
◦ 02 −x  1x  : xn−1 = 0
n−1
02
=  
02 −x  1x  ◦ 12
n−1 n−1
: xn−1 = 1
 n  
02 −x  1x  : xn−1 = 0
= 2n −(2n−1 +x ) 2n−1 +x 
0 1 : xn−1 = 1
n
−xn−1 x  xn−1 x 
= 02 1
2 −x x
n
=0 1 .
An n-parallel prefix circuit P P◦ (n) for an associative function ◦ : B × B → B
is a circuit with inputs x[n − 1 : 0] and outputs y[n − 1 : 0] satisfying
y0 = x0 , yi+1 = xi+1 ◦ yi . (6)
For even n a recursive construction of an n-parallel prefix circuit based on
◦-gates is shown in Fig. 16. For odd n one can realize an (n − 1)-bit parallel
prefix from Fig. 16 and compute output yn−1 as
yn−1 = xn−1 ◦ yn−2
using one extra ◦-gate.
For the correctness of the construction, we first observe that
xi = x2i+1 ◦ x2i

y2i = x2i ◦ yi−1
y2i+1 = yi .
We first show that odd outputs of the circuit satisfy (6). For i = 0 we have
y1 = y0 (construction)
= x0 (ind. hypothesis P P◦ (n/2))
= x1 ◦ x0 . (construction)
For i > 0 we conclude
y2i+1 = yi (construction)
= xi ◦ yi−1

(ind. hypothesis P P◦ (n/2))

= (x2i+1 ◦ x2i ) ◦ yi−1 (construction)

= x2i+1 ◦ (x2i ◦ yi−1 ) (associativity)
= x2i+1 ◦ y2i . (construction)
40 3 Hardware

n=2: n > 2 : xn−1 xn−2 x3 x2 x1 x0

x1 x0 xn −1 x1 x0


2

P P◦ (n/2)

y1 y0 y n −1 y0
2

yn−1 yn−2 y2 y1 y0

Fig. 16. Recursive construction of an n-bit parallel prefix circuit of the function ◦
for an even n

A = 1 : b[0] A > 1: b[A − 1] b[A/2] b[A/2 − 1] b[0]


n
··· n n
··· n
n
(n, A/2)-OR (n, A/2)-OR

out
n n

out
Fig. 17. Recursive construction of an (n, A)-OR tree

For even outputs of the circuit, we easily conclude

y0 = x0 (construction)

i > 0 → y2i = x2i ◦ yi−1 (construction)
= x2i ◦ y2i−1 . (construction)

An (n, A)-OR tree has A many input vectors b[i] ∈ Bn with i ∈ [0 : A − 1],
where b[i][j] with j ∈ [0 : n − 1] is the j-th bit of input vector b[i]. The outputs
of the circuit out[n − 1 : 0] satisfy
3.3 Clocked Circuits 41


A−1
out[j] = b[i][j] .
i=0

The implementation of (n, A)-OR trees, for the special case where A is a power
of two, is shown in Fig. 17.

3.3 Clocked Circuits


We introduce two computational models in which processors are constructed
and their correctness is proven. We begin with the usual digital hardware
model, where time is counted in hardware cycles and signals are binary-valued.
Afterwards, we present a more general, detailed hardware model that is moti-
vated by the data sheets of hardware manufacturers. There, time is real-valued
and signals may assume the digital values in B as well as a third value Ω. The
detailed hardware model allows arguing about
• hardware with multiple clock domains4 , e.g., in the real time systems that
control cars or airplanes [10, 14], and
• the presence and absence of glitches. Glitches are an issue in the construc-
tion of memory systems: accesses to dynamic RAM tend to take several
hardware cycles and inputs have to be constant in the digital sense and free
of glitches during this time. The latter requirement cannot be expressed
in the usual digital model, and thus the lemmas establishing their absence
in our construction would be isolated from the remainder of the theory
without the detailed model.
We explain how timing analysis is performed in the detailed model and then
show that, with proper timing analysis, the digital model is an abstraction
of the detailed model. Thus, in the end, all constructions are correct in the
detailed model. Where glitches do not matter – i.e., everywhere except the
access to dynamic RAM – we can simply work in the usual and much more
comfortable digital model (without having to resort to the detailed hardware
model to prove the absence of glitches).

3.3.1 Digital Clocked Circuits

A digital clocked circuit, as illustrated in Fig. 18, has four components:


• a special reset input,
• special inputs 0 and 1,
• a sequence x[n − 1 : 0] of 1-bit registers, and
• a circuit with inputs x[n − 1 : 0], reset, 0, and 1 and outputs x[n − 1 : 0]in
and x[n − 1 : 0]ce.

4
In this book we do not present such hardware.
42 3 Hardware

x[n − 1]in x[1]in x[0]in

···
x[n − 1] x[n − 1]ce x[1] x[1]ce x[0] x[0]ce

x[n − 1] x[1]
x[0] reset
0
1 circuit c

n
n

x[n − 1 : 0]in x[n − 1 : 0]ce

Fig. 18. A digital clocked circuit. Every output signal x[i]in of circuit c is the data
input of the corresponding register x[i] and every output x[i]ce produced by circuit
c is the clock enable input of the corresponding register

Each register x[i] has


• a data input x[i]in,
• a clock enable input x[i]ce, and
• a register value x[i] which is also the output signal of the register.
In the digital model we assume that register values as well as all other signals
always are in B.
A hardware configuration h of a clocked circuit is a snapshot of the current
values of the registers:
h = x[n − 1 : 0] ∈ Bn .
A hardware computation is a sequence of hardware configurations where the
next configuration h is computed from the current configuration h and the
current value of the reset signal by a next hardware configuration function δH :

h = δH (h, reset) .

In a hardware computation, we count cycles (steps of the digital model) using


natural numbers t ∈ N ∪ {−1}. The hardware configuration in cycle t of a
hardware computation is denoted by ht = xt [n − 1 : 0] and the value of signal
y during cycle t is denoted by y t .
The values of the reset signal are fixed. Reset is on in cycle −1 and off
ever after:
3.3 Clocked Circuits 43

1 0 reset

x[1] 1

Fig. 19. Simple clocked circuit with a single register


t 1 t = −1
reset =
0 t≥0.
At power up, register values are binary but unknown. We denote this sequence
of unknown binary values at startup by a[n − 1 : 0]:
x−1 [n − 1 : 0] = a[n − 1 : 0] ∈ Bn .
The current value of a circuit signal y in cycle t is defined according to the
previously introduced circuit semantics:

in1(y)t y is an inverter
yt =
in1(y) ◦ in2(y) y is a ◦-gate .
t t

Let x[n − 1 : 0]int and x[n − 1 : 0]cet be the register input and clock enable
signals computed from the current configuration xt [n − 1 : 0] and the current
value of the reset signal resett . Then the register value xt+1 [i] of the next
hardware configuration xt+1 [n − 1 : 0] = δH (xt [n − 1 : 0], resett ) is defined as

t+1 x[i]int x[i]cet = 1
x [i] =
xt [i] x[i]cet = 0 ,

i.e., when the clock enable signal of register x[i] is active in cycle t, the register
value of x[i] in cycle t + 1 is the value of the data input signal in cycle t;
otherwise, the register value does not change.
As an example, consider the digital clocked circuit from Fig. 19. There is
only one register, thus we abbreviate x = x[0]. For cycle −1 we have
x−1 = a[0]
reset−1 = 1
xce−1 = 1
xin−1 = 0 .
44 3 Hardware

Hence, x0 = 0. For cycles t ≥ 0 we have

resett = 0
xcet = 1
xint = y t = xt .

Hence, we get xt+1 = xt . An easy induction on t shows that

∀t ≥ 0 : xt = (t mod 2) .

3.3.2 The Detailed Hardware Model

In the detailed hardware model, time is real-valued. Circuit signals y (which


include register outputs) are functions

y : R → {0, 1, Ω}

where Ω stands for an either undefined or metastable value.


A circuit in the detailed hardware model is clocked by a clock signal ck
which alternates between 0 and 1 in a regular fashion. When the clock signal
switches from 0 to 1, we call this a clock edge. In order to arrive at the
abstraction of the digital hardware model later, we count clock edges – clock
edge i marks the start of cycle i in the detailed model. The circuit clock has
two parameters:
• the time γ where clock edge 0 occurs,
• the cycle time τ between consecutive clock edges.
For c ∈ N ∪ {−1} this defines the position e(c) of clock edge c as

e(c) = γ + c · τ .

Inspired by data sheets from hardware manufacturers, registers and gates have
six timing parameters:
• ρ: the minimal propagation delay of register outputs after clock edges,
• σ: the maximal propagation delay of register outputs after clock edges (we
require 0 ≤ ρ < σ),
• ts: setup time of register input and clock enable before clock edges,
• th: hold time of register input and clock enable after clock edges,
• α: minimal propagation delay of gates, and
• β: maximal propagation delay of gates5 (we require 0 < α < β).

5
Defining such delays from voltage levels of electrical signals is nontrivial and can
go wrong in subtle ways. For the deduction of a negative propagation delay from
the data of a very serious hardware catalogue, see [5].
3.3 Clocked Circuits 45

e(c)
ck
ts th

x[i]in

x[i]ce 1
ρ

x[i] x[i]in(e(c))
σ

Fig. 20. Detailed timing of a register x[i] with stable inputs and ce = 1

This is a simplification. Setup and hold times can be different for register
inputs and clock enable signals. Also, the propagation delays of different types
of gates are, in general, different. Generalizing our model to this situation is
easy but requires more notation.
Let y be any signal. The requirement that this signal satisfies the setup
and hold times of registers at clock edge c is defined by

stable(y, c) ↔ ∃a ∈ B : ∀t ∈ [e(c) − ts, e(c) + th] : y(t) = a .

The behavior of a register x[i] with stable input and clock enable at edge t is
illustrated in Fig. 20.
For c ∈ N ∪ {−1} and t ∈ (e(c) + ρ, e(c + 1) + ρ], we define the register
value x[i](t) and output at time t by a case split:
• Clocking the register at edges c ≥ 0. The clock enable signal is 1 at edge
e(c) and the setup and hold times for the input and clock enable signals
are met:
x[i]ce(e(c)) ∧ stable(x[i]in, c) ∧ stable(x[i]ce, c) .
Then the data input at edge e(c) becomes the new value of the register,
and it becomes visible (at the latest) at time σ after clock edge e(c).
• Not clocking the register at edges c ≥ 0. The clock enable signal is 0 at
edge e(c) and the setup and hold times for it are met:

x[i]ce(e(c)) ∧ stable(x[i]ce, c) .

The output stays unchanged for the entire period.


• Register initialization. This is happening when the reset signal (reset(t))
is high. In this situation we assume that the register x[i] outputs some
value a[i] ∈ B, which is unknown but is not Ω.
• Any other situation, where the voltage cannot be guaranteed to be recog-
nized as a known logical 0 or 1. This includes i) the transition period from
46 3 Hardware

e(−1) e(0) e(1)


ck
σ σ

reset 1 0
ρ

Fig. 21. Detailed timing of the reset signal

ρ to σ after regular clocking, and ii) the entire time interval if there was a
violation of the stability conditions of any kind. Usually, a physical register
will settle in this situation quickly into an unknown logical value, but in
rare occasions the register can “hang” at a voltage level not recognized as
0 or 1 for a long time. This is called ringing or metastability.
Formally, we define the register semantics of the detailed hardware model in
the following way:


⎪ a[i] reset(t)



⎪ x[i]in(e(c)) t ∈ [e(c) + σ, e(c + 1) + ρ] ∧ stable(x[i]in, c)


⎨ ∧ stable(x[i]ce, c) ∧ x[i]ce(e(c)) ∧ ¬reset(t)
x[i](t) =

⎪ x[i](e(c)) t ∈ (e(c) + ρ, e(c + 1) + ρ] ∧ stable(x[i]ce, c)



⎪ ∧ ¬x[i]ce(e(c)) ∧ ¬reset(t)



Ω otherwise .

Notice that during regular clocking in, the output is unknown between e(c)+ρ
and e(c)+σ. This is the case even if x[i]in(e(c)) = x[i](e(c)), i.e., when writing
the same value the register currently contains. In this case a glitch on the
register output can occur. A glitch (or a spike) is a situation when a signal
has the same digital value x ∈ B in cycle t and t + 1 but in the physical model
it temporarily has a value not recognized as x. The only way to guarantee
constant register outputs during the time period is not to clock the register
during that time.
We require the reset signal to behave like an output of a register which is
clocked at cycles −1 and 0 and is not clocked afterwards (Fig. 21):


⎨1 t ∈ [e(−1) + σ, e(0) + ρ]
reset(t) = Ω t ∈ (e(0) + ρ, e(0) + σ)


0 otherwise .

Special signals 1 and 0 are always said to be 1 or 0 respectively:


1(t) = 1
0(t) = 0 .
3.3 Clocked Circuits 47

in1 a

in2 b
α β

y a◦b
hold(y, t) reg(y, t)

Fig. 22. Detailed timing of a gate y with two inputs

We show a simple lemma, which guarantees that the output from a register
does not have any glitches if this register is not clocked.
Lemma 3.3 (glitch-free output of non-clocked register). Assume stable
data is clocked into register x[i] at edge e(c):
stable(x[i]in, c) ∧ stable(x[i]ce, c) ∧ x[i]ce(e(c)) .
Assume further that the register is not clocked for the following K − 1 clock
edges:
∀k ∈ [1 : K − 1] : stable(x[i]ce(c + k)) ∧ ¬x[i]ce(e(c + k)) .
Then the value x[i]in(e(c)) is visible at the output of register x[i] from time
e(c) + σ to e(c + K) + ρ:
∀t ∈ [e(c) + σ, e(c + K) + ρ] : x[i](t) = x[i]in(e(c)) .
Proof. One shows by an easy induction on k
∀t ∈ [e(c) + σ, e(c + 1) + ρ] : x[i](t) = x[i]in(e(c))
and
∀k ∈ [1 : K − 1] : ∀t ∈ (e(c + k) + ρ, e(c + k + 1) + ρ] : x[i](t) = x[i]in(e(c)) .


For the definition of the value y(t) of gates g at time t in the detailed model,
we distinguish three cases (see Fig. 22):
• Regular signal propagation. Here, all input signals are binary and stable for
the maximal propagation delay β before t. For inverters y this is captured
by the following predicate:
reg(y, t) ↔ ∃a ∈ B : ∀t ∈ [t − β, t] : in1(y)(t ) = a .
Gate y in this case outputs a at time t. For ◦-gates y we define
reg(y, t) ↔ ∃a, b ∈ B : ∀t ∈ [t − β, t] : in1(y)(t ) = a ∧ in2(y)(t ) = b .
Then gate y outputs a ◦ b at time t.
48 3 Hardware

in1

in2
β α

t2 − β t − α t1 t2 t

Fig. 23. Illustrating the proof of Lemma 3.4

• Signal holding. Here, signal propagation is not regular anymore but it was
regular at some time during the minimal propagation delay α before t:

hold(y, t) ↔ ¬reg(y, t) ∧ ∃t ∈ [t − α, t] : reg(y, t ) .

The gate y in this case still holds the old value y(t ) at time t. We will
show that the value y(t ) is well defined for all t .
• All other cases where we cannot give any guarantees about y(t).

Lemma 3.4 (well-defined signal holding value). Assume hold(y, t) and


t1 , t2 ∈ [t − α, t] ∧ reg(y, t1 ) ∧ reg(y, t2 ). Then we have

y(t1 ) = y(t2 ) .

Proof. The proof is illustrated in Fig. 23. Without loss of generality, we have
t1 < t2 . Let z ∈ {in1(y), in2(y)} be any input of y. From reg(y, t1 ) we infer

∃a ∈ B : ∀t ∈ [t1 − β, t1 ] : z(t ) = z(t1 ) = a .

From
0 < t2 − t 1 < α < β
we infer
t2 − β < t1 < t2 .
Thus,
t1 ∈ (t2 − β, t2 ).
From reg(y, t2 ) we get

∀t ∈ [t2 − β, t2 ] : z(t ) = z(t2 )

and hence,
z(t2 ) = z(t1 ) = a .
For ◦-gates y we have
3.3 Clocked Circuits 49

y(t1 ) = in1(t1 ) ◦ in2(t1 )


= in1(t2 ) ◦ in2(t2 )
= y(t2 ) .

For inverters the argument is equally simple. 




For values t satisfying hold(y, t), we define lreg(y, t) as the last value t before
t when signal propagation was regular:

lreg(y, t) = max{t | t < t ∧ reg(y, t )} .

Now we can complete the definition of the value of gate y at time t:




⎪ in1(y)(t) reg(y, t) ∧ y is an inverter

⎨in1(y)(t) ◦ in2(y)(t) reg(y, t) ∧ y is a ◦-gate
y(t) =


⎪y(lreg(y, t))

hold(y, t)
Ω otherwise .

3.3.3 Timing Analysis

Timing analysis is performed in the detailed model in order to ensure that all
register inputs x[i]in and clock enables x[i]ce are stable at clock edges. We
capture the conditions for correct timing by

∀i, c : stable(x[i]in, c) ∧ stable(x[i]ce, c) .

After a reminder that d(y) and sp(y) are the lengths of longest and shortest
paths from the inputs to y, we define the minimal and the maximal propaga-
tion delays of arbitrary signals y relative to the clock edges:

tmin(y) = ρ + sp(y) · α,
tmax(y) = σ + d(y) · β .

Note that, in the definitions of tmin(y) and tmax(y), we overestimate propa-


gation delays of signals which are calculated from the special input signals 0
and 1.6
In what follows, we define a sufficient condition for correct timing and
show that with this condition detailed and digital circuits simulate each other
in the sense that for all signals y the value y c in the digital model during cycle

6
The input signals 0 and 1 of a circuit do in fact have no propagation delay.
However, giving a precise definition that takes this into account would make
things unnecessarily complicated here since we would need to define and argue
about the longest and shortest path without the 0/1 signals. Instead, we prefer
to overestimate and keep things simple by using already existing definitions.
50 3 Hardware

c equals the value y(e(c + 1)) at the end of the cycle. In other words, with
correct timing the digital model is an abstraction of the detailed model:

y c = y(e(c + 1)) .

Lemma 3.5 (timing and simulation). If for all signals y we have

∀y : tmax(y) + ts ≤ τ

and if for all inputs x[i]in and clock enable signals x[i]ce of registers we have

∀i : th ≤ tmin(x[i]in) ∧ th ≤ tmin(x[i]ce),

then
1. ∀y, c : ∀t ∈ [e(c) + tmax(y), e(c + 1) + tmin(y)] : y(t) = y c ,
2. ∀i, c : c ≥ 0 → stable(x[i]in, c) ∧ stable(x[i]ce, c).

Proof. By induction on c. For each c, we show statement 1 with the help of


the following auxiliary lemma.

Lemma 3.6 (timing and simulation, 1 cycle). Let statement 1 of Lemma


3.5 hold in cycle c for all signals with depth 0:

∀t ∈ [e(c) + σ, e(c + 1) + ρ] : x[i](t) = x[i]c ,

then the same statement holds for all signals y in cycle c:

∀y : ∀t ∈ [e(c) + tmax(y), e(c + 1) + tmin(y)] : y(t) = y c .

Proof. By induction on the depth d(y) of signals. Let the statement hold for
signals of depth d − 1 and let y be a ◦-gate of depth d. We show that it holds
for y.
Consider Fig. 24. There are inputs z1 , z2 of y such that

d(y) = d(z1 ) + 1 ∧ sp(z) = sp(z2 ) + 1 .

Hence,

tmax(y) = tmax(z1 ) + β
tmin(y) = tmin(z2 ) + α .

By induction we have for all inputs z of y

∀t ∈ [e(c) + tmax(z), e(c + 1) + tmin(z)] : z(t) = z c .

Since we have

tmin(z2 ) ≤ tmin(z) ∧ tmax(z) ≤ tmax(z1 )


3.3 Clocked Circuits 51

e(c) e(c + 1)
ck

z2
tmax(z1 ) tmin(z2 )

z1
β α

y yc
tmax(y) tmin(y)

Fig. 24. Computing tmin(y) and tmax(y)

for all inputs z of y, we get

[e(c) + tmax(y) − β, e(c + 1) + tmin(y) − α]


= [e(c) + tmax(z1 ), e(c + 1) + tmin(z2 )]
⊆ [e(c) + tmax(z), e(c + 1) + tmin(z)]

and thus,

∀t ∈ [e(c) + tmax(y) − β, e(c + 1) + tmin(y) − α] : z(t) = z c .

We conclude

∀t ∈ [e(c) + tmax(y), e(c + 1) + tmin(y) − α] : reg(y, t)

and

y(t) = in1(y)(t) ◦ in2(y)(t)


= in1(y)c ◦ in2(y)c (ind. hypothesis)
= yc .

Now we have to show that

∀t ∈ (e(c + 1) + tmin(y) − α, e(c + 1) + tmin(y)] : z(t) = z c .

We choose an arbitrary t in this interval and do a case split on whether the


signal is regular or not. In case reg(y, t) holds (actually, it never does, but we
don’t bother showing that), we have for all inputs z of y

∀t ∈ [t − β, t] : z(t) = z(t ) .


52 3 Hardware

e(c) e(c + 1)
ck
ts th

y
tmax(y) tmin(y)
τ

Fig. 25. Stability of register input y

Since α < β, we have

e(c + 1) + tmin(y) − α ∈ [t − β, t]

and conclude for all inputs z of y

z(t) = z(e(c + 1) + tmin(y) − α) = z c ,

which implies

y(t) = in1(y)c ◦ in2(y)c


= yc .

In case reg(y, t) doesn’t hold, we get hold(y, t). We have to show that

y(lreg(y, t)) = y c .

If lreg(y, t) ∈ (e(c+1)+tmin(y)−α, t), then we observe that reg(y, lreg(y, t))


holds and proceed in the same way as in the first case. Otherwise, we have
lreg(y, t) = e(c + 1) + tmin(y) − α, which also implies

y(lreg(y, t)) = y c .




We now continue with the proof of Lemma 3.5. For the induction base c = −1
and the signals coming from the registers x[i] with d(x[i]) = 0, we have

tmin(x[i]) = ρ ∧ tmax(x[i]) = σ .

From the initialization rules in the digital and detailed models we get

∀t ∈ [e(−1) + σ, e(0) + ρ] : x[i](t) = x[i]−1 = a[i] ∈ B

and conclude the proof of part 1 by Lemma 3.6. For part 2 there is nothing
to show.
3.3 Clocked Circuits 53

For the induction step we go from c to c + 1 and first show part 2 of


the lemma. Consider Fig. 25. For register inputs y = x[i]in and clock enable
signals y = x[i]ce we have from the induction hypothesis:
∀t ∈ [e(c) + tmax(y), e(c + 1) + tmin(y)] : y(t) = y c .
From the lemma’s assumptions we get for all y = x[i]in, y = x[i]ce
∀y : th ≤ tmin(y) ∧ tmax(y) + ts ≤ τ ,
which implies
e(c) + tmax(y) ≤ e(c) + τ − ts
= e(c + 1) − ts,
e(c + 1) + tmin(y) ≥ e(c + 1) + th .
Thus,
[e(c + 1) − ts, e(c + 1) + th] ⊆ [e(c) + tmax(y), e(c + 1) + tmin(y)]
and
∀t ∈ [e(c + 1) − ts, e(c + 1) + th] : y(t) = y c .
We conclude stable(y, c + 1), which shows part 2 for c + 1.
Since all input and clock enable signals for register x[i] are stable at clock
edge c + 1, we get from the register semantics of the detailed model for all
t ∈ [e(c + 1) + σ, e(c + 2) + ρ]:

x[i]in(e(c + 1)) x[i]ce(e(c + 1)) = 1
x[i](t) =
x[i](e(c + 1)) x[i]ce(e(c + 1)) = 0.
Observing that for all y we have
e(c + 1) ∈ [e(c) + tmax(y), e(c + 1) + tmin(y)] ,
we conclude with part 1 of the induction hypothesis:
x[i](e(c + 1)) = x[i]c
x[i]in(e(c + 1)) = x[i]inc
x[i]ce(e(c + 1)) = x[i]cec .
Finally, from the register semantics of the digital model we conclude for t ∈
[e(c + 1) + σ, e(c + 2) + ρ]:

x[i]inc x[i]cec = 1
x[i](t) =
x[i]c x[i]cec = 0
= x[i]c+1 .
This shows part 1 of the lemma for all signals y with d(y) = 0. Applying
Lemma 3.6, we get part 1 for all other circuit signals and conclude the proof.


54 3 Hardware

Rinn−1 Rin0

Rn−1 ··· R0

Rce

Fig. 26. An n-bit register

3.4 Registers
So far we have shown that there is one basic hardware model, namely the
detailed one, but with correct timing it can be abstracted to the digital model
(Lemma 3.5). From now on we assume correct timing and stick to the usual
digital model unless we need to prove properties not expressible in this model
– like the absence of glitches.
Although all memory components can be built from 1-bit registers, it is
inconvenient to refer to all memory bits in a computer by numbering them
with an index i of a clocked circuit input x[i]. It is more convenient to deal
with hardware configurations h and to gather groups of such bits into certain
memory components h.M . For M we introduce n-bit registers h.R. In Chap.
4 we add to this no less than 9 (nine) random access memory (RAM) designs.
As before, in a hardware computation with memory components, we have
ht+1 = δH (ht , resett ) .
An n-bit register R consists simply of n many 1-bit registers R[i] with a
common clock enable signal Rce as shown in Fig. 26.
Register configurations are n-tuples:
h.R ∈ Bn .
Given input signals Rin(ht ) and Rce(ht ), we obtain from the semantics of the
basic clocked circuit model:

Rin(ht ) Rce(ht ) = 1
ht+1 .R =
ht .R Rce(ht ) = 0 .
Recall that, from the initialization rules for 1-bit registers, after power up
register content is binary but unknown (metastability is extremely rare):
h0 .R ∈ Bn .

3.5 Drivers and Main Memory


In order to deal with main memory and its connection to caches and processor
cores, we introduce several new hardware components: tristate drivers, open
3.5 Drivers and Main Memory 55

yin
yin
α α
OC
y y
β β

Fig. 27. Open collector driver and its timing diagram

collector drivers, and main memory. For hardware consisting only of gates,
inverters, and registers, we have shown in Lemma 3.5 that a design that works
in the digital model also works in the detailed hardware model. For tristate
drivers and main memory this will not be the case.

3.5.1 Open Collector Drivers and Active Low Signal

A single open collector driver y and its detailed timing is shown in Fig. 27. If
the input yin is 0, then the open collector driver also outputs 0. If the input
is 1, then the driver is disabled. In detailed timing diagrams, an undefined
value due to disabled outputs is usually drawn as a horizontal line in the
middle between 0 and 1. In the jargon of hardware designers this is called the
high impedance state or high Z or simply Z. In order to specify behavior and
operating conditions of open collector and tristate drivers, we have to permit
Z as a signal value for drivers y. Thus, we have

y : R → {0, 1, Ω, Z} .

For the propagation delay of open collector drivers, we use the same parame-
ters α and β as for gates. Regular signal propagation is defined the same way
as for inverters:

reg(y, t) ↔ ∃a ∈ B : ∀t ∈ [t − β, t] : yin(t) = a .

The signal y generated by a single open collector driver is defined as




⎪0 reg(y, t) ∧ yin(t) = 0

⎨Z reg(y, t) ∧ yin(t) = 1
y(t) =
⎪y(lreg(y, t)) hold(y, t)



Ω otherwise .

In contrast to other gates, it is allowed to connect the outputs of drivers by


wires which are often called buses. Fig. 28 shows k open collector drivers yi
with inputs yi in driving a bus b. If all the drivers connected to the open
collector bus are disabled, then in the physical design a pull-up resistor drives
1 on the bus. The bus value b(t) is then determined as
56 3 Hardware

y1 in yk in

OC ··· OC
y1 yk
b
Fig. 28. Open collector drivers yi connected by a bus b



⎨0 ∃i : yi (t) = 0
b(t) = 1 ∀i : yi (t) = Z


Ω otherwise .

In the digital model, we simply get



bt = yi int ,
i

but this abstracts away an important detail: glitches on a driver input can
propagate to the bus, for instance when other drivers are disabled. This will
not be an issue for the open collector buses constructed here. It is, however,
an issue in the control of real time buses [10].
By de Morgan’s law, one can use open collector buses together with some
inverters to compute the logical OR of signals ui :

b= ui = ¬( ui ) . (7)
i i

In control logic, it is often equally easy to generate or use an “active high”


signal u or its inverted “active low” version u. By (7), open collector buses
compute an active low OR b of control signals ui without extra cost, if the
active low versions ui are available.
n-bit open collector drivers are simply n open collector drivers in parallel.
Symbol and construction are shown in Fig. 29.

3.5.2 Tristate Drivers and Bus Contention

Tristate drivers y are controlled by output enable signals yoe7 . Symbol and
timing are shown in Fig. 30. Only when the output enable signal is active, a
tristate driver propagates the data input yin to the output y. Like ordinary
switching, enabling and disabling tristate drivers involves propagation delays.

7
Like clock enable signals, we model them as active high, but in data sheets for
real hardware components they are usually active low.
3.5 Drivers and Main Memory 57

x xn−1 xi x0
n

OC OC ··· OC ··· OC
n

y yn−1 yi y0
(a) symbol (b) implementation

Fig. 29. Symbol and construction of an n-bit open collector driver.

yoe
yin
yin
α α α
yoe
y y
β β β

Fig. 30. Tristate driver and its timing diagram

Ignoring propagation delays, a tristate driver computes the following function:



in oe = 1
tr(in, oe) =
Z oe = 0 .

For simplicity, we use the same timing parameters as for gates. Regular signal
propagation is defined as for gates:

reg(y, t) ↔ ∃a, b ∈ B : ∀t ∈ [t − β, t] : yin(t) = a ∧ yoe(t) = b .

The signal y generated by a single tristate driver is then defined as




⎨tr(yin(t), yoe(t)) reg(y, t)
y(t) = y(lreg(y, t)) hold(y, t)


Ω otherwise .

Observe that a glitch on an output enable signal can produce a glitch in signal
y. In contrast to glitches on open collector buses this will be an issue in our
designs involving main memory.
Like open collector drivers, the outputs of tristate drivers can be connected
via so called tristate buses. The clean way to operate a tristate bus b with
58 3 Hardware

y1 in yk in

···
y1 oe yk oe
y1 yk
b
Fig. 31. Tristate drivers yi connected by a bus b

R0 1 R1 1

0 1

y0 y1
b
Fig. 32. Switching enable signals of drivers at the same clock edge

drivers yi as shown in Fig. 31 is to allow at any time t at most one driver to


produce a signal different from Z:

yi (t) = Z ∧ yj (t) = Z → i = j . (8)

If this invariant is maintained, the following definition of the bus value b(t) at
time t is well defined:

yi (t) ∃i : yi (t) = Z
b(t) =
Z otherwise .

The invariant excludes a design like in Fig. 32, where drivers y0 and y1 are
switched on and off at the same clock edge8 . In order to understand the
possible problem with such a design consider a rising clock edge when R0 =
y0 oe is turned on and R1 = y1 oe is turned off. This can lead to a situation as
shown in Fig. 33.
There, we assume that the propagation delay of R0 is ρ = 1 and the
propagation delay of R1 is σ = 2. Similarly, assume that the enable time of y0
8
This is not unheard of in practice.
3.5 Drivers and Main Memory 59

0 1 2 3 4
ck

R0

R1

y0

y1

Fig. 33. Possible timing when enable signals are switched at the same clock edge

V CC

R1
y
R2

GN D

Fig. 34. Output stage of a tristate driver as a pair of variable resistors

Table 5. Output y of a driver regulated by adjustment of two resistors R1 and R2 .

R1 R2 y
low high 1
high low 0
high high Z
low low short circuit

is α = 1 and the disable time of y1 is β = 2. The resulting signals at a rising


edge of clock ck are shown in the detailed timing diagram in Fig. 33. Note
that for 2 ≤ t ≤ 4 we have y0 (t) = 0 and y1 (t) = 1. This happens to produce
more problems than just a temporarily undefined bus value.
The output circuitry of a driver or a gate can be envisioned as a pair of
adjustable resistors as shown in Fig. 34. Resistor R1 is between the supply
voltage V CC and the drivers output y. The other resistor R2 is between the
output and ground GN D. Logical values 0 and 1 as high impedance state Z
can now be implemented by adjusting the values of the resistors as shown in
Table 5.
Of course the circuitry of a well designed single driver will never produce
a short circuit by adjusting both resistors to “low”. However, as shown in
60 3 Hardware

V CC

high low
y0 b y1

low high

GN D

Fig. 35. Short circuit via the bus b when two drivers are enabled at the same time

Fig. 35 the short circuit is still possible via the low resistance path

GN D → y0 → b → y1 → V CC .

This occurs when two drivers are simultaneously enabled and one of the drivers
drives 0 while the other driver drives 1. Exactly this situation occurs tem-
porarily in the real-valued time interval [r + 2, r + 3] after each rising clock
edge r. In the jargon of hardware designers this is called – temporary – bus
contention, which clearly sounds much better than “temporary short circuit”.
But even with the nicer name it remains of course a short circuit. In the best
case, it increases power consumption and shortens the life time of the driver.
The spikes in power consumption can have the side effect that power supply
voltage falls under specified levels; maybe not always, but sporadically when
power consumption in other parts of the hardware is high. Insufficient supply
voltage then will tend to produce sporadic non reproducible failures in other
parts of the hardware.

3.5.3 The Incomplete Digital Model for Drivers

Observe that there is a deceptively natural looking digital model of tristate


drivers which has a good and a bad part. The good part is

yin(y) yoe = 1
y= (9)
Z otherwise .

The bad part – as we will demonstrate later – is the very natural looking
condition:
yit = Z ∧ yjt = Z → i = j . (10)
The good part, i.e., (9), correctly models the behavior of drivers for times after
clock edges where all propagation delays have occurred and when registers are
updated. Indeed, if we consider a bus b driven by drivers yi as a gate with
depth
3.5 Drivers and Main Memory 61

d(b) = max{d(yi ) | i ∈ [1 : k]}


sp(b) = min{sp(yi ) | i ∈ [1 : k]}

we can immediately extend Lemma 3.5 to circuits with buses and drivers of
both kinds.
Lemma 3.7 (timing and simulation with drivers and buses). Assume
that (8) holds for all tristate buses and assume the correct timing

∀y : tmax(y) + ts ≤ τ

and
∀i : th ≤ tmin(x[i]in) ∧ th ≤ tmin(x[i]ce) .
Then,
1. ∀y, c : ∀t ∈ [e(c) + tmax(y), e(c + 1) + tmin(y)] : y(t) = y c ,
2. ∀i, c : c ≥ 0 → stable(x[i]in, c) ∧ stable(x[i]ce, c).
This justifies the use of the digital model as far as register update is concerned.
The lemma has, however, a hypothesis coming from the detailed model. Re-
placing it simply by what we call the bad part of the digital model, i.e., by
(10), is the highway to big trouble. First of all, observe that our design in Fig.
33, which switched enable signals at the same clock edge, satisfies it. But in
the detailed model (and the real world) we can do worse. We can construct
hardware that destroys itself by the short circuits caused by bus contention
but which is contention free according to the (bad part of) the digital model.

3.5.4 Self Destructing Hardware

In what follows we will do some arithmetic on time intervals [a, b] where signals
change. In our computations of these time bounds we use the following rules:

c + [a, b] = [c + a, c + b]
c · [a, b] = [c · a, c · b]
[a, b] + [c, d] = [a + c, b + d]
c + (a, b) = (c + a, c + b)
c · (a, b) = (c · a, c · b) .

Lemma 3.8 (self destructing hardware). For any > 0 there is a design
satisfying (10) which produces continuous bus contention for at least a fraction
α/β − of the total time.
62 3 Hardware

g1 gc
u
···
u v
Fig. 36. Generating a pulse of arbitrary width by a sufficiently long delay line

t
u

u
t1

v
t2 t3

Fig. 37. Timing diagram for a pulse generator

Proof. The key to the construction is the parametrized design of Fig. 36. The
timing diagram in Fig. 37 shows that the entire design produces a pulse of
length growing with c; hence, we call it a c-pulse generator.
Signal u goes up at time t. The chain of c AND gates just serves as a delay
line. The result is finally inverted. Thus, signal u falls in time interval with

t1 = t + (c + 1) · [α, β] .

The final AND gate produces a pulse v with a rise time in interval t2 and a
fall time in interval t3 satisfying

t2 = t + [α, β]
t3 = t + (c + 2) · [α, β] .

Note that, in the digital model, we have for all cycles t

v t = (ut ∧ ¬ut ) = 0 ,

which is indeed correct after propagation delays are over – and that is all
the digital model captures. Now consider the design in Fig. 38. In the digital
model, v1 and v2 are always zero. The only driver ever enabled in the digital
model is y3 . Thus, the design satisfies (10) for the digital model.
Now consider the timing diagram in Fig. 39. At each clock edge T , one of
registers Ri has a rising edge in time interval

t1 = T + [ρ, σ] ,

which generates a pulse with rising edge in time interval t2 and falling edge
in time interval t3 satisfying
3.5 Drivers and Main Memory 63

R1 1 R2 1

C−PULSE C−PULSE

v1 v2
0 0 1

1
y1 y2 y3
b
Fig. 38. Generating contention with two pulse generators

τ (c)

ck

Ri
t1

vi
t2 t3

yi
t4 l(c) t5

Fig. 39. Timing analysis for the period of bus contention from t4 to t5

t2 = T + [ρ, σ] + [α, β]
t3 = T + [ρ, σ] + (c + 2) · [α, β] .

Driver yi then enables in time interval t4 and disables in time interval t5 ,


satisfying

t4 = T + [ρ, σ] + 2 · [α, β]
t5 = T + [ρ, σ] + (c + 3) · [α, β] .

We choose a cycle time

τ (c) = σ + (c + 3) · β ,
64 3 Hardware

set ∧ ¬reset
R
set
clr
R set ∨ clr ∨ reset

(a) symbol (b) implementation

Fig. 40. Symbol and implementation of a set-clear flip-flop

such that the timing diagram fits exactly into one clock cycle. In the next
cycle, we then have the same situation for the other register and driver. We
have contention on bus b at least during time interval

C = T + [σ + 2 · β, ρ + (c + 3) · α]

of length
(c) = ρ + (c + 3) · α − (σ + 2 · β) .
Asymptotically we have

lim (c)/τ (c) = α/β .


c→∞

Thus, we choose c such that

(c)/τ (c) ≥ α/β −

and the lemma follows. 



For common technologies, the fraction α/β is around 1/3. Thus, we are talking
about a short circuit for roughly 1/3 of the time. We should mention that
this will overheat drivers to an extent that the packages of the chips tend to
explode.

3.5.5 Clean Operation of Tristate Buses

We now construct control logic for tristate buses. We begin with a digital
specification, construct a control logic satisfying this (incomplete) specifica-
tion and then show in the detailed model i) that the bus is free of contention
and ii) that signals are free of glitches while we guarantee their presence on
the bus.
As a building block of the control, we use the set-clear flip-flops from
Fig. 40. This is simply a 1-bit register which is set to 1 by activation of the
set signal and to 0 by activation of the clr signal (without activation of the
set signal). During reset, i.e., during cycle −1, the flip-flops are forced to zero.
3.5 Drivers and Main Memory 65

Rj Rj ce

yj oe
yj
b
Fig. 41. Registers Rj connected to a bus b by tristate drivers yj

R0 = ⎧
0

⎨1 sett
Rt+1 = 0 ¬sett ∧ clrt

⎩ t
R otherwise

We consider a situation as shown in Fig. 41 with registers Rj connected to a


bus b by tristate drivers yj for j ∈ [1 : k].
For i ∈ N we aim at intervals

Ti = [ai : bi ]

of cycles, s.t., ai ≤ bi , and a function

send : N → [1 : k]

specifying for each i ∈ N the unique index j = send(i) such that register Rj
is “sending” on the bus during “time” interval Ti :

t Rj ∃i : j = send(i) ∧ t ∈ Ti
b =
Z otherwise .

The sending register Rj is clocked at cycle ai − 1 and is not clocked in the


time interval [ai : bi − 1]:

send(i) = j → Rj ceai −1 ∧ ∀t ∈ [ai : bi − 1] : ¬Rj cet .

At the end of a time interval Ti we consider two possible scenarios:


• Unit j = send(i) continues to operate on the bus in cycle bi + 1. In this
case we have
ai+1 = bi + 1 ∧ send(i + 1) = send(i) .
• Unit j gives up the bus in cycle bi . In order to guarantee that there is
enough time for activating signals in between two time intervals in this
case we require
bi + 1 < ai+1 .
66 3 Hardware

yj oeset
yj oe
yj oeclr

Rj Rj ce

yj
b
Fig. 42. Generation of output enable signals yoej by set-clear flip-flops

ai bi ai+1 bi+1 ai+2


ck

Rsend(i) Rsend(i)
b

Fig. 43. Idealized timing of clean tristate bus control

For the first time interval, we require

0 < a0 .

As shown in Fig. 42, control signals yj oe are generated as outputs of set-clear


flip-flops which in turn are controlled by signals yj oeset and yj oeclr. The rule
for generation of the latter signals is simple: for intervals Ti during which yj
is enabled (j = send(i)), the output enable signal is set in cycle ai − 1 (unless
unit j was sending in cycle ai − 1 = bi−1 ) and cleared in cycle bi (unless unit
j will be sending in cycle bi + 1 = ai+1 ):

yj oesett ≡ ∃i : send(i) = j ∧ t = ai − 1 ∧ t = bi−1


yj oeclrt ≡ ∃i : send(i) = j ∧ t = bi ∧ t = ai+1 − 1 .

An example of idealized timing of the tristate bus control in shown in Fig. 43.
Unit send(i) is operating on the bus in two consecutive time intervals Ti and
Ti+1 (the value of the register Rsend(i) can be updated in the last cycle of the
first interval) and then its driver is disabled. Between the end bi+1 of interval
Ti+1 and the start ai+2 of the next interval Ti+2 , there is at least one cycle
where no driver is enabled in the digital model.
In the digital model, we immediately conclude
3.5 Drivers and Main Memory 67

e(ai − 1) e(ai ) e(bi ) e(bi + 1)


ck

yj oeset

yj oeclr
ρ σ ρ σ
yj oe
ρ+α ρ+α

yj Rj
σ+β σ+β

Rj ce
ρ σ ρ σ
Rj

Fig. 44. Timing diagram for clean bus control in case unit j is sending in the
interval Ti and is giving up the bus afterwards. Timing of signals yj oeset, yj oeclr,
and Rj ce are idealized. Other timings are detailed

yj oet ≡ ∃i : send(i) = j ∧ t ∈ Ti

t Rj ∃i : send(i) = j ∧ t ∈ Ti
b =
Z otherwise ,

as required in the digital specification. In the detailed model, we can show


more. Before we do that, recall that e(t) is the time of the clock edge starting
cycle t. Because we are arguing about cycles and time simultaneously we
denote cycles with q and times with t.
Lemma 3.9 (tristate bus control). Let a tristate bus be controlled by the
logic designed in this section. Then,
• after time t ≥ e(0) + σ + β there is no bus contention:

t ≥ e(0) + σ + β ∧ yi (t) = Z ∧ yj (t) = Z → i = j ,

• if j = send(i), then the content of Rj is glitch free on the bus roughly


during Ti :

j = send(i) → ∀t ∈ [e(ai ) + σ + β, e(bi + 1) + ρ + α] : b(t) = Rjai .

Proof. Note that the hypotheses of this lemma are all digital. Thus, we can
prove them entirely in the digital world.
Consider the timing diagram in Fig. 44. For the outputs of the set-clear
flip-flop yj oe, we get after reset
68 3 Hardware

e(0) + σ ≤ t ≤ e(1) + ρ → yj oe(t) = 0 .


For t > e(1) + ρ we get


⎪ Ω ∃i : send(i) = j ∧ t ∈ e(ai ) + (ρ, σ) ∧ ai − 1 = bi−1



⎪ 1 ∃i : send(i) = j ∧ t ∈ e(ai ) + (ρ, σ) ∧ ai − 1 = bi−1


⎨1 ∃i : send(i) = j ∧ t ∈ [e(a ) + σ, e(b + 1) + ρ]
i i
yj oe(t) =

⎪ Ω ∃i : send(i) = j ∧ t ∈ e(b + 1) + (ρ, σ) ∧ bi + 1 = ai+1


i

⎪ 1 ∃i : send(i) = j ∧ t ∈ e(bi + 1) + (ρ, σ) ∧ bi + 1 = ai+1



0 otherwise .
For the outputs yj of the drivers, it follows that after reset
e(0) + σ + β ≤ t ≤ e(1) + ρ + α → yj (t) = Z .
For t > e(1) + ρ we get
yj (t) = Z → ∃i : send(i) = j ∧ t ∈ [e(ai ) + ρ + α, e(bi + 1) + σ + β] .
Hence,
yj (t) = Z → ∃i : send(i) = j ∧ t ∈ (e(ai ), e(bi + 2)) .
The first statement of the lemma is fulfilled because our requirements on the
time intervals where different units are operating on the bus imply
e(1) ≤ e(a0 ) ∧ e(bi + 2) ≤ e(ai+1 ).
For the second statement of the lemma, we observe that the signal Rj ce is
active in cycle ai − 1 and can possibly become active again only in cycle bi .
Using Lemmas 3.3 and 3.5 we conclude that
j = send(i) ∧ t ∈ [e(ai ) + σ, e(bi + 1) + ρ] →
Rj (t) = Rj (e(ai + 1)) = Rjai .
We have shown already about the output enable signals
j = send(i) ∧ t ∈ [e(ai ) + σ, e(bi + 1) + ρ] → yoej (t) = 1 .
Thus, for the driver values we get
j = send(i) ∧ t ∈ [e(ai ) + σ + β, e(bi + 1) + ρ + α] → yj (t) = Rjai = Z .
From the first part of the lemma we conclude for the value of the bus
j = send(i) ∧ t ∈ [e(ai ) + σ + β, e(bi + 1) + ρ + α] → b(t) = Rjai .


Figure 45 shows symbol and implementation of an n-tristate driver. This
driver consists simply of n tristate drivers with a common output enable
signal.
3.5 Drivers and Main Memory 69

x[n − 1 : 0]
n

oe
n

y[n − 1 : 0]
(a) symbol

xn−1 xi x0
oe
··· ···

yn−1 yi y0
(b) implementation

Fig. 45. Symbol and construction of an n-tristate driver.

3.5.6 Specification of Main Memory

As a last building block for hardware, we introduce a main memory h.mm. It


is a line addressable memory

h.mm : B29 → B64

which is accessed via a tristate bus b with the following components:


• b.d ∈ B64 . In write operations, this is a cache line to be stored in main
memory. In the last cycle of read operations, b.d contains the data read
from main memory.
• b.ad ∈ B29 . The line address of main memory operations.
• b.mmreq ∈ B. The request signal for main memory operations.
• b.mmw ∈ B. The main memory write signal, which denotes that the cur-
rent main memory request is a write.
• b.mmack ∈ B. The main memory acknowledgement signal. The main mem-
ory activates it in the last cycle of a main memory operation.
An incomplete digital specification of main memory accesses is given by the
idealized timing diagram in Fig. 46.
It is often desirable to implement some small portion of the main memory
as a read only memory (ROM) and the remaining large part as a random
70 3 Hardware

b.mmack

b.mmreq

b.ad address

b.mmw write request

b.d (write) input data

b.d (read) output

Fig. 46. Timing of main memory operations

access memory (RAM)9 . The standard use for this is to store boot code in the
read only portion of the memory. Since, after power up, the memory content
of RAM is unknown, computation will not start in a meaningful way unless at
least some portion of memory contains code that is known after power up. The
reset mechanism of the hardware ensures that processors start by executing
the program stored in the ROM. This code usually contains a so called boot
loader which accesses a large and slow memory device – like a disk – to load
further programs, e.g., an operating system to be executed, from the device.
For the purpose of storing a boot loader, we assume the main memory to
behave as a ROM for addresses a = 029−r b, where b ∈ Br and r < 29.
Operating conditions of the main memory are formulated in the following
definitions and requirements:
1. Stable inputs. In general, accesses to main memory last several cycles.
During such an access, the inputs must be stable:
b.mmreq t ∧ ¬b.mmack t ∧ X ∈ mmin(q) → b.X t+1 = b.X t ,
where mmin(q) is the set of inputs of an access active in cycle q:

b.d b.mmwq
mmin(q) = {b.ad, b.mmreq, b.mmw} ∪
∅ otherwise .

2. No spurious acknowledgements. The main memory should never raise


a b.mmack signal unless the b.mmreq signal is set:
¬b.mmreq t → ¬b.mmack t .
9
Some basic constructions of static RAMs and ROMs are given in Chapter 4.
3.5 Drivers and Main Memory 71

3. Memory liveness. If the inputs are stable, we may assume liveness for
the main memory, i.e., every request should be eventually served:

b.mmreq t → ∃t ≥ t : b.mmack t .

We denote the cycle in which the main memory acknowledges a request


active in cycle t by

ack(t) = min{x ≥ t | b.mmack x = 1} .

4. Effect of write operations. If the inputs are stable and the write access
is on, then in the next cycle after the acknowledgement, the data from b.d
is written to the main memory at the address specified by b.ad:


⎨b.d
q
x = b.adq ∧ b.mmack q ∧ b.mmwq
mmq+1 (x) = ∧ x[28 : r] = 029−r

⎩ q
mm (x) otherwise .

The writes only affect the memory content if they are performed to ad-
dresses larger than 029−r 1r .
5. Effect of read operations. If the inputs are stable, then, in the last
cycle of the read access, the data from the main memory specified by b.ad
is put on b.d:

b.mmreq q ∧ ¬b.mmwq → b.dack(q) = mmq (b.adq ) .

6. Tristate driver enable. The driver mmbd connecting the main memory
to bus b.d is never enabled outside of a read access:

¬(∃q : b.mmreq q ∧ ¬b.mmwq ∧ t ∈ [q, ack(q)]) → mmbdt = Z .

Properties 1, 5, and 6 of the digital specification are incomplete with respect to


the detailed hardware model; since the absence of glitches is of importance, we
complete the specification in the detailed hardware model with the following
three conditions:
1. Timing of inputs. We require that, during a main memory access, in-
puts to main memory have to be free of glitches. The detailed specification
has a new timing parameter, namely a main memory set up time mmts.
This setup time has to be large enough to permit a reasonable control au-
tomaton (as specified in Sect. 3.6) to compute a next state and a response
before the next clock edge.
Let the memory request be active during time q. We require input com-
ponents b.X of the bus to have the digital value b.X q from time mmts
before edge e(q + 1) until hold time th after edge e(ack(q) + 1) :

t ∈ [e(q + 1) − mmts, e(ack(q) + 1) + th] ∧ b.mmreq q


∧ X ∈ mmin(q) → b.X(t) = b.X q . (11)
72 3 Hardware

Xj Xj ce

Xj bdoe
Xj bd
b.X
Fig. 47. Registers and their drivers on bus components b.X

2. Timing of outputs. We also have to specify the timing of responses


given by main memory. We require the memory acknowledgement signal
b.mmack(t) to have digital value b.mmack c from time mmts before edge
e(c + 1) until hold time th after edge e(c + 1):

t ∈ [e(c + 1) − mmts, e(c + 1) + th] → b.mmack(t) = b.mmack c . (12)

In case there is an active read request in cycle q, we also require the


data output from the memory on bus b.d to have digital value b.dack(q)
from time mmts before edge e(ack(q) + 1) until hold time th after edge
e(ack(q) + 1):

t ∈ [e(ack(q) + 1) − mmts, e(ack(q) + 1) + th] ∧ b.mmreq q


∧ ¬b.mmwq → b.d(t) = b.dack(q) . (13)

3. Absence of bus contention. Finally, we have to define the absence of


bus contention in the detailed model so that clean operation of the tristate
bus can be guaranteed. The mmbd-driver can only be outside the high Z
from start of the cycle starting a read access until the end of the cycle
following a read access:

¬(∃q : b.mmreq q ∧ ¬b.mmwq ∧ t ∈ (e(q), e(ack(q) + 2))) →


mmbd(t) = Z . (14)

3.5.7 Operation of Main Memory via a Tristate Bus

We extend the control of the tristate bus from Sect. 3.5.5 to a control of
the components of the main memory bus. We consider k units U (j) with
j ∈ [1 : k] capable of accessing main memory. Each unit has output registers
mmreqj , mmwj , aj , and dj and an input register Qj . They are connected to
the bus b accessing main memory in the obvious way: bus components b.X
with X ∈ {ad, mmreq, mmw} occur only as inputs to the main memory. The
situation shown in Fig. 47 for unit U (j) is simply a special case of Fig. 41
with
3.5 Drivers and Main Memory 73

dj dj ce mm

dj bdoe mmbdoe
dj bd 64 64 mmbd
b.d

Qj Qj ce

Fig. 48. Registers and main memory with their drivers on bus component b.d

Rj = Xj
yj = Xj bd
b = b.X .

As shown in Fig. 48, bus components b.d can be driven both by the units and
by the main memory. If main memory drives the data bus, the data on the bus
can be clocked into input register Qj of unit U (j). If the data bus is driven
by a unit, the data on the bus can be stored in main memory. We treat main
memory simply as unit k + 1. Then we almost have a special case of Fig. 41
with

Rj = dj if 1 ≤ j ≤ k

dj bd 1≤j≤k
yj =
mmbd j = k + 1
b = b.d .

Signal b.mmack is broadcast by main memory. Thus, bus control is not neces-
sary for this signal. We want to extend the proof of Lemma 3.9 to show that
all four tristate buses given above are operated in a clean way. We also use the
statement of Lemma 3.9 to show that the new control produces memory input
without glitches in the sense of the main memory specification. The crucial
signals governing the construction of the control are the main memory request
signals mmreqj . We compute them in set-clear flip-flops; they are cleared at
reset:
∀y : mmreqy0 = 0 .
For the set and clear signals of the memory request, we use the following
discipline:
74 3 Hardware

• a main memory request signal is only set when all request signals are off

mmreqj setq → ∀y : mmreqyq = 0 ,

• at most one request signal is turned on at a time (this requires some sort
of bus arbitration):

mmreqj setq ∧ mmreqj  setq → j = j  ,

• a request which starts in cycle q is kept on until the corresponding ac-


knowledgement in cycle ack(q) and is then turned off

mmreqj setq−1 →
(∀x ∈ [q : ack(q) − 1] : ¬mmreqj clrx ) ∧ mmreqj clrack(q) .

Now we can define access intervals Ti = [ai : bi ]. The start cycle ai of interval
Ti is occurrence number i of the event that any signal mmreqj turns on. In
the end cycle bi , the corresponding acknowledgement occurs:

a1 = min{x ≥ 0 : ∃j : mmreqj setx } + 1


bi = ack(ai )
ai+1 = min{x > bi : ∃j : mmreqj setx } + 1 .

For bus components b.X with X ∈ {ad, mmreq, mmw}, we say that a unit
U (j) is sending in interval Ti if its request signal is on at the start of the
interval:
send(i) = j ↔ mmreqjai = 1 .
Controlling the bus components b.X with X ∈ {ad, mmreq, mmw} (which
occur only as inputs to the main memory) as prescribed in Lemma 3.910 , we
conclude

∀t ∈ [e(ai ) + σ + β, e(bi + 1) + ρ + α] : b.X(t) = Xsend(i)


ai
.

For the data component b.d of the bus, we define unit j to be sending if its
request signal is on in cycle ai and the request is a write request. We define
the main memory to be sending (send (i) = k + 1) if the request in cycle ai
is a read request:

 j mmreqjai = 1 ∧ mmwjai
send (i) =
k + 1 ∃j : mmreqjai = 1 ∧ ¬mmwjai .

Now we control all registers dataj for j ∈ [1 : k] as prescribed in Lemma 3.9.


Absence of bus contention for component b.d follows from the proof of Lemma
10
Note that for signals X ∈ {ad, mmreq} the corresponding time intervals when
they are driven on the bus can be larger than the time interval for signal mmreq.
3.6 Finite State Transducers 75

3.9 and (14) in the specification of the main memory. For write operations,
we conclude by Lemma 3.9:

send (i) ≤ k → ∀t ∈ [e(ai ) + σ + β, e(bi + 1) + ρ + α] : b.X(t) = Xsend


ai
 (i) .

Under reasonable assumptions for timing parameters and cycle time τ , this
completes the proof of (11) of the main memory specification requiring that
glitches are absent in main memory input.
Lemma 3.10 (clean opeation of memory). Let ρ + α ≥ th and σ + β +
mmts ≤ τ . Then,

X ∈ mmin(ai ) ∧ t ∈ [e(ai + 1) − mmts, e(ack(ai ) + 1) + th] →


b.X(t) = b.X ai .

Equation (13) is needed for timing analysis. In order to meet set up times for
the data of input Qj in of registers Qj on bus b.d, it obviously suffices if

mmts ≥ ts .

However, a larger lower bound for parameter mmts will follow from the con-
struction of particular control automata in Chap. 8.

3.6 Finite State Transducers


Control automata (also called finite state transducers) are finite automata
which produce an output in every step. Formally, a finite state transducer
M is defined by a 6-tuple (Z, z0 , I, O, δA , η), where Z is a finite set of states,
I ⊆ Bσ is a finite set of input symbols, z0 ∈ Z is called the initial state, O ⊆ Bγ
is a finite set of output symbols,

δA : Z × I → Z

is the transition function, and

η :Z ×I →O

is the output function.


Such an automaton performs steps according to the following rules:
• the automaton is started in state z0 ,
• if the automaton is in state z and reads input symbol in, then it outputs
symbol η(z, in) and goes to state δA (z, in).
If the output function does not depend on the input, i.e., if it can be written
as
η:Z→O,
76 3 Hardware

i
z z

Fig. 49. Graphical representation of a transition z  = δA (z, i)

the automaton is called a Moore automaton. Otherwise, it is called a Mealy


automaton.
Automata are often visualized in graphical form. We will do this too in
Sect. 8.4.3 when we construct several automata for the control of a cache
coherence protocol. State z is drawn as an ellipse with z written inside. A
state transition
z  = δA (z, i)
is visualized by an arrow from state z to state z  with label i as shown in Fig.
49. Initial states are sometimes drawn as a double circle.
In what follows, we show how to implement control automata. We start
with the simpler Moore automata and then generalize the construction to
Mealy automata.

3.6.1 Realization of Moore Automata

Let k = #Z be the number of states of the automaton. Then states can be


numbered from 0 to k − 1, and we can rename the states with numbers from
0 to k − 1, taking 0 as the initial state:

Z = [0 : k − 1] , z0 = 0 .

We code the current state z in a register S ∈ Bk by simple unary coding:



1 z=i
S = code(z) ↔ ∀i : S[i] =
0 otherwise .

A completely straightforward and naive implementation is shown in Fig. 50.


By the construction of the reset logic, we get

h0 .S = code(0) .

Circuits out (like output) and nexts are constructed such that the automaton
is simulated in the following sense: if h.S = z, i.e., state z is encoded by the
hardware, then
1. out(h) = η(z), i.e., automaton and hardware produce the same output,
2. nexts(h) = code(δA (z, in(h))), i.e., in the next cycle the hardware h .S
encodes the next state δA (z, in(h)).
3.6 Finite State Transducers 77

in
σ

nexts
0k−1 1
k k
0 1 reset
k

S 1
k

out
γ

out
Fig. 50. Naive implementation of a Moore automaton

The following lemma states correctness of the construction shown in Fig. 50.

Lemma 3.11 (Moore automaton). Let

h.S = code(z) ∧ δA (z, in(h)) = z  .

Then,
out(h) = η(z) ∧ h .S = code(z  ) .

For all i ∈ [0 : γ − 1], we construct the i’th output simply by OR-ing together
all bits S[x] where η(x)[i] = 1, i.e., such that the i-th output is on in state x
of the automaton:
out(h)[i] = h.S[x] .
η(x)[i]=1

A straightforward argument shows the first claim of the lemma. Assume h.S =
z. Then,
h.S[x] = 1 ↔ x = z .
Hence,

out(h)[i] = 1

↔ h.S[x] = 1
η(x)[i]=1

↔ ∃x : η(x)[i] = 1 ∧ h.S[x] = 1
↔ η(z)[i] = 1 .
78 3 Hardware

Lemma 2.16 gives


out(h)[i] = η(z)[i] .
For states i, j we define auxiliary switching functions

δi,j : Bσ → B

from the transition function δA of the automaton by

δi,j (in) = 1 ↔ δA (i, in) = j ,

i.e., function δi,j (in) is on if input in takes the automaton from state i to state
j. Boolean formulas for functions δi,j can be constructed by Lemma 2.20. For
each state j, component nexts[j], which models the next state function, is
turned on in states x, which transition under input in to state j according to
the automaton’s transition function:

nexts(h)[j] = h.S[x] ∧ δx,j (in(h)) .
x

For the second claim of the lemma, let

h.S = code(z)
δA (z, in(h)) = z  .

For any next state j, we then have

nexts(h)[j] = 1

↔ h.S[x] ∧ δx,j (in(h)) = 1
x
↔ δz,j (in(h)) = j
↔ δA (z, in(h)) = j .

Hence, 
1 j = z
nexts(h)[j] =
0 otherwise .
Thus,

code(z  ) = nexts(h)
= h .S .

3.6.2 Precomputing Outputs of Moore Automata

The previous construction has the disadvantage that the propagation delay of
circuit out tends to contribute to the cycle time of the circuitry controlled by
3.6 Finite State Transducers 79

in
k σ

nexts
0k−1 1 k

k k
out
reset 0 1
γ
Sin k

S 1 outR 1

out
Fig. 51. Implementation of a Moore automaton with precomputed outputs

the automaton. This can by avoided by precomputing the output signals of a


Moore automaton as a function of the next state signals as shown in Fig. 51.
As above, one shows

Sin(h) = code(z) → out(h) = η(z) .

For h = h−1 the reset signal is active and we have

Sin(h−1 ) = 0k−1 1 = code(0)


out(h−1 ) = η(0) .

Thus,

h0 .S = code(0) and h0 .outR = η(0) .

The following lemma states correctness of the construction shown in Fig. 51.
Lemma 3.12 (Moore automaton with precomputed outputs). For h =
ht and t ≥ 0, let
h.S = code(z) ∧ δA (z, in(h)) = z  .
Then,
h .S = code(z  ) ∧ h .outR = η(z  ) .
We have reset(h) = 0, and hence, Sin(h) = nexts(h). From above we have

h .S = nexts(h) = code(z  )

and
h .outR = out(h) = η(z  ) .
80 3 Hardware

nexts
0k−1 1
k k
0 1 reset
k

S 1
k
in
σ

out
γ

out
Fig. 52. Simple implementation of a Mealy automaton

3.6.3 Realization of Mealy Automata


Figure 52 shows a simple implementation of a Mealy automaton. Compared to
the construction for Moore automata, only the generation of output signals
changes; the next state computation stays the same. Output η(z, in) now
depends both on the current state z and the current input in. For states
z and indices i of outputs, we derive from function η the set of switching
functions fz,i , where
fz,i (in) = 1 ↔ η(z, in)[i] = 1 .
Output is generated by

out(h)[i] = h.S[x] ∧ fx,i (in(h)) .
x

This generates the outputs of the automaton in the following way.


Lemma 3.13 (Mealy automaton).
h.S = code(z) → out(h) = η(z, in(h))
Again, the proof is straightforward:
out(h)[i] = 1

↔ h.S[x] ∧ fx,i (in(h)) = 1
x
↔ fz,i (in(h)) = 1
↔ η(z, in(h))[i] = 1 .
3.6 Finite State Transducers 81

in
σ

nexts
0k−1 1 k

k k out
reset 0 1 M oore
β
Sin k

S 1 outR 1
k
in β
σ
out outβ
M ealy
α

outα
Fig. 53. Separate realization of Moore and Mealy components

3.6.4 Partial Precomputation of Outputs of Mealy Automata

We describe two optimizations that can reduce the delay of outputs of Mealy
automata. The first one is trivial. We divide the output components out[j] into
two classes: i) Mealy components η[k](z, in), which have a true dependency
on the input variables, and ii) Moore components that can be written as
η[k](z), i.e., that only depend on the current state. Suppose we have α Mealy
components and β Moore components with γ = α + β. Obviously, one can
precompute the Moore components as in a Moore automaton and realize the
Mealy components as in the previous construction of Mealy automata. The
resulting construction is shown without further correctness proof in Fig. 53.
However, quite often, more optimization is possible since Mealy compo-
nents usually depend only on very few input bits of the automaton. As an
example, consider a Mealy output depending only on two input bits:

η(z, in)[j] = f (z, in[1 : 0]) .

For x, y ∈ B, we derive Moore outputs fx,y (z) that precompute η(z, in)[j] if
in[1 : 0] = xy:
fx,y (z) = f (z, xy) .
Output η(z, in)[j] in this case is computed as

η(z, in)[j] = f (z, in[1 : 0])



= (in[1 : 0] = xy) ∧ fx,y (z) .
x,y∈B
82 3 Hardware

in
σ

nexts
0k−1 1
k k
reset 0 1

Sin k

S 1
k
precompute

f1,1 f1,0 f0,1 f0,0

f1,1 R f1,0 R f0,1 R f0,0 R

in[0] 1 0 1 0 in[0]

1 0 in[1]

outα [j]
Fig. 54. Partial precomputation of a Mealy output depending on two input bits

Now, we precompute automata outputs fx,y (Sin(h)) and store them in regis-
ters fx,y R as shown in Fig. 54.
As for precomputed Moore signals, one shows
h.S = code(z) → h.fx,y R = fx,y (z) .
For the output outα [j] of the multiplexer tree, we conclude
outα [j](h) = h.fin[1:0] R
= fin[1:0] (z)
= η(z, in[1 : 0])[j] .
This construction has the advantage that only the multiplexers contribute to
the delay of the control signals generated by the automaton. In general, for
Mealy signals which depend on k input bits, we have k levels of multiplexers.
4
Nine Shades of RAM1

The processors of multi-core machines communicate via a shared memory in


a highly nontrivial way. Thus, not surprisingly, memory components play an
important role in the construction of such machines. We start in Sect. 4.1
with a basic construction of (static) random access memory (RAM). Next,
we derive in Sect. 4.2 five specialized designs: read only memory (ROM),
multi-bank RAM, cache state RAM, and special purpose register RAM (SPR
RAM). In Sect. 4.3 we then generalize the construction to multi-port RAM;
this is RAM with more than one address and data port. We need multi-port
RAMs in 4 flavours: 3-port RAM for the construction of general purpose
register files, general 2-port RAM, 2-port combined multi-bank RAM-ROM,
and 2-port cache state RAM.
For the correctness proof of a RAM construction, we consider a hard-
ware configuration h which has the abstract state of the RAM h.S as well
as the hardware components implementing this RAM. The abstract state of
the RAM is coupled with the state of its implementation by means of an ab-
straction relation. Given that both the abstract RAM specification and RAM
implementation have the same inputs, we show that their outputs are also
always the same.
The material in this section builds clearly on [12]. The new variations of
RAMs (like general 2-port RAM or 2-port cache state RAM), that we have
introduced, are needed in later chapters. Correctness proofs for the various
flavours of RAM are quite similar. Thus, if one lectures about this material,
it suffices to present only a few of them in the classroom.

4.1 Basic Random Access Memory


As illustrated in Fig. 55, an (n, a)-static RAM S or SRAM is a portion of a
clocked circuit with the following inputs and outputs:
1
The title of this chapter is inspired by the song “Forty Shades of Green” written
in 1959 by Johnny Cash and not by a recent novel.

M. Kovalev et al.: A Pipelined Multi-core MIPS Machine, LNCS 9000, pp. 83–98, 2014.

c Springer International Publishing Switzerland 2014
84 4 Nine Shades of RAM

Sin
n

Sa Sw
a S

n
Sout
Fig. 55. Symbol for an (n, a)-SRAM

• an n-bit data input Sin,


• an a-bit address input Sa,
• a write signal Sw, and
• an n-bit data output Sout.
Internally, the static RAM contains 2a many n-bit registers S(x) ∈ Bn . Thus,
it is modeled as a function

h.S : Ba → Bn .

The initial content of the RAM after reset is unknown:

∀x : h0 .S(x) ∈ Bn .

The output of the RAM is the register content selected by the address input:

Sout(h) = h.S(Sa(h)) .

For addresses x ∈ Ba we define the next state transition function for SRAM
as 
 Sin(h) Sa(h) = x ∧ Sw(h) = 1
h .S(x) =
h.S(x) otherwise .
The implementation of an SRAM is shown in Fig. 56. We use 2a many n-bit
registers R(i) with i ∈ [0 : 2a − 1] and an a-decoder with outputs X[2a − 1 : 0]
satisfying
X(i) = 1 ↔ i = Sa(h) .
The inputs of register R(i) are defined as

h.R(i) in = Sin(h)
h.R(i) ce = Sw(h) ∧ X[i] .

For the next state computation we get


4.2 Single-Port RAM Designs 85

R(0)
Sw

···
X[0] Sin
n
···
a
a-Dec X[i] R(i)
Sa
···

n
n b[i] n
X[2a − 1] (n, 2a )-OR Sout

···
R(2
a
−1)

Fig. 56. Construction of an (n, a)-SRAM


Sin(h) i = Sa(h) ∧ Sw(h)
h .R(i) =
h.R(i) otherwise .

The i-th input vector b[i] to the OR-tree is constructed as


b[i] = X[i] ∧ h.R(i)

h.R(i) i = Sa(h)
=
0n otherwise .
Thus,
2
a
−1
Sout(h) = b[i]
i=0

= h.R(Sa(h)) .
As a result, when we choose
h.S(x) = h.R(x)
as the defining equation of our abstraction relation, the presented construction
implements an SRAM.

4.2 Single-Port RAM Designs


4.2.1 Read Only Memory (ROM)
An (n, a)-ROM is a memory with a drawback and an advantage. The draw-
back: it can only be read. The advantage: its content is known after power
86 4 Nine Shades of RAM

Sa
a S

n
Sout
Fig. 57. Symbol of an (n, a)-ROM

S(ia )
X[0] n
···

X[i] n b[i] n
Sa
a
a-Dec (n, 2a )-OR Sout
···

X[2a − 1]

Fig. 58. Construction of an (n, a)-ROM

up. It is modeled by a mapping S : Ba → Bn , which does not depend on the


hardware configuration h. The construction is obtained by a trivial variation
of the basic RAM design from Fig. 56: replace each register R(i) by the con-
stant input S(bina (i)) ∈ Bn . Since the ROM cannot be written, there are no
data in, write, or clock enable signals; the hardware constructed in this way
is a circuit. Symbol and construction are given in Figs. 57 and 58.

4.2.2 Multi-bank RAM

Let n = 8k be a multiple of 8. An (n, a)-multi-bank RAM S : Ba → B8k is


basically an (n, a)-RAM with separate bank write signals bw(i) for each byte
i ∈ [0 : k − 1] (see Fig. 59). It has
• a data input Sin ∈ B8k ,
• a data output Sout ∈ B 8k ,
• an address input Sa ∈ Ba , and
• bank write signals Sbw[i] ∈ B for i ∈ [0 : k − 1].
Data output is defined exactly as for the ordinary RAM:

Sout(h) = h.S(Sa(h)) .
4.2 Single-Port RAM Designs 87

Sin
8k

a k
Sa Sbw
S

8k
Sout
Fig. 59. Symbol of an (n, a)-multi-bank RAM

byte(i, x) byte(i, y)
8 8

0 1 bw[i]
8

byte(i, modify(x, y, bw))


Fig. 60. Computation of output byte i of a modify circuit

For the definition of the next state, we first introduce auxiliary function

modify : B8k × B8k × Bk → B8k .

This function selects bytes from two provided strings according to the provided
byte write signals. Let y, x ∈ B8k and bw ∈ Bk . Then, for all i ∈ [0 : k − 1],

byte(i, y) bw[i] = 1
byte(i, modify(x, y, bw)) =
byte(i, x) bw[i] = 0 ,

i.e., for all i with active bw[i] one replaces byte i of x by byte i of y. The next
state of multi-bank RAM is then defined as

 modify (h.S(x), Sin(h), Sbw(h)) x = Sa(h)
h .S(x) =
h.S(x) otherwise .

As shown in Fig. 60, each byte of the output of a modify circuit is simply
computed by an 8-bit wide multiplexer.
The straightforward construction of a multi-bank RAM uses k separate so
called banks. These are (8, a)-RAMs S (i) for i ∈ [0 : k − 1]. For each i, bank
S (i) is wired as shown in Fig. 61:
88 4 Nine Shades of RAM

byte(i, Sin)
8

a
Sa Sbw[i]
S (i)

8
byte(i, Sout)
Fig. 61. Bank i of an (n, a)-multi-bank RAM

S (i) a = Sa(h)
S (i) in = byte(i, Sin(h))
S (i) out = byte(i, Sout(h))
S (i) w = Sbw(h)[i] .

We abstract the state h.S for this construction as

byte(i, h.S(x)) = h.S (i) (x) .

Correctness now follows in a lengthy – but completely straightforward – way


from the specification of ordinary RAM. For the outputs we have

byte(i, Sout(h)) = S (i) out(h) (construction)


= h.S (i) (S (i) a(h)) (construction)
= h.S (i) (Sa(h)) (construction)
= byte(i, h.S(Sa(h)) . (state abstraction)

For the new state of the multi-bank RAM and address x = Sa(h), we have

byte(i, h .S(x)) = h .S (i) (x) (state abstraction)


= h.S (i) (x) (construction)
= byte(i, h.S(x)) . (state abstraction)

For the new state and address x = Sa(h):


4.2 Single-Port RAM Designs 89

Sin Svinv
n n

Sw
Sa
a S Sinv

n
Sout
Fig. 62. Symbol of an (n, a)-CS RAM

byte(i, h.S(x)) = h .S (i) (x) (state abstraction)



S (i) in(h) S (i) w(h) = 1
= (construction)
h.S (i) (x) S (i) w(h) = 0

byte(i, Sin(h)) Sbw(h)[i] = 1
= (construction)
h.S (i) (x) Sbw(h)[i] = 0

byte(i, Sin(h)) Sbw(h)[i] = 1
= (state abstraction)
byte(i, h.S(x)) Sbw(h)[i] = 0 .

As a result, we have

h .S(x) = modify (h.S(x), Sin(h), Sbw(h)) .

4.2.3 Cache State RAM

The symbol of an (n, a)-cache state RAM or CS RAM is shown in Fig. 62.
This type of RAM is used later for holding the status bits of caches. It has
two extra inputs:
• a control signal Sinv – on activation, a special value is forced into all
registers of the RAM. Later, we will use this to set a value that indicates
that all cache lines are invalid2 and
• an n-bit input Svinv providing this special value. This input is usually
wired to a constant value in Bn .
Activation of Sinv takes precedence over ordinary write operations:

2
I.e., not a copy of meaningful data in our programming model. We explain this
in much more detail later.
90 4 Nine Shades of RAM

Sin Svinv
n n

0 1 Sinv
n

R(i) Sinv ∨ Sw ∧ X[i]


n
Sout
Fig. 63. Construction block of an (n, a)-cache state RAM

Sin Sdin[2a − 1] Sdin[0]


n n ··· n

a Sw
Sa
S 2 a
Sce

n n
··· n
a
Sout Sdout[2 − 1] Sdout[0]
Fig. 64. Symbol of an (n, a)-SPR RAM



⎨Svinv(h) Sinv(h) = 1

h .S(x) = Sin(h) x = Sa(h) ∧ Sw(h) = 1 ∧ Sinv(h) = 0


h.S(x) otherwise .

The changes in the implementation for each register R(i) are shown in Fig. 63.
The clock enable is also activated by Sinv and the data input comes from a
multiplexer:

R(i) ce = Sinv(h) ∨ X[i] ∧ Sw(h)



i Svinv(h) Sinv(h) = 1
R in =
Sin(h) otherwise .

4.2.4 SPR RAM

An (n, a)-SPR RAM as shown in Fig. 64 is used for the realization of special
purpose register files and in the construction of fully associative caches. It
behaves both as an (n, a)-RAM and as a set of 2a many n-bit registers. It has
the following inputs and outputs:
4.2 Single-Port RAM Designs 91

Sw
Sin Sdin[i]
n n
Sce[i]
0 1 Sce[i]
n

R(i)

Sdout[i]
n
n b[i] n
Sa
a
a-Dec (n, 2a )-OR Sout
X[i]

Fig. 65. Construction of an (n, a)-SPR RAM

• an n-bit data input Sin,


• an a-bit address input Sa,
• an n-bit data output Sout,
• a write signal Sw,
• for each i ∈ [0 : 2a − 1] an individual n-bit data input Sdin[i] for register
R(i) ,
• for each i ∈ [0 : 2a − 1] an individual n-bit data output Sdout[i] for register
R(i) , and
• for each i ∈ [0 : 2a − 1] an individual clock enable signal Sce[i] for register
R(i) .
Ordinary data output is generated as usual, and the individual data outputs
are simply the outputs of the internal registers:

Sout(h) = h.S(Sad(h))
Sdout(h)[i] = h.S(bina (i)) .

Register updates to R(i) can be performed either by Sin for regular writes
or by Sdin[i] if the special clock enables are activated. Special writes take
precedence over ordinary writes:


⎨Sdin(h)[x] Sce(h)[x] = 1
h .S(x) = Sin(h) Sce(h)[x] = 0 ∧ Sw(h) = 1


h.S(x) otherwise .

A single address decoder with outputs X[i] and a single OR-tree suffices.
Figure 65 shows the construction satisfying
92 4 Nine Shades of RAM

Sin
n
a
Sa
Sw
a S
Sb
a
Sc
n n
Souta Soutb
Fig. 66. Symbol of an (n, a)-GPR RAM

R(i) ce = Sce(h)[i] ∨ X[i] ∧ Sw(h)



(i) Sdin(h)[i] Sce(h)[i] = 1
R in =
Sin(h) otherwise .

4.3 Multi-port RAM Designs


4.3.1 3-port RAM for General Purpose Registers

An (n, a)-GPR RAM is a three-port RAM that we use later for general purpose
registers. As shown in Fig. 66, it has the following inputs and outputs:
• an n-bit data input Sin,
• three a-bit address inputs Sa, Sb, Sc,
• a write signal Sw, and
• two n-bit data outputs Souta, Soutb.
As for ordinary SRAM, the state of the 3-port RAM is a mapping

h.S : Ba → Bn .

Reads are controlled by address inputs Sa(h) and Sb(h):

Souta(h) = h.S(Sa(h))
Soutb(h) = h.S(Sb(h)) .

Writing is performed under control of address input Sc(h):



 Sin(h) Sc(h) = x ∧ Sw(h) = 1
h .S(x) =
h.S(x) otherwise .
4.3 Multi-port RAM Designs 93

Sw
Sin
n

a
a-Dec Z[i] R(i)
Sc

X[i] n a[i] n
Sa
a
a-Dec (n, 2a )-OR Souta

Y [i] n b[i] n
Sb
a
a-Dec (n, 2a )-OR Soutb

Fig. 67. Construction of an (n, a)-GPR RAM

The implementation shown in Fig. 67 is a straightforward variation of the


design for ordinary SRAM. One uses three different a-decoders with outputs
X[0 : 2a − 1],Y [0 : 2a − 1], Z[0 : 2a − 1] satisfying
X[i] = 1 ↔ i = Sa(h)
Y [i] = 1 ↔ i = Sb(h)
Z[i] = 1 ↔ i = Sc(h) .
Clock enable signals are derived from the decoded Sc address:
R(i) ce = Z[i] ∧ Sw(h) .
Outputs Souta, Soutb are generated by two (n, 2a )-bit OR-trees with inputs
a[i], b[i] satisfying
a[i] = X[i] ∧ h.R(i)

Souta(h) = a[i]
b[i] = Y [i] ∧ h.R(i)

Soutb(h) = b[i] .
94 4 Nine Shades of RAM

Sina Sinb
n n
a
Sa
Swa
a S
Sb Swb

n n
Souta Soutb
Fig. 68. Symbol of an (n, a)-2-port RAM

4.3.2 General 2-port RAM

A general (n, a)-2-port RAM is shown in Fig. 68. This is a RAM with the
following inputs and outputs:
• two data inputs Sina, Sinb,
• two addresses Sa, Sb,
• two write signals Swa, Swb.
The data outputs are determined by the addresses as in the 3-port RAM for
general purpose registers:

Souta(h) = h.S(Sa(h))
Soutb(h) = h.S(Sb(h)) .

The 2-port RAM allows simultaneous writes to two addresses. In case both
write signals are active and both addresses point to the same port we have to
resolve the conflict: the write via the a port will take precedence:


⎨Sina(h) x = Sa(h) ∧ Swa(h) = 1
h .S(x) = Sinb(h) x = Sb(h) ∧ Swb(h) = 1 ∧ x = Sa(h) ∧ Swa(h) = 1


h.S(x) otherwise .

Only two address decoders with outputs X[0 : 2a − 1], Y [0 : 2a − 1] are


necessary. They satisfy

X[i] = 1 ↔ i = Sa(h)
Y [i] = 1 ↔ i = Sb(h) .

Figure 69 shows the changes to each register R(i) . Clock enable is activated in
case a write via the a address or via the b address occurs. The input is chosen
from the corresponding data input by a multiplexer:
4.3 Multi-port RAM Designs 95

Sina Sinb
n n

1 0 X[i] ∧ Swa
n

R(i) X[i] ∧ Swa ∨ Y [i] ∧ Swb


n

Fig. 69. Construction of an (n, a)-2-port RAM

byte(i, Sina) byte(i, Sinb)


8 8
a
Sa Sbwa[i]
(i)
a S
Sb Sbwb[i]

8 8
byte(i, Souta) byte(i, Soutb)
Fig. 70. Bank i of an (n, a)-2-port multi-bank RAM

R(i) ce = Swa(h) ∧ X[i] ∨ Swb(h) ∧ Y [i]



(i) Sina(h) Swa(h) ∧ X[i]
R in =
Sinb(h) otherwise .

As required in this implementation, writes via port a take precedence over


writes via port b to the same address.
Output is generated as for GPR RAMs.

4.3.3 2-port Multi-bank RAM-ROM

In Sect. 3.5.6, we have already explained why it is often desirable to implement


some small portion of the main memory as a ROM and the remaining large
part as a RAM. In the implementation of the sequential MIPS processor that
we construct in Chap. 6, every instruction is executed in a single cycle. Hence,
we need a memory construction, which allows us to fetch an instruction and to
access data in a single cycle. For this purpose we construct a 2-port multi-bank
RAM-ROM out of a 2-port multi-bank RAM and a 2-port ROM.
96 4 Nine Shades of RAM

S(ia )
n

X[i] n a[i] n
Sa
a
a-Dec (n, 2a )-OR Souta

Y [i] n b[i] n
Sb
a
a-Dec (n, 2a )-OR Soutb

Fig. 71. Construction of an (n, a)-2-port ROM

A straightforward implementation of an (8k, a)-2-port multi-bank RAM uses


k many (8, a)-2-port RAMs. Wiring for bank i is shown in Fig. 70. Figure 71
shows the implementation of an (n, a)-2-port ROM.
For r < a we define a combined (n, r, a)-2-port multi-bank RAM-ROM,
where n = 8k, as a device that behaves for small addresses a = 0a−r b with
b ∈ Br like ROM and on the other addresses like RAM. Just like an ordinary
2-port RAM, we model the state of the (n, r, a)-2-port multi-bank RAM-ROM
as
h.S : Ba → Bn
and define its output as

Souta(h) = h.S(Sa(h))
Soutb(h) = h.S(Sb(h)) .

Write operations, however, only affect addresses larger than 0a−r 1r . Moreover,
we only need the writes to be performed through port b of the memory (port
a will only be used for instruction fetches):


⎨modify (h.S(x), Sin(h), Sbw(h)) x[a − 1 : r] = 0
a−r

h .S(x) = ∧ x = Sb(h)


h.S(x) otherwise .

The symbol for an (n, r, a)-2-port multi-bank RAM-ROM and a straightfor-


ward implementation involving an (n, a)-2-port multi-bank RAM, an (n, r)-
2-port ROM, and two zero testers is shown in Figs. 72 and 73.
4.3 Multi-port RAM Designs 97

Sin
n
a
Sa k
S Sbw
a
Sb
(n, r)-ROM

n n
Souta Soutb
Fig. 72. Symbol of an (n, r, a)-2-port multi-bank RAM-ROM

Sin
n
a ina inb k k r
Sa 0 Sa[r − 1 : 0]
a S1 k r S2
Sb Sbw Sb[r − 1 : 0]

n n n n
Sa[a − 1 : r] Sb[a − 1 : r]
a−r a−r
(a − r)-zero (a − r)-zero

0 1 0 1
n n
Souta Soutb

Fig. 73. Construction of an (n, r, a)-2-port multi-bank RAM-ROM

4.3.4 2-port Cache State RAM

Exactly as the name indicates, an (n, a)-2-port CS RAM is a RAM with all
features of a 2-port RAM and a CS RAM . Its symbol is shown in Fig. 74.
Inputs and outputs are:
• two data inputs Sina, Sinb,
• two addresses Sa, Sb,
• two write signals Swa, Swb,
• a control signal Sinv, and
• an n-bit input Svinv providing a special data value.
Address decoding, data output generation, and execution of writes is as for
2-port RAMs. In write operations, activation of signal Sinv takes precedence
98 4 Nine Shades of RAM

Sina Sinb Svinv


n n n
a
Sa Swa
a S Swb
Sb
Sinv
n n
Souta Soutb
Fig. 74. Symbol of an (n, a)-2-port CS RAM

Sina Sinb
n n
Svinv 1 0 X[i] ∧ Swa
n n

1 0 Sinv
n

R(i) Sinv ∨ X[i] ∧ Swa ∨ Y [i] ∧ Swb


n

Fig. 75. Construction block of an (n, a)-2-port CS RAM

over everything else:




⎪ Svinv(h) Sinv(h) = 1



⎪ Sinv(h) = 0 ∧ x = Sa(h) ∧ Swa(h) = 1
⎨Sina(h)

h .S(x) = Sinb(h) Sinv(h) = 0 ∧ x = Sb(h) ∧ Swb(h) = 1



⎪ ∧ x = Sa(h) ∧ Swa(h) = 1


⎩h.S(x) otherwise .

The changes in the implementation for each register R(i) are shown in Fig. 75.
The signals generated are:

R(i) ce = Sinv(h) ∨ X[i] ∧ Swa(h) ∨ Y [i] ∧ Swb(h)




⎨Svinv(h) Sinv(h) = 1
R(i) in = Sina(h) Sinv(h) = 0 ∧ X[i] ∧ Swa(h)


Sinb(h) otherwise .
5
Arithmetic Circuits

For later use in processors with the MIPS instruction set architecture (ISA),
we construct several circuits: as the focus in this book is on correctness and not
so much on efficiency of the constructed machine, only the most basic adders
and incrementers are constructed in Sect. 5.1. For more advanced construc-
tions see, e.g., [12]. An arithmetic unit (AU) for binary and two’s complement
numbers is studied in Sect. 5.2. In our view, understanding the correctness
proofs of this section is a must for anyone wishing to understand fixed point
arithmetic.
With the help of the AU we construct in Sect. 5.3 an arithmetic logic
unit (ALU) for the MIPS ISA in a straightforward way. Differences to [12]
are simply due to differences in the encoding of ALU operations between the
MIPS ISA considered here and the DLX ISA considered in [12].
Also the shift unit considered in Sect. 5.4 is basically from [12]. Shift
units are not completely trivial. We recommend to cover this material in the
classroom.
As branch instructions in the DLX and the MIPS instruction set archi-
tectures are treated in quite different ways, the new Sect. 5.5 with a branch
condition evaluation unit had to be included here.

5.1 Adder and Incrementer


An n-adder is a circuit with inputs a[n − 1 : 0] ∈ Bn , b[n − 1 : 0] ∈ Bn , c0 ∈ B
and outputs cn ∈ B and s[n − 1 : 0] ∈ Bn satisfying the specification
cn , s[n − 1 : 0] = a[n − 1 : 0] + b[n − 1 : 0] + c0 .
We use the symbol from Fig. 76 for n-adders.
A full adder is obviously a 1-adder. A recursive construction of a very
simple n−adder, called carry chain adder, is shown in Fig. 77. The correctness
follows directly from the correctness of the basic addition algorithm for binary
numbers (Lemma 2.11).

M. Kovalev et al.: A Pipelined Multi-core MIPS Machine, LNCS 9000, pp. 99–115, 2014.

c Springer International Publishing Switzerland 2014
100 5 Arithmetic Circuits

a b
n n c0

n-Add

n
cn
s
Fig. 76. Symbol of an n-adder

n = 1: n > 1: a[n − 2 : 0] b[n − 2 : 0]


a0 b0 n−1 n−1 c0
c0
(n − 1)-Add
FA
an−1 bn−1
c1 s0 n−1
cn−1
FA s[n − 2 : 0]

cn sn−1
Fig. 77. Recursive construction of a carry chain adder

It is often convenient to ignore the carry out bit cn of the n-adder and to talk
only about the sum bits s[n − 1 : 0]. With the help of Lemma 2.10, we can
then rewrite the specification of the n-adder as

s = ((a + b + c0 ) mod 2n ).

An n-incrementer is a circuit with inputs a[n− 1 : 0] ∈ Bn , c0 ∈ B and outputs


cn ∈ B and s[n − 1 : 0] ∈ Bn satisfying

cn , s[n − 1 : 0] = a[n − 1 : 0] + c0 .

Throwing away the carry bit cn and using Lemma 2.10, we can rewrite this
as
s = ((a + c0 ) mod 2n ).
We use the symbol from Fig. 78 for n-incrementers. Obviously, incrementers
can be constructed from n-adders by tying the b input to 0n . As shown in
5.2 Arithmetic Unit 101

a
n c0

n−inc
n
cn s
Fig. 78. Symbol of an n-incrementer

n = 1: n > 1:
a0 c0 a[n − 2 : 0]
n−1 c0

HA
an−1 (n − 1)−inc
n−1
c1 s0 s[n − 2 : 0]
HA

cn sn−1
Fig. 79. Recursive construction of a carry chain incrementer

Sect. 3.2 a full adders whose b input is tied to zero can be replaced with a
half adder. This yields the construction of carry chain incrementers shown in
Fig. 79.

5.2 Arithmetic Unit

The symbol of an n-arithmetic unit or short n-AU is shown in Fig. 80. It is a


circuit with the following inputs:
• operand inputs a = a[n − 1 : 0], b = b[n − 1 : 0] with a, b ∈ Bn ,
• control input u ∈ B distinguishing between unsigned (binary) and signed
(two’s complement) numbers,
• control input sub ∈ B indicating whether input b should be subtracted
from or added to input a,
and the following outputs:
• result s[n − 1 : 0] ∈ Bn ,
102 5 Arithmetic Circuits

a b
n n

u
n-AU sub
2
n
ovf neg
s
Fig. 80. Symbol of an n-arithmetic unit

• overflow bit ovf ∈ B, and


• negative bit neg ∈ B.
We define the exact result S ∈ Z of an arithmetic unit as


⎪ [a] + [b] (u, sub) = 00

⎨ [a] − [b] (u, sub) = 01
S=

⎪ a + b (u, sub) = 10


a − b (u, sub) = 11 .
For the result of the ALU, we pick the representative of the exact result in
Bn resp. Tn and represent it in the corresponding format

twocn (S tmod 2n ) u = 0
s=
binn (S mod 2n ) u=1,
i.e., we have
[s] = (S tmod 2n ) if u = 0
s = (S mod 2n ) if u = 1 .
Overflow and negation signals are defined with respect to the exact result.
The overflow bit is computed only for the case of two’s complement numbers;
for binary numbers it is always 0 since the architecture we introduce later
does not consider unsigned overflows:

S∈ / Tn u = 0
ovf =
0 u=1
neg = S < 0 .

Data Paths
We introduce special symbols +n and −n to denote addition and subtraction
of n-bit binary numbers mod 2n :
5.2 Arithmetic Unit 103

a +n b = binn (a + b mod 2n )


a −n b = binn (a − b mod 2n ) .

The following lemma asserts that, for signed and unsigned numbers, the sum
bits s can be computed in exactly the same way.
Lemma 5.1 (computing sum bits). Compute the sum bits as

a +n b sub = 0
s=
a −n b sub = 1 ,

then

[s] = (S tmod 2n ) if u = 0
s = (S mod 2n ) if u = 1 .

Proof. For u = 1 this follows directly from the definitions. For u = 0 we have
from Lemma 2.14 and Lemma 2.2:

[s] ≡ s mod 2n


 
a + b sub = 0
≡ mod 2n
a − b sub = 1
 
[a] + [b] sub = 0
≡ mod 2n
[a] − [b] sub = 1
≡ S mod 2n .

From [s] ∈ Tn and Lemma 2.5 we conclude

[s] = (S tmod 2n ) .



The main data paths of an n-AU are shown in Fig. 81. The following lemma
asserts that the sum bits are computed correctly.

Lemma 5.2 (correctness of arithmetic unit). The sum bits s[n − 1 : 0]


in Fig. 81 satisfy 
a +n b sub = 0
s=
a −n b sub = 1 .

Proof. From the construction of the circuit, we have

d = b ⊕ sub

b sub = 0
=
b sub = 1 .
104 5 Arithmetic Circuits

b sub

a
n d n

n-Add

n
s
Fig. 81. Data paths of an n-arithmetic unit

From the specification of an n-adder, Lemma 2.10, and the subtraction algo-
rithm for binary numbers (Lemma 2.15), we conclude
  
a + b sub = 0
s = mod 2 n
a + b + 1 sub = 1
  
a + b sub = 0 n
= mod 2 .
a − b sub = 1

Application of binn (·) to both sides completes the proof of the lemma. 


Negative Bit

We start with the case u = 0, i.e., with two’s complement numbers. We have

S = [a] ± [b]
= [a] + [d] + sub
≤ 2n−1 − 1 + 2n−1 − 1 + 1
= 2n − 1,
S ≥ −2n−1 − 2n−1
= −2n .

Thus,
S ∈ Tn+1 .
According to Lemma 2.14 we use sign extension to extend operands to n + 1
bits:
5.2 Arithmetic Unit 105

[a] = [an−1 a]
[d] = [dn−1 d] .

We compute an extra sum bit sn by the basic addition algorithm:

sn = an−1 ⊕ dn−1 ⊕ cn ,

and conclude
S = [s[n : 0]].
Again by Lemma 2.14 this is negative if and only if the sign bit sn is 1:

S < 0 ↔ sn = 1 .

As a result, we have the following lemma.


Lemma 5.3 (two’s complement negative bit).

u = 0 → neg = an−1 ⊕ dn−1 ⊕ cn

For the case u = 1, i.e., for binary numbers, a negative result can only occur
in the case of subtraction, i.e., if sub = 1. In this case we argue along the lines
of the correctness proof for the subtraction algorithm:

S = a − b
= a − [0b]
= a + [1b] + 1
= a + b − 2n + 1
= cn s[n − 1 : 0] − 2n
= 2n (cn − 1) + s[n − 1 : 0] .
  
∈Bn

If cn = 1 we have S = s ≥ 0. If cn = 0 we have

S = −2n + s[n − 1 : 0]


≤ −2n + 2n − 1
= −1 .

Thus,
u = 1 → neg = sub ∧ cn ,
and together with Lemma 5.3 we can define the negative bit computation.
Lemma 5.4 (negative bit).

neg = u ∧ (an−1 ⊕ dn−1 ⊕ cn ) ∨


u ∧ sub ∧ cn
106 5 Arithmetic Circuits

Overflow Bit

We compute the overflow bit only for the case of two’s complement numbers,
i.e., when u = 0. We have

S = [a] + [d] + sub


= −2n−1 (an−1 + dn−1 ) + a[n − 2 : 0] + d[n − 2 : 0] + sub
= −2n−1 (an−1 + dn−1 ) + cn−1 s[n − 2 : 0] − cn−1 2n−1 + cn−1 2n−1
= −2n−1 (an−1 + dn−1 + cn−1 ) + 2n−1 (cn−1 + cn−1 ) + s[n − 2 : 0]
= −2n−1 cn sn−1  + 2n cn−1 + s[n − 2 : 0]
= −2n cn − 2n−1 sn−1 + 2n cn−1 + s[n − 2 : 0]
= 2n (cn−1 − cn ) + [s[n − 1 : 0]] .

We claim
S ∈ Tn ↔ cn−1 = cn .
If cn = cn−1 we obviously have S = [s], thus S ∈ Tn . If cn = 1 and cn−1 = 0
we have
−2n + [s] ≤ −2n + 2n−1 − 1 = −2n−1 − 1 < −2n−1
and if cn = 0 and cn−1 = 1, we have

2n + [s] ≥ 2n − 2n−1 > 2n−1 − 1 .

Thus, in the two latter cases, we have S ∈


/ Tn . Because

cn = cn−1 ↔ cn ⊕ cn−1 = 1 ,

we get the following lemma for the overflow bit computation.


Lemma 5.5 (overflow bit).

ovf = u ∧ cn ⊕ cn−1

5.3 Arithmetic Logic Unit (ALU)


Figure 82 shows a symbol for the n-ALU constructed here. Width n should
be even. The circuit has the following inputs:
• operand inputs a = a[n − 1 : 0], b = b[n − 1 : 0] with a, b ∈ Bn ,
• control inputs af [3 : 0] ∈ B4 and i ∈ B specifying the operation that the
ALU performs with the operands,
and the following outputs:
• result alures[n − 1 : 0] ∈ Bn ,
• overflow bit ovfalu ∈ B.
5.3 Arithmetic Logic Unit (ALU) 107

a b
n n

i
4
n-ALU af

n
ovfalu alures
Fig. 82. Symbol of an n-arithmetic logic unit

Table 6. Specification of ALU operations


af [3 : 0] i alures[31 : 0] ovfalu
0000 ∗ a +n b [a] + [b] ∈
/ Tn
0001 ∗ a +n b 0
0010 ∗ a −n b [a] − [b] ∈
/ Tn
0011 ∗ a −n b 0
0100 ∗ a∧b 0
0101 ∗ a∨b 0
0110 ∗ a⊕b 0
0111 0 a∨b 0
0111 1 b[n/2 − 1 : 0]0n/2 0
1010 ∗ 0n−1 ([a] < [b] ? 1 : 0) 0
1011 ∗ 0n−1 ( a < b ? 1 : 0) 0

The results that must be generated are specified in Table 6. There are three
groups of operations:

• Arithmetic operations.
• Logical operations. At first sight, the result b[n/2 : 0]0n/2 might appear
odd. This ALU function is later used to load the upper half of an n-bit
constant using the immediate fields of an instruction.
• Test and set instructions. They compute an n-bit result 0n−1 z where only
the last bit is of interest. The result of these instructions can be computed
by performing a subtraction in the AU and then testing the negative bit.

Figure 83 shows the fairly obvious data paths of an n-ALU. The missing signals
are easily constructed. We subtract if af [1] = 1. For test and set operations
with af [3] = 1, output z is simply the negative bit neg. The overflow bit can
only differ from zero if we are doing an arithmetic operation. Thus, we have
108 5 Arithmetic Circuits

n
a
n
b
n nn nn n b[ n2 − 1 : 0]

n
2
n n n
n n
0n/2
0 1 af [0] n
u n 2
n-AU sub 0 1 i
n
2
0 1 af [0]
ovf neg n

0 1 af [1]

n n
0 1 af [2]
n−1
n 0 z
n
0 1 af [3]
n
alures
Fig. 83. Data paths of an n-arithmetic logic unit

sub = af [1]
z = neg
u = af [0]
ovfalu = ovf ∧ af [3] ∧ af [2] .

5.4 Shift Unit


n-shift operations have two operands:
• a bit vector a[n − 1 : 0] ∈ Bn that is to be shifted and
• a shift distance i ∈ [0 : n − 1].
Shifts come in five flavors: cyclical left shift slc, cyclical right shift src, logical
left shift sll, logical right shift srl, and arithmetic right shift sra. The result
of such an n-shift has n bits and is defined as
5.4 Shift Unit 109

slc(a, i)[j] = a[j − i mod n]


src(a, i)[j] = a[j + i mod n]

a[j − i] j ≥ i
sll(a, i)[j] =
0 otherwise

a[j + i] j ≤ n − 1 − i
srl(a, i)[j] =
0 otherwise

a[j + i] j ≤ n − 1 − i
sra(a, i)[j] =
an−1 otherwise

or, equivalently, as

slc(a, i) = a[n − i − 1 : 0]a[n − 1 : n − i]


src(a, i) = a[i − 1 : 0]a[n − 1 : i]
sll(a, i) = a[n − i − 1 : 0]0i
srl(a, i) = 0i a[n − 1 : i]
sra(a, i) = ain−1 a[n − 1 : i] .

From the definition we immediately conclude how to compute right shifts


using left shifts.
Lemma 5.6 (left right shift).

src(a, i) = slc(a, n − i mod n)

Proof.

j + i = j − (−i)
≡ j − (n − i) mod n



Here, we only build shifters for numbers n which are a power of two:

n = 2k , k∈N.

Basic building blocks for all following shifter constructions are (n, b)-cyclic
left shifters or short (n, b)-SLCs for b ∈ [1 : n − 1]. They have
• input a[n − 1 : 0] ∈ Bn for the data to be shifted,
• input s ∈ B indicating whether to shift or not,
• data outputs a [n − 1 : 0] ∈ Bn satisfying

 slc(a, b) s = 1
a =
a otherwise
110 5 Arithmetic Circuits

a[n − 1 : n − b] a[n − b − 1 : 0]
b n−b

1 0 s
n

a
Fig. 84. Implementation of an (n, b)-SLC ((n, b)-cyclic left shifter)

a
n

(n, 1)-SLC b0
n (0)
r
...

(n, 2i )-SLC bi
n
r (i)
...

(n, 2k−1)-SLC bk−1


n (k−1)
r
r
Fig. 85. Implementation of a n-SLC (cyclic n-left shifter)

Figure 84 shows a construction of an (n, b)-SLC.


A cyclic n-left shifter or short n-SLC is a circuit with
• data inputs a[n − 1 : 0] ∈ Bn ,
• control inputs b[k − 1 : 0] ∈ Bk , which provide the binary representation
of the shift distance,
• data outputs r[n − 1 : 0] ∈ Bn satisfying

r = slc(a, b) .

Figure 85 shows a construction of a cyclic n-SLC as a stack of (n, 2i )-SLCs.


An easy induction on i ∈ [0 : k − 1] shows

r(i) = slc(a, b[i : 0]) .

A cyclic n-right-left shifter n-SRLC is a circuit with


• data inputs a[n − 1 : 0] ∈ Bn ,
5.4 Shift Unit 111

b
k

k 1
k−inc
c k
a 0 1 f
k
n
d
n-SLC
n

r
Fig. 86. Implementation of an n-SRLC (cyclic n-right-left shifter)

• control inputs b[k − 1 : 0] ∈ Bk , which provide the binary representation


of the shift distance,
• a control input f ∈ B indicating the shift direction, and
• data outputs r[n − 1 : 0] ∈ Bn satisfying

slc(a, b) f = 0
r=
src(a, b) f = 1 .

Figure 86 shows a construction of n-SRLCs. The output c[k − 1 : 0] of the


k-incrementer satisfies

c = (b + 1 mod n)


= (n − b mod n) ,

which follows from the subtraction algorithm for binary numbers (Lemma
2.15).
The output d of the multiplexer then satisfies

b f =0
d =
n − b mod n f = 1 .

The correctness of the construction now follows from Lemma 5.6.


An n-shift unit n-SU (see Fig. 87) has
• inputs a[n − 1 : 0] providing the data to be shifted,
• inputs b[k − 1 : 0] determining the shift distance,
• inputs sf [1 : 0] determining the kind of shift to be executed, and
112 5 Arithmetic Circuits

a b
n k
2
n-SU sf

n
sures
Fig. 87. Symbol of an n-shift unit

a b
n k

n-rl- f sf [1]
shifter
n
r
Fig. 88. Right-left shifter of an n-shift unit

b
k

k-hdec
n

flip

n
0 1 sf [1]
n

mask
Fig. 89. Mask computation of an n-shift unit

• outputs sures[n − 1 : 0] satisfying




⎨sll(a, b) sf = 00
sures = srl(a, b) sf = 10


sra(a, b) sf = 11 .

A construction of an n-SU is shown in Figs. 88, 89, 90.


Let i = b. Then the cyclic right-left shifter in Fig. 88 produces output
5.5 Branch Condition Evaluation Unit 113

r[j] f ill

0 1 mask[j]

sures[j]
Fig. 90. Result computation of an n-shift unit

Table 7. Specification of branch condition evaluation


bf [3 : 0] bcres
0010 [a] < 0
0011 [a] ≥ 0
100* a=b
101* a = b
110* [a] ≤ 0
111* [a] > 0


a[n − i − 1 : 0]a[n − 1 : n − i] sf [1] = 0
r=
a[i − 1 : 0]a[n − i : i] sf [1] = 1 .

The output of the circuit in Fig. 89 produces a mask



0n−i 1i sf [1] = 0
mask =
1i 0n−i sf [1] = 1 .

For each index j ∈ [0 : n − 1], the multiplexer in Fig. 90 replaces the shifter
output r[j] by the f ill bit if this is indicated by the mask bit mask[j]. As a
result we get 
a[n − i − 1 : 0]f illi sf [1] = 0
sures =
f illia[n − i : i] sf [1] = 1 .
By setting
f ill = sf [0] ∧ an−1 ,
we conclude


⎨sll(a, i) sf = 00
sures = srl(a, i) sf = 10


sra(a, i) sf = 11 .

5.5 Branch Condition Evaluation Unit


An n-BCE (see Fig. 91) has
114 5 Arithmetic Circuits

a b
n n
4
n-BCE bf

bcres
Fig. 91. Symbol of an n-branch condition evaluation unit

b bf [3] ∧ bf [2]
n

an−1 a
d
n n

n-eq
lt
eq neq

le
Fig. 92. Computation of auxiliary signals in an n-branch condition evaluation unit

• inputs a[n − 1 : 0], b[n − 1 : 0] ∈ Bn ,


• inputs bf [3 : 0] ∈ B4 selecting the condition to be tested,
• output bcres ∈ B specified by Table 7.
The auxiliary circuit in Fig. 92 computes obvious auxiliary signals satisfying

d = b ∧ (bf [3] ∧ bf [2])



b bf [3 : 2] = 10
=
0n otherwise
eq ≡ a = d

a=b bf [3 : 2] = 10

[a] = 0 otherwise
neq = eq
lt = [a] < 0
5.5 Branch Condition Evaluation Unit 115

a=b bf [3 : 2] = 10
le ≡ [a] < 0 ∨
[a] = 0 otherwise .

The result bcres can then be computed as

bcres ≡ bf [3 : 1] = 001 ∧ (bf [0] ∧ lt ∨ bf [0] ∧ lt)


∨ bf [3 : 2] = 10 ∧ (bf [1] ∧ eq ∨ bf [1] ∧ eq)
∨ bf [3 : 2] = 11 ∧ (bf [1] ∧ le ∨ bf [1] ∧ le)
≡ bf [3] ∧ bf [2] ∧ bf [1] ∧ (bf [0] ⊕ lt)
∨ bf [3] ∧ bf [2] ∧ (bf [1] ⊕ eq)
∨ bf [3] ∧ bf [2] ∧ (bf [1] ⊕ le) .
6
A Basic Sequential MIPS Machine

We define the basic MIPS instruction set architecture (ISA) without delayed
branch, interrupt mechanism and devices. The first Sect. 6.1 of this chapter is
very short. It contains a very compact summary of the instruction set archi-
tecture (and the assembly language) in the form of tables, which define the
ISA if one knows how to interpret them. In Sect. 6.2 we provide a succinct
and completely precise interpretation of the tables, leaving out only the co-
processor instructions and the system call instruction. From this we derive in
Sect. 6.3 the hardware of a sequential, i.e., not pipelined, MIPS processor and
provide a proof that this processor construction is correct.
This chapter differs from its counter part in [12] in several ways:
• The ISA is MIPS instead of DLX. Most of the resulting modifications are
already handled in the control logic of the ALU and the shift unit1 .
• The machine implements each instruction in one very long hardware cycle
and uses only precomputed control. It is not meant to be an efficient se-
quential implementation and serves later only as a reference machine. This
turns most portions of the correctness proof into straightforward bookkeep-
ing exercises, which would be terribly boring if presented in the classroom.
We included this bookkeeping only as a help for readers, who want to use
this book as a blueprint for formal proofs.
• Because the byte addressable memory of the ISA is embedded in the imple-
mentation into a 64-bit wide hardware memory, shifters have to be used
both for the load and store operations of words, half words, and bytes.
In [12] the memory is 32 bits wide, the shifters for loads and stores are
present; they must be used for accesses of half words or bytes. However, [12]
provides no proof that with the help of these shifters loads and stores of
half words or bytes work correctly. Subsequent formal correctness proofs
for hardware from [12] as presented in [1, 3, 6] restricted loads and stores
to word accesses, and thus, did not provide these proofs either. We present
1
In contrast to [9] we do not tie register 0 to 0. We also do not consider interrupts
and address translation in this book.

M. Kovalev et al.: A Pipelined Multi-core MIPS Machine, LNCS 9000, pp. 117–160, 2014.

c Springer International Publishing Switzerland 2014
118 6 A Basic Sequential MIPS Machine

these proofs here; they hinge on the software condition that accesses are
aligned and turn out to be not completely trivial.

6.1 Tables
In the “Effect” row of the tables we use the following shorthands: m =
md (ea(c)) where ea(c) = rs(c)+32 sxtimm(c), rx = gpr(rx(c)) for x ∈ {t, s, d}
(except for the coprocessor instructions, where rd = rd(c) and rt = rt(c)), and
iindex stands for iindex(c)2 . In the table for J-type instructions, R31 stands
for gpr(315 ). Arithmetic operations + and − are modulo 232 . Sign extension
is denoted by sxt and zero extension by zxt.

6.1.1 I-type
opcode Instruction Syntax d Effect
Data Transfer
100 000 lb lb rt rs imm 1 rt = sxt(m)
100 001 lh lh rt rs imm 2 rt = sxt(m)
100 011 lw lw rt rs imm 4 rt = m
100 100 lbu lbu rt rs imm 1 rt = 024 m
100 101 lhu lhu rt rs imm 2 rt = 016 m
101 000 sb sb rt rs imm 1 m = rt[7:0]
101 001 sh sh rt rs imm 2 m = rt[15:0]
101 011 sw sw rt rs imm 4 m = rt
Arithmetic, Logical Operation, Test-and-Set
001 000 addi addi rt rs imm rt = rs + sxt(imm)
001 001 addiu addiu rt rs imm rt = rs + sxt(imm)
001 010 slti slti rt rs imm rt = ([rs] < [sxt(imm)] ? 132 : 032 )
001 011 sltiu sltiu rt rs imm rt = (rs < sxt(imm) ? 132 : 032 )
001 100 andi andi rt rs imm rt = rs ∧ zxt(imm)
001 101 ori ori rt rs imm rt = rs ∨ zxt(imm)
001 110 xori xori rt rs imm rt = rs ⊕ zxt(imm)
001 111 lui lui rt imm rt = imm016
opcode rt Instr. Syntax Effect
Branch
000 001 00000 bltz bltz rs imm pc = pc + ([rs] < 0 ? sxt(imm00) : 432 )
000 001 00001 bgez bgez rs imm pc = pc + ([rs] ≥ 0 ? sxt(imm00) : 432 )
000 100 beq beq rs rt imm pc = pc + (rs = rt ? sxt(imm00) : 432 )
000 101 bne bne rs rt imm pc = pc + (rs = rt ? sxt(imm00) : 432 )
000 110 00000 blez blez rs imm pc = pc + ([rs] ≤ 0 ? sxt(imm00) : 432 )
000 111 00000 bgtz bgtz rs imm pc = pc + ([rs] > 0 ? sxt(imm00) : 432 )

2
Formal definitions for predicates and functions used here are given in Sect. 6.2.
6.1 Tables 119

6.1.2 R-type

opcode fun Instruction Syntax Effect


Shift Operation
000000 000 000 sll sll rd rt sa rd = sll(rt,sa)
000000 000 010 srl srl rd rt sa rd = srl(rt,sa)
000000 000 011 sra sra rd rt sa rd = sra(rt,sa)
000000 000 100 sllv sllv rd rt rs rd = sll(rt,rs)
000000 000 110 srlv srlv rd rt rs rd = srl(rt,rs)
000000 000 111 srav srav rd rt rs rd = sra(rt,rs)
Arithmetic, Logical Operation
000000 100 000 add add rd rs rt rd = rs + rt
000000 100 001 addu addu rd rs rt rd = rs + rt
000000 100 010 sub sub rd rs rt rd = rs − rt
000000 100 011 subu subu rd rs rt rd = rs − rt
000000 100 100 and and rd rs rt rd = rs ∧ rt
000000 100 101 or or rd rs rt rd = rs ∨ rt
000000 100 110 xor xor rd rs rt rd = rs ⊕ rt
000000 100 111 nor nor rd rs rt rd = ¬(rs ∨ rt)
Test Set Operation
000000 101 010 slt slt rd rs rt rd = ([rs] < [rt] ? 132 : 032 )
000000 101 011 sltu sltu rd rs rt rd = (rs < rt ? 132 : 032 )
Jumps, System Call
000000 001 000 jr jr rs pc = rs
000000 001 001 jalr jalr rd rs rd = pc + 432 pc = rs
000000 001 100 sysc sysc System Call
Coprocessor Instructions
opcode fun rs Instruction Syntax Effect
010000 011 000 10000 eret eret Exception Return
010000 00100 movg2s movg2s rd rt spr[rd] := gpr[rt]
010000 00000 movs2g movs2g rd rt gpr[rt] := spr[rd]

6.1.3 J-type

opcode Instr. Syntax Effect


Jumps
000 010 j j iindex pc = (pc+432 )[31:28]iindex00
000 011 jal jal iindex pc = (pc+432 )[31:28]iindex00 R31 = pc + 432
120 6 A Basic Sequential MIPS Machine

32

pc
m 232

gpr 32

8
CPU memory
Fig. 93. Visible data structures of MIPS ISA

6.2 MIPS ISA

In this section we give the precise formal interpretation of the basic MIPS
ISA without the coprocessor instructions and the system call instruction.

6.2.1 Configuration and Instruction Fields

A basic MIPS configuration c has only three user visible data structures
(Fig. 93):
• c.pc ∈ B32 – the program counter (PC).
• c.gpr : B5 → B32 – the general purpose register (GPR) file consisting of 32
registers, each 32 bits wide. For register addresses x ∈ B5 , the content of
general purpose register x in configuration c is denoted by c.gpr(x) ∈ B32 .
• c.m : B32 → B8 – the processor memory. It is byte addressable; addresses
have 32 bits. Thus, for memory addresses a ∈ B32 , the content of memory
location a in configuration c is denoted by c.m(a) ∈ B8 .
Program counter and general purpose registers belong to the central process-
ing unit (CPU).
Let K be the set of all basic MIPS configurations. A mathematical defini-
tion of the ISA will be given by a function

δ:K→K,

where
c = δ(c)
is the configuration reached from configuration c, if the next instruction is
executed. An ISA computation is a sequence (ci ) of ISA configurations with
i ∈ N satisfying
6.2 MIPS ISA 121

c0 .pc = 032
ci+1 = δ(ci ) ,

i.e., initially the program counter points to address 032 and in each step one
instruction is executed. In the remainder of this section we specify the ISA
simply by specifying function δ, i.e., by specifying c = δ(c) for all configura-
tions c.
Recall that for numbers y ∈ Bn we abbreviate the binary representation
of y with n bits as
yn = binn (y) ,
e.g., 18 = 00000001 and 38 = 00000011. For memories m : B32 → B8 , ad-
dresses a ∈ B32 , and numbers d of bytes, we define the content of d consecutive
memory bytes starting at address a informally by

md (a) = m(a +32 (d − 1)32 ) ◦ . . . ◦ m(a).

Formally, we define it in the inductive form as

m1 (a) = m(a)
md+1 (a) = m(a +32 d32 ) ◦ md (a) .

The current instruction I(c) to be executed in configuration c is defined by


the 4 bytes in memory addressed by the current program counter:

I(c) = c.m4 (c.pc) .

Because all instructions are 4 bytes long, one requires that instructions are
aligned on 4 byte boundaries 3 or, equivalently, that

c.pc[1 : 0] = 00 .

The six high order bits of the current instruction are called the op-code:

opc(c) = I(c)[31 : 26] .

There are three instruction types: R-, J-, and I-type. The current instruction
type is determined by the following predicates:

rtype(c) ≡ opc(c) = 0∗04


jtype(c) ≡ opc(c) = 04 1∗
itype(c) = rtype(c) ∨ jtype(c) .

3
In case this condition is violated a so called misalignment interrupt is raised. Since
we do not treat interrupts in our construction, we require all ISA computations
to have only aligned instructions.
122 6 A Basic Sequential MIPS Machine

31 26 25 21 20 16 15 0
I opc rs rt imm
31 26 25 21 20 16 15 11 10 6 5 0
R opc rs rt rd sa f un
31 26 25 0
J opc iindex
Fig. 94. Types and fields of MIPS instructions

Depending on the instruction type, the bits of the current instruction are sub-
divided as shown in Fig. 94. Register addresses are specified in the following
fields of the current instruction:

rs(c) = I(c)[25 : 21]


rt(c) = I(c)[20 : 16]
rd(c) = I(c)[15 : 11] .

For R-type instructions, ALU-functions to be applied to the register operands


can be specified in the function field:

f un(c) = I(c)[5 : 0] .

Three kinds of immediate constants are specified: the shift amount sa in R-


type instructions, the immediate constant imm in I-type instructions, and an
instruction index iindex in J-type (like jump) operations:

sa(c) = I(c)[10 : 6]
imm(c) = I(c)[15 : 0]
iindex(c) = I(c)[25 : 0] .

Immediate constant imm has 16 bits. In order to apply ALU functions to


it, the constant can be extended with 16 high order bits in two ways: zero
extension and sign extension:

zxtimm(c) = 016 imm(c)


sxtimm(c) = imm(c)[15]16 imm(c)
= I(c)[15]16 imm(c) .

In case of sign extension, Lemma 2.14 guarantees that the value of the constant
interpreted as a two’s complement number does not change:

[sxtimm(c)] = [imm(c)] .
6.2 MIPS ISA 123

6.2.2 Instruction Decoding


For every mnemonic mn of a MIPS instruction from the tables above, we
define a predicate mn(c) which is true, if instruction mn is to be executed in
configuration c. For instance,
lw(c) ≡ opc(c) = 100011
bltz(c) ≡ opc(c) = 05 1 ∧ rt(c) = 05
add(c) ≡ rtype(c) ∧ f un(c) = 105 .
The remaining predicates directly associated to the mnemonics of the assem-
bly language are derived in the same way from the tables. We group the basic
instruction set into 5 groups and define for each group a predicate that holds,
if an instruction from that group is to be executed:
• ALU-operations of I-type are recognized by the leading three bits of the
opcode, resp. I(c)[31 : 29]; ALU-operations of R-type - by the two leading
bits of the function code, resp. I(c)[5 : 4]:
alur(c) ≡ rtype(c) ∧ f un(c)[5 : 4] = 10
≡ rtype(c) ∧ I(c)[5 : 4] = 10
alui(c) ≡ itype(c) ∧ opc(c)[5 : 3] = 001
≡ itype(c) ∧ I(c)[31 : 29] = 001
alu(c) = alur(c) ∨ alui(c) .
• Shift unit operations are of R-type and are recognized by the three leading
bits of the function code. If bit f un(c)[2] of the function code is on, the
shift distance is taken from register specified by rs(c)4 :
su(c) ≡ rtype(c) ∧ f un(c)[5 : 3] = 000
≡ rtype(c) ∧ I(c)[5 : 3] = 000
suv(c) = su(c) ∧ f un(c)[2]
= su(c) ∧ I(c)[2] .
• Loads and stores are of I-type and are recognized by the three leading bits
of the opcode:
l(c) ≡ opc(c)[5 : 3] = 100
≡ I(c)[31 : 29] = 100
s(c) ≡ opc(c)[5 : 3] = 101
≡ I(c)[31 : 29] = 101
ls(c) = l(c) ∨ s(c)
≡ opc(c)[5 : 4] = 10
≡ I(c)[31 : 30] = 10 .
4
Mnemonics with suffix v as “variable”; one would expect instead for the other
shifts a suffix i as “immediate”.
124 6 A Basic Sequential MIPS Machine

• Branches are of I-Type and are recognized by the three leading bits of the
opcode:

b(c) ≡ itype(c) ∧ opc(c)[5 : 3] = 000


≡ itype(c) ∧ I(c)[31 : 29] = 000 .

• Jumps are defined in a brute force way:

jump(c) = jr(c) ∨ jalr(c) ∨ j(c) ∨ jal(c)


jb(c) = jump(c) ∨ b(c) .

6.2.3 ALU-Operations

We can now go through the ALU-operations in the tables one by one and give
them precise interpretations. We do this for two examples.

add(c)

The table specifies the effect as rd = rs + rt. This is to be interpreted as an


operation on the corresponding register contents: on the right hand side of
the equation – for configuration c, i.e., before execution of the instruction; on
the left hand side – for configuration c :

c .gpr(rd(c)) = c.gpr(rs(c)) +32 c.gpr(rt(c)) .

Other register contents and the memory content do not change:

c .gpr(x) = c.gpr(x) for x = rd(c)



c .m = c.m .

The program counter is advanced by four bytes to the next instruction:

c .pc = c.pc +32 432 .

addi(c)

The second operand is now the sign extended immediate constant:



 c.gpr(rs(c)) +32 sxtimm(c) x = rt(c)
c .gpr(x) =
c.gpr(x) otherwise
c .m = c.m
c .pc = c.pc +32 432 .

It is clear how to derive precise specifications for the remaining ALU-


operations, but we take a shortcut exploiting the fact that we have already
constructed an ALU that was specified in Table 6.
6.2 MIPS ISA 125

This table defines functions alures(a, b, af, i) and ovf (a, b, af, i). As we do
not treat interrupts in this book, we use only the first of these functions here.
We observe that in all ALU operations a function of the ALU is performed.
The left operand is always

lop(c) = c.gpr(rs(c)) .
For R-type operations the right operand is the register specified by the rt
field of R-type instructions. For I-type instructions it is the sign extended
immediate operand if opc(c)[2] = I(c)[28] = 0 or zero extended immediate
operand if opc(c)[2] = 1. Thus, we define immediate fill bit ifill (c), extended
immediate constant xtimm(c), and right operand rop(c) in the following way:

imm(c)[15] opc(c)[2] = 0
ifill (c) =
0 opc(c)[2] = 1
= imm(c)[15] ∧ opc(c)[2]
= imm(c)[15] ∧ I(c)[28]

sxtimm(c) opc(c)[2] = 0
xtimm(c) =
zxtimm(c) opc(c)[2] = 1
= ifill (c)16 imm(c)

c.gpr(rt(c)) rtype(c)
rop(c) =
xtimm(c) otherwise .

Comparing Table 6 with the tables for I-type and R-type instructions we see
that bits af [2 : 0] of the ALU control can be taken from the low order fields of
the opcode for I-type instructions and from the low order bits of the function
field for R-type instructions:

f un(c)[2 : 0] rtype(c)
af (c)[2 : 0] =
opc(c)[2 : 0] otherwise

I(c)[2 : 0] rtype(c)
=
I(c)[28 : 26] otherwise .

For bit af [3] things are more complicated. For R-type instructions it can be
taken from the function code. For I-type instructions it must only be forced
to 1 for the two test and set operations, which can be recognized by opc(c)[2 :
1] = 01:

f un(c)[3] rtype(c)
af (c)[3] =
opc(c)[2] ∧ opc(c)[1] otherwise

I(c)[3] rtype(c)
=
I(c)[28] ∧ I(c)[27] otherwise .
126 6 A Basic Sequential MIPS Machine

The i-input of the ALU distinguishes for af [3 : 0] = 0111 between the lui-
instruction of I-type for i = 0 and the nor-instruction of R-type for i = 1.
Thus, we set it to itype(c). The result of the ALU computed with these inputs
is denoted by

ares(c) = alures(lop(c), rop(c), af (c), itype(c)) .

Depending on the instruction type, the destination register rdes is specified


by the rd field or the rt field:

rd(c) rtype(c)
rdes(c) =
rt(c) otherwise .

A summary of all ALU operations is then

alu(c) →

 ares(c) x = rdes(c)
c .gpr(x) =
c.gpr(x) otherwise
c .m = c.m
c .pc = c.pc +32 432 .

6.2.4 Shift Unit Operations

Shift operations come in two flavors: i) for f un(c)[2] = 0 the shift distance
sdist(c) is an immediate operand specified by the sa field of the instruction.
For f un(c)[2] = 1 the shift distance is specified by the last bits of the register
specified by the rs field:

sa(c) f un(c)[2] = 0
sdist(c) =
c.gpr(rs(c))[4 : 0] f un(c)[2] = 1 .

The left operand that is shifted is always the register specified by the rt-field:

slop(c) = c.gpr(rt(c)) .

and the control bits sf [1 : 0] are taken from the low order bits of the function
field:
sf (c) = f un(c)[1 : 0] .
The result of the shift unit computed with these inputs is denoted by

sres(c) = sures(slop(c), sdist(c), sf (c)) .

For shift operations the destination register is always specified by the rd field.
Thus, the shift unit operations can be summarized as
6.2 MIPS ISA 127

su(c) →

 sres(c) x = rd(c)
c .gpr(x) =
c.gpr(x) otherwise
c .m = c.m
c .pc = c.pc +32 432 .

6.2.5 Branch and Jump

A branch condition evaluation unit was specified in Table 7. It computes the


function bcres(a, b, bf ). We use this function with the following parameters:
blop(c) = c.gpr(rs(c))
brop(c) = c.gpr(rt(c))
bf (c) = opc(c)[2 : 0] ◦ rt(c)[0]
= I(c)[28 : 26]I[16] .
and define the result of a branch condition evaluation as
bres(c) = bcres(blop(c), brop(c), bf (c)) .
The next program counter c .pc is usually computed as c.pc +32 432 . This
order is only changed in jump instructions or in branch instructions, where
the branch is taken, i.e., the branch condition evaluates to 1. We define
jbtaken(c) = jump(c) ∨ b(c) ∧ bres(c) .
In case of a jump or a branch taken, there are three possible jump targets

Branch Instructions b(c)

Branch instructions involve a relative branch. The PC is incremented by a


branch distance:
b(c) ∧ bres(c) →
bdist(c) = imm(c)[15]14 imm(c)00
btarget(c) = c.pc +32 bdist(c) .
Note that the branch distance is a kind of a sign extended immediate constant,
but due to the alignment requirement the low order bits of the jump distance
must be 00. Thus, one uses the 16 bits of the immediate constant for bits
[17 : 2] of the jump distance. Sign extension is used for the remaining bits.
Note also that address arithmetic is modulo 232 . We have
c.pc + bdist(c) ≡ [c.pc] + [imm(c)00] mod 232 .
Thus, backward jumps are realized with negative [imm(c)].
128 6 A Basic Sequential MIPS Machine

R-type Jumps jr(c) and jalr(c)

The branch target is specified by the rs field of the instruction:

jr(c) ∨ jalr(c) → btarget(c) = c.gpr(rs(c)) .

J-type Jumps j(c) and jal(c)

The branch target is computed in a rather peculiar way: i) the PC is incre-


mented by 4, ii) then bits [27 : 0] are replaced by the iindex field of the
instruction:

j(c) ∨ jal(c) → btarget(c) = (c.pc +32 432 )[31 : 28]iindex(c)00 .

Now we can define the next PC computation for all instructions as




⎨c.pc +32 bdist(c) b(c) ∧ bres(c)
btarget(c) = c.gpr(rs(c)) jr(c) ∨ jalr(c)


(c.pc +32 432 )[31 : 28]iindex(c)00 otherwise

 btarget(c) jbtaken(c)
c .pc =
c.pc +32 432 otherwise .

Jump and Link jal(c) and jalr(c)

Jump and link instructions are used to implement calls of procedures. Besides
setting the PC to the branch target, they prepare the so called link address

linkad(c) = c.pc +32 432

and save it in a register. For the R-type instruction jalr, this register is
specified by the rd field. J-type instruction jal does not have an rs field, and
the link address is stored in register 31 (= 15 ). Branch and jump instructions
do not change the memory.
Therefore, for the update of registers in branch and jump instructions, we
have:

jb(c) →

 linkad(c) jalr(c) ∧ x = rd(c) ∨ jal(c) ∧ x = 15
c .gpr(x) =
c.gpr(x) otherwise
c .m = c.m .
6.2 MIPS ISA 129

6.2.6 Sequences of Consecutive Memory Bytes

Recall that for i ∈ [0 : k − 1] we define byte i of string a ∈ B8·k as

byte(i, a) = a[8 · (i + 1) − 1 : 8 · i] .

A trivial observation is stated in the following lemma.


Lemma 6.1 (bytes of concatenation). Let a ∈ B8 , let b ∈ B8·d and let
c = a ◦ b. Then,

a i=d
∀i ∈ [d : 0] : byte(i, c) =
byte(i, b) i < d

Proof.

byte(i, c) = a ◦ b[8 · (i + 1) − 1 : 8 · i]

a i=d
=
b[8 · (i + 1) − 1 : 8 · i] i < d

a i=d
=
byte(i, b) i < d .



The state of byte addressable memory with 32-bit addresses is modeled as a
mapping
m : B32 → B8 ,
where for each address x ∈ B32 one interprets m(x) ∈ B8 as the current
value of memory location x. Recall that we defined the content md (x) of d
consecutive locations starting at address x by

m1 (x) = m(x)
md+1 (x) = m(x +32 d32 ) ◦ md (x) .

The following simple lemma is used to localize bytes in sequences of consecu-


tive memory locations.
Lemma 6.2 (bytes in sequences).

∀i < d : byte(i, md(x)) = m(x +32 i32 )

Proof. By induction on d. For d = 1 we have i = 0. Thus, i32 = 032 and

byte(0, m1(x)) = m(x) = m(x +32 032 ) = m(x +32 i32 ) .

For the induction step from d to d + 1, we have by Lemma 6.1, definition


md+1 (x), and the induction hypothesis:
130 6 A Basic Sequential MIPS Machine

m(x +32 d32 ) i=d
∀i < d + 1 : byte(i, md+1 (x)) =
byte(i, md (x)) i<d

m(x +32 i32 ) i = d
=
m(x +32 i32 ) i < d
= m(x +32 i32 ) .




6.2.7 Loads and Stores

Load and store operations access a certain number d(c) ∈ {1, 2, 4} of bytes
of memory starting at a so called effective address ea(c). Letters b,h, and w
in the mnemonics define the width: b stands for d = 1 resp. a byte access; h
stands for d = 2 resp. a half word access, and w stands for d = 4 resp. a word
access. Inspection of the instruction tables gives


⎨1 opc(c)[0] = 0
d(c) = 2 opc(c)[1 : 0] = 01


4 opc(c)[1 : 0] = 11 .


⎨1 I(c)[26] = 0
= 2 I(c)[27 : 26] = 01


4 I(c)[27 : 26] = 11 .

Addressing is always relative to a register specified by the rs field. The offset


is specified by the immediate field:

ea(c) = c.gpr(rs(c)) +32 sxtimm(c) .

Note that the immediate constant is sign extended. Thus, negative offsets can
be realized in the same way as negative branch distances. Effective addresses
are required to be aligned. If we interpret them as binary numbers they have
to be divisible by the width d(c):

d(c) | ea(c)

or equivalently

ls(c) ∧ d(c) = 2 → ea(c)[0] = 0 , ls(c) ∧ d(c) = 4 → ea(c)[1 : 0] = 00 .

Stores

A store instruction takes the low order d(c) bytes of the register specified by
the rt field and stores them as md(c) (ea(c)). The PC is incremented by 4 (but
6.2 MIPS ISA 131

we have already defined that on page 128). Other memory bytes and register
values are not changed:

s(c) →

 byte(i, c.gpr(rt(c))) x = ea(c) +32 i32 ∧ i < d(c)
c .m(x) =
c.m(x) otherwise
c .gpr = c.gpr

A word of caution in case you plan to enter this into a CAV system: the
first case of the “definition” of c .m(x) is very well understandable for humans,
but actually it is a shorthand for the following: if

∃i : x = ea(c) +32 i32

then update c.m(x) with the hopefully unique i satisfying this condition. In
this case we can compute this i by solving the equation

x = ea(c) +32 i32

resp.
x = (ea(c) + i mod 232 ) .
From alignment we conclude

ea(c) + i ≤ 232 − 1 .

Hence,
(ea(c) + i mod 232 ) = ea(c) + i .
And we have to solve
x = ea(c) + i
as
i = x − ea(c) .
This turns the above definition into

byte(x − ea(c), c.gpr(rt(c))) x − ea(c) ∈ [0 : d(c) − 1]
c .m(x) =
c.m(x) otherwise ,

which is not so readable for humans.

Loads

Loads, like stores, access d(c) bytes of memory starting at address ea(c). The
result is stored in the low order d(c) bytes of the destination register, which is
specified by the rt field of the instruction. This leaves 32 − 8 · d(c) bits of the
destination register to be filled by some bit fill (c). For unsigned loads (with a
132 6 A Basic Sequential MIPS Machine

suffix “u” in the mnemonics) the fill bit is zero; otherwise it is sign extended
by the leading bit of c.md(c) (ea(c)). In this way a load result lres(c) ∈ B32 is
computed and the general purpose register specified by the rt field is updated.
Other registers and the memory are left unchanged:

u(c) = opc(c)[2]

0 u(c)
fill (c) =
c.m(ea(c) +32 (d(c) − 1)32 )[7] otherwise
lres(c) = fill (c)32−8·d(c) c.md(c) (ea(c))
l(c) →

 lres(c) x = rt(c)
c .gpr(x) =
c.gpr(x) otherwise
c .m = c.m .

6.2.8 ISA Summary

We collect all previous definitions of destination registers for the general pur-
pose register file into


⎨1
5
jal(c)
Cad(c) = rd(c) rtype(c)


rt(c) otherwise .

Also we collect the data gprin to be written into the general purpose register
file. For technical reasons, we define on the way an intermediate result C that
collects the possible GPR input from arithmetic, shift, and jump instructions:


⎨sres(c) su(c)
C(c) = linkad(c) jal(c) ∨ jalr(c)


ares(c) otherwise

lres(c) l(c)
gprin(c) =
C(c) otherwise .

Finally, we collect in a general purpose register write signal all situations when
some general purpose register is updated:

gprw(c) = alu(c) ∨ su(c) ∨ l(c) ∨ jal(c) ∨ jalr(c) .

Now we can summarize the MIPS ISA in three rules concerning the updates
of PC, general purpose registers, and memory:
6.3 A Sequential Processor Design 133

31 3 2 0
a a.l a.o
Fig. 95. Line address a.l and offset a.o of a byte address a


 btarget(c) jbtaken(c)
c .pc =
c.pc +32 432 otherwise

 gprin(c) x = Cad(c) ∧ gprw(c)
c .gpr(x) =
c.gpr(x) otherwise

 byte(i, c.gpr(rt(c))) x = ea(c) +32 i32 ∧ i < d(c) ∧ s(c)
c .m(x) =
c.m(x) otherwise .

6.3 A Sequential Processor Design


From the ISA specification, we derive a hardware implementation of the basic
MIPS processor. It will execute every MIPS instruction in a single hardware
cycle, and it will be so close to the ISA specification that the correctness proof
is reduced to a very simple bookkeeping exercise. This basic implementation,
however, is far from naive. In the following chapter we turn this implementa-
tion into a provably correct pipelined processor design with almost ridiculously
little effort.

6.3.1 Software Conditions

As was required in the ISA specification, the hardware implementation only


needs to work if all memory accesses of the ISA computation (ci ) are aligned,
i.e.,

∀i > 0 : ci .pc[1 : 0] = 00 ∧
ls(ci ) → (d(ci ) = 2 → ea(ci )[0] = 0 ∧
d(ci ) = 4 → ea(ci )[1 : 0] = 00) .

As illustrated in Fig. 95, we divide addresses a ∈ B32 into line address a.l ∈
B29 and offset a.o ∈ B3 as

a.l = a[31 : 3]
a.o = a[2 : 0] .

When we later introduce caches, a.l will be the address of a cache line in the
cache and a.o will denote the offset of a byte in the cache line.
134 6 A Basic Sequential MIPS Machine

For the time being we will assume that there is a code region CR ⊂ B29
such that all instructions are fetched from addresses with a line address in CR.
We also assume that there is a data region DR ⊂ B29 such that all addresses
of loads and stores have a line address in DR:
∀i : ci .pc.l ∈ CR
∀i : ls(ci ) → ea(ci ).l ∈ DR .
For the time being we will also assume that these regions are disjoint:
DR ∩ CR = ∅ .
Moreover, in the next section we require the code region CR to always include
the addresses which belong to the ROM portion of the hardware memory.

6.3.2 Hardware Configurations and Computations


The hardware configuration h of the implementation has four components:
• a program counter h.pc ∈ B32 ,
• a general purpose register file h.gpr : B5 → B32 ,
• a double word addressable hardware memory h.m : B29 → B64 . In later
constructions it is replaced by separate data and instruction caches. Here
it is a single (64, r, 29)-2-port multi-bank RAM-ROM. We use port a of
this memory for instruction fetch from the code region and port b for
performing loads and stores to the data region. Hence, we refer to port
a as the instruction port and to port b as the data port of the hardware
memory h.m. We assume that the ROM portion of the memory is always
included into the code region:
CR ⊇ {029−r b | b ∈ Br }.
Wider memory speeds up aligned loads and stores of half words and words
and will later speed up communication between caches and main memory. For
the hardware, this comes at the price of shifters for loading or storing words,
half words, or bytes. We also need to develop some machinery for tracking the
byte addressed data in line addressable hardware memory.
A hardware computation is a sequence (ht ) with t ∈ N ∪ {−1}, where
configuration ht denotes the state of the hardware at cycle t. The hardware
construction given in the remainder of this chapter defines a hardware tran-
sition function
ht+1 = δH (ht , resett ) .
We assume the reset signal to be high in cycle −1 and to be low afterwards5 :

t 1 t = −1
reset =
0 otherwise .
5
That is the reason why we count hardware states starting from −1 and ISA states
starting from 0.
6.3 A Sequential Processor Design 135

6.3.3 Memory Embedding

Let m : B32 → B8 be a byte addressable memory like c.m in the ISA specifi-
cation, and let cm : B29 → B64 be a line addressable memory like h.m in the
intended hardware implementation. Let A ⊆ B29 be a set of line addresses like
CR and DR. We define in a straightforward way a relation cm ∼A m stating
that with respect to the addresses in A memory m is embedded in memory
cm by
∀a ∈ A : cm(a) = m8 (a03 ) .
Thus, illustrating with dots, each line of memory cm contains 8 consecutive
bytes of memory m, namely

cm(a) = m(a +32 732 ) ◦ . . . ◦ m(a) .

We are interested to localize the single bytes of sequences md (x) in the line
addressable memory cm. We are only interested in access widths, which are
powers of two and at most 8 bytes6 :

d ∈ {2k | k ∈ [0 : 3]} .

Also we are only interested in so called accesses (x, d) which are aligned in
the following sense: if d = 2k with k ≥ 1 (i.e., to more than a single byte),
then the last k bits of address x must all be zero:

d = 2k ∧ k ≥ 1 → x[k − 1 : 0] = 0k .

For accesses of this nature and i < d, the expressions x.o +32 i32 that are used
in Lemma 6.2 to localize bytes of md (x) in byte addressable memory have
three very desirable properties: i) their numerical value is at most 7, hence,
ii) computing their representative mod 8 in B3 gives the right result, and iii)
all bytes are embedded in the same cache line. This is shown in the following
technical lemma.
Lemma 6.3 (properties of aligned addresses). Let (x, d) be aligned and
i < d. Then,
1. x.o + i ≤ 7,
2. x.o +3 i3  = x.o + i,
3. x +32 i32 = x.l ◦ (x.o +3 i3 ).

Proof. We prove three statements of the lemma one by one.


1. By alignment and because i < d = 2k we have

6
Double precision floating point numbers are 8 bytes long.
136 6 A Basic Sequential MIPS Machine

x[2 : k] ◦ x[k − 1 : 0] + i k ≤ 2
x.o + i =
x[k − 1 : 0] + i k=3

x[2 : k] ◦ 0k  + i k≤2
=
03  + i k=3

7 − (2k − 1) + d − 1 k≤2

7 k=3
=7.

2. By the definition of +3 and part 1 of the lemma we have

x.o + i = x.o + i3 


= (x.o + i3  mod 8)
= x.o +3 i3  .

3. We write

x = x.l ◦ x.o
i32 = 029 ◦ i3 .

Adding the offset and the line components separately, we get by part 2 of
the lemma

x.o + i3  = 0 ◦ (x.o +3 i3 )


x.l + 029  = x.l .

Because the carry of the addition of the offsets to position 4 is 0, we get


from Lemma 2.12:

x + i32  = x.l ◦ (x.o +3 i3 ) < 232 .

Hence,
(x + i32  mod 232 ) = x.l ◦ (x.o +3 i3 ) .
Applying bin32 ( ) to this equation proves part 3 of the lemma.



In Lemma 6.2 we showed for all accesses (aligned or not)

∀i < d : byte(i, md(x)) = m(x +32 i32 ) .

Using Lemma 6.3 we specialize this for aligned accesses.


Lemma 6.4 (bytes in sequences for aligned accesses). Let (x, d) be
aligned and i < d. Then,

byte(i, md (x)) = m(x.l ◦ (x.o +3 i3 )) .


6.3 A Sequential Processor Design 137

This allows a reformulation of the embedding relation ∼A :


Lemma 6.5 (embedding for bytes). Relation cm ∼A m holds iff for all
byte addresses x ∈ B32 with x.l ∈ A

byte(x.o, cm(x.l)) = m(x) .

Proof. Assume for line addresses a ∈ B29 we have

∀a ∈ A : cm(a) = m8 (a03 ) .

Then access (a03 , 8) is aligned and Lemma 6.4 can be reformulated for single
bytes:

∀i < 8 : byte(i, cm(a)) = byte(i, m8 (a03 ))


= m(a ◦ i3 ) .

Now we rewrite byte address a ◦ i3 as

a ◦ i3 = x = x.l ◦ x.o

and get
byte(x.o, cm(x.l)) = m(x) .
For the opposite direction of the proof we assume

∀x ∈ B32 : x.l ∈ A → byte(x.o, cm(x.l)) = m(x) .

We instantiate
x = x.l ◦ x.o = a ◦ i3 ,
and get for all i < 8 and a ∈ A

byte(i3 , cm(a)) = m(a ◦ i3 ) .

We further derive

∀i < 8 : byte(i, cm(a)) = byte(i3 , cm(a))


= m(a ◦ i3 )
= m(a ◦ (03 +3 i3 ))
= byte(i, m8 (a03 )) (Lemma 6.4),

which implies
cm(a) = m8 (a03 ) .


Finally, we can formulate for aligned accesses (x, d), how the single bytes of
consecutive sequences md (x) are embedded in memory cm.
138 6 A Basic Sequential MIPS Machine

Lemma 6.6 (embedding for aligned accesses). Let relation cm ∼A m


hold, (x, d) be aligned, x.l ∈ A, and i < d. Then,

byte(i, md(x)) = byte(x.o + i, cm(x.l)) .

Proof.

byte(i, md(x)) = m(x.l ◦ (x.o +3 i3 )) (Lemma 6.4)


= byte(x.o +3 i3 , cm(x.l)) (Lemma 6.5)
= byte(x.o + i, cm(x.l)) (Lemma 6.3)




For aligned word accesses (d = 4) and indices i < 3, we get an important


special case:

byte(i, m4 (x)) = byte(x[2]00 + i, cm(x.l))


= byte(4 · x[2] + i, cm(x.l))

byte(i, cm(x.l)) x[2] = 0
=
byte(4 + i, cm(x.l) x[2] = 1 .

Concatenating bytes we can rewrite the embedding relation for aligned word
accesses.
Lemma 6.7 (embedding for word accesses). Let relation cm ∼A m hold,
x ∈ B32 , x.l ∈ A, and x[1 : 0] = 00. Then,

cm(x.l)[31 : 0] x[2] = 0
m4 (x) =
cm(x.l)[63 : 32] x[2] = 1 .

6.3.4 Defining Correctness for the Processor Design

We define in a straightforward way a simulation relation sim(c, h) stating that


hardware configuration h encodes ISA configuration c by

sim(c, h) ≡

1. h.pc = c.pc ∧
2. h.gpr = c.gpr ∧
3. h.m ∼CR c.m ∧
4. h.m ∼DR c.m,
i.e., every hardware memory location h.m(a) for a ∈ CR ∪ DR contains the
contents of eight ISA memory locations:
6.3 A Sequential Processor Design 139

c.m(a111) ◦ . . . ◦ c.m(a000) = h.m(a) .


By Lemma 6.5 this is equivalent to
∀x ∈ B32 : x.l ∈ CR ∪ DR → c.m(x) = byte(x[2 : 0], h.m(x[31 : 3])) .
We will construct the hardware such that one ISA instruction is emulated in
every hardware cycle and that we can show the following lemma.
Lemma 6.8 (MIPS one step). Let alignment and disjointness of code and
data regions hold for configuration ci . Then
∀i ≥ 0 : sim(ci , hi ) → sim(ci+1 , hi+1 ).
This is obviously the induction step in the correctness proof of the MIPS
construction.
Lemma 6.9 (MIPS correctness). There exists an initial ISA configuration
c0 such that
∀i ≥ 0 : sim(ci , hi ) .
At first glance it seems that the lemmas are utterly useless, because after
power up hardware registers come up with unknown binary content. With
unknown content of the code region, one does not know the program that is
executed, and thus, one cannot prove alignment and the disjointness of code
and data regions. That is of course a very real problem whenever one tries to
boot any real machine. We have already seen the solution: for the purpose of
booting, the code region occupies the bottom part of the address range:
CR = {029−r b | b ∈ Br }
for some small r < 29. The hardware memory is, therefore, realized as a
combined (64, r, 29)-2-port multi-bank RAM-ROM. The content of the ROM-
portion is known after power up and contains the boot loader. Thus, Lemma
6.9 works at least literally for programs in ROM. After the boot loader has
loaded more known programs, one can define new code and data regions CR 
and DR  and discharge (with extra arguments) the hypotheses of Lemma 6.8
for the new regions7 .
As already mentioned earlier, our hardware construction will closely fol-
low the ISA specification, and there will be many signals X occurring both
in the ISA specification and in the hardware implementation. We will distin-
guish them by their argument. X(c) is the ISA signal whereas X(h) is the
corresponding hardware signal.
7
In a pipelined processor, in between a switch to a new code and data region
one has to make sure that the pipe is drained. This is achieved, for instance,
with instruction eret (return from exception). Since in this book instruction eret
and other instructions which drain the pipe as a side effect are not specified, we
assume the code and data regions to be fixed initially and unchanged afterwards.
As a result, a slight adaptation of the proofs would be necessary if one wants to
argue formally about self-modifying code. For details refer to [12].
140 6 A Basic Sequential MIPS Machine

IF

IF im imout
env
I

nextpc F, p p, Cad, F
env I-decoder
bf
ID
A, pc xtimm, af, sf, sa
pc
ima
pc A, B, af xtimm A, B sa, sf A xtimm B
A, B
p, F
inc ALUenv SUenv add sh4s

linkad ares sres ea


EX ea.o ea.o

muxes p

C
ea.l dmin, bw
ima
m imout
M
dmout

ea.o, p, F
sh4l
lres

WB 0 1
l

gpr rs, rt, Cad, gprw

A B

Fig. 96. Stages of a simple MIPS machine

6.3.5 Stages of Instruction Execution

Aiming at pipelined implementations later on, we construct the basic imple-


mentation of a MIPS processor in a very structured way. We split instruction
6.3 A Sequential Processor Design 141

execution into 5 stages, that we describe first in a preliminary and somewhat


informal way (see Fig. 96):
• IF – instruction fetch. The program counter pc is used to access the in-
struction port of memory m in order to fetch the current instruction I.
• ID – instruction decode:
– In an instruction decoder predicates p and functions f depending only
on the current instruction are computed. The predicates p correspond
to the predicates in the ISA specification. Functions f include the func-
tion bits af and sf controlling ALU and shift unit as well as the ex-
tended immediate constant xtimm. Some trivial functions f simply
select some fields F of the instruction I, like rs or rt.
– The general purpose register file is accessed with addresses rs and rt.
For the time being we call the result A = gpr(rs) and B = gpr(rt).
– Result A is used to compute the next program counter in the next PC
environment.
• EX – execute. Using only results from the ID stage the following results
are computed by the following circuits:
– the link address linkad for the link-instructions by an incrementer,
– the result ares of the ALU by an ALU-environment,
– the result sures of the shift unit by a shift unit environment,
– the preliminary input C for the general purpose register file from
linkad, ares, and sres by a small multiplexer tree,
– effective address ea for loads and stores by an adder,
– the shifted operand dmin and the byte write signals bw[i] for store
instructions in an sh4s-environment8.
• M – memory. Store instructions update the hardware memory m through
the data port (port b). For store instructions, line ea.l of the hardware
memory is accessed. The output of the data port of the hardware memory
m is called dmout.
• WB – write back. For load instructions the output of the data port dmout
is shifted in an sh4l-environment and if necessary modified by a fill bit.
The result lres is combined with C to the data input gprin of the general
purpose register file. The gpr is updated.

6.3.6 Initialization

PC initialization and the instruction port environment imenv is shown in


Fig. 97. We have
h0 .pc = 032 = c0 .pc .
Thus, condition 1 of relation sim holds for i = 0. We take no precautions to
prevent writes to h.gpr or h.m during cycle −1 and define

8
sh4s is a shorthand for “shift for store” and sh4l is a shorthand for “shift for
load”.
142 6 A Basic Sequential MIPS Machine

nextpc 032
32 32
reset 0 1 [31 : 3]
a
29
32 b m

pc 1
(64, r)-ROM
32
imout
ima 64

0 1 pc[2]
32

I
Fig. 97. PC initialization and the instruction port environment

c0 .gpr = h0 .gpr
c .m8 (a000) = h0 .m(a) .
0

Hence, we can conclude


sim(c0 , h0 ) .
We assume that the software conditions stated in Sec. 6.3.1 hold for all ISA
computations that start from c0 .
From now on let i > 0 and c = ci , h = hi and assume sim(c, h). In the
remainder of the chapter we consider every stage of instruction execution in
detail and show that Lemma 6.8 holds for our construction, i.e., that simula-
tion relation
sim(c , h )
holds, where c = ci+1 and h = hi+1 . When in the proofs we invoke part k of
the simulation relation for k ∈ [1 : 4], we abbreviate this as (sim.k). When we
argue about hardware construction and semantics of hardware components,
like memories, we abbreviate by (H).

6.3.7 Instruction Fetch

The treatment of the instruction fetch stage is short. The instruction port of
the hardware memory h.m is addressed with bits

ima(h) = h.pc[31 : 3] = h.pc.l .

It satisfies
6.3 A Sequential Processor Design 143

instruction decoder
4

2
predicates p 5 bf
4
instruction fields F
Cad sf
af

Fig. 98. Instruction decoder

h.pc[31 : 3] = c.pc[31 : 3] (sim.1)


∈ CR .

Using Lemma 6.7 we conclude that the hardware instruction I(h) fetched by
the circuitry in Fig. 97 is

h.m(h.pc[31 : 3])[63 : 32] h.pc[2] = 1
I(h) =
h.m(h.pc[31 : 3])[31 : 0] h.pc[2] = 0
= c.m4 (c.pc[31 : 2]00) (Lemma 6.7, sim.3)
= c.m4 (c.pc) (alignment)
= I(c) .

Thus, instruction fetch is correct.


Lemma 6.10 (instruction fetch).

I(h) = I(c)

6.3.8 Instruction Decoder

The instruction decoder belongs to the instruction decode stage. As shown in


Fig. 98 it computes the hardware version of functions f (c) that only depend
on the current instruction I(c), i.e., which can be written as

f (c) = f  (I(c)) .

For example,
144 6 A Basic Sequential MIPS Machine

rtype(c) ≡ opc(c) = 0*04


≡ I(c)[31 : 26] = 0*04
rtype (I[31 : 0]) ≡ I[31 : 26] = 0*04
rtype(c) = rtype (I(c))
or
rd(c) = I(c)[15 : 11]

rd (I[31 : 0]) = I[15 : 11]
rd(c) = rd (I(c)) .

Predicates
This trivial transformation, however, gives a straightforward way to construct
circuits for all predicates p(c) from the ISA specification that depend only on
the current instruction:
• Construct a boolean formula for p . This is always possible by Lemma 2.20.
In the above example
rtype (I) = I[31] ∧ I[29] ∧ I[28] ∧ I[27] ∧ I[26] .
• Translate the formula into a circuit and connect the inputs of the circuit to
the hardware instruction register. The output p(h) of the circuit satisfies
p(h) = p (I(h))
= p (I(c)) (Lemma 6.10)
= p(c) .
Thus, the instruction decoder produces correct instruction predicates.
Lemma 6.11 (instruction predicates). For all predicates p depending only
on the current instruction:
p(h) = p(c) .

Instruction Fields
All instruction fields F have the form
F (c) = I(c)[m : n] .
Compute the hardware version as
F (h) = I(h)[m : n]
= I(c)[m : n] (Lemma 6.10)
= F (c) .

Lemma 6.12 (instruction fields). For all instruction fields F :


F (h) = F (c) .
6.3 A Sequential Processor Design 145

15 rd
5 5

1 0 jal
5
rt
5
1 0 rtype
5

Cad
Fig. 99. C address computation

C Address

The output Cad(h) in Fig. 99 computes the address of the destination register
for the general purpose register file. By Lemmas 6.11 and 6.12 it satisfies


⎨1
5
jal(h)
Cad(h) = rd(h) rtype(h)


rt(h) otherwise


⎨1
5
jal(c)
= rd(c) rtype(c)


rt(c) otherwise
= Cad(c) .

Extended Immediate Constant

The fill bit ifill (c) is a predicate and imm(c) is a field of the instruction. Thus,
we can compute the extended immediate constant in hardware as

xtimm(h) = ifill (h)16 imm(h)


= ifill (c)16 imm(c) (Lemmas 6.11 and 6.12)
= xtimm(c) .

Lemma 6.13 (immediate constant).

xtimm(h) = xtimm(c) .

Function Fields for ALU, SU, and BCE

Figure 100 shows the computation of the function fields af ,i, sf , and bf for
the ALU, the shift unit, and the branch condition evaluation unit.
146 6 A Basic Sequential MIPS Machine

rtype ∧ I[3] ∨ rtype ∧ I[28] ∧ I[27]


I[2 : 0] I[28 : 26] itype I[1 : 0] I[28 : 26] I[16]
3 3
1 0 rtype 2 3
3

af [3] af [2 : 0] i sf [1 : 0] sf [3 : 1] sf [0]
Fig. 100. Computation of function fields for ALU, SU, and BCE

Outputs af (h)[2 : 0] satisfy by Lemmas 6.11 and 6.12



I(h)[2 : 0] rtype(h)
af (h)[2 : 0] =
I(h)[28 : 26] otherwise

I(c)[2 : 0] rtype(c)
=
I(c)[28 : 26] otherwise
= af (c) .

One shows

i(h) = i(c)
sf (h) = sf (c)
bf (h) = bf (c)

in the same way. Bit af [3](c) is a predicate. Thus, af (h) is computed in the
function decoder as a predicate and we get by Lemma 6.11

af [3](h) = af [3](c) .

We summarize in the following lemma.


Lemma 6.14 (C address and function fields).

Cad(h) = Cad(c)
af (h) = af (c)
i(h) = i(c)
sf (h) = sf (c)
bf (h) = bf (c)

That finishes – fortunately – the bookkeeping of what the instruction decoder


does.
6.3 A Sequential Processor Design 147

gprin
32
5
rs a in
5 w gprw
rt b
gpr
5
Cad c
outa outb
32 32

A B
Fig. 101. General purpose register file

6.3.9 Reading from General Purpose Registers


The general purpose register file h.gpr of the hardware as shown in Fig. 101
is a 3-port GPR RAM with two read ports and one write port. The a and b
addresses of the file are connected to rs(h) and rt(h). For the data outputs
gprouta and gproutb, we introduce the shorthands A and B
A(h) = gprouta(h)
= h.gpr(rs(h)) (H)
= c.gpr(rs(h)) (sim.2)
= c.gpr(rs(c)) (Lemma 6.12)
B(h) = c.gpr(rt(c)) . (similarly)
Thus, the outputs of the GPR RAM are correct.
Lemma 6.15 (GPR outputs).
A(h) = c.gpr(rs(c))
B(h) = c.gpr(rt(c))

6.3.10 Next PC Environment


Branch Condition Evaluation Unit
The BCE-unit is wired as shown in Fig. 102. By Lemmas 6.15 and 6.14 as
well as the correctness of the BCE implementation from Sect. 5.5 we have
bres(h) = bcres(A(h), B(h), bf (h))
= bcres(c.gpr(rs(c)), c.gpr(rt(c)), bf (c))
= bres(c) .
148 6 A Basic Sequential MIPS Machine

A B
32 32

4
BCE bf

bres
Fig. 102. The branch condition evaluation unit and its operands

pc[31 : 2]
30 1

30−inc 00
30 2

pcinc[31 : 2] pcinc[1 : 0]
Fig. 103. Incrementing an aligned PC with a 30-incrementer

Thus, the branch condition evaluation unit produces the correct result.
Lemma 6.16 (BCE result).

bres(h) = bres(c)

Incremented PC

The computation of an incremented PC as needed for the next PC environ-


ment as well as for the link instructions is shown in Fig. 103. Because the PC
can be assumed to be aligned9 the use of a 30-incrementer suffices. Using the
correctness of the incrementer from Sect. 5.1 we get

pcinc(h) = (h.pc[31 : 2] +30 130 )00


= (c.pc[31 : 2] +30 130 )00 (sim.1)
= c.pc[31 : 2]00 +32 130 00 (Lemma 2.12)
= c.pc +32 432 . (alignment)

Thus, the incremented PC is correct.


9
Otherwise a misalignment interrupt would be signalled.
6.3 A Sequential Processor Design 149

pc[31 : 2] imm14
15 imm
30 30

30-Add A pcinc[31 : 28]iindex00


32 32
s 1 0 jr ∨ jalr
30 00 32
1 0 b
btarget pcinc
32 32
1 0 jbtaken
32
nextpc
Fig. 104. Next PC computation

Lemma 6.17 (incremented PC).


pcinc(h) = c.pc +32 432

Next PC Computation
The circuit computing the next PC input, which was left open in Fig. 97 when
we treated the instruction fetch, is shown in Fig. 104.
Predicates p ∈ {jr, jalr, jump, b} are computed in the instruction decoder.
Thus, we have
p(c) = p(h)
by Lemma 6.11.
We compute jbtaken in the obvious way and conclude by Lemma 6.16
jbtaken(h) = jump(h) ∨ b(h) ∧ bres(h)
= jump(c) ∨ b(c) ∧ bres(c)
= jbtaken(c) .
We have by Lemmas 6.15, 6.17, and 6.12
A(h) = c.gpr(rs(c))
pcinc(h) = c.pc +32 442
imm(h)[15]14 imm(h)00 = imm(c)[15]14 imm(c)00
= bdist(c) .
150 6 A Basic Sequential MIPS Machine

For the computation of the 30-bit adder, we argue as in Lemma 6.17:

s(h)00 = (h.pc[31 : 2] +30 imm(h)[15]14 imm(h))00


= (c.pc[31 : 2] +30 imm(c)[15]14 imm(c))00 (sim.1, Lemma 6.12)
= c.pc[31 : 2]00 +32 imm(c)[15]14 imm(c)00 (Lemma 2.12)
= c.pc +32 bdist(c) . (alignment)

We conclude


⎨c.pc +32 bdist(c) b(c)
btarget(h) = c.gpr(rs(c)) jr(c) ∨ jalr(c)


(c.pc +32 432 )[31 : 28]iindex(c)00 j(c) ∨ jal(c)
= btarget(c) .

Exploiting
reset(h) = 0
and the semantics of register updates we conclude

h .pc = nextpc(h)

btarget(c) jbtaken(c)
=
c.pc +32 432 otherwise
= c .pc .

Thus, we have shown the following lemma.


Lemma 6.18 (next PC).
h .pc = c .pc
This is sim.1 for the next configuration.

6.3.11 ALU Environment

We begin with the treatment of the execute stage. The ALU environment is
shown in Fig. 105. For the ALU’s left operand, we have

lop(h) = A(h)
= c.gpr(rs(c)) (Lemma 6.15)
= lop(c) .

For the right operand, it follows by Lemmas 6.15 and 6.13


6.3 A Sequential Processor Design 151

xtimm B
32 32
0 1 rtype
A rop
32 32

i
4
32-ALU af

32

ares
Fig. 105. ALU environment


B(h) rtype(h)
rop(h) =
xtimm(h) otherwise

c.gpr(rt(c)) rtype(c)
=
xtimm(c) otherwise
= rop(c) .
For the result ares of the ALU, we get
ares(h) = alures(lop(h), rop(h), itype(h), af (h)) (Sect. 5.3)
= alures(lop(c), rop(c), itype(c), af (c)) (Lemma 6.11)
= ares(c) .
We summarize in the following lemma.
Lemma 6.19 (ALU result).
ares(h) = ares(c)
Note that in contrast to previous lemmas the proof of this lemma is not just
bookkeeping; it involves the not so trivial correctness of the ALU implemen-
tation from Sect. 5.3.

6.3.12 Shift Unit Environment


The computation of the operands of the shift unit is shown in Fig. 106. The
left operand of the shifter is tied to B. Thus,
slop(h) = B(h)
= c.gpr(rt(c)) (Lemma 6.15)
= slop(c) .
152 6 A Basic Sequential MIPS Machine

sa A[4 : 0]
5 5
0 1 f un[2]
B sdist
32 5
2
32-SU sf
32

sres
Fig. 106. Shift unit environment

For the shift distance we have by Lemmas 6.12 and 6.15



sa(h) f un(h)[2] = 0
sdist(h) =
A(h)[4 : 0] f un(h)[2] = 1

sa(c) f un(c)[2] = 0
=
c.gpr(rs(c))[4 : 0] f un(c)[2] = 1
= sdist(c) .

Using the non trivial correctness of the shift unit implementation from
Sect. 5.4 we get

sres(h) = sures(slop(h), sdist(h)), sf (h)) (Sect. 5.4)


= sures(slop(c), sdist(c), sf (c)) (Lemma 6.14)
= sres(c) .

Thus, the result produced by the shift unit is correct.


Lemma 6.20 (shift unit result).

sres(h) = sres(c)

6.3.13 Jump and Link

The value linkad that is saved in jump and link instructions is identical with
the incremented PC pcinc from the next PC environment (Lemma 6.17):

linkad(h) = pcinc(h) = c.pc +32 432 = linkad(c) . (15)


6.3 A Sequential Processor Design 153

linkad ares sres


32 32
0 1 su
32 32

1 0 jal ∨ jalr
32

C
Fig. 107. Collecting results into signal C

A imm16
15 imm
32 32
0

32-Add

32
ea
Fig. 108. Effective address computation

6.3.14 Collecting Results


Figure 107 shows a small multiplexer-tree collecting results linkad, ares, and
sres into an intermediate result C. Using Lemmas 6.19, 6.20, and 6.11 as well
as (15) we conclude


⎨sres(h) su(h)
C(h) = linkad(h) jal(h) ∨ jalr(h)


ares(h) otherwise
= C(c) .
Thus, the intermediate result C(h) is correct.
Lemma 6.21 (C result).
C(h) = C(c)

6.3.15 Effective Address


The effective address computation is shown in Fig. 108. We have
ea(h) = A(h) +32 imm(h)[15]16 imm(h) (Sect. 5.1)
= c.gpr(rs(c)) +32 sxtimm(c) (Lemmas 6.15 and 6.12)
= ea(c) .
154 6 A Basic Sequential MIPS Machine

B
32

(32, 8)-SLC ea[0]


32
E
(32, 16)-SLC ea[1]
32
F
32 32

dmin[63 : 32] dmin[31 : 0]


Fig. 109. Shifter for store operations in the sh4s-environment

smask
4

(4, 1)-SLC ea[0]


4
e
(4, 2)-SLC ea[1]
4
f
4 4
ea[2]
4 4

bw[7 : 4] bw[3 : 0]
Fig. 110. Computation of byte write signals bw[7 : 0] in the sh4s-environment

Thus, the effective address computation is correct.


Lemma 6.22 (effective address).

ea(h) = ea(c)

6.3.16 Shift for Store Environment

Figure 109 shows a shifter construction and the data inputs for the data port
of the hardware memory h.m. The shifter construction serves to align the B
operand with the 64-bit wide memory. A second small shifter construction
generating the byte write signals is shown in Fig. 110.
The initial mask signals are generated as
6.3 A Sequential Processor Design 155

smask(h)[3 : 0] = s(h) ∧ (I(h)[27]2 I(h)[26]1) .

One easily verifies that




⎪0000 ¬s(c)

⎨0001 s(c) ∧ d(c) = 1
smask(h) =
⎪0011
⎪ s(c) ∧ d(c) = 2


1111 s(c) ∧ d(c) = 4
resp.
smask(h)[i] = 1 ↔ s(c) ∧ i < d(c) .
In case s(c) = 0, we immediately get

bw(h) = 08 .

By alignment we have

(d(c) = 2 → ea(c)[0] = 0) ∧ (d(c) = 4 → ea(c)[1 : 0] = 00) .

Using ea(c) = ea(h) from Lemma 6.22 and correctness of SLC-shifters we


conclude for s(c) = 1:

e(h)[j] = 1 ↔ ∃i < d(c) : j = ea(c)[0] + i


f (h)[j] = 1 ↔ ∃i < d(c) : j = ea(c)[1 : 0] + i
bw(h)[j] = 1 ↔ ∃i < d(c) : j = ea(c)[2 : 0] + i .

Similarly we have for the large shifter from Fig. 109 and for i < d(c)

byte(i, B(h)) = byte(i + ea(c)[0], E(h))


= byte(i + ea(c)[1 : 0], F (h))
= byte(i + 0ea(c)[1 : 0], dmin(h))
= byte(i + 1ea(c)[1 : 0], dmin(h))
= byte(i + ea(c)[2 : 0], dmin(h)) .

Using B(h) = c.gpr(rt(c)) from Lemma 6.15, we summarize this for the
shifters supporting the store operations.
Lemma 6.23 (shift for store). If s(c) = 1, i.e., if a store operation is per-
formed in ISA configuration c, then

bw(h)[j] = 1 ↔ ∃i < d(c) : j = ea(c).o + i


∀i < d(c) : byte(i, c.gpr(rt(c))) = byte(i + ea(c).o, dmin(h)) .

This concludes the treatment of the execute stage.


156 6 A Basic Sequential MIPS Machine

dmin
64
29
ima 8
bw
29 m
ea.l

(64, r)-ROM

64 64

imout dmout
Fig. 111. Wiring of the hardware memory

6.3.17 Memory Stage

In the memory stage we access port b of hardware memory h.m with the
line address ea.l and the signals bw(h)[7 : 0] and dmin(h) constructed above.
Figure 111 shows wiring of the hardware memory. We proceed to prove the
induction step for h.m.
Lemma 6.24 (hardware memory).

∀a ∈ CR ∪ DR : h .m(a) = c.m8 (a000).

Proof. By Lemma 6.5 this is equivalent to

∀x ∈ B32 : x.l ∈ CR ∪ DR : c .m(x) = byte(x[2 : 0], h .m(x[31 : 3])) .

and we will prove the lemma for the data region in this form.
By induction hypotheses, sim.3, and sim.4 we have

∀a ∈ CR ∪ DR : h.m(a) = c.m8 (a000).

For s(c) = 0 no store is executed and in the ISA computation we have c .m =


c.m. In the hardware computation we have bmask(h) = 04 and bw(h)[7 : 0] =
08 ; hence, h .m = h.m. With the induction hypothesis we conclude trivially
for all a ∈ CR ∪ DR:

h .m(a) = h.m(a)
= c.m8 (a000) (sim.3, sim.4)
= c .m8 (a000) .

For s(c) = 1 we get from the ISA specification:



byte(i, c.gpr(rt(c)) x = ea(c) +32 i32 ∧ i < d(c)
c .m(x) =
c.m(x) otherwise .
6.3 A Sequential Processor Design 157

For x ∈ B32 and i < d(c) we derive:


x = ea(c) +32 i32 ↔ x = ea(c).l ◦ (ea(c).o +3 i3 ) (Lemma 6.3)
↔ x.o = ea(c).o +3 i3 ∧ x.l = ea(c).l
↔ x.o = ea(c).o +3 i3  ∧ x.l = ea(c).l (Lemma 2.7)
↔ x.o = ea(c).o + i ∧ x.l = ea(c).l . (Lemma 6.3)
Hence,


⎨byte(i, c.gpr(rt(c)) x.o = ea(c).o + i ∧ i < d(c)

c .m(x) = ∧ x.l = ea(c).l


c.m(x) otherwise .
In Lemma 6.22 we have already concluded that
ea(h) = ea(c).
Moreover, we know from the software conditions that ea(c).l ∈ DR. Thus, for
any a ∈ CR we have
h .m(a) = h.m(a)
= c.m8 (a000) (sim.3)
= c .m8 (a000) .
For the hardware memory the specification of the 2-port multi-bank RAM-
ROM gives for all a ∈ DR:

 modify (h.m(a), dmin(h), bw(h)) a = ea(c).l
h .m(a) = (16)
h.m(a) otherwise .

With x ∈ B32 , a = x.l ∈ DR, j = x.o ∈ B3 , and the definition of function


modify , we rewrite (16) as
byte(x.o, h .m(x.l))
= byte(j, h .m(a))

byte(j, dmin(h)) bw(h)[j] ∧ a = ea(c).l
=
byte(j, h.m(a)) otherwise


⎨byte(i, c.gpr(rt(c))) j = ea(c).o + i ∧ i < d(c)
= ∧ a = ea(c).l (Lemma 6.23)


byte(j, h.m(a)) otherwise

byte(i, c.gpr(rt(c))) j = ea(c).o + i ∧ i < d(c)
=
c.m(x) otherwise (sim.4, Lemma 6.5)
= c .m(x) .


158 6 A Basic Sequential MIPS Machine

dmout
64

[63 : 32] [31 : 0]


32 32

1 0 ea[2]
32
G
(32, 16)-SLC ea[1]
32
H
(32, 24)-SLC ea[0]
32

J
Fig. 112. Shifter for load operations in the sh4l-environment

6.3.18 Shifter for Load

The only remaining stage is the write back stage. A shifter construction sup-
porting load operations is shown in Fig. 112. Assume l(c) holds, i.e., a load
instruction is executed. Because c.m ∼DR h.m holds by induction hypothesis,
we can use Lemma 6.6 to locate for i < d(c) the bytes to be loaded in h.m
and subsequently – using memory semantics – in dmout(h). Then we simply
track the effect of the two shifters taking into account that a 24-bit left shift
is the same as an 8-bit right shift:

byte(i, c.md(ea(c))
= byte(ea(c).o + i, h.m(ea(c).l)) (Lemma 6.6)
= byte(ea(h).o + i, h.m(ea(h).l)) (Lemma 6.22)
= byte(ea(h).o + i, dmout(h)) (H)
= byte(ea(h)[1 : 0] + i, G(h)) (H)
= byte(ea(h)[0] + i, H(h)) (H)
= byte(i, J(h)) . (H)

Hence, we can conclude the following lemma.


Lemma 6.25 (shift for load).

J(h)[8d(c) − 1 : 0] = c.md(c) (ea(c))

By setting
fill(h) = J(h)[7] ∧ lb(h) ∨ J(h)[15] ∧ lh(h)
6.3 A Sequential Processor Design 159

J[i] fill

1 0 lmask[i]

lres[i]
Fig. 113. Fill bit computation for loads

we conclude
s(c) ∧ d(c) = 4 → fill (h) = fill (c) .
Similar to the mask smask for store operations we generate a load mask

lmask(h) = I(h)[27]16 I(h)[26]8 18 .

In case of load operations (l(c) holds) it satisfies




⎨0 1
24 8
d(c) = 1
lmask(h) = 016 116 d(c) = 2

⎩ 32
1 d(c) = 4
= 032−8·d(c) 18·d(c) .

As shown in Fig. 113 we insert the fill bit at positions i where the correspond-
ing mask bit lmask[i] is zero:

fill (h) lmask(h)[i] = 0
lres(h)[i] =
J(h)[i] lmask(h)[i] = 1 .

By Lemma 6.25 we show that the load result is correct.


Lemma 6.26 (load result).

lres(h) = fill (c)32−8·d(c) c.md(c)


= lres(c)

6.3.19 Writing to the General Purpose Register File

Figure 114 shows the last multiplexer connecting the data input of the general
purpose register file with intermediate result C and the result lres coming from
the sh4l-environment. The write signal gprw of the general purpose register
file and the predicates su, jal, jalr, l controlling the muxes are predicates p
computed in the instruction decoder. By Lemma 6.11 we have for them

p(c) = p(h) .
160 6 A Basic Sequential MIPS Machine

C lres
32 32
0 1 l
32
gprin
Fig. 114. Computing the data input of the general purpose register file

Using Lemmas 6.21 and 6.26 we conclude



lres(h) l(h)
gprin(h) =
C(h) otherwise

lres(c) l(c)
=
C(c) otherwise
= gprin(c) .

Using RAM semantics, induction hypothesis sim.2, and Lemma 6.14 we com-
plete the induction step for the general purpose register file:

gprin(h) gprw(h) ∧ x = Cad(h)
h .gpr(x) =
h.gpr(x) otherwise

gprin(c) gprw(c) ∧ x = Cad(c)
=
c.gpr(x) otherwise
= c .gpr(x) .

This concludes the proof of Lemma 6.8 as well as the correctness proof of the
entire (simple) processor.
7
Pipelining

In this chapter we deviate from [12] and present somewhat simpler proofs in
the spirit of [6]. Pipelining without speculative instruction fetch introduces de-
lay slots after branch and jump instruction. The corresponding simple changes
to ISA and reference implementation are presented in Sect. 7.1.
In Sect. 7.2 we use what we call invisible registers to partition the refer-
ence implementation into pipeline stages. Replacing the invisible registers by
pipeline registers and controlling the updates of the pipeline stages by a very
simple stall engine we produce a basic pipelined implementation of the MIPS
ISA. As in [12] and [6] we use scheduling functions which, for all pipeline
stages k and hardware cycles t, keep track of the number of the sequential
instruction I(k, t) which is processed in cycle t in stage k of the pipelined
hardware. The correctness proof intuitively then hinges on two observations:
1. The circuits of stage k in the sequential hardware σ and the pipelined
hardware π are almost identical; the one difference (for the instruction
address ima) is handled by an interesting special case in the proof.
2. If Xπ is a signal of circuit stage k of the pipelined machine and Xσ is its
counter part in the sequential reference machine, then the value of Xπ in
cycle t equals the value of Xσ before execution of instruction I(k, t). In
I(k,t)
algebra Xπt = Xσ .
Although we are claiming to follow the simpler proof pattern from [6] the cor-
rectness proof presented here comes out considerably longer than its counter
parts in [12] and [6]. The reason is a slight gap in the proof as presented in [12]:
the second observation above is almost but not quite true. In every cycle it
only holds for the signals which are used in the processing of the instruction
I(k, t) currently processed in stage k. Proofs with slight gaps are wrong1 and
should be fixed. Fixing the gap discussed here is not hard: one formalizes the
concept of signals used for the processing of an instruction and then does the

1
Just as husbands which are almost never cheating are not true husbands.

M. Kovalev et al.: A Pipelined Multi-core MIPS Machine, LNCS 9000, pp. 161–206, 2014.

c Springer International Publishing Switzerland 2014
162 7 Pipelining

bookkeeping, which is lengthy and should not be presented fully in the class-
room. In [6], where the correctness of the pipelined hardware was formally
proven, the author clearly had to fix this problem, but he dealt with it in
different way: he introduced extra hardware in the reference implementation,
which forced signals, which are not used to zero. This makes observation 2
above strictly true.
Forwarding circuits and their correctness are studied in Sect. 7.3. The
material is basically from [12] but we work out the details of the pipe fill
phase more carefully.
The elegant general stall engine in Sect. 7.4 is from [6]. Like in [6], where
the liveness of pipelined processors is formally proven, the theory of scheduling
functions with general stall engines is presented here in much greater detail
than in [12]. The reason for this effort becomes only evident at the very end of
this book: due to possible interference between bus scheduler of the memory
system and stall engines of the processors, liveness of pipelined multi-core
machines is a delicate and nontrivial matter.

7.1 MIPS ISA and Basic Implementation Revisited


7.1.1 Delayed PC

What we have presented so far – both in the definition of the ISA and in
the implementation of the processor – was a sequential version of MIPS. For
pipelined machines we introduce one change to the ISA.
So far in an ISA computation (ci ) new program counters ci+1 .pc were
computed by instruction I(ci ) and the next instruction

I(ci+1 ) = ci+1 .m4 (ci+1 .pc)

was fetched with this PC. In the new ISA the instruction fetch after a new
PC computation is delayed by 1 instruction. This is achieved by leaving the
next PC computation unchanged but i) introducing a delayed PC (DPC) c.dpc
which simply stores the PC of the previous instruction and ii) fetching instruc-
tions with this delayed PC. At the start of computations the two program
counters are initialized such that the first two instructions are fetched from
addresses 032 and 432 . Later we always obtain the delayed PC from the value
of the regular PC in the previous cycle:

c0 .dpc = 032
c0 .pc = 432
ci+1 .dpc = ci .pc
I(ci ) = ci .m4 (ci .dpc) .

The reason for this change is technical and stems from the fact that, in basic
5-stage pipelines, instruction fetch and next PC computation are distributed
7.1 MIPS ISA and Basic Implementation Revisited 163

over two pipeline stages. The introduction of the delayed PC permits to model
the effect of this in the sequential ISA. In a nutshell, PC and DPC are a tiny
bit of visible pipeline in an otherwise completely sequentially programming
model.
The 4 bytes after a jump or branch instruction are called a delay slot,
because the instruction in the delay slot is always executed before the branch
or jump takes effect.
The semantics of jump and branch instructions stays unchanged. This
means that for computation of the link address and of the jump target we
still use the regular PC and not the delayed PC. For instance, for the link
address we have
linkad(c) = c.pc +32 432 .
In case there are no jumps or branches in delay slots2 and the current in-
struction I(ci ) = ci .m4 (ci .dpc) is a jump or branch instruction, we have for
i > 0:

ci .dpc = ci−1 .pc


ci .pc = ci−1 .pc +32 432
= ci .dpc +32 432 ,

which for the link address gives us

linkad(ci ) = ci .pc +32 432


= ci .dpc +32 832 .

So in case of jal(ci ) or jalr(ci ) we save the return address, which points to


the first instruction after a delay slot. Relative jumps are also computed with
respect to the instruction in the delay slot in contrast to the ISA from the
previous chapter, where they were computed with respect to the branch or
jump instruction itself.

7.1.2 Implementing the Delayed PC

The changes in the simple non pipelined implementation for the new ISA are
completely obvious and are shown in Fig. 115.
The resulting new design σ is a sequential implementation of the MIPS ISA
for pipelined machines. We denote hardware configurations of this machine
by hσ . The simulation relation sim(c, hσ ) from Sect. 6.3.4 is extended with
the obvious coupling for the DPC:

sim(c, hσ ) ≡

2
A software condition which one has to maintain for the ISA to be meaningful.
164 7 Pipelining

nextpc 432 032 ima


32 32 32 32
reset 0 1 0 1 reset
32 32

pc dpc
32 32

[31 : 2]

Fig. 115. Implementation of the delayed PC

1. hσ .pc = c.pc ∧
2. hσ .dpc = c.dpc ∧
3. hσ .gpr = c.gpr ∧
4. hσ .m ∼CR c.m ∧
5. hσ .m ∼DR c.m.
For ISA computations (ct ) of the new pipelined instruction set one shows
in the style of the previous chapter under the same software conditions the
correctness of the modified implementation for the new (and real) instruction
set.
Lemma 7.1 (MIPS with delayed PC). There is an initial ISA configura-
tion c0 such that
∀t ≥ 0 : sim(ct , htσ ) .
Note that absence of jump or branch instructions in the delay slot is necessary
for the ISA to behave in the expected way, but is not needed for the correctness
proof of the sequential MIPS implementation.

7.1.3 Pipeline Stages and Visible Registers

When designing processor hardware one tries to solve a fairly well defined
optimization problem that is formulated and studied at considerable length
in [12]. In this text we focus on correctness proofs and only remark that one
tries i) to spend (on average) as few as possible hardware cycles per executed
ISA instruction and ii) to keep the cycle time (as, e.g., introduced in the
detailed hardware model) as small as possible. In the first respect the present
design is excellent. With a single processor one cycle per instruction is hard
to beat. As far as cycle time is concerned, it is a perfect disaster: the circuits
of every single stage contribute to the cycle time.
In a basic 5 stage pipeline one partitions the circuits of the sequential
design into 5 circuit stages cir(i) with i ∈ [0 : 4], such that
7.1 MIPS ISA and Basic Implementation Revisited 165

• the circuit stages have roughly the same delay which then is roughly 1/5
of the original cycle time and
• connections between circuit stages are as simple as possible.
We have already introduced the stages in Fig. 96 of Chap. 6. That the cycle
times in each stage are roughly equal cannot be shown here, because we have
not introduced a detailed and realistic enough delay model. The interested
reader is referred to [12].
Simplicity of inter stage connections is desirable, because in pipelined im-
plementations most of theses connections are realized as register stages. And
registers cost money without computing anything new. For a study of how
much relative cost is increased by such registers we refer the reader again
to [12].
We conclude this section by a bookkeeping exercise about the interconnec-
tions between the circuit stages. We stress that we do almost nothing at all
here. We simply add the delayed PC and redraw Fig. 96 according to some
very simple rules:
1. Whenever a signal crosses downwards from one stage to the next we draw a
dotted box around it and rename it (before or after it crosses the bound-
ary). Exception to those are signal ima used for instruction fetch and
signals rs and rt used to address the GPR file during instruction decode.
For those signals we don’t draw a box, since we do not pipeline them later.
2. We collapse the circuits between stages into circles labelled cir(i).
The result is shown in Fig. 116. We observe two kinds of pipeline stages: i)
circuit stages cir(i) and ii) register stages reg(k) consisting either of registers
or memories of the sequential design or of dotted boxes for renamed signals.
Most of the figure should be self explaining, we add a few remarks:
• Circuit stage cir(1) and register stage reg(1) are the IF stage. cir(1) con-
sists only of the instruction port environment, which is presently read only
and hence behaves like a circuit. Signal I contains the instruction that was
fetched.
• Circuit stage cir(2) and register stage reg(2) are the ID stage. The circuit
stage consists of the instruction decoder and the next PC environment.
Signals A and B have been renamed before they enter circuit stage cir(2).
Signal Bin is only continued under the synonym B, but signal Ain is both
used in the next PC environment and continued under the synonym A.
The signals going from the instruction decoder to the next PC environment
are denoted by i2nextpc:

i2nextpc = (bf , jump, jr, jalr, b, imm, iindex) .

Register stage 2 contains program counters pc and dpc, the operands A and
B fetched from the GPRs, the incremented PC renamed to link address
linkad, and the signals i2ex going from the instruction decoder to the EX
stage:
166 7 Pipelining

imout
cir(1)
IF
1 I

Ain, Bin
cir(2)
ID

2 pc dpc linkad A, B i2ex con.2

ima
EX cir(3)

3 C.3 ea.3 bw.3, dmin con.3

4 C.4 ea.4 dmout.4 m con.4

WB cir(5)

gpr
5 rs, rt
Ain, Bin

Fig. 116. Arranging the sequential MIPS design into pipeline stages

i2ex = (xtimm, af , sf , i, sa, smask) .


For some signals x there exist versions in various register stages k. In such
situations we denote the version in register stage reg(k) by x.k. In this
sense we find in all register stages k ≥ 2 versions con.k of control signals
that were precomputed in the instruction decoder. This group of signals
comprises predicates p.k, instruction fields F.k and the C-address Cad.k:

con.k = (. . . , p.k, . . . , F.k, . . . , Cad.k) .


7.2 Basic Pipelined Processor Design 167

• Circuit stage cir(3) and register stage reg(3) are the execute stage. The
circuit stage comprises the ALU-environment, the shift unit environment,
an incrementer for the computation of linkad, multiplexers for the collec-
tion of ares, sures, and linkad into intermediate result C, an adder for
the computation of the effective address, and the sh4s-environment.
Register stage 3 contains a version C.3 of intermediate result C, the effec-
tive address ea.3, the byte write signals bw.3, the data input dmin for the
hardware memory, and the copy con.3 of the control signals.
• Circuit stage cir(4) and register stage reg(4) are the M stage. The circuit
stage consists only of wires; so we have not drawn it here. Register stage
4 contains a version C.4 of C, the output of the data port dmout.4 as well
as versions con.4 and ea.4 of the control signals and the effective address.
Note that we also have included the hardware memory m itself in this
register stage.
• Circuit stage cir(5) and register stage reg(5) are the WB stage. The circuit
stage contains the sh4l-environment (controlled by ea.4.o) and a multi-
plexer collecting C.4 and result lres of the sh4l-environment into the data
input gprin of the general purpose register file. Register stage 5 consists
of the general purpose register file.
For the purpose of constructing a first pipelined implementation of a MIPS
processor we can simplify this picture even further:
• We distinguish in register stages k only between visible (in ISA) registers
pc, dpc and memories m, gpr on one side and other signals x.k on the
other side.
• Straight connections by wires, which in Fig. 116 are drawn as straight
lines, are now included into the circuits cir(i)3 .
• For k ∈ [1 : 5] circuit stage cir(k) is input for register stage k + 1 and for
k ∈ [1 : 4] register stage k is input to circuit stage cir(k + 1). We only
hint these connections with small arrows and concentrate on the other
connections.
We obtain Fig. 117. In the next section we will transform this simple figure
with very little effort into a pipelined implementation of a MIPS processor.

7.2 Basic Pipelined Processor Design


7.2.1 Transforming the Sequential Design into a Pipelined Design

We transform the sequential processor design σ of the last section into a


pipelined design π whose hardware configurations we will denote by hπ . We
also introduce some shorthands for registers or memories R and circuit signals
X in either design:
3
This does not change circuits cir(1) and cir(2).
168 7 Pipelining

imout cir(1)

x.1

Ain, Bin
cir(2)

pc dpc x.2

cir(3)

x.3

cir(4)
ima

m x.4

cir(5)

gpr
rs, rt

Fig. 117. Simplified view of register and circuit stages

Rσt = htσ .R
Rπt = htπ .R
Xσt = X(htσ )
Xπt = X(htπ ) .

For signals or registers only occurring in the pipelined design, we drop the sub-
script π. If an equation holds for all cycles (like equations describing hardware
construction) we drop the index t.
7.2 Basic Pipelined Processor Design 169

1 1

0 1 reset
f ull0
f ull1 1
0

0 1

f ull2 1
0

0 1

f ull3 1
0

0 1

f ull4 1

Fig. 118. Tracking full register stages with a basic stall engine

Table 8. Full bits track the filling of the register stages


t 0 1 2 3 ≥4
f ull0t 1 1 1 1 1
f ull1t 0 1 1 1 1
f ull2t 0 0 1 1 1
f ull3t 0 0 0 1 1
f ull4t 0 0 0 0 1

The changes to design σ are explained very quickly:


• We turn all dotted boxes of all register stages into pipeline registers with
the same name. Because their names do not occur in the ISA, they are only
visible in the hardware design but not to the ISA programmer. Therefore
they are called non visible or implementation registers. We denote visibility
of a register or memory R by predicate vis(R):

vis(R) = R ∈ {pc, dpc, m, gpr} .


170 7 Pipelining

• For indices k of register stages, we collect in reg(k) all registers and memo-
ries of register stage k. We use a common clock enable uek for all registers
of reg(k).
• Initially after reset all register stages except the program counters contain
no meaningful data. In the next 5 cycles they are filled one after another
(Table 8). We introduce the hardware from Fig. 118 to keep track of this.
There are 5 full bits f ull[0 : 4], where the bit f ullkt means that circuit
stage cir(k + 1) contains meaningful data in cycle t.
Formally we define

f ull0 = 1
∀k ≥ 1 : f ullk0 = 0
∀k ≥ 1 : f ullkt+1 = f ullk−1
t
.

We show 
t 1t+1 04−t t≤3
f ull[0 : 4] = (17)
15 t≥4
by the following simple lemma.
Lemma 7.2 (full bits).

0 t<k
∀k, t ≥ 0 : f ullkt =
1 t≥k

Proof. For k = 0 we have for all t ≥ 0:

t≥1>0=k , f ull0t = 1 .

Thus, the lemma holds for k = 0. For k ≥ 1 the lemma is shown by


induction on t. For t = 0 we have

t<k , f ullk0 = 0 .

Thus, the lemma holds after reset. Assume the lemma holds for cycle t.
Then

f ullkt+1 = f ullk−1
t

0 t<k−1
=
1 t≥k−1

0 t+1 <k
=
1 t+1 ≥k .



7.2 Basic Pipelined Processor Design 171

reg(k − 1) uek−1 f ullk−1

cir(k)

reg(k) uek f ullk


Fig. 119. Updating register stage k under control of the full bit f ullk−1

[31 : 2]
imaπ
pc dpc
32 32
1 0 f ull1

Fig. 120. Computation of the instruction address in the pipelined machine

Full bits being set to 0 prevent the update of register stages. This is also
called stalling a register stage; we call the hardware therefore a basic stall
engine. Other stall engines are introduced later.
• For any register stage k we update registers and memories in reg(k) only if
their input contains meaningful data, which is the case when the previous
stage is full. As illustrated in Fig. 119 we set the update enable signal as

uek = f ullk−1 .
For registers in stage 1 we have a special case, where the signal f ull0 is
not coming from a register, but is always tied to 1.
• For memories m and gpr we take the precomputed signals bw.3 and gprw.4
from the precomputed control and AND them, respectively, with the cor-
responding update enable signals to get the new write signals:
bwπ = bw.3π ∧ ue4
gprwπ = gprw.4π ∧ ue5 .
• The address of the instruction is now computed as shown in Fig. 120 as

dpcπ .l f ull1 = 0
imaπ =
pcπ .l f ull1 = 1 .

This has the remarkable effect that we fetch from the PC in all cycles
except the very first one. Thus, the important role of the delayed PC
172 7 Pipelining

is not in the hardware implementation but in the ISA, where it exposes


the effect of the fact that instruction fetch and next PC computation are
distributed over two stages. If we would join the two stages into a single one
(by omitting the I-register), we would gain back the original instruction
set, but we would ruin the efficiency of the design by roughly doubling the
cycle time.
These are all changes we make to the sequential design σ.

7.2.2 Scheduling Functions

In the sequential design, there was a trivial correspondence between the hard-
ware cycle t and the instruction I(ct ) executed in that cycle. In the pipelined
design π the situation is more complicated, because in 5 stages there are
up to 5 different instructions which are in various stages of completion. For
instructions I(ci ) of the sequential computation we use the shorthand
Ii = I(ci ) .
We introduce scheduling functions
I : [1 : 5] × N → N ,
which keep track of the instructions being processed every cycle in every circuit
stage. Intuitively, if
I(k, t) = i ,
then the registers of register stage k in cycle t are in the state before executing
instruction Ii . In case register stage k−1 is full, this is equivalent to saying that
instruction Ii during cycle t is being processed in circuit stage k. If register
stage k − 1 is not full, then circuit stage k does not have any meaningful input
during cycle t, but Ii will be the next instruction which will eventually be
processed by circuit stage k when register stage k − 1 becomes full. For both
cases if I(k, t) = i we say that instruction Ii is in circuit stage k during cycle
t. Note that if some stages of the pipeline are not full, then one instruction is
said to be present in several circuit stages simultaneously.
Formally the functions are defined with the help of the update enable
functions uek in the following way:
∀k : I(k, 0) = 0

I(1, t) + 1 uet1 = 1
I(1, t + 1) =
I(1, t) otherwise

I(k − 1, t) uetk = 1
∀k ≥ 2 : I(k, t + 1) =
I(k, t) otherwise ,

i.e., after reset every stage is in the state before executing instruction I0 . In
circuit stage 1 we fetch a new instruction and increase the scheduling function
7.2 Basic Pipelined Processor Design 173

Table 9. Scheduling functions for the first 6 cycles


t 0 1 2 3 4 5
I(1,t) 0 1 2 3 4 5
I(2,t) 0 0 1 2 3 4
I(3,t) 0 0 0 1 2 3
I(4,t) 0 0 0 0 1 2
I(5,t) 0 0 0 0 0 1

every cycle (in the basic stall engine introduced thus far, stage 1 is always
updated). A register stage k which has a predecessor stage k − 1 is updated
or not in cycle t as indicated by the uet signal. If it is updated then the data
of instruction I(k, t) is written into the registers of stage k, and the circuit
stage k in cycle t + 1 gets the instruction from the previous stage. Later we
prove an easy lemma showing that this instruction is equal to the instruction
in stage k in cycle t increased by one. If register stage k is not updated in cycle
t then the scheduling function for this stage stays the same. Table 9 shows
the development of the scheduling function in our basic pipeline for the first
6 cycles.
The definition of the scheduling functions can be viewed in the follow-
ing way: imagine we extend each register stage reg(k) by a so called ghost
register I(k, ) that can store arbitrary natural numbers. In real machines
that is of course impossible because registers are finite, but for the purpose
of mathematical argument we can add the ghost registers to the construction
and update them like all other registers of their stage by uek . If we initialize
the ghost register I(1, ) of stage 1 with 0 and increase it by 1 every cycle,
then the pipeline of ghost registers simply clocks the index of the current
instruction through the pipeline together with the real data.
Augmenting real configurations for the purpose of mathematical argument
by ghost components is a useful proof technique. No harm is done to the real
construction as long as no information flows from the ghost components to
the real components.
With the help of Lemma 7.2 we show the following property of the schedul-
ing functions.
Lemma 7.3 (scheduling functions). For all k ≥ 1 and for all t

0 t<k
I(k, t) =
t−k+1 t≥k .

Proof. By induction on t. For t = 0 we have for all k

t<k , I(k, t) = 0 .

This shows the base case of the induction. Assume the lemma holds for t. In
the induction step we consider two cases:
174 7 Pipelining

• k = 1: the claim of the lemma simplifies to


I(1, t + 1) = (t + 1) − 1 + 1
= t+1
and we have by the definition of I(1, ) and by induction hypothesis for t:
I(1, t + 1) = I(1, t) + 1
= t+1.
• k ≥ 2: we have by Lemma 7.2
uetk = f ullk−1
t
= (t ≥ k − 1) .
Thus, by the definition of I(1, ) and by induction hypothesis for t we
have 
I(k, t) ¬uetk
I(k, t + 1) =
I(k − 1, t) uetk

I(k, t) t<k−1
=
I(k − 1, t) t ≥ k − 1

0 t<k−1
=
t − (k − 1) + 1 t ≥ k − 1

0 t+1 <k
=
t+1−k+1 t+1 ≥k .


The following lemma relates the indices I(k, t) and I(k − 1, t) in adjacent
pipeline stages. They differ by 1 iff stage k − 1 is full.
Lemma 7.4 (scheduling function difference). Let k > 1, then

t
I(k, t) f ullk−1 =0
I(k − 1, t) = t
I(k, t) + 1 f ullk−1 = 1 .
Proof. By Lemmas 7.3 and 7.2 we have
I(k − 1, t) − I(k, t)
 
0 t<k−1 0 t<k
= −
t − (k − 1) + 1 t ≥ k − 1 t−k+1 t≥k


⎨0 − 0 t<k−1
= 1−0 t=k−1


t − k + 2 − (t − k + 1) t > k − 1

0 t<k−1
=
1 t≥k−1
t
= f ullk−1 .


7.2 Basic Pipelined Processor Design 175

From Lemma 7.4 we immediately conclude for k > 1

I(k, t) = I(k − 1, t) − f ullk−1


t

and for any k  < k we have


k−1
I(k  , t) = I(k, t) + f ulljt .
j=k

7.2.3 Use of Invisible Registers

Not all registers or memories R are used in all instructions I(ci ). In the cor-
rectness theorem we need to show correct simulation of invisible registers only
in situations when they are used. Therefore, we define for each invisible reg-
ister X a predicate used(X, I) which must at least be true for all instructions
I, which require register X to be used for the computation. Some invisible
registers will always be correctly simulated, though not all of them are always
used. We define

∀X ∈ {I, i2ex, linkad, con.2, con.3, con.4, bw.3} : used(X, I) = 1 .

Invisible register A is used when the GPR memory is addressed with rs, and
B is used when it is accessed with rt.
We first define auxiliary predicates A-used(I) and B-used(I) that we will
need later. Recall that in Sect. 6.3.8 we have written the functions f (c) and
the predicates p(c) that only depend on the current instruction I(c) as

f (c) = f  (I(c)) and p(c) = p (I(c)) .

We will use the same notation here. Inspection of the tables summarizing the
MIPS ISA gives

A-used(I) = alur (I) ∨ (su (I) ∧ f un (I)[2]) ∨ jr (I) ∨ jalr (I)
∨ (itype (I) ∧ ¬lui (I))
B-used(I) = s (I) ∨ beq  (I) ∨ bne (I) ∨ su (I) ∨ alur (I) .

Now we simply define4

used(A, I) = A-used(I)
used(B, I) = B-used(I) .

Registers C.3 and C.4 are used when the GPR memory is written but no load
is performed:
4
The notation is obviously redundant here, but later we also use A-used and
B-used as hardware signals.
176 7 Pipelining

∀X ∈ {C.3, C.4} : used(X, I) = gprw (I) ∧ ¬l (I) .

Registers ea.3 and ea.4 are used in load and store operations:

∀X ∈ {ea.3, ea.4} : used(X, I) = l (I) ∨ s (I) .

Register dmin is used only in stores:

used(dmin, I) = s (I) .

Finally, dmout.4 is used in loads:

used(dmout.4, I) = l (I) .

7.2.4 Software Condition SC-1

We keep the software conditions of the sequential construction: alignment and


no self modification due to disjoint code and data regions.
The new condition comes from the connection from circuit stage 2 to
register stage 5 by the rs and rt signals. The scheduling functions for stages
2 and 5 are

0 t<2
I(2, t) =
t−1 t≥2

0 t<5
I(5, t) =
t−4 t≥5.

Thus, the indices of the instructions in stages ID and WB differ by




⎨0 − 0 t<2
I(2, t) − I(5, t) = t − 1 − 0 2≤t≤4


t − 1 − (t − 4) t ≥ 5


⎨0 t<2
= t−1 2 ≤t≤4


3 t≥5.

Hence,
I(2, t) − I(5, t) ≤ 3 . (18)
Assume in cycle t instruction I(2, t) = i is in circuit stage 2, i.e., the ID stage.
Then signals rs and rt of this instruction overtake up to 3 instructions in
circuit stages 2,3, and 4. If any of these overtaken instructions write to some
general purpose register x and instruction i tries to read it - as in our basic
design directly from the general purpose register file, then the data read will
be stale; more recent data from an overtaken instruction is on the way to the
7.2 Basic Pipelined Processor Design 177

GPR but has not reached it yet. For the time being we will simply formulate
a software condition SC-1 saying that this situation does not occur; we only
prove that the basic pipelined design π works for ISA computations (ci ) which
obey this condition. In later sections we will improve the design and get rid
of the condition.
Therefore we formalize for x ∈ B5 and ISA configurations c two predicates:
• writesgpr(x, i) - meaning ISA configuration ci writes gpr(x):

writesgpr(x, i) ≡ gprw(ci ) ∧ Cad(ci ) = x .

• readsgpr(x, i) - meaning ISA configuration ci reads gpr(x). Reading gpr(x)


can occur via rs or the rt address, i.e., if A or B are used with address x:

readsgpr(x, i) ≡ used(A, I(ci )) ∧ rs(ci ) = x ∨ used(B, I(ci )) ∧ rt(ci ) = x .

Now we can define the new software condition SC-1: for all i and x, if Ii
writes gpr(x), then instructions Ii+1 , Ii+2 , Ii+3 don’t read gpr(x):

writesgpr(x, i) → ∀j ∈ [i + 1 : i + 3] : ¬readsgpr(x, j) .

7.2.5 Correctness Statement

Now that we can express what instruction I(k, t) is in stage k in cycle t and
whether an invisible register is used in that instruction, we can formulate the
invariant coupling states htπ of the pipelined machine with the set of states
hiσ of the sequential machine that are processed in cycle t of the pipelined
machine, i.e., the set
{hσI(k,t) | k ∈ [1 : 5]} .
We intend to prove by induction on t the following simulation.
Lemma 7.5 (basic pipeline). Assume software condition SC-1, alignment,
and no self modification. For k ∈ [1 : 5] let R ∈ reg(k) be a register or memory
of register stage k. Then,

I(k,t)
t Rσ vis(R)
Rπ = I(k,t)−1 I(k,t)−1
Rσ f ullkt ∧ ¬vis(R) ∧ used(R, Iσ ).

By Lemma 7.1 we already know sim(ct , htσ ). In particular we have for predi-
cates p only depending on the current instruction I:

p(ci ) = p(hiσ ) .

Thus, Lemma 7.5 also establishes a simulation between the pipelined compu-
tation (htπ ) and the ISA computation (ci ).
Except for the subtraction of 1 from I(k, t) for non visible registers, the
induction hypothesis is quite intuitive: pipelined data htπ .R in stage k in cycle
178 7 Pipelining

t is identical to the corresponding sequential data hiσ .R resp. ci .R or X(hi−1


σ )
resp. X(ci−1 ), where i = I(k, t) is the index of the sequential instruction that
is executed in cycle t in stage k of the pipelined machine.
The subtraction of 1 can be motivated by the fact that in the pipelined
machine instructions i pass the pipeline from stage 1 to 5 unloading their
results into the visible registers of stage k, when they are clocked into reg(k).
Now assume register stage reg(k) contains a visible register R and a non visible
register Q and let I(k, t) = i. Then by the intuitive portion of the induction
hypothesis Rπt = Rσi . Thus, the previous instruction i − 1 is completed for the
visible register Rπ in stage k and the content of Rπ is the content of register
Rσ after execution of instruction i − 1 resp. before execution of instruction i.
The invisible register also contains the data produced by instruction i − 1 (!),
but since this register corresponds to a signal in ISA, then this signal has to
be defined as a function of ci−1 . Indeed, if we did define it as a function of ci
(as we do with the visible registers), then this would correspond to the data
produced by instruction i. Of course this is just motivation and is not to be
confused with a proof.

7.2.6 Correctness Proof of the Basic Pipelined Design

We denote by inv(k, t) the statement of Lemma 7.5 for stage k and cycle t.

Initialization of Register Stages

For t = 0 we have
f ullk0 = 1 ↔ k = 0 .
Thus, there is nothing to show for invisible registers. Initially we also have

∀k : I(k, 0) = 0 .

For visible registers one gets

pc0π = 4 = pc0σ = pcI(2,0)


σ

dpc0π = 0 = dpc0σ = dpcI(2,0)


σ .

The initial content of general purpose registers and hardware memory of the
sequential machine is defined by the content of the pipelined machine after
reset:

m0π = m0σ = mI(4,0)


σ

gprπ0 = gprσ0 = gprσI(5,0) .

Thus, we have
∀k : inv(k, 0) .
7.2 Basic Pipelined Processor Design 179

No Updates

Assume the lemma holds for t. We show for each stage k separately that the
lemma holds for stage k and t + 1. For all stages we always proceed in the
same way. There are two cases. The easy case is

uetk = 0 ,

i.e., register stage k is not updated in cycle t. By the definition of full bits we
know
f ullkt+1 = f ullk−1
t
= uetk = 0 .
Thus, for invisible registers R ∈ reg(k) there is nothing to show either. For
the scheduling functions uetk = 0 implies

I(k, t + 1) = I(k, t) .

Recall, that the byte write signals for the hardware memory and the write
signal for the GPR memory are defined as

bwπ = bw.3π ∧ ue4


gprwπ = gprw.4π ∧ ue5 .

Hence, for visible registers or memories R of reg(k) we get by induction hy-


pothesis inv(k, t):
Rπt+1 = Rπt
= RσI(k,t)
= RσI(k,t+1) .

This shows inv(k, t + 1) for stages k that are not updated in cycle t.

Scheduling Functions for Updated Stages

Lemma 7.6 (scheduling functions for updated stages).

uetk → I(k, t + 1) = I(k, t) + 1


Proof. We have uetk = f ullk−1
t
= 1. For k = 1 we have by definition of the
scheduling functions:
I(1, t + 1) = I(1, t) + 1 .
For k ≥ 2, we have by Lemma 7.4 and the definition of the scheduling func-
tions:

I(k, t + 1) = I(k − 1, t)
= I(k, t) + 1 .



180 7 Pipelining

Proof Obligations for the Induction Step

The case uetk = f ullk−1


t
= 1 is handled for each stage separately. There are,
however, for stages k ∈ [1 : 5] and cycles t, general proof obligations P (k, t)
we have to show for visible registers, invisible registers, and memories in each
register stage reg(k), which will allow us to prove the induction step.
Let cycles t and instruction indices i correspond via

I(k, t) = i .

Then we define proof obligations P (k, t) in the following way:


• For k = 2, visible registers of register stage reg(2) in machines σ and π
have identical inputs:

R ∈ {pc, dpc} → Rintπ = Riniσ .

• For k = 4, data memories in stage reg(4) always have identical byte write
signals bw and have the same effective address input ea and data input
dmin in case of a store operation:

bwπt = bwσi
siσ → eatπ = eaiσ ∧ dmintπ = dminiσ .

• For k = 5, GPRs in stage reg(5) always have identical GPR write signal
gprw, and have the same write address Cad and the same data input gprin
in case if instruction i is writing to GPRs:

gprwπt = gprwσi
gprwσi → Cadtπ = Cadiσ ∧ gprintπ = gpriniσ .

• For any k, invisible registers of register stage reg(k) in machines σ and π


that are used have identical inputs:

R ∈ reg(k) ∧ ¬vis(R) ∧ used(R, Iσi ) → Rintπ = Riniσ .

Note that, in the sequential machine, an input to an invisible register in


cycle i is the same as the “value” of the invisible register in cycle i. This is
due to the fact that invisible “registers” are actually signals in a sequential
machine.
Very simple arguments show that P (k, t) and inv(k, t) implies inv(k, t + 1),
i.e., proving P (k, t) suffices to complete the induction step for stage k.
Lemma 7.7 (basic pipeline induction step).

uetk ∧ P (k, t) ∧ inv(k, t) → inv(k, t + 1)

Proof. Let R ∈ reg(k). The proof hinges on Lemma 7.6 and splits cases in the
obvious way:
7.2 Basic Pipelined Processor Design 181

• R ∈ {pc, dpc} is a visible register. Because the register has in both ma-
chines the same input, it gets updated in both machines in the same way:

Rπt+1 = Rintπ
= Riniσ
= Rσi+1
= RσI(k,t+1) .

• R is invisible and used in ci , where i = I(k, t) = I(k, t + 1) − 1. Then Rπ


is updated, but in the sequential machine Rin and R are just synonyms:

Rπt+1 = Rintπ
= Riniσ
= Rσi
= RσI(k,t)
= RσI(k,t+1)−1 .

• R = m is a hardware memory. From Sec. 6.3.16 we have

¬siσ → bwσi = 08 .

Moreover, from the software conditions we know that eaiσ .l ∈ DR and that
the data region DR is disjoint from the ROM portion of the hardware
memory. Then for all a ∈ B29 we get

t+1 modify (mtπ (a), dmintπ , bwπt ) a = eatπ .l
mπ (a) =
mtπ (a) otherwise

modify (miσ (a), dminiσ , bwσi ) a = eaiσ .l
=
miσ (a) otherwise .
= mi+1
σ (a)

= mσI(k,t+1) (a) .

• R = gpr is a GPR memory. Then we conclude:



t+1 gprintπ gprwπt ∧ x = Cadtπ
gprπ (x) =
gprπt (x) otherwise

gpriniσ gprwσi ∧ x = Cadiσ
=
gprσi (x) otherwise
= gprσi+1 (x)
= gprσI(k,t+1) (x) .


182 7 Pipelining

It remains to prove hypothesis P (k, t) of Lemma 7.7 for each stage k separately
under the assumption that the simulation relation holds for all stages in cycle
t and update enable for stage k is on.
Lemma 7.8 (proof obligations basic pipeline).
(∀k  : inv(k  , t)) ∧ uetk → P (k, t) .
Proof. We prove the statement of the lemma by a case split on stage k. For
each circuit stage cir(k) we identify a set of input signals in(k) of the stage
which are identical in cycle t of π and in the configuration i of σ:
in(k)tπ = in(k)iσ .
We then show that these inputs determine the relevant outputs Rin, dmin,
etc. of the circuit stage. Because the circuit stages are identical in both ma-
chines, this suffices for showing that the outputs which are used have identi-
cal values. Unfortunately, the proofs require simple but tedious bookkeeping
about the invisible registers used. The only real action is in the proofs for
signals Ain, Bin, and ima.

Stage IF (k=1)

We first consider the address input ima of the instruction port. We consider
the multiplexer in Fig. 120, which selects between visible registers pc, dpc ∈
reg(2), and distinguish two cases:
• t = 0. Then f ull1t = 0 and I(1, 0) = I(2, 0) = 0. We conclude with
inv(2, 0):
ima0π = dpc0π .l
= dpcI(2,0)
σ .l
= dpc0σ .l
= ima0σ
= imaI(1,0)
σ .
• t ≥ 1. Then f ull1t = 1. By Lemma 7.4 we have
i = I(1, t) = I(2, t) + f ull1t = I(2, t) + 1 .
Using inv(2, t) and the definition of the delayed PC we conclude:
imatπ = pctπ .l
= pcI(2,t)
σ .l
= dpcI(2,t)+1
σ .l
i
= dpcσ .l
= imaiσ .
7.2 Basic Pipelined Processor Design 183

From the software condition we know that imaiσ ∈ CR and that the content
of the code region does not change during the execution. Using inv(4, t) we
get

imouttπ = mtπ (imatπ )


= mI(4,t)
σ (imaiσ )
= m0σ (imaiσ )
= miσ (imaiσ )
= imoutiσ .

Thus, the instruction port environment has in both machines the same input;
therefore it produces the same output:

Iin tπ = Iin iσ ,

i.e., we have shown P (1, t) and thus by Lemma 7.7 inv(1, t + 1).

Stage ID (k = 2)

From uet2 = f ull1t we know that t > 0. Hence, by Lemma 7.4 we have:

I(1, t) = I(2, t) + 1 .

Let
i = I(2, t) = I(1, t) − 1 .
There are three kinds of input signals for the circuits cir(2) of this stage:
• Signal from invisible register I ∈ reg(1). It is always used. With inv(1, t)
we get:
Iπt = IσI(1,t)−1 = Iσi .
This already determines the inputs of invisible registers con.2 and i2ex:

R ∈ {con.2, i2ex} → Rintπ = Riniσ .

It also determines the signals i2nextpc from the instruction decoder to the
next PC environment so that we have for these signals:

i2nextpctπ = i2nextpciσ .

• Signals from visible registers pc, dpc ∈ reg(2) which are inputs to the next
PC environment. From inv(2, t) we get immediately:

R ∈ {pc, dpc} → Rπt = Rσi .


184 7 Pipelining

• For inputs Ain and Bin of circuit stage cir(2) we have to make use of
software condition SC-1, which is stated on the MIPS ISA level. Hence, we
assume here that the sequential MIPS implementation is correct (Lemma
7.1), i.e., that we always have

∀j ≥ 0 : sim(cj , hjσ ).

Let register A be used in instruction i to access gpr ∈ reg(5) via rs:

used(A, Iσi ).

Let
x = rstπ = rsiσ = rs(ci ) ,
i.e., instruction I(2, t) reads gpr(x):

readsgpr(x, I(2, t)) .

By (18) we have
I(5, t) ≤ I(2, t) + 3 = i + 3 .
If any of instructions I(3, t), I(4, t), I(5, t) would write gpr(x), this would
violate software condition SC-1. Thus,

∀k ∈ [3 : 5] : ¬writesgpr(x, I(k, t)) .

Hence,
gprσI(2,t) (x) = gprσI(5,t) (x) .
Using inv(5, t) we conclude:

Aintπ = gprπt (x)


= gprσI(5,t) (x)
= gprσI(2,t) (x)
= Ainiσ .

Arguing about signal Bin = B  in the same way we conclude

used(B, Iσi ) → Bintπ = Biniσ .

For the input to invisible register linkad we have from inv(2, t) and because
t ≥ 1:

linkadintπ = pcinctπ
= pctπ +32 432
= pciσ +32 432
= linkadiniσ .
7.2 Basic Pipelined Processor Design 185

It remains to argue about the inputs of visible registers pc and dpc, i.e., about
signals nextpc and register pc which is the input of dpc. For the input pc of
dpc we have from inv(2, t) and because t ≥ 1:

dpcintπ = pctπ = pciσ = dpciniσ .

For the computation of the nextpc signal there are four cases:
• beiσ ∨ bneiσ . This is the easiest case, because it implies used(A, Iσi ) ∧
used(B, Iσi ) and we have
intπ = iniσ
for all inputs in ∈ {A, B, i2nextpc} of the next PC environment. Because
the environment is identical in both machines we conclude

pcintπ = nextpctπ = nextpciσ = pciniσ

and are done.


• biσ ∧¬(beiσ ∨bneiσ ). Then we have used(A, Iσi ) and for signal d in the branch
evaluation unit we have
dtπ = 032 = diσ .
And hence,

jbtakentπ = jbtakeniσ
btargettπ = btargetiσ
nextpctπ = nextpciσ .

• jrσi ∨ jalrσi . Then used(A, Iσi ) and

nextpctπ = Aintπ = Ainiσ = nextpciσ .

• jσi ∨ jalσi . Then with inv(2, t) we have:

nextpctπ = (pctπ +32 432 )iindextπ 00


= (pciσ +32 432 )iindexiσ 00
= nextpciσ .

• In all other cases we have:

nextpctπ = pctπ +32 432 = pciσ +32 432 = nextpciσ .

This concludes the proof of P (2, t).


186 7 Pipelining

Stage EX (k = 3)

From uet3 = f ull2t we know that t > 1. Hence, by Lemma 7.4 we have:

I(2, t) = I(3, t) + 1 .

Let
i = I(3, t) = I(2, t) − 1 .
We have to consider three kinds of input signals for the circuits cir(3) of this
stage:
• Invisible registers i2ex, con.2, and linkad. They are always used. Using
inv(2, t) we get:

X ∈ {i2ex, con.2, linkad} → Xπt = XσI(2,t)−1 = Xσi .

Because con.2 = con.3in this shows P (3, t) for the pipelined control regis-
ter con.3,
• Invisible registers A and B. From inv(2, t) we have:

X ∈ {A, B} ∧ used(X, Iσi ) → Xπt = Xσi .

We proceed to show P (3, t) for registers dmin, bw.3, register ea, and register
C.3 separately:
• For ea we have

used(ea, Iσi ) = lσi ∨ siσ


lσi ∨ siσ → used(A, Iσi ) .

Register i2ex is always used. Because ea depends only on A and i2ex we


conclude from inv(2, t):
eaintπ = eainiσ .
• For bw.3, the signal smaskπt is also taken from i2ex, which is always used.
Hence, we have
smaskπt = smaskσi .
We do a case split on whether a store is performed or not:
– ¬siσ . In this case we have

smaskπt = smaskσi
= 0000 .

Thus, we get

bw.3intπ = 08
= bwσi
= bw.3iniσ .
7.2 Basic Pipelined Processor Design 187

– siσ . For this case we have already shown that


eaintπ = eainiσ .
Since computation of byte-write signals depends only on smask and
eain, we obviously get
bw.3intπ = bwσi = bw.3iniσ .
• For dmin we have
used(dmin, Iσi ) = siσ
siσ → itypeiσ ∧ ¬luiiσ
and hence,
used(dmin, Iσi ) → used(A, Iσi ) ∧ used(B, Iσi ) .
With inv(2, t) we conclude
Xπt = Xσi
for all inputs X of cir(3) and conclude trivially
dminintπ = dmininiσ .
• For C.3 we need a larger case split. We have
used(C.3, Iσi ) = aluiσ ∨ suiσ ∨ jalσi ∨ jalrσi .
This results in 4 subcases:
– alurσi ∨ suiσ ∧ f uniσ [3]. Then
used(A, Iσi ) ∧ used(B, Iσi ) .
With inv(2, t) we trivially conclude as above
C.3intπ = C.3iniσ .
– aluiiσ . Then used(A, Iσi ) and
roptπ = xtimmtπ = xtimmiσ = ropiσ .
Hence, alures is independent of B and we conclude
C.3intπ = alurestπ = aluresiσ = C.3iniσ .
– suiσ ∧ ¬f uniσ [3]. Then used(A, Iσi ) and
sdisttπ = satπ = saiσ = sdistiσ .
Hence sures is independent of B and we conclude
C.3intπ = surestπ = suresiσ = C.3iniσ .
– jalσi ∨ jalrσi . Then
C.3intπ = linkadtπ = linkadiσ = C.3iniσ .
This concludes the proof of P (3, t).
188 7 Pipelining

Stage M (k = 4)

From uet4 = f ull3t we know that t > 2. Hence, by Lemma 7.4 we have:

I(3, t) = I(4, t) + 1 .

Let
i = I(4, t) = I(3, t) − 1 .
We have to argue about 3 kinds of signals:
• X ∈ {dmin, con.3, ea.3, C.3}. From inv(3, t) we have:

used(X, Iσi ) → Xπt = XσI(3,t)−1 = Xσi .

This shows P (4, t) for the data inputs of registers con.4, ea.4, and C.4.
• dmout.4. We have

used(dmout.4, Iσi ) → lσi ∧ used(ea, Iσi ) .

Using inv(3, t) for ea and dmin as well as inv(4, t) for m we get:

dmouttπ = mtπ (ea.lπt )


= miσ (ea.lσI(3,t)−1 )
= dmoutiσ .

This shows P (4, t) for the input dmout of register dmout.4.


• memory inputs. The register bw.3 is always used. Thus, we have from
inv(3, t):

bw.3tπ = bwσI(3,t)−1
= bwσi
bwπt = bw.3tπ ∧ uet4
= bwσi ,

For the effective address and the data input to the hardware memory we
have
siσ → used(ea.3, Iσi ) ∧ used(dmin, Iσi ) .
As shown above, this implies in case of siσ :

dmintπ = dminiσ
eatπ = ea.3tπ
= eaiσ .
7.2 Basic Pipelined Processor Design 189

Stage WB (k = 5)

From uet5 = f ull4t we know that t > 3. Hence, by Lemma 7.4 we have
I(4, t) = I(5, t) + 1 .
Let
i = I(5, t) = I(4, t) − 1 .
We only have to consider the input registers of the stage and to show P (4, t)
for the general purpose register file:
• All input registers are invisible thus let X ∈ {C.4, dmout.4, ea.4, con.4}.
From inv(4, t) we have:
used(X, Iσi ) → Xπt = Xσi .
• Signal gprw.4 is a component of con.4. Thus, we have:
gprw.4tπ = gprwσi
gprwπt = gprw.4tπ ∧ uet5
= gprwσi .
Signal Cad.4 is a component of con.4. Thus
Cad.4tπ = Cadiσ .
Assume gprwσi , i.e., the general purpose register file, is written. We have
to consider two subcases:
– A load is performed. Then dmout.4 and ea.4 are both used, load result
lres is identical for both computations and the data input gprin for the
general purpose register file comes for both computations from lres:
lσi → used(dmout.4, Iσi ) ∧ used(ea.4, Iσi )
dmintπ = lrestπ
= lresiσ
= dminiσ .
– No load is performed. Then C.4 is used and it is put to the data input
gprin:
¬siσ → used(C.4, Iσi )
dmintπ = C.4tπ
= Cσi
= dminiσ .
This completes the proof of P (5, t), the proof of Lemma 7.8, and the
correctness proof of the basic pipeline design.


190 7 Pipelining

7.3 Forwarding
Software condition SC-1 forbids to read a general purpose register gpr(x)
that is written by instruction i in the following three instructions i + 1, i + 2,
and i + 3. We needed this condition because with the basic pipelined machine
constructed so far we had to wait until the written data had reached the
general purpose register file, simply because that’s where we accessed them.
This situation is greatly improved by the forwarding circuits studied in this
section.

7.3.1 Hits

The improvement is based on two very simple observations. Let cycle t be


the cycle when we want to read in circuit stage cir(2) register content gpr(x)
into register A or B. First, it is easy to recognize instructions in the deeper
register stages reg(k), with k ∈ [2 : 4], that write into gpr(x):
• The stage must be full:
f ullkt .
Otherwise it contains no meaningful data.
• The C-address must coincide with the rs address or the rt address (note
that these addresses are signals of circuit stage cir(2)):

Cad.k t = rst or Cad.k t = rtt .

• The instruction in stage k must write to the general purpose register file:

gprw.k t .

We introduce for registers A and B separate predicates characterizing this


situation:

hitA [k] ≡ f ullk ∧ Cad.k = rs ∧ gprw.k.


hitB [k] ≡ f ullk ∧ Cad.k = rt ∧ gprw.k .

Second, in case we have a hit in stage reg(2) or reg(3) and the instruction
is not a load instruction, then the data we want to fetch into A or B can be
found as the input of the C register of the following circuit stage, i.e., as C.3in
or C.4in. In case of a hit in stage reg(4) we can find the required data at the
data input grpin of the general purpose register file, even for loads.

7.3.2 Forwarding Circuits

All we have to do now is to construct circuits recognizing hits and forward-


ing the required data – where possible – to circuit stage cir(2). In case of
simultaneous hits in several stages we are interested in the data of the most
7.3 Forwarding 191

Ain

0 1 hitA [2]
C.3in
0 1 hitA [3]
C.4in
0 1 hitA [4]
gprin
gprouta
Fig. 121. Forwarding circuit F orA

recent instruction producing a hit. This is the “top” instruction in the pipe
(i.e., with the smallest k) producing a hit:

topA [k] = hitA [k] ∧ hitA [j].
j<k

topB [k] = hitB [k] ∧ hitB [j] .
j<k

Obviously, top hits are unique, i.e., for X ∈ {A, B} we have

topX [i] ∧ topX [j] → i = j .

Figure 121 shows the forwarding circuit F orA . If we find nothing to forward
we access the general purpose register file as in the basic design. We have:


⎪ C.3in topA [2]

⎨C.4in topA [3]
Ain =

⎪ gprin topA [4]


gprouta otherwise .

Construction of the forwarding circuit F orB is completely analogous.

7.3.3 Software Condition SC-2

Forwarding will only fail if instruction i is a load with destination gpr(x)


and the register x is read by one of the next two instructions i + 1 or i + 2.
Hence, we formulate a weaker software condition SC-2 which takes care of
such situations:
192 7 Pipelining

l(ci ) ∧ Cad(ci ) = x ∧ j ∈ [i + 1 : i + 2] → ¬readsgpr(x, j) .


The correctness statement formulated in Lemma 7.5 stays the same as before.
Only software condition SC-1 is replaced by conditions SC-2.

7.3.4 Scheduling Functions Revisited


For the correctness proof we need a very technical lemma which states in a
nutshell that in the pipeline instructions are not lost.
Lemma 7.9 (no instructions lost). Let s(t) be the number of full stages
in between stages 2 and 5:

4
s(t) = f ulljt .
j=2

Further, let
i = I(2, t) = I(5, t) + s(t) ,
let s(t) > 0 and 0 ≤ j < s(t). Then
I(2 + j, t) = i − j and t
f ull2+j ,
i.e., any instruction i − j between i and i − s(t) is found in the full register
stage 2 + j between stages 2 and 5.
Proof. From Lemma 7.3 we get

0 t<2
I(2, t) =
t−1 t≥2
= I(5, t) + s(t)
≥1.
Hence, t ≥ 2 and
I(2, t) = t − 1 ≥ s(t) .
Thus,
t ≥ s(t) + 1 > 1 + j ,
which implies
t≥2+j .
Applying again Lemma 7.3 we get

0 t<2+j
I(2 + j, t) =
t − (2 + j) + 1 t ≥ 2 + j
= t−1−j
= i−j .
From Lemma 7.2 we get
t
f ull2+j =t≥j+2=1.


7.3 Forwarding 193

7.3.5 Correctness Proof

The only case in the proof affected by the addition of the two forwarding
circuits F orA and F orB is the proof of P (2, t) in Lemma 7.8 for signals Ain
and Bin. Also the order in which proof obligations P (k, t) are shown becomes
important: one proves P (2, t) after P (3, t), P (4, t), and P (5, t).
We present the modified proof for Ain. The proof for Bin is completely
analogous. Assume
uet2 = f ull1t = 1 ,
and let
i = I(2, t) .
Our goal is to show that the forwarding circuit outputs the same content of
the GPR register file, as we get in the sequential configuration i:

Aintπ = gprσi (rsiσ )


= Ainiσ .

As before, in order to make use of the software condition which is stated on


the MIPS ISA level, we assume that the sequential MIPS implementation is
correct (Lemma 7.1), i.e., that we always have

∀j ≥ 0 : sim(cj , hjσ ).

Let us consider some full stage k ∈ [2 : 4]:

k ∈ [2 : 4] ∧ f ullkt .

Then by Lemma 7.2 stage k and all preceding stages must be full in cycle t

∀j ≤ k : f ulljt ,

and we can use induction hypothesis inv(j, t) for the invisible registers. Let

k =2+α with α ∈ [0 : 2] .

For the scheduling function for stages k and k + 1 we get by Lemma 7.4

I(2 + α, t) = I(k, t)

I(2, t) k=2
= k−1
I(2, t) − j=2 f ulljt k>2
= i − (k − 2)
= i−α
I(3 + α, t) = I(k + 1, t)
= I(k, t) − f ullkt
= i−α−1.
194 7 Pipelining

Lemma 7.10 (hit signal). Let i = I(2, t) and

x = rsiσ ∧ k = 2 + α ∧ f ullkt .

Then,
hittA [k] = writesgpr(x, i − α − 1) .
Proof. For the hit signal under consideration we can conclude with inv(k, t)
for the invisible registers Cad.k and gprw.k and with inv(1, t) for the signal
rs:

hitA [k]t ≡ f ullkt ∧ Cad.kπt = rstπ ∧ gprw.kπt


≡ CadσI(k,t)−1 = rsiσ ∧ gprwσI(k,t)−1
≡ Cadi−α−1
σ = x ∧ gprwσi−α−1
≡ writesgpr(x, i − α − 1) .



Now assume
hitA [k]t ∧ k = 2 + α ∧ x = rsiσ .
Then by Lemma 7.10 we have writesgpr(x, i − α − 1). For α ∈ [0 : 1] and
exploiting the fact that instruction i reads GPR x we can also conclude from
software condition SC-2 that instruction i − α − 1 is not a load instruction:

α ∈ [0 : 1] ∧ hitA [2 + α]t → ¬l(ci−α−1 ) .

This in turn implies that registers C.3 and C.4 are used by instruction i−α−1,
i.e.,
used(C.(3 + α), I(ci−α−1 )) ,
and that the content of these registers is written into register x by this in-
struction. Thus, we can apply P (3, t) and P (4, t) to conclude

C.(3 + α)intπ = C.(3 + α)I(3+α,t)


σ
i−α−1
= C.(3 + α)σ
= gprini−α−1
σ
= gprσi−α (x) .

If we have hitA [2 + α]t for α = 2 we conclude from gprwσi−α−1 and the proof
of P (5, t)

gprintπ = gprinI(5,t)
σ
= gprinI(3+α,t)
σ
= gprini−α−1
σ
= gprσi−α (x) .

The proof of P (2, t) for Ain can now be completed. There are two major cases:
7.3 Forwarding 195

• There exists a hit: ∃α ∈ [0 : 2] : hitA [2 + α]t . In this case we take the


smallest such α and have
topA [2 + α]t .
For the output Ain of forwarding circuit F orA we conclude

t C.(3 + α)int α ≤ 1
Ainπ =
gprintπ α=2
= gprσi−α (x) .
If α = 0 we have
gprσi−α (x) = gprσi (x)
and we are done. Otherwise we have
I(2 + α, t) = I(2, t) − α and α≥1.
Since all the stages up to stage 2 + α are full, we can apply Lemma 7.9 to
conclude
j ∈ [0 : α − 1] → f ull2+j
t
∧ I(2 + j, t) = i − j .
From ¬hitA [2 + j]t we conclude by Lemma 7.10 for all such j
¬writesgpr(x, i − j − 1) .
This implies again
gprσi−α (x) = gprσi (x)
and we are done.
• No hit exists: ∀α ∈ [0 : 2] : ¬hitA [2 + α]t . For the output Ain of the
forwarding circuit we have
Aintπ = gprπt (x)
= gprσI(5,t) (x) .
If I(5, t) = i we are done. Otherwise let s(t) denote the number of full
stages between 2 and 4 and we have
I(5, t) = i − s(t) .
Applying again Lemma 7.9 we conclude for the instructions i − j between
i and i − s(t)
j ∈ [0 : s(t) − 1] → f ull2+j
t
∧ I(2 + j, t) = i − j .
From ¬hitA [2 + j]t we conclude by Lemma 7.10 for all such j
¬writesgpr(x, i − j − 1) .
This implies
gprσI(5,t) (x) = gprσi−s(t) (x)
= gprσi (x)
and we are done.
196 7 Pipelining

data paths + control stall engine

reg(k − 1) f ullk−1

hazk
cir(k)
uek

reg(k) f ullk

Fig. 122. Signals between data paths and control and the stall engine

7.4 Stalling
In this last section of the pipelining chapter we use a non trivial stall engine,
which permits to improve the pipelined machine π such that we can drop
software condition SC-2. As shown in Fig. 122 the new stall engine receives
from every circuit stage cir(k) an input signal hazk indicating that register
stage reg(k) should not be clocked, because correct input signals are not
available.
In case a hazard signal hazk is active the improved stall engine will stall the
corresponding circuit stage cir(k), but it will keep clocking the other stages
if this is possible without overwriting instructions. Care has to be taken that
the resulting design is live, i.e., that stages generating hazard signals are not
blocking each other.

7.4.1 Stall Engine

The stall engine we use here was first presented in [6]. It is quickly described
but is far from trivial. The signals involved for stages k are:
• full signals f ullk for k ∈ [0 : 4],
• update enable signals uek for k ∈ [1 : 5],
• stall signals stallk indicating that stage k should presently not be clocked
for k ∈ [1 : 6]; the stall signal for stage 6 is only introduced to make
definitions more uniform,
• hazard signal hazk generated by circuit stage k for k ∈ [1 : 5].
As before, circuit stage 1 is always full (i.e., f ull0 =1) and circuit stages 2 to
5 are initially empty. Register stage reg(6) does not exist, and thus it is never
stalled:
7.4 Stalling 197

f ull0 = 1
f ull[1 : 4]0 = 04
stall6 = 0 .
We specify the new stall engine with 3 equations. Only full circuit stages k
with full input registers (registers reg(k − 1)) are stalled. This happens in two
situations: if a hazard signal is generated in circuit stage k or if the subsequent
stage k + 1 is stalled and clocking registers in stage k would overwrite data
needed in the next stage:
stallk = f ullk−1 ∧ (hazk ∨ stallk+1 ) .
Stage k is updated, when the preceding stage k − 1 is full and stage k itself is
not stalled:
uek = f ullk−1 ∧ stallk
= f ullk−1 ∧ (f ullk−1 ∧ (hazk ∨ stallk+1 ))
= f ullk−1 ∧ (f ullk−1 ∨ (hazk ∨ stallk+1 ))
= f ullk−1 ∧ f ullk−1 ∨ f ullk−1 ∧ (hazk ∨ stallk+1 ))
= f ullk−1 ∧ (hazk ∨ stallk+1 ) .
A stage is full in cycle t + 1 in two situations: i) if new data were clocked in
during the preceding cycle or ii) if it was full before and the old data had to
stay where they are because the next stage was stalled:
f ullkt+1 = uetk ∨ f ullkt ∧ stallk+1
t
.
Because
(stallk+1 ∧ f ullk ) = stallk+1 ,
this can be simplified to
f ullkt+1 = uetk ∨ stallk+1
t
.
The corresponding hardware is shown in Fig. 123.

7.4.2 Hazard Signals


In the new design only stage 2 generates a hazard signal, namely A resp. B
is used and forwarding is desirable but not possible due to a hit in stage 2 or
3 which corresponds to a load:
haz2 = hazA ∨ hazB
hazA = A-used ∧ (topA [2] ∧ con.2.l ∨ topA [3] ∧ con.3.l)
hazB = B-used ∧ (topB [2] ∧ con.2.l ∨ topB [3] ∧ con.3.l) .
For the time being, we set all other hazard signals to zero:
k = 2 → hazk = 0 .
This completes the construction of the new design.
198 7 Pipelining

stallk
f ullk−1

uek
hazk
stallk+1
f ullk

Fig. 123. Hardware of one stage of the stall engine

7.4.3 Correctness Statement

The correctness statement formulated in Lemma 7.5 stays the same as before.
Software conditions SC-1 resp. SC-2 are dropped. Only alignment of memory
accesses and the disjointness of the code and data regions are assumed.

7.4.4 Scheduling Functions

The correctness proof follows the pattern of previous proofs, but due to the
non trivial stall engine the arguments about scheduling functions now become
considerably more complex. Before we can adapt the overall proof we have to
show the counter parts of Lemmas 7.4 and 7.9 for the new stall engine. We
begin with three auxiliary technical results.

Lemma 7.11 (stall lemma 1). Let k ≥ 2. Then,

f ullk−1 ∧ uek−1 → uek ,

i.e., if a full stage k − 1 is clocked then the previous data are clocked into the
next stage.
Proof. By contradiction. Assume

0 = uek
= f ullk−1 ∧ ¬stallk
= ¬stallk .
7.4 Stalling 199

Thus,

stallk = 1
stallk−1 = f ullk−2 ∧ (hazk−1 ∨ stallk )
= f ullk−2
uek−1 = f ullk−2 ∧ ¬stallk−1
= stallk−1 ∧ ¬stallk−1
=0.




Lemma 7.12 (stall lemma 2).

¬f ullkt ∧ ¬uetk → ¬f ullkt+1 ,

i.e., an empty stage k that is not clocked, stays empty.


Proof.

f ullkt+1 = uetk ∨ stallk+1


t

t
= stallk+1
= f ullk ∧ (hazk+1 ∨ stallk+2 )
t

=0.




Lemma 7.13 (stall lemma 3).

¬f ullk−1
t
∨ ¬uetk → I(k, t + 1) = I(k, t) ,

i.e., the scheduling function of stage k that does not have a full input stage
k − 1 or that is not clocked, stays the same.
Proof. By the definitions of the scheduling functions we have

¬uetk → I(k, t + 1) = I(k, t) .

By the definition of the update enable signals we have

¬f ullk−1
t
→ ¬uetk .




We can now state the crucial counterpart of Lemma 7.4.

Lemma 7.14 (scheduling function difference with stalling). Let k ≥ 2.


Then,
I(k − 1, t) = I(k, t) + f ullk−1
t
.
200 7 Pipelining

t
Table 10. Case split according to bits f ullk−1 , uetk−1 , and uetk in the proof of
Lemma 7.15
t
f ullk−1 uetk−1 uetk I(k − 1, t) I(k − 1, t + 1) I(k, t + 1) t+1
f ullk−1
0 0 0 i i i 0
0 1 0 i i+1 i 1
1 0 0 i+1 i+1 i 1
1 0 1 i+1 i+1 i+1 0
1 1 1 i+1 i+2 i+1 1

Proof. By induction on t. For t = 0 the lemma is obviously true because


0
initially we have f ullk−1 = 0 and I(k − 1, 0) = I(k, 0) = 0 for all k ≥ 2.
For the induction step from t to t + 1 we assume that the lemma holds for
t and prove an auxiliary lemma.
Lemma 7.15 (stall lemma 4).

uetk → I(k, t + 1) = I(k, t) + 1 ∧ f ullkt+1 ,

i.e., after stage k is clocked, it is full and its scheduling function is increased
by one.
Proof. In the proof we distinguish two cases. If k = 1 then uet1 implies

I(1, t + 1) = I(1, t) + 1

by the definition of the scheduling functions. Now let k ≥ 2. By the definitions


of ue and f ull, we have

uetk → f ullk−1
t
∧ f ullkt+1 .

Thus we have by the definition of the scheduling functions and the induction
hypothesis of Lemma 7.14

I(k, t + 1) = I(k − 1, t)
t
= I(k, t) + f ullk−1
= I(k, t) + 1 .



Lemma 7.14 for t + 1 is now proven by a case split. Let

I(k, t) = i .
t
The major case split is according to bit f ullk−1 as shown in Table 10:
• t
f ullk−1 = 0. By Lemma 7.13 and the induction hypothesis we have

I(k, t + 1) = I(k, t) = I(k − 1, t) = i .

We consider subcases according to bit uek−1 :


7.4 Stalling 201

– uetk−1 = 0. By Lemma 7.12 and the definitions of the scheduling func-


tions we conclude

¬f ullk−1
t+1
∧ I(k − 1, t + 1) = I(k − 1, t) = i .

– uetk−1 = 1. By Lemma 7.15 and the induction hypothesis we get


t+1
f ullk−1 ∧ I(k − 1, t + 1) = I(k − 1, t) + 1 = i + 1 .

In both subcases we have

I(k − 1, t + 1) = I(k, t + 1) + f ullk−1


t+1
.

• t
f ullk−1 = 1. By induction hypothesis we have

I(k − 1, t) = I(k, t) + 1 = i + 1 .

By the definition of scheduling functions and Lemma 7.15 we get




⎨(i + 1, i) ¬uetk−1 ∧ ¬uetk
(I(k − 1, t + 1), I(k, t + 1)) = (i + 1, i + 1) ¬uetk−1 ∧ uetk


(i + 2, i + 1) uetk−1 ∧ uetk .

We consider subcases according to bits ue[k − 1 : k]t ∈ B2 , where

uetk−1 → uetk

by Lemma 7.11:
– uetk−1 = 1. Then,
t+1
f ullk−1 = uetk−1 ∨ stallkt
=1.

– uetk−1 = 0. Then,
t+1
f ullk−1 = stallkt
uetk = f ullk−1
t
∧ ¬stallkt
= ¬stallkt
= ¬f ullk−1
t+1
.

In both subcases we have

I(k − 1, t + 1) = I(k, t + 1) + f ullk−1


t+1
.


202 7 Pipelining

From Lemma 7.14 we conclude the same formula as for the machine without
stalling for any k  < k:


k−1
I(k  , t) = I(k, t) + f ulljt .
j=k

The counterpart of Lemma 7.9 can now easily be shown.


Lemma 7.16 (no instructions lost with stalling). Let s(t) be the number
of full stages in between stages 2 and 5:


4
s(t) = f ulljt .
j=2

Let
i = I(2, t) = I(5, t) + s(t).
For j ∈ [−1 : s(t) − 1] we define numbers a(j, t) by5

a(−1, t) = 1
a(j, t) = min{x | x > a(j − 1, t) ∧ f ullxt } .

Then,
∀j ∈ [0 : s(t) − 1] : f ulla(j,t)
t
∧ I(a(j, t), t) = i − j .
Proof. The lemma follows by an easy induction on j. For j = −1 there is
nothing to show. Assume the lemma holds for j. By the minimality of a(j+1, t)
we have
∀x : a(j, t) < x < a(j + 1, t) → ¬f ullxt .
By Lemma 7.14 we get


a(j+1,t)−1
I(a(j + 1, t), t) = I(a(j, t), t) − f ullxt
x=a(j,t)

= I(a(j, t), t) − 1
= I(2, t) − j − 1
= I(2, t) − (j + 1) .



We are now also able to state the version of Lemma 7.10 for the machine with
stalling.

5
For j ∈ [0 : s(t) − 1], the function a(j, t) returns the index of the (j + 1)-th full
stage, starting to count from stage 2 .
7.4 Stalling 203

Lemma 7.17 (hit signal with stalling). Let i = I(2, t) and the numbers
s(t) and a(j, t) be defined as in the previous lemma. Further, let α ∈ [0 :
s(t) − 1] and let
x = rsiσ ∧ k = a(α, t) ∧ f ullkt .
Then,
hittA [k] = writesgpr(x, i − α − 1) .
Proof. For the hit signal under consideration we can conclude with inv(k, t),
inv(2, t) and Lemma 7.16:
hitA [k]t ≡ f ullkt ∧ Cad.kπt = rstπ ∧ gprw.kπt
≡ CadσI(k,t)−1 = rsiσ ∧ gprwσI(k,t)−1
≡ Cadi−α−1
σ = x ∧ gprwσi−α−1
≡ writesgpr(x, i − α − 1) .



7.4.5 Correctness Proof


The correctness proof for the pipelined processor with forwarding and stalling
follows the lines of previous proofs. The reduction of the induction step to
the proof obligations P (k, t) and the subsequent proofs of P (3, t), P (4, t), and
P (5, t) relied only on Lemma 7.4 which is now simply replaced by Lemma
7.14.
The proof of P (1, t) is simpler. Let
i = I(1, t) .
We have by Lemma 7.14:

pctπ .l f ull1t
imatπ =
dpctπ .l ¬f ull1t

I(2,t)
pcσ .l f ull1t
= I(2,t)
dpcσ .l ¬f ull1t

pci−1
σ .l f ull1t
=
dpciσ .l ¬f ull1t
= dpciσ .l
= imaiσ .l .
In the proof of P (2, t) for Ain recall that proof obligations P (k, t) only have
to be shown for cycles with active enable signals uetk . For k = 2 we have
uet2 → ¬haz2t .
Let i = I(2, t) and x = rsiσ . Further, let the numbers s(t) and a(j, t) be
defined as in Lemma 7.16. We now consider two cases:
204 7 Pipelining

• There is an instruction in the pipeline in circuit stages 2, 3, or 4 which is


writing to register x and this is the most recent instruction writing to this
register:
∃k ∈ [2 : 4] : hittA [k] ∧ toptA [k] .
This implies that there exists α such that k = a(α, t). Applying Lemma
7.17 we get
writesgpr(x, i − α − 1) .
For all α ∈ [0 : α − 1] we know by definition of a(j, t) that

a(α , t) < a(α, t) ∧ f ulla(α


t
 ,t) .

Since we have chosen the most recent instruction writing to x we have

¬hittA [a(α , t)]

and get by Lemma 7.17

¬writesgpr(x, i − α − 1) .

Hence, we have

∀j ∈ [i − α, i − 1] : ¬writesgpr(x, j) .

Using the hardware construction, Lemma 7.16, and inv(k, t) we derive



⎪ t
⎨C.3inπ k = 2
Aintπ = C.4intπ k = 3


gprintπ k = 4


⎨C.3inσ
i−α−1
k=2
= C.4ini−α−1 k=3


σ
i−α−1
gprinσ k=4.

Since we don’t have an active hazard signal in stage 2, we can conclude


that instruction i − α − 1 is not a load instruction:

0 = con.k.lπt = con.k.lσI(k,t)−1 = con.k.lσi−α−1 .

Hence, we get

Aintπ = gprini−α−1
σ
= gprσi−α (x)
= gprσi (x) .
7.4 Stalling 205

• No hit in stages 2 to 4 exists:

∀k ∈ [2 : 4] : ¬hittA [k].

Applying Lemma 7.17 we get

∀j ∈ [i − s(t), i − 1] : ¬writesgpr(x, j) .

and further derive

Aintπ = gproutatπ
= gprσI(5,t) (x)
= gprσi−s(t) (x)
= gprσi (x) .

7.4.6 Liveness

We have to show that all active hazard signals are eventually turned off, so
that no stage is stalled forever. By the definition of the stall signals we have

¬stallk+1
t
∧ ¬hazkt → ¬stallkt ,

i.e., a stage whose successor stage is not stalled and whose hazard signal is off
is not stalled either. From

stall6 = haz5 = haz4 = haz3 = 0

we conclude
k ≥ 3 → ¬stallk ,
i.e., stages k ≥ 3 are never stalled. Stages k with empty input stage k − 1 are
never stalled. Thus it suffices to show the following lemma.

Lemma 7.18 (pipelined MIPS liveness).

f ull1t ∧ haz2t ∧ haz2t+1 → ¬haz2t+2 ,

i.e., with a full input stage, stage 2 is not stalled for more than 2 successive
cycles.
Proof. From the definitions of the signals in the stall engine we conclude suc-
cessively:

stall2t = stall2t+1 = 1
f ull1t+1 = 1
uet2 = uet+1
2 =0.
206 7 Pipelining

Using
stall3 = stall4 = 0
we conclude successively

f ull2t+1 = f ull2t+2 = 0
uet+1
3 =0
f ull3t+2 = 0 .

Thus in cycle t + 2 stages 2 and 3 are both empty, hence the hit signals of
these stages are off:

X ∈ {A, B} → hitX [2]t+2 = hitX [3]t+2 = 0 ,

which implies
haz2t+2 = 0 .


8
Caches and Shared Memory

In this chapter we implement a cache based shared memory system and prove
that it is sequentially consistent. Sequential consistency means: i) answers of
read accesses to the memory system behave as if all accesses to the memory
system were performed in some sequential order and ii) this order is consis-
tent with the local order of accesses [7]. Cache coherence is maintained by the
classical MOESI protocol as introduced in [16]. That a sequentially consistent
shared memory system can be built at the gate level is in a sense the funda-
mental result of multi-core computing. Evidence that it holds is overwhelm-
ing: such systems are since decades part of commercial multi-core processors.
Much to our surprise, when preparing the lectures for this chapter, we found
in the open literature only one (undocumented) published gate level design of
a cache based shared memory system [17]. Closely related to our subject, there
is of course also an abundance of literature in the model checking community
showing for a great variety of cache protocols, that desirable invariants - in-
cluding cache coherence - are maintained, if accesses to the memory system
are performed atomically at arbitrary caches in an arbitrary sequential order.
In what follows we will call this variety of protocols atomic protocols. For a
survey on the verification techniques for cache coherence protocols see [13],
and for the model checking of the MOESI protocol we refer the reader to [4].
Atomic protocols and shared memory hardware differ in several important
aspects:
• Accesses to shared memory hardware are as often as possible performed in
parallel. After all, the purpose of multi-core computing is gaining speed by
parallelism. If memory accesses were sequential as in the atomic protocols,
memory would be a sequential bottleneck.
• Accesses to cache based hardware memory systems take one, two, or many
more hardware cycles. Thus, they are certainly not performed in an atomic
fashion.
Fortunately, we will be able to use the model checked invariants literally as
lemmas in the hardware correctness proof presented here, but very consider-

M. Kovalev et al.: A Pipelined Multi-core MIPS Machine, LNCS 9000, pp. 207–310, 2014.

c Springer International Publishing Switzerland 2014
208 8 Caches and Shared Memory

able extra proof effort will be required to establish a simulation between the
hardware computation and the atomic protocol. After it is established one
can easily conclude sequential consistency of the hardware system, because
the atomic computation is sequential to begin with.
In Sect. 8.1 we introduce what we call abstract caches and show that the
common basic types of single caches (direct mapped, k-way set associative,
fully associative) can be modelled as abstract caches. This will later permit
to simplify notation considerably. It also permits to unify most of the theory
of shared memory constructions for all basic cache types. However, presently
our definition of abstract caches does not yet include eviction addresses. The
construction we present involves direct mapped caches, and we have to deal
with eviction addresses below the abstract cache level. Modifying this small
part of the construction and the corresponding arguments to other types of
basic caches should not be hard. Modification of the definition of abstract
caches such that they can be used completely as a black box is still future
work. In the classroom it suffices to show that direct mapped caches can be
modeled as abstract caches.
In Sect. 8.2 we develop formalism permitting to deal with i) atomic proto-
cols, ii) hardware shared memory systems, and iii) simulations between them.
It is the best formalism we can presently come up with. Suggestions for im-
provement are welcome. If one aims at correctness proofs there is no way to
avoid this section (or an improved version of it) in the classroom.
Section 8.3 formulates in the framework of Sect. 8.2 the classical theory
of the atomic MOESI protocol together with some auxiliary technical results
that are needed later. Also we have enriched the standard MOESI protocol
by a treatment of compare-and-swap (CAS) operations. We did this for two
reasons: i) compare-and-swap operations are essential for the implementation
of locks. Thus, multi-core machines without such operations are of limited use
to put it mildly, ii) compare-and-swap is not a read followed by a conditional
write; it is an atomic read followed by a conditional write, and this makes a
large difference for the implementation.
A hardware-level implementation of the protocol for the direct mapped
caches is presented in Sect. 8.4. It has the obvious three parts: i) data paths,
ii) control automata, and iii) bus arbiter. The construction of data paths
and control automata is not exactly straightforward. Caches in the data part
generally consist of general 2-port RAMs, because they have to be able to
serve their processor and to participate in the snooping bus protocol at the
same time. We have provided each processor with two control automata: a
master automaton processing requests of the processor and a slave automaton
organizing the cache response to the requests of other masters on the bus. The
arbiter does round robin scheduling of bus requests. One should sleep an extra
hour at night before presenting this material in the classroom.
The correctness proof for the shared memory system is presented in
Sect. 8.5. An outline of the proof is given at the start of the section. Roughly
speaking, the proof contains the following kinds of arguments: i) showing that
8.1 Concrete and Abstract Caches 209

bus arbitration guarantees at any time that at most one master automaton
is not idle, ii) showing the absence of bus contention (except on open collec-
tor buses1 ), among other things by showing that during global transactions
(involving more than one cache) master and participating slave automata
stay “in sync”, iii) concluding that control signals and data “somehow corre-
sponding to the atomic protocol” are exchanged via the buses, iv) abstracting
memory accesses in the sense of Sect. 8.2 from the hardware computation and
ordering them sequentially by their end cycle; it turns out that for accesses
with identical end cycles it does not matter how we order them among each
other, and v) showing (by induction on the end cycles of accesses) that the
data exchanged via the buses are exactly the data exchanged by the atomic
protocol, if it were run in the memory system configuration at the end cy-
cle of the access. This establishes simulation and allows us to conclude that
cache invariants are maintained in the hardware computation (because hard-
ware simulates the atomic protocol, and there it was model-checked that the
invariants are maintained).
Many of the arguments of parts i) to iv) are tedious bookkeeping; in the
classroom it suffices to just state the corresponding lemmas and to present only
a few typical proofs. However, even in this preliminary/bookkeeping phase of
the proof the order of arguments is of great importance: the absence of bus
contention often hinges on the cache invariants. Part v) is not only hard; it
turns out that it is also highly dependent on properties of the particular cache
protocol we are using. Thus, reinspection of the corresponding portions of the
proof is necessary, if one wants to establish shared memory correctness for a
different protocol.

8.1 Concrete and Abstract Caches


Caches are small and fast memories between the fast processor and the large
but slow main memory. Transporting data between main memory and cache
costs extra time, but this time is usually gained back because once data are in
the cache they are usually accessed several times (this is called locality) and
each of these accesses is much faster than an access to main memory. Also
caches are extra hardware units which increase cost. Because here we do not
consider hardware cost and cycle time, we cannot give quantitative arguments
why adding caches is cost effective. We refer the interested reader to [12]. Here
we are interested in explanations why the caches work.
There are three standard cache constructions: i) direct mapped, ii) k-way
associative, and iii) fully associative. In this section we review these three
constructions and then show that – as far as their memory content is concerned
– they all can be abstracted to what we call abstract caches. The correctness

1
We do not use open collector buses to communicate with the main memory. Hence,
we do not worry about absence of bus contention on them.
210 8 Caches and Shared Memory

Table 11. Synonyms and names of cache states s


s synonym name
10000 M modified
01000 O owned
00100 E exclusive
00010 S shared
00001 I invalid

proof of the shared memory construction of the subsequent sections will then
to a very large extent be based on abstract caches.

8.1.1 Abstract Caches and Cache Coherence

We use very specific parameters: an address length of 32 bits, line addresses


of 29 bits, a line size of 8 bytes2 . When it comes to states of cache lines, we
exclusively consider the 5 states of the MOESI protocol [16]. We code the 5
states of the MOESI protocol in unary in the state set

S = {00001, 00010, 00100, 01000, 10000} .

For the states, we use the synonyms and names from Table 11.
In the digital model, main memory is simply a line addressable memory
with configuration
mm : B29 → B64 .
An abstract cache configuration aca has the following components:
• data memory aca.data : B29 → B64 - simply a line addressable memory,
• state memory aca.s : B29 → S mapping each line address a to its current
state aca.s(a).
We denote the set of all possible abstract cache configurations by Kaca .
If a cache line a with a ∈ B29 has state I, i.e., aca.s(a) = I, then the data
aca.data(a) of this cache line is considered invalid or meaningless, otherwise
it is considered valid. When cache line a has valid data, we also say that we
have an abstract cache hit in cache line a:

ahit(aca, a) ≡ aca.s(a) = I .

In case of a hit, we require the data output acadout(aca, a) of an abstract


cache to be aca.data(a) and the state output acasout(aca, a) to be aca.s(a):

ahit(aca, a) → acadout(aca, a) = aca.data(a) ∧ acasout(aca, a) = aca.s(a) .

2
If line size was larger than the width of the memory bus, one would have to use
sectored caches. This would mildly complicate the control automata.
8.1 Concrete and Abstract Caches 211

mm

aca
m(h)
Fig. 124. An abstract cache aca and a main memory mm are abstracted to a single
memory m(h)

mm

aca(0) ··· aca(2i) aca(2i + 1) · · · aca(2p − 1)

m(h)

··· processor i ···

Fig. 125. Many caches ca(i) and a main memory mm are abstracted to a shared
memory m(h)

From a single abstract cache aca and a main memory mm as sketched in


Fig. 124 one can define an implemented memory m : B29 → B64 by

aca.data(a) ahit(aca, a)
m(a) =
mm(a) otherwise .

In this definition, valid data in the cache hide the data in main memory.
A much more practical and interesting situation arises if P many abstract
caches aca(i) are coupled with a main memory mm as shown in Fig. 125 to
get the abstraction of a shared memory. We intend to connect such a shared
memory system with p processors. The number of caches will be P = 2p. For
i ∈ [0 : p−1] we will connect processor i with cache aca(2i), which will replace
the instruction port of the data memory, and with cache aca(2i + 1), which
will replace the data port of the data memory.
Again, we want to get a memory abstraction by hiding the data in main
memory by the data in caches. But this only works if we have an invari-
ant stating coherence resp. consistency of caches, namely that valid data in
different caches are identical:
212 8 Caches and Shared Memory

ad ad.t ad.c ad.o


τ  3

Fig. 126. Decomposition of a byte address ad

a a.t a.c
τ 

Fig. 127. Decomposition of a line address a

aca(i).s(a) = I ∧ aca(j).s(a) = I → aca(i).data(a) = aca(j).data(a).

The purpose of cache coherence protocols like the one considered in this chap-
ter is to maintain this invariant. With this invariant the following definition
of an implemented memory m is well defined

aca(i).data(a) ahit(aca(i), a)
m(a) =
mm(a) otherwise .

8.1.2 Direct Mapped Caches

All cache constructions considered here use the decomposition of byte ad-
dresses ad ∈ B32 into three components as shown in Fig. 126:
• a line offset ad.o ∈ B3 within lines,
• a cache line address ad.c ∈ B . This is the (short) address used to address
the (small) RAMs constituting the cache,
• a tag ad.t ∈ Bτ with
τ +  + 3 = 32 ,
which completes cache line addresses to line addresses:

ad.l = ad.t ◦ ad.c .

For line addresses a ∈ B29 this gives a decomposition into two components as
shown in Fig. 127.
We structure the hardware configurations h of our constructions by in-
troducing cache components h.ca. Direct mapped caches have the following
cache line addressable components:
• data memory h.ca.data : B → B64 implemented as a multi-bank RAM,
• tag memory h.ca.tag : B → Bτ implemented as an ordinary static RAM,
and
• state memory h.ca.s : B → B5 implemented as a cache state RAM.
8.1 Concrete and Abstract Caches 213

τ
a.t
4
01 
a.c
5 cadin
64
vinv a in a a in
state tag data
RAM RAM RAM
(ca.s) (ca.tag) (ca.data)
out out out
4
01 5 τ 64
casout cadout
5-eq τ -eq

hhit
Fig. 128. Data paths of a direct mapped cache h.ca

The standard construction of the data paths of a direct mapped cache is


shown in Fig. 128. Note that cache states are stored in a cache state RAM.
This permits to make all cache lines invalid by activation of the inv signal.
Any data with line address a are stored at cache line address a.c. At any time
this is only possible for a single address a3 . The tag a.t completing the cache
line address a.c to a line address a.l is stored in ca.tag(a.c).
The hardware hit signal is computed as

hhit(h.ca, a) ≡ h.ca.s(a.c) = I ∧ h.ca.tag(a.c) = a.t .

We define the abstract cache aca(h) for a direct mapped cache by



h.ca.s(a.c) hhit(h.ca, a)
aca(h).s(a) =
I otherwise

h.ca.data(a.c) hhit(h.ca, a)
aca(h).data(a) =
∗ otherwise ,

where ∗ simply indicates a “don’t care” entry for invalid data.

3
That caches are smaller than main memory is achieved by mapping many line
addresses to the same cache line address.
214 8 Caches and Shared Memory

··· ca(i) ···


casout(i) hhit(i) cadout(i)
5 64

5 64
··· ··· ··· ···
(5, k)-OR (64, k)-OR
··· ···

5 64

casout hhit cadout


Fig. 129. Connection of way i to the data paths of a k-way associative cache

Lemma 8.1 (direct mapped cache abstraction). aca(h) is an abstract


cache.
Proof. The hardware hit signal hhit(h, a) is active for the addresses where the
abstract hit signal is on:
hhit(h.ca, a) ≡ h.sa.s(a.c) = I ∧ h.ca.tag(a.c) = a.t
≡ aca(h).s(a) = I
≡ ahit(aca(h), a) .
In case of an abstract hit ahit(aca(h), a) we also have a concrete hit
hhit(h.ca, a). For the data and state outputs of the direct mapped cache in
this case, we conclude
cadout(h.ca, a) = h.ca.data(a.c)
= aca(h).data(a)
= acadout(aca(h), a)
casout(h.ca, a) = h.ca.s(a.c)
= aca(h).s(a)
= acasout(aca(h), a) .



8.1.3 k-way Associative Caches

As shown in Fig. 129, k-way associative caches (also called set associative
caches) consist of k copies h.ca(i) of direct mapped caches for i ∈ [0 : k − 1]
8.1 Concrete and Abstract Caches 215

which are called ways. Individual hit signals hhit(i) , cache data out signals
cadout(i) , and cache state out signals casout(i) are computed in each way i as

hhit(i) (h.ca, a) = hhit(h.ca(i) , a)


cadout(i) (h.ca, a) = cadout(h.ca(i) , a)
casout(i) (h.ca, a) = casout(h.ca(i) , a) .

A hit in any of the individual caches constitutes a hit in the set associative
cache:
hhit(h.ca, a) = hhit(i) (h.ca, a) .
i

Joint data output cadout(h.ca, a) and state output casout(h.ca, a) are ob-
tained by multiplexing the individual data and state outputs under control of
the individual hit signals:

cadout(h.ca, a) = cadout(i) (h.ca, a) ∧ hhit(i) (h.ca, a)
i

casout(h.ca, a) = casout(i) (h.ca, a) ∧ hhit(i) (h.ca, a) .
i

Initialization and update of the cache must maintain the invariant that valid
tags in different ways belonging to the same cache line address are distinct:

i = j ∧ h.ca(i) .s(a.c) = I ∧ h.ca(j) .s(a.c) = I →


h.ca(i) .tag(a.c) = h.ca(j) .tag(a.c) .

This implies that for every line address a, a hit can occur in at most one way.

Lemma 8.2 (hit unique).

hhit(i) (h.ca, a) ∧ hhit(j) (h.ca, a) → i = j

Proof. Assume

hhit(i) (h.ca, a) ∧ hhit(j) (h.ca, a) ∧ i = j .

Then,
h.ca(i) .s(a.c) = I ∧ h.ca(j) .s(a.c) = I
and

h.ca(j) .tag(a.c) = a.t


= h.ca(i) .tag(a.c) .

This contradicts the invariant. 



216 8 Caches and Shared Memory

We can now define aca (h) by



 h.ca(i) .s(a.c) hhit(i) (h.ca, a)
aca (h).s(a) =
I otherwise

h.ca(i) .data(a.c) hhit(i) (h.ca, a)
aca (h).data(a) =
∗ otherwise .

This is well defined by Lemma 8.2.


Lemma 8.3 (k-way associative cache abstraction). aca (h) is an ab-
stract cache.
Proof. We have

hhit(h.ca, a) = ∃i : hhit(h.ca(i) , a)
≡ aca (h).s(a) = I
≡ ahit(aca (h), a) .

In case of an abstract hit ahit(aca (h), a) we also have by Lemma 8.2 a unique
concrete hit hhit(h.ca(i) , a). For the data and state outputs of the k-way
associative cache, we conclude

cadout(h.ca, a) = h.ca(i) .data(a.c)


= aca (h).data(a)
= acadout(aca (h), a)
casout(h.ca, a) = h.ca(i) .s(a.c)
= aca (h).s(a)
= acasout(aca (h), a) .




8.1.4 Fully Associative Caches

These RAMs have the same components h.ca.s, h.ca.tag, and h.ca.data as
direct mapped caches, but data for any line address a can be stored at any
cache line and is addressed with a cache line address b ∈ Bα :
• Data memory h.ca.data : Bα → B64 is implemented as an SPR RAM.
• Tag memory h.ca.tag : Bα → B29 is implemented as an SPR RAM. The
tag RAM has width 29 so that it can store entire line addresses.
• State memory h.ca.s : Bα → B5 is implemented as an SPR RAM extended
with the invalidation option of a cache state RAM4 .
8.1 Concrete and Abstract Caches 217

29
a
04 1 α
d
5
cadin
64
vinv a in a a in
ca.s ca.tag ca.data

dout[b] dout[b] dout[b]


··· ··· ··· ··· ··· ···
5 29 64
casout(b) 04 1

5-eq 29-eq

cadout(b)
hhit(b)
5 64

5 64
··· ··· ··· ···
α α
(5, 2 )-OR (64, 2 )-OR
··· ···

5 64

casout hhit cadout


Fig. 130. Data paths of a fully associative cache

A fully associative cache can be viewed as a k-way associative cache with


k = 2α and l = 0.
Figure 130 shows the data paths of a fully associative cache. The RAMs
are addressed by a cache line address d ∈ Bα which is only used for updating
the cache. For each of the RAMs X, one needs simultaneous access to all
register contents X[b] for every cache line address b ∈ Bα . This, together
with the 2α equality testers that we use in the hit signal computation, makes
fully associative caches expensive.
A hit for line address a occurs at cache line address b if a can be found in
the tag RAM at address b and the state of this cache line is valid:

hhit(b) (h.ca, a) ≡ h.ca.tag(b) = a ∧ h.ca.s(b) = I .

4
We leave the construction of such a RAM as an easy exercise for the reader.
218 8 Caches and Shared Memory

A hit for the entire fully associative cache occurs if at least one of the lines
contains the valid data for a:

hhit(h.ca, a) = hhit(b) (h.ca, a) .
b

One maintains the invariant that valid tags are distinct:

b = b ∧ h.ca.s(b) = I ∧ h.ca.s(b ) = I → h.ca.tag(b) = h.ca.tag(b ) .

Along the lines of the proof of Lemma 8.2 this permits to show the uniqueness
of cache lines producing a hit.
Lemma 8.4 (fully associative hit unique).

hhit(b) (h.ca, a) ∧ hhit(b ) (h.ca, a) → b = b

Outputs are constructed as



cadout(h.ca, a) = ca.data(b) ∧ hhit(b) (h.ca, a)
b

casout(h.ca, a) = ca.s(b) ∧ hhit(b) (h.ca, a) .
b

We define aca (h) by



 h.ca.s(b) hhit(b) (h.ca, a)
aca (h).s(a) =
I otherwise

 h.ca.data(b) hhit(b) (h.ca, a)
aca (h).data(a) =
∗ otherwise .

Lemma 8.5 (fully associative cache abstraction). aca (h) is an abstract


cache.
Proof. We have

hhit(h.ca, a) = hhit(b) (h.ca, a)
b
≡ aca (h).s(a) = I
≡ ahit(aca (h), a) .

In case of an abstract hit ahit(aca (h), a) we also have by Lemma 8.4 a


unique concrete hit hhit(b) (h.ca, a). For the data and state outputs of the
direct mapped cache, we conclude
8.2 Notation 219

cadout(h.ca, a) = h.ca.data(b)
= aca (h).data(a)
= acadout(aca (h), a)
casout(h.ca, a) = h.ca.s(b)
= aca (h).s(a)
= acasout(aca (h), a) .



So far, we have not yet explained how to update caches. For different types
of concrete caches this is done in different ways. In what follows we elaborate
details only for direct mapped caches.

8.2 Notation
We summarize a large portion of the notation we are going to use in the
remainder of this book.

8.2.1 Parameters

Our construction uses the following parameters:


• p – denotes the number of processors. The set of processor IDs is [0 : p−1].
• P = 2p – denotes the number of caches. There is one instruction cache
and one data cache per processor. The set of cache indices is [0 : P − 1].

8.2.2 Memory and Memory Systems

The user visible memory model we aim at is a line addressable multi-bank


RAM, i.e., memory configurations are mappings

m : B29 → B64 .

The set of all possible memory configurations is denoted by Km .


A user visible memory will be realized by several flavors of memory sys-
tems. A memory system configuration has components:
• ms.mm : B29 → B64 . This is simply line addressable memory.
• ms.aca : [0 : P − 1] → Kaca . This is simply a sequence of abstract cache
configurations.
The set of memory system configurations is denoted by Kms . In memory
systems we will always keep data caches consistent, i.e., we maintain the
invariant
220 8 Caches and Shared Memory

ms.aca(i).s(a) = I ∧ ms.aca(j).s(a) = I →
ms.aca(i).data(a) = ms.aca(j).data(a) .

From memory systems ms we abstract memories m(ms) in a way described


before: 
ms.aca(i).data(a) ms.aca(i).s(a) = I
m(ms)(a) =
ms.mm(a) otherwise .
For line addresses a ∈ B29 , we project all components ms.mm(a) and
aca(i).X(a) with X ∈ {data, s} belonging to address a in the memory system
slice Π(ms, a):

Π(ms, a) = (ms.aca(0).data(a), ms.aca(0).s(a),


...,
ms.aca(P − 1).data(a), ms.aca(P − 1).s(a),
ms.mm(a)) .

This definition would be shorter if memory systems were tensors. Then a slice
would simply be the submatrix with coordinate a5 .

8.2.3 Accesses and Access Sequences

Memories will be accessed sequentially by accesses. Memory systems will be


accessed sequentially or in parallel by accesses.
The set of accesses is defined as Kacc . An access acc ∈ Kacc has the
following components:
• processor address acc.a[28 : 0] (line address),
• processor data acc.data[63 : 0] – the input data in case of a write or a
compare-and-swap (CAS),
• comparison data acc.cdata[31 : 0] – the data for comparison in case of a
CAS access,
• the byte write signals acc.bw[7 : 0] for write and CAS accesses,
• write signal acc.w,
• read signal acc.r,
5
Actually, we could choose notation coming closer to this if we define an abstract
cache configuration for cache i and line a as

aca(i, a) = (aca(i, a).s, aca(i, a).data) .

Then the abstract cache component of a memory system slice is defined like a
row of a matrix:

Π(ms, a) = (ms.aca([0 : P − 1], a), ms.mm(a)) .


8.2 Notation 221

• CAS signal acc.cas, and


• flush request acc.f .
At most one of the bits w, r, cas, or f must be on. In case none of these bits
is one, we call such an access void. A void access does not update memory
and does not produce an answer.
For technical reasons, we also require the byte write signals to be off in
read accesses and to mask one of the words in case of CAS accesses:

acc.r → acc.bw = 08
acc.cas → acc.bw ∈ {04 14 , 14 04 } .

For CAS accesses, we define the predicate test(acc, d), which compares
acc.cdata with the upper or the lower word of the data d ∈ B64 depending on
the byte write signal acc.bw[0]:

d[63 : 32] ¬acc.bw[0]
test(acc, d) ≡ acc.cdata =
d[31 : 0] acc.bw[0] .

As the name suggests, access sequences are finite or infinite sequences of ac-
cesses. As with caches and abstract caches we use the same notation acc both
for single accesses and access sequences. Access sequences come in two flavors:
• Sequential access sequences. These are simply mappings acc : N → Kacc
in the infinite case and acc : [0 : n − 1] → Kacc for some n in the finite
case.
• Multi-port access sequences

acc : [0 : P − 1] × N → Kacc ,

where acc(i, k) denotes access number k to cache (port) i.

8.2.4 Sequential Memory Semantics

Semantics of single accesses acc operating on a memory m is specified by a


memory update function

δM : Km × Kacc → Km

and the answers


dataout(m, acc) ∈ B64
of read and CAS accesses. Let

m = δM (m, acc) .

Then memory is updated like a multi-bank memory:


222 8 Caches and Shared Memory


⎨modify (m(a), acc.data, acc.bw) acc.a = a ∧ (acc.w ∨

m (a) = acc.cas ∧ test(acc, m(acc.a)))


m(a) otherwise .

For CAS accesses, if the data comparison test(acc, m(acc.a)) succeeds, we call
the CAS access positive. Otherwise we call it negative.
The answers dataout(m, acc) of read or CAS accesses are defined as

acc.r ∨ acc.cas → dataout(m, acc) = m(acc.a) .

Void and flush accesses do not have any affect on the memory and do not
produce an answer.
The change of memory state by sequential access sequences acc of accesses
and the corresponding outputs dataout[i] are defined in the obvious way by

Δ0M (m, acc) = m


Δi+1 i
M (m, acc) = δM (ΔM (m, acc), acc[i])
dataout(m, acc)[i] = dataout(ΔiM (m, acc), acc[i]) .

An easy induction on y shows that performing x + y accesses is the same as


first performing x and then y accesses.
Lemma 8.6 (decomposition of access sequences). Let

m = ΔxM (m, acc[0 : x − 1]) .

Then,

Δx+y y
M (m, acc[0 : x + y − 1]) = ΔM (m , acc[x : x + y − 1]) .

8.2.5 Sequentially Consistent Memory Systems

For multi-port access sequences acc, we denote by msdout(ms, acc, i, k) the


answer of the system to read or CAS access acc(i, k) if the initial configuration
of the memory system is ms.
A sequential ordering of the accesses is simply a bijective mapping

seq : [0 : P − 1] × N → N ,

which respects the local order of accesses, i.e., which satisfies

k < k  → seq(i, k) < seq(i, k  ) .

A memory system ms is called sequentially consistent if for any multi-port ac-


cess sequence acc there exists a sequential ordering seq satisfying the following
condition. Let the sequential access sequence acc be defined as
8.2 Notation 223

acc [seq(i, k)] = acc(i, k) ,

then for read accesses the answer msdout(ms, acc, i, k) to access acc(i, k) of
the multi-port access sequence acc is the same as the answer to access seq(i, k)
of the sequential access sequence acc :

acc(i, k).r ∨ acc(i, k).cas →


msdout(ms, acc, i, k) = dataout(m(ms), acc )[seq(i, k)] .

By the definition of function dataout this is equivalent to

msdout(ms, acc, i, k) = dataout(m(ms), acc )[seq(i, k)]


(m(ms), acc ), acc [seq(i, k)]))
seq(i,k)
= dataout(ΔM
(m(ms), acc )(acc [seq(i, k)].a)
seq(i,k)
= ΔM
(m(ms), acc )(acc(i, k).a) .
seq(i,k)
= ΔM

8.2.6 Memory System Hardware Configurations

We collect the components of a memory system into the following components


of hardware configuration h:
• main memory component h.mm,
• (direct mapped) cache components h.ca(i); in theses components we col-
lect cache RAMs h.ca(i).X for X ∈ {data, s, tag} which we have already
introduced, but later we also add registers h.ca(i).Y of the cache control
and the data paths of cache i.
We denote by
aca(i) = aca(h.ca(i))
the abstract cache abstracted from cache RAMs h.ca(i).X of cache i as ex-
plained in Sect. 8.1.2. For hardware cycles t, we abbreviate the states of hard-
ware cache i and abstract cache i in cycle t as

ca(i)t = ht .ca(i)
aca(i)t = aca(ht .ca(i)) .

For components X ∈ {data, s} of abstract caches and components Y of hard-


ware cache h.ca(i), we use the notation

ca(i).Y t = ht .ca(i).Y
aca(i).X t = aca(ht .ca(i)) .

The hardware constitutes a memory system

ms(h) = (ms(h).mm, ms(h).aca)


224 8 Caches and Shared Memory

with

ms(h).mm = h.mm
ms(h).aca(i) = aca(h.ca(i)) ,

which in turn permits the definition of a memory abstraction

m(h) = m(ms(h)) .

8.3 Atomic MOESI Protocol


We specify the MOESI protocol in five steps:
1. For any system of abstract caches ms.aca(i) and main memory ms.mm
we formulate the state invariants for the five states M , O, E, S, I involved
in the protocol.
2. We present the protocol in a way that is common in literature, namely by
tables prescribing how to run the protocol one access at a time. We give
this version of the protocol a special name and call it atomic, because it
performs each access sequentially in an atomic way without interference
of any other access.
3. We translate the master and slave tables into switching functions C1, C2,
and C3.
4. Using functions Ci we give an algebraic specification of the atomic MOESI
protocol.
5. We specify how the caches and, if applicable, the main memory exchange
data after the protocol information of step 2 has been exchanged.
We then review the classical proof that a system of caches ca(i) and main
memory mm running the atomic MOESI protocol behaves like memory. Ob-
serve that the atomic system is sequentially consistent for completely trivial
reasons: it runs sequentially.
The beauty of the protocol as introduced in [16] is that it permits a par-
allel implementation which nevertheless simulates the atomic protocol. The
first such construction in the open literature is the undocumented design of
OpenSPARC T1 and T2 processors [17]. Here we present, to the best of our
knowledge, the first such construction which is documented and accompa-
nied by a correctness proof. Indeed, the classical result that state invariants
are preserved in the atomic protocol is a crucial lemma in our proof. It is,
however, only a part of our main induction hypothesis.

8.3.1 Invariants

For the memory system ms under consideration, we abbreviate


8.3 Atomic MOESI Protocol 225

mm = ms.mm
aca = ms.aca .
One calls the data in a cache line clean if this data are known to be the
same as in the main memory, otherwise it is called dirty. A line is exclusive
if the line is known to be only in one cache, otherwise it is called shared. The
intended meaning of the states is:
• E – exclusive clean (the data are in one cache and are clean).
• S – shared (the data might be in other caches and might be not clean).
• M – exclusive modified (the data are in one cache and might be not clean).
• O – owned (the data might be in other caches and might be not clean; the
cache with this line in owned state is responsible for writing it back to the
memory or sending it on demand to other caches).
• I – invalid (the data are meaningless).
This intended meaning is formalized in a crucial set of state invariants:
1. States E and M are exclusive; in other caches the line is invalid:
aca(i).s(a) ∈ {E, M } ∧ j = i → aca(j).s(a) = I ,
2. state E is clean:
aca(i).s(a) = E → aca(i).data(a) = mm(a) .
3. Shared lines, i.e., lines in state S, are clean or they have an owner:
aca(i).s(a) = S → aca(i).data(a) = mm(a) ∨ ∃j = i : aca(j).s(a) = O .
4. Data in lines in nonexclusive state are identical:
aca(i).s(a) = S ∧ aca(j).s(a) ∈ {O, S} →
aca(i).data(a) = aca(j).data(a) .
5. If a line is non-exclusive, i.e., in state S or O, other copies must be invalid
or in a non exclusive state. Moreover the owner is unique:
aca(i).s(a) = S ∧ j =
 i → aca(j).s(a) ∈ {I, O, S}
aca(i).s(a) = O ∧ j = i → aca(j).s(a) ∈ {I, S} .
We introduce the notation sinv(ms)(a) to denote that the state invariants
hold for cache line address a with a system aca of abstract caches and main
memory mm. For cycle numbers t, we denote by SINV (t) the fact that the
state invariants hold for the memory system ms(h) abstracted from the hard-
ware for all cycles t ∈ [0 : t], i.e., from cycle 0 after reset until t:
sinv(ms) ≡ ∀a : sinv(ms)(a)

SINV (t) ≡ ∀t ∈ [0 : t] : sinv(ms(ht )) .
One easily checks that the state invariants hold if all cache lines are invalid.
In the hardware construction, this will be the state of caches after reset.
226 8 Caches and Shared Memory

Lemma 8.7 (invalid state satisfies invariants).

(∀a, i : aca(i).s(a) = I) → sinv(ms)

8.3.2 Defining the Protocol by Tables

We stress the fact that the atomic protocol is a sequential protocol operating
on a multi-port memory system ms. Its semantics is defined by two functions:
• A transition function

δ1 : Kms × Kacc × [0 : P − 1] → Kms ,

where
ms = δ1 (ms, acc, i)
defines the new memory system if single access acc is applied to (cache)
port i of memory system ms.
• An output function

pdout1 : Kms × Kacc × [0 : P − 1] → B64 ,

where
d = pdout1(ms, acc, i)
specifies for read and CAS accesses (i.e., accesses with acc.r or acc.cas)
memory system output d as response to access acc at port i in memory
system configuration ms.
We abbreviate

mm = ms .mm


aca = ms .aca .

The processing of accesses is summarized in Tables 131(a) and 131(b).


We first describe somewhat informally how the tables are interpreted. In
Sect. 8.3.4 we translate this description into an algebraic specification.
Every access acc(i, k) is processed by cache aca(i) which is called the
master of the access. Actions of the master are specified in Table 131(a). The
master determines the local state aca(i).s(acc(i, k).a) of cache line acc.a and
the type of the access, i.e., whether the access is a read, a write, a flush, or a
CAS. The state determines the row of the table to be used. The type of the
access determines the column. For a hit with a CAS access, we distinguish
cases when the data comparison succeeds (CAS+) and when it fails (CAS-).
In case of a cache miss on CAS, the master cannot predict whether the test
will succeed or fail and runs the protocol just like in case of a write miss.
There are two kinds of table entries in the master table: i) single states
and ii) others. A single state indicates that a cache can handle the access
8.3 Atomic MOESI Protocol 227

master read write flush CAS- CAS+


M M M I M M
O Ca, im, bc Ca, im, bc
O ch?O:M I O ch?O:M
E E M I E M
S Ca, im, bc Ca, im, bc
S ch?O:M I S ch?O:M
I Ca Ca, im Ca, im
ch?S:E M I M
(a) Master state transitions
slave Ca, ¬im, ¬bc Ca, im, ¬bc Ca, im, bc
read miss write or CAS miss write or CAS hit
M ch, di di -
O I
O ch, di di ch
O I S
E ch, di di -
S I
S ch ch
S I S
I I I I
(b) Slave state transitions

Fig. 131. Protocol state transitions

without contacting the other caches; for some flushes it still may have to
write back a cache line to main memory. In case i) the table entry specifies
the next state of the cache line. The table does not explicitly state how data
are to be processed; this is implicitly specified by the fact that we aim at a
memory construction and by the state invariants. We will make this explicit
in Sect. 8.3.4.
In case there is more than a single state in the master table entry, the
protocol is run in four steps. Three steps concern the exchange of signals
belonging to the memory protocol and the next state computation. The fourth
step involves the processing of the data and is only implicitly specified.
1. Out of three master protocol signals Ca, im, bc the master activates the
ones specified in the table entry. These signals are broadcast to the other
caches ca(j), j = i which are called the slaves of the access. The intuitive
meaning of the signals is:
• Ca – intention of the master to cache line acc.a after the access is
processed.
• im – intention of the master to modify (write) the line.
228 8 Caches and Shared Memory

•bc – intention of the master to broadcast the line after the write has
been performed. This signal is activated after a write or a positive
CAS hit with non exclusive data.
2. The slaves j determine the local state aca(j).s(acc.a) of cache line acc.a,
which determines the row of Table 131(b) to be used. The column is
determined by the values of the master protocol signals Ca, im, and bc.
Each slave aca(j) goes to a new state as prescribed in the slave table entry
and activates two slave protocol signals ch(j) and di(j) as indicated by
the slave table entry used. The intuitive meaning of the signals is:
• ch(j) – cache hit in slave aca(j).
• di(j) – data intervention by slave aca(j). Slave aca(j) has the cache
line needed by the master and will put it on a bus, from which the
master can read it.
The individual signals are ORed together (in active low form on an open
collector bus) and made accessible to the master as

ch = ch(j) , di = di(j) .
j =i j =i

3. The master determines the new state of the cache line accessed as a func-
tion of the slaves’ responses as indicated by the table entry used. The
notation ch ? X : Y is an expression borrowed from C and means

X ch
ch ? X : Y =
Y ¬ch .

4. The master processes the data. This step is discussed in Sect. 8.3.4.

8.3.3 Translating the Tables into Switching Functions

We extract from the tables three sets of switching functions. They correspond
to phases of the protocol, and we specify them in the order, in which they are
used in the protocol:
• C1 – this function is used by the master. It depends on a state6 s ∈ S and
the type of the access acc.type ∈ B4 , where

acc.type = (acc.r, acc.w, acc.cas, acc.f ) .

The function C1 computes the master protocol signals C1.Ca, C1.im, and
C1.bc. Thus,
6
Recall that we encode the cache states in the unary form as

S = {00001, 00010, 00100, 01000, 10000} .


8.3 Atomic MOESI Protocol 229

C1 : S × B4 → B3 .
The component functions C1.X are defined by translating the master pro-
tocol table, i.e., looking up the corresponding cell (s, type) in Table 131(a)
and choosing the necessary protocol bits accordingly:

∀X ∈ {Ca, im, bc} : C1(s, type).X = 1 ↔


master table entry (s, type) contains X.

In case of a cache hit on CAS (acc.cas = 1 and s = I), we always choose


the column CAS+, since in case of CAS- the access is performed locally
and the protocol signals are not put on the bus anyway.
Using the construction of Lemma 2.20 for each component C1.X, the above
switching function can be turned into a switching circuit that we also call
C1. A symbol for this circuit is shown in Fig. 132.
• C2 – this function is used by slaves. It depends on a cache state s ∈ S
and the master protocol signals Ca, im, and bc. It computes slave protocol
signals C2.ch and C2.di, i.e., the slave response, and the next state C2.ss
for slaves. Thus,
C2 : S × B3 → B2 × S .
For component X ∈ {ch, di}, functions C2.X are defined by translating the
slave protocol table, i.e., looking up the corresponding cell (s, Ca, im, bc)
in Table 131(b) and choosing the necessary protocol bits accordingly:

∀X ∈ {ch, di} : C2(s, Ca, im, bc).X = 1 ↔


slave table entry (s, Ca, im, bc) contains X.

C2 also computes the next state of the slave:

C2(s, Ca, im, bc).ss = s ↔


slave table entry (s, Ca, im, bc) contains s .

A symbol for the corresponding circuit is also shown in Fig. 132.


• C3 – this function depends on a state s ∈ S, the type of the access acc.type
and the slave response ch. It computes the next state C3.ps of the master.
Thus,
C3 : S × B4 × B → S .
The function is defined by translating the master protocol table:

C3(s, type, ch).ps = s ↔


master table entry (s, type) contains s
∨ ∃s : ch ∧ master table entry (s, type) contains ch ? s : s
∨ ∃s : ch ∧ master table entry (s, type) contains ch ? s : s .
230 8 Caches and Shared Memory

s type Ca im bc s ch s type
5 4 5 5 4

C1 C2 C3
5 5

Ca im bc di ch ss ps
Fig. 132. Symbols for circuits C1, C2, and C3 computing the protocol signals and
next state functions of the MOESI protocol

In case of a cache hit on CAS (acc.cas = 1 and s = I), we always choose


the column CAS+, since in case of CAS- the state of the master is not
changed at all and the output of C3 is simply ignored7 .
The corresponding symbol for circuit C3 is shown in Fig. 132.

8.3.4 Algebraic Specification

For the following definitions we assume sinv(ms), i.e., that the state invariants
hold for the memory system ms before the (sequential and atomic) processing
of access acc at port i.
For all components x of an access acc, we abbreviate
x = acc.x .
Note that a in this section and below, where applicable, denotes the line
address acc.a. Also, the functions we define depend on arguments ms.aca and
ms.mm. For brevity of notation we will omit these arguments most of the
time – but not always – in the remainder of this section. We now proceed to
define the effect of applying accesses acc to port i of the memory system ms
by specifying functions ms = δ1 (ms, acc, i) and d = pdout1(ms, acc, i).
We only specify the components that do change. We define a hit at atomic
abstract cache aca(i) by
hit(aca, a, i) ≡ aca(i).s(a) = I .
We further identify local read and write accesses. A local read access is either
a read hit or a CAS hit with the negative test result. A local write is either a
write hit to exclusive data or a positive CAS hit to exclusive data:
rlocal(aca, acc, i) = hit(aca, a, i) ∧ (r ∨ cas ∧ ¬test(acc, aca(i).data(a)))
wlocal(aca, acc, i) = hit(aca, a, i) ∧ (w ∨ cas ∧ test(acc, aca(i).data(a)))
∧ aca(i).s(a) ∈ {E, M } .
7
In Sect. 8.3.4 we treat cache hits on CAS with the negative test result the same
way as read hits and do not update the state of the cache line.
8.3 Atomic MOESI Protocol 231

We say that an access to a cache system configuration aca at port i is local,


if it is a local read or a local write. An access is called global if it is not local
and not a flush:
local(aca, acc, i) = rlocal(aca, acc, i) ∨ wlocal(aca, acc, i)
global(aca, acc, i) = ¬local(aca, acc, i) ∧ ¬f .
Summarizing definitions given above, we have specified four types of accesses:
local reads (including negative CAS hits), local writes (including positive CAS
hits), global accesses, and flushes.
Now we define the transition function ms = δ1 (ms, acc, i) and the output
function d = pdout1(ms, acc, i) for every possible type of accesses. We aim at
the following:
• to maintain the state invariants, i.e., to have sinv(ms ),
• to have the resulting memory abstraction m(ms ) behave as if access acc
had been applied with ordinary memory semantics to the previous memory
abstraction m(ms):
m(ms ) = δM (m(ms), acc) ,
• the response d to read accesses to be equal to the response given by the
memory abstraction m(ms):
pdout1(ms, acc, i) = dataout(m(ms), acc) = m(ms)(acc.a) .

Flush

A flush invalidates abstract cache line a and writes back the cache line in case
it is modified or owned:
f → aca (i).s(a) = I ∧ (aca(i).s(a) ∈ {M, O} → mm (a) = aca(i).data(a)) .
Note that we allow invalidation of any cache line, even the one which is initially
in state E, S, or I. The main memory in that case is not updated.

Local Write Accesses

Local write accesses update the local cache line addressed by a and change
the state to M :
wlocal(aca, acc, i) →
aca (i).data(a) = modify (aca(i).data(a), data, bw)
aca (i).s(a) = C3(aca(i).s(a), acc.type, ∗).ps
= M.
In case of positive CAS hits we also need to specify the output of the memory
system. We do this later in this section.
232 8 Caches and Shared Memory

Global Accesses

For global accesses we run the MOESI protocol in an atomic way:

global(aca, acc, i) →
mprot = C1(aca(i).s(a), acc.type)
∀j : sprot(j) = C2(aca(i).s(a), mprot)

∀X ∈ {ch, di} : sprot.X = sprot(j).X
j =i

C3(aca(i).s(a), acc.type, sprot.ch).ps i=j
∀j : aca (j).s(a) =
C2(aca(i).s(a), mprot).ss i = j .

Next, we specify the data broadcast bdata via the bus during a global transac-
tion. If the broadcast signal mprot(i).bc is active then the master broadcasts
the modified result modify (aca(i).data(a), data, bw). If a slave activates the
data intervention signal then it is responsible for putting the data on the bus.
The intervening slave j is unique by the state invariants sinv(ms). In other
cases the data are fetched from the main memory :

global(aca, acc, i) →


⎨modify (aca(i).data(a), data, bw) mprot(i).bc
bdata = aca(j).data(a) sprot(j).di


mm(a) otherwise .

During a global access the caches signaling a cache hit sprot(j).ch store the
broadcast result if the master activates a broadcast signal mprot(i).bc:

∀j = i : global(aca, acc, i) ∧ mprot(i).bc ∧ sprot(j).ch →


aca (j).data(a) = bdata .

Note that in case of a write hit or a positive CAS hit the master and the
affected slaves store the same data for address a.
For global CAS accesses we define the test data as the old value stored in
cache i if we have a hit (which means that the broadcast signal is on) or as
the data obtained from the bus otherwise:

aca(i).data(a) mprot(i).bc
global(aca, acc, i) ∧ cas → tdata =
bdata otherwise .

Negative CAS misses are grouped together with the regular read misses into
global reads:

rglobal(aca, acc, i) = global(aca, acc, i) ∧ (r ∨ cas ∧ ¬test(acc, tdata)) .


8.3 Atomic MOESI Protocol 233

A global write is a global access which is either a write or which is a CAS


with the positive result of the test:

wglobal(aca, acc, i) = global(aca, acc, i) ∧ (w ∨ cas ∧ test(acc, tdata)) .

In case of a global read the master copies the missing cache line from the bus
without modifications:

rglobal(aca, acc, i) → aca (i).data(a) = bdata .

In case of a global write the master either modifies its old data in case if it
is a hit (which means that the broadcast signal is on) or modifies the data
obtained from the bus:

wglobal(aca, acc, i) →

 modify (aca(i).data(a), data, bw) mprot(i).bc
aca (i).data(a) =
modify (bdata, data, bw) otherwise .

Answers of Reads

For a read request or a CAS request, we return either the data from the local
cache or the data fetched from the bus:

aca(i).data(a) hit(aca, a, i)
r ∨ cas → pdout1(ms, acc, i) =
bdata otherwise .

Iterated Transitions

For memory systems ms, sequential access sequences acc , sequences is of


ports, and step numbers n, we define the effect of n steps of the atomic
protocol in the obvious way:

Δ01 (ms, acc , is) = ms


Δn+1
1 (ms, acc , is) = δ1 (Δn1 (ms, acc , is), acc [n], is[n]) .

The following lemma is proven by an easy induction on y.


Lemma 8.8 (decomposition of 1 step accesses). Let

ms = Δx1 (ms, acc , is) .

Then,

Δx+y
1 (ms, acc , is) = Δy1 (ms , acc [x : x + y − 1], is[x : x + y − 1]) .
234 8 Caches and Shared Memory

8.3.5 Properties of the Atomic Protocol

In the atomic execution of the MOESI protocol the state invariants are pre-
served.
Lemma 8.9 (invariants maintained). Let ms = δ1 (ms, acc, i). Then,

sinv(ms) → sinv(ms ) .

Proof. The proof of this lemma is error prone, so it is usually shown by model
checking [4, 13]. 


An easy proof shows that we have achieved two more goals that were stated
before.
Lemma 8.10 (memory abstraction 1 step). Let ms = δ1 (ms, acc, i) and
the state invariants sinv(ms) hold. Then,
• the resulting memory abstraction m(ms ) behaves as if access acc would
have been applied with ordinary memory semantics to the previous mem-
ory abstraction m(ms):

m(ms ) = δM (m(ms), acc) ,

• the response to read accesses is equal to the response given by the memory
abstraction m(ms):

pdout1(ms, acc, i) = dataout(m(ms), acc) = m(ms)(acc.a) .

By induction on y we show that the memory abstraction after y steps of the


atomic protocol is equal to the y memory steps applied to the initial memory
abstraction of the system.
Lemma 8.11 (memory abstraction 1 step iterative). Let the state in-
variants sinv(ms) hold. Then,

m(Δy1 (ms, acc , is )) = ΔyM (m(ms), acc )) .

The following technical lemma formalizes the fact that the abstract protocol
with an access acc only operates on memory system slice Π(ms, acc.a). The
reader might have observed that this address does not even occur in the tables
specifying the protocol, because everybody understands, that line address
aca.a (alone) is concerned in each cache. Readers familiar with cache designs
will of course observe that read, write, or CAS accesses acc can trigger flushes
evicting cache lines with line addresses a = acc.a; but these are treated as
separate accesses in our arguments.
Lemma 8.12 (properties 1 step). Let ms = δ1 (ms, acc, i) and a = acc.a.
Then,
8.4 Gate Level Design of a Shared Memory System 235

1. Local read accesses don’t change the memory system:

rlocal(ms.aca, acc, i) → ms = ms .

2. Slices different from slice a of the memory system are not changed

b = a → Π(ms , b) = Π(ms, b) .

3. Possible changes to slice a only depend on slice a:

∀ms1 , ms2 : Π(ms1 , a) = Π(ms2 , a) →


Π(δ1 (ms1 , acc, i), a) = Π(δ1 (ms2 , acc, i), a) .

4. Answers of reads to address a depend only on slice a:

∀ms1 , ms2 : Π(ms1 , a) = Π(ms2 , a) →


pdout1(ms1 , acc, i) = pdout1(ms2 , acc, i) .

Proof. The proof for every case is based on the following arguments.
1. For local reads we specified no change of ms.
2. In the definition of function δ1 we only specified components that change.
Slices other than slice a were not among them.
3. This is a simple bookkeeping exercise, where one has to compare all parts
of the definition of function δ1 for the two memory system configurations
ms1 and ms2 8
4. Bookkeeping exercise.



8.4 Gate Level Design of a Shared Memory System


We present the construction of a gate level design of a shared memory system
in the following order.
1. We specify in this section the interface between processors p(j) and caches
ca(i) and the interface between caches ca and the main memory bus b.
Bus b is extended by components b.mprot and b.sprot for the exchange of
protocol signals. These components are implemented as an open collector
bus.
2. We specify the data paths of each (direct mapped) cache ca(i). These data
paths have three obvious components for the data, tag, and state RAMs
of the cache. The data paths for the state RAM include circuits C1, C2,
and C3 introduced in Sect. 8.3.3 implementing the tables of the MOESI
protocol. Each cache ca(i) may have to serve two purposes simultaneously:
8
This proof can be avoided if one defines function δ1 directly as a function of slice
Π(ms, a), but this definition does not match the hardware design so well.
236 8 Caches and Shared Memory

i) serving its processor as a master of accesses and ii) participating as a


slave in the protocol. Therefore, all RAMs ca(i).data, ca(i).tag, and ca(i).s
will be implemented as dual ported RAMs.
3. We present control automata. Each cache ca(i) has two such automata:
one for accesses where ca(i) is master and one for accesses when ca(i)
is slave. Thus, in a system with P caches we have 2P control automata.
Showing that master and slave automata are in some sense synchronized
while they are handling the same access will be a crucial part of the
correctness proof.
4. Accesses requiring cooperation of caches via the memory bus b are called
global accesses. In case several caches want to initiate a global access at the
same time (as masters) a bus arbiter has to grant the bus to one of them
and deny it to the others. Construction of the bus arbiter is presented at
the end of this section.
In this and the following chapter for signals and RAMs X of cache i we use
equivalent notations ca(i).X and X(i). We will also sometimes omit index (i)
if talking about internal computation of signals in a single cache.

8.4.1 Specification of Interfaces

We need to establish interfaces between


1. processors p and their caches; this is done by signals,
2. the caches ca(i) and the bus b; this is done (mostly) via dedicated registers.

p → ca Interface

Signals from a processor p to its cache ca(i):


• ca(i).preq ∈ B – processor request signal,
• ca(i).pdin ∈ B64 – processor data coming into the cache,
• ca(i).pcdin ∈ B64 – compare data for CAS requests,
• ca(i).pa ∈ B29 – processor line address,
• ca(i).type ∈ B3 = (ca(i).pr, ca(i).pw, ca(i).pcas) – type of an access; ex-
actly one of these bits should be active for every request,
• ca(i).bw ∈ B8 - byte write signals. They must be off for read requests and
half of them must be off for CAS requests:

ca(i).preq ∧ ca(i).pr → ca(i).bw = 08


ca(i).preq ∧ ca(i).pcas → ca(i).bw ∈ {04 14 , 14 04 } .

Recall that our main memory (Sect. 3.5.6) behaves as a ROM for addresses
029−r br , where b ∈ Br for some small r. As a result, any write performed to
this memory region has no effect. Yet, in the sequential memory semantics
given in Sect. 8.2.4, we consider the whole memory to be writable. To resolve
8.4 Gate Level Design of a Shared Memory System 237

t−1 t ... t+k

preq

mbusy

ptype

pa

pdin

pdout

Fig. 133. The timing diagram for a k−cycle write access followed by two consecutive
1-cycle read accesses

that problem we add a software condition, stating that the processors never
issue write and CAS requests to addresses smaller than or equal to 029−r 1r :

ca(i).preq ∧ ca(i).pa[28 : r] = 029−r → ca(i).bw = 08 .

ca → p Interface

Signals from cache to processor:


• ca(i).mbusy ∈ B – memory system is busy (generated by control automa-
ton of the cache),
• ca(i).pdout ∈ B64 – data output to processor.

p ↔ ca Protocol

We need to define a protocol for interaction between a processor and its caches
(data and instruction cache). Communication between processor p and its
cache ca is done under the following rules:
• p starts a request by activating preq,
• ca in the same cycle acknowledges the request by raising (Mealy9 ) signal
mbusy (unless a one-cycle access is performed),
• ca finishes with lowering mbusy, and
• p disables preq in the next cycle.

9
Recall, that a Mealy output of the control automaton is a function of the input
and the current state.
238 8 Caches and Shared Memory

The timing diagram for a k−cycle (write) cache access is depicted in Fig. 133.
Cycle t is the first cycle of an access iff

¬mbusy t−1 ∧ preq t .

Cycle t ≥ t is the last cycle of an access iff


 
¬mbusy t ∧ preq t .

Observe that 1-cycle accesses are desirable and indeed possible (in case of
local reads, including negative CAS hits). Then signal mbusy is not raised
at all and the processor can immediately start a new request in cycle t + 1.
The timing diagram for two consecutive 1−cycle read accesses is depicted in
Fig. 133.
Once the processor request signal is raised, inputs from the processor must
be stable10 until the cache takes away the mbusy signal. In order to formalize
this condition we identify the cache input signals of cache ca(i) in cycle t as


⎨{pdin} ca(i).pwt
cain(i, t) = {pa, type, pbw, preq} ∪ {pdin, pcdin} ca(i).pcast


∅ otherwise

and then require

ca(i).preq t ∧ ca(i).mbusy t ∧ X ∈ cain(i, t) → ca(i).X t+1 = ca(i).X t .

Memory Bus

The memory bus b is subdivided into 4 sets of bus lines. The first three sets are
already known from the main memory specification in Sect. 3.5 and contain
regular tristate lines. The corresponding outputs are connected to these lines
via the tristate drivers. The fourth set of lines supports the cache protocol
and is an open collector bus:
• b.d ∈ B64 – for transmitting data contained in a cache line,
• b.ad ∈ B29 – memory line address,
• b.mmreq ∈ B, b.mmw ∈ B, b.mmack ∈ B – memory protocol lines,
• b.prot ∈ B5 – cache protocol lines.

ca ↔ b Interface

The following dedicated registers are used between cache ca(i) and bus b:
• ca(i).bdout ∈ B64 – cache data output to the bus,
10
Stability in the digital sense is enough here; the processors never access main
memory directly.
8.4 Gate Level Design of a Shared Memory System 239

• ca(i).bdin ∈ B64 – cache data input from the bus,


• ca(i).badout ∈ B29 – (master) line address output to the bus,
• ca(i).badin ∈ B29 – (slave) line address input from the bus,
• ca(i).mmreq ∈ B – cache request to the main memory (a set-clear flip-
flop),
• ca(i).mmw ∈ B – cache write signal to the main memory (a set-clear
flip-flop),
• ca(i).mprotout ∈ B3 , ca(i).sprotout ∈ B2 – master and slave protocol data
output to the bus,
• ca(i).mprotin ∈ B3 , ca(i).sprotin ∈ B2 – slave and master protocol data
input from the bus.
For the signal b.mmack we do not introduce a dedicated input register and
sample this signal directly from the bus.
For the tristate lines of the memory bus we use the control logic devel-
oped in Sect. 3.5. We use set-clear flip-flops for the generation of the output
enable signals Xoe with X ∈ {mmreq, mmw, bdout, badout}. The ownership
of the tristate lines of the memory bus is controlled by a fair master arbiter
(Sect. 8.4.5). To run a transaction on the bus, cache i raises a request signal

ca(i).req

to the arbiter and waits until the arbiter activates signal grant[i]. Construction
of the arbiter makes sure that at most one grant signal is active at a time.
Register ca(i).req is implemented as a set-clear flip-flop. Control signals for
all set-clear flip-flops are generated by the control automata of the cache.
As shown in Fig. 134, the cache protocol signals are inverted before they
are put on the open collector bus and before they are clocked from the bus into
a register. Thus, by de Morgan’s law we have for every component x ∈ [0 : 4]:

¬b.prot[x] = ¬( ¬ca(j).bprotout[x])
j

= ca(j).bprotout[x] .
j

The following synonyms for the protocol signals are used:

b.mprot.Ca = b.mprot[2] = ¬b.prot[4]


b.mprot.im = b.mprot[1] = ¬b.prot[3]
b.mprot.bc = b.mprot[0] = ¬b.prot[2]
b.sprot.ch = b.sprot[1] = ¬b.prot[1]
b.sprot.di = b.sprot[0] = ¬b.prot[0] .

When several slaves signal a data intervention, further bus arbitration appears
to be necessary, since only one cache should access the bus at a time. However,
240 8 Caches and Shared Memory

ca(i).mprotout ca(i).sprotout

Ca im bc ch di
OC OC

3 2

5 5
b.prot
3 2

ca(i).mprotin ca(i).sprotin

Fig. 134. Using de Morgan’s law to compute the OR of active high signals on the
open collector bus b.prot

arbitration is not necessary as long as only one slave will forward the required
cache line. This is guaranteed by the cache coherency protocol, where we
do not raise di in case of a miss on data in state S. However, the protocol
provides that all caches keep the same data when it is shared, so that we
could in principle forward the data if we arbitrate the data intervention. A
possible arbitration algorithm for data intervention in a “shared clean miss”
case would be to select ca(i) with the smallest i, s.t., di is active. This can be
efficiently implemented using a parallel prefix OR circuit.

8.4.2 Data Paths of Caches

The data paths for the data RAM, state RAM, and tag RAM are presented
in Figs. 135, 137, 136 respectively.
The control signals for the data paths are generated by the control au-
tomata described in the following section. Let us try to get a first under-
standing of the designs.
In general RAMs are controlled from two sides: i) from the processor
side using port a and ii) from the bus side using port b. Auxiliary registers
ca(i).dataouta , ca(i).tagouta , ca(i).souta , and ca(i).soutb latch the outputs
of the RAMs. We have introduced these registers in order to prevent situa-
tions, where reads and writes to the same address of a RAM are performed
during the same cycle11 . Note that such situations are not problematic in our
11
Our construction guarantees that we never perform accesses to different ports of
the same RAM with the same cache line address in a single cycle. As a result,
8.4 Gate Level Design of a Shared Memory System 241

hardware model, because our RAMs are edge triggered. However, in many
technologies RAMs are pulse triggered, and then reads and writes to the
same address in the same cycle lead to undefined behaviour. With the aux-
iliary registers the design of this book is simply implementable in far more
technologies, in particular in the FPGA based implementation from [8]. For
the purpose of correctness proof we show in Lemma 8.24 that they always
have the same data as the current output of the edge triggered RAM. Hence,
in the remainder of the book we simplify our reasoning about the data paths
by simply replacing these registers with wires.

Data Paths of the Data RAM

The data paths in Fig. 135 support the following operations:


• Local read accesses. This includes read hits and negative CAS hits. The
processor addresses port a of the RAM with pa. The hit is signaled by a
processor hit signal phit. This signal is produced by the data paths for the
tag RAM as shown in Fig. 136. Data RAM output outa is routed to the
data output pdout at the processor side.
• Local write accesses. This includes write hits to an exclusive state and
positive CAS hits to an exclusive state. It requires two cycles which to-
gether perform an operation. The cache line addressed by pa is read out
and temporarily stored in register dataouta . From there it becomes an
input to a modify circuit which computes the modify function

byte(i, x) bw[i] = 1
byte(i, modify(x, y, bw)) =
byte(i, y) bw[i] = 0 .

Simple construction of a modify circuit is given in Sect. 4.2.2. For any kind
of write, data to be written y and byte write signals bw come from the
processor. The result is written to the data RAM via port a.
• Flushes. Except for times when the cache is filling up, a cache miss access
is generally preceded by a flush: a so called victim line with some eviction
line address va is evicted from the cache in order to make space for the
missing line. In a direct mapped cache the eviction address has cache line
address
va.c = pa.c .
The victim line is taken from output outa of the data RAM and put on
the bus via register bdout.

reads and writes to the same cache line address in a single cycle can only occur
through the same port (our RAM construction does allow to read and write the
same port in a single cycle). Outputs of port b of data and tag RAMs are never
used in cycles when these ports are being written. Hence, we do not introduce
auxiliary registers for them.
242 8 Caches and Shared Memory

b.ad ca(i).pa b.data ca(i).pdin ca(i).bw


29 29 64 64 8

ca(i).bdin

64 64
ca(i).badin
pa.c ina inb
a wa
29  ca(i).data
badin.c
b wb

outa outb
64 64

ca(i).dataouta

0 1 w ∨ localw
64 64
0 1 ¬f lush ∧ di phit 1 0
64

modif y

64

1 0 ¬bc ∧ ((5) ∨ di)


64
ca(i).pbw[0] 0 1
ca(i).pcdin
32 32 ca(i).bdout
64
32-eq 64

bdoutoe
64

test ca(i).pdout b.data

Fig. 135. Data paths for the data RAM of a cache


8.4 Gate Level Design of a Shared Memory System 243

• Global write accesses. This includes write misses, positive CAS misses,
write hits to shared data, and positive CAS hits to shared data. In case
of a cache miss, the missing line is clocked from the bus into register
bdin. From there it becomes an input to the modifier. The output of the
modifier is written back to the data RAM at port a. In case of a hit to
shared data, the cache line is fetched from port a and stored temporarily
in register dataouta . After that it goes to the modify circuit; the output
of the modifier is written to the RAM via port a and is broadcast on the
bus via register bdout.
• Global read accesses. This includes read misses and negative CAS misses.
The missing line is clocked from the bus into register bdin. The modifier
with byte write signals bw = 08 is used as a data path for the missing
cache line. It is output to the processor via signal pdout and written into
the data RAM via input ina.
• Data intervention. The line address is clocked from the bus into register
badin. The intervention line missing in some other cache is taken from
output outb of the data RAM and put on the bus via register bdout.
Signal test is used to denote the result of the CAS test both for local and
global accesses:


⎪ data(pa.c)[63 : 32] ¬bw[0] ∧ phit

⎨data(pa.c)[31 : 0] bw[0] ∧ phit
test ≡ pcdin =

⎪ bdin[63 : 32] ¬bw[0] ∧ ¬phit


bdin[31 : 0] bw[0] ∧ ¬phit .

Data Paths of the Tag RAM

The tag RAM is very much wired like a tag RAM in an ordinary direct mapped
cache. It is addressed from the processor side by signal pa and from the bus
side by register badin . We operate the data paths for the tag RAM under the
following rules:
• New tags are only written into the tag RAM from the processor side.
• Hit signal phit for the processor side is computed from outputs souta and
tagouta of state and tag RAMs respectively or from the outputs of auxil-
iary registers souta and tagouta , depending on whether port a of these
RAMs is written in this cycle or not. Lemma 8.24 allows us to treat these
auxiliary registers simply as wires, which results in the desired definition
of the phit signal:

phit ≡ pa.t = tag(pa.c) ∧ ¬s(pa.c).I .

Signal bhit for the bus side is computed using outputs soutb and tagoutb
directly, because signal bhit is never used in cycles when state and tag
RAMs are written through port b:
244 8 Caches and Shared Memory

ca(i).pa b.ad
29

ca(i).badin

pa.t 29

pa.c ina inb


a wa ca(i).tagwa
 ca(i).tag
badin.c
b wb

outa outb
τ τ

ca(i).tagouta

0 1 w

badin.t
pa.t
ca(i).souta [0]
pa.c τ 0

pa.t (τ + 1)-eq
τ τ
wait ∧ (5) 0 1 ca(i).soutb[0]
τ  τ 0

(τ + 1)-eq
ca(i).badout

29

badoutoe
29

b.ad phit bhit


Fig. 136. Data paths for the tag RAM of a cache
8.4 Gate Level Design of a Shared Memory System 245

bhit ≡ badin .t = tag(badin.c) ∧ ¬s(badin .c).I .

• For global accesses, the processor address can be put on the bus via register
badout .
• For flushes, the tag of the victim address is taken from output outa of the
tag RAM. The victim line address is then

va = ca(i).tagouta ◦ pa.c .

It is put on the bus via register badout .

Data Paths of the State RAM

As before, addressing from the processor side is by signal pa and from the bus
side by register badin. Some control signals come from the control automata
and are explained in Sect. 8.4.3. The data paths of the state RAM use the
circuits C1, C2, and C3 from Sect. 8.3.3 which compute the memory protocol
signals and the next state of cache lines. We operate the data paths for the
state RAM under the following rules:
• The current master state is read from output outa of the state RAM.
• The new state ps is computed by circuit C3 and is written back to input
ina of the state RAM. As one of the inputs to C3 we provide the type
of the access. This type depends not only on the processor request, but
also on the current state of the master automaton (Sect. 8.4.3): if the
automaton is in state f lush then we calculate the new state for a flush
access (which is I independent of other inputs of C3); there is also a case
when we perform a flush access while we are still in wait without going
to state f lush – this corresponds to an invalidation of a clean line without
writing it back to memory. For more explanations refer to the description
of states wait and f lush in Sect. 8.4.4.
• For global accesses, the master protocol signals are computed by circuit C1
and put on the bus via register mprotout. The mux on top of C1 provides
the invalid state in case if we don’t have a processor hit. The mux on top
of register mprotout is used to clear the master protocol signals after a
run of the protocol.
• If the cache works as a slave, it determines the slave response with circuit
C2 using the state from output outb of the state RAM and puts in on
the bus via register sprotout. The mux on top of circuit C2 forwards the
effect of local writes whose line address conflicts with the line address of
the current global access (so that we don’t have to wait 1 cycle until the
modified state is written to the RAM in case of a local write). The mux
on top of register sprotout is used to clear the slave response after a run
of the protocol.
• The new state of a slave ss is computed by circuit C2 and is written back
to input inb of the state RAM.
246 8 Caches and Shared Memory

b.ad ca(i).pa ps ss 00001


29 5 5 5

ca(i).badin pa.c ina inb vinv


a wa ca(i).swa

29 ca(i).s wb ca(i).swb
badin.c
b inv ca(i).sinv

outa outb
5 5
w ∨ localw
wait ∧ grant ca(i).souta ca(i).soutb

0 1 1 0 sw
ca(i).souta
00001 5
¬phit 1 0
ca(i).pr
ca(i).pw
test 5 4 ca(i).pcas
C1 0

000 3
mprotz 1
C1: master protocol
0
3 phase 1
ca(i).mprotout
3

OC

b.mprot
3
10000
ca(i).mprotin 5
0 1 localw ∧ ca(i).pa = badin
3 5

C2
2 5 C2: slave protocol
00
sprotz 1 0
2
ca(i).pr ∧ w
ca(i).sprotout ca(i).pw ∧ w
2 ca(i).pcas ∧ w
f lush ∨ wait
OC

b.sprot
2

ca(i).sprotin
C3: master protocol
ch
5 4 phase 2
C3
5
ps ss

Fig. 137. Data paths for the state RAM of a cache


8.4 Gate Level Design of a Shared Memory System 247

(1)

idle localw w
¬b.mmack ∧ b.mmreq
(9) (2)
(3)
¬grant[i]
mdata
(4)
wait m0 hot phase

(5) (6) m3
f lush
m1

m2
¬b.mmack

Fig. 138. Master automaton

8.4.3 Cache Protocol Automata

We define state automata for the master and the slave case in order to imple-
ment the cache coherency protocol. In general the protocol is divided into 3
phases:
• Master phase 1: Ca, im, bc are computed and put on the bus.
• Slave phase: slave responds by computing and sending ch, di, generating
new slave state ss .
• Master phase 2: master computes new state ps .
For local accesses, only the last step of the protocol is performed (master
phase 2).
The state diagrams for the master and slave automata are presented in
Figs. 138 and 139.

Automata States

We define sets of master and slave automata states as

M = {idle, localw, wait,f lush, m0, m1, m2, m3, mdata, w}


S = {sidle, sidle, s1, s2, s3, sdata, sw} .

The sets of states belonging to local and global transactions are defined as

L = {idle, localw}
G = M \L.
248 8 Caches and Shared Memory

sw
¬Ca
sidle
¬b.mmack ∧ b.mmreq

sidle
(7)
sdata
(8)

s1 s2 s3

Fig. 139. Slave automaton

We also define sets of states that correspond to warm and hot phases of global
transactions:

W = G \ {wait}
H = W \ {f lush} .

The overview on the states is given in Table 12.


We denote by z(i) and zs(i) the state of a master or a slave automaton i
respectively.
For states x ∈ M , we mean by x(i)t the statement that master automaton i
is in state x during cycle t. Similarly for x ∈ S we mean by x(i)t the statement
that slave automaton i is in state x during cycle t.

z(i)t = x x ∈ M
x(i) ≡
t
zs(i)t = x x ∈ S .

We use the following notation for the set of states A ∈ {M, S, L, G, W, H}:

z(i)t ∈ A A = S
A(i)t ≡
zs(i)t ∈ A A = S

A(i)[t:t ] ≡ ∀q ∈ [t : t ] : A(i)q .

Statements without index t are implicitly quantified for all cycles t. For tran-
sitions numbered with (n) in Figs. 138 and 139, we mean with (n)(i)t that
the condition holds for the automata of cache i in cycle t.
When talking about automata states and transitions we often omit explic-
itly specifying index i in case when it is clear from the context or when the
statement is implicitly quantified for all cache indices i.
8.4 Gate Level Design of a Shared Memory System 249

Table 12. An overview on the automata states


# master intended work slave intended work
state state
0 idle local read accesses (unless sidle snooping on bus
colliding with global
transaction on the bus)
1 localw local write accesses (unless - -
collision)
2 wait wait for the arbiter to grant - -
bus access, compute Ca, im,
bc
3 f lush write back dirty line to mm, - -
compute Ca, im, bc
4 m0 transmit Ca, im, bc via - -
b.mprot
5 m1 wait for slave response s1 check for bhit, compute slave
response ch, di
6 m2 wait for slave response s2 transmit response ch, di on
b.sprot
6 sidle wait until Ca is lowered on
the bus (in sidle new
transaction would start)
7 m3 analyze slave signals, prepare s3 prepare data for data
memory access or data intervention (if necessary)
broadcast (if necessary)
8 mdata wait for memory response (if sdata transmit data on the bus or
necessary), read data from the read data from the bus (if
bus (if necessary) necessary)
9 w write data,tag,s sw write data, s (if necessary)

8.4.4 Automata Transitions and Control Signals


Before we consider the transition and control signals of the master and slave
automata we first introduce some auxiliary signals.
Local reads are identified by signal rlocal:
rlocal = phit ∧ (pr ∨ pcas ∧ ¬test) .
Local writes are indicated by signal wlocal:
wlocal = phit ∧ s(pa.c) ∈ {E, M } ∧ (pw ∨ pcas ∧ test) .
A local computation is performed if either rlocal or wlocal is active:
local = rlocal ∨ wlocal .
If a processor issues a request for data, which are currently being processed in
a global transaction, handling the request locally is not possible. In this case
the signal snoopconf lict is raised:
250 8 Caches and Shared Memory

snoopconf lict ≡ ¬sidle ∧ pa.c = badin.c .

Note that a snoop conflict is discovered one cycle after the address is actually
on the bus (we have to clock data from the bus to register badin first).
A crucial decision on whether to handle an access locally or globally is
performed when the master automaton is in state idle and a processor request
is active. For decisions to go local, the master additionally has to ensure that
no snoop conflict is raised. In case of a global CAS hit we have to reassure
that the decision to go global is still correct in the last cycle of stage wait.
With these prerequisites at hand, we now continue with the actual state
transitions and generated control signals of the automata, starting with state
idle. For every state we write the generated signals in the form

signal name := condition,

meaning that this signal is raised in the given state if the condition is satisfied.
If we omit stating the condition, then the signal is always high in a given state.

State idle

In the idle state the signal mbusy is deactivated if either a local read is
performed (which can be finished in one cycle) or there is no processor request
at all:

¬mbusy := ¬preq ∨ (rlocal ∧ ¬snoopconf lict) .

Note that mbusy is a Mealy signal and thus does not need to be precomputed
before the clock edges. There are three possible transitions starting from state
idle.
1. Transition (1): idle → idle.
This transition is taken if there is a snoop conflict or if we have a local
read or there is no request from the processor to its cache at all:

(1) = ¬preq ∨ snoopconf lict ∨ rlocal .

2. Transition (2): idle → localw.


This transition is taken if there is a local write request and no global
transaction currently accesses the respective data (no snoop conflict):

(2) = preq ∧ ¬snoopconf lict ∧ wlocal .

3. Transition (3): idle → wait.


This transition is taken if the processor request is not local:

(3) = preq ∧ ¬local .


8.4 Gate Level Design of a Shared Memory System 251

In case we go to the localw state we clock registers ca(i).dataouta and


ca(i).souta which will be used in the next cycle in place of outputs of data
and state RAMs:

ca(i).dataouta ce := (2)
ca(i).souta ce := (2) .

With the transition into state wait we activate signal ca(i).reqset to issue a
request for the bus to the arbiter (Sect. 8.4.5). Additionally, we clock register
ca(i).souta which might be used in the next cycle in place of the output of
the state RAM:

ca(i).souta ce := (3)
ca(i).reqset := (3) .

In idle we also transmit the content of ca(i).data(ca(i).pa.c) via ca(i).pdout


back to the processor which waits for ¬mbusy.

State localw

In state localw the master changes its state ps for the given cache line from
E to M , see Fig. 137. The activated signals are

ca(i).swa
ca(i).datawa
¬ca(i).mbusy .

Signal swa is used in the state RAM (Fig. 137) and signal datawa is used in
the data RAM (Fig. 135). In state localw we always go back directly to idle
in the next cycle.

State wait

In state wait, the processor waits for its request to be granted by the bus
arbiter (Sect. 8.4.5.
In the last cycle of wait we have to repeat the test for global CAS hits. In
case we went from idle to wait under condition

phit ∧ s(pa.c) ∈ {S, O} ∧ pcas ∧ test,

it may happen that during the time when we are waiting for the bus our
slave automaton updates the cache data or the cache state for address pa.c.
An update of the state is not a problem, because from S and O the line can
go only to S, O, or I, which means that we still need to perform a global
transaction. Yet, if the data RAM has been updated, the outcome of the local
252 8 Caches and Shared Memory

test might not be the same anymore. In this case we should not start a global
transaction at all; instead, we should return to idle. We call such an access
delayed local.
There are four transitions starting from state wait:
1. Transition: wait → wait.
While ¬grant[i], the automaton stays in wait.
2. Transition (5): wait → f lush.
When the request is granted, but the cache line is occupied and dirty, the
automaton goes to state f lush:

(5) = grant[i] ∧ ¬phit ∧ ca(i).souta ∈ {O, M } .

We use the output of the auxiliary register souta instead of the output of
port a of the state RAM here because signal (5) is used below to generate
the write enable signal to port a of the state RAM. By Lemma 8.24
(auxiliary registers) we always have:

wait(i) ∧ grant[i] → ca(i).souta = ca(i).s(ca(i).pa.c) .

3. Transition (9): wait → idle.


If the conditions for a local read are satisfied by a CAS access, we go to
state idle and perform a delayed local access:

(9) = grant[i] ∧ phit ∧ pcas ∧ ¬test .

4. Transition (4): wait → m0.


If the request is granted, there is no cache line to be evicted, and the con-
dition (9) is not satisfied, then we go to m0 and start a global transaction:

(4) = grant[i] ∧ ¬(5) ∧ ¬(9) .

The following signals are set in state wait:

ca(i).bdoutce := (5)
ca(i).badoutce := (5) ∨ (4)
ca(i).mmreqset := (5)
ca(i).mmwset := (5)
ca(i).bdoutoeset := (5)
ca(i).badoutoeset := (5) ∨ (4)
ca(i).mmreqoeset := (5)
ca(i).mmwoeset := (5)
ca(i).mprotoutce := (4)
ca(i).reqclr := (9)
¬ca(i).mbusy := (9)
8.4 Gate Level Design of a Shared Memory System 253

ca(i).swa := (4) ∧ ¬phit ∧ ca(i).souta ∈ {E, S}


ca(i).souta ce := ¬((5) ∨ (4) ∨ (9)) .

Signal reqclr is used to clear the request to the bus arbiter (Sect. 8.4.5) in
case of (9). Note that in case of (4) we have to load the master protocol data
for transmission on the bus.
There is a case when the cache line is occupied by another line (i.e., the
current tag of the cache line does not match to the tag of the processor
address) but we still go to state m0. This happens when the line is clean;
hence, it can be evicted without writing it back to the memory. In this case
we write I to the state RAM (as guaranteed by the output of the circuit C3 in
Fig. 137). This write is not necessary for the correctness of implementation:
we could simply “evict” this line later in state w, by overwriting the current
tag in the tag RAM with the tag of the processor address (such a write would
make the overwritten cache line invalid in the abstract cache). In the proof,
however, that would force us to simulate two accesses simultaneously: a global
access for the line addressed by pa and a flush access for the evicted line. To
avoid this confusion and to (greatly) simplify the proofs, we prefer to do this
invalidation explicitly by writing I to the state RAM on a transition from
wait to m0. Note, that for the generation of signal swa we use the output of
the auxiliary register souta instead of the output of port a of the state RAM.
By Lemma 8.24 (auxiliary registers) we have in this case:

ca(i).souta = ca(i).s(ca(i).pa.c) .

State f lush

In state f lush, we write the cache line that needs to be evicted to memory.
The following signals are set:

ca(i).bdoutoeclr := (6)
ca(i).mmreqoeclr := (6)
ca(i).mmreqclr := (6)
ca(i).mmwoeclr := (6)
ca(i).mmwclr := (6)
ca(i).badoutce := (6)
ca(i).mprotoutce := (6)
ca(i).swa := (6) .

When we leave the state f lush we have to load master data for transmission
on the bus12 . After the flush is done, we write I to the state RAM. Just
12
The multiplexer controlled by signal phit on Fig. 137 makes sure that we forward
the invalid state as an input to circuit C1. Note that for the computations of
254 8 Caches and Shared Memory

as the same kind of write performed in state wait, this write is not strictly
necessary for the correctness of implementation. Yet it makes the proofs much
easier. Particularly, in Lemma 8.64 we can simulate the 1-step flush access
immediately when the master automaton leaves the state flush, and do not
have to wait until the tag of the evicted line gets overwritten in state w.
There are two transitions starting from state f lush.
1. Transition: f lush → f lush.
While ¬b.mmack, we stay in f lush since the memory is still busy.
2. Transition (6): f lush → m0.
When the b.mmack signal gets active, the memory access is finished and
the automaton proceeds to state m0:

(6) = b.mmack .

States m0 and sidle

During the m0 phase (1 cycle), master protocol data are transmitted on the
bus. In the next cycle, master automaton always advances to state m1.
The following transitions in the slave automaton start from state sidle.
1. Transition (7): sidle → s1.
Slave i leaves the sidle state iff some master j is in state m0 transmitting
signal Ca and j = i:

(7) = b.mprot.Ca ∧ ¬grant[i] .

2. Transition : sidle → sidle.


The slave stays in sidle if signal Ca is not active on the bus or if its master
automaton got control of the bus.
There are no signals raised in the master in state m0. The following signals
are raised in the slave in state sidle:

ca(i).mprotince := (7)
ca(i).badince := (7) .

signal phit in this case we use the output of port a of the state RAM, rather
than the output of the auxiliary register souta , even though port a of the state
RAM is written in the same cycle. The reason why this behaviour is acceptable
here is simple: we went to state flush because the tag of the line in the cache did
not match to the tag of the requested line. In state f lush the tag RAM is not
updated. Hence, the requested line stays invalid in the abstract cache and the
output of the state RAM is simply ignored in the computation of signal phit.
8.4 Gate Level Design of a Shared Memory System 255

States m1 and s1

During these states (1 cycle), the master does nothing and the slave computes
response signals. If the slave doesn’t have the requested data it goes to state
idle , where it waits until signal Ca is removed from the bus. The snoop
conflict starts to be visible in this phase.
Two transitions are starting in state s1.
1. Transition (8): s1 → sidle .
If the slave does not have an active bhit signal, then it goes to state sidle :

(8) = ¬bhit .

2. Transition: s1 → s2.
If the slave doesn’t go to sidle , then it goes to state s2.
The following signal is raised in the slave:

ca(i).sprotoutce := bhit .

State sidle

The slave waits until Ca is removed from the bus, then moves to idle.

States m2 and s2

During these states (1 cycle), the slave response signals are transmitted on
the bus. The following signal is raised in the master:

ca(i).sprotince .

States m3 and s3

Recall, that in a global transaction the master either has to read the data
from the memory or from another cache (in case of a cache miss) or has to
broadcast the data to other caches (in case of a cache hit in a shared or owned
state). In state m3 (1 cycle), the master makes a decision whether a memory
access must be performed in the mdata phase. This depends on whether di
was active on the bus during stage m2. In case of a cache hit the master
prepares the data for broadcasting. The following signals are raised in the
master:

ca(i).mmreqset := ¬ca(i).mprotout.bc ∧ ¬ca(i).sprotin.di


ca(i).mmreqoeset := ¬ca(i).mprotout.bc ∧ ¬ca(i).sprotin.di
ca(i).bdoutce := ca(i).mprotout.bc
ca(i).bdoutoeset := ca(i).mprotout.bc .
256 8 Caches and Shared Memory

The following signals are raised in the slave (preparing the data intervention
if necessary):

ca(i).bdoutce := ca(i).sprotout.di
ca(i).bdoutoeset := ca(i).sprotout.di .

States mdata and sdata

During this phase the master performs a memory access if necessary and
reads the data from the bus. The data are either provided by a slave or are
provided by the main memory. In case the data are provided by a slave, the
mdata and sdata phases only last for 1 cycle. If the data are provided by the
main memory, the automata stay in states mdata and sdata as long as there
is an active memory request:

¬b.mmack ∧ b.mmreq.

Leaving this state, the master clears control signals. The following signals are
raised in the master:

ca(i).bdince := b.mmreq ∧ b.mmack ∨ ca(i).sprotin.di


ca(i).mmreqclr := b.mmreq ∧ b.mmack
ca(i).mmreqoeclr := b.mmreq ∧ b.mmack
ca(i).badoutoeclr := b.mmack ∨ ¬b.mmreq
ca(i).bdoutoeclr := b.mmack ∨ ¬b.mmreq
ca(i).dataouta ce := b.mmack ∨ ¬b.mmreq
ca(i).souta ce := b.mmack ∨ ¬b.mmreq
ca(i).tagouta ce := b.mmack ∨ ¬b.mmreq
ca(i).mprotoutce := b.mmack ∨ ¬b.mmreq
ca(i).mprotz := b.mmack ∨ ¬b.mmreq .

The slave has to clock the broadcast data or clear the data output enable
signal. When leaving state sdata, the b output of the state RAM is clocked
into register ca(i).soutb , which will be used in the next cycle:

ca(i).bdince := ca(i).mprotin.bc
ca(i).bdoutoeclr := ca(i).sprotout.di
ca(i).soutb ce := b.mmack ∨ ¬b.mmreq .

States w and sw

During this phase (1 cycle) the master and the slave write the results of the
transaction into their RAMs (data, tag, and state). The following signals are
raised in the master:
8.4 Gate Level Design of a Shared Memory System 257

req
2p

nextgrant
2p

grant
2p

Fig. 140. Arbiter for masters

ca(i).datawa
ca(i).tagwa
ca(i).swa
ca(i).reqclr
¬ca(i).mbusy .
The following signals are raised in the slave:
ca(i).datawb := ca(i).mprotin.bc
ca(i).swb
ca(i).sprotoutce
ca(i).sprotz .

8.4.5 Bus Arbiter


We need bus arbitration for masters trying to get control of the bus for ac-
cessing the main memory or communicating with other caches. We could also
do arbitration of slave di signals when they all are in a shared (S) state. Cur-
rently we assume that no di signals are raised in this case and the master
reads the main memory.

Master Arbitration
In case of master arbitration we have to ensure fairness. Fairness means that
every request to access the bus is finally granted. The arbiter collects requests
req(i) from the caches and chooses exactly one cache that will get the per-
mission to run on the bus. The winner is identified by the active grant[i]
signal.
The implementation of a fair arbiter is presented in Fig. 140. For the
computation of nextgrant, we use circuit f 1, which finds the first 1 in a
bit-string starting from the smallest index:
258 8 Caches and Shared Memory

2p − 1 0

grant 0 ··· 01 0 ··· 0

X 1 ··· 11 0 ··· 0

req 011 · · · 010 01 · · · 01

Y 011 · · · 010 00 · · · 00

nextgrant 000 · · · 010 00 · · · 00

Fig. 141. Computation of signal nextgrant

f 1(X)[i] = 1 ↔ min{j | X[j] = 1} = i.

Implementation of the f 1 circuit is done in the following way.


1. Apply parallel prefix OR to input X:

X[0] i=0
Y [i] =
X[i] ∨ Y [i − 1] i =
 0.

2. Compute the result as follows:



Y [0] i=0
f 1(X)[i] =
Y [i] ∧ ¬Y [i − 1] i =
 0.

Implementation of the nextgrant circuit uses another instance of the parallel


prefix and circuit f 1:
1. We apply parallel prefix OR to input grant:

grant[0] i=0
X[i] =
X[i − 1] ∨ grant[i] i = 0 .

2. We compute conjunction Y , where

Y [i] = X[i] ∧ req[i] .

3. We apply function f 1 constructed above to compute nextgrant:


 
f 1(Y ) i Y [i]
nextgrant =
f 1(req) otherwise .
8.4 Gate Level Design of a Shared Memory System 259

The computation of nextgrant is illustrated in Fig. 141.


We clock the grant register every cycle when we have an active request:

grantce = req(i) .
i

During the initialization phase we set grant[0] to 1 and all other grant signals
to 0. This guarantees that we always have an active grant signal, even if we
don’t have any active requests.
Note that if cache i gets a permission to run the bus it will maintain
this permission until it lowers its req signal (it will always be the winner in
the f 1 circuit). A cache may get access to the bus in two consecutive memory
accesses, however, only if there are no waiting requests from other caches. The
master lowers the req signal when it leaves stage w. Thus, when the master
returns to idle a new set of grant signals is computed and another cache may
start its bus access in the next cycle.

Fairness of the Master Arbiter

Our construction of the master arbiter guarantees that every request to access
the bus is finally granted if the following conditions are satisfied:
1. Without grant, request stays stable:

req(i)t ∧ ¬grant[i]t → req(i)t+1 .

2. Every granted request is eventually taken away:



grant[i]t → ∃t ≥ t : ¬req(i)t .

The first condition is true, since in state wait signal req(i) stays active and
we do not leave the state before grant[i] holds. The second condition holds
due to system liveness: being in the warm phase, the master automaton will,
eventually, always reach state idle and lower its request signal.
Lemma 8.13 (arbiter fairness).

req(i)t → ∃t ≥ t : grant[i]t

Proof. We only give a sketch of the proof here. In the proof we show that the
distance between the index of the current master and the index of any other
requesting cache i is strictly monotonic, decreasing with each arbitration. Let
one(X) be defined as13

one(X) = i ↔ X[i] = 1 .

13
Well-defined only if string X contains exactly one bit that is set.
260 8 Caches and Shared Memory

Then,

min{j ≥ one(grant) | req[j] = 1} such j exists
one(nextgrant) =
min{j | req[j]} otherwise .

We define the distance measure M :



i − one(grantt ) i ≥ one(grantt )
M (i, t) =
i − one(grantt ) + 2p otherwise
= (i − one(grantt )) mod 2p .

By induction one can show that M (i, t) is decreasing with every new arbitra-
tion (i.e., when the grant register is clocked with the new value). 


8.4.6 Initialization

We assume the reset signal to be active in cycle −1. The following signals are
activated during reset:
• Signals mprotz −1 , mprotoutce−1 , sprotz −1 , sprotoutce−1 . This ensures
that master and slave protocols are initialized correctly:

∀i : ca(i).mprotout0 = 000 ∧ ca(i).sprotout0 = 00 .

• Signal sinv −1 , which guarantees

∀i, x : ca(i).s(x)0 = I .

• Master and slave automata are initialized with idle states:

∀i : idle(i)0 ∧ sidle(i)0 .

• Signal reqclr−1 guarantees that caches don’t request the bus until they
get to the mwait state.
• Signals bdoutoeclr−1 , badoutoeclr −1 and signals mmreqclr−1 , mmwclr−1 ,
mmwoeclr−1 , mmreqoeclr−1 make sure that no master automaton gets
the bus before requesting it.
• The grant signal for the cache 0 is set to 1 and all other grant signals are
set to 0: 
0 1 i=0
grant[i] =
0 otherwise .
8.5 Correctness Proof 261

8.5 Correctness Proof

Recall that for cache abstraction aca(i) = aca(ca(i)) we use the same defini-
tion as was introduced for direct mapped caches in Sect. 8.1.2:

ca(i).s(a.c) hhit(ca(i), a)
aca(i).s(a) =
I otherwise

ca(i).data(a.c) hhit(ca(i), a)
aca(i).data(a) =
∗ otherwise .

We proceed with the correctness proof of the shared memory system in the
following order:
1. We show properties of the bus arbitration guaranteeing that the warm
phases of global transactions don’t overlap.
2. We show that caches not involved in global accesses output ones to the
open collector buses, i.e., they do not disturb signal transmission by other
caches.
3. We show that control automata run in sync during global accesses.
4. This permits to show that tristate buses are properly controlled.
5. We show that protocol data are exchanged in the intended way.
6. This permits to show that data are exchanged in the intended way.
7. Aiming at a simulation between hardware and the atomic MOESI cache
system, we identify the accesses of the hardware computation.
8. We prove a technical lemma stating that accesses acc(i, k) of the atomic
protocol only depend on cache lines with line address acc(i, k).a and only
modify such cache lines.
9. In Lemma 8.64 we show simulation between the hardware model executing
a given access and the atomic model executing the same access.
10. We order hardware accesses by their end cycle and relate the hardware
computation with the computation of the atomic protocol in Lemma 8.65.
Moreover, we show that the state invariants are maintained by the hard-
ware computation.
11. In Lemma 8.67 we show that our hardware memory system is a sequen-
tially consistent shared memory.
In lemmas and theorems given in this section we abbreviate
• (A) – automata construction,
• (HW) – hardware construction,
• (IH) – induction hypothesis.

8.5.1 Arbitration

We start with showing very basic properties of the bus arbitration.


262 8 Caches and Shared Memory

Lemma 8.14 (grant unique).


grant[i] ∧ grant[j] → i = j
Proof. Trivial by construction of the arbiter. The output has the form f 1(x).


Lemma 8.15 (grant stable). During an active request a grant is not taken
away:
grant[i]t ∧ req(i)t → grant[i]t+1 .
Proof. By construction of the arbiter. 

Lemma 8.16 (request at global). Automata in a global phase request ac-
cess to the bus:
G(i)t → req(i)t .
Proof. By induction t − 1 → t. Trivially true for t = 0 because idle(i)0 , and
thus, ¬G(i)0 .
In the induction step we consider cycles t satisfying G(i)t (because other-
wise there is nothing to show) and argue here - and many times later - with
a very typical case distinction:
• ¬G(i)t−1 . By (A) we conclude
idle(i)t−1 ∧ (3)(i)t−1 ∧ wait(i)t ∧ reqset(i)t−1 .
By hardware construction of set/clear flip-flops we conclude
req(i)t .
• G(i)t−1 . Then by (A) we have ¬w(i)t−1 , and hence, ¬reqclr(i)t−1 . Using
(IH) and (HW) we conclude,
req(i)t = req(i)t−1 (HW )
= 1 . (IH)


Lemma 8.17 (grant at warm). A master can only be in the warm phase
if he is granted access to the bus:
W (i)t → grant[i]t .
Proof. By induction t − 1 → t. Nothing to show for t = 0. For the induction
step, consider t such that W (i)t :
• ¬W (i)t−1 . By automata construction we conclude
wait(i)t−1 ∧ ¬wait(i)t .
By (A) this implies grant[i]t−1 and by Lemma 8.16 (request at global) we
get req(i)t−1 . By Lemma 8.15 (grant stable) we conclude grant[i]t .
8.5 Correctness Proof 263

• W (i)t−1 . By Lemma 8.16 (request at global) we get req(i)t−1 . By (IH) we


get grant[i]t−1 and by Lemma 8.15 (grant stable) we get grant[i]t .



Now we state the very crucial lemma about the uniqueness of the automaton
in the warm phase.
Lemma 8.18 (warm unique). Only one processor at a time can be in a
warm phase:
W (i) ∧ W (j) → i = j .
Proof. W (i)∧W (j) implies grant[i]∧grant[j] by Lemma 8.17 (grant at warm).
By Lemma 8.14 (grant unique) one concludes i = j. 


8.5.2 Silent Slaves and Silent Masters

Since the signals between caches are transmitted via an open collector bus,
we want both slaves and masters to stay silent when they do not participate
in a global transaction.

Lemma 8.19 (silent slaves). When a slave is not participating in the pro-
tocol, it puts slave response 00 on the control bus:

zs(i)t ∈ {sidle, sidle, s1} → sprotout(i)t = 00 .

Proof. Proof by induction t − 1 → t. Reset ensures sidle0 and activates signal


sprotz which by (HW) clears the register. Thus, we have sprotout(i)0 = 00
and the lemma holds for t = 0.
Let t > 0 and zs(i)t ∈ {sidle, sidle, s1}. We consider two cases:
• / {sidle, sidle, s1}. Then, by automata construction,
zs(i)t−1 ∈

zs(i)t−1 = sw ∧ sprotz(i)t−1 ∧ sprotce(i)t−1 .

Thus, the lemma holds by (HW).


• zs(i)t−1 ∈ {sidle, sidle, s1}. Thus, we have ¬(s1(i)t−1 ∧s2(i)t ). Therefore,
sprotout is not clocked (¬sprotoutce(i)t−1 ) (HW) and we get by (IH) and
register semantics

sprotout(i)t = sprotout(i)t−1 (HW )


= 00 . (I)



In exactly the same way one shows the next lemma.


Lemma 8.20 (silent master).

¬H(i) → mprotout(i) = 000


264 8 Caches and Shared Memory

8.5.3 Automata Synchronization

This section contains two lemmas. We prove both of them simultaneously by


induction on the number of cycles t. Thus, the statements of both lemmas in
this section form together a single induction hypothesis.

Lemma 8.21 (idle slaves). If no automaton is in a hot phase, then all slaves
are idle:
(∀i : ¬H(i)t ) → ∀j : sidle(j)t .
Proof. For all i, we have after reset idle(i)0 ∈
/ H and sidle(i)0 . Thus, the
lemma holds initially. The induction step requires to argue about all states
and can only be completed at the end of the section. 


The next lemma explains how in a hot phase the master and the slave states
are synchronized.
Lemma 8.22 (sync). Consider a hot phase of master i lasting from cycles t
to t , i.e we have  
¬H(i)t−1 ∧ H(i)[t:t ] ∧ ¬H(i)t +1 .
Then,
1. For the master i we have
m0(i)t ∧ m1(i)t+1 ∧ m2(i)t+2 ∧ m3(i)t+3 ∧
  
mdata(i)[t+4:t −1] ∧ w(i)t ∧ idle(i)t +1 .

2. The slave automaton of cache i doesn’t leave state sidle:



sidle(i)[t:t +1] .
3. For not affected slaves, i.e., for slaves j with j = i and ¬bhit(j)t+1 , we
have  
sidle(j)t ∧ s1(j)t+1 ∧ sidle (j)[t+2:t ] ∧ sidle(j)t +1 .
4. The affected slaves, i.e the slaves j with j = i and bhit(i)t+1 , run in sync
with the master of the transaction:
sidle(j)t ∧ s1(j)t+1 ∧ s2(j)t+2 ∧ s3(j)t+3 ∧
  
sdata(j)[t+4:t −1] ∧ sw(j)t ∧ sidle(j)t +1 ,

Proof. Part (1) follows directly by (A). For the proof of parts (2), (3), and
(4), recall that we are proving both lemmas together by induction on t14 . Our
induction hypothesis is stated for all cycles q ≤ t and consists of two parts:
 
• ∀q ≤ t : ∀t : ¬H(i)q−1 ∧ H(i)[q:t ] ∧ ¬H(i)t +1 → (2), (3), (4), (sync)
• ∀q ≤ t : (∀i : ¬H(i)q ) → ∀j : sidle(j)q . (idle slaves)
14
Observe that for Lemma (sync) t is the start time of the hot phase.
8.5 Correctness Proof 265

The base case (t = 0) is trivial by (A). In the proof of the induction step from
t − 1 to t, the induction hypothesis trivially implies both statements in case
q < t. Hence, we only need to do the proof for the case q = t.
For the induction step of (sync), we have
 
¬H(i)t−1 ∧ H(i)t ∧ H(i)[t:t ] ∧ ¬H(i)t +1

and conclude
(wait(i)t−1 ∨f lush(i)t−1 ) ∧ grant[i]t−1 .
By Lemma 8.14 (grant unique) it follows that

∀j = i : ¬grant[j]t−1 .

By Lemma 8.17 (grant at warm) we have ¬W (j)t−1 and ¬H(j)t−1 . Hence,

∀j : ¬H(j)t−1 .

Using (idle slaves) as part of the induction hypothesis for cycle t − 1 we get

∀j : sidle(j)t−1 .

Using Lemma 8.20 (silent master), part (1) of Lemma (sync), (A), and (HW)
we then conclude for the cycles s ∈ [t − 1 : t ]:

0 s ∈ {t − 1, t }
s s
b.mprot.Ca = mprotout(j).Ca =
j
1 s ∈ [t : t − 1] .

By Lemmas 8.17, 8.14 (grant at warm, grant unique) we know that the grant
signals are stable during cycles s ∈ [t − 1 : t ]:

grant[i]s ∧ ∀j = i : ¬grant[j]s .

Parts (2), (3), and (4) follow now by construction of the slave automata and
observing that the exit conditions for states mdata and sdata are identical.
For the induction step of (idle slaves), we consider a cycle t such that
∀i : ¬H(i)t . We make the usual case distinction:
• ∀i : ¬H(i)t−1 . By (IH) we have

∀(j) : sidle(j)t−1 .

By Lemma 8.20 (silent master) we get

b.mprot.Cat−1 = 0

and the lemma follows by construction of the slave automata.


266 8 Caches and Shared Memory

• ∃i : H(i)t−1 . By Lemma 8.18 (warm unique) this i is unique. By construc-


tion of the master automaton we conclude w(i)t−1 . This is the end of a
hot phase which started before cycle t. Therefore, we can apply parts (2),
(3), and (4) of (sync) as part of (IH) to conclude

∀(j) : sidle(j)t .



Now we are able to argue about the uniqueness of the di signal put on the
bus by the slaves.
Recall, that the predicate SINV (t), introduced in Sect. 8.3.1, denotes that
the state invariants hold for the memory system ms(h) for all cycles t ≤ t:

SINV (t) ≡ ∀t ≤ t : sinv(ms(ht )) .

In the following lemma and in many other lemmas in this chapter we assume
that the predicate SIN V (t) holds. Later, we apply these lemmas in the proof
of a very important Lemma 8.64 (1 step). Then we use Lemma (1 step) as the
main argument in the induction step of Lemma 8.65 (relating hardware with
atomic protocol), where we in turn make sure that the predicate SIN V (t)
actually holds.

Lemma 8.23 (di unique).

SINV (t) ∧ sprotout(i).dit ∧ sprotout(j).dit → i = j

Proof. Proof by induction t − 1 → t. We make the usual case distinction:


• ¬dit−1 . This implies s2(i)t (from (A)). Applying Lemma 8.22 (sync) we
get that all other slaves are either in s2 or are in sidle . If a slave j is
in sidle , it doesn’t have active di (from (A) and (HW)). If a slave is in
s2(j)t , that means it was in s1(j)t−1 (from (A)). From SINV (t) we get
sinv(ms(ht−1 )). Hence, we can conclude that only one cache was in cycle
t − 1 in O, E, or M state. Since we know dit (i) holds, then by (A)

aca(i).st−1 (ca(i).badin t−1 ) ∈ {O, E, M } .

From (HW) we also know that ca(j).badin t−1 = ca(i).badin t−1 . From
SINV (t) and (A) it follows

∀j = i : aca(j).st−1 (ca(i).badin t−1 ) ∈


/ {O, E, M } .

And thus, from construction of circuit C2 we get

∀j = i : C2(aca(j).st−1 (ca(j).badin t−1 ), ca(j).mprotint−1 ).di = 0.

• dit−1 . Trivial using (IH) and Lemma 8.22 (sync).




8.5 Correctness Proof 267

We can now also prove a technical lemma about auxiliary registers


ca(i).dataouta , ca(i).tagouta , ca(i).souta , and ca(i).soutb . This lemma
guarantees that the content of the registers in the cycles when they are used
is the same as the corresponding output values of the edge-triggered RAMs.
Hence, we can simplify our reasoning about the data paths by simply replacing
these registers with wires.

Lemma 8.24 (auxiliary registers). Output of registers tagouta , dataouta ,


souta , and soutb in cycles when they are used is the same as the correspond-
ing output values of the RAMs:
1. w(i)t ∨ localw(i)t → dataouta (i)t = dataouta(i)t ,
2. w(i)t → tagouta (i)t = tagout(i)t ,
3. w(i)t ∨ localw(i)t ∨ (wait(i)t ∧ grant[i]t ) → souta (i)t = souta(i)t ,
4. sw(i)t → soutb (i)t = soutb(i)t .

Proof. By a case split on the state of master and slave automata in cycle t:
• Let w(i)t hold. By (A) we have mdata(i)t−1 , which implies for X ∈
{dataouta, tagouta, souta}

X  (i)t = X(i)t−1 .

Applying Lemma 8.22 (sync) we know that the slave automaton of cache
i is in state sidle in cycle t − 1:

sidle(i)t−1 .

Since we don’t activate write enable signals for the RAMs in states sidle
and mdata, we know that the content of RAMs doesn’t change from t − 1
to t. By Lemma 8.16 (request at global), definition of mbusy, and stability
of processor inputs we get

pa(i)t = pa(i)t−1 .

Hence,

X  (t)t = X(i)t−1 = X(i)t .

• Let localw(i)t hold. As in the previous case we conclude idle(i)t−1 and for
X ∈ {dataouta, souta}

X  (i)t = X(i)t−1 .

Because transition (2) was taken in cycle t−1, we know that signals preq(i)
and mbusy(i) were high in cycle t − 1. Hence, from stability of processor
inputs we get
pa(i)t = pa(i)t−1 .
268 8 Caches and Shared Memory

In cycle t − 1 there was no snoop conflict:

sidle(i)t−1 ∨ pa(i)t−1 .c = badin (i)t−1 .c ,

which implies that ports b of RAMs were either not written, or were written
with a cache line address different from pa(i)t .c. The a ports of RAMs are
not clocked in state idle at all. Hence, for all outputs X we get

X(i)t = X(i)t−1

and conclude the proof.


• Let sw(i)t hold. This implies sdata(i)t−1 and

soutb (i)t = soutb(i)t−1 .

Moreover, we can conclude by (A):

sdata(i)t−2 ∨ s3(i)t−2 .

In states sdata and s3 register badin is not clocked. Hence,

badin(i)t = badin(i)t−1 = badin(i)t−2 .

Port b of the state RAM is never written in sdata. Port a is written only
in states w, f lush, and localw. By Lemmas 8.22 and 8.18 (sync, warm
unique) we conclude that the master automaton of cache i can not be in
states w or f lush in cycle t − 1. In case localw(i)t−1 holds, we know that
in cycle t − 2 there was no snoop conflict on the bus:

pa(i)t−2 .c = badin(i)t−2 .c ,

From stability of processor inputs we get

pa(i)t−2 = pa(i)t−1 ,

which implies
pa(i)t−1 .c = badin(i)t−1 .c .
Hence, port a of the state RAM could only be written with a different
cache line address and we get

soutb (i)t = soutb(i)t−1 = soutb(i)t .

• Let wait(i)t ∧ grant[i]t hold. In the previous cycle the master automaton
of cache i was in state wait or idle. Hence, the register souta was clocked:

souta (i)t = souta(i)t−1 .

By (A) we know that port a of the state RAM was not clocked in cycle t−1.
We show by contradiction that port b of the RAM is also not clocked in
8.5 Correctness Proof 269

cycle t − 1. Let swb(i)t−1 . Then by (A) we have sw(i)t−1 and by Lemma


8.22 (sync) there exists master j = i, s.t., w(j)t−1 holds. By Lemmas
8.17, 8.16, and 8.15 (grant at warm, request at global, grant stable) we
conclude15
grant[j]t .
This contradicts to grant[i]t by Lemma 8.14 (grant unique). Hence, using
stability of processor inputs, we conclude

souta (i)t = souta(i)t−1 = souta(i)t .





8.5.4 Control of Tristate Drivers

Recall that in Sect. 3.5 we defined a discipline for the clean operation of the
tristate bus with and without the main memory. This discipline guaranteed
the absence of bus contention and the absence of glitches on the bus during
the transmission interval.
For the control of the tristate bus without the main memory we introduced
the function send(i) = j and the time intervals Ti = [ai : bi ] when unit j
is transmitting the data on the bus. We allowed unit j to transmit in two
consecutive time intervals Ti and Ti+1 where ai+1 = bi without disabling and
re-enabling the driver. If this is not the case, i.e., the driver of unit j is disabled
in cycle bi , then there must be at least one cycle between two consecutive time
intervals:
bi + 1 < ai+1 .
The sending register X is always clocked in cycle ai − 1 and must not be
clocked in the time interval [ai : bi − 1]. The flip-flop controlling the output
enable signal Xoe must be set in cycle ai − 1 (unless the same unit is sending
in the consecutive time interval, i.e., only if ai − 1 = bi−1 ) and must be cleared
in cycle bi (again, only if bi + 1 = ai+1 ).
A tristate bus with the main memory is controlled in the same way. The
start of the time interval Ti = [ai : bi ] of the memory access is identified by an
activation of signal mmreq(j) and the end of the time interval is determined
by the memory acknowledge signal. Note that for other inputs to the main
memory the time intervals are allowed to be larger than the time interval for
mmreq.

15
Observe, that this proof works only because we have a one cycle delay in the
generation of the grant signal: master j still has the request to the arbiter high in
state w(j)q . Hence, in cycle q + 1 the signal grant[j] is high, even though cache j
is already is state idle. This proof is the only place where we rely on this one cycle
delay in the generation of grant signals. With a more agressive arbitration, i.e.,
if request to the arbiter was lowered one cycle earlier, one would have to forward
the value written to the state RAM as an input to register souta in cycle q.
270 8 Caches and Shared Memory

In this section we show that the construction of our control automata


does adhere to this control discipline. For all signals with the exception of
signal badout we always leave at least one cycle between two consecutive
transmissions. Signal badout is treated in a somewhat special manner:
• In case the automaton does not go to state f lush, the signal badout is
transmitted on the bus starting from state m0 and up to state w. If later
on, a memory access starts in state mdata, then the time interval for
badout happens to start earlier than the time interval for signal mmreq
(in mdata we can have only a read access to the main memory).
• In case the automaton goes to state flush, the signal badout is transmitted
on the bus starting from the first cycle in f lush and up to state w. More-
over, in the last cycle of flush, the sending register badout is clocked and
is overwritten with the new value (without disabling and then re-enabling
the driver). Hence, for badout in this case we have two time intervals
Ti = [ai : bi ] and Ti+1 = [ai + 1 : bi+1 ]. The first memory access (i.e., the
one performed in state f lush), is performed during interval Ti . The second
memory access, if performed in state mdata, lasts during [ai + 4 : bi+1 ].
Hence, the time interval for badout simply starts earlier then the interval
for signals mmreq, mmw, and bdout.
We start with characterising for each register X(i) connected via a tristate
driver to a component b.Y of the bus the cycles t during which the driver for
register X(i) is enabled:

Cy(X, i) = {t | Xoe(i)t } .

For each of the signals X concerned, we will formulate Lemma (X) charac-
terizing the set Cy(X, i) (Lemmas 8.28, 8.29, 8.30, 8.31). Clearly, the flip-flop
controlling the output enable driver of register X is set in cycle t if

t∈
/ Cy(X, i) ∧ t + 1 ∈ Cy(X, i)

and is cleared in cycle t if

t ∈ Cy(X, i) ∧ t + 1 ∈
/ Cy(X, i) .

To make sure that out construction satisfies the control discipline we show
the following properties:
• sets Cy(X, i) and Cy(X, j) are disjoint for i = j and there is at least one
cycle in between cycles t ∈ Cy(X, i) and t ∈ Cy(X, j); hence, there can
be no bus contention (as guaranteed by Lemma 3.9 (tristate bus control),
• for cycles t ∈ Cy(X, i) registers X(i) can be clocked in cycle t only if cycle
t + 1 does not belong to Cy(X, i) (with the exception of the signal badout,
as discussed above); the set-clear flip-flops controlling the drivers obey the
same rules; thus, the enabled drivers do not produce glitches on the bus,
8.5 Correctness Proof 271

• we show that the disabled drivers are not redundantly cleared; thus, the
disabled drivers do not produce glitches on the bus,
• one could also show that register X is always clocked in cycle t ∈
/ Cy(X, i)
if t + 1 ∈ Cy(X, i); yet, the absence of bus contention and the absence of
glitches does not really depend on this condition, so we do not bother.
Our next goal is to show the absence of bus contention. This involves a case
distinction. The easy case deals with signals which are active only during the
warm phases of master states, i.e., satisfying

∀t : t ∈ Cy(X, i) → W (i)t .

It will turn out that these are all signals except bdout. We deal with bus
contention for the latter signal at the end of the next section in Lemma 8.39.
For the easy case we formulate the following lemma.
Lemma 8.25 (no contention). Assume signal X satisfies

∀i, t : t ∈ Cy(X, i) → W (i)t .

Then,
i = j ∧ t ∈ Cy(X, i) → {t, t + 1} ∩ Cy(X, j) = ∅ .
Proof. By contradiction. For i = j, first assume t ∈ Cy(X, i) ∩ Cy(X, j). By
hypothesis we have t ∈ W (i)t ∩ W (j)t . By Lemma 8.18 (warm unique) we
conclude i = j.
Now let t ∈ Cy(X, i) ∧ t + 1 ∈ / Cy(X, i). Assume t + 1 ∈ Cy(X, j). The
only way for the automaton i to leave the warm phase is to go from w to idle.
Hence, we have w(i)t . By Lemmas 8.16, 8.17, and 8.15 (request at global, grant
at warm, grant stable) this implies grant[i]t+1 and gives a contradiction by
Lemmas 8.17 and 8.14 (grant at warm, grant unique). 

Showing absence of glitches for signals of the form mmreq(i), mmw(i),
bdout(i), and badout (i) involves two statements: one for enabled and one for
disabled drivers.
Lemma 8.26 (no glitches, enabled). Let

X ∈ {mmreq, mmw, bdout, badout}

and let t, t + 1 be consecutive cycles in Cy(X, i). Then in the first of these
cycles registers X and Xoe are not clocked:

t ∈ Cy(X, i) ∧ t + 1 ∈ Cy(X, i) → ¬Xce(i)t ∧ ¬Xoeclr(i)t ∧ ¬Xoeset(i)t .

The only exception is the badoutce(i), which might be clocked whenf lush(i)t ∧
m0(i)t+1 holds.
Proof. For each of the signals X concerned, this lemma follows directly from
Lemmas 8.28, 8.29, 8.30, and 8.31 characterizing Cy(X, i) and (A). 

272 8 Caches and Shared Memory

A glitch on the output enable signal Xoe of a disabled driver might propagate
to the output of the driver and thus on the bus.
Lemma 8.27 (no glitches, disabled). Let

X ∈ {mmreq, mmw, bdout, badout}

and let t, t + 1 be consecutive cycles not in Cy(X, i). Then the output enable
signal Xoe is not redundantly cleared:

t∈
/ Cy(X, i) ∧ t + 1 ∈
/ Cy(X, i) → ¬Xoeclr(i)t .

Proof. For each signal concerned, the proof follows again directly from Lemmas
8.28, 8.29, 8.30, and 8.31 and automata construction. 

The next few lemmas are characterizing the sets Cy(X, i).
Lemma 8.28 (mmw). We write to the main memory only in state f lush:

t ∈ Cy(mmw, i) ↔ f lush(i)t .

Proof. We first show

t ∈ Cy(mmw, i) → f lush(i)t .

Consider any maximal interval [t : t ] ⊂ Cy(mmw, i), i.e.,



¬mmwoe(i)t−1 ∧ ∀q ∈ [t : t ] : mmwoe(i)q ∧ ¬mmwoe(i)t +1 .

By hardware construction we have mmwoeset(i)t−1 . By (A) we have

wait(i)t−1 ∧ (5)(i)t−1 ∧f lush(i)t .

For q ∈ [t : t ] we show by induction on q − 1 → q that f lush(i)q holds. For


q > t, we have f lush(i)q−1 by induction hypothesis and mmwoe(i)q . Hence,

¬b.mmack q−1 ∧f lush(i)q

by automata construction.
Finally, we conclude from
 
f lush(i)t ∧ ¬mmwoe(i)t +1

by automata construction
 
mmwoeclr(i)t ∧ m0(i)t +1 .

This shows that t ∈ Cy(mmw, i) → f lush(i)t . The inverse direction

f lush(i)t → t ∈ Cy(mmw, i)

follows by automata construction with a trivial induction on t. 



8.5 Correctness Proof 273

The proofs of all other lemmas characterizing sets Cy(X, i) follow very similar
patterns and we, therefore, just formulate the lemmas without proofs. For
many of the following lemmas it is convenient to define for each cycle t and
cache i the cycle ez(t, i) when the master automaton did the most recent
change of states, i.e., the cycle when the master left the previous state before
entering the current state z(i)t :

z(i)t = idle → ez(t, i) = max{t | t < t ∧ z(i)t = z(i)t } .

Lemma 8.29 (mmreq). We request a memory access when we flush or after


a miss of the master with no data intervention from any of the slaves:

t ∈ Cy(mmreq, i) ↔ mdata(i)t ∧ ¬(mprotout(i).bcez(t,i) ∨ sprotin(i).diez(t,i) )


∨f lush(i)t .

Lemma 8.30 (badout ). The bus address always comes from the master dur-
ing the entire warm phase:

t ∈ Cy(badout , i) ↔ W (i)t .

Observe that the output enable signal badoutoe for this signal stays constantly
1 during the entire warm phase, the content of the address register badout
changes after f lush. The last signal bdout treated here can be activated both
by masters and by slaves.

Lemma 8.31 (bdout). Signal bdout(i) is put on the bus by the master in
state mdata in case of a global write access, by the slave in state sdata if it
intervenes after a miss, or by a master in case of a flush access:

t ∈ Cy(bdout, i) ↔ mdata(i)t ∧ mprotout(i).bcez(t,i)


∨ sdata(i)t ∧ sprotout(i).diez(t,i)
∨ f lush(i)t .

We see that, with the exception of X = bdout, all signals satisfy the hypothesis
of Lemma 8.25 (no contention), thus we can summarize in
Lemma 8.32 (no contention 2).

i = j ∧ t ∈ Cy(X, i) → {t, t + 1} ∩ Cy(X, j) = ∅

The corresponding result for X = bdout happens to depend on certain data


transmitted during the operation of the MOESI protocol. As these data are
not transmitted over the bus b.d, we can show the correct transmission of the
protocol data using the lemmas we already have.
274 8 Caches and Shared Memory

8.5.5 Protocol Data Transmission


For states z ∈ {m0, m1, m2}, we identify what data are processed and trans-
mitted during state z.
Lemma 8.33 (before m0). In the cycle before entering m0 registers badout
and mprotout are loaded with the processor address and the output of circuit
C1. Let m0(i)t ∧ ¬m0(i)t−1 and let us abbreviate
ptype(i)t = (ca(i).prt , ca(i).pw(i)t , ca(i).pcast , 0) .
Then,
badout (i)t = pa(i)t−1
mprotout(i)t = C1(aca(i).st−1 (pa(i)t−1 ), ptype(i)t−1 ) .
Proof. If in cycle t − 1 we don’t have an active phit signal, then we obviously
have
aca(i).st−1 (pa(i)t−1 ) = I .
The mux on top of circuit C1 guarantees that the invalid state is provided as
an input to the circuit. If phit is on then it holds
ca(i).st−1 (pa(i)t−1 ) = aca(i).st−1 (pa(i)t−1 )
and we use the output of the state RAM as an input to C1. The lemma now
follows directly by automata and hardware construction and from stability of
processor inputs. 

Lemma 8.34 (m0). During m0 the content of registers mprotout and badout
does not change and the protocol data and the bus address of the master are
broadcast. Let m0(i)t hold. Then for all j = i:
mprotout(i)t+1 = mprotout(i)t
badout(i)t+1 = badout(i)t
mprotin(j)t+1 = mprotout(i)t
badin(j)t+1 = badout (i)t .
Proof. For mprotout this follows directly from automata and hardware con-
struction. For the mprotin, we have by Lemma 8.22 (sync) for all j = i:
sidle(j)t ∧ (7)(j)t .
By (A), Lemma 8.18 (warm unique), and Lemma 8.20 (silent master) we
conclude:
mprotin(j)t+1 = b.mprott

= mprotout(k)t
k
= mprotout(i)t .
8.5 Correctness Proof 275

For the bus address data b.ad, we have by (HW), (A), and Lemmas 8.30 and
8.32 (badout , no contention 2):
badin(j)t+1 = b.adt
= badout (i)t .


Lemma 8.35 (m1). During m1 the affected slaves load their answer sprotout
with the output of circuit C2. The content of registers X ∈ {badout, mprotout}
of the master and registers Y ∈ {badin, mprotin} of the slaves is unchanged.
Let m1(i)t . Then for all j, s.t., s1(j)t ∧ ¬(8)(j)t :
X(i)t+1 = X(i)t
Y (j)t+1 = Y (j)t
sprotout(j)t+1 = C2(soutb(j)t , mprotin(j)t )
= C2(aca(j).st (badin (j)t ), mprotin(j)t ) .
Proof. Proof analogous to Lemma 8.33 (before m0). 

Lemma 8.36 (m2). During m2 the protocol answer of the slaves is broadcast.
The content of registers X ∈ {badout, mprotout} of the master and registers
Y ∈ {badin, mprotin, sprotout} of the slaves is unchanged. Let m2(i)t . Then,
X(i)t+1 = X(i)t
Y (j)t+1 = Y (j)t

sprotin(i)t+1 = sprotout(j)t .
j

Proof. Proof analogous to Lemma 8.34 (m0). 



Lemma 8.37 (after m2). Let m3(i)t and t = min{q | q > t∧w(i)q }, then for
all cycles q ∈ [t+1, t ] the content of registers X ∈ {badout, mprotout, sprotin}
of the master and registers Y ∈ {badin, mprotin, sprotout} of the slaves is
unchanged:
X(i)q = X(i)t
Y (j)q = Y (j)t .
Proof. Trivial by (A). 

With the above lemmas we can conclude a crucial lemma about the data
intervention signals.
Lemma 8.38 (no DI after BC). If the master signals a write hit during
m2(i) with mprotout(i).bc, then no slave signals data intervention. Let m2(i)t .
Then for all j:
mprotout(i).bct → ¬sprotout(j).dit .
276 8 Caches and Shared Memory

Proof. m2(i)t implies m0(i)t−2 by automata construction. By Lemma 8.22


(sync) we know that slave automaton of cache i remains in state sidle during
the whole hot phase and for j = i we have

sidle(j)t−2 ∧ s1(j)t−1 .

By Lemma 8.19 (silent slaves) we get:

sprotout(i).dit = 0.

For all j = i we can conclude:

mprotout(i).bct = mprotout(i).bct−2 (m0, m1)


= mprotin(j).bct−1 . (m0)

By Lemma 8.22 (sync), if slave j does not have a hit (i.e., ¬bhit(j)t−1 ), then
it goes to state sidle (j)t and by Lemma 8.19 (silent slaves) we have for all
such j:
sprotout(j).dit = 0.
If slave j does have a hit, then we have s1(j)t−1 ∧¬(8)(j)t−1 . From the protocol
and its correct implementation in circuit C2 we conclude using Lemma 8.35
(m1):

sprotout(j).dit = C2(soutb(j)t−1 , mprotin(j)t−1 ).di = 0 .




We are now able to show the absence of contention for bdout.

Lemma 8.39 (bdout contention). Assume SINV (t). Then,

∀q ≤ t : ∀j = i : q ∈ Cy(bdout, i) → {q, q + 1} ∩ Cy(bdout, j) = ∅ .

Proof. Let q ∈ Cy(bdout, i). By Lemma 8.31 we conclude

f lush(i)q ∨ mdata(i)q ∨ sdata(i)q .

Assume for some t ∈ {q, q + 1} we have bdoutoe(j)t for a different cache j. By


(A) this implies
f lush(j)t ∨ sdata(j)t ∨ mdata(j)t .
We split cases:
• f lush(i)q . By (A) we conclude

f lush(i)q+1 ∨ m0(i)q+1 .

By Lemma 8.18 (warm unique) f lush(j)t ∨ mdata(j)t is impossible. By


Lemmas 8.21 and 8.22 (idle slaves, sync) sdata(j)t is impossible too.
8.5 Correctness Proof 277

• mdata(i)q . By (A) we conclude

w(i)q+1 ∧ m3(i)q−1 ∧ m2(i)q−2 ∧ mprotout(i).bcq−1 .

By Lemma 8.18 (warm unique) f lush(j)t ∨ mdata(j)t is impossible. The


case sdata(j)q+1 is impossible by Lemma 8.22 (sync). The case sdata(j)q
implies s3(j)t−1 and sprotout(j).dit−1 . By Lemma 8.36 (m2), we conclude

mprotout(i).bcq−2 ∧ sprotout(j).diq−2 .

and get a contradiction by Lemma 8.38 (no DI after BC).


• sdata(i)q . By (A) this implies sw(i)q+1 . The case for mdata(j)q has been
already ruled out in the proof for the previous case (with the reversed
indices). The cases

f lush(j)t ∨ mdata(j)q+1 ∨ sdata(j)q+1

are not possible by Lemmas 8.21 and 8.22 (idle slaves, sync). Finally, the
case sdata(j)q is ruled out by Lemma 8.23 (di unique).



8.5.6 Data Transmission

Now that we know that the tristate drivers are properly controlled, it is very
easy to state the effect of data transferred via the buses.

Lemma 8.40 (f lush transfer). Assume SINV (t) and consider a maximal
time interval [s : t] when master i is in state flush:

¬f lush(i)s−1 ∧ ∀q ∈ [s : t] : f lush(i)q ∧ ¬f lush(i)t+1 .

Then bdout(i)s is written to line badout (i)s of the main memory:

mmt+1 (badout (i)s ) = bdout(i)s ,

and all other memory cells are left unchanged:

∀a = badout (i)s : mmt+1 (a) = mmq .

Proof. By (A) and (HW) we have for the start cycle s of the time interval:

wait(i)s−1 ∧ (5)(i)s−1 ∧ mmreq(i)s ∧ mmw(i)s .

Let X ∈ {mmreq, mmw, badout , bdout}. Then we have by Lemma 8.26 (no
glitches, enabled):

∀q ∈ [s : t − 1] : X(i)q = X(i)q+1 .
278 8 Caches and Shared Memory

By Lemmas 8.32 (no contention 2), 8.39 (bdout contention) and Lemmas 8.28,
8.29, 8.30, 8.31 characterising Cy(X, i), we get for the bus component b.X:

∀q ∈ [s, t] : b.X q = X(i)q .

By Lemmas 8.26 (no glitches, enabled) and 8.27 (no glitches, disabled) we
can conclude that the rules for operations with the main memory defined in
Sect. 3.5.7 are fulfilled. Hence, by Lemma 3.10 we have absence of glitches in
the main memory inputs. The lemma follows now from the specification of
the main memory16 . 


Lemma 8.41 (m3). Let m3(i)t . Then,


1. mprotout(i).bct → bdout(i)t+1 = modify (aca(i).datat (a), pdin(i)t , bw(i)t ),
2. sprotout(j).dit → bdout(j)t+1 = aca(j).datat (a).

Proof. Proof analogous to Lemma 8.34 (m0). 




Lemma 8.42 (mdata write hit). Assume SINV (t). Let

mprotout(i).bct−1 ∧ mdata(i)t .

Then bdout(i)t is broadcast to all slaves which are in state sdata:

∀j : sdata(j)t → bdin(j)t+1 = bdout(i)t .

Proof. By Lemma 8.31 (bdout) we have

bdoutoe(i)t .

By Lemma 8.39 (bdout contention) we conclude

b.dt = bdout(i)t .

By automata construction and hardware construction we conclude

∀j : sdata(j)t → bdin(j)t+1 = b.dt .




16
Additionally, one has to argue here that the memory write is not performed
to an address in ROM, i.e., that badout (i)s [28 : r] = 029−r . Intuitively this is
true, because the software condition introduced in Sect. 8.4.1 guarantees that
processors never issue write or CAS requests to such addresses. Hence, a cache
line with the line address less or equal to 029−r 1r can not be in states M or O.
Formally, one can prove this by maintaining a simple invariant for such addresses
and we leave that as an easy exercise for the reader.
8.5 Correctness Proof 279

Lemma 8.43 (mdata data intervention). Assume SINV (t). Let

mdata(i)t ∧ sprotout(j).dit−1 .

Then ca(j).bdoutt is transferred to the master

bdin(i)t+1 = bdout(j)t .

Proof. Proof along the lines of the previous two lemmas. 




Lemma 8.44 (mdata miss no intervention). Assume SINV (t) and con-
sider a maximal time interval [s : t] when the master is in state mdata:

¬mdata(i)s−1 ∧ mdata(i)[s:t] ∧ ¬mdata(i)t+1 .

Assume the absence of a write hit and of data intervention in cycle s − 1:

¬mprotout(i).bcs−1 ∧ ¬sprotin(i).dis−1 .

Then line mms (badout (i)s ) is sent to the master:

bdin(i)t+1 = mms (badout (i)s ) .

Proof. This lemma is proven along the lines of Lemma 8.40 (f lush transfer).



8.5.7 Accesses of the Hardware Computation

In this section we construct a series of accesses acc(i, k) from a given hardware


computation and prove a number of properties for these accesses depending on
their type. We start by defining the hardware cycle e(i, k) when the hardware
access corresponding to acc(i, k) ends. A read, write, or CAS access to cache
i ends in cycle t when the processor request signal preq(i)t is on and the busy
signal mbusy(i)t is off. A flush access ends in cycle t when the master leaves
state f lush or when the master writes I to the state RAM while leaving state
wait and going to m0 (this corresponds to invalidation of a clean line):

flushend (i, t) = f lush(i)t ∧ ¬f lush(i)t+1 ∨ wait(i)t ∧ swa(i)t


someend (i, t) = preq(i)t ∧ ¬mbusy(i)t ∨ flushend (i, t) .

The definition of the end cycles e(i, k) for cache i is obviously



min{t | someend (i, t)} k=0
e(i, k) = .
min{t | t > e(i, k − 1) ∧ someend (i, t)} k > 0 .

Note that from (A) and stability of processor inputs it follows that

idle(i)e(i,k) ∨ wait(i)e(i,k) ∨ localw(i)e(i,k) ∨f lush(i)e(i,k) ∨ w(i)e(i,k) .


280 8 Caches and Shared Memory

Table 13. Classification of accesses


Classification of accesses
by hardware execution by atomic execution by operation type
local read local read read, negative CAS
delayed local read local read negative CAS
local write local write write, positive CAS
global access global access write, read, CAS
flush access flush access flush access

Thus far we have introduced two classifications of accesses. In Sect. 8.2.3


we distinguished accesses by the type of their operation: read accesses, write
accesses, CAS accesses, and flushes. In Sect. 8.3.4 we grouped all accesses
depending on the way they are treated in the atomic protocol. There we had
local reads, local writes, global accesses, and flushes. Now we have to introduce
a third classification of accesses depending on the way they are treated by the
hardware. This classification is very close to the one we had for the atomic
protocol and considers the following types of accesses:
• An access (i, k) is a local read if it ends in state idle:

idle(i)e(i,k) .

• An access (i, k) is a delayed local read if it ends in state wait and is not a
flush access:
wait(i)e(i,k) ∧ ¬flushend (i, e(i, k)) .
• An access (i, k) is a local write if it ends in state localw:

localw(i)e(i,k) .

• An access (i, k) is a global access if it ends in state w:

w(i)e(i,k) .

• An access (i, k) is a flush if the condition for the end of a flush access is
satisfied:
flushend(i, e(i, k)) .
A local access is either a local read, a local write, or a delayed local read. The
correspondence between three introduced classifications of accesses is shown
in Table 13.
Start cycles s(i, k) of accesses are defined in the following way. Local reads
start and end in the same cycle. Delayed local reads also start and end in the
same cycle. Local writes start 1 cycle before they end. Global accesses start
in the cycle when their hot phase begins. Flushes ending is state f lush start
when the master enters state f lush. Flushes ending in state wait start in the
same cycle.
8.5 Correctness Proof 281

Let t = e(i, k). Then,




⎪ t idle(i)t ∨ wait(i)t

⎨t − 1 localw(i)t
s(i, k) =

⎪ max{q | q < t ∧ wait(i)q } + 1 f lush(i)t


max{q | q < t ∧ m0(i)q } otherwise .

From (A) we conclude

idle(i)s(i,k) ∨ wait(i)s(i,k) ∨f lush(i)s(i,k) ∨ m0(i)s(i,k) .

One easily shows the following lemma.


Lemma 8.45 (local order).

∀k : s(i, k) ≤ e(i, k) < s(i, k + 1)

With the help of the end cycles e(i, k) alone we define the parameters of the
acc(i, k) of the sequential computation. We start with flush accesses ending
in state f lush, i.e., with the case

f lush(i)e(i,k) .

The address comes from badout at the end of the access. The rest is obvious:

acc(i, k).a = badout (i)e(i,k)


acc(i, k).f = 1
acc(i, k).r = acc(i, k).w = acc(i, k).cas = 0 .

For flush accesses ending in state wait, i.e., for the case

wait(i)e(i,k) ∧ flushend (i, e(i, k)) ,

the tag of the address is taken from the tag RAM in cycle e(i, k) while the
cache line address is copied from the processor input. Let pa = pa(i)e(i,k) ,
then:

acc(i, k).a = ca(i).tag(pa.c)e(i,k) ◦ pa.c


acc(i, k).f = 1
acc(i, k).r = acc(i, k).w = acc(i, k).cas = 0 .

For all other accesses, we construct acc(i, k) from the processor input at the
end of the access t = e(i, k) (note that the processor inputs don’t change
during the access):
282 8 Caches and Shared Memory

acc(i, k).a = pa(i)t


acc(i, k).data = pdin(i)t
acc(i, k).cdata = pcdin(i)t
acc(i, k).bw = pbw(i)t
acc(i, k).w = pw(i)t
acc(i, k).r = pr(i)t
acc(i, k).cas = pcas(i)t
acc(i, k).f = 0 .

For accesses acc(i, k) which are not flushes, we also define the last cycle d(i, k)
before or during (in case of a local operation) the access, when master i was
making a decision to go either global or local. For all CAS accesses which end
in stage w (i.e., global CAS accesses) and for those CAS accesses which end
in stage wait (i.e., delayed local reads), this will be the last cycle of wait. For
all other accesses, this is the last cycle when master automaton i was in state
idle:


⎨max{q | q ≤ s(i, k) ∧ wait(i) } acc(i, k).cas ∧ (w(i)
q e(i,k)

d(i, k) = ∨ wait(i) e(i,k)


)


max{q | q ≤ s(i, k) ∧ idle(i) } otherwise .
q

Note that for global CAS accesses and for delayed local reads we actually do
tests two times: the first time when we leave state idle and the second time
when we leave state wait. For these accesses we take into consideration only
the results of the second case. However, we also partially depend on the results
of the first test, because only the first test guarantees us that we do not have
a positive CAS hit in an exclusive state. In the proof of Lemma 8.46 we show
that if a positive exclusive CAS hit was not signalled at the time of the first
test, then it would also not be signalled at the time of the second test.
Further we aim at lemmas stating that the conditions for local and global
accesses are stable during an access: if we would make the decision based on
the cache content later during the access we would get the same result.
We now show some crucial lemmas for these accesses. For the predicates
defined in Sect. 8.3.4, we use the following shorthands:

global(i, k, aca) = global(aca, acc(i, k), i)


local(i, k, aca) = local(aca, acc(i, k), i)
rlocal(i, k, aca) = rlocal(aca, acc(i, k), i)
wlocal(i, k, aca) = wlocal(aca, acc(i, k), i) .

Lemma 8.46 (global end cycle). Let acc(i, k) be an access. Then the global
test is successful in state d(i, k) iff the access ends in stage w:

global(i, k, acad(i,k)) ↔ w(i)e(i,k) .


8.5 Correctness Proof 283

Proof. We first prove the direction from left to right. We show w(i)e(i,k) by
contradiction. Let

idle(i)e(i,k) ∨ wait(i)e(i,k) ∨ localw(i)e(i,k) .

By definition of global we get ¬acc(i, k).f . This implies

¬f lush(i)e(i,k) .

We consider two cases:


• Let (i, k) be a delayed local read:

acc(i, k).cas ∧ wait(i)e(i,k) .

Then we have s(i, k) = e(i, k) = d(i, k). By automata construction we get


(9)(i)d(i,k) , which implies rlocal(i)d(i,k) . This contradicts

global(i, k, acad(i,k)) .

• If (i, k) is not a delayed local read, we have idle(i)s(i,k) and

s(i, k) = d(i, k) ∈ {e(i, k), e(i, k) − 1} .

This implies by (A) that condition (3) for the master automaton did not
hold in cycle d(i, k), which contradicts global(i, k, acad(i,k)).
For the direction from right to left, we have ¬acc(i, k).f by definition of e(i, k).
By automata construction we find cycles t, t , and t , such that

t = max{q | q < e(i, k) ∧ m0(i)q }


t = max{q | q < t ∧ wait(i)q }
t = max{q | q < t ∧ idle(i)q } .

By definition we have t = s(i, k). We again consider two cases:


• acc(i, k).cas. Then by definition we have t = d(i, k). From hardware and
automata construction we get
 
grant[i]t ∧ ¬(9)(i)t ,

which implies  
¬phit(i)t ∨ test(i)t .
 
In case ¬phit(i)t holds we obviously have global(i, k, acat ). For case
  
phit(i)t ∧ test(i)t , we still have to show aca(i).st (a) ∈ {O, S}. Let this
be not true, i.e., 
aca(i).st (a) ∈ {E, M } .
During cycles q ∈ [t : t ] the master automaton was in state wait or in
state idle. The tag RAM of cache i during these cycles is not updated. If
284 8 Caches and Shared Memory

in any such cycle the state RAM of cache i gets updated with address a.c
(this can only happen at cycle q if sw(i)q holds), then we can conclude by
(HW) and the construction of circuit C2:

aca(i).sq+1 (a) ∈ {I, S, O} .

Hence, the only possible case for line a to be in one of the exclusive states
in t is to be in the same state in cycle t :

aca(i).st (a) ∈ {E, M } .

But this by (A) and (HW) contradicts the fact that we moved from state
idle to state wait in cycle t .
• ¬acc(i, k).cas. By definition we have t = d(i, k). Since we have moved
to state wait in this cycle, we have (3)(i)d(i,k) and the lemma follows by
(HW) and (A).



In the very same way we show a lemma for the local accesses.
Lemma 8.47 (local end cycle). Let acc(i, k) be an access. Then
1. the test for a local read is successful in cycle d(i, k) iff the access ends in
stage idle or in stage wait:

rlocal(i, k, acad(i,k) ) ↔ (idle(i)e(i,k) ∨ (wait(i)e(i,k) ∧ ¬acc(i, k).f )) ,

2. the test for a local write is successful in cycle d(i, k) iff the access ends in
stage localw:

wlocal(i, k, acad(i,k) ) ↔ localw(i)e(i,k) .

Proof. We first show the direction from right to left for both statements. By
definition of e(i, k) we have

preq(i)e(i,k) ∧ ¬mbusy(i)e(i,k) .

We do an obvious case split. If idle(i)e(i,k) , then d(i, k) = e(i, k) and by (A)


we get
rlocal(i)d(i,k) ,
which implies rlocal(i, k, acad(i,k) ). If wait(i)e(i,k) ∧ ¬acc(i, k).f , then again
d(i, k) = e(i, k) and by (A) we get

(9)(i)e(i,k) ,

which also implies rlocal(i, k, acad(i,k) ). Finally, if localw(i)e(i,k) , then d(i, k) =


e(i, k) − 1 and by (A) we have

(2)(i)d(i,k) .
8.5 Correctness Proof 285

This implies wlocal(i, k, acad(i,k) ).


For the direction from left to right, we have ¬acc(i, k).f from the definition
of rlocal and wlocal. From Lemma 8.46 (global end cycle) we get ¬w(i)e(i,k) .
Hence, we have

idle(i)e(i,k) ∨ wait(i)e(i,k) ∨ localw(i)e(i,k) .

Observing that rlocal(i, k, acad(i,k) ) and wlocal(i, k, acad(i,k) ) are mutually


exclusive and applying the (already proven) direction from right to left of
both statements, we conclude the proof. 

The lemmas global/local end cycle are (sometimes implicitly) applied in the
proofs of almost all the lemmas given below.
Lemma 8.48 (slave write at hot). For all X ∈ {data, tag, s}:

grant[i]t ∧ ¬H(i)t → Xwb(i)t = 0 .

Proof. By Lemma 8.14 (grant unique) we get ∀j = i : ¬grant[j]t . Applying


Lemma 8.17 (grant at warm) again we get

∀j : ¬H(j)t .

Hence, by Lemma 8.21 (idle slaves) we know that all slaves are idle (including
slave i):
∀j : sidle(j)t .
By (A) we get Xwb(i)t = 0 and the lemma holds. 

Lemma 8.49 (stable master). Assume acc(i, k).f ∨ global(i, k, acad(i,k)),
i.e., acc(i, k) is a flush or a global access. Then, during the entire access,
abstract cache i does not change:

∀q ∈ [s(i, k) : e(i, k)] : aca(i)q = aca(i)s(i,k) .

Proof. For X ∈ {data, tag, s} the master automaton activates write signals
Xwa only in cycle q = e(i, k). These writes update the cache only after the
end of the access.
If access acc(i, k) is a flush, then for any cycle q under consideration we
have
wait(i)q ∧ grant[i]q ∨ W (i)q
and conclude the statement by Lemmas 8.17 and 8.48 (grant at warm, slave
write at hot). If access acc(i, k) is a global read or write, then by Lemma 8.22
(sync) for the slave automaton of cache i we have sidle(i)q . In this state the
slave automaton does not activate a write signal Xwb. 

The following lemma states that in the last cycle of wait the RAMs of a waiting
cache are not updated, unless a flush access is performed on a transition from
wait to m0.
286 8 Caches and Shared Memory

Lemma 8.50 (last cycle of wait). Let wait(i)q ∧ ¬wait(i)q+1 . Then


1. ¬swa(i)q → aca(i)q = aca(i)q+1 ,
2. swa(i)q → aca(i).sq (pa(i)q ) = aca(i).sq+1 (pa(i)q ) = I.

Proof. For X ∈ {data, tag, s} we have Xwb(i)q = 0 by Lemma 8.48 (slave


write at hot). We do an obvious case split:
• ¬swa(i)q . Then the RAMs are not updated via port a and we conclude

aca(i)q = aca(i)q+1 .

• swa(i)q . By (A) and (HW) this implies

ca(i).tag q (pa(i)q .c) = pa(i)q .t .

Hence, in the abstract cache the line addressed by pa(i) is invalid:

aca(i).sq (pa(i)q ) = I.

Since we do not update the tag RAM, the line stays invalid in cycle q + 1.



Another lemma argues that in the last cycle of f lush the state of the abstract
cache line addressed by pa is not changed. As a result, the output of circuit
C1 is the same as if it was computed one cycle later.
Lemma 8.51 (last cycle of flush). Let f lush(i)t ∧ ¬f lush(i)t+1 . Then

aca(i).st (pa(i)t ) = aca(i).st+1 (pa(i)t ) = I .

Proof. The idea behind the proof is simple: when we made a decision to go to
state f lush, the cache line addressed by pa had a different tag. Hence, in the
abstract cache this line was invalid. Since we do not write the tag RAM in
the cycles under consideration, the line stays invalid when we leave the state
f lush.
Formally, let q be the last cycle when we were in state wait before the
flush: 
q = max{t | wait(i)t ∧ t < q} .
Since in cycle q we made a decision to go to state f lush, it holds that

pa(i)q .t = ca(i).tag q (pa(i)q .c) .

Lemma 8.48 (slave write at hot) guarantees that in between cycles q and t we
don’t activate signal tagwb(i). Since we also don’t write the tag RAM from
the master side in states wait and f lush, and the processor inputs are stable
we get
8.5 Correctness Proof 287

pa(i)q .t = pa(i)t .t
= pa(i)t+1 .t
ca(i).tag q (pa(i)q .c) = ca(i).tag t (pa(i)t .c)
= ca(i).tag t+1 (pa(i)t+1 .c) .

Hence, we conclude

aca(i).st (pa(i)t ) = aca(i).st+1 (pa(i)t ) = I .




Lemma 8.52 (overlapping accesses with global). Let acc(i, k) be a


global access with address a = acc(i, k).a ending in cycle e(i, k) = t. Let
acc(r, s) be an access with address a that is not a flush access and that is
overlapping with acc(i, k). Then acc(r, s) is a local write and the overlap is in
cycle {s(i, k), s(i, k) + 1} or acc(r, s) is a local read and the overlap is in cycle
s(i, k):

w(i)e(i,k) ∧ (i, k) = (r, s) ∧ ¬acc(r, s).f


∧ acc(i, k).a = acc(r, s).a ∧ u ∈ [s(i, k) : e(i, k)] ∩ [s(r, s) : e(r, s)]
→ localw(r)e(r,s) ∧ u ∈ {s(i, k), s(i, k) + 1}
∨ idle(r)e(r,s) ∧ u = s(i, k) = s(r, s) .

Proof. Access acc(r, s) cannot be on the same master by Lemma 8.45 (local
order) and it cannot be global by Lemma 8.18 (warm unique). Thus, it is
a local access. It cannot be a delayed local read because that would imply
grant[r]s(r,s) which gives a contradiction by Lemmas 8.17 and 8.14 (grant at
warm, grant unique). Hence, the decision to go local is made in stage idle:

idle(r)d(r,s) ∧ d(r, s) = s(r, s) ∈ {e(r, s), e(r, s) − 1} .

acc(r, s) cannot start later than s(i, k) because by Lemma 8.34 (m0) in cycle
s(i, k) + 1 slave r has already clocked in address a:

badin(r)s(i,k)+1 = a .

This gives a snoop conflict for address a in cache r in cycle s(i, k) + 1 and the
access acc(r, s) cannot start. 


Lemma 8.53 (stable local). Let acc(i, k) be a local access. Then during the
access abstract cache i does not change:

local(i, k, acad(i,k)) → aca(i)s(i,k) = aca(i)s(i,k)+1 .

Proof. By Lemma 8.47 (local end cycle), by the definition of s(i, k), and by
(A) we have
288 8 Caches and Shared Memory

idle(i)s(i,k) ∨ (wait(i)s(i,k) ∧ idle(i)s(i,k)+1 ) .

For X ∈ {data, tag, s} we have to show Xwa(i)s(i,k) = Xwb(i)s(i,k) = 0.


This is trivial for signals Xwa(i) because they are not activated in state idle
and on a transition from wait to idle. Signals Xwb(i)s(i,k) are active if slave
automaton i is in state sw, i.e., sw(i)s(i,k) . In this case by Lemma 8.22 (sync)
there is a master r = i and an access acc(r, s) such that w(r)s(i,k) . Thus, the
accesses overlap while the master is in state w. But this contradicts Lemma
8.52 (overlapping accesses with global): the accesses can only overlap when
the master is in states m0 or m1. 


Lemma 8.54 (overlapping accesses with flush). Assume SINV (t). Let
acc(i, k) be a flush with address a = acc(i, k).a ending in cycle e(i, k) = t.
Let acc(r, s) be any access to address a except a local read. Then the time
intervals of the two accesses are disjoint. Thus, only local reads can overlap
with flushes:

(i, k) = (r, s) ∧ acc(i, k).f ∧ e(i, k) = t


∧ ¬idle(r)e(r,s) ∧ acc(r, s).a = acc(i, k).a
→ [s(i, k) : e(i, k)] ∩ [s(r, s) : e(r, s)] = ∅ .

Proof. Proof by contradiction. Assume

(r, s) = (i, k) ∧ q ∈ [s(i, k) : e(i, k)] ∩ [s(r, s) : e(r, s)] .

The case r = i is impossible by Lemma 8.45 (local order). Hence, r = i. For


cycle q we have
grant[i]q .
Hence, acc(r, s) can not be a flush, a global access or a delayed local read
by (A) and by Lemmas 8.17 and 8.14 (grant at warm, grant unique). Thus,
access acc(r, s) is a local write access at cache r and it consists of 2 cycles:

e(r, s) = s(r, s) + 1 .

Hence, access acc(r, s) has started in cycle q or in cycle q − 1 with a cache hit
in an exclusive state:

s(r, s) ∈ {q − 1, q} ∧ aca(r).ss(r,s) (a) ∈ {E, M } .

By Lemma 8.53 (stable local) we get

aca(r).sq (a) ∈ {E, M } .

On the other hand, flush accesses like acc(i, k) are preceded by a cache hit in
state wait at the eviction address in cycle s(i, k) − 1 or s(i, k). We split cases:
8.5 Correctness Proof 289

• f lush(i)s(i,k) . This implies

wait(i)s(i,k)−1 ∧ aca(i).ss(i,k)−1 (a) = I .

By (A) and by Lemmas 8.49, 8.50 (stable master, last cycle of wait) we
get
∀u ∈ [s(i, k) − 1 : e(i, k)] : aca(i).su (a) = I .
For u = q this contradicts the state invariants.
• wait(i)s(i,k) . This implies s(i, k) = e(i, k) = q and

aca(i).sq (a) = I .

Again, this contradicts the state invariants.





Lemma 8.55 (overlapping accesses with local write). Assume SINV (t).
Let acc(i, k) be a local write access with address a = acc(i, k).a ending in cycle
e(i, k) = t. Let acc(r, s) be another local access with address a. Then it cannot
overlap with acc(i, k):

(i, k) = (r, s) ∧ e(i, k) = t ∧ localw(i)e(i,k)


∧ acc(i, k).a = acc(r, s).a ∧ local(r, s, acad(r,s) )
→ [s(i, k) : e(i, k)] ∩ [s(r, s) : e(r, s)] = ∅ .

Proof. Assume intervals overlap in cycle q. We have i = r by Lemma 8.45


(local order). For acc(i, k) we have d(i, k) = s(i, k). For acc(r, s) we have by
Lemma 8.47 (local end cycle), by the definition of s(r, s), and by (A):

idle(r)s(r,s) ∨ (wait(r)s(r,s) ∧ idle(r)s(r,s)+1 )

and
d(r, s) = s(r, s).
By (A) and by Lemmas 8.53, 8.50 (stable local, last cycle of wait) we conclude

aca(i).sd(i,k) (a) = aca(i).sq (a)


aca(r).sd(r,s) (a) = aca(r).sq (a) .

By Lemma 8.47 (local end cycle) we get

wlocal(i, k, acad(i,k) ).

Together with local(r, s, acad(r,s)), this gives us

aca(i).sd(i,k) (a) ∈ {E, M }


aca(r).sd(r,s) (a) = I .
290 8 Caches and Shared Memory

time

global

flush

delayed local

local write

local reads

Fig. 142. Possible overlaps between accesses to the same cache address a

Thus,

aca(i).sq (a) ∈ {E, M }


aca(r).sq (a) = I ,

which contradicts the state invariants. 




The last three lemmas can be summarized as follows. The only possible over-
laps between accesses to the same cache address a are (see Fig. 142):
1. a flush with local reads,
2. a global access with local reads or local writes; in this case a local access
ends at most 1 cycle after the start of the global access,
3. a local read with other local reads and with delayed local reads.
If we are interested in accesses to the same address a and ending at the same
cycle t we are left only with the first and the third cases. Formally, let

E(a, t) = {(i, k) | e(i, k) = t ∧ acc(i, k).a = a}

be the set of accesses with address a ending in cycle t. Then we have

Lemma 8.56 (simultaneously ending accesses). Assume SINV (t). For


any a and t, either set E(a, t) contains at most one element, or all accesses
in E(a, t) are local reads and delayed local reads, or one access in E(a, t) is a
flush and all other accesses are local reads:
8.5 Correctness Proof 291

#E(a, t) ≤ 1
∨ (∀(i, k) ∈ E(a, t) : idle(i)e(i,k) ∨ (wait(i)e(i,k) ∧ ¬acc(i, k).g))
∨ (∃(i, k) ∈ E(a, t) : acc(i, k).f ∧ ∀(r, s) ∈ E(a, t) :
(r, s) = (i, k) → idle(r)e(r,s) ) .

Proof. Trivial by using Lemmas 8.47, 8.46 (local end cycle, global end cycle)
and overlapping lemmas. 


Let predicate P (i, k, a, t) be true if access (i, k) ends in cycle t, accesses address
a, and is not a local read or a delayed local read:

P (i, k, a, t) ≡ e(i, k) = t ∧ acc(i, k).a = a


∧ ¬(idle(i)e(i,k) ∨ wait(i)e(i,k) ∧ ¬acc(i, k).f ) .

There is an obvious correspondence between P (i, k, a, t) and set E(a, t):

P (i, k, a, t) = 1 ↔ (i, k) ∈ E(a, t)∧¬(idle(i)e(i,k) ∨wait(i)e(i,k) ∧¬acc(i, k).f ) .

By Lemma 8.47 (local end cycle) this is equivalent to

P (i, k, a, t) = 1 ↔ (i, k) ∈ E(a, t) ∧ ¬rlocal(i, k, acad(i,k) ) .

We can now identify all cache lines that get modified in cycle t.
Lemma 8.57 (unchanged cache lines). Let X ∈ {s, data}, then

aca(i).X t+1 (a) = aca(i).X t (a) →


∃k, j : P (i, k, a, t) ∨ j = i ∧ P (j, k, a, t) ∧ w(j)t .

Proof. On cache i ports a of RAMs are updated only when

w(i)t ∨ localw(i)t ∨ ((f lush(i)t ∨ wait(i)t ) ∧ m0(i)t+1 ) .

Ports b of state and data RAMs are updated only when the slave automaton
of cache i is in state sw(i)t . Port b of the tag RAM is never written.
A write to a state or a data RAM can modify at most one line in the
abstract cache. A write to a tag RAM can modify at most 2 lines. We do the
proof by considering all possibly updates to cache RAMs.
• w(i)t . In this state all the RAMs are updated via ports a with the address
pa(i)t . In the abstract cache i this update can possibly modify the lines

a = pa(i)t and b = ca(i).tag t (pa(i)t .c) ◦ pa(i)t .c .

By definition we obviously have

acc(i, k).a = pa(i)t .


292 8 Caches and Shared Memory

By Lemma 8.46 (global end cycle) we get

global(i, k, acad(i,k))

and conclude
P (i, k, aca(i, k).a, t) .
Hence, for line address a = acc(i, k).a we are done. The line b can possibly
be affected by the update only if

ca(i).tag t (a.c) = a.t ,

i.e., if we are overwriting the tag with the new value. In that case the cache
line addressed by b becomes invalid:

aca(i).st+1 (b) = I .

Our goal is to show that (in this case) this line was already invalid in cycle
t and, hence, no change to the memory slice b in cache i has occured. By
Lemma 8.49 (stable master) we have

aca(i)t = aca(i)s(i,k) .

In cycle s(i, k), the automaton of cache i was in state m0. Hence, in the
previous cycle we had:

wait(i)s(i,k)−1 ∧ (4)(i)s(i,k)−1 ∨f lush(i)s(i,k)−1 .

The tag RAMs are not updated on both of these transitions, as well as in
the time interval [s(i, k) : t − 1]. Hence,

ca(i).tag s(i,k)−1 = ca(i).tag s(i,k) = ca(i).tag t .

Thus, if the tags did not match in cycle t, then they also did not match in
cycle s(i, k) − 1:
ca(i).tag s(i,k)−1 (a.c) = a.t .
Hence, independently of whether the automaton of cache i was in state
wait or in state f lush in cycle s(i, k), we get by (A) and (HW):

ca(i).ss(i,k) (a.c) = I .

This, in turn, implies


aca(i).ss(i,k) (b) = I .
• sw(i)t . In this state ports b of state and data RAMs are updated with the
address badin(i)t . In the abstract cache i this update can only modify the
line
a = ca(i).tag t (badin(i)t .c) ◦ badin(i)t .c .
8.5 Correctness Proof 293

By Lemmas 8.22, 8.46 (sync, global end cycle) there exists cache j and
access (j, k), s.t., w(j)t and

P (j, k, pa(j)t , t) .

Hence, we only need to show that

a = pa(j)t

and we are done. For slave i we get by the data transfer lemmas:

badin(i)t = pa(j)t .

By Lemma 8.22 (sync), we know that there was a bus hit on slave i in
cycle s(j, k) + 1:
bhit(i)s(j,k)+1 .
Applying data transfer lemmas again this gives us

ca(i).tag s(j,k)+1 (pa(j)t .c) = pa(j)t .t .

Since local write accesses never overwrite tags (global and flush accesses
in the interval [s(j, k) : t] can not occur at cache j by Lemmas 8.17 and
8.14 (grant at warm, grant unique)), we conclude

ca(i).tag t (pa(j)t .c) = pa(j)t .t .

• localw(i)t . In this state ports a of data and state RAMs are updated with
the address pa(i)t . In the abstract cache i this update can only modify the
line
a = ca(i).tag t (pa(i)t .c) ◦ pa(i)t .c .
By Lemma 8.47 (local end cycle) there must be an access acc(i, k) ending
in cycle t:
P (i, k, pa(i)t , t) .
Hence, we only have to show that

a = pa(i)t .

This is very easy. By Lemma 8.53 (stable local) and by (A) we get

aca(i).st (pa(i)t ) = I .

Hence,
ca(i).tag t (pa(i)t .c) = pa(i)t .t .
294 8 Caches and Shared Memory

• f lush(i)t ∧ m0(i)t+1 . In this case port a of the state RAM is updated with
the address pa(i)t . In the abstract cache i this update can only modify the
line
a = ca(i).tag t (pa(i)t .c) ◦ pa(i)t .c .
By definition, there is a flush access acc(i, k) ending in cycle t:

P (i, k, badout(i)t , t) .

By (A), (HW), and by Lemma 8.49 (stable master) we conclude

badout(i)t .c = pa(i)t .c
badout(i)t .t = ca(i).tag t (pa(i)t .c) .

Hence,
a = badout(i)t .
• wait(i)t ∧ m0(i)t+1 ∧ swa(i)t . Port a of the state RAM is updated with
the address pa(i)t .c. In the abstract cache i this update can only modify
the line
a = ca(i).tag t (pa(i)t .c) ◦ pa(i)t .c .
By definition, there is a flush access (i, k) to address a ending in cycle t

P (i, k, a, t) .


The next two lemmas make sure that (i) the state of a slave does not change
after it computes its response and until the global access ends and (ii) that
the slave decision to participate or not participate in the transaction stays
stable during the entire access.
Lemma 8.58 (stable slaves). Assume SINV (t). Let acc(i, k) be an access
to address a = acc(i, k).a ending in cycle t = e(i, k) in state w(i)t . Let X ∈
{s, data} and q ∈ [s(i, k) + 2 : t] then

∀j = i : aca(j).X q (a) = aca(j).X t (a) .

Proof. By Lemmas 8.52 and 8.54 (overlapping accesses with global, overlap-
ping accesses with flush) we know that no local write or flush access to address
a can end in slave j in cycles q ∈ [s(i, k) + 2 : t]. Hence, we conclude the proof
by Lemma 8.57 (unchanged cache lines). 


Lemma 8.59 (stable slave decision). Assume SINV (t). Let acc(i, k) be
an access to address a = acc(i, k).a ending in cycle t = e(i, k) in state w(i)t .
Then
∀j = i : bhit(j)s(i,k)+1 = bhit(j)t .
8.5 Correctness Proof 295

Proof. By Lemma 8.30 (badout), we know that the address is put on the bus
by the master during the entire hot phase. By (A), we know that the register
badout(i) is not clocked during the hot phase. Hence, for all q ∈ [s(i, k) : t]
b.adq = a .
By Lemma 8.22 (sync) for all slaves j = i we have
s1(j)s(i,k)+1 .
By Lemma 8.58 (stable slaves) we get for cycle q ∈ [s(i, k) + 2 : t] and RAMs
X ∈ {s, data}:
aca(j).X q (a) = aca(j).X t (a) .
Now, we consider cases on whether there is a bus hit in state s(i, k) + 1 or
not:
• ¬bhit(j)s(i,k)+1 . Hence,

aca(j).ss(i,k)+1 (a) = I .

As a result, there can be no local write access to address a on cache j


ending at cycle s(i, k) + 1 (that would give a contradiction by (A) and
by Lemmas 8.47 and 8.53 (local end cycle, stable local)). Hence, with the
help of Lemma 8.57 (unchanged cache lines) we conclude

aca(j).ss(i,k)+1 (a) = aca(j).ss(i,k)+2 (a)


= aca(j).st (a)
=I .

This implies ¬bhit(j)t .


• bhit(j)s(i,k)+1 . Hence,

aca(j).ss(i,k)+1 (a) = I .

If there is no access to address a on cache j ending at cycle s(i, k) + 1,


then we again apply Lemma 8.57 (unchanged cache lines) to get

aca(j).ss(i,k)+2 (a) = aca(j).ss(i,k)+1 (a) .

If there is such access, then by (A) and (HW) we conclude


aca(j).ss(i,k)+2 (a) = M
and
aca(j).st (a) = M
= I .

This implies bhit(j)t .




296 8 Caches and Shared Memory

Now we can state the crucial lemmas that guarantee stability of decisions to
go for a global or a local transaction.

Lemma 8.60 (stable local decision). Let access acc(i, k) end in cycle t =
e(i, k). Let in cycle d(i, k) the decision for a local read or a local write hold.
Then the same decision holds in cycle t:

∀x ∈ {r, w} : xlocal(i, k, acad(i,k) ) → xlocal(i, k, acat ).

Proof. Expand the definition of rlocal or wlocal and apply Lemma 8.53 (stable
local). 


Lemma 8.61 (stable global decision). Assume SINV (t). Let acc(i, k) be
a global access, i.e., in cycle d(i, k) we have global(i, k, acad(i,k)). Then we
could have reached the same decision in cycle t = e(i, k):

global(i, k, acad(i,k)) → global(i, k, acat) .

Proof. Assume global(i, k, acad(i,k)) and let a = acc(i, k).a. We expand the
definition of global and observe

global(i, k, aca) ≡ ¬acc(i, k).f ∧ (aca(i).s(a) = I ∨ aca(i).s(a) ∈ {S, O}


∧ (acc(i, k).w ∨ acc(i, k).cas ∧ test(acc(i, k), aca(i).data(a)))) .

By Lemma 8.49 (stable master) we have

aca(i)s(i,k) = aca(i)e(i,k) .

Hence, all we need to show is

global(i, k, acas(i,k) ) .

By Lemma 8.46 (global end cycle) and the definition of s(i, k) , we have

m0(i)s(i,k) ∧ w(i)e(i,k) .

If a flush access ends in cycle s(i, k) − 1:

f lush(i)s(i,k)−1 ∨ wait(i)s(i,k)−1 ∧ swa(i)s(i,k)−1 ,

then we have by (A) and (HW)

aca(i).ss(i,k) (a) = I ,

which obviously implies global(i, k, acas(i,k)).


If in cycle s(i, k) − 1 no flush access ends, then we have

ca(i).tag s(i,k)−1 (a.c) = a.t .


8.5 Correctness Proof 297

We consider cycles q ∈ [d(i, k) : s(i, k) − 1] when the master is in state idle or


wait:
idle(i)q ∨ wait(i)q .
The tag RAM of cache i is not updated during these cycles:

ca(i).tag q (a.c) = ca(i).tag s(i,k)−1 (a.c)


= ca(i).tag s(i,k) (a.c)
= ca(i).tag d(i,k) (a.c) .

The state RAM of cache i is updated only if there is another global access
ending in q and if cache i is participating in that access, i.e., if sw(i)q holds.
We split cases on the state of line a in cycle d(i, k):
• ca(i).sd(i,k) (a.c) = I. Hence, line a is invalid in aca(i). We show by con-
tradiction that this line stays invalid until s(i, k). Let q be the first cycle
when line a gets updated:

q = min{t | t ≥ d(i, k) ∧ t < s(i, k) ∧ sw(i)t ∧ badin(i)t .c = a.c} .

Then by (A) there exists cache j with w(j)q and there exists an access
(j, k  ) ending at cycle q. Applying data transfer lemmas and Lemma 8.30
(badout) we get

acc(j, k  ).c = pa(j)q .c = a.c = b.adq .c = b.ads(j,k )+1 .c .

By (A), slave i had a bus hit in state s(j, k  ) + 1. Hence, by Lemma 8.59
(stable slave decision) it also has a bus hit in state q. This implies

ca(i).sq (a.c) = I

and gives a contradiction. Hence, we can conclude for this case

aca(i).ss(i,k) (a) = I .

• ca(i).sd(i,k) (a.c) ∈ {S, O}. In this case, by (HW) and by the construction
of circuit C2, any update to the state RAM can only change the state of
the line to S, O, or I. Hence, we can conclude for this case

aca(i).ss(i,k) (a) ∈ {S, O, I} .

If acc(i, k) is not a CAS access, then this already gives us

global(i, k, acas(i,k) ) .

If acc(i, k) is a CAS access and we are not in f lush in cycle s(i, k) − 1, then
we have
wait(i)d(i,k) ∧ s(i, k) = d(i, k) + 1 .
298 8 Caches and Shared Memory

By Lemma 8.50 (last cycle of wait) we get

aca(i)d(i,k) = aca(i)s(i,k) ,

which implies
global(i, k, acas(i,k) ) .



An important consequence is a reformulation of predicate P (i, k, a, t).


Lemma 8.62 (reformulation of P ). Assume SINV (t). Then,

P (i, k, a, t) ≡ e(i, k) = t ∧ acc(i, k).a = a ∧ ¬rlocal(i, k, acat ) .

Proof. We have to show that

rlocal(i, k, acad(i,k) ) ↔ rlocal(i, k, acat ) .

For the implication from left to right, we apply Lemma 8.60 (stable local
decision) and get the proof.
For the other direction, we prove by contradiction. Let

rlocal(i, k, acat ) ∧ ¬rlocal(i, k, acad(i,k) ) .

Then we have

wlocal(i, k, acad(i,k) ) ∨ global(i, k, acad(i,k)) .

By Lemmas 8.60 and 8.61 (stable local decision, stable global decision) we get

wlocal(i, k, acat ) ∨ global(i, k, acat) ,

which gives a contradiction to rlocal(i, k, acat ). 



With this definition at hand, we conclude a carefully phrased technical lemma.

Lemma 8.63 (unchanged memory slices).


Let SINV (t) hold. Then the following is true:
1. Memory system slice a is only changed in cycle t if P (i, k, a, t) holds for
some (i, k):

(∀(i, k) : ¬P (i, k, a, t)) → Π(ms(ht+1 ), a) = Π(ms(ht ), a) .

2. If access (i, k) ends in t and P (i, k, a, t) does not hold, then in the atomic
protocol access acc(i, k) applied to port i does not change slice a of mem-
ory system ms(ht ):

t = e(i, k) ∧ ¬P (i, k, a, t) → Π(δ1 (ms(ht ), acc(i, k), i), a) = Π(ms(ht ), a) .


8.5 Correctness Proof 299

3. At most one access ending in cycle t can change slice a both in the hard-
ware computation and in the atomic protocol:

P (i, k, a, t) ∧ P (r, s, a, t) → (i, k) = (r, s) .

4. If P (i, k, a, t) holds and access (i, k) is not a global access, then content
of abstract cache j = i for address a is not changed at cycle t. Let X ∈
{data, s}, then

j = i ∧ P (i, k, a, t) ∧ ¬w(i)t → aca(j).X t+1 (a) = aca(j).X t (a) .

5. If P (i, k, a, t) holds and access (i, k) does not end in state f lush, then the
content of the main memory for address a is not changed at cycle t:

P (i, k, a, t) ∧ ¬f lush(i)t → mmt+1 (a) = mmt (a) .

Proof. We prove the statements one by one.


1. By contradiction. Assume for some a

Π(ms(ht+1 ), a) = Π(ms(ht ), a).

There are two cases possible:


• ∃i : aca(i).X t+1 (a) = aca(i).X t (a). Then we get a contradiction by
Lemma 8.57 (unchanged cache lines).
• mmt+1 (a) = mmt (a). By (A) and (HW) this is only possibly when
there exists cache i such that

f lush(i)t ∧ m0(i)t+1 .

Hence, there is a flush access ending in cycle t:

P (i, k, badout(i)t , t) .

By Lemma 8.40 (flush transfer) we conclude that the only address a


that is modified in the memory is a = badout(i)t , and get a contradic-
tion.
2. If access (i, k) ends in cycle t but the predicate P (i, k, a, t) does not hold,
then we have two options:
• access (i, k) is either a local read or a delayed local read. By Lemma
8.62 (reformulation of P ) we get

rlocal(i, k, acat ) .

By part 1 of Lemma 8.12 (properties one step), this implies

δ1 (ms(ht ), acc(i, k), i) = ms(ht ) .


300 8 Caches and Shared Memory

• access (i, k) is performed to an address different from a:

acc(i, k).a = a.

By part 2 of Lemma 8.12, we conclude

Π(δ1 (ms(ht ), acc(i, k), i), a) = Π(ms(ht ), a) .

3. The proof immediately follows by Lemma 8.56 (simultaneously ending


accesses).
4. By contradiction. Let
P (i, k, a, t) ∧ ¬w(i)t ,
and let for some j = i and for some a

aca(j).X t+1 (a) = aca(j).X t (a) .

By Lemma 8.57 (unchanged cache lines), a cache line can change in cycle
t only in two cases:

∃k  : P (j, k  , a, t) ∨ ∃r = j : P (r, k  , a, t) ∧ w(r)t . (19)

By part 3 of the lemma we are now proving, we conclude that there are
no other accesses to address a ending in cycle t:

∀r = i : ∀k  : ¬P (r, k  , a, t) .

Hence, for cache j = i we get

∀k  : ¬P (j, k  , a, t) .

Since (i, k) is the only access to address a ending in cycle t and it does
not end in state w, we conclude

∀r, k  : ¬(P (r, k  , a, t) ∧ w(r)t ) .

and get a contradiction to (19).


5. By contradiction. Let mmt+1 (a) = mmt (a). By (A) and (HW) this is only
possibly when there exists cache j = i such that

f lush(j)t ∧ m0(j)t+1 .

Hence, there is a flush access ending in cycle t:

P (j, k  , badout(j)t , t) .

By Lemma 8.40 (flush transfer) we conclude that the only address a that
is modified in the memory is a = badout(i)t , and get a contradiction by
part 3 of the lemma we are now proving.
8.5 Correctness Proof 301

8.5.8 Relation with the Atomic Protocol

We are now ready to establish a crucial simulation result between the se-
quential computation of the atomic protocol and the hardware computation.
Essentially it states that an access acc(i, k) of the hardware computation end-
ing in cycle t has the same effect as the same access acc(i, k) applied to port
i and memory system ms(ht ) of the atomic protocol.

Lemma 8.64 (1 step). Assume SINV (t). Then



t+1 Π(δ1 (ms(ht ), acc(i, k), i), a) P (i, k, a, t)
1. Π(ms(h ), a) =
Π(ms(ht ), a) otherwise ,
2. let acc(i, k).r ∨ acc(i, k).cas and e(i, k) = t then

pdout(i)t = pdout1(ms(ht ), acc(i, k), i) .

Proof. The second statement is trivial for local reads and delayed local reads.
We will show the second statement for other accesses together with the first
statement.
By part 1 of Lemma 8.63 (unchanged memory slices) Π(ms(ht ), a) only
changes in cycles t + 1 following cycles t when a flush, a local write access, or
a global access with address a ends. Thus, for ¬∃(i, k) : P (i, k, a, t) there is
nothing left to show.
Next, we observe by part 3 of Lemma 8.63 (unchanged memory slices) that
in any cycle t there is at most one access acc(i, k) satisfying the conditions of
the predicate P (i, k, a, t) for any given address a. Thus, the statement of the
lemma is well defined. By definition of e(i, k) and automata construction we
have
localw(i)t ∨ w(i)t ∨ ((f lush(i)t ∨ wait(i)t ) ∧ m0(i)t+1 ) .
Now we split cases on the kind of access to address a ending in cycle t:
• Access acc(i, k) ends in state localw. Hence, s(i, k)+ 1 = t and by Lemmas
8.47, 8.60 (local end cycle, stable local decision) we have

local(i, k, acat) .

By Lemma 8.53 (stable local) we have

aca(i)s(i,k) = aca(i)s(i,k)+1 .

By part 4 of Lemma 8.63 (unchanged memory slices) we get for X ∈


{s, data}:
∀j = i : aca(j).X t+1 (a) = aca(j).X t (a)
and by part 5 of the same lemma we get

mmt+1 (a) = mmt (a) .


302 8 Caches and Shared Memory

In state localw we write to cache address pa(i).c via ports a of the data
and the state RAMs. The tag RAM is not updated. The writes via port a
always have precedence over the writes via port b. Hence, even if there was
a write to pa(i).c via port b of the data or the state RAM, it would not
have any effect (below, in the proof for the global accesses, we show that
simultaneous writes via ports a and b to the same address never occur).
Hence, by (A), (HW), and the definition of the one step protocol we can
conclude:
Π(ms(ht+1 ), a) = Π(δ1 (ms(ht ), acc(i, k), i), a) .
In case of a CAS access we also have by automata and hardware construc-
tion:
pdout(i)t = aca(i).datat (a) = pdout1(ms(ht ), acc(i, k), i) .
• Access acc(i, k) ends in state w. By Lemmas 8.46 and 8.61 (global end
cycle, stable global decision) we have
global(acat, acc(i, k), i) .
By part 5 of Lemma 8.63 (unchanged memory slices) we know that the
memory content for address a is unchanged:
mmt+1 (a) = mmt (a) .
Using Lemma 8.49 (stable master) we get
∀q ∈ [s(i, k) : t] : aca(i)q = aca(i)t .
Moreover, by Lemmas 8.50 and 8.51 (last cycle of wait, last cycle of flush)
we have
aca(i).ss(i,k)−1 (a) = aca(i).st (a) .
By Lemma 8.58 (stable slaves) we get for slaves j = i, cycle q ∈ [s(i, k)+2 :
t], and RAMs X ∈ {s, data}:
aca(j).X q (a) = aca(j).X t (a) .
Using the protocol transfer lemmas and the stability of processor inputs
we get for all slaves j = i:
mprotin(j)s(i,k)+1 = mprotout(i)s(i,k)
= C1(aca(i).s(a)s(i,k)−1 , ptype(i)s(i,k)−1 )
= C1(aca(i).s(a)t , ptype(i)t )
sprotout(j)s(i,k)+2 = C2(aca(j).ss(i,k)+2 (a), mprotin(j)s(i,k)+1 ).(ch, di)
= C2(aca(j).st (a), mprotout(i)t ).(ch, di)

sprotin(i)s(i,k)+3 = sprotout(j)s(i,k)+2
j

= C2(aca(j).st (a), mprotout(i)t ).(ch, di) .
j
8.5 Correctness Proof 303

By Lemma 8.22 (sync) all slaves in cycle t are either in state sidle (if they
do not participate in the transaction) or are in state sw (if they participate
in the transaction). For participating slaves we have by Lemmas 8.22 and
8.59 (sync, stable slave decision):

bhit(j)t .

Hence,
ca(j).tag t (a.c) = a.t .
For not participating slaves by the same arguments we get

¬bhit(j)t

and
aca(j).st (a) = I .
By part 3 of Lemma 8.63 (unchanged memory slices) we know that no
other global, flush or local write accesses to address a end in cycle t. If
some access to address b = a, where b.c = a.c, on cache j ends in cycle t,
then this access cannot be a flush or a global access by Lemmas 8.17 and
8.14 (grant at warm, grant unique). If it is a local write, then by Lemmas
8.47 and 8.53 (local end cycle, stable local) and by (A) we get:

ca(j).tag t+1 (b.c) = ca(j).tag t (a.c)


= b.t
= a.t .

Hence, such an access can occur only on a not participating cache j and
for such cache it then holds:

aca(j).st+1 (a) = I .

For participating slaves j, possible updates to port a of state RAMs (ports


a of other RAMs are not clocked at all) can only be done to cache addresses
different from a.c. Hence, they do not interfere with the updates performed
through ports b by the slave automata. The statement

Π(ms(ht+1 ), a) = Π(δ1 (ms(ht ), acc(i, k), i), a)

now follows from the data transfer lemmas. If acc(i, k) is a read or a CAS
access, the statement

pdout(i)t = pdout1(ms(ht ), acc(i, k), i)

also follows from the data transfer lemmas.


304 8 Caches and Shared Memory

• If a flush access acc(i, k) ends in cycle t = e(i, k) then we have


f lush(i)t ∨ wait(i)t .
In cycle t we write to cache address pa(i).c via port a of the state RAM.
Ports a of the data and the tag RAMs are not updated. By Lemmas 8.17
and 8.14 (grant at warm, grant unique) we know that no global access can
end in cycle t. Hence, ports b of the RAMs are also not updated in cycle t.
We now split cases on whether the access ends in state f lush or in state
wait:
– f lush(i)t . By (A) and (HW) we have
aca(i).ss(i,k)−1 (a) ∈ {M, O} ∧ aca(i).st+1 (a) = I.
By (A), by (HW), and by Lemma 8.40 (flush transfer) we conclude
mmt+1 (a) = mmt+1 (badout(i)e(i,k) )
= mmt+1 (badout(i)s(i,k) )
= bdout(i)s(i,k)
= ca(i).datas(i,k)−1 (a.c)
= aca(i).datas(i,k)−1 (a) .
By (A) and by Lemmas 8.50, 8.49 (last cycle of wait, stable master)
we conclude:
∀q ∈ [s(i, k) − 1 : e(i, k)] : aca(i)q = aca(i)t .
Hence, we have
aca(i).st (a) ∈ {M, O}
aca(i).st+1 (a) = I
mmt+1 (a) = aca(i).datat (a)
and conclude the statement
Π(ms(ht+1 ), a) = Π(δ1 (ms(ht ), acc(i, k), i), a)
by part 4 of Lemma 8.63 (unchanged memory slices).
– wait(i)t . By (A) and (HW) we have
aca(i).st (a) ∈ {E, S} ∧ aca(i).st+1 (a) = I .
The memory content for address a is unchanged by part 5 of Lemma
8.63 (unchanged memory slices):
mmt+1 (a) = mmt (a) .
Again, by part 4 of Lemma 8.63 (unchanged memory slices) we con-
clude
Π(ms(ht+1 ), a) = Π(δ1 (ms(ht ), acc(i, k), i), a) .


8.5 Correctness Proof 305

8.5.9 Ordering Hardware Accesses Sequentially

Recall that the set of accesses to address a ending in cycle t is denoted by

E(a, t) = {(i, k) | e(i, k) = t ∧ acc(i, k).a = a} .

The set E(t) of all accesses ending in cycle t we define as



E(t) = {(i, k) | e(i, k) = t} = E(a, t) .
a

Then #E(t) is the number of accesses ending in cycle t, and the number NE (t)
of accesses that have ended before cycle t is defined by

NE (0) = 0
NE (t + 1) = NE (t) + #E(t) .

We number accesses acc(i, j) according to their end time and accesses with
the same end time arbitrarily. Thus, accesses ending before t get sequential
numbers [0 : NE (t) − 1] and accesses ending at t get numbers from set Q(t) =
[NE (t) : NE (t + 1) − 1]. Thus,

seq(E(0)) = [0 : NE (1) − 1]
seq(E(t)) = [NE (t) : NE (t + 1) − 1] .

If a flush access and one or more local reads to the same address end in cycle
t, we order the flush access last:

(i, k), (i , k  ) ∈ E(a, t) ∧ acc(i, k).f → seq(i , k  ) < seq(i, k) . (20)

The resulting sequentialized access sequence acc is defined by

acc [seq(i, k)] = acc(i, k) .

The sequence is of corresponding port indices is defined by

is[seq(i, k)] = i .

We can now relate the hardware computation with the computation of the
atomic protocol and show that the state invariants hold for the hardware
computation.
Lemma 8.65 (relating hardware with atomic protocol). The following
statements hold for cycle t and hardware configuration ht :
1. The first NE (t) sequential atomic accesses lead exactly to the same ab-
stract memory system configuration ms as the first t cycles of the hardware
computation:

(ms(h0 ), acc [0 : NE (t) − 1], is[0 : NE (t) − 1]) .


NE (t)
ms(ht ) = Δ1
306 8 Caches and Shared Memory

2. The state invariants hold until cycle t:

SINV (t) .

3. The memory abstraction after the first t cycles equals the memory ab-
straction after NE (t) sequential atomic memory accesses:

(m(h0 ), acc [0 : NE (t) − 1]) .


NE (t)
m(ht ) = ΔM

Proof. By induction t → t + 1. For t = 0 the first statement is trivial. After


reset, we have for all a and i:

aca(i).s0 (a) = I

and we have sinv(ms(h0 )) by Lemma 8.7.


For the induction step, we assume that the lemma holds for t and consider
accesses ending in cycle t. For x ∈ [1 : #E(t)] we set

nx = NE (t) + x − 1 .

Then we have

seq(E(t)) = [NE (t) : NE (t + 1) − 1] = {nx | x ∈ [1 : #E(t)]} .

For x ∈ [1 : #E(t)] we define the pair (ix , kx ) of indices by

seq(ix , kx ) = nx .

Then,
acc(ix , kx ) = acc [nx ] and ix = is[nx ] .
We also define a sequence of memory system configurations msx by

ms0 = ms(ht )
x > 0 → msx = δ1 (msx−1 , acc [nx ], ix ) .

Using the induction hypothesis and Lemma 8.8 we get

msx = Δx1 (ms0 , acc [NE (t) : nx ], is[NE (t) : nx ])


= Δx1 (ms(ht ), acc [NE (t) : nx ], is[NE (t) : nx ])
(ms(h0 ), acc [0 : NE (t) − 1], is[0 : NE (t) − 1]),
NE (t)
= Δx1 (Δ1
acc [NE (t) : nx ], is[NE (t) : nx ])
(ms(h0 ), acc [0 : nx ], is[0 : nx ]) .
NE (t)+x
= Δ1 (21)

For x = #E(t) this gives

(ms(h0 ), acc [0 : NE (t + 1) − 1], is[0 : NE (t + 1) − 1]) .


NE (t+1)
ms#E(t) = Δ1
8.5 Correctness Proof 307

By part 2 of the induction hypothesis the state invariants hold for ms0 :

SINV (t) → sinv(ms0 ) .

Using Lemma 8.9 we conclude by induction that the state invariants hold for
all memory systems msx under consideration:

∀x ∈ [1 : #E(t)] : sinv(msx ) .

We proceed to characterize the slices Π(msx , a) as a function of a and x. We


split cases:
• ∀x : ¬P (ix , kx , a, t). Then by part 1 of Lemma 8.63 (unchanged memory
slices), slice a does not change:

Π(ms#E(t) , a) = Π(msx , a) = Π(ms0 , a) .

• ∃x : P (ix , kx , a, t). By part 3 of Lemma 8.63 index x is unique:

∀y = x : ¬P (iy , ky , a, t) .

By part 2 of Lemma 8.63, no other access acc [ny ] with y = x ending in


cycle t changes slice a in the atomic protocol:

Π(msx−1 , a) = Π(ms0 , a)
Π(ms#E(t) , a) = Π(msx , a) .

Using part 3 of Lemma 8.12 we conclude

Π(ms#E(t) , a) = Π(msx , a)
= Π(δ1 (msx−1 , acc [nx ], ix ), a)
= Π(δ1 (ms0 , acc [nx ], ix ), a) .

Using the definition of ms0 this can be summarized in



Π(δ1 (ms(ht ), acc [nx ], ix ), a) ∃x : P (ix , kx , a, t)
Π(ms#E(t) , a) =
Π(ms(ht ), a) otherwise .

Using the definition of acc [nx ] and part 1 of Lemma 8.64 (1 step) we conclude

Π(δ1 (ms(ht ), acc(ix , kx ), ix ), a) ∃x : P (ix , kx , a, t)
Π(ms#E(t) , a) =
Π(ms(ht ), a) otherwise
= Π(ms(ht+1 ), a) .

Hence,
ms#E(t) = ms(ht+1 ) .
308 8 Caches and Shared Memory

This concludes the first and the second statements.


For the third statement, we conclude by Lemma 8.11 and by (21):
m(ht+1 ) = m(ms(ht+1 ))
= m(ms#E(t) )
(ms(h0 ), acc [0 : NE (t + 1) − 1], is))
NE (t+1)
= m(Δ1
(m(h0 ), acc [0 : NE (t + 1) − 1]).
NE (t+1)
= ΔM



8.5.10 Sequential Consistency

In Sect. 8.2.5 we claimed that a memory system is sequentially consistent if


for read or CAS accesses (i, k) we have

(m(ms), acc )(acc(i, k).a) .


seq(i,k)
msdout(ms, acc, i, k) = ΔM
In our construction the answer of a memory system to a read or a CAS access
is defined as
acc(i, k).r ∨ acc(i, k).cas → msdout(ms(h0 ), acc, i, k) = pdout(i)e(i,k) .
Hence, we can rewrite the definition of sequential consistency as
acc(i, k).r ∨ acc(i, k).cas →
(m(ms(h0 )), acc )(acc(i, k).a)
seq(i,k)
pdout(i)e(i,k) = ΔM
(m(h0 ), acc )(acc(i, k).a) .
seq(i,k)
= ΔM
With the notation from the proof of Lemma 8.65 (relating hardware with
atomic protocol) this is transformed to
acc(ix , kx ).r ∨ acc(ix , kx ).cas →
pdout(ix )t = ΔnMx (m(h0 ), acc )(acc(ix , kx ).a) .
In the next lemma we show that the answer of the memory system produced
by the hardware at port ix is the content of the memory system msx−1 . We use
this result in Lemma 8.67 which asserts sequential consistency of the hardware
memory. In both lemmas we stick to the notation from Lemma 8.65 (relating
hardware with atomic protocol).
Lemma 8.66 (almost sequentially consistent). Let acc(ix , kx ) be a read
or a CAS access with address a ending in cycle t, i.e., we have acc(ix , kx ).r ∨
acc(ix , kx ).cas and acc(ix , kx ).a = a. Then the answer pdout produced by the
hardware at port ix in cycle t is the content of the memory system msx−1 at
address a:
pdout(ix )t = m(msx−1 )(a) .
8.5 Correctness Proof 309

Proof. By Lemma 8.65 (relating hardware with atomic protocol) we know


that the state invariants hold up to cycle t:

SIN V (t).

We consider two cases:


• #E(a, t) = 1. No other access with address a ends at t. Hence,

∀y = x : ¬P (iy , ky , a, t) .

• #E(a, t) ≥ 2. By Lemma 8.56 (simultaneously ending accesses) access


acc(ix , kx ) is a local read or a delayed local read. By the ordering seq as
specified in (20) we get

∀y < x : ¬P (iy , ky , a, t) .

As in the proof of Lemma 8.65 (relating hardware with atomic protocol) we


conclude
Π(ms0 , a) = Π(msx−1 , a) .
Using part 4 of Lemma 8.12, part 2 of Lemma 8.10, and Lemma 8.64 (1 step)
we get

pdout(ix )t = pdout1(ms0 , acc(ix , kx ), ix ) (Lemma 8.64)


= pdout1(msx−1 , acc(ix , kx ), ix ) (Lemma 8.12 )
= m(msx−1 )(a) . (Lemma 8.10 )




Lemma 8.67 (sequential consistency). The hardware memory is sequen-


tially consistent. Let e(ix , kx ) = t and acc(ix , kx ).r ∨ acc(ix , kx ).cas. Then

pdout(ix )t = ΔnMx (m(h0 ), acc )(acc(ix , kx ).a) .

Proof. Using Lemma 8.66, (21), Lemma 8.11, and recalling the definition
m(h) = m(ms(h)) of the hardware memory we get

pdout(ix )t = m(msx−1 )(acc(ix , kx ).a)


(ms(h0 ), acc , is))(acc(ix , kx ).a)
NE (t)+x−1
= m(Δ1
= m(Δ1 (ms(h ), acc , is))(acc(ix , kx ).a)
nx 0

= ΔnMx (m(ms(h0 )), acc )(acc(ix , kx ).a)


= ΔnMx (m(h0 ), acc )(acc(ix , kx ).a) .



310 8 Caches and Shared Memory

8.5.11 Liveness

Having fairness of the bus arbiter (Lemma 8.13) and liveness of the main
memory (Sect. 3.5.6), the liveness proof of the shared memory construction
becomes trivial. Assuming stability of processor inputs defined in Sect. 8.4.1,
we can also guarantee that signal mbusy is off when there is no processor
request. This simple but important property of the memory system is very
helpful when we show liveness of the multi-core processor in Sect. 9.3.9. We
state the desired properties in the following lemma and leave the proof of the
lemma as an easy exercise for the reader.
Lemma 8.68 (liveness of shared memory).

1. preq(i)t → ∃t ≥ t : ¬mbusy(i)t ,
2. ¬preq(i)t → ¬mbusy(i)t .
Note, that the mbusy signal from cache i is guaranteed to eventually go away
even if for some cache j = i signal preq(j) always stays high. This is the case
because the master automaton of cache j, if it is in the hot phase, eventually
reaches state idle where it lowers its request to the arbiter and gives up own-
ership of the bus (Sect. 8.4.5). This property of the shared memory system
is very important for us, since in the next section we construct a processor
where the preq signal to the instruction cache can possibly stay high until the
mbusy signal for the data cache goes away. As a result, when proving liveness
of that construction (Lemma 9.18) we heavily rely on the fact that a memory
access to the data cache eventually ends (i.e, the mbusy signal from the data
cache goes away), even if the processor request to the instruction cache stays
high for the duration of the entire access to the data memory.
9
A Multi-core Processor

We finally are able to specify a multi-core MIPS machine, build it, and show
that it works. Clearly the plan is to take pipelined MIPS machines from
Chap. 7 and connect them to the shared memory system from Chap. 8. Before
we can do this, however, we have to address a small technical problem: the
pipelined machine was obtained by a transformation from a sequential refer-
ence implementation, and that machine does not have a compare-and-swap
operation. Thus, we have to add an introductory Sect. 9.1, where we augment
the sequential instruction set with a compare-and-swap instruction. This turns
out to be an instruction with 4 register addresses, where we accommodate the
fourth address in the sa field of an R-type instruction. In order to process
such instructions in a pipelined fashion we now also need in the sequential
reference implementation the ability to read three register operands and to
write one register operand in a single hardware cycle. If we would have treated
interrupts here, we would have a special purpose register file as in [12], and
we could take the third read operand from there. Here, we simply add a third
read port to the general purpose register file using technology from Chap. 4.
In Sect. 9.2 we specify the ISA of multi-core MIPS and give a reference
implementation with sequential processors. Like the hardware, ISA and ref-
erence implementation can be made completely deterministic, but in order to
hide implementation details they are modelled for the user in a nondetermin-
istic way: processors execute instructions one at a time; there is a stepping
function s specifying for each time n the processor s(n) executing an instruc-
tion at step n. We later derive this function from the implementation, but
the user does not know it; thus, programs have to work for any such stepping
function. Specifying multi-core ISA turns out to be very easy: we split the
sequential MIPS configuration into i) memory and ii) processor (everything
else). A multi-core configuration has a single (shared) memory component and
multiple processor components. In step n an ordinary sequential MIPS step
is executed with processor s(n) and the shared memory.
In the multi-core reference implementation, we hook the sequential refer-
ence processors to a memory (which now has to be able to execute compare-

M. Kovalev et al.: A Pipelined Multi-core MIPS Machine, LNCS 9000, pp. 311–344, 2014.

c Springer International Publishing Switzerland 2014
312 9 A Multi-core Processor

and-swap operations). We do not bother to give a hardware construction for


this memory; it suffices to use the specification from Sect. 8.2. Note, however,
that for this reference implementation of multi-core MIPS we have to gener-
alize the hardware model: we have to partition hardware into portions which
are selectively clocked under the control of a stepping function s.
In Sect. 9.3 we “simply” hook pipelined implementations of the sequential
MIPS processors into the shared memory system from Sect. 8. The generation
of processor inputs to the caches and the consumption of the answers of the
memory system are completely straightforward to implement. Unifying the
correctness proofs for pipelined processors from Chap. 7 with the lemmas
about the shared memory system from Chap. 8 is not terribly difficult any
more. It does however require the development of some technical machinery
permitting to couple local scheduling functions for the pipelined processors
with local instruction numbers of the multi-core reference implementation.
Liveness is finally shown using the machinery from Sect. 7.4.

9.1 Compare-and-Swap Instruction


9.1.1 Introducing CAS to the ISA

We start with extending the MIPS ISA from Sect. 6.2 with the compare-and-
swap operation. We define it as an R-type instruction with the function bits
being all ones:
cas(c) ≡ opc(c) = 06 ∧ f un(c) = 16 .
Recall that previously we defined the effective address of load and store in-
structions as
ea(c) = c.gpr(rs(c)) +32 sxt(imm(c)) .
In R-type instructions we do not have an immediate constant. So for the
effective memory address of a CAS instruction we simply take the value from
the GPR file addressed by the rs field:

cas(c) → ea(c) = c.gpr(rs(c)) .

All CAS operations are assumed to be word accesses:

cas(c) → d(c) = 4 .

The comparison data for a CAS operation is taken from a GPR specified by
field sa of the instruction1 :

cas(c) → cdata(c) = c.gpr(sa(c)) .

1
An alternative would be to take this value from a dedicated SPR (special purpose
register file), but we do not consider SPRs in this book.
9.1 Compare-and-Swap Instruction 313

The result of the CAS data test is then defined as


castest(c) ≡ cdata(c) = c.m4 (ea(c)) .
The data to be stored in the memory is obtained from the GPR specified by rt.
Field rd is used to specify a destination register. The resulting configuration
c after executing a CAS instruction is obtained in an obvious way:
cas(c) →

 byte(i, c.gpr(rt(c))) x = ea(c) +32 i32 ∧ i < 4 ∧ castest(c)
c .m(x) =
c.m(x) otherwise

c.m4 (ea(c)) x = rd(c)
c .gpr(x) =
c.gpr(x) otherwise .
The GPR write predicate is now defined as
gprw(c) = alu(c) ∨ su(c) ∨ l(c) ∨ jal(c) ∨ jalr(c) ∨ cas(c) .

9.1.2 Introducing CAS to the Sequential Processor


For implementing CAS instructions, a couple of modifications have to be ap-
plied to the construction of a sequential reference machine.
A schematic view of the reference hardware machine with support for
CAS operations is shown in Fig. 143. The 3-port GPR RAM used in previous
designs is replaced with a 4-port GPR RAM, where ports a, b, and d are used
for reading:
A(h) = h.gpr(rs(h))
B(h) = h.gpr(rt(h))
D(h) = h.gpr(sa(h))
and port c is used for writing:

lres(h) l(h) ∨ cas(h)
gprin(h) =
C(h) otherwise .
The circuit computing the input for the GPR register file is shown in Fig. 144.
Construction of a 4-port GPR RAM is done in the very same manner as the
construction of a 3-port GPR RAM and we omit it here.
Figure 145 shows a simple modification which has to be done to the effec-
tive address computation. The load mask is now computed as

⎪ 16 8 8
⎨I(h)[27] I(h)[26] 1 l(h)
lmask(h) = 132 cas(h)

⎩ 32
0 otherwise
= l(h) ∧ I(h)[27]16 I(h)[26]8 18 ∨ cas(h) ∧ 132 .
314 9 A Multi-core Processor

IF

IF im imout
env
I

nextpc F, p p, Cad, F
env I-decoder
bf
ID
A, pc xtimm, af, sf, sa
pc dpc
ima, D
pc A, B, af xtimm A, B sa, sf A xtimm B
A, B, D
p, F
inc ALUenv SUenv add sh4s

linkad ares sres ea bw


EX ea.o ea.o

muxes p
D
C

mask4cas

ea.l dmin bw 
M ima
m imout

dmout
dmout

ea.o p, F
sh4l
lres

WB 0 1
l

gpr rs, rt, sa, Cad, gprw

A B D

Fig. 143. Schematic view of a simple MIPS machine supporting the CAS instruction
9.1 Compare-and-Swap Instruction 315

C lres
32 32
0 1 l ∨ cas
32
gprin
Fig. 144. Computing the data input of the GPR

A sxtimm ∧ ¬cas
32 32 0

32-Add

32
ea
Fig. 145. Effective address computation

For computing the memory byte write signals in case of CAS accesses, we have
to take into consideration the result of the CAS test. Hence, we first have to
read the data from the hardware memory and only then decide whether we
need to perform a write. This is possible, because the construction of a 2-port
multi-bank RAM-ROM from Sect. 4.3.3 allows reading and writing the same
address through port b in a single cycle2 .
We now split the computation of byte write signals into two parts. First,
the environment sh4s computes the byte write signals assuming that the result
of the CAS test has succeeded. The construction of the circuit stays the same
as in the previous designs shown in Fig. 110, but the initial smask signals are
now calculated as

⎪ 2
⎨I(h)[27] I(h)[26]1 s(h)
smask(h)[3 : 0] = 1 4
cas(h)

⎩ 4
0 otherwise
= s(h) ∧ I(h)[27]2 I(h)[26]1 ∨ cas(h) ∧ 14 .

For the shifted version of the byte write signals, this gives us

¬(cas(h) ∨ s(h)) → bw(h) = 08 .


2
For practical reasons, such a construction would be inefficient. We use it here just
to construct a “reference” machine which we use for the simulation proof of the
multi-core processor with a shared memory system.
316 9 A Multi-core Processor

dmout
64

[63 : 32] [31 : 0]


32 32

1 0 ea[2]
D
32 32

32-eq

¬cas castest

bw
8

8
bw 
Fig. 146. Implementation of circuit mask4cas

Circuit mask4cas shown in Fig. 146 first computes the signal castest, where

D(h) = dmout(h)[63 : 32] ea(h)[2] = 1
castest(h) ≡
D(h) = dmout(h)[31 : 0] ea(h)[2] = 0 ,
and then uses this signal to mask all active byte write signals in case the CAS
test was not successful.
These are all modification one has to make to the sequential reference
hardware implementation to introduce support for the CAS instruction. Ad-
ditionally, we extend the software condition on disjoint code and data regions
to handle the CAS instructions:
ls(c) ∨ cas(c) → ea(c).l ∈ DR.
Correctness of the construction is stated in
Lemma 9.1 (MIPS with CAS correct). Let alignment hold and let code
and data regions be disjoint. Then
sim(c, h) → sim(c , h ) .
Proof. For the case when we don’t execute a CAS instruction, i.e., ¬cas(c),
we observe that the signals generated by the hardware are the same as in the
sequential MIPS machine from Chap. 6. Hence, we simply use Lemma 6.8 and
we are done.
If cas(c) holds, then we consider the cases when the CAS test succeeds and
when it fails. In both cases the proof is completely analogous to the proofs
from Chap. 6 and we omit it here. 

9.2 Multi-core ISA and Reference Implementation 317

9.2 Multi-core ISA and Reference Implementation

9.2.1 Multi-core ISA Specification

Recall that MIPS configurations c have components c.pc, c.dpc, c.gpr, and
c.m. For the purpose of defining the programmer’s view of the multi-core
MIPS machine, we collect the first three components of c into a processor
configuration:
c.p = (c.p.pc, c.p.dpc, c.p.gpr) .
We denote by Kp the set of processor configurations. A MIPS configuration
now consists of a processor configuration and memory configuration:

c = (c.p, c.m) .

The next state function c = δ(c) is split into a next processor component δp
and a next memory component δm :

c = δ(c)
= (c .p, c .m)
= (δp (c.p, c.m), δm (c.p, c.m)) .

A multi-core MIPS configuration mc with P processors consists of the follow-


ing components:
• mc.p : [0 : P − 1] → Kp . The configuration of processor q ∈ [0 : P − 1] in
configuration mc is mc.p(q).
• mc.m is the memory shared by all processors.
We introduce a step function s : N → [0 : P − 1], which maps step numbers n
of the multi-core configuration to the ID s(n) of the processor making a step
in configuration mcn . We require the step function to be fair in the sense that
every processor q is stepped infinitely often:

∀n, q : ∃m > n : s(m) = q .

Note that this step function is unknown to the programmer; we will eventually
construct it from the hardware. Programs, thus, have to perform well for all
fair step functions.
Initially, we require

mc0 .p(q).pc = 432


mc0 .p(q).dpc = 032 .

We now define the multi-core computation (mcn ) where mcn is the configu-
ration before step n:
318 9 A Multi-core Processor

n+1 δp (mcn .p(x), mcn .m) x = s(n)
mc .p(x) =
mcn .p(x) x = s(n)
mcn+1 .m = δm (mcn .p(s(n)), mcn .m) .

An equivalent definition is given in the following lemma.


Lemma 9.2 (multi-core computation).

(mcn+1 .p(s(n)), mcn+1 .m) = δ(mcn .p(s(n)), mcn .m)


q = s(n) → mcn+1 .p(q) = mcn .p(q)

9.2.2 Sequential Reference Implementation

We define a sequential multi-core reference “implementation”. It is almost


hardware and it could easily be turned into hardware, but we don’t bother3 .
Recall that a hardware configuration h of the sequential processor had com-
ponents
h = (h.pc, h.dpc, h.gpr, h.m) .
In case the reset signal is off, the hardware construction of the sequential
processor defines a hardware transition function

h = δH (h) .

We collect components pc, dpc, gpr into a processor component:

h.p = (h.pc, h.dpc, h.gpr) .

and rewrite the hardware transition function as

h = (h .p, h .m) = δH (h.p, h.m) .

We define the data access acc = dacc(h) in the hardware configuration h as

acc.a = ea(h).l
acc.f = 0
acc.r = l(h)

(s(h), cas(h)) ea(h).l[28 : r] = 0r
(acc.w, acc.cas) =
(0, 0) otherwise
acc.data = dmin(h)
acc.cdata = D(h)
acc.bw = bw(h) .
3
Turning our construction into a real sequential implementation would require a
scheduler and a number of multiplexors connecting the shared memory to the
processors.
9.2 Multi-core ISA and Reference Implementation 319

In case instruction I(h) is neither a load, a store, nor a CAS instruction, all
bits f ,w, r, and cas of access dacc(h) are off and we have a void access. Recall
that a void access does not update memory and does not produce an answer.
In case a write or a CAS is performed to an address in the ROM region we
also have a void access4 .
In the same way we construct instruction fetch access acc = iacc(h) as

acc.a = ima(h)
acc.r = 1
acc.w = 0
acc.cas = 0
acc.f = 0 .

We observe that our hardware memory of the reference implementation to-


gether with the control logic matches the specification of the sequential mem-
ory introduced in Sect. 8.2.2.
Lemma 9.3 (hardware memory is sequential). Hardware memory of the
sequential reference machine follows the semantics of the sequential memory:
1. cas(h) ∨ r(h) → dmout(h) = dataout(h.m, dacc(h)),
2. imout(h) = dataout(h.m, iacc(h)),
3. h .m = δM (h.m, dacc(h)).

Proof. The first and the second statements we simply get by unfolding the
definitions and applying the semantics of the 2-port multi-bank RAM-ROM:

dmout(h) = h.m(ea(h).l)
= dataout(h.m, dacc(h))
imout(h) = h.m(ima(h))
= dataout(h.m, iacc(h)) .

For the third statement we have by the hardware construction and the se-
mantics of the 2-port multi-bank RAM-ROM:

⎪ 
⎨modify (h.m(b), dmin(h), bw (h)) b = ea(h).l ∧ (s(h)∨
h.m(b) = cas(h) ∧ castest(h))


h.m(b) otherwise

4
Note that this situation never occurs if the reference hardware computation is
simulated by an ISA computation and disjointness of data and code regions holds.
The only reason why we consider it possible here, is because we want to specify
the multi-core reference hardware before we show that it is simulated by a multi-
core ISA. Hence, at that point we cannot yet assume that there are no writes to
the ROM portion of the hardware memory.
320 9 A Multi-core Processor


⎨modify (h.m(b), dmin(h), bw(h)) b = ea(h).l ∧ (s(h)∨
= cas(h) ∧ castest(h))


h.m(b) otherwise .

For the original byte write signals generated by sh4s environment, in case of
a CAS access we have

04 14 ea(h)[2] = 0
cas(h) → bw(h) =
14 04 ea(h)[2] = 1 ,

which gives us
cas(h) → bw(h)[0] = ¬ea(h)[2] .
For the result of the CAS test, we have by construction of circuit mask4cas:

D(h) = dmout(h)[63 : 32] ea(h)[2] = 1
castest(h) =
D(h) = dmout(h)[31 : 0] ea(h)[2] = 0

D(h) = h.m(ea(h).l)[63 : 32] ¬bw(h)[0]
=
D(h) = h.m(ea(h).l)[31 : 0] bw(h)[0]
= test(dacc(h), h.m(ea(h).l)) .

Hence,

h.m(b)


⎨modify (h.m(b), dmin(h), bw(h)) b = ea(h).l ∧ (s(h) ∨
= cas(h) ∧ castest(h))


h.m(b) otherwise


⎨modify (h.m(b), dmin(h), bw(h)) b = ea(h).l ∧ (s(h) ∨ cas(h) ∧
= test(dacc(h), h.m(ea(h).l))


h.m(b) otherwise
= δM (h.m, dacc(h)) .




As a result of Lemma 9.3 we can rewrite the hardware transition function as

h = δH (h.p, h.m)
= (δhp (h.p, h.m), δhm (h.m, dacc(h)))
= (δhp (h.p, h.m), δM (h.m, dacc(h))) .

For the definition of the multi-core reference implementation, we duplicate


the processor component h.p of the hardware for every processor ID. Thus,
the new hardware has components h.m and h.p(q) for each processor ID q.
9.2 Multi-core ISA and Reference Implementation 321

For the case the reset signal is off, the computation (hn ) of the multi-core
reference implementation is simply defined by

hn+1 .p(s(n)) = δhp (hn .p(s(n)), hn .m)


hn+1 .m = δM (hn .m, dacc((hn .p(s(n)), hn .m)))

and
hn+1 .p(q) = hn .p(q) for q = s(n) .
An equivalent definition we state in the following lemma.
Lemma 9.4 (computation of the reference machine). Assume resetn =
0, then

(hn+1 .p(s(n)), hn+1 .m) = δH (hn .p(s(n)), hn .m)


q = s(n) → hn+1 .p(q) = hn .p(q) .

Recall that we assume the reset signal to be on in cycle n = −1 and to be


off afterwards. For the case when the reset signal is on, we initialize processor
components the same way as in the case of a single-core implementation:

h0 .p(q).dpc = 032
h0 .p(q).pc = 432 .

9.2.3 Simulation Relation

As in Chap. 6 we assume alignment, disjointness of code and data regions,


and, hence, no self modifying code. The basic sequential simulation relation
sim(c, h) is extended to multi-core machines by

msim(mc, h) ≡

1. (∀q : mc.p(q).pc = h.p(q).pc) ∧


2. (∀q : mc.p(q).dpc = h.p(q).dpc) ∧
3. (∀q : mc.p(q).gpr = h.p(q).gpr) ∧
4. mc.m ∼CR h.m ∧
5. mc.m ∼DR h.m .
The correctness of the reference implementation is asserted in the following
lemma.
Lemma 9.5 (correctness of the reference implementation). There is
an initial multi-core ISA configuration mc0 such that

∀n : msim(mcn , hn ) .
322 9 A Multi-core Processor

Proof. This is a straightforward bookkeeping exercise using Lemmas 9.2, 9.4


and the correctness of the basic MIPS processor. Assuming reset to be on in
cycle n = −1 we set

mc0 .p(q).gpr = h0 .p(q).gpr

and

∀a ∈ CR ∪ DR : mc0 .m8 (a000) = h0 .m(a)

and obtain
msim(mc0 , h0 ) .
For the induction step, we conclude for processor s(n) from the induction
hypothesis

sim((mcn .p(s(n)), mcn .m), (hn .p(s(n)), hn .m)) .

By Lemma 9.1 , i.e., the correctness of the basic sequential hardware for one
step, we conclude

sim((mcn+1 .p(s(n)), mcn+1 .m), (hn+1 .p(s(n)), hn+1 .m)) .

For processors q = (s(n)) that are not stepped, we have by induction hypoth-
esis

mcn .p(q).pc = hn .p(q).pc


mcn .p(q).dpc = hn .p(q).dpc
mcn .p(q).gpr = hn .p(q).gpr .

By the definitions of the multi-core ISA and the reference implementation


program counters, delayed program counters, and general purpose register
files do not change for the processors which are not stepped, so we have

X ∈ {pc, dpc, gpr} → mcn+1 .p(q).X = hn+1 .p(q).X .



Lemma 9.5 obviously implies that write or CAS accesses to the ROM portion
of h.m never occur. Hence, we have

cas(h) ≡ dacc(h).cas
s(h) ≡ dacc(h).w.

From now on we argue only about cycles n ≥ 0 and assume resetn = 0.


9.2 Multi-core ISA and Reference Implementation 323

9.2.4 Local Configurations and Computations

For processor IDs q and local step numbers i, we define the step numbers
pseq(q, i) when local step i is executed on processor q:

pseq(q, 0) = min{n | s(n) = q}


pseq(q, i) = min{n | n > pseq(q, i − 1) ∧ s(n) = q} .

Configuration hpseq(q,i) is the hardware configuration directly before local step


i of processor q.
We also define a function ic(q, n) which counts how often processor q was
stepped before step n resp. the number of instructions completed on processor
q before step n by

ic(q, 0) = 0

ic(q, n) + 1 s(n) = q
ic(q, n + 1) =
ic(q, n) otherwise .

An easy induction on n shows the following lemma.


Lemma 9.6 (instruction count).

ic(q, n) = #{j | j < n ∧ s(j) = q}

We also establish a simple relation between functions pseq and ic.


Lemma 9.7 (instruction count and step numbers).

ic(q, n) = i ∧ s(n) = q → pseq(q, i) = n

Proof. In case i = 0 we have by definition

pseq(q, 0) = min{n | s(n) = q} .

Because ic(q, n) = 0, we conclude

∀m ∈ [0 : n − 1] : s(m) = q .

Hence,
pseq(q, 0) = n .
In case i > 0, let

{j0 , . . . ji−1 } = {j | j < n ∧ s(j) = q}

and
j0 < . . . < ji−1 .
A trivial induction shows
324 9 A Multi-core Processor

∀x ≤ i − 1 : jx = pseq(q, x) .

Because
∀m ∈ [ji−1 + 1 : n − 1] : s(m) = q ,
we conclude

n = min{m | m > pseq(q, i − 1) ∧ s(m) = q} = pseq(q, i) .



Hence, up to configuration hpseq(q,i) , processor q has already been stepped i
times. The next step to be executed in hpseq(q,i) is step number i of processor
q, which is the (i + 1)-st local step of this processor.
For processor IDs q and step numbers i, we define the local hardware
configurations hq,i of processor q before local step i. We start with the mul-
tiprocessor hardware configuration hpseq(q,i) in which processor q makes step
i; then we construct a single processor configuration hq,i by taking the pro-
cessor component of the processor that is stepped, i.e., q, and the memory
component from the shared memory:

hq,i = (hpseq(q,i) .p(q), hpseq(q,i) .m) .

The following lemma asserts, for every q that, as far as the processor com-
ponents are concerned, the local configurations hq,i behave as in an ordinary
single processor hardware computation; the shared memory of course can
change between steps i and i + 1 of the same processor.
Lemma 9.8 (local computations).

hq,0 .p = h0 .p(q)
hq,i+1 .p = δH (hq,i .p, hpseq(q,i) .m).p

Proof. By the definition of pseq(q, 0) processor q is not stepped before step


pseq(q, 0):
n < pseq(q, 0) → s(n) = q .
Thus, the configuration of processor q is not changed in these steps and we
get

hq,0 .p = hpseq(q,0) .p(q)


= h0 .p(q) .

By the definition of pseq(q, i + 1) processor q is also not stepped between steps


pseq(q, i) and pseq(q, i + 1):

n ∈ [pseq(q, i) + 1 : pseq(q, i + 1) − 1] → s(n) = q .

As above we conclude that the configuration of processor q does not change


in these steps:
9.2 Multi-core ISA and Reference Implementation 325

hq,i+1 .p = hpseq(q,i+1) .p(q)


= hpseq(q,i)+1 .p(q)
= δH (hpseq(q,i) .p(q), hpseq(q,i) .m).p
= δH (hq,i .p, hpseq(q,i) .m).p .



Next we show a technical result relating the local computations with the
overall computation.
Lemma 9.9 (relating local and overall computations).

hn .p(q) = hq,ic(q,n) .p

Proof. By induction on n. For n=0 we have

h0 .p(q) = hq,0 .p = hq,ic(q,0) .p .

For the induction step assume the lemma holds for n. Let i = ic(q, n). We
distinguish two cases:
• q = s(n). Then,
ic(q, n + 1) = i + 1 .
By induction hypothesis, Lemma 9.7, and Lemma 9.8 we get

hn+1 .p(q) = δH (hn .p(q), hn .m).p


= δH (hq,i .p, hpseq(q,i) .m).p
= hq,i+1 .p
= hq,ic(q,n+1) .p .

• q = s(n). Then,
ic(q, n + 1) = ic(q, n) = i
and by induction hypothesis and Lemma 9.7 we get

hn+1 .p(q) = hn .p(q)


= hq,i .p
= hq,ic(q,n+1) .p .



9.2.5 Accesses of the Reference Computation

We define the instruction fetch access iacc(q, i) and the data access dacc(q, i)
in local step i of processor q as

iacc(q, i) = iacc(hq,i )
dacc(q, i) = dacc(hq,i ) .
326 9 A Multi-core Processor

Lemma 9.10 (accesses of reference computation). The hardware mem-


ory of the multi-core reference machine follows semantics of the sequential
memory:
1. For fetch accesses it holds
imout(hq,i ) = dataout(hpseq(q,i) .m, iacc(q, i)) = dataout(h0 .m, iacc(q, i)) .
2. For loads or CAS accesses it holds
l(hq,i ) ∨ cas(hq,i ) → dmout(hq,i ) = dataout(hpseq(q,i) .m, dacc(q, i)) .
3. For updates of h.m it holds
hpseq(q,i)+1 .m = δM (hpseq(q,i) .m, dacc(q, i)) .
Proof. Assuming disjointness of data and code region and correctness of the
multi-core reference implementation (Lemma 9.5) we can easily show
iacc(q, i).a ∈ CR
and
∀a ∈ CR : hpseq(q,i) .m(a) = h0 .m(a).
The statement of the lemma now follows by simple unfolding of definitions
and applying Lemma 9.3. 


9.3 Shared Memory in the Multi-core System


We proceed with constructing a pipelined multi-core MIPS processor, where
a single hardware memory is replaced with caches ica and dca of the shared
memory system constructed in Chap. 8. The hardware memory is replaced
together with the circuit mask4cas. Schematic view of a single core of the
pipelined MIPS processor is shown in Fig. 147.

9.3.1 Notation

As we did in the correctness proof of the single-core pipelined processor,


we denote the multi-core sequential reference implementation by hσ and the
pipelined multi-core implementation with the shared memory system by hπ .
For registers, memories, or circuit signals X in the reference machine,
processor IDs q, and instruction numbers i, we abbreviate

q,i hq,i
σ .X X ∈ {pc, dpc, gpr, m}
Xσ = q,i
X(hσ ) otherwise

pseq(q,i)
hσ .p(q).X X ∈ {pc, dpc, gpr, m}
= pseq(q,i) pseq(q,i)
X((hσ .p(q), hσ .m)) otherwise .
9.3 Shared Memory in the Multi-core System 327

Values of registers R and circuit signals X during cycle t in processor q of the


multi-core hardware machine we abbreviate resp. by

Rπq,t and Xπq,t .

As before, for signals or registers only occurring in the pipelined design we


drop the subscript π. If an equations holds for all cycles or for all processors
(like equations describing hardware construction) we drop the indices t and q
respectively.
We denote the instruction cache of processor q by ica = ca(2q) and the
data cache of processor q by dca = ca(2q + 1). For inputs and outputs X of a
data cache or an instruction cache of processor q we write respectively

ica.X = X(2q)
dca.X = X(2q + 1) .

9.3.2 Invisible Registers and Hazard Signals

In order to implement CAS instructions in the pipelined design we add invis-


ible registers D.2 and D.3 into the pipeline.
The used predicates from Sect. 7.2.3 are adapted for CAS operations,
which are treated as loads and stores at the same time:

∀X ∈ {C.3, C.4} : used(X, I) = gprw (I) ∧ ¬(l (I) ∨ cas (I))


∀X ∈ {ea.3, ea.4} : used(X, I) = l (I) ∨ s (I) ∨ cas (I)
∀X ∈ {dmin} : used(X, I) = s (I) ∨ cas (I)
used(dmout, I) = l (I) ∨ cas (I) .

Registers D.2 and D.3 are used only for CAS instructions:

D-used(I) = cas (I)


∀X ∈ {D.2, D.3} : used(X, I) = D-used(I) .

The forwarding engine is extended to forward the data into D.2 during the
instruction decode stage:

hitD [k] ≡ f ullk ∧ Cad.k = saπ ∧ gprw.k ∧ casπ



topD [k] = hitD [k] ∧ ¬hitD [j]
j<k


⎪C.3inπ topD [2]

⎨C.4in
π topD [3]
D.2inπ =
⎪gprinπ
⎪ topD [4]


gproutDπ otherwise .
328 9 A Multi-core Processor

ima
ica
IF
1 I

Ain, Bin, D.2in


cir(2)
ID

2 pc dpc linkad A, B i2ex con.2 D.2

1 0

EX cir  (3)

3 C.3 bw.3, dmin ea.3 con.3 D.3

4 C.4 dmout.4 dca ea.4 con.4

WB cir(5)

gpr
5 rs, rt, sa
Ain, Bin, D.2in

Fig. 147. Schematic view of the pipelined multi-core MIPS pipeline

CAS instructions load data from the memory just as regular loads do. Hence,
we have to update the hazard signals:

hazA = A-used ∧ (topA [k] ∧ (con.k.l ∨ con.k.cas))
k∈[2,3]

hazB = B-used ∧ (topB [k] ∧ (con.k.l ∨ con.k.cas)) .
k∈[2,3]
9.3 Shared Memory in the Multi-core System 329

The stall engine has to additionally generate a hazard signal when the D data
can not be forwarded:
haz2 = hazA ∨ hazB ∨ hazD

hazD = D-used ∧ (topD [k] ∧ (con.k.l ∨ con.k.cas)) .
k∈[2,3]

9.3.3 Connecting Interfaces

Every MIPS processor in the multi-core system has an instruction cache and a
data cache. We connect the instruction cache ica = ca(2q) to MIPS processor
q in the following way:
ica.pa = imaπ
ica.pw = 0
ica.pr = 1
ica.pcas = 0
ica.preq = 1
Iinπ = ica.pdout
haz1 = ica.mbusy .
The data cache dca = ca(2q + 1) is connected to processor q in the following
way:
dca.pa = ea.3.lπ
dca.pw = con.3.sπ
dca.pr = con.3.lπ
dca.pcas = con.3.casπ
dca.bw = bw.3π
dca.pdin = dminπ
dca.pcdin = D.3π
dca.preq = f ull3 ∧ (con.3.sπ ∨ con.3.lπ ∨ con.3.casπ )
dmoutπ = dca.dout
haz4 = dca.mbusy .
Recall that the stall engine is defined as
stallk = f ullk−1 ∧ (hazk ∨ stallk+1 )
uek = f ullk−1 ∧ ¬stallk
f ullkt+1 = uetk ∨ stallk+1
t
.
In stage 1 we always perform a memory access, but we clock the results of the
access to the registers only if we don’t have a stall2 signal coming from the
330 9 A Multi-core Processor

stage below. As a result, we might perform the same access to the instruction
cache several times, until we are actually able to update the register stage
below. In Sect. 9.3.9 we show that this kind of behaviour does not produce
any deadlocks.

9.3.4 Stability of Inputs of Accesses


Lemma 9.11 (stable inputs of accesses). Inputs for the data and instruc-
tion caches are stable:
• For data cache 2q + 1, if the request signal preq(2q + 1) and the memory
busy signal mbusy(2q+1) are both on, register stage 3 of processor q which
contains the inputs of the access is not updated:
preq(2q + 1)t ∧ mbusy(2q + 1)t → ueq,t
3 = 0.

• For instruction cache 2q, if the request signal preq(2q) and the memory
busy signal mbusy(2q) are both on, the inputs to the instruction cache
remain stable:
preq(2q)t ∧ mbusy(2q)t → preq(2q)t+1 ∧ imaq,t+1
π = imaq,t
π .

Proof. For data caches we have


haz4q,t = mbusy(2q + 1)t = 1
and
preq(2q + 1)t → f ull3q,t .
Hence,
stall4q,t = f ull3q,t ∧ haz4q,t
=1.
Thus,
ueq,t q,t q,t
3 = f ull2 ∧ ¬stall4
=0.
For instruction caches we have

pcq,t
π .l f ull1q,t
imaq,t
π =
dpcq,t
π .l ¬f ull1q,t
haz1q,t = mbusy(2q)t = 1
stall1q,t = f ull0q,t ∧ (haz1q,t ∨ stall2q,t )
=1
ueq,t q,t q,t
1 = f ull0 ∧ ¬stall1
=0
f ull1q,t+1 = ueq,t q,t
1 ∨ stall2
q,t
= stall2 .
9.3 Shared Memory in the Multi-core System 331

The instruction address is taken either from the PC or from the DPC register
depending on whether stage 1 is full or not. Hence, we split cases on values
of f ull1q,t and f ull1q,t+1 :
• if ¬f ull1q,t we have

ueq,t q,t q,t


2 = f ull1 ∧ ¬stall2
=0
stall2q,t = f ull1q,t ∧ (haz2q,t ∨ stall3q,t )
=0
f ull1q,t+1 = stall2q,t
=0
imaq,t+1
π = dpcq,t+1
π .l
q,t
= dpcπ .l
= imaq,t
π ,

• if f ull1q,t ∧ f ull1q,t+1 we have

ueq,t q,t q,t


2 = f ull1 ∧ ¬stall2
=0
imaq,t+1
π = pcq,t+1
π .l
= pcq,t
π .l
= imaq,t
π ,

• if f ull1q,t ∧ ¬f ull1q,t+1 we have

ueq,t q,t q,t


2 = f ull1 ∧ ¬stall2
=1
imaq,t+1
π = dpcq,t+1
π .l
= pcq,t
π .l
= imaq,t
π .



9.3.5 Relating Update Enable Signals and Ends of Accesses

By the definition of function someend(i, t) read, write, or CAS accesses to


cache i end in cycles t when preq(i)t ∧ ¬mbusy(i)t . In this section we relate
ends of memory accesses with the update enable signals of the processors.

Lemma 9.12 (update enable implies access end). An active update en-
able signal denotes the end of a memory access:
332 9 A Multi-core Processor

1. For data cache 2q + 1, if the update enable signal of stage 4 is activated


and stage 3 contains a memory request, then a read, write, or CAS access
ends:

ueq,t
4 ∧ preq(2q + 1) → ∃k : e(2q + 1, k) = t ∧ ¬acc(2q + 1, k).f .
t

2. For instruction cache 2q, if the update enable signal of stage 1 is activated,
then a read access ends:

ueq,t
1 → ∃k : e(2q, k) = t ∧ acc(2q, k).r .

Proof. For data caches we have by hypothesis

ueq,t q,t q,t


4 = f ull3 ∧ ¬stall4
=1.

Also by hypothesis we have

preq(2q + 1)t = f ull3q,t ∧ (con.3.sq,t


π ∨ con.3.lπ ∨ con.3.casπ )
q,t q,t

=1.

Hence,
π ∨ con.3.lπ ∨ con.3.casπ .
con.3.sq,t q,t q,t

Thus, the update is due to a memory access and does not come from an
instruction that does not require accessing the memory. Because ueq,t
4 holds,
we have

stall4q,t = f ull3q,t ∧ (haz4q,t ∨ stall5q,t )


=0.

Because stall5q,t = 0 and f ull3q,t = 1 we conclude

haz4q,t = mbusy(2q + 1)t = 0 .

Thus, we have someend(2q + 1, t). The ending access cannot be a flush access,
because by the construction of the control automata of the caches the mbusy
signal stays active during flush accesses.
For instruction caches we have by hypothesis

ueq,t q,t q,t


1 = f ull0 ∧ ¬stall1
=1.

Hence,
stall1q,t = 0 .
Because
stall1q,t = f ull0q,t ∧ (haz1q,t ∨ stall2q,t )
9.3 Shared Memory in the Multi-core System 333

and f ull0q,t = 1, we conclude

haz1q,t = mbusy(2q)t = 0
preq(2q)t = 1 .

Thus, we have someend(2q, t). We argue as above that the ending access is
not a flush access. Moreover, we know that write and CAS accesses do not
occur at instruction caches and conclude the proof. 


We come to a subtle point which is crucial for the liveness of the system.
Lemma 9.13 (access end implies update enable). When a read, write,
or CAS access ends, the corresponding stage is updated, unless there is a stall
signal coming from the stage below:
1. For data cache 2q + 1, we have

¬acc(2q + 1, k).f ∧ e(2q + 1, k) = t → ueq,t


4 .

2. For instruction cache 2q, we have

acc(2q, k).r ∧ e(2q, k) = t ∧ ¬stall2q,t → ueq,t


1 .

Proof. For the data cache we have by hypothesis

preq(2q + 1)t ∧ ¬mbusy(2q + 1)t .

Because

preq(2q + 1)t = f ull3q,t ∧ (con.3.lq,t ∨ con.3.sq,t ∨ con.3.casq,t ) ,

we conclude
f ull3q,t = 1 .
Because

stall4q,t = f ull3q,t ∧ (haz4q,t ∨ stall5q,t )


= haz4q,t
= mbusy(2q + 1)t
=0,

we conclude

ueq,t q,t q,t


4 = f ull3 ∧ ¬stall4
=1.

For the instruction cache we conclude in a similar manner:


334 9 A Multi-core Processor

stall2q,t = 0
haz1q,t = mbusy(2q)t
=0
stall1q,t = f ull0 ∧ (haz1q,t ∨ stall2q,t )
=0
ueq,t q,t
1 = f ull0 ∧ ¬stall1
=1.




9.3.6 Scheduling Functions

The scheduling functions for a processor q of the pipelined multi-core system


are defined analogous to the single-core processor. I(q, k, t) = i means that
instruction i5 is in circuit stage k of processor q in cycle t:

I(q, k, 0) = 0

I(q, 1, t) + 1 ueq,t
1
I(q, 1, t + 1) =
I(q, 1, t) otherwise

I(q, k − 1, t) ueq,t
k
I(q, k, t + 1) =
I(q, k, t) otherwise .

For the scheduling functions of the multi-core system, we state the counterpart
of Lemma 7.14.
Lemma 9.14 (scheduling functions difference multi-core). Let k ≥ 2.
Then for all q:
q,t
I(q, k − 1, t) = I(q, k, t) + f ullk−1 .
Proof. Completely analogous to the proof of Lemma 7.14. 


9.3.7 Stepping Function

In what follows we distinguish, as in the sequential case, between the pipelined


machine π and the sequential reference implementations σ. For every hardware
cycle t of the multi-core pipelined machine π, we define the set PS (t) of
processors stepped at cycle t by

PS (t) = {q | ueq,t
4 = 1} ,

i.e., a processor q of the multi-core reference implementation σ is stepped


whenever an instruction is clocked out of the memory stage of processor q of
5
Here, i is the local index of the instruction.
9.3 Shared Memory in the Multi-core System 335

the pipelined machine. The number N S of processors stepped before cycle t


is defined as

NS (0) = 0
NS (t + 1) = NS (t) + #PS (t) .

Thus, in every cycle t we step #PS (t) processors. For every t we will define
the values s(m) of the step function s for m ∈ [NS (t) : NS (t + 1) − 1] such
that
s([NS (t) : NS (t + 1) − 1]) = PS (t) .
Any step function with this property would work, but we will later choose a
particular function which makes the proof (slightly) easier. For any function
with the above property the following easy lemma holds.
Lemma 9.15 (relating instruction count with scheduling functions).
For every processor q the scheduling function I(q, 4, t) of the pipelined machine
at time t counts the instructions completed ic(q, t) on the sequential reference
implementation:
ic(q, NS (t)) = I(q, 4, t) .
Proof. By induction on t. For t = 0 both sides of the equation are 0. For the
induction step we assume

ic(q, NS (t)) = I(q, 4, t)

and split cases:


• q ∈ PS (t). This implies

ic(q, NS (t + 1)) = ic(q, NS (t)) + 1.

Since processor q is stepped in cycle t, we have ueq,t q,t


4 and, hence, f ull3 .
By definition of the scheduling functions and Lemma 9.14 we conclude

I(q, 4, t + 1) = I(q, 3, t)
= I(q, 4, t) + 1.

• q∈
/ PS (t). This implies

ic(q, NS (t + 1)) = ic(q, NS (t)).

Since processor q is not stepped in cycle t, we have ¬ueq,t


4 and by definition
of the scheduling functions:

I(q, 4, t + 1) = I(q, 4, t).




336 9 A Multi-core Processor

For y ∈ [1 : #PS (t)] we define step numbers my of steps performed during


cycle t and processor IDs qy , identifying the processors which perform these
steps:

my = NS (t) + y − 1
qy = s(my ) .

In cycle t every processor is stepped at most once

z = y → qz = qy ,

and hence, processor qy cannot be stepped in between step numbers NS (t)


and my :
ic(qy , my ) = ic(qy , NS (t)) . (22)
By Lemmas 9.15, 9.7, and (22) we get

pseq(qy , I(qy , 4, t)) = pseq(qy , ic(qy , NS (t))) = my . (23)

Thus,
pseq(q1 , I(q1 , 4, t)) = m1 = NS (t)
and
pseq(q#PS (t) , I(q#PS (t) , 4, t)) = NS (t + 1) − 1 .
We define the linear data access sequence dacc by

dacc [my ] = dacc(qy , I(qy , 4, t))

and conclude with part 3 of Lemma 9.10:

hσmy +1 .m = hσpseq(qy ,I(qy ,4,t))+1 .m


= δM (hσpseq(qy ,I(qy ,4,t)) .m, dacc(qy , I(qy , 4, t)))

= δM (hm
σ .m, dacc [my ]) .
y

By Lemma 8.6 we get the following relation for the hardware memory of the
reference machine.
Lemma 9.16 (hardware memory of the reference computation).

(hσNS (t) .m, dacc [NS (t) : NS (t + 1) − 1])


#PS (t)
hσNS (t+1) .m = ΔM

9.3.8 Correctness Proof

For the correctness result of the multi-core system, we assume as before align-
ment and the absence of self modifying code. Recall that for R ∈ reg(k) the
single-core system simulation (correctness) theorem had the form
9.3 Shared Memory in the Multi-core System 337

I(k,t)
Rσ vis(R)
Rπt = I(k,t)−1 I(k,t)−1
Rσ f ullkt ∧ ¬vis(R) ∧ used(R, Iσ ).

For the multi-core machine we aim at a theorem of the same kind. We have,
however, to couple it with an additional statement correlating the memory
abstraction m(htπ ) of the pipelined machine with the hardware memory hσ .m
of the sequential reference implementation. We correlate the memory m(htπ )
of the pipelined machine π with the memory of the sequential machine σ after
NS (t) sequential steps:

a ∈ CR ∪ DR → m(htπ )(a) = hσNS (t) .m(a) .

The main result of this book asserts the simulation of the pipelined multi-core
machine π by the sequential multi-core reference implementation σ.

Lemma 9.17 (pipelined multi-core MIPS correctness). For a ∈ CR ∪


DR there are initial values h0σ .m(a) and for every t there is a step function

s : [0 : NS (t) − 1] → [0 : P − 1] ,

such that
• for all stages k, registers R ∈ reg(k), and all processor IDs q, let

I(q, k, t) = i ,

then 
Rσq,i vis(R)
Rπq,t =
Rσq,i−1 f ullkt ∧ ¬vis(R) ∧ used(R, Iσq,i−1 ) ,

a ∈ CR ∪ DR → m(htπ )(a) = hσNS (t) .m(a) .

Proof. By induction on t. For t = 0 all cache lines of π are invalid, as specified


in Sect. 8.4.6. Thus, the memory abstraction of π is defined by the main
memory:
m(h0π (a)) = h0π .mm(a) .
For a ∈ CR ∪ DR we choose initial values of the memory of σ by

h0σ .m(a) = h0π .mm(a) .

A meaningful initial program can only be guaranteed if the initial code region
CR is realized in the main memory as a ROM. We choose the size of the ROM
in hσ .m(a) to be the same as the size of the read only region in hπ .mm.
Compared to the proof for a single pipelined processor, the proof of the
induction step changes only for the instruction and memory stages, i.e., for
k = 1 and k = 4, and only for the memory components and their outputs. In
338 9 A Multi-core Processor

what follows, we only present these parts of the proof. We first consider stage
k = 4 and consider processors q with ueq,t
4 = 1 resp. with q ∈ PS (t).
Using the formalism from Sect. 9.3.7, we have for y ∈ [1 : #PS (t)], step
numbers my of steps performed during cycle t, and processor IDs qy of pro-
cessors that perform these steps:

my = NS (t) + y − 1
qy = s(my ) .

Since in cycle t every processor is stepped at most once, we have

∀y1 , y2 ∈ [1 : #PS (t)] : y1 = y2 ↔ qy1 = qy2 .

The linear sequence of hardware data accesses dacc is defined as

dacc [my ] = dacc(qy , I(qy , 4, t)) .


q ,t q ,t
Now, let iy = I(qy , 4, t). From ue4y we get f ull3y . By Lemmas 9.14 and
9.15 we get

I(qy , 3, t) = I(qy , 4, t) + 1
= iy + 1
= ic(qy , NS (t)) + 1 .

All registers R of register stage 3 are invisible. From the induction hypothesis
we get

used(R, Iσq,iy ) → Rπqy ,t = Rσqy ,I(qy ,3,t)−1


= Rσqy ,iy
= Rσqy ,ic(qy ,NS (t)) .

We split cases on the value preq(2qy + 1)t of the memory request signal of the
data cache of processor qy . Recall that it is defined as
q ,t
preq(2qy + 1)t = (con.3.lqy ,t ∨ con.3.sqy ,t ∨ con.3.casqy ,t ) ∧ f ull3y
= con.3.lqy ,t ∨ con.3.sqy ,t ∨ con.3.casqy ,t .

• If preq(2qy + 1)t = 0, instruction iy is not accessing the memory. Since the


register con.3 is always used, we can conclude that data access

dacc(qy , iy ) = dacc [my ]

is a void access.
• If preq(2qy + 1)t = 1, we conclude with part 1 of Lemma 9.12 that there
exists a number ky , such that a data access acc(2qy + 1, ky ) ends in cycle
t. Because for the input registers R of the memory stage we have shown
9.3 Shared Memory in the Multi-core System 339

used(R, Iσq,iy ) → Rπqy ,t = Rσqy ,ic(qy ,NS (t)) ,

we can conclude by an easy case split on the type of instruction iy :

acc(2qy + 1, ky ) = dacc(qy , ic(qy , NS (t))) = dacc(qy , iy ) = dacc [my ] .

Now things are easy. By induction hypothesis we have

a ∈ CR ∪ DR → m(htπ )(a) = hσNS (t) .m(a) .

What we aim at showing is

a ∈ CR ∪ DR → m(ht+1 NS (t+1)
π )(a) = hσ .m(a) .

Recall that in Sect. 8.5.9 we have numbered accesses acc(i, k) according to


their end time:

seq(E(0)) = [0 : NE (1) − 1]
seq(E(t)) = [NE (t) : NE (t + 1) − 1] ,

and have defined sequentialized access sequence acc as

acc [seq(i, k)] = acc(i, k) .

By Lemma 8.65 we get for a ∈ CR ∪ DR

(m(htπ ), acc [NE (t) : NE (t + 1) − 1])(a)


#E(t)
m(ht+1
π )(a) = ΔM

(hσNS (t) .m, acc [NE (t) : NE (t + 1) − 1])(a) .


#E(t)
= ΔM

The sequence acc [NE (t) : NE (t + 1) − 1] of memory accesses ending in cycle


t consists of read, write, CAS, and flush accesses. Let acc [0 : u − 1] be the
subsequence of acc [NE (t) : NE (t + 1) − 1] consisting exactly of the write and
CAS accesses. Because reads and flushes don’t change the memory abstraction
we get
m(ht+1 u
π )(a) = ΔM (hσ
NS (t)
.m, acc )(a) .
By Lemma 9.16 we have

(hσNS (t) .m, dacc [NS (t) : NS (t + 1) − 1]) .


#PS (t)
hσNS (t+1) .m = ΔM

Let dacc [0 : v−1] be the subsequence of the data access sequence dacc [NS (t) :
NS (t + 1) − 1] consisting only of the write and CAS accesses. Because reads
and void accesses don’t change the memory abstraction we get

hσNS (t+1) .m = ΔvM (hNS


σ
(t)
.m, dacc ) .

Lemmas 9.13 and 9.12 guarantee that, if a write or a CAS memory access
ends in a cache of processor q in cycle t, then it is an access to the data cache
340 9 A Multi-core Processor

and the memory stage of processor q is updated. Thus, we have q ∈ PS (t).


Hence, all accesses from acc are included in the sequence dacc . From the
other side, Lemma 9.12 also implies that all accesses from dacc are included
in the sequence acc . Hence, sequences acc and dacc consist of exactly the
same (write or CAS) accesses acc(2qy + 1, ky ), ending at cycle t.
From the definition of the end of a memory access, we know that there
can be only one access to the data cache of processor qy ending at cycle t:
∀k, k  : e(2qy + 1, k) = e(2qy + 1, k  ) → k = k  .
For every such access, we step the processor qy exactly once. Hence, the lengths
of sequences acc and dacc are the same:
u=v.
The order of accesses in these sequences might be different but by Lemma 8.56
write and CAS accesses ending in the same cycle have different addresses.
Thus, the two access sequences have the same effect on memory. For a ∈
DR ∪ CR this gives us
ΔuM (hNS
σ
(t)
.m, acc )(a) = ΔuM (hNS
σ
(t)
.m, dacc )(a) ,
which implies
m(ht+1 NS (t+1)
π )(a) = hσ .m(a) .
This shows the second statement of the lemma.
Next, for the data outputs of data caches of π we consider read and CAS
accesses
acc(2qy + 1, ky ) = dacc(qy , I(qy , 4, t)) = dacc(qy , iy ) = dacc [my ] .
Let
a = acc(2qy + 1, ky ).a .
By Lemma 8.56 read accesses and write accesses ending in the same cycle
have different addresses. Hence,
hσNS (t) .m(a) = hσNS (t)+y−1 .m(a) = hm
σ .m(a) .
y

By Lemma 8.64 (1 step), part 2 of Lemma 8.10, (23), and part 2 of Lemma
9.10 we get
pdoutπ (2qy + 1)t = pdout1(ms(htπ ), dacc [my ], 2qy + 1) (Lemma 8.64)
= m(htπ )(a) (Lemma 8.10)
NS (t)
= hσ .m(a)
= hm
σ
y
.m(a)
= hpseq(q
σ
y ,i)
.m(a) (23)
pseq(qy ,i)
= dataout(hσ .m, dacc(qy , i))
= dmoutσqy ,i . (Lemma 9.10)
9.3 Shared Memory in the Multi-core System 341

Finally, for the outputs of instruction caches 2q in stage k = 1 we consider


processors q with ueq,t
1 = 1. Then by part 2 of Lemma 9.12 a read access
acc(2q, r) ends in cycle t, i.e., (2q, r) ∈ E(t). Let

a = acc(2q, r).a and i = I(q, 1, t) .

By the same argument as for single-core pipelined processors we conclude

π = imaσ ∈ CR .
a = imaq,t q,i

Thus, the access ending at the instruction cache of processor q in cycle t is a


fetch access iacc(q, i)6 :
acc(2q, r) = iacc(q, i) .
By Lemma 8.64 (1 step), part 2 of Lemma 8.10, and part 1 of Lemma 9.10
we get

pdoutπ (2q)t = pdout1(ms(htπ ), acc(2q, r), 2q) (Lemma 8.64)


= m(htπ )(a) (Lemma 8.10)
NS (t)
= hσ .m(a)
0
= hσ .m(a)
= dataout(h0σ .m, iacc(q, i))
= imoutq,iσ . (Lemma 9.10)




9.3.9 Liveness

In the liveness proof of the multi-core processor we argue that every stage
which is being stalled is eventually updated.

Lemma 9.18 (multi-core liveness).



stallkt → ∃t > t : uetk

Proof. The order of stages for which we prove the statement of the lemma is
important. For liveness of the upper stages, we use liveness of the lower stages.
Recall that the signals in the stall engine are defined as

6
Note that, due to the fact that we keep the request signal to the instruction cache
active until the stall signal from the previous stage is removed, there may be
several accesses acc(2q, r1 ), acc(2q, r2 ), . . . corresponding to the access iacc(q, i).
Fortunately, all these accesses are reads which don’t modify the state of the
memory abstraction. As a result, we don’t care about the exact reconstruction of
the access sequence to the instruction cache and talk only about the existence of
an access, which ends in the same cycle when we activate signal ue1 .
342 9 A Multi-core Processor

stallk = f ullk−1 ∧ (hazk ∨ stallk+1 )


f ullkt+1 = uetk ∨ stallk+1
t

uek = f ullk−1 ∧ ¬stallk .

• For k ∈ [5, 6] we always have stallk = 0 and there is nothing to show.


• For k = 4 we have

stall4t = f ull3t ∧ haz4t


= f ull3t ∧ dca.mbusy t .

From Lemma 8.68 (liveness of shared memory) we know that the processor
request to the data cache must be active in cycle t if the mbusy signal is
high:
dca.mbusy t → dca.preq t .
Moreover, there exists a cycle t > t such that
 
¬dca.mbusy t ∧ ∀t ∈ [t : t ) : dca.mbusy t

holds. From Lemma 9.11 (stable inputs of accesses) we get that the regis-
ters of stage 3 are not updated:

∀t ∈ [t : t ) : ¬uet3 ,

which implies 
∀t ∈ [t : t ] : dca.preq t .
Hence, there is a data access to the data cache ending in cycle t , which

implies uet4 by Lemma 9.13.
• For k = 3 we have

stall3t = f ull2t ∧ stall4t

and conclude the statement by applying liveness of stage 4.


• For k = 2 we assume 
∀t > t : ¬uet2
and prove the lemma by contradiction. From stall2t we get f ull1t and, hence,
 
∀t ≥ t : f ull1t ∧ stall2t .

The stall signal for stage 2 is defined as

stall2t = f ull1t ∧ (haz2t ∨ stall3t ) .

If we have haz2t , we do a case split depending on the top-most stage when


we have a hit. If for some X ∈ {A, B, D} we have

topX [2]t ∧ (con.2.lπt ∨ con.2.castπ ) ,


9.3 Shared Memory in the Multi-core System 343

then by the liveness of stage 3 we find the smallest t > t such that uet3

holds, which implies ¬stall3t . Since stage 2 is not updated in cycle t , we
have   
f ull2t +1 = (uet2 ∨ stall3t ) = 0 ,
which means that in cycle t + 1 we do not have a hit in stage 2 anymore
and have a top-most hit in stage 3:
  
topX [3]t +1 ∧ (con.3.lπt +1 ∨ con.3.castπ +1 ) .

Moreover, for all cycles t∗ ≥ t + 1 we also have


∗ ∗
uet3 = f ull2t = 0 .

Using liveness of stage 4, we find the smallest cycle t > t such that

¬stall4t holds and get
  
f ull3t +1
= (uet3 ∨ stall4t ) = 0 .

Hence, at cycle t + 1 both stages 2 and 3 are not full. This implies
 
¬haz2t +1
∧ ¬stall3t +1
.

Thus, we have ¬stall2t +1 and get a contradiction.
The case when we have haz2t ∧ topX [3]t is proven in the same way using
liveness of stage 4. For the last case we have

stall3t ∧ ¬haz2t ,

which also gives a contradiction by liveness of stage 3.


• For k = 1 we again assume

∀t > t : ¬uet1

and prove by contradiction. We have


 
∀t ≥ t : haz1t ∨ stall2t .

The hazard signal in stage 1 is generated by the instruction cache:

haz1 = ica.mbusy .

We consider two cases. If


stall2t
holds, then we use liveness of stage 2 to find the smallest t > t, such that
 
uet2 ∧ ¬stall2t .
344 9 A Multi-core Processor

Together with assumption ∀t > t : ¬uet1 this implies for all cycles t∗ ≥
t + 1 ∗ ∗
¬f ull1t ∧ ¬stall2t .
 
If ¬haz1t , we are done. If haz1t , then using Lemma 8.68 (liveness of shared
 
memory) we find t > t , such that ¬mbusy t holds, which implies ¬haz1t
and gives a contradiction.
In the second case we have

haz1t ∧ ¬stall2t .

In the proof of the first case we have already considered the same situation
for cycle t .


References

1. Beyer, S., Jacobi, C., Kroning, D., Leinenbach, D., Paul, W.: Instantiating un-
interpreted functional units and memory system: Functional verification of the
VAMP. In: Geist, D., Tronci, E. (eds.) CHARME 2003. LNCS, vol. 2860, pp.
51–65. Springer, Heidelberg (2003)
2. Cohen, E., Paul, W., Schmaltz, S.: Theory of multi core hypervisor verification.
In: Emde Boas, P., Groen, F.C.A., Italiano, G.F., Nawrocki, J., Sack, H. (eds.)
SOFSEM 2013: Theory and Practice of Computer Science. LNCS, vol. 7741, pp.
1–27. Springer, Heidelberg (2013)
3. Dalinger, I., Hillebrand, M.A., Paul, W.J.: On the verification of memory man-
agement mechanisms. In: Borrione, D., Paul, W. (eds.) CHARME 2005. LNCS,
vol. 3725, pp. 301–316. Springer, Heidelberg (2005)
4. Emerson, E.A., Kahlon, V.: Rapid parameterized model checking of snoopy
cache coherence protocols. In: Garavel, H., Hatcliff, J. (eds.) TACAS 2003.
LNCS, vol. 2619, pp. 144–159. Springer, Heidelberg (2003)
5. Keller, J., Paul, W.J.: Hardware Design. Teubner-Texte zur Informatik. Teub-
ner, Stuttgart (1995)
6. Kröning, D.: Formal Verification of Pipelined Microprocessors. PhD thesis, Saar-
land University (2001)
7. Lamport, L.: How to make a multiprocessor computer that correctly executes
multiprocess programs. IEEE Trans. Comput. 28(9), 690–691 (1979)
8. Maisuradze, G.: Implementing and debugging a pipelined multi-core MIPS ma-
chine. Master’s thesis, Saarland University (2014)
9. MIPS Technologies, Inc. MIPS32 Architecture For Programmers – Volume 2
(March 2001)
10. Müller, C., Paul, W.: Complete formal hardware verification of interfaces for a
FlexRay-like bus. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS,
vol. 6806, Springer, Heidelberg (2011)
11. Müller, S.M., Paul, W.J.: On the correctness of hardware scheduling mechanisms
for out-of-order execution. Journal of Circuits, Systems, and Computers 8(02),
301–314 (1998)
12. Müller, S.M., Paul, W.J.: Computer Architecture, Complexity and Correctness.
Springer, Heidelberg (2000)
13. Pong, F., Dubois, M.: A survey of verification techniques for cache coherence
protocols (1996)
346 References

14. Schmaltz, J.: A formal model of clock domain crossing and automated verifi-
cation of time-triggered hardware. In: FMCAD, pp. 223–230. IEEE Computer
Society, Los Alamitos (2007)
15. Schmaltz, S.: Towards the Pervasive Formal Verification of Multi-Core Operat-
ing Systems and Hypervisors Implemented in C. PhD thesis, Saarland Univer-
sity, Saarbrücken (2013)
16. Sweazey, P., Smith, A.J.: A class of compatible cache consistency protocols
and their support by the IEEE futurebus. SIGARCH Computer Architecture
News 14(2), 414–423 (1986)
17. Weaver, D.L.: OpenSPARC internals. Sun Microsystems (2008)
Index

abstract cache 210 Boolean expressions 20


cache coherence 211 evaluation 21
cache hit 210 translating into circuits 33
cache line states 210 Boolean values 8
configuration 210 branch condition evaluation unit 113,
implemented memory abstraction 147
210 bytes 11
shared memory abstraction 211 concatenation 129
adder 99
ALU see arithmetic logic unit cache see abstract cache
arithmetic logic unit 106, 150 cache coherence 211
arithmetic unit 101 cache consistency see cache coherence
data paths 103, 107 cache hit
negative bit 104 abstract cache 210
overflow bit 106 direct mapped cache 213
atomic MOESI protocol see MOESI fully associative cache 217
protocol k-way associative cache 215
AU see arithmetic unit CAS
MIPS ISA 312
BCE see branch condition evaluation sequential processor 313
unit compare-and-swap see CAS
binary numbers 15 congruence mod see equivalence mod
addition 17 control automata
decomposition 16 shared memory system 247
subtraction algorithm 20 transitions and control signals 249
bit-strings 11 control automaton 75
as binary numbers 15 Mealy automaton 76
as two’s complement numbers 18 implementation 80
bit-operations 12 precomputing outputs 81
bytes 11 Moore automaton 76
bits 11 implementation 76
Boolean equations 22 precomputing outputs 78
identities 22, 23 cyclic left shifter 110
solving equations 23, 25 cyclic right-left shifter 110
348 Index

decision cycle of an access 282 hazard signals


decoder 37 multi-core pipelined processor 327
decomposition lemma 16 single-core pipelined processor 197
delayed PC 162
detailed hardware model 44 implementation registers see visible
circuit semantics 48 registers
parameters 44 incrementer 100
register semantics 46 instruction set architecture see MIPS
reset 46 ISA
simulation by digital model 50, 61 integers 8
stable signal 45 two’s complement representation 18
timing analysis 49 inverter 34
digital circuit 30, 41 invisible registers 161
cycle 32 multi-core pipelined processor 327
hardware computation 42 single-core pipelined processor 175
hardware configuration 42 ISA see MIPS ISA
path 32
digital gates 30 k-way associative cache 214
n-bit gates 35 abstraction 216
direct mapped cache 212 cache hit 215
abstraction 213
cache hit 213 liveness
disjunctive normal form 26 multi-core pipelined processor 341
complete 27 shared memory system 310
single-core pipelined processor 205
effective address 130, 153
end cycle of an access 281 main memory 69
equivalence mod 12 clean operation 72
equivalence relation 12 operating conditions 70
properties 13 stable inputs 70
solutions 14 timing 69
system of representatives 13 memory accesses 220
multi-port access sequences 221
finite state transducer see control processor data accesses 318, 325
automaton processor instruction fetch accesses
forwarding see pipelined processor 319, 325
full adder 33 sequential access sequences 221
full bits 170 memory embedding 135
fully associative cache 216 memory slices see memory systems
abstraction 218 memory system, shared see shared
cache hit 217 memory system
memory systems 219
general purpose register file see GPR abstraction 220
geometric sums 14 cache abstraction 223
glitch 46 hardware configurations 223
GPR 120, 134, 147, 159 memory accesses see memory
accesses
half adder 33 memory slices 220
half-decoder 38 sequential consistency 222, 308
Index 349

memory, user-visible 219 memory interfaces 329


memory accesses see memory scheduling functions 334
accesses stepping function 334
sequential semantics 221 multi-core sequential processor 318
MIPS ISA 117 configuration 318
ALU-operations 124 correctness 321
branches and jumps 127 data accesses 318, 325
J-type jumps 128 instruction count 323
jump and link 128 instruction fetch accesses 319, 325
R-type jumps 127 local step numbers 323
CAS 312 simulation relation 321
configuration 120 multiplexer 34
computation 120
next configuration 120 natural numbers 8
delayed PC 162 binary representation 15
instructions number of caches 219
current instruction 121 number of processors 219
decoding 123
I-type 118, 121 open collector bus 55
immediate constant 122 open collector driver 55
instruction fields 122 overlapping accesses 288
J-type 119, 121
opcode 121 parallel prefix 39
R-type 119, 121 PC 120, 134
loads and stores 130 pipelined processor 167
effective address 130 correctness proof 178
loads 131 proof obligations 179
stores 130 correctness statement 177
memory 129 forwarding 190
multi-core see multi-core MIPS ISA correctness proof 193
shift unit operations 126 software conditions 191
summary 132 full bits 170
mod operator 14 invisible registers 175
MOESI protocol 210, 224 liveness 205
algebraic specification 230 multi-core see multi-core pipelined
invariants 224 processor
master transitions 226 scheduling functions 172
properties 234 properties 173
slave transitions 226 with forwarding 192
multi-core MIPS ISA 317 with stalling 198
computation 317 software conditions 176
configuration 317 stall engine 171, 196
multi-core pipelined processor 326 correctness 203
correctness 337 hazard signals 197
ends of memory accesses 331 update enable 171
hazard signals 327 used registers 175
invisible registers 327 visible registers 168
liveness 341 program counter see PC
memory inputs stability 330
350 Index

RAM 83 shift for store 154


2-port CS RAM 97 shift unit environment 151
2-port RAM 94 simulation relation 138
cache state RAM 89 software conditions 133
GPR RAM 92 stages of instruction execution 140,
multi-bank RAM 86 165
RAM-ROM 95 visible registers 167
SPR RAM 90 writing to GPR 159
SRAM 83 set-clear flip-flops 64
random access memory see RAM shared memory system 235
read only memory see ROM cache abstraction 223
registers 54 control automata 247
relation 12 transitions and control signals 249
equivalence 12
correctness proof 261
reflexive 12
1 step 301
symmetric 12
accesses of the hardware computa-
transitive 12
tion 279
ROM 85
arbitration 261
RAM-ROM 95
automata synchronization 264
auxiliary registers 267
scheduling functions classification of accesses 280
multi-core pipelined processor 334 control of tristate drivers 269
single-core pipelined processor 172 data transmission 277
properties 173 ordering of accesses 305
with forwarding 192 overlapping accesses 288
self destructing hardware 61 protocol data transmission 274
sequences 9 relating with atomic protocol 305
concatenation 11 sequential consistency 308
subsequences 11 silent master 263
sequential consistency 222, 308 silent slaves 263
sequential processor 133
simultaneously ending accesses
ALU environment 150
290
CAS 313
stable decision 296
computation 134
data paths 240
configuration 134
data RAM 241
correctness 139
state RAM 245
delayed PC 163
effective address 153 tag RAM 243
initialization 141 hardware configuration 223
instruction decoder 143 initialization 260
instruction fetch 142 interfaces 236
jump and link 152 liveness 310
memory stage 156 master arbitration 257
multi-core see multi-core sequential fairness 259
processor memory bus 238
next PC environment 147 MOESI protocol
reading from GPR 147 global transactions 231
shift for load 158 local transactions 231
Index 351

shift unit 108, 151 tree 36


implementation 112 ◦-tree 36
single-core pipelined processor see OR tree 40
pipelined processor tristate bus 57
single-core sequential processor see bus contention 59, 62
sequential processor control logic 64
SLC see cyclic left shifter operation of main memory 72
software conditions 133, 176, 191 simulation by digital model 61
spike 46 tristate driver 56
SRLC see cyclic right-left shifter n-bit 68
stall engine see pipelined processor output circuitry 59
start cycle of an access 279 two’s complement numbers 18
SU see shift unit properties 19
switching function 26
computing by Boolean expression visible registers 167, 168
26
system of representatives 13 zero tester 36

You might also like