Issues in Microprocessor and Multimicroprocessor Systems - Veljko Milutinivic
Issues in Microprocessor and Multimicroprocessor Systems - Veljko Milutinivic
ISSUES
IN
MICROPROCESSOR
AND
MULTIMICROPROCESSOR
SYSTEMS
Veljko Milutinovi
Foreword by Michael Flynn
3
Table of Contents
PROLOGUE..................................................................................................................... 7
FOREWORD ..................................................................................................................................................... 9
PREFACE ....................................................................................................................................................... 10
ACKNOWLEDGMENTS.................................................................................................................................... 15
2. Advanced Issues................................................................................................................................... 41
3. Problems .............................................................................................................................................. 46
2. Advanced Issues................................................................................................................................... 52
2.1. Exploitation of Temporal Locality ......................................................................................................................57
2.2. Exploitation of Spatial Locality ...........................................................................................................................58
2.3. The Most Recent Contributions ...........................................................................................................................60
3. Problems .............................................................................................................................................. 60
INSTRUCTION-LEVEL PARALLELISM ............................................................................................................. 62
1. Basic Issues ......................................................................................................................................... 62
1.1. Example: MIPS R10000 and R12000 .................................................................................................................68
1.2. Example: DEC Alpha 21164 and 21264 .............................................................................................................70
1.3. Example: PowerPC 620 and 750 .........................................................................................................................71
1.4. Example: Precision Architecture PA-8000 and PA-8500 ..................................................................................73
1.5. Example: AMD K-5 and K-6 ...............................................................................................................................75
1.6. SUN Ultra SPARC I and II ..................................................................................................................................76
1.7. A Summary ...........................................................................................................................................................77
2. Advanced Issues................................................................................................................................... 80
3. Problems .............................................................................................................................................. 87
PREDICTION STRATEGIES .............................................................................................................................. 89
1.
Branch Prediction Strategies ......................................................................................................... 89
1.1. Basic Issues ...........................................................................................................................................................89
1.1.1. Hardware BPS ..............................................................................................................................................90
1.1.2. Software BPS................................................................................................................................................99
1.1.3. Hybrid BPS ................................................................................................................................................ 100
1.1.3.1 Predicated Instructions ...................................................................................................................... 100
1.1.3.2 Speculative Instructions .................................................................................................................... 101
1.2. Advanced Issues ................................................................................................................................................. 103
PROLOGUE
Foreword
There are several different styles in technical texts and monographs. The most familiar is
the review style of the basic textbook. This style simply considers the technical literature
and re-presents the data in a more orderly or useful way. Another style appears most commonly in monographs. This reviews either a particular aspect of a technology or all technical aspects of a single complex engineering system. A third style, represented by this
book, is an integration of the first two styles, coupled with a personal reconciliation of important trends and movements in technology.
The author, Professor Milutinovic, has been one of the most productive leaders in the
computer architecture field. Few readers will not have encountered his name on an array of
publications involving the important issues of the day. His publications and books span almost all areas of computer architecture and computer engineering. It would be easy, then,
but inaccurate, to imagine this work as a restatement of his earlier ideas. This book is different, as it uniquely synthesizes Professor Milutinovic's thinking on the important issues in
computer architecture.
The issues themselves are presented quite concisely: cache, instruction level parallelism,
prediction strategies, the I/O bottleneck, multithreading, and multiprocessors. These are
currently the principal research areas in computer architecture. Each one of these topics is
presented in a crisp way, highlighting the important issues in the field together with Professor Milutinovic's special viewpoints on these issues, closing each section with a statement
about his own group's research in this area. This statement of issues is coupled with three
important case studies of fundamentally different computer systems implementations. The
case studies use details of actual engineering implementations to help synthesize the issues
presented. The result is a necessarily somewhat eclectic, personal statement by one of the
leaders of the field about the important issues that face us at this time.
This work should prove invaluable to the serious student.
Michael J. Flynn
Stanford University
Preface
Design of microprocessor and/or multimicroprocessor systems represents a continuous
struggle; success (if achieved) lasts infinitesimally long and disappears forever, unless a
new struggle (with unpredictable results) starts immediately. In other words, it is a continuous survival process, which is the main motto of this book.
This book is about survival of those who have contributed to the state of the art in the rapidly changing field of microprocessing and multimicroprocessing on a single chip, and
about the concepts that yet have to find their way into the next generation microprocessors
and multimicroprocessors on a chip, in order to enable these products to stay on the competitive edge.
This book is based on the assumption that one of the ultimate goals of the single chip design is to have an entire distributed shared memory system on a single silicon die, together
with numerous specialized accelerators, including the complex ones of the SIMD and/or
MISD type. Such an opinion is based on the author's experiences, and on a number of important references,** like [Hammond97]. Consequently, the book concentrates on the major
problems to be solved on the way to this ultimate goal (distributed shared memory on a single chip), and summarizes the authors experiences which led to such a conclusion (in other
words, the problem is "how to invest one billion transistors on a single chip").
The internal appendices of this book (at the end of the book) cover the details of one important DSM concept (RMS - Reflective Memory System), as well as the details of the
tools which can be used to evaluate new architectural ideas, or to characterize the applications of interest.
10
The external appendices of this book (on WWW) are about the microprocessor and multimicroprocessor based designs of the author himself, and about the lessons that he has
learned through his own professional survival process which lasts for about two decades
now; concepts from microprocessor and multimicroprocessor boards of the past represent
potential solutions for the microprocessor and multimicroprocessor chips of the future, and
(which is more important) represent the ground for the authors belief that an ultimate goal
is to have an entire distributed shared memory on a single chip, together with numerous
specialized accelerators.
At first, distributed shared memory on a single chip may sound as a contradiction; however, it is not. As the dimensions of chips become larger, their full utilization can be obtained only with multimicroprocessor architectures. After the number of microprocessors
reaches 16, the SMP architecture is no longer a viable solution since bus becomes a bottleneck; consequently, designers will be forced, in a number of cases, to move to the distributed shared memory paradigm (implemented in hardware, or partially in hardware and
partially in software).
In this book, the issues of importance for current on-board microprocessor and multimicroprocessor based designs, as well as for future on-chip microprocessor and multimicroprocessor designs, have been divided into eight different topics. The first one is about the
general microprocessor architecture, and the remaining seven are about seven different
problem areas of importance for the ultimate goal: distributed shared memory on a single
chip, together with numerous specialized accelerators.
After long discussions with the most experienced colleagues (see the list in the acknowledgment section), and the most enthusiastic students (they always have excellent suggestions), the major topics have been selected, as follows:
a)
b)
c)
d)
e)
f)
g)
h)
11
computing are concerned, only the issues of importance for future on-chip machines have
been selected.
Efficient implementation of a sophisticated machine implies numerous efforts in domains
like process technology, system design (e.g., bus design, system integrity checking, and
fault-tolerance), and compiler design. In other words, the ultimate goal of this book can not
be reached without efforts in process technology, computer design, and system software.
However, these issues are outside the scope of this book.
This book also includes a prologue section, which explains the roots of the idea behind it:
combining synergistically the general body of knowledge and the particular experiences of
an individual who has survived several pioneering design efforts of which some were relatively successful commercially.
As already indicated, this book includes both internal and external appendices with case
studies. Internal appendices are a part of this book, and will not be further discussed here.
External appendices are on the WWW, and will be briefly discussed here (only the initial
three case studies, developed before this book was published; the list of case studies in the
external appendix is expected to grow over time).
The author was deeply engaged in all designs presented in the external appendix. Each
project, in the field which is the subject of this book, includes three major activities:
a) envisioning of the strategy (project directions and milestones),
b) consulting on the tactics (product architecture and organization), and
c) engaging in the battle (design and almost exhaustive testing at all logical levels,
until the product is ready for production).
The first case study on the WWW is a multimicroprocessor implementation of a data
modem receiver for high frequency (HF) radio. This design has often been quoted as the
worlds first multimicroprocessor based high frequency data modem. The work was done in
70s; however, the interest in the results reincarnated both in 80s (due to technology impacts
which enabled miniaturization) and in 90s (due to application impacts of wireless communications). The author, absolutely alone, took all three activities (roles) defined above (one
technician only helped with wire-wrapping, using the list prepared by the author), and
brought the prototype to a performance success (the HF modem receiver provided better
performance on a real HF medium, compared to the chosen competitor product), and to a
market success (after the preparation for production was done by others: wire-wrap boards
and older-date components were turned, by others, into the printed-circuit boards and newer-date components) in less than two years (possible only with the enthusiasm of a novice).
See the references in the reference section, as a pointer to details (these references are not
the earliest ones, but the ones that convey most information of interest for this book).
The second case study on the WWW is on a multimicroprocessor implementation of a
GaAs systolic array for Gram-Schmidt orthogonalization (GSO). This design has been often quoted as the worlds first GaAs systolic array. The work was done in 80s; the interest
in the results did not reincarnate in 90s. The author took only the first two roles; the third
one was taken by the others (see the acknowledgment section), but never really completed,
since the project was canceled before its full completion, due to enormous cost (total of
8192 microprocessor nodes, each one running at the speed of 200 MHz). See the reference
12
in the reference section, as a pointer to details (these references are not the earliest ones, but
the ones that convey most information of interest for this book).
The third case study is on the implementation of a board (and the preceding research)
which enables a personal computer (PC) to become a node in distributed shared memory
(DSM) multiprocessor of the reflective memory system (RMS) type. This design has been
often quoted as the worlds first DSM plug-in board for PC technology (some efforts with
larger visibility came later; one of them, with probably the highest visibility [Gillett96], as
an indirect consequence of this one). The work was done in 90s. The author took only the
first role and was responsible for the project (details were taken care of by graduate students); fortunately, the project was completed successfully (and what is more important for
a professor, papers were published with timestamps prior to those of the competition). See
the references in the reference section, as a pointer to details (these references are not the
earliest ones, but the ones that convey most information of interest for this book).
All three case studies have been presented with enough details, so the interested readers
(typically undergraduate students) can redesign the same product using a state of the art
technology. Throughout the book, the concepts/ideas and lessons/experiences are in the foreground; the technology characteristics and implementation details are in the background,
and can be modified (updated) by the reader, if so desired. This book:
Milutinovic, V.,
Issues in Microprocessor and Multimicroprocessor Systems:
Lessons Learned,
is nicely complemented with other books of the same author. One of them is:
Milutinovic, V.,
Surviving the Design of a 200 MHz RISC Microprocessor:
Lessons Learned,
IEEE Computer Society Press, Los Alamitos, California, USA, 1997.
The above two books together (in various forms) have been used for about a decade now,
by the author himself, as the coursework material for two undergraduate courses that he has
taught at numerous universities worldwide. Also, there are three books on the more advanced topics, that have been used in graduate teaching on the follow-up subjects:
Protic, J., Tomasevic, M., Milutinovic, V.,
Tutorial on Distributed Shared Memory: Concepts and Systems,
IEEE Computer Society Press, Los Alamitos, California, USA, 1998.
Tartalja, I., Milutinovic, V.,
Tutorial on Cache Consistency Problem in Shared Memory Multiprocessors:
Software Solutions,
IEEE Computer Society Press, Los Alamitos, California, USA, 1997.
Tomasevic, M., Milutinovic, V.,
Tutorial on Cache Coherence Problem in Shared Memory Multiprocessors:
Hardware Solutions,
IEEE Computer Society Press, Los Alamitos, California, USA, 1993.
This book covers only the issues which are, in the opinion of the author, of strong interest for future design of microprocessors and multimicroprocessors on the chip, or the issues
13
which have impacted his opinion about future trends in microprocessor and multimicroprocessor design.
All presented issues have been treated selectively, with more attention paid to topics that
are believed to be of more importance. This explains the difference in the breadth and depth
of coverage throughout the book.
Also, the selected issues have been treated at the various levels of detail. This was done
intentionally, in order to create room for creativeness of the students. Typical homework
requires that the missing details be completed, and the inventiveness with which the students fulfill the requirement is sometimes unbelievable (the best student projects can be
found on the authors web page). Consequently, one of the major educational goals of this
book, if not the major one, is to help create the inventiveness among the students. Suggestions on how to achieve this goal more efficiently are more than welcome.
Finally, a few words on the educational approach used in this book. It is well known that
one picture is worth of one thousand words. Consequently, the stress in this book has
been placed on the utilization of modern presentation methodologies in general, as well as
figures and figure captions, in particular. All necessary explanations have been put into the
figures and figure captions. The main body of the text has been kept to its minimumonly
the issues of interest for the global understanding of the topic and/or the thoughts on experiences gained and lessons learned. Consequently, students claim that this book is fast to
read and easy to comprehend.
Important prerequisites for reading this book are [Flynn95], [Hennessy96], and [Patterson94] - the three most important textbooks in the fields of computer architecture, organization, and design.
In conclusion, this book teaches the concepts of importance for putting together a DSM
system on a single chip. It is well suited for graduate and advanced undergraduate students,
and as already mentioned, has been used as such at numerous universities worldwide. It is
also well suited for practitioners from industry (for innovation of their knowledge) and for
managers from industry (for better understanding of the future trends).
Veljko Milutinovi
[email protected]
http://galeb.etf.bg.ac.yu/~vm/
14
Acknowledgments
This book would not be possible without the help of numerous individuals; some of them
helped the author to master the knowledge and to gather the experiences necessary to write
this book; others have helped to create the structure or to improve the details. Since the
book of this sort would not be possible if the author did not take place in the three large
projects defined in the preface, the acknowledgment will start from those involved in the
same projects, directly or indirectly.
In relation to the first project (MISD for DFT), the author is thankful to professor Georgije Lukatela from whom he has learned a lot, and also to his colleagues who worked on the
similar problems in the same or other companies (Radenko Paunovic, Slobodan Nedic, Milenko Ostojic, David Monsen, Philip Leifer, and John Harris).
In relation to the second project (SIMD for GSO), the author is thankful to professor Jose
Fortes who had an important role in the project, and also to his colleagues who were involved with the project in the same team or within the sponsor team (David Fura, Gihjung
Jung, Salim Lakhani, Ronald Andrews, Wayne Moyers, and Walter Helbig).
In relation to the third project (MIMD for RMS), the author is thankful to professor Milo
Tomasevic who has contributed significantly, and also to colleagues who were involved in
the same project, within his own team or within the sponsor team (Savo Savic, Milan Jovanovic, Aleksandra Grujic, Ziya Aral, Ilya Gertner, and Mark Natale).
The list of colleagues/professors who have helped about the overall structure and contents of the book, through formal or informal discussions, and direct or indirect advice, on
one or more elements of the book, during the seminars presented at their universities or during the friendly chatting between conference sessions, or have influenced the author in other ways, includes but is not limited to the following individuals: Tihomir Aleksic, Vidojko
Ciric, Jack Dennis, Hank Dietz, Jovan Djordjevic, Jozo Dujmovic, Milos Ercegovac, Michael Flynn, Borko Furht, Jean-Luc Gaudiot, Anoop Gupta, Reiner Hartenstein, John Hennessy, Kai Hwang, Liviu Iftode, Emil Jovanov, Zoran Jovanovic, Borivoj Lazic, Bozidar
Levi, Kai Li, Oskar Mencer, Srdjan Mitrovic, Trevor Mudge, Vojin Oklobdzija, Milutin
Ostojic, Yale Patt, Branislava Perunicic, Antonio Prete, Bozidar Radenkovic, Jasna Ristic,
Eduardo Sanchez, Richard Schwartz, H.J. Siegel, Alan Jay Smith, Ljubisa Stankovic, Dusan Starcevic, Per Stenstrom, Daniel Tabak, Igor Tartalja, Jacques Tiberghien, Mateo Valero, Dusan Velasevic, and Dejan Zivkovic.
The list also includes numerous individuals from industry worldwide who have provided
support or have helped clarify details on a number of issues of importance: Tom Brumett,
Roger Denton, Charles Gimarc, Gordana Hadzic, Hans Hilbrink, Lee Hoevel, Petar Kocovic, Oleg Panfilov, Lazar Radicevic, Charles Rose, Djordje Rosic, Gad Sheaffer, Mark
Tremblay, Helmut Weber, and Maurice Wilkes.
Students have helped a lot to maximize the overall educational quality of the book.
Several generations of students have used the book before it went to press. Their comments
15
and suggestions were of extreme value. Those who deserve special credit are listed here:
Aleksandar Bakic, Jovanka Ciric, Dragana Cvetkovic, Goran Davidovic, Zoran
Dimitrijevic, Vladan Dugaric, Danko Djuric, Ilija Ekmecic, Damir Horvat, Igor
Ikodinovic, Milan Jovanovic, Predrag Knezevic, Dejan Krivokuca, Dusko Krsmanovic,
Petar Lazarevic, Davor Magdic, Darko Marinov, Gvozden Marinkovic, Boris Markovic,
Predrag Markovic, Aleksandar Milenkovic, Jelena Mirkovic, Milan Milicevic, Nenad
Nikolic, Milja Pesic, Dejan Petkovic, Zvezdan Petkovic, Milena Petrovic, Jelica Protic,
Milos Prvulovic, Bozidar Radunovic, Dejan Raskovic, Nenad Ristic, Andrej Skorc, Ivan
Sokic, Milan Trajkovic, Miljan Vuletic, Slavisa Zigic, and Srdjan Zgonjanin.
Veljko Milutinovi
[email protected]
http://galeb.etf.bg.ac.yu/~vm/
16
FACTS OF IMPORTANCE
17
As already indicated, this author believes that one efficient solution for the one billion
transistor chip" of the future is a complete distributed shared memory machine on a single
chip, together with a number of specialized on-chip accelerators.
The eight sections to follow cover: (a) essential facts about the current microprocessor
architectures and (b) the seven major problem areas, to be resolved on the way to the final
goal stated above.
18
Microprocessor Systems
This chapter includes two sections. The section on basic issues covers the past trends in
microprocessor technology and characteristics of some contemporary microprocessor machines from the workstation market, namely Intel Pentium, Pentium MMX, Pentium Pro,
and Pentium II/III, as the main driving forces of the todays personal computing market.
The section on advanced issues covers future trends in state of the art microprocessors.
1. Basic Issues
It is interesting to compare current Intel CISC type products (which drive the personal
computer market today) with the RISC products of Intel and of the other companies. At the
time being, DEC Alpha family includes three representatives: 21064, 21164, and 21264.
The PowerPC family was initially devised by IBM, Motorola, and Apple, and includes a
series of microprocessors starting at PPC 601 (IBM name) or MPC 601 (Motorola name);
the follow-up projects have been referred to as 601, 602, 603, 604, 620, and 750. The SUN
Sparc family follows two lines: V.8 (32-bit machines) and V.9 (64-bit machines). The
MIPS Rx000 series started with R2000/3000, followed by R4000, R6000, R8000, R10000,
and R12000. Intel has introduced two different RISC machines: i960 and i860 (Pentium II
has a number of RISC features included at the microarchitecture level). The traditional
Motorola RISC line includes MC88100 and MC88110. The Hewlett-Packard series of
RISC machines is referred to as PA (Precision Architecture).
All comparative data both for modern microprocessors and for microprocessors that are
sitting on our desks for years now have been presented in the form of tables (manufacturer
names and Internet URLs are given in Figure MPSU1). One has to be aware of the past,
before starting to look into the future.
Tables to follow include only data for microprocessors declared by their manufacturers
as being (predominantly or partially) of the RISC type. Consequently, these tables do not
include data on Pentium and Pentium Pro. However, Pentium and Pentium Pro are presented in more details later in the text. Therefore, the tables can serve the basis for comparison of Pentium and Pentium Pro with other relevant products.
19
Company
Internet URL of microprocessor family home page
IBM
http://www.chips.ibm.com/products/powerpc/
Motorola
http://www.mot.com/SPS/PowerPC
DEC
http://www.europe.digital.com/semiconductor/alpha/alpha.htm
Sun
http://www.sun.com/microelectronics/products/microproc.html
MIPS
http://www.mips.com/products/index.html
Hewlett-Packard
http://hpcc920.external.hp.com/computing/framed/technology/micropro
AMD
http://www.amd.com/K6
Intel
http://www.intel.com/pentiumii/home.htm
Figure MPSU1: Microprocessor family home pages (source: [Prvulovic97])
Comment:
Listed URL addresses and their contents do change over time.
Figure MPSU2 compares the chip technologies. Figure MPSU3 compares selected architectural issues. Figure MPSU4 compares the instruction level parallelism and the count of
the related processing units. Figure MPSU5 is related to cache memory, and Figure MPSU6
includes miscellaneous issues, like TLB* structures and branch prediction solutions.
The main reason for including Figure MPSU2 is to give readers more insight into the
major technological aspects of the design, and their impact on the transistor count and clock
speed. When the feature size goes down, the number of transistors goes up, at best quadratically, if the same or similar speed is to be maintained. That is the case with DEC Alpha
and HP Precision Architecture microprocessors. However, if the goal is to increase the
speed dramatically, the transistor count increase will be affected. That is the case with SUN
SPARC microprocessors. Until recently, pin count was one of the major bottlenecks of microprocessor technology, but over the past years the pin count problem has improved a lot.
Another dramatic improvement is noticed in the number of metal layers. The higher the
number of metal layers, the larger the transistor count increase when the feature size decreases. Consequently, the most dramatic transistor count increases are in the cases characterized (among other things) with a dramatic improvement in the number of metal layers
(for example, DEC Alpha 21264 and HP Precision Architecture 8500).
The main reason for including Figure MPSU3 is to give readers more insight into the
number and type of on-chip resources for register storage and bus communications. The
number of integer unit registers can be either medium (if flat organizations are used) or relatively large (if structured organizations are used). Flat organization means that all registers
of the CPU are visible to the currently running program. Structured organizations mean that
registers are organized into specific structures (often times called windows), with only one
structure element visible to the currently running program. All advanced microprocessors
include a rename buffer. Rename buffers enable runtime resolution of data dependencies.
The larger the rename buffer, the better the execution speed of programs. Consequently, the
newer microprocessors feature considerable increases in the sizes of their rename buffers.
Contribution of rename buffers in the floating-point execution stream is not as dramatic, so
they are less often found in the floating-point units. System buses (and external cache data
buses, if implemented) are 64 or 128 bits wide.
*
20
Microprocessor
Company
Technology
Transistors
Frequency
[MHz]
Package
PowerPC 601
IBM, Motorola
2,800,000
80
304 PGA
0.6 m, 4 L, CMOS
5,100,000
PowerPC 604e
IBM, Motorola
225
255 BGA
0.35 m, 5 L, CMOS
7,000,000
PowerPC 620*
IBM, Motorola
200
625 BGA
0.35 m, 4 L, CMOS
6,350,000
Power PC 750* IBM, Motorola
300
360 CBGA
0.29 m, 5 L, CMOS
1,680,000
Alpha 21064*
DEC
300
431 PGA
0.7 m, 3 L, CMOS
Alpha 21164*
DEC
9,300,000
500
499 PGA
0.35 m, 4 L, CMOS
15,200,000
Alpha 21264*
DEC
500
588 PGA
0.35 m, 6 L, CMOS
3,100,000
SuperSPARC
Sun Microelectronics 0.8 m, 3 L, CMOS
60
293 PGA
UltraSPARC-I* Sun Microelectronics 0.4 m, 4 L, CMOS
5,200,000
200
521 BGA
UltraSPARC-II* Sun Microelectronics 0.35 m, 5 L, CMOS
5,400,000
250
521 BGA
R4400*
MIPS Technologies 0.6 m, 2 L, CMOS
2,200,000
150
447 PGA
R10000*
MIPS Technologies 0.35 m, 4 L, CMOS
200
599 LGA
6,700,000
PA7100
Hewlett-Packard
850,000
100
504 PGA
0.8 m, 3 L, CMOS
PA8000*
Hewlett-Packard
3,800,000
180
1085 LGA
0.35 m, 5 L, CMOS
PA8500*
Hewlett-Packard
250
??
0.25 m, ? L, CMOS >120,000,000
MC88110
Motorola
1,300,000
50
361 CBGA
0.8 m, 3 L, CMOS
AMD K6
AMD
8,800,000
233
321 PGA
0.35 m, 5 L, CMOS
i860 XP
Intel
2,550,000
50
262 PGA
0.8 m, 3 L, CHMOS
Pentium II
Intel
300
242 SEC
7,500,000
0.35 m, ? L, CMOS
Figure MPSU2: Microprocessor technology (sources: [Prvulovic97], [Stojanovic95])
Legend:
* 64-bit microprocessors, all others are 32-bit microprocessors.
x Lx-layer metal (x = 2, 3, 4, 5, 6);
PGApin grid array;
BGAball grid array;
CBGA ceramic ball grid array,
LGAland grid array;
SECsingle edge contact,
DEC Digital Equipment Corporation,
AMD Advanced Micro Devices.
Comment:
Actually, this figure shows the strong and the not so strong sides of different manufacturers, as well as their
basic development strategies. Some manufacturers generate large transistor count chips which are not very
fast, and vice versa. Also, the pin count of chip packages differs, as well as the number of on-chip levels of
metal, or the minimal feature size. Note that the number of companies manufacturing general purpose microprocessors is relatively small.
The main reason for including Figure MPSU4 is to give readers more insight into the
ways in which the features related to ILP (instruction level parallelism) are implemented in
popular microprocessors. The issue width goes up to 6 (AMD K6 represents a sophisticated
attempt to get maximum out of ILP). In order to be able to execute all fetched operations in
parallel, in all scenarios of interest, a relatively large number of execution units are needed.
On average, this number is bigger than the issue width. One exception to this is AMD K6
with a relatively small number of execution units, which are used efficiently, due to the sophisticated internal design of the AMD K6 [Shriver98].
21
Microprocessor
IU registers
FPU registers
VA
PA
EC Dbus
SYS Dbus
PowerPC 601
52
32
none
64
3232
3264
52
32
none
64
PowerPC 604e
3264+RB(8)
3232+RB(12)
80
40
128
128
PowerPC 620
3264+RB(8)
3264+RB(8)
52
32?
64, 128
64
PowerPC 750
3264+RB(6)
3232+RB(12)
Alpha 21064
43
34
128
128
3264
3264
Alpha 21164
43
40
128
128
3264+RB(8)
3264
?
44
128
64
Alpha 21264
3264+RB(40)
3264+RB(48)
SuperSPARC
32
36
none
64
13632
3232*
UltraSPARC-I
44
36
128
128
13664
3264
UltraSPARC-II
44
36
128
128
13664
3264
R4400
40
36
128
64
3264
3264
R10000
44
40
128
64
3264+RB(32)
3264+RB(32)
PA7100
64
32
?
?
3232
3264
48
40
64
64
PA8000
3264
3264+RB(56)
PA8500
48
40
64
64
3264+RB(56)
3264
MC88110
32
32
none
?
3232
3280
48
32
64
64
AMD K6
880
832+RB(40)
i860 XP
32
32
none
?
3232
3232*
Pentium II
?
48
36
64
64
880
Figure MPSU3: Microprocessor architecture (sources: [Prvulovic97], [Stojanovic95])
Legend:
IUinteger unit;
FPUfloating point unit;
VAvirtual address [bits];
PAphysical address [bits];
EC Dbusexternal cache data bus width [bits];
SYS Dbussystem bus width [bits];
RBrename buffer [size expressed in the number of registers];
* Can also be used as a 1664 register file.
Comment:
The number of integer unit registers shows the impact of initial RISC research, on the designers of a specific
microprocessor. Only SUN Microsystems have opted for the extremely large register file, which is a sign of a
direct or indirect impact of Berkeley RISC research. In the other cases, smaller register files indicate the preferences corresponding directly or indirectly to the Stanford MIPS research.
The main reason for including Figure MPSU5 is to give readers more insight into the
cache memory design of popular microprocessors. L1 refers to level#1 cache memory and
L2 refers to level#2 cache memory. L1 is closer to the CPU and smaller than L2. L1 is typically smaller in capacity for one order of magnitude. Since it is smaller, it is faster, and less
clock cycles are needed to access it. In commercial designs, typically the principle of inclusion is satisfied, which means that all data in L1 are also present in L2. Often times, L1 is
on the same chip as the CPU. In more recent microprocessors, both L1 and L2 are on the
same chip as CPU. DEC Alpha is the first microprocessor to include both L1 and L2 onto
the same chip as the CPU. In case of Pentium II, L1 and L2 are on two different chips, but
on the same package. L1 cache is typically divided into separate instruction and data caches. L2 typically includes both instructions and data in one cache.
22
The main reason for including Figure MPSU6 is to give readers more insight into the design of TLB (translation look-aside buffer) and BPS (branch prediction strategies) in popular microprocessors. The number of TLB entries goes up to 256. In most microprocessors,
there are separate TLBs for instructions and data. Exceptions are Intel i860, Sun SPARC,
and Hewlett Packard PA 8500. As far as BPS, typically only the less sophisticated techniques are used (2-bit counter and return address stack). The only exception is DEC Alpha
21264 (two-level multi-hybrid). Detailed explanations of these and more sophisticated
techniques are given later in this book, in the part on branch prediction strategies.
23
Microprocessor
L1 Icache, KB
L1 Dcache, KB
L2 cache, KB
PowerPC 601
32, 8WSA, UNI
PowerPC 604e
32, 4WSA
32, 4WSA
PowerPC 620
32, 8WSA
32, 8WSA
*
PowerPC 750
32, 8WSA
32, 8WSA
*
Alpha 21064
8, DIR
8, DIR
*
Alpha 21164
8, DIR
8, DIR
96, 3WSA*
Alpha 21264
64, 2WSA
64, DIR
*
SuperSPARC
20, 5WSA
16, 4WSA
UltraSPARCI
16, 2WSA
16, DIR
*
UltraSPARCII
16, 2WSA
16, DIR
*
R4400
16, DIR
16, DIR
*
R10000
32, 2WSA
32, 2WSA
*
PA7100
0
**
PA8000
0
**
PA8500
512, 4WSA
1024, 4WSA
MC88110
8, 2WSA
8, 2WSA
AMD K6
32, 2WSA
32, 2WSA
*
i860 XP
16, 4WSA
16, 4WSA
Pentium II
16, ?
16. ?
512, ?***
Figure MPSU5: Microprocessor cache memory (sources: [Prvulovic97], [Stojanovic95])
Legend:
Icacheon-chip instruction cache;
Dcacheon-chip data cache;
L2 cacheon chip L2 cache;
DIRdirect mapped;
xWSAx-way set associative (x = 2, 3, 4, 5, 8);
UNIunified L1 instruction and data cache;
*
on-chip cache controller for external L2 cache;
** on-chip cache controller for external L1 cache;
*** L2 cache is in the same package, but on a different silicon die.
Comment:
It is only an illusion that early HP microprocessors are lagging behind the others, as far as the on-chip cache
memory support; they are using the so called on-chip assist cache, which can be treated as a zero-level cache
memory, and works on slightly different principles, compared to traditional cache (as it will be explained later
on in this book). On the other hand, DEC was the first one to place both level-1 and level-2 caches on the
same chip with the CPU.
24
Microprocessor
ITLB
DTLB
BPS
PowerPC 601
256, 2WSA, UNI
*
PowerPC 604e
128, 2WSA
128, 2WSA
5122BC
PowerPC 620
128, 2WSA
128, 2WSA
20482BC
PowerPC 750
128, 2WSA
128, 2WSA
5122BC
Alpha 21064
12
32
40962BC
Alpha 21164
48 ASSOC
64 ASSOC
ICS2BC
Alpha 21264
128 ASSOC 128 ASSOC
2LMH, 32RAS
SuperSPARC
64 ASSOC, UNI
?
UltraSPARC-I
64 ASSOC
64 ASSOC
ICS2BC
UltraSPARC-II
64 ASSOC
64 ASSOC
ICS2BC
R4400
48 ASSOC
48 ASSOC
R10000
64 ASSOC
64 ASSOC
5122BC
PA7100
16
120
?
PA8000
4
96
2563BSR
PA8500
160, UNI
>2562BC
MC88110
40
40
?
AMD K6
64
64
81922BC, 16RAS
i860 XP
64, UNI
?
Pentium II
?
?
?
Figure MPSU6: Miscellaneous microprocessor features (source: [Prvulovic97])
Legend:
ITLBtranslation lookaside buffer for code [entries];
DTLBtranslation lookaside buffer for data [entries];
2WSAtwo-way set associative; ASSOC - fully associative;
UNIunified TLB for code and data;
BPSbranch prediction strategy;
2BC2-bit counter;
3BSRthree bit shift register;
RASreturn address stack;
2LMHtwo-level multi-hybrid
(gshare for the last 12 branch outcomes and pshare for the last 10 branch outcomes);
ICSinstruction cache size (2BC for every instruction in the instruction cache);
* hinted instructions available for static branch prediction.
Comment:
The great variety in TLB design numerics is a consequence of the fact that different manufacturers see differently the real benefits of having a TLB of a given size. Grouping of pages, in order to use one TLB entry for
a number of pages, has been used by DEC and viewed as a viable price/performance trade-off. Variable page
size has been first used by MIPS Technologies machines.
The following sections give a closer look into the Intel Pentium, Pentium MMX, Pentium Pro, and Pentium II/III machines.
1.1. Pentium
The major highlights of Pentium include the features that make it different in comparison
with the i486. The processor is built out of 3.1 MTr (Million Transistors) using the Intels
0.8 m BiCMOS silicon technology. It is packed into a 273-pin PGA (Pin Grid Array)
package, as indicated in Figure MPSU7.
25
9 10 11 12 13 14 15 16 17 18 19 20 21
TOP VIEW
W
1
9 10 11 12 13 14 15 16 17 18 19 20 21
Power and ground pins are marked on Figure MPSU7, and that is the main reason for
showing this figure (to underline the fact that signal departure from power and ground is a
major issue). With more power and ground pins, the logic on the chip gets power and
ground levels with less departure from ideal.
26
Pentium pin functions are shown in Figure MPSU8. The major reason for showing pin
functions is to underline the fact that, as far as the main pin functions are concerned, there
is no major difference between Pentium and the earliest members of the x86 family.
Function
Pins
Clock
CLK
Initialization
RESET, INIT
Address Bus
A31A3, BE7#BE0#
Address Mask
A20M#
Data Bus
D63D0
Address Parity
AP, APCHK#
Data Parity
DP7DP0, PCHK#, PEN#
Internal Parity Error
IERR#
System Error
BUSCHK#
Bus Cycle Definition
M/IO#, D/C#, W/R#, CACHE#, SCYC, LOCK#
Bus Control
ADS#, BRDY, NA#
Page Cacheability
PCD, PWT
Cache Control
KEN#, WB/WT#
Cache Snooping/Consistency
AHOLD, EADS#, HIT#, HITM#, INV
Cache Flush
FLUSH#
Write Ordering
EWBE#
Bus Arbitration
BOFF#, BREQ, HOLD, HLDA
Interrupts
INTR, NMI
Floating Point Error Reporting
FERR#, IGNNE#
System Management Mode
SMI#, SMIACT#
Functional Redundancy Checking
FRCMC# (IERR#)
TAP Port
TCK, TMS, TDI, TDO, TRST#
Breakpoint/Performance Monitoring
PM0/BP0, PM1/BP1, BP32
Execution Tracing
BT3BT0, IU, IV, IBT
Probe Mode
R/S#, PRDY
Figure MPSU8: Pentium pin functions (source: [Intel93])
Legend:
TAPProcessor boundary scan.
Comment:
Traditional pin functions are clock, initialization, addressing, data, bus control, bus arbitration, and interrupts.
These or similar functions can be found all the way back to the earliest x86 machines. Pin functions like parity, error control, cache control, tracing, breakpoint, and performance monitoring can be found in immediate
predecessors, in a somewhat reduced form.
Pentium is fully binary compatible with previous Intel machines in the x86 family. Some
of the above mentioned enhancements are supported with new instructions. The MMU
(Memory Management Unit) is fully compatible with i486, while the FPU (Floating-Point
Unit) has been redesigned for better performance.
Block diagram of the Pentium processor is shown in Figure MPSU9. The figure is important because it shows elements that are expected to be further explored in future microprocessors. For example, caches are the kernel of the later support for SMP and DSM. U
and V Pipelines are the kernel of the later support for ILP, etc....
27
Branch
TLB
Prefetch
CodeCache
Target
Address
8 KBytes
Buffer
256
Control
ROM
PrefetchBuffers
InstructionDecode
Instruction
Pointer
BranchVerification
& TargetAddress
ControlUnit
64-BitDataBus
32-BitAddressBus
Control
Bus
Unit
Address
Address
Generate
Generate
(U Pipeline) (V Pipeline)
Page
Unit
FloatingPointUnit
IntegerRegisterFile
Control
RegisterFile
ALU
ALU
(U Pipeline) (V Pipeline)
Barrel
Shifter
Add
Divide
80
64
64-Bit
Data
Bus
32
32-Bit
Address
Bus
32
32
32
TLB
DataCache
(8 KBytes)
Multiply
80
32
32
32
The core of the processor is the pipeline structure, which is shown in Figure MPSU10,
comparatively with the pipeline structure of the i486. A precise description of activities in
each pipeline stage can be found in [Intel93]. As it can be seen from Figure MPSU10, Pentium is a superscalar processor (which is an important departure from i486). However, the
depth of the Pentium pipeline has not changed (compared to i486).
Internal error detection is based on FRC (Functional Redundancy Checking), BIST
(Built-In Self Test), and PTC (Parity Testing and Checking). Constructs for performance
monitoring count occurrences of selected internal events and trace execution through the
internal pipelines.
28
Intel486 Pipeline
PF
D1
D2
EX
WB
I1
I2
I1
I3
I2
I1
I4
I3
I2
I1
I4
I3
I2
I1
I4
I3
I2
I4
I3
I4
Pentium Pipeline
PF
D1
I1
I2
I3
I4
I1
I2
I5
I6
I3
I4
I1
I2
I7
I8
I5
I6
I3
I4
I1
I2
I7
I8
D2
I5
I7
I6
I8
EX
I3
I5
I7
I4
I6
I8
WB
I1
I3
I5
I7
I2
I4
I6
I8
Figure MPSU10: Intel 486 pipeline versus Pentium pipeline (source: [Intel93])
Legend:
PFPrefetch;
D1/2Decoding 1/2;
EXExecution;
WBWriteback.
Comment:
This block diagram sheds light on the superscaling of Pentium processor, in comparison with the i486 processor. The depth of the pipeline has not changed; only the width. This is a consequence of the fact that technology has changed drastically in the sense of on-chip transistor count, and minimally in the sense of off-chip-toon-chip delay ratio.
29
LINEAR ADDRESS
DIR PTRS
TABLE
(Optional)
DIRECTORY
OFFSET
PCD,
PWT
+
PCD,
PWT
+
+
PCD,
PWT
PAGE
FRAME
PAGE
TABLE
CR3
PAGE
DIRECTORY
PWT
PG (PagingEnable)
CR0
PWT
PCD
CD (CachingDisable)
CacheTransitionToE-stateEnable
PCD
WB/WT#
CacheLineFillEnable
KEN#
CacheInhibitTR12.3CI
UnlockedMemoryReads
CACHE#
WritebackCycle
30
Allocation:
prime - ecx
k - edx
FALSE - al
Assembly code:
inner_loop:
mov byte ptr flags[edx], al
add edx, ecx
cmp edx, SIZE
jle inner_loop
1.1.4. Input/Output
Organization of the interrupt mechanism is a feature which is of importance for incorporation of an off-the-shelf microprocessor into microprocessor and multimicroprocessor systems. Interrupts inform the processor or the multiprocessor of the occurrence of external
asynchronous events. External interrupt related details are specified in Figure MPSU13.
31
In Pentium,
the instruction boundary is at the first clock in the execution stage of the instruction pipeline.
32
The 8-KB data cache is reconfigurable on the line-by-line basis, as write-back or writethrough cache. In the write-back mode, it fully supports the MESI cache consistency protocol. Parts of data memory can be made non-cacheable, either by software action or by external hardware. The 8-KB instruction cache is inherently write-protected, and supports the
SI (Shared/Invalid) protocol.
Data cache includes two state bits to support the MESI protocol. Instruction cache includes one state bit to support the SI protocol. Operating modes of the two caches are controlled with two bits in the register CR0: CD (Code Disable) and NW (Not Write through).
System reset makes CD = NW = 1. The best performance is potentially obtained with
CD = NW = 0. Organization of code and data caches is shown in Figure MPSU14.
MESI State
Set
LRU
TAG Address
WAY 0
MESI State
TAG Address
WAY 1
Data Cache
State Bit (S or I)
State Bit (S or I)
LRU
Set
TAG Address
TAG Address
WAY 0
WAY 1
Instruction Cache
Figure MPSU14: Organization of instruction and data caches (source: [Intel93])
Legend:
MESIModified/Exclusive/Shared/Invalid;
LRULeast Recently Used.
Comment:
This figure stresses the fact that Pentium processor uses a two-way set associative cache memory. In addition
to tag address bits, two more bits are needed to specify the MESI state (in the data cache), while one more bit
is needed to specify the SI state (in the instruction cache).
A special snoop (inquire) cycle is used to determine if a line (with a specific address) is
present in the code or data cache. If the line is present and is in the M (modified) state, processor (rather than memory) has the most recent information and must supply it.
The on chip caches can be flushed by external hardware (input pin FLUSH# low) or by
internal software (instructions INVD and WBINVD). The WBINVD causes the modified
lines in the internal data cache to be written back, and all lines in both caches are to be
marked invalid. The INVD causes all lines in both data and code caches to be invalidated,
without any writeback of modified lines in the data cache.
As already indicated, each line in the Pentium processor data cache is assigned a state,
according to a set of rules defined by the MESI protocol. These states tell whether a line is
valid or not (I = Invalid), if it is available to other caches or not (E = Exclusive or
S = Shared), and if it has been modified in comparison to memory or not (M = Modified).
An explanation of the MESI protocol is given in Figure MPSU15. The data cache state
transitions on read, write, and snoop (inquire) cycles are defined in Figures MPSU16,
MPSU17, and MPSU18, respectively.
33
MModified:
An M-state line is available in ONLY one cache,
and it is also MODIFIED (different from main memory).
An M-state line can be accessed (read/written to)
without sending a cycle out on the bus.
EExclusive:
An E-state line is also available in only one cache in the system,
but the line is not MODIFIED (i.e., it is the same as main memory).
An E-state line can be accessed (read/written to) without generating a bus cycle.
A write to an E-state line will cause the line to become MODIFIED.
SShared:
This state indicates that the line is potentially shared with other caches
(i.e., the same line may exist in more that one cache).
A read to an S-state line will not generate bus activity,
but a write to a SHARED line will generate a write-through cycle on the bus.
The write-through cycle may invalidate this line in other caches.
A write to an S-state line will update the cache.
IInvalid:
This state indicates that the line is not available in the cache.
A read to this line will be a MISS,
and may cause the Pentium processor to execute LINE FILL.
A write to an INVALID line causes the Pentium processor
to execute a write-through cycle on the bus.
Figure MPSU15: Definition of states for the MESI and the SI protocols (source: [Intel93])
Legend:
LINE FILLFetching the whole line into the cache from main memory.
Comment:
This figure gives only the precise description of the MESI and the SI protocols. Detailed explanations of the
rationales behind, and more, are given later on in this book (section on caching in shared memory multiprocessors).
34
Figure MPSU16:
Next
State Description
M Read hit;
data is provided to processor core
by cache.
No bus cycle is generated.
N/A
E Read hit;
data is provided to processor core
by cache.
No bus cycle is generated.
N/A
S Read hit;
data is provided to processor core
by cache.
No bus cycle is generated.
CACHE# low
E Data item does not exist in cache (MISS).
AND
A bus cycle (read) will be generated
KEN# low
by the Pentium processor.
AND
This state transition will happen
WB/WT# high
if WB/WT# is sampled high
with first BRDY# or NA#.
AND
PWT low
CACHE# low
S Same as previous read miss case
AND
except that WB/WT# is sampled low
KEN# low
with first BRDY# or NA#.
AND
(WB/WT# low
OR
PWT high)
CACHE# high
I
KEN# pin inactive;
AND
the line is not intended to be cached
KEN# high
in the Pentium processor.
Data cache state transitions for UNLOCKED
Pentium processor initiated read cycles* (source: [Intel93])
Legend:
* Locked accesses to data cache will cause the accessed line to transition to Invalid state.
Comment:
For more details see the section on caching in shared memory multiprocessors and the reference [Tomasevic93]. In comparison with a theoretical case, this practical case includes a number of additional state transition conditions, related to pin signals.
35
Present State
Pin
Activity
N/A
Next
State Description
M
M write hit; update data cache.
No bus cycle generated to update memory.
E
N/A
M Write hit; update cache only.
No bus cycle generated; line is now MODIFIED.
S
PWT low
E Write hit; data cache updated with write data item.
AND
A write-through cycle is generated on bus
WB/WT# high
to update memory and/or invalidate contents of other caches.
The state transition occurs
after the writethrough cycle completes on the bus
(with the last BRDY#).
S
PWT low
S Same as above case of write to S-state line except that
AND
WB/WT# is sampled low.
WB/WT# low
S
PWT high
S Same as above cases of writes to S state lines except that
this is a write hit to a line in a write through page;
status of WB/WT# pin is ignored.
I
N/A
I
Write MISS; a write through cycle is generated on the bus
to update external memory. No allocation done.
Figure MPSU17:
Data cache state transitions for UNLOCKED Pentium processor
initiated write cycles* (source: [Intel93])
Legend:
WB/WTWriteback/Writethrough.
* Locked accesses to data cache will cause the accessed line to transition to Invalid state.
Comment:
For more details see the section on caching in shared memory multiprocessors and the reference [Tomasevic93]. In comparison with a theoretical case, this practical case includes a number of additional state transition conditions, related to pin signals.
Next Next
State State Description
INV=1 INV=0
M
I
S
Snoop hit to a MODIFIED line indicated by HIT# and HITM# pins low.
Pentium processor schedules the writing back
of the modified line to memory.
E
I
S
Snoop hit indicated by HIT# pin low;
no bus cycle generated.
S
I
S
Snoop hit indicated by HIT# pin low;
no bus cycle generated.
I
I
I
Address not in cache; HIT# pin high.
Figure MPSU18: Data cache state transitions during inquire cycles (source: [Intel93])
Legend:
INVInvalid bit.
Comment:
For more details see the section on caching in shared memory multiprocessors and the reference [Tomasevic93]. A good exercise for the reader is to create a state diagram, using the data from this table, and the previous two tables.
Present State
36
37
Appearance of the MMX can be treated as the proof of validity of the opinion that accelerators will play an important role in future machines. The MMX subsystem can be treated
as an accelerator on the chip.
38
Note:
AGU
In the
larger
block
should
be
replaced
with
IFU
speeds. For the most recent speed comparison data, interested reader is referred to Intels
WWW presentation (http://www.intel.com/).
An important element of Pentium II and Pentium Pro is DIB (Dual Independent Bus)
structure. As indicated in Figure MPSU21, one of the two DIB buses goes towards L2
cache; the other one goes towards DRAM (Dynamic RAM) main memory, and I/O, and
other parts of the system. The only difference between the Pentium Pro and the Pentium II
processors, with respect to their L2 caches, is that the Pentium Pro implemented the L2 in
the same package as the CPU, while the Pentium II implements its L2 on a cartridge with a
single-edge connector. In fact, the Pentium Pro's L2 is accessible at the full CPU speed of
200 MHz; the Pentium II's L2 runs at only half the speed of the CPU (300MHz divided by
2, or 150 MHz), and is therefore more of a bottleneck than the original design. (This was
done to enable the use of commodity SRAMs instead of the custom SRAMs needed by the
Pentium Pro, and to obviate the need for the Pentium Pro's expensive ceramic 2-die package.)
SB (Pentium)
CPU
L2
cache
CLC
chipset
L2
DRAM
memory & I/O
cache
CPU
CLC
DRAM
chipset
memory & I/O
During Q1/99 Intel presented a new microprocessor - the Pentium III (code name "Katmai"). It operates at 450 MHz and 500 MHz. It is compatible with the previous generation
of Intel Pentium II processors, and the major improvements are: (a) SSE (Streaming SIMD
Execution), and (b) chip ID.
The SSE implies 70 new instructions intended mainly for 3D graphics, digital image and
sound processing, voice recognition, etc. According to Intel, due to an efficient hardware
implementation, these new instructions improve the execution time of application code for
more than 50%.
Main idea behind chip ID is, according to Intel, improving the e-commerce over Internet,
by avoiding unnecessary manually entered passwords. Instead, processor introduces itself
40
with the unique chip ID on sites for e-commerce. Discussion about this new feature is still
going on; Intel has an option to disable this feature, unless a customer explicitly requests it.
2. Advanced Issues
One of the basic issues of importance in microprocessing is correct prediction of future
development trends. The facts presented here have been adopted from a talk by an Intel researcher [Sheaffer96], and represent a possible way to be paved by Intel in future developments after the year 2000 (it summarizes the views of several Intel VIPs).
Quite soon after the year 2000, the on-chip transistor count will reach about 1GTr (one
gigatransistors) using the minimum lithographic dimension of less than 0.1 m (micrometers) and the gate oxide thickness of less than 40 (angstroms). Consequently, microprocessors on a single chip will be faster. Their clock rate is expected to be about 4 GHz (gigahertz), to enable the average speed of about 100 BIPS (billion instructions per second).
This means that the microprocessor speed, measured using standard benchmarks, may
reach to about 100 SPECint95 (which is about 3500 SPECint92).
In such conditions, the basic issue is how to design future microprocessors, in order to
maintain the existing trend of approximately doubling the performance in every 18 months.
Hopefully, some of the answers will be clear, once the reading of this book is completed.
As far as the process technology, the delay trends are given in Figure MPSS1, and the
area trends in Figure MPSS2, for the case of Intel products. Position of the ongoing project
can be estimated using extrapolation. Frequency of operation and transistor count represent
alternative ways to express performance and complexity. Related data for the line of Intel
products are given in Figure MPSS3 and Figure MPSS4.
Figure MPSS3 can be used to extrapolate the future trends at Intel, as far as the clock
frequency. However, the ultimate goal is not fast clock, but fast application. Therefore, the
speed of the clock has to be carefully examined against how much work the microprocessor
can do in one clock cycle, which is an architectural issue. Note that the dots for PPro-225
and PPro-300 are located above the position that would be obtained by linear extrapolation.
In other words the trend at Intel shows the elements of superlinearity. On the other hand, in
reality, most of the time it is the opposite the curves like the one in Figure MPSS3 typically show saturation, as indicated in [Becker98] and [Hartenstein97].
Figure MPSS4 can be used to extrapolate the future trends at Intel, as far as the transistor count per die. Note that in some cases the microprocessor package includes two or more
dies, while Figure MPSS4 refers only to the number of transistors per single die. Again, as
in the case of Figure MPSS3, the trend at Intel shows elements of superlinearity.
Generally, modern microprocessors have been classified in two basic groups:
(a) Brainiacs (lots of work accomplished during a single clock cycle, but slow clock), and
(b) Speedemons (fast clock, but not much work accomplished during a single clock cycle).
Figure MPSS5 can be treated as a proof of the statement that Intel microprocessors belong
in between the two extreme groups. Looks like, Intel processors have started as brainiacs;
however, over the time they have taken a course which is getting closer to speedemons.
Figure MPSS6 compares the time budget of brainiacs and speedemons.
41
Delay ps
1000
100
10
Metal 2 (2 mm)
Transistor
1
1.50
1.00
0.80
0.60
0.35
0.25
Technology Generation m
0.18
0.10
1.5 m
1.0 m
0.8 m
0.6 m
0.35 m
0.25 m
Intel386TM DX
Processor
Intel486TM DX
Processor
Pentium
Processor
Pentium Pro
Processor
est
est
42
1000
P P ro-3 0 0
P P ro-2 2 5
Frequency MHz
100
4 8 6 D X 2 -6 P
6 P -9 0
P P -6 6
3 8 6 D X -3 3 4 8 6 D X -5 0
4 8 6 D X -3 3
4 8 6 D X -2 5
3 8 6 D X -1 6
10
8085
80286
8088
8086
8080
1
1970
1975
1980
1985
1990
1995
2000
Y ear
Figure MPSS3: Microprocessor chip operation frequency (source: [Sheaffer96])
Legend:
PPPentium Processor;
PProPentium Pro Processor.
Comment:
The raw speed of microprocessors increases at the steady speed of about one order of magnitude per decade.
Note, however, that what is important is not the speed of the clock (raw speed), but the time to execute a useful application (semantic speed).
1000000000
100000000
10000000
1000000
100000
10000
1000
100
10
1
1965
256K
1M
64M
16M
4M
P entium P ro
PTMentium
64K
Intel486
16K
TM
Intel386
4K
80286
1K
8086
8080
4004
Mic roproc essor
Memor y
1970
1975
1980
1985
1990
1995
2000
Year
Figure MPSS4: Microprocessor and memory complexity (source: [Sheaffer96])
Comment:
The fact that the memory curve is above the processor curve tells that highly repetitive structures pack more
transistors onto the same area die. The distance between the two curves has a slightly increasing trend which
means that the higher the complexity, the more difficult it is to fit a given irregular structure onto the same
area die.
43
SPECInt92/MHz
PowerPC
Brainiacs
1,5
X86
PentiumPro
Pentium
21164
21264
ALPHA
21064
0,5
Speedemons
0
0
50
100
150
200
250
300
350
400
MHz
Figure MPSS5: Microprocessor sophistication (source: [Sheaffer96])
Comment:
Brainiacs tend to have a slower clock; however, on average, they accomplish more during a single clock period. On the other hand, speedemons are characterized by an extremely fast clock and little work done during
each clock period. Of course, what counts is the speed of the compiled high-level language code (semantic
speed). Consequently, an architecture is superior if it has been designed wisely, not if it has been designed as
a brainiac or a speedemon. Typically, the design path in between the two extremes has the highest success
expectations. Maybe that explains the decision of Intel to select the medium approach in between the two extremes.
From Figure MPSS6 one can see that, in the case of "brainiacs," most of the clock cycles
are used to perform arithmetical, logical, or related operations. On the other hand, in the
case of "speedemons," a good portion of time is used to manage the memory more efficiently.
In principle, there are two educational approaches: (a) from general to specific (analysis)
and (b) from specific to general (synthesis). Both of the approaches have their merits. In
this book, the second of the two approaches is used: Intel case study is presented first (already presented in the text so far), and then the future architectural developments are discussed next (in the text to follow, from now on).
Finally, a frequent question is how to select the optional microprocessor for a given application. There are two answers to that question. One answer implies that one can select
only among the existing microprocessors on the market. In such a case, non-technical reasons for selection may be more relevant than technical reasons. The other answer to the
same question implies that one can design and implement a new microprocessor, in which
case the question translates into "what is the best architecture to select for the new design."
This book gives an answer when the goal is to design a DSM on a single chip.
44
misprediction
L1 (data)
execution
memory
(data)
instruction
fetch
L2 (data)
L1 (data)
instruction
fetch
memory (data)
Figure MPSS6: Microprocessor time budget (source: [Sheaffer96])
Legend:
L1/2First/second level cache.
Comment:
Brainiacs use more on-chip resources to enrich the execution unit; consequently, often there is no room for L2
cache on the CPU chip. On the other hand, speedemons are the first ones to include L2 on the same chip as
the CPU, which is only partially a consequence of the fact that the execution unit is simpler.
45
* * *
The author and his associates were active in the microprocessor design and also in several
designs of potential accelerators for mission critical applications. For the details see [Davidovic97, Fortes86, Helbrig89, Milutinovic78, Milutinovic79, Milutinovic80a, Milutinovic80b, Milutinovic80c, Milutinovic84, Milutinovic85a, Milutinovic85b, Milutinovic85c,
Milutinovic86a, Milutinovic86b, Milutinovic86c, Milutinovic87a, Milutinovic87b, Milutinovic88a, Milutinovic95b, Raskovic95, and Vuletic97].
3. Problems
1. Explain why the higher the number of metal layers means the larger transistor count
increase when the feature size decreases. Explain the impacts.
2. Compare the transistor count invested (by various manufacturers) into the IU and
FPU registers of Figure MPSU3. Interpret the differences.
3. Calculate the ratio of the number of execution units and the issue width for all microprocessors listed in Figure MPSU4. Interpret the differences.
4. Study the details of AMD K6 and explain the essence of the features for which you
believe are the major contributors to its excellent performance. Suggest further improvements.
5. When you compare DEC and HP microprocessors, how many transistors were saved
in HP processors, by having only the assist cache, rather than both L1 and L2 caches
on the same chip as the CPU? Create a small application example (a short routine)
which shows the effectiveness of the assist cache of HP microprocessors, comparatively with DEC microprocessors.
6. Compare the L2 to L1 capacity ratio (Rc) for DEC Alpha 21264, Intel Pentium II,
and a few commercial board designs (that one can find on Internet) using other microprocessors listed in Figure MPSU2. How Rc for DEC Alpha 21264 and Intel Pentium II compares with Rc for commercial boards that you have analyzed?
7. What are the pros and cons of having separate TLBs for instructions and data? Compare with unified TLBs.
8. On Figure MPSU9 explain the possible ways to integrate the MMX accelerator. Explain pros and cons of different schemes.
9. Consult the open literature and describe precisely (in the time domain) all 7 external
interrupts listed in Figure MPSU13.
10. Suggest architectural add-ons that would make Pentium more efficient when used in
SMP systems. Do the same for the case of DSM systems.
46
ISSUES OF IMPORTANCE
47
The seven sections to follow cover the seven problem areas of importance for future microprocessors on a single VLSI chip, having in mind the ultimate goal advocated in this
book, which is an entire DSM on a single chip, together with a number of simple or complex accelerators.
48
1. Basic Issues
This section gives only three examples and the elementary definitions, since the basic
cache concepts are widely known (the text to follow assumes that the basic cache concepts
are well understood, which is one of the prerequisites for reading this book).
The major issue in cache design is how to design a close-to-the-CPU memory which is
both fast enough and large enough, but cheap enough. In other words, the major problem
here is not technology, but techno-economics, in conditions of constantly changing technology costs and characteristics (which is what some researchers forget, sometimes).
The solution to the above defined problem which includes the elements of both technology and economics is cache memory. This solution was made possible after it was recognized that programs do not access their data in a way which is completely randomthe
principle of locality is inherent to data and code access patterns. The discussion to follow
implies the processor-memory organization from Figure CACU1 and the related access patterns.
CPU+CAC
BUS
Figure CACU1: On-chip cache (source: [Tanenbaum90])
Legend:
CACCache and cache hierarchy.
Comment:
This figure presents a simple processor/memory model used to explain the caching approaches in the figures
to follow. Note that some new microprocessor system architectures make a departure from this simple processor/memory model (like the dual independent bus structure of Intel Pentium II). However, the difference does
not affect the essence of caching.
49
Three cache organizations are possible: (a) associative, as one extreme (b) setassociative, as a combined solution, and (c) direct-mapped, as another extreme. In all these
cases, the memory capacity is assumed to be equal to 2m (m is a relatively large integer) and
divided into consecutive blocks of b bytes (b is also a power of two).
Block #
0
137
1
52
Valid
1
1
1
0
1
0
Block #
0
600
2
Value
137
2131
1410
160248
290380
2
1410
12
M
3
1 bit
22 bits
1K lines
32 bits
635
16
4
M
224
Figure CACU2: An associative cache with 1024 lines and 4-byte blocks
(source: [Tanenbaum90])
Comment:
In the case of a fully-associative cache it can not happen that a block is purged out of the cache only because a
subset of its address bits matches the same subset of address bits of the newly referenced block. Consequently, if the replacement algorithm is properly designed, the block to be purged is always the block which is the
least likely to be referenced again.
In the example of Figure CACU2, memory is byte addressable, and one block occupies
four bytes. Consequently, the Block# field is 22 bits wide, because the memory addresses
are 24 bits wide. If the Valid bit is zero, the contents of the fields Block# and Value is irrelevant. The Value field is 32 bit wide.
50
Entry 0
644474448
Line
0
1
2
3
4
5
M
Entry 1
644474448
Valid
Tag
Value
Entry (n 1)
644474448
Figure CACU3: A set-associative cache with n entries per line (source: [Tanenbaum90])
Comment:
In the case of a set-associative cache, the benefits of full associativity are present within a set. Since the block
count of a set is considerably smaller than the block count of the entire cache, non-optimal block replacement
is a reality which impacts the performance.
Valid
1
1
1
0
0
0
M
1
Tag
2
1
3
Value
12130
170
2142
M
4092, 8188, 12284,
12
10
2
Tag
Line
00
Address bits
Figure CACU4: A direct-mapped cache with 1024 4-byte lines and a 24-bit address [Tanenbaum90].
Comment:
In the case of direct-mapped cache, the benefits of associativity are non-existent. Consequently, a match of a
subset of address bits of a block will force block replacement, even for a block which may to be reused later.
Some compiled vector processing code may result in patterns which periodically destroy the item to be
needed next. Interested reader is encouraged to develop such a code, to demonstrate the issue.
Note that a relatively large number of addresses map onto the same cache line. Consequently, addresses that map on the same line also compete for the same line, which is a potential source of unnecessary replacment of data to be used in near future.
51
Cache memory can be organized hierarchicaly. If that is the case, typicaly the inclusion
principle is satisfied. This means that data in lower levels (closer to CPU) are also present
in higher levels (closer to memory). The larger the speed gap between the CPU and the
memory, the larger the optimal number of cache hierarchy levels.
According to [Peir99], techniques for acieving fast cache access, while maintaining high
hit ratios, can be broadly classified into the following four categories:
(a) Decoupled cache: Data array access and line selection are done independently of the
tag array access and comparison, which reduces the delay disbalance between the paths
through the tag and the data arrays; this improves the cache clock rate.
(b) Multiple access cache: Direct-mapped cache (which is characterized with fast access)
is accessed more than once; this results in the hit ratio typical of set-associative caches, for
the access time which is (in spite of multiple accesses) faster than in the case of the setassociative cache.
(c) Augmented cache: In addition to the direct-mapped cache, a small fully-associative
cache is added to improve performance; overall, this improves hit ratio, while the access
time slow-down is practically negligible.
(d) Multilevel cache: The lowest level cache is the smallest and the fastest, while the upper level caches are slower and bigger; this enables large capacity of the highest-level cache
and the good speed of the lowest-level cache, which is the reasoning which led to the introduction of the cache memory concept, in the first place.
In the part on advanced issues, which follows next, several enhancements will be disscused. In each of the enhancements to be discussed, some elements of the above mentioned four categories could be recognized.
2. Advanced Issues
This section gives a short overview of selected research efforts in the field of uniprocessor cache memory. Some of the solutions presented here are applicable also in the field of
multiprocessing.
An idea that deserves attention is to use pointers to page numbers rather than page numbers, in cache architectures of modern microprocessors and multimicroprocessors [Seznec96]. Figures CACS1, CACS2, CACS3, and CACS4 give the basic explanation.
AddressTag
Data
52
14444244443
A2
A1
A0
VirtualPageNumber
PageOffset
144444424444443
TLB
+
PN-Cache
TAG
ARRAY
CACHE
ARRAY
MatchingEntry
IndirectTag
Match
PhysicalPageNumber
Hit?
DataOut
By using pointers to the page numbers, one saves a relatively large number of transistors
in a new microprocessor design. This is because the same data can be found in various microprocessor resources (in general case, caches, TLB, various tables and directories, etc.).
14444244443
A2
A1
A0
VirtualPageNumber
PageOffset
144444424444443
TLB
TAG
ARRAY
CACHE
ARRAY
IndirectTag
PAGE
NUMBER
CACHE
IndirectTag
Match
Hit?
PhysicalPageNumber
DataOut
53
This idea is of special interest for object-oriented environments where the pointers to
page numbers may exhibit a higher level of locality, which is of crucial importance for
cache efficiency. Variations of this idea are expected to be seen in future machines for object-oriented environments.
Another contribution that deserves attention is the difference-bit cache [Juan96]. The difference-bit cache is a two-way set-associative cache with a smaller access time, which is
important, since the cycle time of a pipelined processor is usually determined by the cache
access time.
The smaller access time was achieved by noticing that the two tags for a set have to differ at least by one bit, and this bit can be used to select the way in a two-way set associative
cache. Consequently, all hits of a difference-bit cache are one cycle long.
Recent research at Hewlett-Packard [Chan96] introduces the concept of assist cache,
which can be treated as a zero-level cache, in systems in which only the zero-level cache is
on the microprocessor chip, and level-1/level-2 caches (L1/L2) are off the processor chip.
The major idea behind the assist cache is to treat spatial data (elements of complex data
structures) in a different way compared to other data, like temporal data (single variables),
etc. Spatial data are likely to be used only once, and consequently after being used, on the
next replacement, they are not placed back into the off-chip L1 cache (and off-chip L2
cache, if present); this cache is bypassed and spatial data go directly to memory.
VirtualPageNumber
PageOffset
144444424444443
TLB
LowOrderBits
TAG
ARRAY
CACHE
ARRAY
IndirectTag
PAGE
NUMBER
CACHE
PhysicalPageNumber
IndirectTag
Match
Hit?
DataOut
An important research track is related to the victim cache. First mention of the underlying concept dates back to [DeJuan87+DeJuan88]. The concept was introduced under the
name victim cache in [Jouppi90]. An improvement referred to as selective victim caching is
described in [Stiliadis97].
54
This research starts from the design tradeoff between direct-mapped and set-associative
or fully-associative caches. Direct-mapped caches are faster (which means a shorter cycle
time); however, their miss ratio is higher (which means more cycles for execution of a program) because semantically related data can map into the same cache block. Fullyassociative caches represent the opposite extreme (longer access time, but higher hit ratio).
Set-associative caches are in between.
One question is what to do in order to obtain access time of the direct-mapped cache and,
at the same time, hit ratio of the set-associative approach. Victim cache is a solution. It is a
small fully-associative cache. Its place in the cache/memory hierarchy is one level above
the first-level cache (L1), which is direct-mapped. Blocks replaced from the direct-mapped
L1 cache are placed into the fully-associative victim (V) cache (kind of level 1). If this
block is referenced again, before being replaced from the V cache, the L1 miss penalty will
be relatively small (because the victim cache is small and physically close). Since the V
cache is fully-associative, many blocks which conflict with each other in the direct-mapped
L1 cache, coexist without problems in the V cache. Consequently speed of directed-mapped
cache makes L1 faster, and the major drawback of the direct-mapped cache (mapping of
two or more useful data items into the same cache block) gets eliminated by the victim
cache.
In the case of the selective victim cache, incoming blocks to L1 are placed selectively either into L1 or into V cache, based on their past behavior. This implies the existence of a
prediction mechanism, to determine the access likelihood of the incoming block. Blocks
that are more likely to be accessed are placed into the L1 cache; blocks that are less likely
to be accessed are placed into the V cache.
Similarly, when a miss is being serviced from the V cache, the prediction mechanism is
applied again, to determine if an interchange of the conflicting blocks is required (which is
mandatory in the case of the classical V cache). For 10 different Spec92 benchmarks, selective victim cache provides the cache miss rate improvement of up to about 20%, and interchange traffic improvement between L1 and V for up to about 70%. The selective victim
cache approach can be based on a number of different prediction mechanisms.
Another question is what is better: (a) fewer cycles which are longer (fully-associative
cache), or (b) more cycles which are shorter (direct-mapped). Paper [Hill88] compares direct-mapped cache and set-associative cache (fully-associative cache has no practical value
in a number of applications), and concludes that direct-mapped cache gives shorter overall
execution time for a set of important system and application programs. Paper [Milutinovic86c] brings the same conclusion for GaAs technology, through a side study leading to an
early GaAs RISC [Helbig89, Milutinovic97].
An important new trend in cache design implies the techniques that optimize the cache
design and performance in conditions typical of superscalar microprocessors [Vajapeyam97, Nair97].
The SFP (Spatial Footprint Predictor) cache predicts which portions of a cache block
will be used before eviction [Kumar98]. In traditional cache memories, more than 50% of
cached data is evicted before use, which means that potentials of such a prediction are relatively high. Efficient prediction enables smaller cache blocks (combined with multiple
block fetch on a miss) to achieve better ratio of used-before-evicted data versus evicted-
55
before-used data, and consequently a better performance. With the predictor from [Kumar98] the cache miss rate is improved, on average, by about 18% for SPEC applications.
The miss rate improvement potential is estimated to be about 35%. The SFP approach requires no ISA modifications.
Prediction is based on two tables: SHT (Spatial-footprint History Table) and AST (Active
Sector Table). The SHT stores some of the previously recorded footprints. The AST
records the footprint while a sector is active in the cache. Sectors are aligned groups of adjacent cache blocks, and can be active or inactive (initially, all sectors are inactive). On sector activation, an AST entry is allocated and the predicted spatial footprint is used to fetch
new blocks. On sector deactivation, the recorded footprint is used to update the predictor.
Cache blocks within a sector that get referenced while the sector is active define the spatial
footprint of that sector. A spatial footprint is stored as a bit vector, with a bit for every
block in the sector. Consequently, only the blocks from a sector are fetched that are very
likely to be used in the near future.
An alternative approach, referred to as trace cache [Peleg94, Rotenberg97] places logically contiguous instructions into physically contiguous storage. Consequently, multiple
cache blocks can be efficiently fetched on each cycle. The maximal number of fetch blocks
supplied on each cycle is limited by the capabilities of the branch predictor.
In general, instruction fetching past a taken branch is a serious problem for instruction
cache design. One possible solution for the problem is the trace cache. Like the classical
instruction cache, trace cache is accessed using the starting address of the next block. Unlike the classical instruction cache, it stores logically consecutive instructions in physically
consecutive storage.
Actually, a trace cache line stores a segment of the dynamic instruction trace, across multiple potentially taken branches. In other words, block contains dynamically consecutive
instructions, not statically consecutive instructions.
As a group of instructions is processed, it is latched into the fill unit. Consecutive groups
of instructions are concatenated until the block is completely filled. One possible size of the
block is equal to the issue width of instructions. Fill unit is off the critical execution path,
so its latency does not affect the CPU clock.
One important characteristic of the trace cache is that it continues to contribute to the
overall system performance as its size grows. This is not the case with a classical instruction cache; if instruction cache size grows beyond the working set of larger applications,
larger instruction cache may not bring additional performance improvements. With the
growing size of the trace cache, it can be partitioned into a smaller one-cycle component
and a larger multiple-cycle component. Multiple level hierarchies of trace caches can also
be built.
Paper [Patel98] introduces two mechanisms to improve the efficiency of the trace cache:
branch promotion (dynamically converting strongly biased branches into branches with
static predictions - statically predicted branches do not consume the branch predictor bandwidth when they are fetched) and trace packing (dynamically packing trace segments into
as many instructions as possible - the fill unit will not divide a block of newly retired instructions across trace segments). Both techniques together improve the effective fetch rate
of the traditional trace cache by up to about 17% [Patel98].
56
Further enhancements of the trace cache concept can be found in [Rotenberg99, Patel99]
and [Black99]. The first paper [Rotenberg99] gives a more accurate performance analysis.
The second paper [Patel99] evaluates a number of trace cache design choices. The third paper [Black99] introduces the concept of block-based trace cache, which is aimed at using
the storage more efficiently; instead of storing blocks of traces, it stores pointers to blocks
of traces.
Several research efforts try to improve the cache subsystem performance by treating spatial and temporal data in different ways, with different mechanisms. The major approaches
will now be briefly summarized from [Prvulovic99a].
57
The major drawback of this design is the assumption that most data are either predominantly temporal or predominantly spatial. In fact, most data exhibit both types of locality.
Two issues result from this. The first is that a secondary cache would be useful in the spatial part, since it caches the data that exhibits both types of locality as well as spatial only
data. The other issue is the counter-based heuristic. If data is accessed so that one word is
used several times before the next word is accessed, the heuristic may mark that block as
temporal if Y counter saturates while only one half of the block is used. Thereafter, this
block is always cached in the temporal part and causes four misses (one per word used) instead of only one, which it would cause in the spatial subcache.
An interesting piece of research is presented in [Sahuquillo99]. It is an extension of the
Split temporal/spatial cache for the shared memory multiprocessor systems.
The dual data cache was introduced in [Gonzalez95]. In this design, the primary data
cache is split into a smaller (33% of total cache capacity) temporal subcache and a larger
(66% of the total cache capacity) spatial subcache. Temporal subcache has a block size of
eight bytes, while the spatial subcache has a larger block size (16 or 32 bytes). The decision
which subcache to use is made at runtime, using a variant of stride directed prefetch predictor. The predictor, in essence, tries to detect if a particular load/store instruction accesses
memory at addresses that differ by a constant stride. Using this information, at each cache
miss the cache controller decides if data should be cached in temporal subcache, spatial
subcache, or not cached at all. Prefetching the next block on a cache miss is also done in
this design, but only in the spatial subcache. The most important drawback of this design is
that its locality detection is based on instructions than on data. This means that if the CPU
accesses a data block in a spatial manner, but words of that block are accessed by different
instructions, the detection logic could still decide that a block is non-spatial. Moreover, one
instruction could see the block as non-spatial while the other may see it as spatial. This
means that a particular word may be cached in both the spatial and non-spatial subcache.
While the authors have shown that this does not lead to inconsistencies if properly implemented, the performance may still suffer.
A substantially modified design of the dual data cache is proposed in [Sanchez97]. The
temporal subcache in this new design is only a small fully associative buffer (up to 16 single word 8-byte entries) while the spatial subcache remains large and with the block size of
32 bytes.
The decision which subcache to use for each data access is made at compile time, by performing data locality analysis. Moreover, the compiler can mark a particular load/store instruction as non-cached (i.e. bypass), if no locality for that particular data access is expected. The main drawback of this approach is that the changes to the instruction set are
required to implement different flavors of load/store instructions. Also, locality information
is still based on instructions rather than on data, and locality of many data accesses is difficult to determine at compile time.
Research on array cache [Tomasko97] design splits the primary data cache into a smaller
(25% of the total cache capacity) scalar subcache and a larger (75% of the total cache capacity) array cache. Block size in scalar subcache is 32 bytes while in the array cache it is
larger (experimentally varied from 64 to 512 bytes). The decision which subcache to use
for a particular data access is done at compile time, by marking the scalar variable accesses
59
to use the scalar subcache, while array accesses go to the array subcache. The main drawback of this approach is that the changes to the instruction set are required to implement
different flavors of load/store instructions. In addition, the scalar/array heuristic may be
widely inaccurate. Many arrays are accessed in a random manner that exhibits almost no
spatial locality, while scalars that are used in a particular part of the program are usually
stored in the neighboring memory words.
3. Problems
1. For the case of a direct-mapped cache, develop a piece of code that destroys the data
item to be needed next. Show what happens, for the same example, in the case of a setassociative or a fully-associative cache.
2. Compute the transistor count saving, for one of the popular microprocessors, if pointers
to page numbers are used rather than page numbers, as in [Seznec96]. What is the impact
on the clock cycle time?
3. Develop the difference-bit approach [Juan96] for the case of a 4-way set-associative
cache memory. What is the problem with such a solution?
60
4. Create a piece of code which demonstrates the benefits of the assist cache approach from
[Chan96]. Create another piece of code, which includes data with much more spatial locality, and show how many clock cycles have been saved.
5. Compare victim cache from [Jouppi90] and the selective victim cache from [Stiliadis97]
for a piece of code of your choice. What are the further improvements that you would suggest?
6. For a piece of code of your choice, show which of the two cases is better:(a) fewer cycles
which are longer, or (b) more cycles which are shorter. Create also an example which
shows the opposite, and explain what are the code characteristics which make one or the
other case to be better.
7. Develop a detailed block scheme of the spatial footprint predictor (SFP) cache from
[Kumar98]. Propose an improvement that takes into consideration the temporal locality, as
well.
8. Explain why trace cache continues to contribute to the overall system performance when
the size of trace cache grows. Why is that not the case with the conventional instruction
cache?
9. Explain the essence of branch promotion and trace packing. Compare performance and
complexity, for these two improvements of the trace cache approach.
10. Propose alternative ways to measure the level of spatial and temporal locality of data in
a given program. Propose ways to visualize the changes, during the execution of a program.
61
Instruction-Level Parallelism
This chapter includes two sections. The first one is oriented to basic issues (background).
The second one is oriented to advanced issues (state-of-the-art).
1. Basic Issues
Instruction level parallelism (ILP) and its efficient exploitation are crucial for maximization of the speed of compiled high-level language (HLL) code. Methods to exploit ILP include superscalar (SS), superpipelined (SP), and very large instruction word (VLIW)
processing. In principle, ILP can be viewed as a measure of the average number of instructions that an appropriate SS, SP, or VLIW processor might be able to execute at the same
time.
Instruction timing of a RISC processor and an SS processor are given in Figure ILPU1
and Figure ILPU2, respectively. The ILP is a function of the number of dependencies in
relation to other instructions. Unfortunately, an architecture is not always able to support all
available ILP. Consequently, another measuremachine level parallelism (MLP)is often
used, as a measure of the ability of an SS processor to take advantage of the available ILP.
For the best performance/complexity ratio, the MLP and the ILP have to be balanced.
time
i0
i1
i2
i3
i4
i5
62
This is a typical RISC-style instruction execution pipeline with three stages. It represents the baseline for explanation of other cases to follow.
time
i0
i1
i2
i3
i4
i5
Fundamental performance limitations of an SS processor are illustrated in Figures ILPU3, ILPU4, and ILPU5, respectively. Figure ILPU3 explains the case of true data dependencies in a superscalar environment (one cycle is lost). Figure ILPU4 explains the case of
procedural dependencies in the same superscalar environment (the number of lost cycles is
much larger). Figure ILPU5 explains the case of resource conflicts in the same superscalar
case (effects are the same as in Figure ILPU3, but the cause is different). Finally, Figure ILPU6 shows the impact of resource conflicts for the case of a long operation, in two
different scenarios, when the execution unit is internally not pipelined (worse case, from
the performance point of view) and internally fully pipelined (better case, from the performance point of view).
No Dependency
Dependency
i0
i0
i1
i1
e
e
63
time
i0
i 1 = branch
e
i2
i3
i4
i5
Without Conflicts
With Conflict
i0
i0
i1
i1
e
e
Case 1
i0
i1
i0
i1
e1
e2
e3
e4
e1
e2
e3
e2
e3
e4
e4
Case 2
e1
e1
e2
e3
e4
Figure ILPU6: Resource conflictseffect of resource conflicts on instruction timing (source: [Johnson91])
Comment:
The same instruction pipeline from Figure ILPU5 is presented, under the condition when the execution takes
several pipeline stages to complete. Under such a condition, two alternative solutions represent a
price/performance tradeoff. The one which wastes less cycles is more expensive to implement.
64
Issue (fetch and decode), execution (in the more general sense), and completion (write
back which changes the state of the finite state machine called microprocessor) are the major elements of an instruction. In the least sophisticated microprocessors these elements are
in-order. In the most sophisticated microprocessors these elements are out-of-order, which
means better performance for more VLSI complexity. The essence will be explained using
an example from [Johnson91].
In this example, a microprocessor which can issue two instructions at a time, execute
three at a time, and complete two at a time, is assumed. Furthermore, the following program related characteristics are assumed: (a) Instruction I1 requires two cycles to execute,
(b) Instructions I3 and I4 conflict for a functional unit, (c) Instruction I5 depends on the datum generated by I4, and (d) Instructions I5 and I6 conflict for a functional unit. The case
of the in-order issue and in-order completion is given in Figure ILPU7. The case of in-order
issue and out-of-order completion is given in Figure ILPU8. The case of out-of-order issue
and out-of-order execution is given in Figure ILPU9.
Decode
I2
I1
I3
I4
I3
I4
I4
I5
I6
I6
Execute
I1
I1
Writeback
I2
I3
I4
I5
I6
I1
I2
I3
I4
Cycle
1
2
3
4
5
6
7
8
I5
I6
Figure ILPU7: Example II (source: [Johnson91])
Legend: IIIn-order issue, In-order completion.
Comment:
Instructions I1 and I2 are issued together to the execution unit and have to be written back (i.e., completed)
together. The same applies for the pair I3 and I4, as well as the pair I5 and I6. Total execution time is eight
cycles.
Decode
I2
I1
I3
I4
I4
I5
I6
I6
Execute
I1
I1
Writeback
I2
I3
I4
I5
I6
I2
I1
I4
I5
I6
I3
Cycle
1
2
3
4
5
6
7
65
Decode
I1
I2
I3
I4
I5
I6
Window
I1, I2
I3, I4
I4, I5, I6
I5
Execute
I1
I1
Writeback
I2
I6
I5
I3
I4
I2
I1
I4
I5
I3
I6
Cycle
1
2
3
4
5
6
In this particular example, the simplest case (II, in-order issue and in-order completion)
takes 8 cycles to execute. The medium case (IO, in-order issue and out-of-order completion) takes 7 cycles to execute. The most sophisticated case (OO, out-of-order issue and
out-of-order completion) takes 6 cycles to execute. These values have been obtained after
the following machine related characteristics are assumed: (a) Instruction is present in the
decoding unit until its execution starts, and (b) Each instruction is executed in the appropriate execution unit.
In the simplest case (II), the completion can be done only after both paired instructions
get fully executed; both get completed together. This approach (II) is typical of scalar microprocessors and rarely used in superscalar microprocessors.
In the medium case (IO), execution of an instruction can start as soon as the related resource is available; also, the completion is done as soon as the execution gets finished (I2
completes out-of-order). There are three basic cases when the issue of an instruction has to
be stalled: (a) when such issue could generate a functional unit conflict, (b) when the instruction to be issued depends on the instruction(s) not yet completed, and (c) when the result of the issued instruction could be overwritten by an older instruction which takes longer to execute, or by a following instruction not yet executed. Special purpose control hardware is responsible for stalling, in all three cases. This approach (IO) was first used in scalar microprocessors; however, its major use is in superscalar microprocessors.
In the most sophisticated approach (OO), the processor is able to look ahead beyond the
instruction which has been stalled (due to one of the reasons listed in the previous paragraph on the IO approach), which is not possible with the IO approach. Fetching and decoding beyond the stalled instruction is made possible by inserting a resource called instruction window, between the decode stage and the execute stage. Decoded instructions are
placed into the instruction window (if enough space there), and examined for resource conflicts and possible dependencies. Term instruction window (or window of execution) refers
to the full set of instructions that may be simultaneously considered for parallel execution,
subject to data dependencies and data conflicts. As soon as an executable instruction is detected (like I6 in Figure ILPU9), it is scheduled for execution, regardless of the program
order, i.e. out-of-order (the only condition is that program semantics are preserved). This
approach (OO) introduces one additional type of hazard, when instruction N + 1 destroys
66
the input of instruction N (this case has to be watched for by the control hardware). So far,
this approach has been used only in superscalar microprocessors.
In this context, important roles are played by BHT (branch history table) and BTB
(branch target buffer). The BHT helps determine the branch outcome. The BTB helps compute the branch target address. These issues will be discussed in more detail later on.
A related approachVLIW (Very Long Instruction Word)is shown in Figure ILPU10.
Single instruction specifies a larger number of operations to be executed concurrently. Consequently, the number of run-time instructions is smaller, but the amount of compile-time
activities is larger (and the code size may increase if the compiler is not adequate for the
given architecture/application). The approach is well suited for special purpose architectures/applications and not well suited for making new VLIW machines which are binary
compatible with existing machines (binary compatibility is the ability to execute a machine
program written for an architecture of an earlier generation).
time
i0
e
e
i1
e
e
i2
e
e
The related superpipelined approach is shown in Figure ILPU11. Stages are divided into
substages; the more substages per stagethe deeper the superpipelining. Figure ILPU11
refers to the case with the depth of pipelining equal to two. Superpipelining (SP) takes
longer to generate a result (1.5 clock cycles for two instructions), compared with superscaling (SS) which takes shorter (1 clock cycle for two instructions). On the other hand, SP
takes less for simpler operations (e.g., 0.5 clock cycles), compared with SS which takes
longer (e.g., 1 clock cycle, if and when no clock with finer resolution is available). Latency
is shorter with SP (shorter basic clock) and consequently the outcome of a simple branch
test is known sooner, but the clock period is shorter with SS (no extra latches which are necessary for the SP approach). Resources are less duplicated with the SP approach, but its
major problem is clock skew. What is betterSS or SPdepends on the technology and
the application.
67
time
f
i0
d
f
i1
e
d
i2
e
d
i3
e
d
i4
d
f
i5
e
e
d
Of course, the best performance/complexity ratio can be achieved using hybrid techniques that combine SS, SP, and VLIW. Theoretical limits on the ILP exploitation are determined by conditional branches which affects clock count and the extra complexity which
grows quadratically and affects clock speed. According to several authors, practical limits
of ILP exploitation are 8-way issue (if only hardware employed) and N-operation VLIW (if
both hardware and compiler employed); N > 8.
Issues discussed so far will be briefly revisited through different examples of different
superscalar microprocessors. Each microprocessor will be discussed according to the same
template: superscalar features, block diagram, major highlight, branch prediction, execution
issues, functional units, and cache memory. For more examples and a more detailed treatment, see [Stallings96], or the original manufacturer literature.
68
FP
Register
File
PreDecode
Instruction
Cache
Instruction
Buffer
Decode
Rename &
Dispatch
FPInstruction
Buffers
Functional
Units
Memory
Interface
Integer + Address
InstructionBuffers
FunctionalUnits +
DataCache
Integer
Register
File
Re-OrderAndCommit
The branch prediction unit is placed next to the I cache and shares some features of the
cache in order to speed up the branch prediction. The I cache has 512 lines and the branch
prediction table (BPT) has 512 entries. Branch prediction table is based on 2-bit counters,
as explained later on in this book. If a branch is predicted taken, instruction fetching is redirected after a cycle. During that cycle, instructions in the not-taken path continue to be
fetched; they get placed into the cache called resume cache, to be ready if the prediction
happens not to be correct. The resume cache is large enough for instructions of the nottaken paths related to four consecutive branches.
Major execution phases are instruction fetch, instruction decoding, operand renaming
(based on 64 physical and 32 logical registers), and dispatching to appropriate instruction
queues (up to four instructions concurrently dispatched into three queues, for later execution in five different functional units).
During the register renaming process, the destination register is assigned a physical register which is listed as unused in the so called free list. At the same time, another list is updated to reflect the new logical-to-physical mapping. If needed, an operand is accessed
through the list of logical-to-physical mappings.
During the instruction dispatching process, each of the queues can accept up to four instructions. Also, reservation bit for each physical result register is set busy. Queues act like
reservation stations holding instructions and physical registers designators acting as pointers to data. Global register reservation bits for source operands are being constantly tested
for availability. As soon as all source operands become available, the instruction is considered free, and can be issued.
69
A reorder buffer is used to maintain precise state at the time of exception. Exception
conditions for noncommitted instructions are held at the reorder buffer. Interrupt occurs
whenever an instruction with an exception is ready to be committed.
The five functional units are: address adder, two integer ALUs (one with a shifter and the
other with an integer multiplier, in addition to basic adder and logic unit), floating point
adder, and floating point multioperation unit (multiplication, division, and square root).
The on-chip primary cache is 32 KB large, 2-way set associative, and includes 32-byte
lines. The off-chip secondary cache is typically based on the inclusion principle (if something is in the primary cache, it will also be in the secondary cache).
The architecture of R12000 is basically the same as the architecture of R10000, except
that some features are considerably improved (address space, on-chip cache memory,
etc,...). Also, implementation technology reflects the state-of-the-art at time of its introduction.
FP
Register
File
FPFunctionalUnits
Instruction
Cache
Instruction
Buffer
Instruction
Decode &
Issue
FunctionalUnits +
DataCache
Level 2
Data
Cache
&
Memory
Interface
Integer
Register
File
Instructions are fetched and placed into one of two instruction buffers (each one is four
instructions deep). Instructions are issued (from an instruction buffer) in the program order
(not bypassing each other). One instruction buffer is used until emptied, and then the issuing from the next instruction buffer starts (a solution which makes the control logic much
less complex).
Again, the branch prediction table is associated with the instruction cache. A 2-bit branch
history counter is associated with each cache entry. Only one branch can be in the state
when it is predicted but not yet resolved. Therefore, the issue is stalled on the second
branch if the first one is not yet resolved.
70
After decoding, instructions are arranged according to the type of functional unit that
they are to use. After the operand data are ready, instructions are issued to units that they
are to be executed at. Instructions are not allowed to bypass each other.
In order to make the handling of precise interrupts easier, the instruction issue is in order.
The final pipeline stage in the integer units updates the destination registers also in order.
Bypass registers are included into the data path structure, so that data can be used before
they are being written into their destination registers. The final bypass stages of the floating
point units update registers out-of-order. Consequently, not all floating point exceptions
result in precise interrupts.
This microprocessor includes four functional units: two integer ALUs (one for basic
ALU operations plus shift and multiplication; the other for basic ALU operations and evaluation of branches), one floating point adder, and one floating point multiplier.
Two levels of cache memory reside on the CPU chip. The first level cache, directmapped for fast one clock access, is split into two parts: instruction cache and data cache.
Both instruction cache and data cache are 8 KB large. The second level cache (three-way
set-associative) is shared (joint instruction and data cache). Its capacity is 96 KB.
Primary cache handles up to six outstanding misses. For that purpose, a six entry MAF
(miss address file) is included. Each entry includes the missed memory address and the target register address of the instruction which exhibits a miss. If MAF contains two entries
with the same memory address, then the two entries will merge into one.
The more recent microprocessor from the same family [Gwennap97], the DEC Alpha
21264, has the following characteristics: (a) 4-way issue superscalar, (b) out-of-order execution architecture, (c) speculative execution, and (d) multi-hybrid branch prediction.
The cycle time is 600 MHz, which makes it deliver an estimated 40 SPECint95 and 60
SPECfp95 performance. This was made possible because both L1 and L2 caches are on the
processor chip. Also, the path to memory enables the data transfer rate of over 3 GB/s.
71
IC
32 KB
IU
L2/BI
128 b
IR
LSU
FPR
FP-ALU
DC
32 KB
Each execution unit includes two or more reservation stations that store the dispatched
instructions until after the results of other instructions are known. The PowerPC 620 can
execute speculatively up to four unresolved branch instructions (the PowerPC 601 can execute speculatively only one unresolved branch instruction).
The on-chip L1 cache memory is split in two parts: Instruction cache memory and data
cache memory. Each one of the two parts is 32KB in capacity, and is implemented as an 8way set associative memory.
72
The PowerPC 750 microprocessor is also designed for desktop and portable markets. It
includes an on-chip L2 cache controller supporting the back-side L2 caches of up to 1MB.
It has an improved dynamic branch prediction unit, as well as an additional floating-point
unit. Block scheme of the PowerPC 750 is similar to the block scheme of the PowerPC 620
[Kennedy97].
The PowerPC 750 can achieve the SPECint95 performance of over 14 and the SPECfp95
performance of over 10. It is offered as a single-chip package, or as a small daughtercard
integrating the processor and the L2 cache SRAMs [Pyron98].
73
IC
IF
BIU
IUs
...
SMUs
...
IRB-A IRB-M
...
FMAUs
LSU
ARB
...
DSRUs
DC
RU
RU
RR
RF
74
PD
IC
BQ
ROP
RS
RF
FUs+DC
MI
ROB
Since Intel x86 instruction set uses variable length instructions, the fetched instructions
are sequentially precoded before being placed into the instruction cache. Instructions in the
byte queue wait to be dispatched.
The branch prediction logic is integrated with instruction cache. There is one prediction
entry per cache line. Prediction entry includes a bit to reflect the direction taken by the previous execution of the branch. Each prediction entry also contains a pointer to the target
instruction, so it is known where in the cache it can be found.
Decoding takes two cycles (because x86 instruction set is complex) and creates ROPs
(RISC operations). More complex x86 instructions are converted into a sequence of ROPs
(essentially a microroutine). After the first decode cycle, instructions get dispatched to reservation stations. Data come to the reservation stations from the register file or the re-order
buffer.
Once the data become available, instructions and data enter the functional units. There are
6 functional units: two integers ALUs, one FP unit, two load/store units, and a branch unit.
Result data are kept in the re-order buffer before being placed into the register file
[Smith95].
75
Core of the AMD K-6 processor is the RISC86 microarchitecture, similar to the one of
AMD K-5. Many resources are larger/longer in size and new resources are added. For example, in comparison with AMD K-5, the AMD K-6 architecture includes also a multimedia unit [Fetherston98].
IC
I-MMU
PD
I-TLB
IB
DU
L/S
FP/GU
IU
BU
76
The core instruction set is extended to provide support for graphics and multimedia (2D
image processing, 2D and 3D graphics, and image compression). It is especially fast for
MPEG-2. Prefetch unit can fetch instructions from all levels of memory hierarchy. The next
instruction is fetched based on the result of branch prediction. Instruction execution is
based on a 9-stage pipeline. The L1 cache is on the same chip with the CPU, and is divided
into two parts: Instruction cache (16KB) and data cache (also 16KB).
The SUN Ultra SPARC II has the same number of integer unit and floating-point unit
registers, and the same width of system buses. It is basically the same architecture; the major differences are technological, which resulted in better speed.
1.7. A Summary
An innovative point-of-view on modern microprocessor based computers is given in
[Tredennick96]. It includes an interesting diagram that shows differences between the first
five generations of microprocessors, as shown in Figure ILPU18
Paper [Tredennick96] also includes a number of interesting statements. The higher the
chip complexity, the larger the design teams, the higher design costs, and the longer the
idea-to-market time. Consequently, current leading-edge design efforts exceed three years,
while new product developments are required about every 18 months to remain competitive. In other words, 18 months after a company introduces a new product, the competing
company introduces a new product, with twice the performance. Therefore, overlapping
design teams have to be used, in order to stay on the competitive edge. Also, development
costs per microprocessor rise at the rate of 25% per year. This increased development cost
can be absorbed only with a larger number of units sold, as indicated in Figure ILPU19.
77
Generation 2
Generation 1
F
Instruction 1
Instruction 2
Instruction 3
Instruction 1
Time
Generation 3
F
D
F
Instruction 2
Instruction 3
Instruction 4
Instruction 5
E
Instruction 6
Time
Generation 4
F
E8
E8
Time
Generation 5
E1
E2
E5
E6
E8
E8
E8
E3
E8
E1
E6
E7
E2
E3
E5
E7
E1
E6
E4
E5
E8
Time
E3
E4
E7
E1
E2
E4
E5
E1
E8
E1
E7
E7
E5
E4
E6
E7
E5
E6
E4
E8
E8
E2
E2
E8
E5
E3
E4
E8
E8
E7
E5
E4
E1
Time
W
(Some
latency)
Dataflow model
Fetch instruction
Decode
Address calculation
Read operands
Execute
Write result
78
FMC = $50
900
$500M (total developement cost)
800
$400
700
$300
$200
600
500
$100
400
$50
300
$25
200
$10
100
2,000
1,900
1,800
1,700
1,600
1,500
1,400
1,300
1,200
1,100
1,000
900
800
700
600
500
400
300
200
$5
0
100
1,000
Units (x 1,000)
Figure ILPU19: Amortized development cost versus the number of microprocessor chips sold
Legend: FMC Fixed manufacturing cost per microprocessor chip
Comment: At 2M units (microprocessor chips) sold, a $500M total development cost amortizes at approximately $300 per unit.
If the curves of Figure ILPU19 are examined more carefully, it can be noted that Intel's
products are to the right of the X axis (where the development costs are easy to amortize),
which makes it different to compete against Intel. Also, since development costs rise at
25% per year, unless the unit volume of an Intel competitor grows at better than 25% per
year, its cost per chip will grow with each successive generation, while Intel's chips cost is
almost constant from generation to generation, because being to the right of the X axis
makes Intel operate in conditions of low-slope curves (now, it is more clear why Intel's
products are stressed in this book).
Paper [Tredennick96] also discusses ways in which competition can be successful
against Intel: (a) by increasing unit volumes (e.g., by being faster in penetrating into a new
high-volume application), and (b) by rendering the importance of the curve moot (e.g., by
making CPU to become a low-margin component in systems).
79
2. Advanced Issues
This part contains the authors selection of research activities which, in his opinion, have
made an important contribution to the field in the recent time, and are compatible with the
overall profile of this book.
Paper [Jourdan95] describes an effort to determine the minimal number of functional
units (FUs) for the maximal speed-up of modern superscalar microprocessors. Analysis includes MC 88110, SUN UltraSparc, DEC Alpha21164 (IO - in-order superscalars), plus
IBM 604 and MIPS R10000 (OO - out-of-order superscalars). Basic characteristics of modern superscalar processors are given in Figure ILPS1.
Processor
DateShip
Degree
IntegerUnit
ShiftUnit
DivideUnit
MultiplyUnit
FPAddUnit
ConvertUnit
FPDivideUnit
FPMultiplyUnit
DataCachePort
PPC 604
94
4
2
2
1
1
1
1
1
1
1
MC 88110
92
2
2
1
1
1
1
1
1
1
1
Major conclusion of the study is that the number of FUs in modern superscalar microprocessor architectures needed to exploit the existing ILP is from 5 to 9, depending on the
application. If the degree of superscaling is four or more, the number of data cache ports
becomes the bottleneck, and has to be increased accordingly.
Conditions of the analysis imply the applications corresponding to the SPEC92 benchmark suite, a lookahead window of the degree 2 to 8, and OO execution architecture in all
considered cases (where not included into the original architecture, the OO capability is simulated). Effects of cache miss have not been studied (a large enough cache is implied).
For details, the interested reader is referred to the original paper [Jourdan95].
Paper [Wilson96] proposes and analyzes the approaches for increasing the cache port efficiency of superscalar processors. Major problem is that, on one hand, cache ports are necessary for better exploitation of ILP, and on the other hand, they are expensive. The
load/store unit data path of a modern superscalar microprocessor is given in Figure ILPS2.
80
RegisterFile
StoreForwarding
CacheAccessBuffer
CAB
LoadIssue
AddressCalculation
CachePort
Figure ILPS2: The load/store unit data path in modern SS microprocessors (source: [Wilson96])
Legend:
CABCache Address Buffer.
Comment:
This structure supports four different enhancements: (a) load allif two or more loads from the same address
are present in the buffer, the data which returns from memory will satisfy all of them; this will not cause program errors, because the memory disambiguation unit already removed all loads which depend on not-yetcompleted store instructions, (b) load all wideit is easy to widen the cache port (which means that load all
can be applied to a larger data structure), so that the entire cache line (rather than a single word) can be returned from cache at a time; this is obtained by increasing only the number of interface sense amplifiers,
which is not expensive, (c) keep tagsif a special tag buffer is included (which holds tags of all outstanding
data cache misses), newly arrived cache accesses which are sure to miss (because their addresses match some
of the addresses in the tag buffer) can be removed from the cache access buffer, thus leaving room for more
instructions which are potentially successful (i.e., likely to hit), (d) line bufferdata returned from the cache
can be buffered in some kind of L0 cache, which is fully associative, multi-ported, and based on the FIFO
replacement policy; if a line buffer contains the data requested by the load, the data will be supplied from the
line buffer (good for data with a high level of temporal locality). Note that the four enhancements can be superimposed, to maximize the performance.
Solution offered by this research is to increase the bandwidth of a single port, by using
additional buffering in the processor and wider ports in caches. Authors have proposed four
different enhancements.
Conditions of this research imply split primary caches (two-way set-associative external
8 KB large caches), a unified secondary cache (two-way set-associative 2 MB large cache),
and a main memory. Four on-going cache misses are enabled by four special purpose registers called MSHR (miss status handling registers). Concrete numerical performance results
imply operating system SimOS and benchmark suite Spec95.
Research performed at the Polytechnical University of Barcelona [Gonzalez96] reports
some very dramatic results. They have assumed the architecture of Figure ILPS3, and they
have tried to measure the real ILP in selected SPEC95 benchmarks. They have varied the
parameters like: (a) reorder buffer size (1024, 2048, 4096, 8192, 16384, and infinite),
(b) memory ordering (in-order load, out-of-order load), (c) number of memory ports (1, 2,
4, no restrictions), and (d) register pressure (64, 128, 256, 512, infinite register count). The
width of the fetch engine has been chosen to be 64, which is much more than in the contemporary microprocessors.
81
IC
F&D
RR
ROB
RF
EU
DC
Program
Go
30.63
41.00
47.69
49.53
50.13
50.13
Compress
31.59
43.19
43.19
Vortex
23.86
24.70
25.84
27.04
28.80
28.80
Li
17.97
18.11
18.11
Fpppp
16.04
18.63
21.93
25.53
28.40
28.40
Applu
29.85
31.43
31.43
Wave5
18.56
18.56
18.56
Swim
16.71
16.71
16.71
Turb3d
9.02
9.03
9.03
Figure ILPS4: The IPC for various reorder buffer sizes and no memory dependencies
(source: [Gonzalez96]).
Legend:
RBReorder Buffer.
Comment:
A dash is used to indicate that the value did not change, i.e. that the saturation point has been reached (saturation point is here defined as the value obtained by the infinite reorder buffer). Different benchmarks saturate
at different reorder buffer sizes.
82
In order
Out-of-Order load
No dependencies
Go
4.32
6.18
50.13
Compress
10.56
13.87
43.19
Vortex
5.17
5.76
28.80
Li
4.14
5.75
18.11
Fpppp
8.50
8.57
28.40
Applu
6.40
6.47
31.43
Wave5
13.81
18.56
18.56
Swim
4.16
4.83
16.71
Turb3d
9.03
9.03
9.03
Figure ILPS5: The IPC for in-order and out-of-order load
in the presence of memory dependencies (source: [Gonzalez96]).
Comment:
The column without memory dependencies has been repeated from the previous figure, so that the results can
be compared. With most of the benchmarks, the drop in IPC is substantial.
1 Port
2 Ports
4 Ports
No dependencies
Go
2.27
3.86
5.30
6.18
Compress
5.84
9.06
11.50
13.87
Vortex
2.12
3.63
4.94
5.76
Li
2.08
3.56
5.04
5.75
Fpppp
2.26
4.16
6.55
8.57
Applu
2.51
3.79
4.80
6.47
Wave5
5.22
9.82
18.55
18.56
Swim
2.93
3.84
4.54
4.83
Turb3d
4.46
8.94
9.01
9.03
Figure ILPS6: The IPC with 1, 2, 4, or infinitely many memory ports
(source: [Gonzalez96]).
Comment:
For comparison purposes, the out-of-order load column has been repeated from the previous figure, as the
case without any restrictions on the number of memory ports. No benchmark saturates at 2 memory ports,
which is used in most contemporary microprocessors.
32
64
128
256
Go
4.29
5.43
6.02
6.17
6.18
Compress
4.11
5.37
6.92
10.12
13.87
Vortex
4.78
5.65
5.74
5.74
5.76
Li
4.39
5.42
5.75
5.75
5.75
Fpppp
5.83
7.17
7.45
7.61
8.57
Applu
4.37
5.25
5.51
5.57
6.47
Wave5
4.52
5.90
6.77
7.58
18.56
Swim
3.46
4.40
4.82
4.83
4.83
Turb3d
7.11
8.11
8.39
8.39
9.03
Figure ILPS7: The IPC as a function of a limited size reorder buffer (source: [Gonzalez96]).
Comment:
Very few benchmarks saturate at 128 or 256 entry reorder buffer.
83
64
128
256
512
Go
4.74
5.97
6.17
6.18
6.18
Compress
5.45
8.81
12.55
13.87
13.87
Vortex
5.37
5.74
5.74
5.74
5.76
Li
4.98
5.75
5.75
5.75
5.75
Fpppp
6.46
7.43
7.61
7.75
8.57
Applu
4.93
5.53
5.58
5.63
6.47
Wave5
5.86
7.35
7.60
7.62
18.56
Swim
4.20
4.83
4.83
4.83
4.83
Turb3d
8.20
8.39
8.41
8.41
9.03
Figure ILPS8: The IPC as a function of a limited physical register count
(source: [Gonzalez96]).
Comment:
Register utilization is a function of the optimizing compiler characteristics.
The overall conclusion of the paper is that the available ILP is much longer compared to
what current microprocessors can achieve. The major obstacles on the way to a considerably better microprocessor performance are memory dependencies.
One of the major problems of the superscalar microprocessors in particular, and microprocessors in general, is their flexibility. Execution units are fixed in their design: they execute a given set of operations, on a given set of data types. Recent advances in FPGA
(Field Programmable Gate Array) technology enable several reconfigurable execution units
to be implemented in a superscalar microprocessor. Of course, full benefits of reconfigurability can be achieved only if appropriate compilers are developed.
A survey of reconfigurable computing can be found in [Villasenor97]. The first attempt
at reconfigurable computing dates back to the late 60s proposal by Gerald Estrin of UCLA,
which was highly constrained by technology capabilities of those days. The latest attempts
include a number of designs oriented to FPGAs with over 100,000 logic elements, including the effort of John Wawrzynek of UC Berkeley. An earlier research effort of Eduardo
Sanchez and associates at the EPFL in Lausanne, Switzerland starts from the concept of
reconfiguration at program or program segment boundaries [Iseli95], and progresses toward
reconfiguration at the single instruction boundary level. Such a development is made possible by the latest technology trends, like those by Andr Deffon and Thomas Knight at MIT,
which imply FPGAs storing multiple configurations, and switching among different configurations in a single cycle (order of only tens of nanoseconds).
An important new trend in instruction level paralellism related research implies the techniques that optimize the system design and performance by turning to complexity reduction
and dynamic speculation [Palacharla97, Moshovos97].
Paper [Hank95] describes a research effort referred to as region-based compilation. The
analysis is oriented to superscalar, superpipelined, and VLIW architectures. Traditionally,
compilers have been built assuming functions as units of optimization. Consequently, function boundaries are not changed and numerous optimization opportunities get hidden. Example of an undesirable function-based code partition is given in Figure ILPS9.
84
1
2
5
3
7
8
Function A
Function B
Solution offered by this research is that compiler is allowed to repartition the code into
more desirable units, in order to achieve more efficient code optimization. Block structure
of a region-based compiler is given in Figure ILPS10.
Program
Selection
Classifier
Control
Router
Optimization
Scheduling
Regalloc
Figure ILPS10: Example of a region-based compiler, inlining and repartition into regions (source: [Hank95])
Comment: The essential elements that distinguish this compiler from traditional ones are the classifier unit
and the router unit.
Conditions of this research imply that compilation units (CUs) are selected by compiler,
not by the software designer. Optimal size of the CU helps for better utilization of employed transformations. Profiler support is used to determine regions of code which enable
better optimization. All experiments have been based on the Multiflow technology for trace
scheduling and register allocation.
Another approach of interest is to create architectures in which features are provided to
facilitate compiler enhancements of ILP in all programs. One of the architecture types
which follows this approach is EPIC (Explicitly Parallel Instruction Computing). The term
was coined by Hewlett Packard and Intel in their joint announcement of the IA-64 instruction set [Gwennap97]. The EPIC architectures require the compiler to express the ILP (of
the program) directly to the hardware.
85
Techniques have been developed to represent control speculation, data dependence speculation, and predication. It has been shown that these techniques can provide a considerable
performance improvement [e.g., August98].
Rationale behind the EPIC architectures is as follows. Processors introduced before the
year 1990 (for the most part) were able to execute at most one instruction per cycle. Processors introduced before the year 1995 (for the most part) were able to execute at most four
instructions per cycle. Processors introduced after the year 2000 (for the most part) will be
able to execute sixteen or more instructions per cycle. All this implies an enormous pressure on compiler technology. One way to help is to migrate some of the load from the compiler domain into the architecture domain.
Many beleive that control-flow missprediction penalties are the major source of performance loss in wide-issue superscalar processors. One approach to cope with the problem is
to incorporate the SEE (selective eager execution) mechanism into the underlying processor
architecture. One such effort is described in [Klauser98]. Essence of the SEE approach is to
execute both execution paths when a diffident branch occures. Diffident branch is a branch
which can not be predicted with a high level of confidence. Obviously, executing both execution paths all the time is too complex, and (from the cost/performance point of view) not
necessary if a branch can be predicted with a high level of confidence. Simulation results of
[Klauser98] show that the SEE approach has the performance improvement potential of up
to about 50% (maximum about 35% and average about 15% for selected benchmarks), for
an 8-way superscalar architecture which has an 8-stages deep pipeline. The wider the superscalar and the deeper the pipeline, the higher are the potentials of the SEE approach.
Success of wide issue superscalar processors is highly dependent on the existence of
adequate data bandwidth. One solution for the problem is to implement a larger number of
ports to data cache. Another solution is to use the so-called DDA (data decoupled architecture), as in [Cho99].
With support from compiler and/or hardware, early in the processor pipeline (before entering the reservation stations), the DDA approach partitions the memory stream into two
independent streams, and feeds each stream into a separate memory unit (access queue and
cache). This has two advantages: (a) The cost and complexity of building a large cache with
many ports is avoided, and (b) The splitting of the streams enables that each stream is
treated with different specialized optimization techniques, which potentially results in a
higher efficiency. In the case of [Cho99], one stream consists of the local variable accesses,
and the other stream includes the rest. The first stream is fed into the specialized LVC (local variable cache), and optimizations like fast data forwarding or access combining can be
utilized.
Studies show that a considerable ILP is present among instructions that are dynamically
far away from each other. With this in mind, paper [Vajapeyam99] proposes a hardware
mechanism called DV (dynamic vectorization), which builds quickly a large logical instruction window. The DV converts repetitive dynamic instruction sequences into a vector form,
thus enabling concurrent processing of instructions in the current program loop and those
from far beyond the current program loop. The DV mechanism helps in cases when the
static control flow is too complex, and compile time vectorization can not be very success-
86
ful. Evaluations using SPECint92 have shown that a relatively large portion of dynamic instructions can be captured into a vector form, making the speedup of two or more. Traditional instruction windows are of the length from a few tens to a couple of hundred instructions. With the help of DV, the instruction window of the size up to a thousand instructions
or more could be fully utilized.
The essence of DV is as follows: It detects the repetitive control flow at run time and captures the corresponding dynamic loop body in a vector form. Consequently, multiple loop
iterations are issued from the single copy of the loop body in the instruction window. This
eliminates the need to refetch the loop body for each loop iteration, which frees the fetch
stage to fetch post loop code (instead of the loop code, as in traditional solutions). The end
result is that a much larger instruction window can be built.
From the discussion presented so far, one could conclude that all new microprocessors
follow the described trend of maximal exploitation of the ILP. However, that is not the
case. The IBM S/390 single-chip mainframe represents an interesting new experiment with
a processor that is not superscalar [Slegel99]. Instead of focusing on execution of more than
one instruction per cycle, it focuses on the minimization of the number of cycles needed to
execute instructions of the ESA/390 architecture; this architecture stresses complex instructions needing tens, hundreds, or even thousands of clock cycles to execute.
* * *
The author and his associates were not much active in the field of ILP, except for the side
activities on related projects. For details, see [Helbig89, Milutinovic96d].
3. Problems
1. What type of code includes more ILP: (a) scientific code, (b) artificial intelligence
oriented code, or (c) data processing code? Explain why.
2. Design a control unit (CU) for an N-issue superscalar microprocessor. Derive the formula that gives the CU transistor count as a function of N.
3. Design a control unit (CU) for an N-deep superpipelined microprocessor. Derive the
formula that gives the CU transistor count as a function of N.
4. Some years ago, many believed that MLP=4 (machine level parallelism equal to 4) is the
maximum that makes sense to implement. These days researchers talk about
MLP={16,32}. Check the open literature and try to find out about the asymptotic value of
MLP. Explain different opinions.
5. Explain differences between superpipelined, superscalar, and VLIW code, using a simple
code example. Which one of the three approaches is potentially the best one, if the available compiler technology is not very sophisticated?
87
88
Prediction Strategies
This chapter includes two parts, one on branch prediction strategies and one on data prediction strategies. Chronologically, branch prediction comes first, and data prediction
represents a newer development.
90
The DLX machine by Hennessy and Patterson uses the same solution. It includes a return
buffer for the nesting depth of up to 16. This number was obtained from statistical analysis
of SPEC89, aimed at the optimal complexity/performance ratio.
The simplest predictor is referred to as the 2-bit predictor, introduced by Jim Smith. It is
shown in Figure BPSU1. The 2-bit predictor yields a considerably better performance than
the one-bit predictor. This is because the 2-bit predictor mispredicts only at one of the two
branches forming a loop, while the one-bit predictor mispredicts at both branches. A threebit predictor yields a performance that is only slightly better, at the 50% larger cost. Consequently, only 2-bit predictors make sense, both in the case of the simplest predictor of Figure BPSU1, and the more complex predictors to be discussed later.
Taken
11
PredictTaken1
NotTaken
Taken
Taken
01
PredictNotTaken1
PredictTaken2
10
NotTaken
NotTaken
Taken
PredictNotTaken2
00
NotTaken
Figure BPSU1: States of the 2-bit predictor;
avoiding the misprediction on the first iteration of the repeated loop (source: [Hennessy96])
Legend:
NodesStates of the scheme,
ArcsState changes due to branches.
Comment:
State 11 means branch very likely. State 10 means branch likely. State 01 means branch unlikely. State 00
means branch very unlikely. Note that two (rather than three) mispredictions move the state machine from 11
to 00, and vice versa. The best initial state is branch likely, because a randomly chosen branch instruction is
more likely to branch then not, but not very likely to branch. A condition-controlled loop is typically executed
several times. Last execution of the loop condition statement must result in a misprediction (that misprediction can not be avoided, unless a special loop count estimation algorithm is incorporated, like the one in
[Chang95]). If a one-bit predictor is used, the predictor bit gets inverted on the misprediction, and the first
next execution of the loop condition statement will also result in a misprediction (in spite of the fact that loops
are very unlikely to execute only once). However, if a 2-bit predictor is used, an inertia is incorporated into
the system, and the predictor needs two mispredictions before it switches from 11 to 00. If average loop count
is N, the 2-bit predictor, compared to the one-bit predictor, has the misprediction which is at least (2/N
1/N)% better (a number which justifies the exclusive use of 2-bit predictors). On the other hand, a three-bit
predictor brings a negligible performance improvement and a 50% complexity increase, compared to the 2-bit
predictor (another fact which fully justifies the exclusive use of the 2-bit predictor).
Prediction accuracy of a BTB depends on its size, as indicated in Figure BPSU2. This
figure shows that before the saturation point is met, logic design strategy of the BTB may
help, since the set-associative approach works somewhat better than the direct-mapped approach. However, after the saturation point is reached, logic design strategy does not matter
any more. This figure also shows that the saturation point is reached at about 2K entries in
91
the BTB. Several other studies claim that a 4K-entry BTB is about equally as good as an
infinite BTB. Note that it is not sufficient that the branch instruction is located in the BTB;
what is also important is that the prediction must be correct. In other words, in real designs,
the cost of a misprediction is typically the same as the cost in the case of a BTB miss!
90
80
70
CPB
60
50
DM
4W SA
40
30
20
10
0
16
32
64
128
256
512
1024
2048
NE
Figure BPSU2: Average branch-target buffer prediction accuracy;
a BTB with 2K entries is about the same in performance as an infinite BTB
(source: [Johnson91])
Legend:
CPBPercentage of Correctly Predicted Branches,
4WSA4-Way Set-Associative,
DMDirect Mapped,
NENumber of Entries.
Comment:
It is important to notice that the set-associative approach is better than the direct-mapped approach only if the
BTB is too small, or not large enough.
Another solution is presented in Figure BPSU3, where the prediction related information
is included into the instruction cache, in addition to the standard information, which is the
addressing information and the code. Prediction related information includes two fields:
(a) successor index field, and (b) branch entry index field.
FetchInfo
CacheEntry
I0
I1
I2
I3
I4
I5
I6
I7
Figure BPSU3: Instruction Cache Entry for Branch Prediction (source: [Johnson91])
Legend:
ICache entry.
Comment:
Efficiency of the approach depends on the number of instructions in each cache entry. Other approaches to the
incorporation of prediction into cache memory are possible, too. Their elaboration is left as an exercise for the
students.
92
The successor index field contains two subfields: (a) address of the next cache entry predicted to be fetched, and (b) address of the first instruction (in that entry) which is predicted
to be executed. The length of the successor index field depends on cache size (the number
of instructions that fits into the cache). Relative size of the two subfields depends on the
number of instructions per cache entry. For example, a 1MB direct-mapped cache for a machine with 64-bit instructions, and 8 instructions per cache entry, requires a 17-bit successor index field (N1 = 14 bits to address the cache entry and N2 = 3 bits to address an instruction within the entry). This is so because a 1MB cache holds 128K 8-byte (or 64-bit) instructions, and these instructions are organized in 16K cache entries (8 instructions per
cache entry). To address one of the 16K cache entries, one needs 14 bits. To address one
instruction within an entry, one needs 3 bits.
The branch entry index field specifies the location (within the current cache entry) of the
branch instruction that is predicted to be taken. Consequently, instructions beyond the
branch point are predicted not to be executed.
This organization represents a way to incorporate the prediction related information into
the cache. The successor index field specifies where to start the prefetching from (instructions before the one pointed to by the successor index field are not needed). The branch entry index field specifies the branch instruction that the prediction information refers to (instruction beyond the one pointed to by the branch entry index field is likely not to be executed).
An improvement of 2-bit predictors is referred to as two-level predictor (a more accurate
name would be two-level 2-bit predictor, since 2-bit predictor is an element of a two-level
predictor). The scheme was introduced by Yeh and Patt. Two-level predictors use the information on the behavior of other branches (at other addresses), in order to do prediction
about the currently executing branch instruction (at the current address). Practically always
(although theoretically not always) two-level predictors show better performance compared
to 2-bit predictors. A two-level 2-bit (2,2) branch predictor is shown in Figure BPSU4.
The example (2,2) branch predictor from Figure BPSU4 includes four vectors of 2-bit
predictors. The first one is for the case when the most recent two branches (on any two addresses) were (0,0), the second one is for the case when the most recent two branches (on
any two addresses) were (0,1), etc. In this context, (0,0) means that the most recent two
branches (on any addresses) were not taken, etc. Information about the outcome of the most
recent branches (on any addresses) is kept in a register called GBH (global branch history).
Therefore, for a given branch, prediction depends both on the address of the branch (horizontal entry into the matrix of 2-bit predictors) and the contents of the GBH register (vertical entry into the matrix of 2-bit predictors).
The described example belongs to the category of global predictors, because there is only
one GBH, and its contents refers to most recent branches at any addresses. It will be seen
later that one can also talk about per-address predictors, where the number of GBH registers is equal to the number of entries in each vertical vector of 2-bit predictors.
93
PO = XX
14243
2-bit GBH
Figure BPSU4: A (2,2) Branch Predictor using a 2-bit global history to select one of the four 2-bit predictors
(source: [Hennessy96])
Legend:
GBHGlobal Branch History,
POPrediction Outcome.
Comment:
Global branch history register of the size 9 or 10 bits, and the size of one single (vertical) vector branch predictor equal to 2048 or 4096 entries, seems to be a good price/performance compromise.
The basic rationale behind global predictors is explained using the code example from
Figure BPSU5a and the explanation from Figure BPSU5B. Figure BPSU5a includes a section of HLL (high-level language) code and its MLL (machine-level language) equivalent.
From the explanation in Figure BPSU5b one can see that if previous branch (B1) is taken,
there is a high probability that the next branch (B2) will also be taken. Branches B1 and B2
are related through the semantics of the code, and that is what global predictors rely on.
HLL:
if (d == 0)
d = 1;
if (d == 1)
MLL (d assigned to r1):
bnez
r1, l1
; branch B1 (d 0)
addi
r1, r0, #1 ; d = 0, so d 1 (note: [r0 = 0])
l1: subi
r3, r1, #1
bnez
r3, l2
; branch B2 (d 1)
l2:
Figure BPSU5a: Example code (source: [Hennessy96])
Legend:
HLLHigh-Level Language,
MLLMedium-Level Language.
Comment:
This example implies that register r0 is hardwired to zero, which is typical for a number of RISC microprocessors.
94
Dinit D = 0?
B1
DbeforeB2 D = 1?
B2
0
Yes
Not taken
1
Yes
Not taken
1
No
Taken
1
Yes
Not taken
2
No
Taken
2
No
Taken
Figure BPSU5b: Example explanationillustration of the advantage of a two-level predictor with one-bit
history;
if B1 is not_taken, then B2 will also be not_taken, which can be utilized to achieve better prediction (source:
[Hennessy96])
Legend:
DinitInitial D,
DbeforeB2Value of D before B2.
Comment:
It is important to underline that a number of other program segments would result into the same pattern of
taken/not-taken relationship, which is a consequence of programming discipline and compiler design. In other
words, parts of code which are semantically unrelated can be strongly related as far as the prediction related
issues.
In general, one can talk about an (M,N) predictor; it uses the behavior of the last M
branches to select one of the 2M predictor vectors of the length L, each one consisting of Nbit predictors. One study claims that it is not unreasonable that M goes all the way up to
M = 9. Previous discussion argues that it does not make sense to go beyond N = 2.
Note that each 2-bit predictor corresponds to a number of different branches, because only a subset of address bits is used to access different 2-bit predictors in one vertical vector
of 2-bit predictors. Consequently, the prediction may not correspond to the branch currently
being executed, but to another one with the same set of low-order address bits. However,
most of the time, the prediction will correspond to the branch currently being executed, because of the locality principle. In some cases, the prediction will not correspond to the
branch currently being executed, but the scheme will still work, because the same programming discipline results in a similar pattern of branchings/nonbranchings.
Figure BPSU6a represents another viewpoint of looking at the above mentioned types of
branch predictors. Different viewpoints enable the reader to create a better understanding of
the issue.
95
2bC
BranchAddress
2n
14444244443
2 bits
GAs
BranchAddress
14444244443
2n
BHSR
m
PHT
1442443
2m
PAs
BranchAddress
m
M
m
144424443
BHSRs
m
14444244443
2n
PHT
1442443
2m
96
The basic scheme is now referred to as 2bC, as it is named in a number of research papers. It is seen that only a subset of n branch address bits is used to access the single vector
of 2n 2-bit registers (n = logL). The cost (complexity) of the scheme is given by:
C(2bC) = 22n bits.
The global scheme is now referred to as GAs, as it is named in a number of research papers. Again, only a subset of lower n address bits is used to access one of the vertical 2m
vertical branch prediction vectors (m = M). The one to be accessed is determined by the
contents of the BHSR (branch history shift register). Terms BHSR and GBH refer to the
same thing. One horizontal vector of 2-bit predictors is referred to as PHT (pattern history
table). The cost of the scheme is given by:
C(GAs) = m + 2m+n+1 bits.
The per-address scheme is here referred to as PAs, as it is named in a number of research
papers. Everything is the same, except that each PHT is associated with a different BHSR.
Consequently, the cost of the scheme is given by:
C(PAs) = m2n + 2m+n+1 bits
assuming the same number of local branch history registers and rows in PHT.
Figure BPSU6b covers four different schemes of a much smaller cost, and (most of the
time) of only a slightly lower performance. The simplification is based either on an interesting run-time property or on an efficient compile-time effort.
In the case of gshare (introduced by McFarlingfor details see [McFarling95]), the lowest n = m bits of the branch address are exclusive ORed with the contents of the single
global m-bit BHSR. The obtained value points to one of 2m 2-bit predictors. The exclusive
OR operation enables that each vertical vector from previous schemes gets substituted with
a single 2-bit predictor. Consequently, the complexity of the scheme drops down considerably. Fortunately, empirical studies say that performance of the scheme drops down only
slightly, due to the above discussed code locality and programming culture issues. The cost
of this scheme is given by:
C(gshare) = m + 2m+1 bits.
In the case of pshare, the lowest n = m bits of the branch address are exclusive ORed
with the per-address BHSR corresponding to the same address as the branch instruction being currently executed. Basically, everything is the same, except that the global treatment is
substituted by the per-address treatment. The cost of this scheme is given by:
C(pshare) = m2m + 2m+1 bits.
97
gshare
BranchAddress
2 bits
123
m
BHSR
L
1444442444443
2m
pshare
BranchAddress
n
2 bits
123
BHSRs
m
L
1444442444443
2m
M
m
GSg(?)
2m bits
Cost (bits) = m + 2m
1444442444443
m
PSg(Patt, Sechrest, Lee/Smith)
BranchAddress
2m bits
144424443
2n
m
m
M
m
144424443
1444442444443
Figure BPSU6b: Schemes gshare, pshare, GSg, and PSg (source: [Evers96])
Legend:
BHSRBranch History Shift Register.
Comment:
Common characteristic of all three schemes is that the stress is on decreased complexity, rather than increased
performance.
98
In the case GSg, predictors are based on compile-time prediction (rather than run-time
prediction). Consequently, predictors are one-bit wide (rather than 2-bits wide). Scheme
GSg is a global scheme. This means that the address of the current branch is not relevant.
What is relevant is the contents of the m-bit BHSR, which points to one of the 2m one-bit
predictors. This scheme only has a theoretical value. It does not have any practical value, in
spite of its extremely low cost function:
C(GSg) = m + 2m bits.
Finally, in the case of PSg (introduced, in various forms, independently by Lee and
Smith, by Sechrest, and by Patt), the lowest n bits of the branch address are used to select
one of the 2n m-bit BHSRs of the per-address type. The selected m-bit BHSR is used to
point to one of the one-bit predictors, inside the array of one-bit predictors; the length of
this array is 2m. This scheme has some practical value, and is used in low-cost and hybrid
schemes. Its cost is given by:
C(PSg) = m2m + 2m bits.
Note that the warm up time of the schemes based on one-bit predictors is equal to zero;
this characteristic has been utilized in some hybrid branch predictors to be discussed later.
Software scheduling implies rearrangement of code at compile time, with hints from the
programmer or the profiler. The motion is either local (within the basic block boundaries)
or global (across the basic block boundaries). Basic block starts at one of the three points:
(a) at the very beginning of the code, (b) at the instruction following a branch, and (c) at the
labeled instructionlabeled instruction is a possible target of a branch. Basic block ends at
one of the following three points: (a) at the very end of the code, (b) at the branch instructionthe branch instruction itself is treated as being a part of the basic block it terminates,
and (c) at the instruction immediately before the labeled instruction. Each code rearrangement must be accompanied by appropriate code compensations, so that semantic structure
of the code is not violated, no matter which way the execution proceeds.
Trace scheduling is the principal technique used in conjunction with VLIW (very large
instruction word) architectures. Here, global code motion is enhanced with techniques to
detect parallelism across conditional branches, assuming a special type of architecture
(VLIW).
Loop unrolling is a technique used on any type of architecture, to increase the amount of
sequentially executable code, i.e. to increase the size of basic blocks. Although the technique can be used in conjunction with any type of architecture, it gives the best results in
conjunction with VLIW architecture and trace scheduling.
Software pipelining (symbolic loop unrolling) is a technique to pipeline operation from
different loop iterations. Each iteration of a software pipelined loop includes instructions
from different iterations of the original loop.
Note that many of the software techniques can be implemented in hardware, directly or
indirectly. For example, the Tomasulo algorithm is a hardware equivalent of the software
pipelining algorithm.
Software BPS will not be further elaborated here. For more information on these issues,
the interested reader is referred to specialized literature.
The most common predicated instruction in modern microprocessors is predicated register-to-register move. However, predicated instructions are not a new invention. A form of
predicated instruction can be found even in the first microprocessor of the x86 seriesin
the Intel 8086. The Intel 8086 instruction set includes the conditional REP prefix which can
be treated as a primitive form of predicating.
Predicated instructions help to eliminate branches in some contexts: (a) where one has ifthen with minimal body, or (b) where the code can be rearranged to create if-then with minimal body. These contexts happen in numerical codes (for example, when the absolute
value is to be computed) and in the symbolic code (for example, when the repetitive search
is to be applied).
Usefulness of predicated instructions is limited in a number of cases: (a) when the moving of a predicated instruction across a branch creates a code slow down; this is because
cancelled instruction does take execution cycles, (b) when the moment of condition evaluation comes too late; the sooner the condition is evaluated, the better the speed of the code,
(c) when the clock count of predicated instructions is too large; this happens easily in some
architectures, and (d) when exception handling may become a problem, due to the presence
of a predicated instruction.
Still, as indicated above, it is rather a rule than an exception, that modern microprocessor
architectures do include predicated instructions. The above mentioned predicated registerto-register move is included into the architectures like DEC Alpha, SGI MIPS, IBM PowerPC, and SUN Sparc. The PA (Precision Architecture) approach of HP is that any register-to-register instruction can be predicated (not only the move instruction).
1.1.3.2 Speculative Instructions
Speculative instruction is executed before the processor knows if it should execute or
not, i.e. before it is known if the prior branch instruction is taken or not. Control unit and
optimizing compiler in microprocessors which implement speculative execution support the
following scenario: (a) First, branch is predicted, (b) Second, the next instructioneither
the target address instruction or the next address instruction, depending on the outcome of
the predictionis made speculative and is executed, (c) Third, some run-time scheduling is
done, in order to optimize the code, i.e., to increase the efficiency of speculation, and
(d) Four, if prediction was a miss, the recovery action is invoked and completed.
Some of the actions defined above are hardware responsibility, others are compiler (or,
in principle, even operating system) responsibility; still the others are either a combined
responsibility or can be treated one way (hardware) or the other way (compiler). That is the
reason why this book classifies speculating into the hybrid group.
There are two basic approaches to speculation: (a) Compiler schedules a speculative instruction and hardware helps recover, if it shows up that speculation was wrong (here,
speculation is done at compile time), and (b) Compiler does a straightforward code generation and the branch prediction hardware is responsible for speculation (here, speculation is
done at run time).
101
This book assumes that the reader is fully introduced into the details of the Thorntons
and the Tomasulos algorithm. If not, he/she is referred either to the original papers [Thornton64 and Tomasulo67] or to a well-known textbook [Hennessy96].
103
As indicated in Figure BPSS1, it is based on an array of 2-bit up-down predictor selection counters (PSC) similar to those from the McFarlings branch selection mechanism, except that McFarling includes two 2-bit selection counters per branch and Evers/Chang/Patt
include N 2-bit selection counters per branch; each BTB entry is extended with one PSC.
BTB
PSCs
L
P1
PN
L
PE
MUX
PO
Initial value of all PSC entries in Figure BPSS1 is 3, and a priority logic is used is several predictors are equal (see the order of component predictors in Figure BPSS2). If, among
the predictors for which the contents of the PSC was 3, at least one was correct, the PSCs
for all incorrect predictors were decremented. If none of the predictors with PSC = 3 was
correct, the PSCs of all correct predictors are incremented. This algorithm guaranties that at
least one PSC is equal to 3. Complexity of the described multi-hybrid selection mechanism
is 2CL, where L is the number of entries in the BTB, and C is the number of component
predictors.
Figure BPSS2 defines two things: (a) the priority order for the included component predictors2bC has the highest priority and AlwaysTaken has the lowest priority, and (b) the
component cost for each component predictor included into a given version of the MHBPS. Different versions of MH-BPS differ in the overall bit count. The simplest version of
MH-BPS from Figure BPSS2 includes 11KB. The most complex version of MH-BPS includes 116 KB. If one bit takes four transistors, the overall transistor count is in the range
of about 350 kTr to about 3.5 MTr, which is more than the entire Intel Pentium.
104
HybPredSiz [kB]
~33
~64
~116
CompCost [kB]
SelectMech
2
2.5
3
3
3
2bC
0.5
0.5
0.5
0.5
0.5
GAs
2
2
4
8
Gshare
4
8
16
32
64
Pshare
4
5.25
7.5
20
36.25
loop
4
4
4
AlwaysTaken
0
0
0
0
0
Figure BPSS2: Multi-Hybrid Configurations and Sub-Optimal Priority Ordering 95.22/95.65 (source:
[Evers96])
Legend:
HybPredSizHybrid Predictor Size,
CompCostComponent Cost,
SelectMechSelection Mechanism.
Comment:
The research of [Evers96] demonstrates that the optimal priority encoding algorithm provides the hit ratio
which is only slightly larger compared to the hit ratio provided by the priority encoding algorithm which is
selected in [Evers96]95.22% versus 95.65%.
~11
~18
The quoted numbers set the stage for a small digression discussing the overall complexity of the MH-BPS. As already indicated, the complexity of 116 KB means more than the
transistor count of the most single chip microprocessors from 80s. However, the transistor
count of VLSI chips keeps growing, and their use has to be defined. If the performance improvement provided by MH-BPS is higher than the performance improvement in case of
other possibilities, then the MH-BPS may become a standard element of many future microprocessors on the chips with more than 10 MegaTransistors. Therefore, the question
boils down to the exact performance benefit of the MH-BPS. The answer can be found in
[Evers96]. For the predictor size of approximately 64 KB, the MH-BPS achieves prediction
accuracy of 96.22%, compared to 95.26% for the best two-component BPS of the same
cost. At the first glance, this difference does not look spectacular (only 0.96%, or less than
1%). However, what matters is the difference in misprediction percentage, which drops
from 4.74% to 3.78%, or over 25%. It is the misprediction which is costly and takes away
the cycles of the execution time. Therefore, at the second glance, the difference does look
spectacular, and places this research among the most exciting ones.
Note that the above data are obtained for SPECint92 (which is, according to many, not a
large enough application suite, for this type of research, and includes user code only), and
for the systems in which the contents of the prediction tables is flushed after periodical context switches (which is, according to many, not the optimal way to treat prediction tables
after periodical context switches).
The table in Figure BPSS2 tells that the multi-hybrid predictor of 11 KB includes, in addition to the selection mechanism of about 2 KB, the following component predictors: 2bC,
Gshare, Pshare, and AlwaysTaken. The multi-hybrid predictor of 116 KB includes a somewhat more costly selection mechanism of about 3 KB, plus the following component predictors: 2bC, GAs, Gshare, Pshare, loop, and AlwaysTaken (loop is a simple scheme
which predicts the number of iterations for each loop-branchfor details see [Chang95]).
105
Authors claim that the selection of component predictors was guided by the following rationales: (a) large dynamic predictors have better accuracy at steady-state, but longer warm
up time after context switch, (b) smaller dynamic predictors have worse accuracy at steadystate, but shorter warm up time after context switch, which means better accuracy during
the initial period after context switch, (c) static predictors have zero warm up time, which
means the best accuracy for a very short period immediately after context switch, and (d) a
price/performance analysis has eliminated the predictors not included into the table of Figure BPSS2, due to their marginal price/performance.
All predictors taken into consideration in this study are summarized in Figure BPSS3,
from a point of view which is different compared to the previous presentation of the same
facts (more points of view brings much deeper understanding of the issues).
Predictor
2bC
Algorithm
Cost (bits)
A 2-bit counter predictor
212
consisting of a 2K entry array of two bit counters.
GAs(m, n)
A global variation of the Two-level Adaptive Branch Predictor
m + 2 m+n+1
consisting of a single m-bit global branch history
and 2n pattern history tables.
PSg(m)
A modified version of the per-address variation
211m + 2m
of the Two-level Adaptive Branch Predictor
consisting of 2K m-bit branch history registers
and a single pattern history table
(each PHT entry uses one statically determined hint bit instead
of a 2bC).
The version of PSg used in this study is the PSg(algo).
gshare(m)
A modified version of the global variation
m + 2m+1
of the Two-level Adaptive Branch Predictor
consisting of a single m-bit global branch history
and a single pattern history table.
pshare(m)
A modified version of the per-address variation
211m + 2m+1
of the Two-level Adaptive Branch Predictor
consisting of 2K m-bit branch history registers
and one pattern history table.
As in the gshare scheme,
the branch history is XORed
with the branch address to select the appropriate PHT entry.
loop(m)
An AVG predictor where the prediction of a loops exit
212m
is based on the iteration count of the previous run of this loop.
A 2K entry array of two m-bit counters
is used to keep the iteration counts of loops. In this study, m = 8.
Always Taken
0
Always Not Taken
0
Figure BPSS3: Single Scheme PredictorsAlgorithms and Complexities (source: [Evers96])
Legend:
AlwaysTakenBranch Always Taken (MIPS 10000),
AlwaysNotTaken Branch Always Not Taken (Motorola 88110).
Comment:
This figure includes alternative descriptions of the branch prediction algorithms covered in this book. Receiving information from different sources is an important prerequisite for better understanding of essential issues.
106
The study in [Gloy96] is based on the IBS traces which include both system and user
code (as indicated before, SPECint92 includes only user code) and a larger number of static
branches (which means a more realistic environment). It also analyzes the systems in which
the prediction tables are not flushed after the periodic context switches (which many believe is a better way to go, since different contexts are coded by the programmers using the
same software design methodologies, and consequently produce code with similar run time
prediction related characteristics).
The study by Gloy, Young, Chen, and Smith from Harvard University implies the model
of BPS shown in Figure BPSS4 (another way of representing the same model), and it includes the component predictors shown in Figure BPSS5 The impact of zero warm up time
and short warm up time predictors is smaller if no flushing is involved. This fact had an
impact on the selection of component predictors shown in Figure BPSS5.
Divider Predictors
Substreams
ExecutionStream
b4 b5 b4
1 1 0
PredictionStream
1 1 0
Major conclusions of the study are twofold: (a) First, better prediction accuracy results
are obtained if prediction tables are not flushed after periodic context switches, (b) Second,
results for traces which include both user code and system code differ from the results
based only on user code.
The study in [Sechrest96] is based on extremely long custom made traces (both user and
system code) and it claims that even the longest standard application suites give the traces
which are not nearly as long as needed to measure precisely enough the real accuracy of
various branch predictors. The study assume the model of BPS which is shown in Figure BPSS6 (still another way to represent the same basic model), and the authors have used
the benchmarks presented in Figure BPSS7.
107
BranchAddr
j bits
L
BranchAddr
BHSR
L
j bits
PO
PO
M
k bits
BranchAddr
BHSR
L
k bits
j bits
i bits
k bits
BHSRs
L
L
L
PO
L
L
M
M
L
k bits
PO
M
BranchAddress
PriorBranchOutcomes
F(addr, history)
RowSelectionBox
ColumnAliasing
RowMerging
PredictorTable
Figure BPSS6: Yet another model of BPS (source: [Sechrest96])
Comment:
This figure includes a still another alternative method of modeling the branch prediction system covered in
this book. Receiving information from more sources is an important prerequisite for better understanding of
essential issues. The reader should make an effort to understand the real reasons for using different symbolics.
108
B
DI
DCB
(%[TI]) SCB
N
compress
83,947,354 11,739,532 (14.0%)
236
13
eqntott
1,395,165,044 342,595,193 (24.6%)
494
5
espresso
521,130,798 76,466,489 (14.7%) 1784 110
gcc
142,359,130 21,579,307 (15.2%) 9531 2020
xlisp
1,307,000,716 147,425,333 (11.3%)
489
48
sc
889,057,008 150,381,340 (16.9%) 1269 157
groff
104,943,750 11,901,481 (11.3%) 6333 459
gs
118,090,975 16,308,247 (13.8%) 12852 1160
mpeg_play
99,430,055
9,566,290 (9.6%) 5598 532
nroff
130,249,374 22,574,884 (17.3%) 5249 228
real_gcc
107,374,368 14,309,867 (13.3%) 17361 3214
sdet
42,051,812
5,514,439 (13.1%) 5310 508
verilog
47,055,243
6,212,381 (13.2%) 4636 850
video_play
52,508,059
5,759,231 (11.0%) 4606 757
Figure BPSS7: BenchmarksSPEC versus IBS (source: [Sechrest96])
Legend:
BBenchmarks,
DIDynamic Instructions,
DCBDynamic Conditional Branches (percentage of total instructions),
TITotal Instructions,
SCBStatic Conditional Branches,
NNumber of Static Branches Constituting 90% of Total DCB.
Comment:
Note the minor difference in the number of dynamic instructions, between the benchmarks in the upper part of
the table (SPEC) and the benchmarks in the lower part of the table (IBS). Also, note that the number of static
branches constituting 90% of total dynamic conditional branches is extremely small in the case of SPEC, and
much larger (on average) in the case of IBS. This means that in IBS a much larger percentage of branch population has an impact on the overall happenings in the system. Authors of [Sechrest96] believe that the later
size difference is crucial for correct understanding of the essential issues (like, impact of aliasing, impact of
warm-up, etc.). In other words, what matters (when it comes to proper performance evaluation) is the size of
the branch population, not the size of the code.
Sechrest, Lee, and Mudge claim that control of aliasing (when many branches map into
the same entry of the branch history table, due to the lower number of address bits used)
and interbranch correlation (when various branches impact each other) are crucial for the
prediction success of a scheme.
An important new trend in branch prediction research implies techniques which reduce
negative branch history interference and improve target prediction for indirect branches
[Sprangle97, Chang97].
In object oriented (OO) code, the relative amount of indirect branches is higher; consequently, as the impact of OO programming gets higher, it becomes more and more important to predict indirect branches as accurately as possible. Paper [Driesen98] investigates a
number of different two-level predictors tuned to indirect branching. They start with predictors which use full-precision addresses and unlimited-size tables, and they gradually come
to limited-precission addresses and limited-size tables of acceptable performance and complexity. For indirect branches, their two-level predictor achieves a missprediction rate of
about 10% (with a 1K-entry table); their hybrid predictor (for the same table size) achieves
the missprediction rate of about 9% (in real microprocessors of mid-to-late 90s, this rate is
from 20% to 25%).
109
Paper [Evers98] tries to give the answer to the question on which characteristics of
branch behavior make predictors perform well. Authors quantify the reasons for predictability and show that not all predictability is captured by two-level adaptive branch predictors, which means that there is still lots of room for new advances in the field. They also
show that only a very few previous branches are needed for a correlation based predictor to
be accurate, and that these branches are typically very close to the branch being predicted
(in most cases, two or three); this means that new predictors can be deviced, which are not
only performancewise better, but also less complex.
Along the similar lines is the research of [Juan98] which proposes a third level of adaptibility for branch prediction. Traditional two-level predictors combine a part of the branch
address and a fixed amount of global history. However, optimal history length (from the
performance point of view) depends on code type, input data characteristics, and frequency
of context switches. This means that fixed predictors will never be as efficient as those that
dynamically determine the optimal history length and adapt to it. Authors propose the
DHLF (dynamic history length fitting) method which is applicable to any predictor type
based on global branch history; it uses a BHR of the size which is equal to the maximal history length of interest. If the DHLF method is combined with gshare, one obtains the dhlfgshare predictor, which is able to XOR any number of history bits with the PC bits of the
branch instruction to be predicted. The major question is how to determine (dynamically)
the optimal number of history bits. This is done by testing periodically all possible history
lengths and by choosing the one which provides the minimal number of misspredictions.
Effects of history length on the missprediction rate are shown in Figure BPSS8, for two different SPECint95 benchmarks (gcc and go).
Figure BPSS8: Effect of history length on the missprediction rate for selected SPECint95
benchmarks using a gshare predictor.
110
111
The section on basic issues presents details of some important data value prediction
techniques, from two landmarking references in the field [Sodani97] and [Wang97]:
(a) Dynamical instruction reuse [Sodani97]
(b) Last outcome predictor [Lipasti97]
(c) Stride based predictor [Wang97]
(d) Two level predictor [Wang97]
(e) Hybrid predictors [Wang97]
Once again, it is very dangerous to apply analogies with branch prediction, when reasoning about data value prediction. In some aspects it is possible; in other aspects, it can lead
to fatal misunderstandings.
112
HitIndicator
:
:
IR
FABuf
SKGU
:
:
M
miss U
X
hit
Q1
EU
ALU
Q2
pendent.
Figure DPSU2: Block scheme of the dynamic instruction reuse predictor
Legend:
ALU - Arithmetic and Logic Unit
SKGU - Search Key Generator Unit
EU - Execution Unit
FABuf - Fully Associative Buffer
HitIndicator - Indicator of the Hit in the Fully Associative Buffer
Qi - Operand #1 (i=1,2)
R - Result
Comment:
Performance of the scheme depends on the size of the FABuf. Unfortunately, the larger the size of the FABuf,
the larger the complexity of the search mechanism.
113
IA
HF
Deco
Value
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
PData
1
Comp
PValid
Interestingly enough, this scheme, which is extremely simple, gives surprisingly good results. For the PowerPC architecture and for SPEC'92 applications, the prediction accuracy
is 49%. If the last 4 different values are stored, and if the scheme was able ALWAYS to
pick the correct value (if one of the four was correct), then the prediction accuracy is 61%.
114
The probability of one of the last 16 being reused is pretty high. However, the storage
needed to keep all that history is prohibitively high. Consequently, the next question is how
many of the last 16 produced values are unique.
The answer to that question is given in Figure DPSU5. For most programs, statistically
speaking, the number of unique values is relatively. For example, in the case of m88ksim,
about 65% of value producing instructions (i.e., about 65% of eligible instructions) create
only four different values in the last 16 value producing activities.
In conclusion, statistical data support the hypothesis that the storage which covers the
last 16 value productions can be considerably reduced. One reasonable compromise seems
to be having only four storage registers for each data value producing instruction.
2.1.2.4. The Stride Based Predictor
If results vary by a constant stride, then it is easy to predict the result of the next instance
of the same static instruction. If a relatively large number of instructions follows this pattern, benefits from a stride based predictor could be relatively high.
The stride based approach was used successfully for data prefetching. It works well for
data prediction, too, because a relatively large percentage of instructions include: (a) loop
controlling variables, and/or (b) array stepping variables.
115
90%
x
x
x
x
x
x
80%
x
x
x
x
x
70%
x
x
x
x
x
60%
x
x
x
x
50%
x
x
40%
x
30%
x
x
x
x
x
go
compress95
m88ksim
20%
x
x
Ii
x
gcc(cc1)
11
12
10%
10
13
14
15
Block scheme of the stride based predictor is given in Figure DPSU6. Stride value predictors work well only while in the steady state, which necessitates the incorporation of a
special bit called STATE (into the VHT). The state transition diagram of the stride value
predictor is given in Figure DPSU7.
With the help of figures DPSU6 and DPSU7, it is relatively straightforward to understand the operation of the stride based predictor. The entire algorithm with necessary explanations is given in Figure DPSU8.
It is obvious that the last outcome predictor and the stride predictor cover different data
producing behaviors. Consequently, the clear next step is to combine the two approaches.
This issue and related problems are covered in a later section.
116
Stride
D
E
IA
HF
O
D
E
R
PData
W
PValid
VHT miss/
Update value
Any stride/
Update value and strade
Init
[Don't predict]
Same stride/
Update value
Transient
[Don't predict]
Different stride/
Update value and stride
Steady
[Predict]
Different stride/
Update value and stride
117
Same stride/
Update value
118
Value
Data
History
Values Pattern
IA
HF
D
E
C
O
D
E
R
D
E
C
O
D
E
R
111111111111
2p
4:2
CODER
2
W
4:1
MUX
PData
W
PValid
Figure DPSU 9: Block scheme of a two level value predictor
119
Similarly as in the branch predictor, when a value prediction is to be made for an instruction, the following steps are to be done:
(a) The appropriate VHT entry is selected;
(b) The TAG field is checked, to see if the entry corresponds to the current instruction
(you remember, a number of different instructions may map to the same entry);
(c) If yes, the VHP value is used to select the PHT entry;
(d) Ha maximum of the four counter values is selected, and the corresponding value is
declared as "predicted value". If there is a tie, either the value related to the last outcome is selected, or one of the values is selected on random.
Note that the prediction is made only if the maximum is above a threshold. If it is below
the threshold, no prediction is made (because it is assumed that the prediction quality would
be low).
Updating of the VHT entry is done in two steps:
(a) Contents get shifted by two bits;
(b) New outcome is entered into the bits left vacant.
Updating of the PHT entry is done also in two steps:
(a) Selected counter (corresponding to the correct outcome) gets incremented by three
(or less, if three results in saturation);
(b) All other counters get decremented by one (unless they are already at zero).
Note that the updating parameters (three for incrementation and one for decrementation)
are obtained empirically (after experimenting with various values).
2.1.2.6. The Hybrid Predictor
No single scheme provides good prediction for each and every application. As in the case
of branch prediction, the solution is in creating a hybrid predictor. In theory, there is a
number of possible ways to create a hybrid predictor, with two or more components.
One possible way to create a hybrid predictor is to combine a two-level predictor and a
stride based predictor. In that case, VHT has to be expanded with two additional fields:
STATE and STRIDE. Block scheme of such a hybrid predictor is given in Figure DPSU11.
In the case of the scheme from Figure DPSU11, the prediction algorithm includes the
following steps:
(a) The appropriate VHT entry is selected and the TAG field is checked;
(b) In parallel, the VHP field (of the "two level" part) and the STATE field (of the "stride
based" part) are read out;
(c) If the selected PHT entry has the maximum count value above the threshold, then the
two level predictor is responsible for prediction; otherwise, the stride based predictor
makes prediction, unless the value of STATE is different from {ready}, in which
case no prediction is made (that is a sign of low prediction quality).
120
Complexity of the hybrid value predictor is fairly large, so the question is how much performance improvement was enabled by all this additional complexity.
Value History Table (VHT)
Stride
State
Tag
IA
HF
LRU
Info
Value
History
Pattern
Data
Values
Pattern History
Table
000000000000
000000000001
D
E
C
O
D
E
R
D
E
C
O
D
E
R
111111111111
W
2p
4: 1
MUX
4:1
MUX
2:1
MUX
PData
PValid
Last Outcome
Stride Based
2-Level
Percentage of
DVP-eligible
instructions
Hybrid
100
90
80
Percentage
Not Predicted
70
60
Percentage
Correctly
Predicted
50
40
Percentage
Incorrectly
Predicted
30
20
10
li
compress95
m88ksim
go
gcc
121
Figure DPSU12 shows the results of a simulation study to compare performance of different value predictor types. This study assumed the MIPS-I instruction set architecture and
the integer part of the SPEC'95 application suite. The study implied the VHT with 4K direct-mapped entries and a PHT with also 4K entries (counters saturate at 6, while the threshold was equal to 6). The simulation was run for 100 million instructions or till completion
of the program, whichever comes first.
Figure DPSU12 gives results for the following metrics (relatively to the total number of
eligible instructions): (a) Percentage of instructions correctly predicted, (b) Percentage of
instructions miss-predicted, and (c) the percentage of instructions not predicted. Obviously,
these three items sum up to 100%. The major conclusion is that the hybrid predictor can
offer the prediction rate of up to 98% (m88ksim). Unfortunately, the average is much lower, while the overall complexity is significant. These facts make some people to talk about
value prediction with limited optimism.
Trace
Address
HF
History of Live
Values
Value
History
Table
(VHT)
D
E
C
O
D
E
R
History
Processing
Unit
nW
122
Block scheme of a stride oriented trace based predictor is given in Figure DPSS2. Note
that different live registers are associated with different sections of the predictor. Combined
stride and last value (of a live register) represent the prediction of the next value.
Live Register 1
Tag
Live Register 2
........
Value History
Table
LV(0) Prediction
LV(1) Prediction
Performance data for various forms of trace based predictors are given in Figure DPSS3.
Performance improvements are not always very dramatical, so the complexity reduction
seems to be the major benefit.
Live Register 0
Tag
LRU
Info Data Values
VHP
Live Register 1
LRU
Info Data Values
Additional live
Registers
VHP
. ...
Decoder
4:1
MUX
LV[0] Prediction
Decoder
4:1
MUX
LV[1] Prediction
123
Value
History
Table
The major question of data prediction is understanding its limitations in realistic machines. Results from [Gabbay98] show that instruction fetch bandwidth and issue rate have
a very significant impact on the efficiency of value prediction. A hardware solution is proposed which speeds up the value prediction, by exploiting low level parallelization opportunities.
Another promissing approach is selective value prediction [Calder99]. So far, research
efforts did not take into consideration the impact of limited predictor capacities and realistic
misprediction penalties. Paper [Calder99] filters out certain instruction types (values produced by those instructions are not taken into consideration) and gives priority to other instruction types (those within the longest data dependence path in the processor's active instruction window). For example, filtering away all instructions except load instructions is
potentially useful, because load instructions are responsible for most program latencies.
For more advanced topics, see the special issue of Microprocessors and Microsystems
[Tabak98], as well as the newly coming papers at major computer architecture conferences,
like [Yoaz99, Bekerman99, Calder99, Tullsen99].
* * *
The author and his associates were not much active in the field of prediction strategies,
except for side activities on related projects. For more details see [Helbig89, Milutinovic96d].
3. Problems
1. Give a block diagram of branch predictors for all microprocessors mentioned in this
book. If some relevant data are missing from this book or the open literature, assume reasonable values, and continue with the work. Calculate the approximate transistor count for
each scheme.
2. Explain the strengths and the weaknesses of global and per-address predictors. Compare
both performance and complexity.
3. Introduce an alternative predictor selection mechanism, and compare its advantages/drawbacks relative to the mechanism from Figure BPSS1. What is the potential performance improvement, and what is the added transistor count?
4. Calculate the exact transistor count for each predictor type mentioned in Figure BPSS2.
Include the control circuitry into the calculation, too.
5. Design a detailed control unit for the multihybrid predictor. What is the transistor count
of the control unit?
6. Figure out the percentage of value producing instructions in a program of your choice.
What type of code results in a higher percentage of value producing instructions?
124
7. Construct two pieces of code and compare their suitability for the dynamic reuse data
predictor. Explain which feature causes one of the two pieces of code to work better.
8. Design a detailed control unit for the last outcome data predictor. What is the transistor
count?
9. Design a detailed control unit for the two level data predictor. What is the transistor
count?
10. Design a detailed control unit for the hybrid data predictor. What is the transistor count?
125
1. Basic Issues
The overall system performance is frequently limited by I/O devices. The slowdown
comes for several reasons. Two of them are the most important: (a) monitoring of the I/O
process consumes processor cycles, and (b) if I/O supplies input data, the processing has to
wait until data are ready.
126
Data presentation devices can be successfully made autonomous in their operation, and
typically require minimal processor interaction. This is less of the case for data transport
devices. Most of the processor workload comes from data storage devices. Consequently,
most of the attention in the text to follow is dedicated to data storage devices (for both uniprocessor and multiprocessor environments).
Technological progress of data storage devices is especially fast. For example, at the
time of writing of this book, an EIDE disk has an average access time of 11.5 ms, it supports the 16.6 MB/s data transfer rate, and the capacity of 4 GB. Since technology data
change so quickly in time, the interested reader should check the WWW presentations of
major disk technology vendors, if state of the art information is needed (for example:
http://www.wdc.com/products/drivers/drive_specs/AC34000.html).
Also, at the writing of this book, tape technology is characterized with average access
time O(0.1 s), effective data transfer rate of 1.2 MB/s, and capacity of 20 GBs. For a state
of the art information, see the WWW presentations of major tape vendors (for example,
http://www.interface_data.com/idsp2140.html).
The CD-ROM technology has become very popular. At the time of writing this book, it
is characterized with average access time of 150 ms, data transfer rate of 1800 KB/s to
2400 KB/s (speed 12x and 16x), and capacity of 650 MB (standardized). For state of the art
information, see the WWW presentations of major vendors (for example,
http://www.teac.com/dsp/cd/cd_516.html).
The new DVD-ROM standard implies access time of 4.7 ns, data rate of 9.4 MB/s, and
capacity of 17 GB. For details, see the WWW presentations of major manufacturers (e.g.,
http://www.toshiba.com/taisdpd/dvdrom.htm).
127
128
CD
CIOP
I
O
P
Memory
P
CU
I
O
P
Figure IOBU4: Three possible locations for disk cache buffers (source: [Flynn95])
Legend:
PProcessor,
IOPInput/Output Processor.
Comment: In disk caches, the spatial locality component is much more dominant than the temporal locality
component. In processor caches, both types of locality are present, spatial more with complex data structures,
and temporal more with single variables, like loop control, process synchronization, and semaphore variables.
7
Disk
Storage Controller
6
CD
MissRate
Cache In Memory
4
C IOP
3
2
CU
1
0
1
16
32
64
BufferCapacity MB
Figure IOBU5: Miss ratios for three different locations of disk cache buffers (source: [Flynn95])
CDDisk,
CIOPStorage controller,
CUCache in memory.
Comment:
If disk access prediction algorithms are used, each of the three different locations implies a different algorithm
type, which enlarges the performance differences of the three different approaches (predictors in the memory
have access to more information related to prediction, compared to those located in the I/O processor, and
especially to those located in the disk itself).
129
Disk access implies two different activities, read and write. Different methods have been
used to improve the read and the write access.
Disk arrays represent a method to improve disk read. Bytes of each block are distributed
across all disks. Consequently, the time to read a file and the time to do the buffer-toprocessor transfer become better. In many cases of interest, this speedup is about linear.
This means, if N disks are used, the speedup will be equal to about N.
System structure and access structure typical of disk arrays are given in Figure IOBU6.
Numerics which demonstrate the speedup derived from a disk array organization are given
in Figure IOBU7.
System structure:
Disk
Buffer
P
s 1.5 MBps
Access structure:
n T
s read
Tlatency
n T
s read
Ts
Ttransfer
Disk logs represent a method to improve disk write. Data are first collected in a log buffer, until their size becomes equal to the size of a disk access unit. Consequently, disk
access is characterized with the minimal ratio of overhead time to transfer time.
130
Tread
s
For (1, s) configurations, n = E( f ) and
E( f )
Tservice = Tlatency +
Tread
s
E( f )
Ttransfer =
Tread
s
For (1,16):
3.4
Tservice = 17.5 +
(2.6)
16
= 17.5 + 0.55 = 18.1ms
Ttransfer = 0.55ms
For (1,8), we would have:
3.4
Tservice = 17.5 +
( 2.6)
8
= 18.6ms
Ttransfer = 1.1ms
Ttransfer = n
131
Application
I/O Requirements
Storage
S
IOB
S
IOB
A
A
S
A
Computational physics
Particle algorithms
in cosmology and astrophysics
Radio synthesis imaging
Computational biology
Computational quantum materials
IOB
4 GBs of data/4 h.
S
40 MB to 2 GB/s disk, 50100 MB/s disk
IOB
to 3 inch storage (comparable to HiPPI/Ultra).
Computational fluid
1 Tbyte
A
and combustion dynamics
0.5 GB/s to disk, 45 MB/s to disk for visualization
IOB
Figure IOBU8: I/O requirements of Grand Challenge applications (source: [Patt94])
Legend:
SSecondary,
AArchival,
IOBI/O Bandwidth.
Comment:
All Grand Challenge applications belong to the domain of scientific computing. With the recent innovations
in the internet technology, applications from the business computing domain become even more challenging.
Different companies employ different disk interface models in order to solve the I/O bottleneck in multiprocessor and multicomputer systems. The start point reading on these issues is [Patt94]. It covers the traditional approaches, and most of them can be generalized
using the architecture from Figure IOBU9.
The Intel Touchstone Delta model implies a 2D array (16 by 32) of processing element
nodes with 16 I/O nodes on the sides of the array.
The Intel Paragon model uses inexpensive I/O nodes which can be placed anywhere
within a mesh.
132
Processors (P)
CiP
Drives (1n)
Channels (CI/O)
Drives (1n)
M
Parallel I/O
Connection
M
Drives (1n)
I/O Nodes
The Thinking machines CM-5 model is based on a fewer I/O processors, each one characterized with a relatively high I/O bandwidth.
The best sources of information on the state of the art in the field are conferences which
include sessions on I/O and manuals on the latest products for the Grand Challenge applications.
At the time of the writing of this book, the Encore Infinity SP model, based on an internal architecture of the reflective memory type, according to many, is believed to be the best
I/O pump on planet (the reflective memory model and the authors involvement are discussed in a follow-up section).
Network interface technologies are considered crucial, especially for high demand applications like Grand Challenge or multimedia. Figure IOBU10 sheds some light on the capacities and characteristics of some traditional approaches.
Type
Bandwidth
Distance
Technology
Fiber Channel 1001,000 MB/s
LAN, WAN Fiber optics
HiPPI
800 MB/s or 1.6 GB/s
Copper cables (32 or 64 lines)
25 m
Serial-HiPPI 800 MB/s or 1.6 GB/s
Fiber-optics channel
10 km
SCI
8 GB/s
LAN
Copper cables
Sonet/ATM
554.8 GB/s
LAN, WAN Fiber-optics
N-ISDN
64 KB/s, 1.5 MB/s
WAN
Copper cables
B-ISDN
WAN
Copper cables
622 MB/s
Figure IOBU10: Network capacities and characteristics (source: [Patt94])
Legend:
LANLocal Area Networks (up to several meters),
WANWide Area Networks (up to several kilometers).
Comment:
Quoted numbers are the subject of fast changes due to technological advances; for state of the art numbers,
the reader is referred to related www pages. Also, new standards emerge (for example, IEEE FireWire).
133
Once again, issues covered so far are only those which, according to this author,
represent the major problems to be solved by the designers of future microprocessors and
multimicroprocessors on the single chip. The latest developments, only on these issues, are
covered in the next section.
2. Advanced Issues
This part contains the authors selection of research activities which, in his opinion, have
made an important contribution to the field in the recent time, and are compatible with the
overall profile and mission of this book.
Paper [Hu96] describes an effort to optimize the I/O write performance. The solution of
their research implies a small log disk used as a secondary disk cache to build a disk hierarchy. A small RAM buffer collects write requests, and passes them to the log disk when it
is idle. In this way, one obtains performance close to the same size RAM for the cost of a
disk. Conditions of their analysis imply that the temporal component of data is relatively
high. The higher the temporal locality, the higher the performance of this approach.
Paper [Maquelin96] describes an effort to improve the efficiency of handling of incoming messages in message passing environments. The solution of their research implies a
hardware extension that limits the generation of interrupts to the cases where polling fails to
provide a quick enough response. Conditions of their research imply that the message arrival frequency is the criterion for selection of interrupt versus polling. This solution is promising in environments typical of the future distributed shared memory multimicroprocessors
on a chip.
An important new trend in I/O research implies issues in minimization of negative effects of deadlocks, as well the minimization of negative effects of multiple failures in
RAID architectures [Pinkston97, Alvarez97].
In the case of disk array architectures, declustered organizations are used to achieve fast
reconstruction of a failed disk content. A crucial design issue is data layout. The most
common data layout organization is the well known stripes organization. The six desirable
properties of declustered organizations are [Holland92]:
1. Single failure correcting: No two units of the same stripe are mapped to the same
disk, to make recovery possible from a single disk crash.
2. Distributed parity: All disks have the same number of check units mapped to them,
to balance the accesses to check units during writes or when a failure has occurred.
3. Distributed reconstruction: There is a constant such that, for every pair of disks, that
constant number of stripes have units mapped to both disks (to ensure that the accesses to surviving disks during on-line reconstruction are spread evenly).
4. Large write optimization: Each stripe contains a contiguous interval of the clients
data, to process a write of all (k-1) data units without pre-reading the prior contents
of any disk.
134
135
The multimedia accelerator AMD 3DNow! is a part of the AMD-K6-2 effort, which introduces 21 new instructions focusing on major bottlenecks in multimedia and floatingpoint-intensive applications [Oberman99]. This enables faster frame rates on highresolution scenes, more accurate physical modeling of real-world phenomena, etc. The
3DNow! architecture operates on multiple operands in parallel (SIMD), and represents an
extension of the x86 MMX architecture, which included the initial 57 instructions for multimedia applications. In addition to speed up that comes from new instructions, there is also
a speed up that comes from organizational innovations, like introduction of new register
structures and prefetching. In other words, the traditional SIMD architecture, enhanced with
a more sophisticated (application tuned) register memory and a properly designed prefetching (with elements of pipelinig) results in a much better price/performance.
* * *
The author and his associates were not much active in the field of I/O bottlenecks, except
for side activities on related project [Ekmecic96, Raskovic95].
3. Problems
1. Consult the open literature, and create a table with data rates of modern presentation devices, for the time of your reading this book. What is the approximate unit cost for each device in your table?
2. Consult the open literature, and create a table with data rates of modern transport devices, for the time of your reading this book. What is the approximate unit cost for each device in your table?
3. Consult the open literature, and create a table with data rates of modern storage devices,
for the time of your reading this book. What is the approximate unit cost for each device in
your table?
4. Create a table with URLs for major manufacturers of presentation, transport, and storage
devices. What are the trends that you have noticed, after a carefull examination of each offer?
5. Disk access happens in bursts. In between the data transfer bursts, there are relatively
long silence periods, which can be used for prediction of what data may be needed (from
the disk) next. That data can be prefetched into the disk cache. Try to create a prediction
mechanism for prefetch (from the disk) into the disk cache. Develop a schematic which implements the proposed solution.
6. Create a block diagram of the disk log approach for improving the disk write. Develop a
detailed schematic.
136
7. Consult the reference [Ekmecic96] and discuss the suitability of each presented concept
for efficient large-scale I/O. Repeat the same for various applications of interest (e.g., for
applications from Figure IOBU8).
8. Multimedia applications are esspecially demanding when it comes to I/O. Consult the
open literature and make a survey of I/O for multimedia applications.
9. A two-node reflective memory system can be used as an I/O "pump" with faulttolerance. All I/O data can be reflected at both nodes, for the case that one node fails. Consult the available literature on reflective memory and create a block diagram of such a solution. Discuss pros and cons.
10. Create a detailed block diagram of the architecture presented in [Alvarez98]. Explain
how it satisfies each one of the six requirements for an "ideal" declustered architecture.
137
Multithreaded Processing
This chapter includes two sections. The first one is oriented to basic issues (background).
The second one is oriented to advanced issues (state-of-the-art).
1. Basic Issues
This chapter gives an introduction to multithreaded processing; mainly the elements
which are, in the authors opinion, of importance for future microprocessors on the chip.
There are two major types of multithreaded machines: (a) coarse grained, based on task
level multithreading, and (b) fine grained, based on the instruction level multithreading.
Task level multithreading implies that switching to a new thread is done on context
switch. Instruction level multithreading implies that switching to a new thread is done on
every cycle.
Principle components of a multithreaded machine are: (a) multiple activity specifiers,
like multiple program counters, multiple stack pointers, etc., (b) multiple register contexts,
(c) thread synchronization mechanisms, like memory-access tags, two-way joins, etc., and
(d) mechanisms for fast switching between threads.
A common question is what is the difference between a thread and a process. An important difference between threads and processes is that each process has its own virtual address space while multiple threads run in the same address space; consequently, thread
switching is faster than process switching. According to another definition, threads are
mostly supported at the architecture level, and processes are mostly supported at the operating system level. For example, major multithreading constructs like start, suspension, and
continuation, of a thread, are usually supported on the ISA (instruction level architecture)
level. Major processing constructs like start, suspension, and continuation, of a process, are
usually supported at the OS (operating system) level.
Projects oriented to multithreading include, but are not limited to, Tera (Smith at Tera
Computers), Monsoon and T* (Arvind at MIT in cooperation with Motorola), Super Actor
Machine (Gao at McGill University), EM-4 (Sakai, Yamaguci, and Kodama at ETL in Japan), MASA (Halstead and Fujita at Multilisp), J-Machine (Dally at MIT), and Alewife
(Agarwal at MIT). Many of these machines include elements of other concepts, like dataflow, message passing multicomputing, distributed shared memory multiprocessing, etc.
Consequently, all above mentioned efforts can be classified in a number of different ways.
Again, for details see the original papers or [Iannucci94]. For an interesting discussion
about relationships between dataflow and multithreading see [Silc98].
138
PEM
DMM
PSU
ICN
PSU
IOC
PSU
PEM
LAP
DMM
Internal structure of a PEM is given in Figure MTPU2. Each PEM can run up to eight
user threads and up to eight system threads. Details can be found in the original papers or in
[Iannucci94].
139
Instruction
Scheduler
RegisterAndConstantMemory
Thread
Control
FU
Process
Queue
(PSWs)
FP
Add
FU
FP
Multiply
FU
Integer
FU
ExecuteOperation
Instruction
Fetch
Effective
Address
Operand
Fetch
Result
Store
DataMemoryController
Program
Memory
Task
Register
File
Local
Data
Memory
Interprocessor
Network
Figure MTPU2: Structure of the HEP processing element module (source: [Iannucci94])
Legend:
FUFunctional Unit,
FPFloating-Point.
Comment:
This structure can be conditionally treated as a superscalar with a special thread control functional unit.
Coarse grained multithreading is important as a method of combining existing microprocessors into larger and more powerful systems. However, fine grained multithreading (to be
discussed in the next section) is important for the existing instruction level parallelism, in
order to make the single chip machines faster.
140
Figure MTPU3 describes the essence of SMT. The term horizontal waste refers to unused slots within one cycle of a superscalar machine. Horizontal waste is a consequence of
the fact that, often, one thread does not include enough instruction level parallelism. The
term vertical waste refers to the cases when an entire cycle is wasted, because various hazards have to be avoided at run time. In Figure MTPU3, which describes traditional fine
grained multithreading, horizontal waste is equal to 9 slots, and vertical waste is equal to 12
slots; consequently, the total waste is equal to 21 slots. If SMT was used, other threads
could fill in the unused slots, which, in turn, may bring the total waste down to zero.
Cycles
IssueSlots
FullIssueSlot
EmptyIssueSlot
HorizontalWaste = 9 slots
VerticalWaste = 12 slots
Figure MTPU3: Empty issue slots: horizontal waste and vertical waste (source: [Tullsen95])
Comment:
Potential efficiency of filling the empty slots (see Figure MTPU5) and minimal architectural differences in
comparison with traditional superscalars (see reference [Eickmeyer96]), make this approach a prime candidate for incorporation into the next generation microprocessor architectures.
In conclusion, superscaling is not efficient for vertical waste, traditional fine grained
multithreading is not efficient for horizontal waste, and SMT is efficient in both cases.
Figure MTPU4 lists the major sources of wasted issue slots, and possible latency hiding
and/or latency reducing techniques which can be used to cure the problems. It is essential to
understand that all listed latency hiding and latency reducing techniques have to be used
properly before one turns to SMT for further improvement of the run time performance.
Consequently, one has to be careful about the interpretation to benefits of SMTit may
appear as being more efficient that realistically, unless all techniques from Figure MTPU4
have been used properly. The research from [Tullsen95] recognizes the problem and provides a realistic analysis which sheds an important light on SMT.
A performance comparison of SMT and various other multithreaded multiprocessor approaches is given in Figure MTPU5. The results are fairly optimistic in favor of SMT, in
spite of the fact that the study is based on a somewhat idealized case of SMT. The study
concludes that SMT paves the way to 8-issue and 16-issue superscalars, while the techniques used in microprocessors of mid to late 90s, applied to superscalar machines, do not
enable designers to go beyond the 4-issue superscalars.
141
Common Elements
Specific Configuration
T
Test A: FUs = 32
SM: 8 thread, 8-issue
6.64
IssueBw = 8, RegSets = 8 MP: 8 1-issue
5.13
Test B: FUs = 16
SM: 4 thread, 4-issue
3.40
IssueBw = 4, RegSets = 4 MP: 4 1-issue
2.77
Test C: FUs = 16
SM: 4 thread, 8-issue
4.15
IssueBw = 8, RegSets = 4 MP: 4 2-issue
3.44
Unlimited FUs:
SM: 8 thread, 8-issue, 10 FU 6.36
Test D:
Test A, but limit SM to 10 FUs
IssueBw = 8, RegSets = 8 MP: 8 1-issue procs, 32 FU
5.13
Unequal issue BW:
Test E: FUs = 32
SM: 8 thread, 8-issue
6.64
MP has up to four times
RegSets = 8
MP: 8 4-issue
6.35
the total issue bandwidth
Test F: FUs = 16
SM: 4 thread, 8-issue
4.15
RegSets = 4
MP: 4 4-issue
3.72
FU utilization:
Test G: FUs = 8
SM: 8 thread, 8-issue
5.30
equal FUs, equal issue bw, unequal reg sets
IssueBw = 8
MP: 2 4-issue
1.94
Figure MTPU5: Comparison of various (multithreading) multiprocessors
and an SMT processor (source: [Tullsen95])
Legend:
TThroughput (instructions/cycle);
FUFunctional Unit.
Comment: Note that the number of instructions per cycle scales up almost linearly with the increase of the
issue width.
142
2. Advanced Issues
This part contains the authors selection of research activities which, in his opinion, have
made an important contribution to the field recently, and are compatible with the overall
profile of this book.
Paper [Eickmeyer96] describes an effort at IBM to use an off-the-shelf microprocessor
architecture in a form of multithreading. The effort is motivated by the fact that memory
accesses are starting to dominate the execution time of uniprocessor machines.
Major conclusion of the study is that multithreading is an important avenue on the way to
more efficient multiprocessor/multicomputer environments. For details, see [Eickmeyer96].
Conditions of the analysis imply in [Eickmeyer96] object oriented programming for online transactions processing. This does not mean that the approach is not efficiently applicable to other environments of interest.
Paper [Tullsen96] describes a follow up effort on the SMT project, aimed at better characterization of the SMT design environment, so that more realistic performance figures are
obtained. The major goal of the project is to pave the way for implementation which extends the conventional wide-issue superscalar architecture using the principles of SMT.
Major principles of their study are: (a) minimizing the changes to the conventional superscalar architectures, which makes SMT more appealing for incorporation into future
versions of existing superscalar machines, (b) making any single thread to be only slightly
suboptimal, which means that performance of a single thread can be sacrificed for at most
about 2%, and (c) achieving the maximal improvement over the existing superscalar machines when multiple threads are executed. In other words, [Tullsen96] insists that performance of multiple threads should not be targeted at the cost of slowing a single thread for
more than about 2%.
Conditions of their analysis imply a modified Multiflow compiler to work with eight
threads, and a specific superscalar architecture along the lines of SGI 10000. This study
takes into account the fact that the increased number of multiple activity specifiers (register
files, etc.) either slows down the microprocessor clock or causes some operations which are
of the one-cycle type in traditional superscalars to become of the two-cycle type in SMT. In
spite of the relatively pessimistic timing conditions of the study, SMT still demonstrates a
relatively optimistic throughput of 5.4, versus the traditional superscalar throughput of 2.5,
in conditions when all architectural details are the same, except that SMT runs multiple
threads, and traditional superscalar runs a single thread.
An important new trend in high-performance microprocessors implies the combination of
simultaneous multithreading on one side and multithreaded vector architectures and/or multiscalar processor architectures on the other side [Espasa97, Jacobson97].
An issue of special importance, which helps a wider acceptance of a new concept, is to
show performance in a variety of scenarios of importance. Paper [Lo98] shows the performance analysis results for SMT in the case of database workload. Results promise that
SMT has a future in this important application field.
143
Multithreading and cache designs are highly correlated. However, the problem is not
widely studied in the open literature. Paper [Kwak99] represents one such effort, and proposes the novel multithreaded virtual processor model, which enables an easier evaluation
of the interaction between multithreading and cache performance.
In general, multithreading is esspecially useful if programs are, prior to execution, partitioned into threads. If that is not the case, the benefits of multithreading are limited. With
this in mind, the SSMT (simultaneous subordinate microthreading) approach [Chappell99]
proposes to enhance the performance of the primary thread, which proves to be useful for
the overall performance.
144
TC
IC
Instruction
flow
BP
FU
IB
DU
Integer
RS
Floating-point
RS
Media
RS
Memory
RS
Memory
dataflow
Execute
ROB
Register
dataflow
SQ
Commit
DC
Paper [Smith97] argues that the best solution for a 1BTr on-chip uniprocessor is the
trace processor. Instruction fetch hardware unwinds programs into traces. Traces are
placed into a trace cache. A special trace fetch unit reads traces from the trace cache and
forwards them to appropriate processing elements. With appropriate control mechanisms
and processing elements, a peak throughput of one trace per clock cycle can be achieved.
Traces are hardware generated (not compiler generated as in multiscalar processor of [Sohi95]). Multiple superscalar pipelines are used.
145
LR
IB
FU
TP
BP
CT
IP
...
TC
...
PE 1
DU
PE 2
DP
PE 3
146
5MB 2L TC
1MB MHBP
IC
8MB L2 DC
RS
F1
F1
...
8x32KB DC
147
Paper [Kozyrakis97] argues that the best way to go for an 1BTr uniprocessor on a single
chip is to place the entire main memory on the same chip, together with a regular CPU like
R10000 or similar. This assumes that DRAM technology is used. It is argued that such uniprocessor will be perfectly scalable - one only adds more main memory - a resource that is
always needed and never large enough. Such an approach is generally referred to as IRAM,
or Intelligent RAM.
Architecture of an IRAM-based 1BTr CPU is given in Figure MTPS4. It is simple. It includes a RISC processor (e.g., one of the year 2000), plus main memory. An important argument of [Kozyrakis97] is that DRAM can hold up to 50 times more data than the same
area devoted to the caches. Another advantage of IRAM is that it can be combined with a
number of different CPU architecture types for better tuning to a given application.
IRAM
CPU-2000
* * *
The author and his associates wre not much active in the fied of multithreadung, except
for side activities on related projects. For the details, see [Helbig89, Milutinovic88b].
3. Problems
1. Create a piece of symbolical graphics, which illustrates essential differences between a
thread and a process. What are the major issues to underline?
2. Create a table with major characteristics (and URLs) of course grained multithreading
projects mentioned in this book. Add more examples not included into this book.
148
149
1. Basic Issues
Programming model of SMP machines is relatively simple and straightforward. It resembles the programming model of single instruction single data (SISD) machines. Therefore,
the existing code can be easily restructured for reuse, and the new code can be easily developed using the well-known and widely accepted programming techniques.
Unfortunately, the SMP systems are not scalable beyond some relatively low number of
processors. This number changes over time and depends on the speed of the components
involved (processors, bus, memory). For the current technology of components involved,
the performance/price ratio starts dropping after the number of processors reaches 16; it
starts dropping sharply after the number of processors crosses the count of 32. Consequently, SMP systems with more than 16 nodes are rarely being considered.
One of the major problems in implementing the SMP is cache consistency maintenance.
This problem arises if two or more processors bring the same shared data item into their
local private caches. While this data item is being read from local private caches of the processors-sharers, consistency of the system is maintained. However, if one of the processorssharers executes a write and changes the value of the shared data item, all subsequent reads
(of that same data item) may result in a program error. Maintenance of cache consistency is
done using appropriate cache consistency maintenance protocols. They can be implemented
in hardware, software, or using hybrid techniques. Here we will cover only the basic notions of hardware protocols. For advanced aspects of hardware protocols, interested reader
is referred to [Tomasevic93]. Software protocols will not be elaborated in this book, due to
their marginal use in modern computer systems today. For a detailed information on software protocols, interested readers are referred to [Tartalja96]. Hybrid protocols try to combine the best of the two extreme approaches (fully hardware approach versus fully software
approach). They represent a promising new avenue for the on-going research.
150
151
V
1
V
0
inv X
V
1
V
X
C1
V
X
V
0
CN
C2
C1
V
X'
C2
READX
CN
WRITEX
V
1
V
1
V
1
C1
V
X
1
C2
X'
V
X
V
1
CN
C1
READ X
V
X'
V
X'
C2
X'
CN
WRITE X
152
153
Write-update protocols are not so cheap to implement, since additional bus lines are
needed. They generate more bus traffic - control signals as well as data and addresses are
being transferred across the bus in the case of the example of Figure SMPU2. However, as
indicated earlier, other sharers of the same new value will have that value readily available
when needed, and will not have to run through the potentially expensive read miss cycle.
On the other hand, the updating may bring a data item that will never be used by a remote
processor and may force the purge of a data item that may be needed later.
154
In theory, with 3 state bits one can create 8 states. However, some states make no sense
(exclusiveness or ownerhip of an invalid data item). Consequently, only 5 states make
sense. These 5 states are:
(a) Exclusive owned
(b) Shareable owned
(c) Exclusive un-owned
(d) Shareable un-owned
(e) Inalid
It is mentioned above that ownership helps distinguish a modified copy from the copy in
main memory. Consequently, term modified can be used instead of the term owned, so the
above defined 5 states can be labled as:
(a) Exclusive modified, or Modified
(b) Shareable modified, or Owned
(c) Exclusive un-modified, or Exclusive
(d) Shareable un-modified, or Shareable
(f) Invalid
These 5 states define the acronym MOESI. Figure SMPU3 illustrates the way in which V,
E, and O bits combine to create 5 states of the MOESI model. Note that states M, O, E, and
S are actually state pairs (only I is not a state pair, but a single state). Also, note the difference between the O bit and the O state.
The MOESI model represents an extrapolation and concatenation of models found in the
preceding research. Before the MOESI model was established, researchers used different
terms for the state pairs M, O, E, and S. For example, M is known as modified, exclusive
modified, or exclusive owned. O is known as owned, shareable modified, shareable owned,
or shreable responsible. E is known as exclusive, exclusive unmodified, or excusive unowned. S is known as shareable, shareable unmodified, shareable unowned, or shared.
Note that S state does not imply that main memory is valid.
One needs 6 signal lines in order to implement the MOESI protocol on the backplane of
Futurebus (or its descendents). Three lines are used by the master for the transactions, to
indicate the intentions of the master; the other three are used for other units on the bus, to
assert either status or control.
In reality, one more signal line is needed, called BS or busy, to abort a transaction. It is
needed in order to implement versions of other exsisting protocols, introduced before
MOESI. Therefore, the total number of needed lines is 7.
155
Earlier protocols include Write-Once, Illinois, Berkeley (in the write-invalidate domain),
and Firefly or Dragon (in the write-update domain). For details, see [Tomasevic93]. All
these protocols are based on 3 or 4 states, and conditionally speaking, each one can be
treated as a subset of the MOESI protocol. However, note that, in general case, with the
above defined 7 control lines, one can implement only versions of many existing protocols
- not their exact algorithms.
For example, the states of the Berkely protocol are MOSI, the states of the Dragon protocol are MOSE. In principle, Berkeley is treated as the most mature protocol in the writeupdate group, and Dragon is treated as the most mature protocol in the write-invalidate
group. Of course, in each of the two groups there exist newer protocols with better performance, but they have been derived either from Berkeley, or Dragon, or from a combination
of the two.
OWNERSHIP
VALIDITY
EXCLUSIVENESS
The MESI protocol can be treated as a first step towards a future goal of having an entire
SMP system on a single VLSI chip. This goal is believed to be implementable as soon as
the on-chip transistor count crosses the 10 million threshold. Of course, with simpler nodes,
the goal can be achieved sooner.
A partial list of machines implementing the MESI protocol or its supersets (for example,
MOESI) includes, but is not limited to: (a) AMD K5 and K6, (b) Cyrix 6x86, (c) the DEC
Alpha series, (d) the HP Precision Architecture series, (e) the IBM PowerPC series, (f) Intel
156
Pentium, Pentium Pro, Pentium II, and Merced, (g) SGI series, and (h) SUN UltraSparc series.
1.1.5. SI Protocol
For instruction cache memory, there is no need to maintain M and E states, because these
caches support only cache read. Consequently, the protocol built into the instruction cache
controllers is SI. Controller for the SI protocol is relatively simple to design. As such, it is a
good example for student homework assignments.
157
In this context, bus implies parallel transfer on an open loop topology, ring implies parallel transfer on a closed loop topology, and LAN implies serial transfer on any topology
(open, close, star, etc.). Furthermore, grid, mesh, and crossbar imply the same topology of
lines (a number of horizontal and a number of vertical lines); however, processing nodes
are positioned differently. In the case of a grid, processing nodes are located next to line
crossings (messages go by the processing nodes). In the case of a mesh, processing nodes
are located on the line crossings (messages go through the processing nodes). In the case of
a crossbar, processing nodes are located on horizontal and/or vertical line terminations.
Information stored in the directory can be organized in a number of different ways. This
organization determines the type and the characteristics of directory protocols. The three
major types of directory protocols are: (a) full-map directory, (b) limited directory, and
(c) chained directory. All of three of them will be briefly elaborated in the text to follow.
1.2.1. Full-Map Directory Protocols
Essence of the full-map directory protocols is explained using Figure SMPU5. Each directory entry (related to a cache block) includes N presence bits corresponding to N processors in the system, plus a single bit (dirty or D) which denotes if the memory is updated, or
not (D = 0 means that the memory is updated, i.e. not dirty). This means that each directory
entry includes N + 1 bits.
158
This type of protocol is denoted as Dir(N)NB, which means that each entry includes N
presence related fields, and no broadcast operation is ever used.
Some of the consistency maintenance bits are kept in the local private cache memories;
in the case of full-map protocols, two more bits are added per each cache block. Each cache
block includes one validity bit (V) telling if the cached copy is valid (V = 1), or not, and
one modified bit (M) telling if the value in the cache compared to the value in the memory
is different/modified (M = 1), or not.
In Figure SMPU5 (left hand side) processor K reads a value, which is shared by processors 1 and K. The sharing status is reflected through the fact that two out of the N bits in the
centralized directory entry are set to one. Also, since the value in the memory and the values in the caches are the same, memory bit D is set to zero (D = 0), and cache bits M are set
to zero (M = 0). Of course, the V bit for all shares is set to one (V = 1).
In Figure SMPU5 (right hand side) processor K writes a new value into its own cache.
Consequently, all but one of the bits in the presence vector are set to zero. Since a writeback approach is assumed in the example of Figure SMPU5, the dirty bit in the corresponding memory entry is set to one (D = 1), telling that the memory is not up to date any more.
Since a write-invalidate approach is assumed in the example of Figure SMPU5, the V bits
of other caches in the system get set to zero; in such condition, the M bits in other cashes
become irrelevant (typically, designs are such that V=0 causes M = 0, as well).
The full-map protocol is characterized with the best performance, compared to other directory protocols. This is because the coherence traffic is the smallest, compared to other
directory protocols. However, this protocol is characterized with a number of drawbacks.
159
First, the size of the directory storage is O(MN), where M refers to the number of blocks
in the memory, and N refers to the number of processors in the system. This is essentially
an O(N2) complexity, which means that the protocol is not scalable. It is not able to support
very large systems, due to a large cost which grows as O(N2).
Second, adding a new node requires system changes, like the widening of the centralized
controller, etc. Consequently, the protocol is not flexible for expansion, i.e. the cost per
added node is considerably larger than the cost of the node alone.
Third, the centralized controller is a potential fault-tolerance bottleneck, as well as a performance-degradation factor, especially if more nodes are present in the system.
The protocols to follow eliminate some or all of the drawbacks of full-map protocols, at
the expense of performance reduction, due to an increased consistency maintenance related
traffic.
1.2.2. Limited Directory Protocols
Essence of the limited directory protocols is explained using Figure SMPU6, which includes a straightforward modification of a full-map protocol, called Dir(i)NB. Each directory entry (related to a cache block) includes the same single dirty bit (D) plus only i (i < N)
presence fields; each presence field being of the length logN bits. In total, this is 1 + ilog2N
bits (remember that the full-map protocol entries include 1 + N bits). In reality, such an approach means less bits per entry. If N is large enough, and i is small enough, the size of one
directory entry is less, compared to the full-map protocol. For example, if N = 16 and i = 2,
a Dir(i)NB limited directory protocol includes 9 bits, and a full-map directory protocol includes 17 bits. In other words, if N = 16, the Dir(i)NB limited directory protocols are less
costly if i < 4.
160
The described approach is possible, because several studies have shown that typically a
block is shared only by a relatively small number of processors. In many cases, it is only a
producer and a consumer. The question is what happens if a need arises for a cache block to
be shared by more than i processors, i.e., what type of mechanism is used to handle the
presence vector overflow (which happens when the number of sharers becomes larger than
i).
Actually, there are two types of limited directory protocols. The one described so far, as
indicated, is denoted as Dir(i)NB. Another one is denoted as Dir(i)B, which means that
each entry includes i presence related fields, and that the broadcast operation is used in the
protocol.
In both schemes, some of the consistency maintenance bits are kept in the local private
cache memories. As in the case of full-map protocols, two more bits are added per each
cache block. Same as before, each cache block includes one validity bit (V) telling if the
cached copy is valid (V = 1), or not, and one modified bit (M) telling if the value in the
cache compared to the value in the memory is different/modified (M = 1), or not.
In both cases, the protocols are scalable, as far as the directory storage overhead, which
is O(MlogN), or essentially O(NlogN). In both cases, protocols are less inflexible, because
not always a new node means changes in the central directory (adding a new node may not
change the prevalent pattern of sharing). However, in both cases, performance is affected
by increased sharing. Also, the centralized directory continues to be the bottleneck of many
kinds.
1.2.2.1. The Dir(i)NB Protocol
In Figure SMPU6, which explains Dir(i)NB, there is a restriction for reading (which is a
consequence of the fact that i < N), meaning that the number of copies for simultaneous
read is also limited to i. For example (left hand side of the figure), processor N reads a value, which is shared by processors 1 and K. After the read miss, the value will be brought
into cache N (right hand side of the figure), and the corresponding pointer field will be updated (K is substituted by N), but the copy in another cache (K in this example) has to be
invalidated. In other words, the sharing status had to be changed, because two out of the
i = 2 fields (in the centralized directory entry) have already been in use. Also, since the value in the memory and the values in the active caches continue to be the same, memory bit
D continues to be set to zero (D = 0), and cache bits M continue to be set to zero (M = 0).
Of course, the V bit for all active sharers is set to one (V = 1).
161
00
In this context, there are no restrictions on the number of simultaneous readers. However,
if a write happens in conditions of a pointer overflow (B = 1), an invalidation broadcast
signal will be generated. Of course, the need for invalidation broadcast increases the write
latency, plus some of the invalidation broadcasts are not necessary which wastes a fraction
of the communications bandwidth.
All protocols described so far share some common drawbacks: (a) too much is centralized, which is a bottleneck, (b) adding a new node requires centralized changes, which is
inflexible, and (c) the size of the central directory is still relatively large. The chained directory protocols, to be described next, remove these problems, at the price which is negligible
in a number of common applications.
162
This means that the three above mentioned problems have been cured. First, almost nothing is centralized (only the vector of link lists heads), and there is no central bottleneck.
Next, the system is absolutely flexible for expansion, and no changes in the centralized directory are needed, if a new node has to be added. Finally, the size of the centralized directory is down at the minimum, i.e. the storage cost is defined by O(logN).
Figure SMPU8 gives an example based on the singly-linked lists. After a new reader
shows up (processor 1), it is added to the linked list, as the first node after the head of the
list. All searches start from the head of the list; consequently, due to the locality principle,
the most recently added read sharer is the most likely one to be reading in the near future.
End of the linked list is denoted with a terminator symbol (CT).
A performance can be improved at a minimal cost increase, if a doubly-linked list is
used. Research at MIT has shown that such an approach leads to performance which is
close to the one provided by full-map directory protocols, at the cost which continues to be
O(logN).
2. Advanced Issues
This part contains the authors selection of research activities which, in his opinion, have
made an important contribution to the field in the recent time, and are compatible with the
overall profile of this book. Here we concentrate only on extended pointer schemes and
protocols for efficient elimination of negative effects of false and passive sharing.
163
The approach of [Simoni90, Simoni91, Simoni92] is described in Figure SMPS2. It describes an effort to improve the characteristics of chained directory protocols, by using dynamic pointer allocation. Each directory entry includes a short and a long part. The Short
Directory includes only a small directory headerone for each shared cache block. This
header includes two fields: (a) dirty state bit, and (b) head link field. The Long Directory
includes two types of structures: (a) one linked list of free entries of the so called pointer/link store (referred to as the Free List Link), and (b) numerous linked lists for shared
cache blocks, with two fields per link list element (processor pointer and forward link). On
read, the new sharer is added at the head position (the space for the new sharer is obtained
by reallocating an entry from the Free List Link). On write, the linked list is traversed for
invalidation purposes (and the space is freed up by returning the entries into the Free List
Link). Essentially, this protocol represents a hardware implementation of the LimitLESS
protocol.
164
165
166
related to misses on remotely cached blocks will be decreased. It has been shown that Cosmos can achieve the prediction accuracy's of about 62% to about 93%. The bottomline is
that a coherence protocol can execute faster if future actions can be predicted and executed
speculatively.
As indicated in the preface of this book, the main assumption of this book is that one of
the major goals of future on-chip processor designs is to have an entire shared memory
multiprocessor on the same silicon die. This assumption has been confirmed by several studies. Here, results from the simulation study of [Hammon97] will be presented in detail.
The three architectures compared by researchers from Stanford university are: (a) Superscalar, (b) Simultaneous multithreading, and (c) Shared memory multiprocessor. These three
architectures are shown in Figure SMPS4. They are characterized with approximately the
same number of transistors (one billion). Characteristics of the three compared architectures
are given in Figure SMPS5. The simulated performance is shown in Figure SMPS6. On average, the shared memory multiprocessor architecture performs the best, and is the easiest
to implement.
Results of the described simulation study can be easily explained. When the minimum
feature size decreases, the transistor gate delay decreases linearly, while the wire delay
stays nearly constant or increases. Consequently, as the feature size improves, on-chip wire
delays improve two to four times more slowly than the gate delay. This means, resources
on a processor chip that must communicate with each other must be physically close to
each other. Because of this reason, the simulation study has shown (as it was expected) that
the shared memory multiprocessor architecture exhibits (on average) the best performance,
for the same overall transistor count. In addition, due to modular structure, the complexity
of the shared memory multiprocessor architecture is much easier to manage at design-time
and test-time. Due to the scalability of the multiprocessor architectures with logically
shared memory address spaces, the same conclusion applies to a wider range of architectures with the logically shared memory address space.
167
128-Kbyte I cache
128-Kbyte D cache
System bus
RDRAM
RDRAM
RDRAM
(a)
...
128-Kbyte I cache
128-Kbyte D cache
System bus
...
Registers 8
Registers 1
PC 8
PC 1
RDRAM
RDRAM
RDRAM
(b)
CPU 1
...
CPU8
System bus
PC
Registers
I fetch
Execute
16K I
16K D
I fetch
Execute
16K I
16K D
PC
Dualissue
logic
Registers
Dualissue
logic
RDRAM
RDRAM
RDRAM
(c)
Figure SMPS4: Comparing (a) superscalar, (b) simultaneous multithreading, and (c) chip multiprocessor
architectures [Hammond97]. These three architectures, each one of the same transistor count (one billion), are
compared in [Hammond97].
Legend: I Instruction
D Data
PC Program counter
Comment: All three architectures are implementable with approximately the same number of transistors
(one billion). Paper [Hammond97] concludes that the third architecture (multiprocessor) is the best one to
implement
168
Characteristic
Number of CPUs
CPU issue width
Number of threads
Architecture registers
(for integer and floating point)
Physical registers
(for integer and floating point)
Instruction window size
Branch predictor table size
(entries)
Return stack size
Instruction (I) and data (D)
cache organization
I and D cache size
I and D cache associativities
I and D line sizes (bytes)
I and D cache access times
(cycles)
Secondary cache organization
(Mbytes)
Secondary cache size (bytes)
Secondary cache associativity
Secondary cache
line size (bytes)
Secondary cache
access time (cycle)
Secondary cache
occupancy per access (cycles)
Memory organization
(number of banks)
Memory access time (cycles)
Memory occupancy
per access (cycles)
Superscalar
1
12
1
Simultaneous multithreaing
1
12
8
Chip multiprocessor
8
2 per CPU
1 per CPU
32
32 per thread
32 per CPU
32+256
256 + 256
32 + 32 per CPU
256
256
32 per CPU
32,768
32,768
84,096
64 entries
64 entries
88 entries
18 banks
18 banks
1 bank
128 Kbytes
4-way
32
128 Kbytes
4-way
32
18 banks
18 banks
18 banks
8
4-way
8
4-way
8
4-way
32
32
32
50
50
50
13
13
13
Figure SMPS5: Characteristics of the compared superscalar, simultaineous multithreading, and chip multiprocessor architectures.
169
8
SS
SMT
Relative performance
SMP
5
4
3
2
1
0
compress
mpeg
tomcatv
multiprogram
Figure SMPS6: Relative performance of the simulated on-chip processor architectures: (a) superscalar, (b)
simultaneous multithreading, and (c) shared memory multiprocessor.
Legend:
SS Superscalar
SMT Simultaneous multithreading
SMP Shared memory multiprocessor
Comment: On average, SMP performs the best, and is the easiest one to implement.
* * *
The author and his associates were active in the field of shared-memory multiprocessing
and esspecialy in the field of cache consistency for shared-memory multiprocessors. For
more details, see [Ekmecic95, Tartalja97, Tomasevic92a, Tomasevic92b, and Tomasevic93].
3. Problems
1. Design a control unit for a write-invalidate snoopy protocol of your choice. What is the
transistor count, and how it depends on the cache size?
170
2. Design a control unit for a write-update snoopy protocol of your choice. What is the transistor count, and how it depends on the cache size?
3. If one word in a block is dirty, the whole cache block is considered dirty. However, an
improvement of MOESI protocol can be devices (called, for example, MOESI-Advanced,
or MOESIA), in which a cache block is declared dirty only after the second word becomes
dirty. Develop the details of such a cache consistency maintenance protocol. Explain for
what applications is such a protocol potentially useful.
4. Design a control unit for a version of the full-map directory protocol. What is the transistor count?
5. Design a control unit for a version of Dir(i)NB limited-directory protocol. What is the
transistor count?
6. Design a control unit for a version of Dir(i)B limited-directory protocol. What is the
transistor count?
7. Compare single-linked-list and double-linked-list chained directory protocols, from the
performance and complexity points of view. What are the pros and cons of each approach?
8. Consult open literature (e.g. Internet) and find out about the commercial SMP products
using SCI (scalable coherent interface). Make a table that compares the major features of
these products.
9. What cache consistency maintenance protocols are most efficient for elimination of false
sharing? Why?
10. Compare different passive sharing oriented cache consistency maintenance protocols,
and explain pros and cons of each one. What are the applications in which passive sharing
becomes a problem?
171
1. Basic Issues
Basic structure and organization of a DSM system are given in Figure DSMU1. The
backbone of the system is a system level interconnection network (ICN) of any type. This
means that some systems include a minimal ICN of the BRL (bus, ring, or LAN) type. Other systems include a maximal ICN of the GMC (grid, mesh, or crossbar) type, or some reduced ICN, which is in between, as far as cost and performance.
172
ICN
Cluster 1
Cluster 2
Interconnection
Controller
Directory
DSM
Processors
Caches
Cluster N
Interconnection
Controller
Directory
Processors
Caches
DSM
Interconnection
Controller
Directory
Processors
Caches
DSM
DSM
SharedAddressSpace
Figure DSMU1: Structure and organization of a DSM system (source: [Protic96a]).
Legend:
ICNInterconnection network.
Comment:
Note that each cluster (node) can be a uniprocessor system, or a relatively sophisticated multiprocessor system. In a number of designs, both clusters and interconnection network are off-the-shelf systems. Only the
DSM mechanisms are custom made, in hardware, software, or a combination of the two.
Nodes of a DSM system are referred to as clusters in Figure DSMU1. In reality, a cluster
can include a single microprocessor, or an entire SMP system. Systems can be uniform (all
clusters implemented as a single microprocessor, or all clusters implemented as an SMP) or
heterogenous (different clusters based on different architectures).
Basic elements of a cluster are: (a) one or more processors with their level 1 (L1) and
level 2 (L2) caches, (b) a part of memory with the memory consistency maintenance directory, and (c) interconnection network interface, which is more or less sophisticated.
Finally, the logical DSM memory address space is obtained by combining the memory
address spaces of physical modules placed in different processing nodes. Address spaces of
different physical modules can logically overlap (which is the case in systems with replication), or they can be logically non-overlapping (which is the case in systems with migration), or the overlap can be partial (as indicated in the bottom left corner of Figure DSMU1).
173
174
175
The MRSW algorithm implies that several nodes are allowed to read during a given period of time, but only one node is allowed to write during a given period of time. In the case
of reading, no migration of memory consistency units is needed. Migration is needed only
in the case of writing. This approach is compatible with typical application scenarios,
where multiple consumers use data generated by one producer. Consequently, this type of
algorithm is used in most of the traditional DSM machines.
The MRMW algorithm implies that all or most nodes are allowed both to read and to
write during a given period of time. In systems with replication, in write-through update-all
systems (like RMS or MMM), whenever a node writes to a shared variable, this writing is
also sent to all other nodes which potentially need that value, and the remote nodes get updated after a network delay. If the network serializes all writes in the system, data consistency will be preserved, and each node can read shared variables from its own local physical memory. In systems without replication, in write-back systems, and in systems in which
the interconnection network does not serialize the writes, additional action (on the software
or hardware part) is needed, in order to preserve data consistency. In systems with migration, appropriate mechanisms have to be applied (in software or in hardware), to help preserve the data consistency.
176
177
The processor MCM implies that all nodes see each particular data access stream in the
same order; however, each node sees the global data access stream in a different order. Processor MCM is typically supported on systems, which use BLR and include ICN access
buffers, which forward data to the ICN using different speeds and priorities. For example,
in the same scenario as in the previous paragraph, possible orders of writes that different
processors in the system can see are: ABCD, ACBD, ACDB, CDAB, CADB, and CABD.
Note that in each case A is before B, and C is before D. This means that each processor in
the system sees the same order of writes of processor P1, which is A before B, and the same
order of writes of processor P2, which is C before D.
The weak MCM (as well as all other MCMs to follow) implies that special synchronization primitives are incorporated into the program code. In between the synchronization
points, the memory consistency is not maintained, and programmers have to be aware of
that fact, to avoid semantic problems in the code. Synchronization points are referred to as
either: (a) acquire, at the entry point of a critical section, and (b) release, at the exit point of
the critical section. In the weak MCM, consistency is maintained at both synchronization
points, and the code is not allowed to proceed after the synchronization points, unless the
memory consistency has been established. Synchronization points by themselves have to
follow the sequential consistency rule (all nodes seeing the same global order of accesses).
The release MCM implies that memory consistency has to be established only at release
points, and the release points by themselves have to follow the processor consistency rule.
At each release point, all other processors/processes in the system get updated with the
changes to variables made prior to the release point of the critical code section just completed. This generates some potentially unnecessary ICN traffic, which may have a negative
impact on the speed of the application code.
The lazy release MCM implies that memory consistency has to be established only at the
next acquire point, which means that the ICN traffic will include only the updated variables
that might be used by the critical code section which is just about to start. This means less
ICN traffic, but more buffering space at each processor, to keep the updated variables until
the start of all critical code sections to follow. The better execution speed of the application
code is paid by more data buffering at all nodes. Note that some of the updated variables
will not be used in the critical code section to follow. Nevertheless, such variables will contribute to the ICN traffic, because they might be used, and consequently have to be passed
over to the next critical code section, which is about to start running on another processor.
The entry MCM implies that each shared variable (or each group of shared variables) is
protected by a synchronization variable (a critical section is bounded by a pair of synchronization accesses to the synchronization variable). Consequently, updated variables will be
passed over the network if and only when absolutely needed by the follow-up critical code
sections. Such an approach leads to the performance, which is potentially the best. However, concrete performance depends a lot on the details of the practical implementation. The
first implementation of entry consistency (Midway project at CMU) requires the programmer to be the one to protect the specific shared variables (or the groups of shared variables).
If such implementation of entry MCM is compared with lazy release MCM, differences are
small; for some of the SPLASH-2 applications [Woo95] entry MCM works better, and for
other SPLASH-2 applications lazy release MCM works better [Adve96]. This author believes that, with an appropriate implementation, entry MCM is able to provide better performance for most SPLASH-2 applications.
178
179
being used by the processor/process being updated. Page fragmentation happens when a
processor/process needs only a part of the page, because there is no finer granularity available. Both problems are alleviated through the use of relaxed MCMs, which permit the system to delay the coherence activity from the time when the modification happens to the
time of the next synchronization point of interest (release or acquire or both). Postponing
the coherence activity means that some of the ICN traffic will not happen. On the other
hand, hardware implemented DSM typically uses cache blocks as memory coherence units,
which means that impacts of false sharing and fragmentation are not so severe, and that
there is no urgent need to do relaxed MCMs in hardware, except for uniformity reasons
when everything is done in hardware (like relaxed MCM in DASH), or when the speedup
of a software solution is needed (like AURC and SCOPE in SHRIMP).
The major MCMs will now be revisited from another point of view, using pictures and
specific examples (definitions of all mnemonics used in the pictures have been given in figure captions).
Munin:
180
RELrelease,
ACQacquire.
Comment:
Note that Dash (hardware implementation) and Munin (software implementation) create different interconnection network traffic patterns. Dash accesses the network after each write, which means a more uniform
traffic. Munin accumulates the information from all writes within the critical section, and accesses the network at the release point, which means bursty traffic. This difference is due to inherent characteristics of
hardware (low overhead for single data broadcast, multicast, or unicast) and software (high overhead for single data broadcast, multicast, or unicast). Note that release consistency, in general, no matter if implemented
in hardware or software, updates/invalidates more processors than needed, which is a drawback to be overcome by the lazy release consistency, at the cost of extra buffering in the system. In theory, release consistency can be based on either update or invalidate approaches.
The Rice Munin project includes a predominantly software implementation of the release
consistency, which is depicted in the middle part of Figure DSMU1. All updated values are
being collected in a special-purpose buffer, and the update is done only at the release point,
which is more convenient in software implementations. Obviously, the software implementations create a bursty ICN traffic.
The example in the lower part of Figure DSMU3 underlines the fact that release consistency broadcasts the updated values to all potentially interested processors, in spite of the
fact that many of them will not need the data. This picture is provided to serve as the contrast to the picture, which explains lazy release consistency, and follows next.
181
182
Message traffic:
P1
P2
W(x)
REL(S1)
ACQ(S1)
W(x)
xS1
REL(S1)
NO
ACQ(S2)
P3
W(y)
REL yS2
It is immediately seen that (in this idealized example) entry consistency generates even
less ICN traffic than lazy release consistency. As indicated earlier, this potential performance improvement is paid (in the Midway implementation) with extra effort on the programmer side.
The EC (entry consistency) implementation of Midway is a single writer protocol using
updating (other implementations are possible, as well). Entry consistency guarantees that
shared data becomes consistent only when the processor acquires a synchronization object.
Designers of the Midway system were aware of the fact that advantages of entry consistency may be difficult to obtain if limited programming effort is possible; consequently, in addition to entry consistency, Midway also supports release and processor consistency.
183
RdFt
R(x)
P1
A CQ W(x)
REL
P2
P3
A CQ
REL
Copyset-2 AURC:
A CQ
R(x)
P1
A uUp
Flush
Copy1
A CQ W(x)
P2
P3
Copy2
A CQ
REL
Copyset-N AURC:
P0
RdFt
P1
R(x)
OWNER
Copy1
A CQ W(x) REL
P2
P3
Copy2
...
CopyN-1
A CQ
REL
As indicated in Figure DSMU7, Copyset-2 mechanism implies automatic hardwarebased update of the tables in the home processor, as soon as a value is updated. Copyset-N
mechanism is a logical extension of Copyset-2 mechanism (this means that Copyset-2 mechanism is directly supported in hardware, and Copyset-N is synthesized out of the appropriate number of actions of the Copyset-2 mechanism). Details of AURC are given in Figure DSMU7, using a version of AURC (the AURC idea has undergone several modifications). Details not covered in Figure DSMU7 are left to the interested reader, to be developed as homework.
184
AURC
1.
2.
3.
On the acquire point, the acquiring node sends a message to the last owner of the lock
(indirect through the home of the lock which is statically assigned), as in LRC.
4.
From the last owner, along with the lock, the acquiring node gets the "write notices"
which indicate all pages which were updated in the past of that acquire,
according to the happen-before partial ordering (as in LRC).
5.
Pages indicated in the write notices are invalidated unless the versions of the local copies
are posterior to the versions indicated in the write notices.
Versions are represented
as a timestamp vector and can be constructed from the write notices received.
6.
At the faulting time the process sends the request for the page
to the home indicating the version.
7.
Home keeps the copy set (list of sharers) and the updated version vector
telling which locks were writing to a particular version of each page,
in the set of pages for which it is the home.
Each node keeps just one element of the version vector.
This is the vector element related to the pages replicated in that specific node.
The home replies if it has at least the required version. Otherwise it waits
until it reaches that version.
Along with the page, the home also sends the current version of the page,
which can be greater than the one, which was requested.
This acts as a prefetching which could save future invalidations.
8.
Before page fault is resumed, the faulting node write-protects the page
in order to detect the update.
9.
185
10.
Each write will be performed into the local copy and propagated as an automatic update
to the home and also performed into the home's copy.
Others will be updated after they acquire the lock
and after they contact the home to obtain the last update of the page.
This decreases the ICN traffic,
in comparison with some other reflective (write-through-update) schemes,
which update ALL sharers using a hardware reflection mechanism (RM and/or MC).
The major issue here is to minimize the negative effects of false sharing,
in cases when page sizes are relatively large, and several processes (locks) share the same page.
11.
12.
A new interval starts. All pages from the update list are write protected and the update list is emptied.
13.
The above implements the lazy release consistency model through a joint effort of hardware and software
(unlike TreadMarks, which does the entire work in software).
The AURC scheme has been implemented in the SHRIMP-2 project at Princeton.
The AURC is a special case of a more general class of LRC protocols which some call
Home-based LRC protocols (HLRC). Researchers at Princeton have implemented an
HLRC on a 64-node Intel Paragon. Later, they have implemented an HLRC on Typhoon-0.
The basic difference between HLRC and the canonical LRC is how the updates are collected. In LRC (which some call "homeless"or distributed or standard LRC) the diffs are
kept everywhere and are fetched on demand and applied to the local copy in order to bring
it up-to-date.
In HLRC, the updates are propagated to the home using diffs (instead of automatic update) at the release time and applied to the home's copy. Later, at the page fault time, the
whole page is fetched from the home instead of fetching diffs potentially from multiple
nodes.
186
The advantage of HLRC is simplicity and scalability. The memory overhead (which can
be huge in standard LRC due to diff storage and may require garbage collection) is negligible in HLRC since diffs are discarded after they are sent home. The disadvantages of
HLRC are:
(a) It may send more data sometimes (pages instead of diffs,
but diffs can also accumulate, and bandwidth is not a problem now), and
(b) Its home choice can create hot spots sometimes
(or twice as many messages as LRC for migratory data patterns).
However, running HLRC and LRC on up to 64 nodes confirmed that HLRC scales better
than the standard LRC. Moreover, AURC can be viewed now as an HLRC protocol, which
uses the AU mechanisms to propagate the updates.
187
SCOPE
1.
The SCOPE is the same as the AURC, except for the following differences.
2.
Write notices are kept per lock rather than per processor.
3.
The version update vector at the home includes one more field,
telling which variables (addresses) were written into,
by a given lock; this is repeated for each version of the page
At the acquire time, the acquiring node receives from the last releaser,
only the write notices corresponding to that lock.
4.
5.
6.
However, unlike with the ENTRY, the activity upon invoking a variable
is related to the entire page, i.e. the entire page is brought over, if so necessary.
7.
In ScC a node keeps a list of update lists, one for each lock (scope) which is open.
This means page-based maintenance and more traffic,
rather than object-based maintenance and less traffic (like in Midway).
In principle, Midway can also do page-based maintenance, but it does not.
8.
At the release point, the processor increments its local timestamp and the epoch number of the lock.
The epoch number is used at the acquire time to determine which write notices to send to the next acquirer.
9.
At the barrier time, all write notices since the last barrier (global scope) are performed. This ensures that,
when barrier is reached, the entire address space is made coherent and the local scopes can be redefined.
188
14. The SCOPE research brings up a number of new ideas: merge, diff-avoid.
Figure DSMU8: SCOPERevisited
Comment:
The SCOPE idea was going through a number of different versions, and the one selected for presentation here
may not correspond to some of the implementations of ScC.
t1
t2
t3
t4
t5
P1
BARRIER
A := 1
acquire S1
B := 2
release S1
C := 3
BARRIER
P2
BARRIER
P3
BARRIER
acquire S1
b := B
release S1
BARRIER
acquire S2
a := A
release S2
BARRIER
Assume that variables A, B, and C are shared and the others are private; assume further
that all reads and writes not shown are local. Symbol ti denotes a point of time at which a
synchronization operation (acquire, release, barrier) occurs. Different models treat this code
in different ways, and they are ordered here by the amount of extra programming effort required:
189
entry consistency: As in the previous case, B = 2 is guaranteed to be visible to P2 immediately after t3 (provided that the lock S1 guards B), but no further assumptions are
made, not even to what will be visible after the epoch. In order to ensure visibility of
shared variables updated outside of critical sections, the programmer must either (a) use
read-only locks to read the shared but not propagated values when they are needed, or
(b) must explicitly bind them to the barrier, or (c) bind the entire address space to the
barrier (the latter two when visibility in the next epoch is the goal).
cache) until it realizes that an updated value is ready to be retrieved from the producer's
cache into the consumer's cache. At that moment, the synchronization delay is over. Finally, during the read delay interval, the consumer's cache gets updated.
Producer-initiated communications (for the same sceanro) work as follows. As the data
are produced in producer's cache, data are forwarded directly to the consumer's cache. As
before, the write is completed, and the write delay is over, after the last write is acknowledged. The write to the synchronization variable also gets forwarded from producer to consumer (consumer reads, rather than fetches), and the synchronization delay is shorter. After
the synchronization delay is over, data are already in consumer's caches, and consumer
reads them from its cache. Because consumer reads data from the cache (no cache miss involved), the read delay is much shorter.
These two simple examples demonstrate the potentials. However, the question is, how
much would these potentials materialize given a real application, for different implementations of producer-initiated mechanisms. One classification of DSM communication mechanisms is given in Figure DSMU11 [Byrd99].
Producer
write delay
read delay
Producer
write delay
D
FW
D
FW
D
FW
D
FW
Consumer
sync delay
read delay
191
RE
AD
RE
AD
RE
AD
RE
AD
V
IN
V
IN
INV
V
IN
sync delay
Consumer
Legend:
FWD Forward
Comment: Write delay stays the same. Synchronization delay and read delay become shorter.
Simulaiton was done for concrete values of major design parameters as specified in Figure DSMU12. The simulation study of [Byrd99] has shown that producer-initiated mechanisms typically improve performance, but consumer-initiated mechanisms are less expensive
to implement.
Consumer-Initiated
Producer-Initiated
Invalidate
Prefetch
Data
Fwd
Message
Passing
Update
Lock
Sel
Update
Transfer Targets
Transfer Initiation
Transfer Granularity
implicit
line
explicit
line
explicit
explicit
message
block/
word
-
block/
word
=
=
<
implicit
implicit
line/
word
block/
word
> or =
<
<<
implicit
explicit
line
Sync Granularity
explicit
explicit
line/
word
block/
word
> or <
< or =
<<
implicit
explicit
line/
word
block/w
ord
> or <
< or =
<<
Write Delay
Sync. Delay
Read Delay
block
>
<
<<
Processors
L1 cache size
L1 write buffer
L2 cache size
L2 outstanding accesses
L2 line size
L2 access
Network latency
Network bandwidth
Network topology
Memory access
Memory bandwidth
64
8K bytes
8 lines
unlimited
16
8 words (64 bytes)
3 cycles
4 cycles/hop
4 bytes/cycle
hypercube
24 cycles
1 word/cycle
192
line
<
<
<<
Figure DSMU12: Base system parameters used in the simulation study (source: [Byrd99]).
Legend:
L1,2 Level 1,2 cache memory
Comment: Cache and network parameters have been chosen to be consistent with the current generation
technology. The release consistency memory model was chosen.
Type
of Implementation
user-level library +
OS modification
user-level library +
OS modifications
runtime system +
linker + library +
preprocessor +
OS modifications
runtime system +
compiler
Type
of Algorithm
Consistency
Model
Granularity
Unit
Coherence
Policy
MRSW
sequential
1KB
invalidate
MRSW
sequential
1KB, 8KB
invalidate
type-specific
(SRSW,
MRSW,
MRMW)
release
variable size
objects
Midway
entry, release,
MRMW
4KB
update
[Bershad93]
processor
TreadMarks
user-level
MRMW
lazy release
4KB
update, invalidate
[Keleher94]
Blizzard
user-level + OS kerMRSW
sequential
32-128 byte
invalidate
[Schoinas94]
nel modification
Mirage
OS kernel
MRSW
sequential
512 byte
invalidate
[Fleisch89]
Clouds
inconsistent, sediscard segment
OS, out of kernel
MRSW
8KB
[Ramachandran91]
quential
when unlocked
Linda
variable
implementation
language
MRSW
sequential
[Ahuja86]
(tuple size)
dependent
Orca
synchronization shared data
language
MRSW
update
[Bal88]
dependent
object size
Figure DSMU13: A summary of software-implemented DSM (source: [Protic96a]).
Legend: OSOperating System.
Comment:
Note that the bulk of software DSM research starts in mid to late 80s. This author especially likes the selective approach of Munin, which uses different approaches for different data types, in almost all major aspects
of DSM, and introduces the release memory consistency model. TreadMarks introduces the lazy release
memory consistency model. Midway introduces the entry memory consistency model.
193
Name and
Reference
Memnet
[Delp91]
Dash
[Lenoski92]
SCI
[James94]
Cluster
Configuration
single processor,
Memnet device
SGI 4D/340
(4 PEs, 2-L caches),
local memory
arbitrary
Type of
Network
Type of
Algorithm
Consistency
Model
Granularity
Unit
Coherence
Policy
token ring
MRSW
sequential
32 bytes
invalidate
mesh
MRSW
release
16 bytes
invalidate
arbitrary
MRSW
sequential
16 bytes
invalidate
In software implementations, the majority of newer systems support OS (operating system) level and OC (optimizing compiler) level approaches. Most of the systems use the
MRSW algorithms; however, new systems typically explore the MRMW algorithms. Sequential MCM is still represented a lot, in spite of its lower performance; newer approaches
are typically oriented to more sophisticated MCMs. Granularity of the memory consistency
unit tends to be larger (pages of the size below 1KB are rarely used). Invalidate consistency
maintenance protocols seem to be more frequently used, compared to update protocols. Of
course, the conclusions from tables of this sort will change, as the time goes by, and new
research ideas find their way into research prototypes and/or industrial products.
In hardware implementations, the majority of systems support clusters with multiple processors (typically SMP). As far as the type of the ICN, the variety is large. The MRSW algorithm is more frequently used than the MRMW algorithm. Again, sequential consistency
prevails. The granularity unit is at most 128 bytes. The consistency maintenance protocols
are usually of the invalidate type. Again, the conclusions from tables of this sort will
change, as the time goes by.
194
Name and
Cluster Configuration +
Type of
Consistency
Granularity
Coherence
Reference
Network
Algorithm
Model
Unit
Policy
PLUS [BisaM88000, 32K cache,
MRMW
processor
4K bytes
update
ni90]
8-32M local memory, mesh
Galactica Net
4 M88110s, 2-L caches
update/
MRMW
multiple
8K bytes
[Wilson94]
256M local memory, mesh
invalidate
Alewife
Sparcle PE, 64K cache,
MRSW
sequential
16 bytes
invalidate
[Chaiken94]
4M local mem, CMMU, mesh
FLASH
MIPS T5, I+D caches,
MRSW
release
128 bytes
invalidate
[Kuskin94]
MAGIC controller, mesh
Typhoon
SuperSPARC, 2-L caches,
invalidate
MRSW
custom
32 bytes
[Reinhardt94]
NP controller
custom
Hybrid DSM
FLASH-like
MRSW
release
variable
invalidate
[Chandra93]
SHRIMP
16 Pentium PC nodes,
AURC,
update/
MRMW
4KB
[Iftode96a]
Intel Paragon routing network
scope
invalidate
Figure DSMU15: A summary of hybrid hardware/software-implemented DSM
(source: [Protic96a]).
Legend:
CMMUCache Memory Management Unit,
NPNetwork Protocol.
Comment:
Note that the bulk of hybrid DSM research starts in early to mid 90s. Princeton SHRIMP-2 project introduces
the AURC and the SCOPE memory consistency models.
In hybrid implementations, the majority of systems is based on SMP multiprocessor clusters, using off-the-shelf microprocessors. Since these machines are of a newer-date, the
MRMW and MRSW algorithms are used about equally often. For the same reason, there is
a variety of MCMs used. The granularity units vary from as low as 16 bytes to as high as
8KB. Update and invalidate consistency protocols are being used about equally often.
These three tables assume the classification of DSM systems according to the type of
implementation (hardware, software, or hybrid), as the major classification criterion. Other
classifications are possible, too. One widely used classification is based on the access pattern characteristics, as the major classification criterion. In that classification, the major
classes are: (a) UMA or uniform memory access, (b) NUMA or non-uniform memory
access, and (c) COMA or cache-only memory access. Each class can be further subdivided
into a number of subclasses: (a) F-COMA or Flat COMA versus H-COMA or Hierarchical
COMA, (b) CC-NUMA or cache-coherent NUMA versus NCC-NUMA or not-cachecoherent NUMA, etc.
196
RM/MC bus
TMI
board
HPI
board
RX window
RX buffer
TX buffer
RX buffer
TX buffer
TX window
Memory
Node
Host bus
Local bus
197
The RM/MC system combines potentials for a high bandwidth with a low latency time.
This system has some advantages because it is bus-based. A broadcast mechanism is easy
for implementation. Update messages are small, consisting basically of virtual address and
data (there are no special headers and trailers, etc.). Propagation time for update messages
is small, because interconnection medium spends only one bus cycle (in the case of RMS
where the bus is non-multiplexed) to propagate an update message to all nodes. The system
tolerates node failures without service disruption.
The main disadvantge is the low system scalability. Also, the cacheing of RM is disabled, because the HPI board, as a slave on the system bus, cannot initiate update/invalidate
messages upon receiving an update from the RM/MC bus, in order to keep the cache memory consistent. Short (word) messages and long (block) messages share the same FIFO buffers, so the short messages always have to wait after the long ones.
Differences between various reflective memory systems are summarized in Figure
DSMU18 [Jovanvic99], while their strenghts/weaknesses are compared in Figure DSMU19
[Jovanovic99].
System
RMS/Multimax
RMS for PC
RM/MC
RM/MC++
LAM
MMM
RSM
MC
NSM
SCRAMNet
VMIC
Sesame
SHRIMP
Academic/
Industrial
I
A+I
I
A+I
I+A
I
A
I
I
I
I
A
A
University/
Company
Encore
Belgrade/Encore
Encore
Belgrade/Encore
Florida/Encore
Modcomp
Tokyo
DEC
ATC
Systran
VMIC
Stony Brook
Princeton
Year
Introduced
1991
1993
1993
1994
1996
1993
1994
1996
1994
1991
1995
1991
1994
Interconnection
Network
bus
bus
bus
bus
bus
bus
bus
crossbar
ring
ring
ring
ring
ring
198
URL
http://www.encore.com
http://galeb.etf.bg.ac.yu/~dsm
http://www.encore.com
http://galeb.etf.bg.ac.yu/~dsm
http://www.encore.com
http://www.modcomp.com
http://www.sail.t.u-tokyo.ac.yp/~oguchi
http://www.digital.com
http://www.atcorp.com.
http://www.systran.com
http://www.vmic.com
ftp://ftp.cd.sunyb.edu/pub/techreport/wittie
http://www.cs.princeton.edu/shrimp
System
Encore RMS
RMS for PC
Encore RM/MC
RM/MC++
Encore LAM
Modcomp MMM
RSM
DEC MC
NSM
SCRAMNet+
VMIC Network
Sesame
SHRIMP
Processor
count
8
16, 32, more
8
8
8
8
8
16
60
256
256
>1000
>1000
Sharing
Granularity
page
page
page
page
segment
page
segment
page
segment
segment
segment
page
page
RM
Mapping
dynamic
dynamic
dynamic
dynamic
dynamic
dynamic
static
dynamic
static
static
static
dynamic
dynamic
Update
Granularity
word
word
word & block
word & block
word & block
word
word
word & block
word
word
word
word
word & block
MCM
Protocol
PC
PC
PC
PC
EC
SC
RC
PC
SC
PC
PC
PC
AURC
Caching
Included
no
yes
no
no
no
no
no
yes
no
no
no
no
yes
Main
Applications
real-time
real-time
OLTP
OLTP
OLTP
real-time
real-time
client-server
S&E
real-time
real-time
S&E
client-server
199
System
Advantages
Disadvantages
low scalability,
high cable complexity,
limited number of nodes
Encore RMS
RMS for PC
better scalability,
data filtering
Encore RM/MC
widely applicable
low scalability,
high cable complexity,
limited number of nodes
RM/MC++
Encore LAM
Modcomp
MMM
RSM
widely applicable,
prioritization,
better control of word and
block streams
widely applicable,
entry consistency,
shared data
are organized as segments
hard and soft real-time
applications
release MCM,
lazy updates
extremely high bandwidth,
caching,
update acknowledgements,
remote read primitive,
sender could bypass
local memory
low scalability,
compiler
has to be modified
caching,
hierarchy of RM buses
caching,
multiple memory pools,
prioritization,
hierarchy of RM buses
poor scalability,
dual function of the
VME bus
RM concept in software,
only word updates
bus hierarchy,
system bus
dynamic RM mapping,
block update granularity
homogeneous nodes,
low number of nodes
heterogeneous computing,
relaxed MCM,
inclusion
of some kind of hierarchy,
segment sharing granularity
Hardware support
for synchronization
no overlapping
computation
with communication,
each node keeps a copy
of the entire RM space
SCRAMNet+
data filtering,
modular media interface,
heterogeneous computing,
merging word updates
RM is limited to 8MB,
each node keeps a copy
of the entire RM space
dynamic RM mapping,
block update granularity
VMIC Network
on-board RM is SRAM,
heterogeneous computing
dynamic RM mapping,
block update granularity,
data filtering
no broadcast mechanism
DEC MC
NSM
Sesame
SHRIMP
high scalability,
heterogeneous computing,
merging of word updates,
distributed synchronization,
dynamic sharing
high scalability, caching,
sophisticated relaxed MCMs,
deliberate updates,
word and block updates,
merging of word updates
Figure DSMU 19: Strengths and weaknesses of the presented reflective memory systems.
Legend:
MCM Memory consistency model
Comment: See [Jovanovic99] for a discussion of possible improvements of the RM approach.
200
...
...
...
IC
Access delays in DASH were as follows: (a) Local fill - 29 processor clocks, (b) Fill
from home node - 101 processor clocks, and (c) Fill from remote node - 132 processor
clocks. The length of the processor clock was 30ns, which reflects the technology capabilities of the time of DASH implementation.
The DASH prototype was used to experiment on three different latency hiding techiques: (a) memory consistency modules, (b) prefetching, and (c) remote access cache
(RAC).
DASH supports the relaxed memory consistency models. See details in the section on
memory consistency models earlier in this book. DASH supports software prefetch. It allows processor to specify the address of a data item to be fetched before it is actualy
needed.
201
DASH introduced the remote access cache approach, to allow remote accesses to be
combined and buffered within the individual nodes. It stores remote data that were accessed
recenlty. If a remote data item is requested, and is included in RAC, it comes from RAC.
RAC is useful when two different processors of the same 4-processor node use the same
data, or when data is not "competitive" enough to be captured in the regular cache, but is
"competetive" enough to be captured in RAC.
A reincarnation of the concept appeared in SGI Origin, except that each node includes 2
processors, and many improvements were made, to reflect the lessons learned from the
DASH research and prototype implementation. For lessons learned from the DASH project
and for new challenges inspired by the DASH project see [Hennessy99].
Multiprocessors like HP Exemplar, Sequent NUMA-Q, DG NUMA-Liine, and HAL S-1
can be conditionally treated as reincarnatinos of DASH, with less or more departures from
the baseline. Finally, the FLASH project of Stanford can be treated as an effort to move
important features from the hardware layer (as in DASH) to the software layer (to achieve a
more efficiant hardware/software co-design, for a better price/performance ratio).
Figure DSMU21 lists a number of modern post-DASH DASH-like DSM systems. Figure DSMU22 compares the systems presented in the previous figure. Figure DSMU23 discusses the strengths and weaknesses of the presented modern DSM post-DASH DASH-like
systems.
System
SGI Origin
HP Exemplar
Sequent NUMA-Q
DG
NUMA-Liine
Academic/
Industrial
University/
Company
Year
Introduced
Interconnection
Network
Silicon Graphics
1996
Cray Link
Spider
http://www.sgi.com
URL
Hewlett
Packard
1996
SCI based
Dual rings
http://www.hp.com
Sequent
1997
SCI based
Single ring
http://www.seauent.com
Data General
1997
SCI based
Dual ring
http://www.dg.com/numaliine
202
System
Processor
Count
Sharing Granularity
Mapping
Update Granularity
MCM
Protocol
Caching
Included
Main
Applications
SGI
Origin
1024
(512 x 2)
page
dynamic/
static
block
SC/RC
yes
scientific/
commercial
HP
Exemplar
512
(8 x 4 torus)
page
dynamic/
static
block
SC/RC
yes
scientific/
commercial
252
(63 x 4)
page
dynamic/
static
block
SC/RC
yes
commercial
Logically
1024
page
dynamic/
static
block
SC/RC
yes
commercial
Sequent NUMA-Q
DG
NUMA-Liine
System
Advantages
Disadvantages
Possible Improvement
Avenues
SGI
Origin
HP
Exemplar
Software support
Software support
Sequent NUMA-Q
DG
NUMA-Liine
203
2. Advanced Issues
This part contains the authors selection of research activities which, in his opinion, have
made an important contribution to the field in the recent time, and are compatible with the
overall mission of this book.
Papers [Nowatzyk93] and [Saulsbury96] describe efforts at SUN Microsystems to come
up with a DSM architecture which represents a good candidate for future porting to a DSMon-a-single-chip environment. Major conclusion of their study is that a siliconless motherboard, as a first next step towards the final goal, is achievable once the feature size drops
down to 0.25 micrometers. Major highlights are summarized in Figure DSMS1.
S3.mp and beyond
Origin and Environment
Nowatzyk, et al.
SUN Microsystems
Major Highlights
- Going towards the DSM based workstation (S3.mp)
- Going towards the siliconless motherboard (LEGO)
- Using many less powerful CPUs, rather than a few brainiacs,
since the performance is limited by the memory wall
- Simulation studies oriented to 0.25 m 256 Mbit DRAM
Figure DSMS1: The S3.mp and beyond.
Comment: The LEGO project from Sun Microsystems can be treated as an earliest effort towards a DSM on
a chip, with a number of on-chip accelerators.
204
Paper [Lovett96] describes an effort at Sequent Corporation to come up with a commercially successful CC-NUMA machine based on the SCI standard chips. Major conclusion
of their study is that once Intel becomes able to place four P6 machines on a single die
(quad Pentium Pro), it will be possible to have a much more efficient implementation of
their STiNG architecture (small letter i comes from iNTEL inside). Major highlights
are summarized in Figure DSMS2.
Paper [Savic95] describes an effort at Encore Computer Systems to come up with a
board which one can plug into a PC (personal computer), in order to enable it to become a
node in DSM systems of the RMS type. Major conclusion of their study is that the board
(implemented using FPGA VLSI chips and fully operational) can be ported into a single
standard-cell VLSI chip, which means that the RMS approach might be the first one to fit
within the single chip boundaries. Major highlights are summarized in Figure DSMS3.
The Sequent STiNG
Origin and environment
Lovett + Clapp
Sequent Computer Systems, Beaverton, Oregon, USA
A CC-NUMA for the commercial market (1996)
Major highlights:
- Combines 4 quads using SCI
- Quad is based on Intel P6
- Quad includes up to 4GB of system memory, 2 PCI buses for I/O,
and a LYNX board for SCI interface and system-wide cache consistency
- Architecture similar to Encore Mongoose (1995)
- Processor consistency MCM
- Application: OLTP
205
Paper [Gillett96] describes an effort at DEC to come up with a support product for their
client-server systems, using the principles of RMS (this effort can be treated as a follow up
effort after [Savic95], done on the top of a contract with Encore). Major conclusion of their
study is that RMS still represents a successful way to go, in spite of the fact that the concept
has been around for such a long time (as long as appropriate innovations are incorporated,
like those selected by DEC). Major highlights are summarized in Figure DSMS4.
Paper [Milutinovic96] describes another effort at Encore Computer Systems to come up
with further improvements of the RMS concept, for better exploitation in the I/O environment (in order for their Infinity SP I/O pump to continue to be, in their words, the Fastest I/O pump on Planet.). Major conclusion of the study is that the RMS can be further
improved if it is combined with more sophisticated MCMs, and if an appropriate layer is
added, which can be viewed as distributed shared I/O on the top of distributed shared memory. Major highlights are summarized in Figure DSMS5.
The DEC MC for NOWs
Origin and Environment
Gillett
A follow-up on the Digital/Encore MC team (1994/95)
Major highlights
(a) A PCI version of the IFACT RM/MC board
(b) Digital UNIX cluster team: Better advantage of MC
(c) Digital HPC team: Optimized application interfaces (including PVM)
(d) Reason for adoption:
- Performance potentials over 1000 times the conventional NOW
- No compromise in cost per added node
- Computer architecture for availability
- Error handling at no cost to the applications
Figure DSMS4: The DEC Memory Channel Architecture.
Legend:
NOWNetwork Of Workstations;
HPCHigh Performance Computing;
PVMParallel Virtual Machine.
Comment:
The DEC memory channel product is treated as one of the most successful market oriented products based on
the reflective memory approach, done as a follow up effort, after a contract with Encore.
The IFACT RM/MC for Infinity SP
Origin and Environment
Milutinovi + Proti + Milenkovi + Rakovi + Jovanovi + Denton + Aral
Supported by Encore, on a contract for IBM
Major highlights:
(a) Basic research in 1996
(b) Goal: Continuing to be the highest performance I/O processor on planet
(c) Five different ideas introduced for higher performance:
- Separation of temporal and spatial data in DSM
- Direct cache injection mechanisms in DSM
- Distributed shared I/O on top of DSM
- Moving to more sophisticated memory consistency models
Figure DSMS5: The IFACT RM/MC for Infinity SP.
Comment: The major goal of the IFACT RM/MC for Infinity SP project at Encore is to make the reflective
memory approach more competitive in the performance race with other approaches acquired by industry.
206
An important new trend in DSM research implies the building of commercially successful machines, as well as the usage of IRAM approach to achieve energy efficient architectures or replication/migration tradeoffs to achieve performance-efficient architectures
[Fromm97, Laudon97, Soundararajan98].
One of the companies most active lately in the DSM arena is Convex (the Examplar family). The initial SPP1000 was introduced in 1994, while the SPP2000 in 1997. The later uses
a superscalar processor with out-of-order execution and non-blocking caches; in addition, it
includes more nodes, richer interconnection topology, and a better optimized protocol for
improved memory latency and lower bandwidth requirements.
Both SPP1000 and SPP2000 connect their nodes using multiple rings. An SPP1000 node
has 4 pairs of processors connected by a crossbar. An SPP2000 node has 8 processor pairs
also connected by a crossbar. Each processor pair has an agent that connects it to a crossbar
port. The SPP1000 uses the HP PA 7100 (a two-way superscalar), while the SPP2000 uses
the HP PA 8000 (a four-way superscalar). Figure DSMS7 shows the nodes and Figure
DSMS8 shows the internode communication topologies of the two machines [Abandah98].
207
The most recent advances in DSM and related problems can be found in some of the papers of IEEE Transactions on Computers, Special Issue on DSM February 1999. For example, [Heinrich99] gives an excellent quantitative analysis of scalability of DSM cache
coherence protocols, and [Zhang99] discusses the novel Excel-NUMA approach. The first
paper [Heinrich99] stresses the importance of the impact of cache coherence protocols on
the overall performance of DSM machines, and illustrates the point for the case of the Stanford FLASH project. The second paper [Zhang99] stresses the importance of adequate data
layout for dynamic locality, and proposes that after a memory line is written and cached,
the storage that kept that line in memory remains unutilized, so that it can be used to hold
remote data displaced from local caches (the approach referred to as the Excel-NUMA).
Other papers of special importance are [Luk99, Lai99, Bilir99, and Jiang99]. The first
paper [Luk99] elaborates on the concept of data forwarding or remote write, where the goal
is to forward data closer to the user, before the data are actually needed, which minimizes
the latencies related to data fetching. The second paper [Lai99] extends the prediction concept into the message prediction domain; if one can predict concrete messages used to
maintain coherence in DSM, the performance of DSM systems can be maximized, because
much of the remote access latency can be efficiently hided. The third paper [Bilir99] introduces a hybrid approach which combines broadcast snooping and directory protocols; it
includes the so called multicast mask that minimizes the amount of unnecessary bandwidth
utilization, and consequently increases the performance. The forth paper [Jiang99] examines the scalability of hardware DSM and concludes that application restructuring can
help considerably in achieving a better scalability.
Figure DSMS7:
Processing nodes of the SPP machines
Legend:
RI - Ring Interface
208
Figure DSMS8:
Interconnection topologies of the SPP machines.
Description:
In SPP1000, 4 8-processor nodes are connected in one dimension, using 4 rings.
In SPP2000, 4 16-processor nodes are connected in two dimensions, using 32 rings.
An interesting homework for the reader is to estimate the needed transistor count for all
DSM systems mentioned so far, having in mind both the transistor count of the actual components used, as well as the transistor count if only the needed resources are used. Such an
exercise can bring a better understanding of the future referred to as DSM-on-a-VLSI-chip.
Another type of exercise for students, used in classes of this author, for the above mentioned systems, and for new ones to come, is to prepare one-academic-hour lectures, to explain the details to their colleagues, using the presentation strategy explained in [Milutinovic95b] and [Milutinovic96c]: (a) problem being attacked by the chosen research project,
(b) essence of the existing solutions, and what is to be criticized with them, form the point
of view defined in the problem statement, (c) essence of the proposed approach, and why it
is expected to be better in conditions of interest, which are defined as a part of the problem
statement, (d) details which deserve attention, (e) performance evaluation results, and
(f) complexity evaluation results, and (g) final conclusion.
* * *
The author and his associates were active in the DSM architecture research, and esspecialy in the field of reflective memory. For more details, see [Grujic96, Jovanovic99, Jovanovic2000, Milenkovic96a, Milenkovic96b, Milutinovic92, Milutinovic95c, Milutinovic96c, Protic96a, Protic96b, Protic97, Protic98, and Savic95]
209
3. Problems
1. Compare the DSM systems of 70s, 80s, and 90s. What are the issues which stayed essentially the same? What changes have been drven by advances in technology and applications?
2. Compare the modern DSM systems as far as the type of interconnection network used on
various levels of the system. Which type prevails?
3. Discuss pros and cons of various granularity levels for maintenance of memory consistency in DSM systems. Do that both from the software and the hardware implementation
points of view.
4. The MRMW algoritm is potentioally the most efficient one; however, write conflicts are
possible, if multiple writers have to write to the same location. Check how is this problem
solved in DSM systems on Figures DSMU13/14/15, and propose improved solutions.
5. Using a tool like Limes (see the appendix of this book), compare two alternative-COMA
approaches of the same complexity.
6. Using a tool like Limes (see the appendix of this book), compare two alternative-RSM
approaches of the same complexity.
7. Using a tool like Limes (see the appendix of this book), compare two alternative CCNUMA approaches of the same complexity.
8. Compare the hardware and the software implementation of the release and the lazyrelease memory consistency models, from the implementation point of view. Show block
scheme of one hardware implementation and flow chart of one software implementation.
9. Develop details of AURC and SCOPE, and show differences on the example of one short
program of your choice. Discuss the complexity of specific mechanisms of SCOPE, which
make it superior to AURC.
10. Develop relevant details for one consumer-initiated and one producer-initaiated scheme,
and compere the time delays: write delay, sync delay, and read delay. What is the ratio of
write, sync, and read delays?
210
REFERENCES
211
212
[Abandah98]
[Adve96]
Adve, S.V., Cox, A.L., Dwarkadas, S., Rajamony, R., Zwaenopoel, W.,
A Comparison of Entry Consistency
and Lazy Release Consistency Implementations,
Proceedings of the IEEE HPCA-96, San Jose, California, USA,
February 1996, pp. 26-37.
See http://www-ece.rice.edu/~sarita/publications.html for related work.
[Agarwal90]
[Agarwal91]
[Ahuja86]
[Alvarez97]
[Alvarez98]
[AMD97]
[August98]
August, D.I., Connors, D.A., Mahlke, S.A., Sias, J.W., Crozier, K.M., Cheng, B.-C.,
Eaton, P.R., Olaniran, Q.B., Hwu, W.-M.,
"Integrated Predicated and Speculative Execution in the IMPACT EPIC Architecture,"
Proceedings of the ISCA-98, Barcelona, Catalonia, Spain, June 27 - July 1, 1998,
pp. 227-237.
[Bal88]
[Becker98]
213
[Bekerman99]
[Bell99]
Bell, G., van Ingen, K., "DSM Perspective: Another Point of View,"
Proceedings of the IEEE, Vol. 87, No. 3, March 1999, pp. 412-417.
[Bershad93]
[Bierman77]
Bierman, G. J.,
Factorization Methods for Discrete Sequential Estimation,
Academic Press, New York, New York, USA, 1977.
[Bilir99]
[Bisani90]
[Black99]
[Blumrich94]
Blumrich, M.A., Li, K., Alpert, R., Dubnicki, C., Felten, E. W., Sandberg, J.,
The Virtual Memory Mapped Network Interface for the Shrimp Multicomputer,
Proceedings of the 21st Annual International Symposium on Computer Architecture,
Chicago, Illinois, USA, April 1994, pp. 142153.
[Blumrich98]
Blumrich, M., Alpert, R., Chen, Y., Clark, D., Damianakis, S., Dubnicki, C., Felten, E.,
Iftode, L., Li, K., Martonosi, M., Shillner, R.,
"Design Choices in the SHRIMP System: An Empirical Study,"
Proceedings of the ISCA-98, Barcelona, Catalonia, Spain, June 27 - July 1, 1998, pp.
330-341.
[Brauch98]
Brauch, J., Flesichman, J., "Design of Cache Test Hardware on the HP8500, IEEE
Design & Test of Computers, Vol. 15, No. 3, July-September 1998, pp. 58-63.
[Burger97]
Burger, D., Goodman, J.R., "Billion Transistor Arcihtectures," Vol. 30, No. 9, IEEE
Computer, September 1997, pp.46-48.
[Byrd99]
214
[Calder99]
[Carter91]
[Chaiken94]
[Chan96]
Chan, K. K., Hay, C. C., Keller, J. R., Kurpanek, G. P. Schumacher, F. X. Zheng, J.,
Design of the HP PA7200 CPU, Hewlett-Packard Journal, February 1996, pp. 1-12.
[Chan96]
Chan, K. K., Hay, C. C., Keller, J. R., Kurpanek, G. P., Schumacher, F. X., Zheng, J.,
Design of the HP PA 7200 CPU,
Hewlett-Packard Journal, February 1996, pp. 111.
[Chandra93]
[Chang95]
[Chang97]
[Chappell99]
[Cho99]
[Corsini87]
215
[Davidovic97]
Davidovic, G., Ciric, J., Ristic-Djurovic, J., Milutinovic, V., Flynn, M.,
A Comparative Study of Adders: Wave Pipelining vs. Classical Design,
IEEE TCCA Newsletter, June 1997, pp. 64-71.
[DeJuan87]
[DeJuan88]
[Delp91]
[Digital96a]
[Digital96b]
(http://www.digital.com/info/semiconductor/a264up1/index.html),
Digital Equipment Corporation, Maynard, Massachusetts, USA, 1996.
[Digital97a]
[Digital97b]
[Driesen98]
[Eickmeyer96]
Eickmeyer, R. J., Johnson, R. E., Kunkel, S. R., Liu, S., Sqillante, M. S.,
Evaluation of Multithreaded Uniprocessor
for Commercial Application Environments,
Proceedings of the ISCA-96, Philadelphia, Pennsylvania, May 1996,
pp. 203212.
[Ekmecic95]
[Ekmecic96]
[Espasa97]
216
[Evers96]
[Evers98]
[Fetherston98]
Fetherston, R.S., Shaik, I.P, Ma, S.C., Testability Features of the AMD-K6 Microprocessor, IEEE Design & Test of Computers, Vol. 15, No. 3, Jul-Sep 1998,
pp. 64-69.
[Fleisch89]
[Flynn95]
Flynn, M. J.,
Computer Architecture: Pipelined and Parallel Processor Design,
Jones and Bartlett Publishers,
Boston, Massachusetts, 1995.
[Forman94]
[Fortes86]
Fortes, J., Milutinovi, V., Dock, R., Helbig, W., Moyers, W.,
A High-Level Systolic Architecture for GaAs,
Proceedings of the HICSS-86, Honolulu, Hawaii, January 1986,
pp. 253258.
[Frank93]
[Fromm97]
Fromm, R., Perissakis, S., Cardwell, N., Kozyrakis, C., McGaughy, B., Patterson, D.,
Anderson, T., Yelick, K.,
The Energy Efficiency of IRAM Architectures,
Proceedings of the ISCA-24, Denver, Colorado, USA, June 1997, pp. 327337.
[Gabbay98]
[Geppert97]
Geppert, L.,
Technology 1997 Analysis and Forecast: Solid State,
IEEE Spectrum, Vol. 34, No. 1, January 1997, pp. 5559.
[Gillett96]
Gillett, R. B.,
Memory Network for PCI,
IEEE MICRO, February 1996, pp. 1218.
217
[Gloy96]
[Gonzalez95]
Gonzalez, A., Aliagas, C., Valero, M., A Data Cache with Multiple Caching Strategies
Tuned to Different Types of Locality, Proceedings of the International Conference on
Supercomputing (ICS 95), Barcelona, Spain, July 1995, pp. 338-347.
[Gonzalez96]
[Gonzalez97]
[Gould81]
Gould, Inc.,
Reflective Memory System,
Gould, Inc., Fort Lauderdale, Florida, USA, December 1981.
[Grujic96]
[Gwennap97]
Gewnnap, L.,
Digital 21264 Sets New Standard,
(http://www.chipanalyst.com/report/articles/21264/21264.html),
Micro Design Resources, Sebastopol, California, USA, 1997.
[Gwennap97]
Gwennap, L.,
"Intel and HP Make EPIC Disclosure,"
Microprocessor Report, 11(14), October 1997, pp. 1-9.
[Hagersten92]
[Hagersten99]
[Hammond97]
Hammond, L., Nayfeh, B., Olokotun, K., "A Single Chip Multiprocessor,"
IEEE Computer, Vol. 30, No. 9, September 1997, pp. 79-85.
[Hammond97]
218
[Hank95]
[Hartenstien97]
Hartenstein, R.,
The Microprocessor Is No More General Purpose:
Why Future Reconfigurable Platforms Will Win,
Proceedings of the ISIS97
(International Conference on Innovative Systems in Silicon 97),
Austin, Texas, USA, October 8-10, 1997, pp.2-12.
[Heinrich99]
[Helbig89]
[Helstrom68]
Helstrom, G.,
Statistical Theory of Signal Detection,
Pergamon Press, Oxford, England, 1968.
[Hennessy96]
[Hennessy99]
[Hill88]
Hill, M.,
A Case for Direct-Mapped Caches,
IEEE Computer, Vol. 21, No. 12, December 1988, pp. 2540.
[Holland92]
[Hu96]
[Hunt97]
Hunt D.,
Advanced Performance Features of the 64-bit PA-8000,
(http://hpcc920.external.hp.com/computing/framed/technology/micropro/pa8000/docs/advperf.html),
Hewlett-Packard Company, Fort Collins, Colorado, USA, 1997.
219
[Hwu95]
[Iannucci94]
[IBM93]
[IBM96a]
[IBM96b]
[IEEE93]
[Iftode96a]
[Iftode96b]
[Iftode96c]
[Intel93]
[Intel96]
(http://www.intel.com/procs/p6/p6white/index. html),
Intel, Santa Clara, California, USA, 1996.
[Intel97a]
[Intel97b]
220
[Iseli95]
[Jacobson97]
[James90]
[James94]
James, D. V,
The Scalable Coherent Interface: Scaling to High-Performance Systems,
COMPCON 94: Digest of Papers, March 1994, pp. 6471.
[Jiang99]
[Johnson91]
Johnson, M.,
Superscalar Microprocessor Design,
Prentice-Hall, Englewood Cliffs, New Jersey, 1991.
[Johnson97a]
[Johnson97b]
[Jouppi90]
Jouppi, N.,
Improving Direct-Mapped Cache Performance
by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,
Proceedings of the ISCA-90, May 1990, pp. 364-373.
[Jourdan96]
[Jovanovic99]
[Jovanovic2000]
Jovanovic, M.,
Advanced RMS for PC Environments,
Ph.D. Thesis, School of Electrical Engineering, University of Belgrade,
Belgrade, Serbia, Yugoslavia, 2000.
221
[Juan96]
[Juan98]
[Kalamatianos99]
[Kandemir99]
[Keeton98]
Keeton, K., Patterson, D.A., He, Y.-Q., Raphael, R.C., Baker, W.E.,
"Performance Characterization of a Quad Pentium Pro SMP Using OLTP Workloads,"
Proceedings of the ISCA-98, Barcelona, Catalonia, Spain, June 27 - July 1, 1998,
pp. 15-26.
[Keleher94]
[Kennedy97]
[Kim96]
[Klauser98]
[Kozyakis97]
Kozyrakis, C.E., Patterson, D.A., Anderson, T. Asanovic, K., Cardwell, N., From, R.,
Golbus, J., Gribstad, b., Keeton, K., Thomas, R., Treuhaft, N., Yelick, K. , IRAM,
IEEE Computer, September 1997, Vol. 30, No. 9, pp. 75-78.
[Kumar97]
Kumar, A., The HP PA-8000 RISC CPU, IEEE Micro, Vol. 17, No. 2, March/April
19997, pp. 27-32.
[Kumar98]
222
[Kung88]
Kung, S. Y.,
VLSI Array Processors,
Prentice-Hall, Englewood Cliffs, New Jersey, 1988.
[Kuskin94]
Kuskin, J., Ofelt, D., Heinrich, M., Heinlein, J., Simoni, R., Gharachorloo, K., Chapin,
J., Nakahira, D., Baxter, J., Horowitz, M., Gupta, A., Rosenblum, M., Hennessy, J.,
The Stanford FLASH Multiprocessor,
Proceedings of the 21st Annual International Symposium
on Computer Architecture, April 1994, pp. 302313.
[Kwak99]
Kwak, H., Lee, B., Hurson, A.R., Yoon, S.-H., Hahn, W.-J.,
"Effects of Multithreading on Cache Performance,"
IEEE Transactions on Computers, Vol. 48, No. 2, February 1999,
pp. 176 - 184.
[Lai99]
[Laudon97]
[Lenoski92]
[Lesartre97]
[Li88]
Li, K.,
IVY: A Shared Virtual Memory System for Parallel Computing,
Proceedings of the 1988 International Conference on Parallel Processing,
August 1988, pp. 94101.
[Lipasti96]
[Lipasti97]
[Lo98]
223
[Lovett96]
[Lucci95]
[Luk99]
[Manzingo80]
[Maples90]
[Maquelin96]
Maquelin, O., Gao, G. R., Hum, H. H. J., Theobald, K., Tian, X.,
Polling Watchdog:
Combining Polling & Interrupts for Efficient Message Handling,
Proceedings of the ISCA-96, Philadelphia, Pennsylvania, May 1996,
pp. 1221.
[Martin97]
[McCormack99]
[McFarling95]
McFarling, S.,
Technical Report on gshare,
(http://www.research.digital.com/wrl), 1996.
[Milenkovi96a]
[Milenkovi96b]
[Milutinovic78]
Milutinovic, V.,
Microprocessor-Based Modem Design,
Product Documentation, Michael Pupin Institute, Belgrade, Serbia, Yugoslavia,
December 1978.
224
[Milutinovic79]
Milutinovic, V.,
"A MOS Microprocessor-Based Medium-Speed Data Modem,
Microprocessing and Microprogramming, March 1979, pp. 100103.
[Milutinovic80a]
Milutinovic, V.,
One Approach to Multimicroprocessor Implementation of Modem
for Data Transmission over HF Radio,
Proceedings of EUROMICRO-80,
London, England, Europe, September 1980, pp. 107111.
[Milutinovic80b]
Milutinovic, V.,
Suboptimum Detection Procedure Based on the Weighting of Partial Decisions,
IEE Electronic Letters, Vol. 16, No. 6, 13th March 1980, pp. 237238.
[Milutinovic80c]
Milutinovic, V.,
Comparison of Three Suboptimum Detection Procedures,
IEE Electronic Letters, Vol. 16, No. 17, 14th August 1980, pp. 683685.
[Milutinovi84]
Milutinovi, V.,
Performance Comparison of Three Suboptimum Detection Procedures
in Real Environment,
IEE Proceedings Part F, Vol. 131, No. 4, July 1984, pp. 341344.
[Milutinovic85a]
Milutinovic, V.,
A 4800 bit/s Microprocessor-Based CCITT Compatible Data Modem,
Microprocessing and Microprogramming, February 1985, pp. 5774.
[Milutinovi85b]
Milutinovi, V.,
Generalized W.P.D. Procedure for Microprocessor Based Signal Detection,
IEEE Proceedings Part F, Vol. 132, No. 1, February 1985, pp. 2735.
[Milutinovi85c]
Milutinovi, V.,
Avenues to Explore
in GaAs Multimicroprocessor Research and Development,
RCA Internal Report (Solicited Expert Opinion),
RCA, Moorestown, New Jersey, USA, August 1985.
[Milutinovic86a]
[Milutinovi86b]
[Milutinovic86c]
Milutinovic, V., Silbey, A., Fura, D., Keirn, K., Bettinger, M., Helbig, W, Heagerty,
W., Zeiger, R., Schellack, B., Curtice, W.,
Issues of Importance in Designing GaAs Microcomputer Systems,
IEEE Computer, Vol. 19, No. 10, October 1986, pp. 4559.
[Milutinovic87a]
225
[Milutinovic87b]
Milutinovic, V.,
A Simulation Study of the Vertical-Migration
Microprocessor Architecture,
IEEE Transactions on Software Engineering, December 1987,
pp. 12651277.
[Milutinovic88a]
Milutinovic, V.,
A Comparison of Suboptimal Detection Algorithms
Applied to the Additive Mix of Orthogonal Sinusoidal Signals,
IEEE Transactions on Communications, Vol. COM-36, No. 5,
May 1988, pp. 538543.
[Milutinovic88b]
[Milutinovic92]
Milutinovic, V.,
Avenues to Explore in PC-Oriented DSM Based on RM,
ENCORE Internal Report (Solicited Expert Opinion),
ENCORE, Fort Lauderdale, Florida, USA, December 1992.
[Milutinovic95a]
Milutinovic, V.,
A New Cache Architecture Concept:
The Split Temporal/Spatial Cache Memory, Technical Report,
(UBG-ETF-TR-95-035), School of Electrical Engineering, University of Belgrade,
Belgrade, Serbia, Yugoslavia, January 1995.
[Milutinovic95b]
[Milutinovi95c]
Milutinovi, V.,
New Ideas for SMP/DSM, Technical Report,
School of Electrical Engineering, University of Belgrade,
Belgrade, Serbia, Yugoslavia, 1995.
[Milutinovic96a]
[Milutinovic96b]
[Milutinovic96c]
Milutinovic, V.,
Some Solutions for Critical Problems in Distributed Shared Memory,
IEEE TCCA Newsletter, September 1996.
[Milutinovic96d]
Milutinovic, V.,
The Best Method for Presentation of Research Results,
IEEE TCCA Newsletter, September 1996.
226
[Milutinovic99]
[MIPS96]
[Modcomp83]
Modcomp, Inc,
Mirror Memory System,
Internal Report, Modcomp, Fort Lauderdale, Florida, USA,
December 1983.
[Montoye90]
[Moshovos97]
[Mukherjee98]
[Nair97]
[Nowatzyk93]
Nowatzyk, M., Monger, M., Parkin, M., Kelly, E., Browne, M., Aybay, G.,
Lee, D.,
S3.mp: A Multiprocessor in Matchbox,
Proceedings of the PASA, 1993.
[Oberman99]
[Palacharla97]
[Papworth96]
Papworth, D. B.,
Tuning the Pentium Pro Microarchitecture,
IEEE Micro, April 1996, pp. 816.
[Patel98]
227
[Patel99]
[Patt94]
Patt, Y. N.,
The I/O SubsystemA Candidate for Improvement,
IEEE Computer, Vol. 27, No. 3, March 1994 (special issue).
[Patt97]
Patt, Y.N., Ptel, S.J., Evers, M., Friendly, D.H., Stark, J., One Billion Transistors,
One Uniprocessor, One Chip, IEEE Computer, Vol. 30, No. 9, September 1997,
pp. 51-57.
[Patterson94]
Patterson, D.A., Hennessy., J.L., Computer Organization and Design: The Hardware/Software Interface, Morgan Koufmann, San Mateo, California, USA, 1994.
[Peir99]
[Peleg94]
Peleg, A., Wiser, V., Dynamic Flow Instruction Cache Memory Organized Around
Trace Segments Independent of Virtual Address Line, US Patent (5,381,533),
Washington, D.C., USA, 1994.
[Petterson96]
[Pinkston97]
[Prete91]
Prete, C. A.,
RST: Cache Memory Design for a Tightly Coupled Multiprocessor System,
IEEE Micro, April 1991, pp. 1619, 4052.
[Prete95]
[Prete97]
[Protic85]
Protic, J.,
System LOLA-85,
Technical Report (in Serbian), Lola Industry, Belgrade, Serbia, Yugoslavia,
December 1985. (email: [email protected]).
228
[Protic96a]
[Proti96b]
[Protic97]
[Protic98]
Protic, J.,
A New Hybrid Adaptive Memory Consistency Model,
Ph.D. Thesis, School of Electrical Engineering, University of Belgrade,
Belgrade, Serbia, Yugoslavia, 1998.
[Prvulovic97]
Prvulovic, M.,
Microarchitecture Features of Modern RISC MicroprocessorsAn Overview,
Proceedings of the SinfoN97, Zlatibor, Serbia, Yugoslavia, November 1997
([email protected]).
[Prvulovic99a]
Prvulovic, M., Marinov, D., Dimitrijevic, Z., Milutinovic, V., "Split Temporal/Spatial
Cache: A Survey and Reevaluation of Performance," IEEE TCCA Newsletter, 1999.
[Prvulovic99b]
Prvulovic, M., Marinov, D., Dimitrijevic, Z., Milutinovic, V., "Split Temporal/Spatial
Cache: A Performance and Complexibility Evaluation," IEEE TCCA Newsletter, 1999.
[Pyron98]
Pyron, C., Prado, J., Golab, J., "Test Strategy for the PowerPC 750 Microprocessor,"
IEEE Design and Test of Computers, July-September 1998, pp. 90-97.
[Ramachandran91]
[Raskovic95]
[Reinhardt94]
[Reinhardt96]
[Rexford96]
229
[Rivers96]
[Rotenberg97]
[Rotenberg99]
[Savell99]
Savell, T.C.,
"The EMU10K1 Digital Audio Processor,"
IEEE Micro, Vol. 19, No. 2, March/April 1999, pp. 49 - 57.
[Sahuquillo99]
Sahuquillo, J., Pont, A., The Split Data Cache in Multiprocessors Systems: An Initial
Hit Ratio Analysis, Proceedings of the 7th Euromicro Workshop on Parallel and Distributed Prcessing, Madeira, Portugal, February 1999.
[Sanchez97]
[Saulsbury96]
[Savic95]
Savic, S., Tomasevic, M., Milutinovic, V., Gupta, A., Natale, M.,
Gertner, I.,
Improved RMS for the PC Environment,
Microprocessors and Microsystems,
Vol. 19, No. 10, December 1995, pp. 609619.
[Schoinas94]
Schoinas, I., Falsafi, B., Lebeck, A., R., Reinhardt, S., K., Larus, J., R., Wood, D., A.,
Fine-grain Access Control for Distributed Shared Memory,
Proceedings of the 6th International Conference on Architectural Support
for Programming Languages and Operating Systems, November 1994,
pp. 297306.
[Sechrest96]
[Seznec96]
Seznec, A.,
Dont use the page number, but a pointer to it,
Proceedings of the ISCA-96, Philadelphia, Pennsylvania, USA,
June 1996.
230
[Sheaffer96]
Sheaffer, G.,
Trends in Microprocessing,
Keynote Address, YU-INFO-96, Brezovica, Serbia, Yugoslavia,
April 1996.
[Shriver98]
[Silc98]
Silc, J., Robic, B. Ungerer, T., Asynchrony in Parallel Computing: From Dataflow to
Multithreading, Parallel and Distributed Computing Practices, 1(1), 1998, pp. 3-30
(http://goethe.ira.uka.de/people/ungerer/JPDCPdataflow.pdf).
[Simha96]
[Simoni90]
Simoni, R.,
Implementing a Directory-Based Cache Coherence Protocol,
Stanford University, Technical Report, (CSL-TR-90-423), Palo Alto, California, USA,
March 1990.
[Simoni91]
[Simoni92]
Simoni, R.,
Cache Coherence Directories for Scalable Multiprocessors, Ph.D. Thesis,
Stanford University, Palo Alto, California, USA, 1992.
[Slegel99]
[Smith95]
[Smith97]
Smith, J.E., Vajapeyam, S., Trace Processors, IEEE Computer, September 1997,
Vol. 30, No. 9, pp. 68-73
[Sodani97]
[Sohi95]
231
[Soundararajan98]
Soundararajan, V., Heinrich, M., Verghese, B., Gharachorloo, K., Gupta, A.,
Hennessy, J.,
"Flexible Use of Memory for Replication/Migration in Cache-Coherent DSM
Multiprocessors," Proceedings of the ISCA-98,
Barcelona, Catalonia, Spain, June 27 - July 1, 1998, pp. 342-355.
[Sprangle97]
[Stallings96]
[Stenstrom88]
Stenstrom, P.,
Reducing Contention in Shared-Memory Multiprocessors,
IEEE Computer, November 1988, pp. 2637.
[Stiliadis97]
[Stojanovi95]
Stojanovi, M.,
Advanced RISC Microprocessors,
Technical Report, School of Electrical Engineering, University of Belgrade,
Belgrade, Serbia, Yugoslavia, December 1995.
[Sun95]
[Sun96]
UltraSPARC-I High Performance, 167 & 200 MHz, 64-bit RISC Microprocessor Data Sheet,
(http://www.sun.com/sparc/stp1030a/datasheets/stp1030a.pdf),
Sun Microelectronics, Mountain View, California, USA, 1996.
[Sun97]
UltraSPARC-II High Performance, 250 MHz, 64-bit RISC Processor Data Sheet,
(http://www.sun.com/sparc/stp1031/datasheets/stp1031lga.pdf),
Sun Microelectronics, Mountain View, California, USA, 1997.
[Sweazey86]
[Tabak98]
[Tanenbaum90]
Tanenbaum, A. S.,
Structured Computer Organization,
Prentice-Hall, Englewood Cliffs, New Jersey, USA, 1990.
232
[Tartalja97]
Tartalja, I.,
The Balkan Schemes for Software Based Maintenance
of Cache Consistency is Shared Memory Multiprocessors,
Ph.D. Thesis, University of Belgrade, Belgrade, Serbia, Yugoslavia, 1997.
[Temam99]
Temam, O.,
"An Algorithm for Optimally Exploiting Spatial
and Temporal Locality In Upper Memory Levels,"
IEEE Transactions on Computers, Vol. 48, No. 2, February 1999,
pp. 150 - 158.
[Teodosiu97]
Teodosiu, D., Baxter, J., Govil, K., Chapin, J., Rosenblum, M., Horowitz, M.,
Hardware Fault Containment in Scalable Shared-Memory Multiprocessors,
Proceedings of the ISCA-24, Denver, Colorado, USA, June 1997, pp. 7384.
[Thompson94]
Thompson, T., Ryan, B., PowerPC 620 Soars, Byte, November 1994.
[Thornton64]
Thornton, J. E.,
Parallel Operation on the Control Data 6600,
Proceedings of the Fall Joint Computer Conference, October 1964, pp. 3340.
[Tomasevic92a]
[Tomasevic92b]
Tomasevic, M.,
A New Snoopy Cache Coherence Protocol,
Ph.D. Thesis, School of Electrical Engineering, University of Belgrade,
Belgrade, Serbia, Yugoslavia, 1992.
[Tomasevic93]
[Tomasko97]
[Tomasko97]
[Tomasulo67]
Tomasulo, R. M.,
An Efficient Algorithm for Exploiting Multiple Arithmetic Units,
IBM Journal of Research and Development, January 1967, pp. 2533.
[Tredennick86]
[Tremblay96]
233
[Tse98]
[Tullsen95]
[Tullsen96]
Tullsen, D. M., Eggers, S. J., Emer, J. S., Levi, H. M., Lo, J. L.,
Stamm, R. L.,
Exploiting Choice: Instruction Fetch and Issue
on an Implementable Simultaneous Multithreading Processor,
Proceedings of the ISCA-96, Philadelphia, Pennsylvania, May 1996,
pp. 191202.
[Tullsen99]
[Tyson95]
[Vajapeyam97]
[Vajapeyam99]
[Villasenor97]
[Vuletic97]
Vuletic, M., Ristic-Djurovic, J., Aleksic, M., Milutinovic, V., Flynn, M.,
Per Window Switching of Window Characteristics:
Wave Pipelining vs. Classical Design,
IEEE TCCA Newsletter, September 1997, pp. 1-6.
[Wang97]
234
[Wilson94]
[Wilson96]
[Woo95]
Woo, S. C., Ohara, M., Torrie, E., Singh, J. P., Gupta, A.,
The SPLASH-2 Programs: Characterization and Methodological Considerations,
Proceedings of the ISCA-95, Santa Margherita Ligure, Italy, June 1995, pp. 2436.
[Yoaz99]
[Zhang99]
[Zhou90]
235
INTERNAL APPENDICES
236
237
EXTERNAL APPENDICES
238
239
240