(Mark Gordon Arnold) Verilog Digital Computer Design
(Mark Gordon Arnold) Verilog Digital Computer Design
Computer Design
Algorithms into
Hardware
ISBN 0-13-639253-9
91 0111161111 1
Verilog Digital
Computer Design
Algorithms into Hardware
Pre
Ad
1. WH'
1.1
1.
1.
Editorial/Production Supervision: Craig Little 1.2
Acquisitions Editor: Bernard M. Goodwin 1.3
Manufacturing Manager: Alan Fischer
Marketing Manager: Miles Williams
1.4
Cover Design Director: Jerry Votta 1.5
Cover Design: TalarAgasyan 1.6
1.7
1999 by Prentice Hall PTR
< Prentice-Hall, Inc. 2. DES'
A Simon & Schuster Company
Upper Saddle River, NJ 07458
2.1
2.
All product names mentioned herein are the trademarks of their respective owners.
Prentice Hall books are widely used by corporations and government agencies for training, marketing, and resale.
The publisher offers discounts on this book when ordered in bulk quantities.
2.
For more information, contact the Corporate Sales Department at 800-382-3419, fax: 201-236-7141,
email: [email protected] or write
Corporate Sales Department
Prentice Hall PTR
2.
One Lake Street
Upper Saddle River, NJ 07458
Table of Contents
Page
Preface.............................................................................................................. xxvii
Acknowledgements ......................................... xxix
I .
TVVF.TC NTNT
I'
f
Page
2.2.5 Saving the quotient ............................................... 33 3.
2.2.6 Variations within the loop ............................................... 34
2.2.7 Eliminate state ZEROR3 .............. ................................. 38
2.3 Mixed examples ............................................... . 40
2.3.1 First example ............................................... 41
2.3.2 Second example ............................................... 45 3.
2.3.3 Third example ............................................... 46
2.3.4 Methodical versus central ALU architectures ........................................ 48
2.4 Pure structural example ........... .................................... 49
2.4.1 First example ............................................... 49
2.4.2 Second example ............................................... 51 3.8
2.5 Hierarchical design ......... ....................................... 52 3.8
2.5.1 How pure is "pure"? ............................................... 56 3.
2.5.2 Hierarchy in the first example ..................... .......................... 57 3.
2.6 Conclusion ............................................... 59
2.7 Further Reading ................................................ 59
2.8 Exercises ............................................... 60
vi
if
Page
Page
........................... 33 3.7.1 # time control. . ........................................................................................ 83
........................... 34 3.7.1.1 Using # in test code .................................................... 84
.......................... 38 3.7.1.2 Modeling combinational logic with # ................................................ 85
........................... 40 3.7.1.3 Generating the system clock with # for simulation .......................... 87
.......................... 41 3.7.1.4 Ordering processes without advancing $time ............. ....................... 87
. ................. 45 3.7.2 @time control .................................................... 88
........................... 46 3.7.2.1 Efficient behavioral modeling of combinational logic with @ ... 89
.......................... 48 3.7.2.2 Modeling synchronous registers .................................................... 90
.......................... 49 3.7.2.3 Modeling synchronous logic controllers. . 91
.......................... 49 3.7.2.4 @ for debugging display.................................................................... 92
.......................... 51 3.7.3 wait . . . . ................................................ 93
.......................... 52 3.8 Assignment with time control .................................................... 94
........................... 56 3.8.1 Blocking procedural assignment ................................................ .... 95
.......................... 57 3.8.2 Non-blocking procedural assignment . . .................................................... 95
........................... 59 3.8.2.1 Problem with <= for RTN for simulation .......................................... 96
........................... 59 3.8.2.2 Proper use of <= for RTN in simulation ............................................ 98
.......................... 60 3.8.2.3 Translating goto-less ASMs to behavioral Verilog ......... ................... 99
3.8.2.3.1 Implicit versus explicit style........................................................ 99
,.....64 3.8.2.3.2 Identifying the infinite loop .100
........................... 64... 3.8.2.3.3 Recognizing if else .101
........................... 64
3.8.2.3.4 Recognizing a single alternative. ...................................... 103
........................... 65 3.8.2.3.5 Recognizing while loops .104
.--,,,,,-,--,,65
........................... 66 3.8.2.3.6 Recognizing forever .106
3.8.2.3.7 Translating into an if at the bottom of forever .108
---......... 67
........... 68 3.9 Tasks and functions . . . . ..................................... 109
...........................69 3.9.1 Tasks .......................................... . 109
............................ 70 3.9.1.1 Example task .......................................... 110
.......................... 73 3.9.1.2 enternewstate task ......................................... 112
.. ......... 73 3.9.2 Functions ..... ...................................... 114
.......................... 74 3.9.2.1 Realfunction example ......................................... 115
7----.11,11.11.....1 .. 75 3.9.2.2 Using a function to model combinational logic ................................ 115
.......................... 76 3.10 Structural Verilog, modules and ports ......................................... 117
........................... 77 3.10.1 input ports . . . . 118
......................... 77 3.10.2 output ports . . . . 119
...........................78 3.10.3 inout ports . . . . 119
---........ 78
3.10.4 Historical analogy: pins versus ports ...................................... 119
--........ ......... 81 3.10.5 Example of a module defined with a behavioral instance .... 121
. ......... 81 3.10.6 Example of a module defined with a structural instance .... 123
........................... 82 3.10.7 More examples of behavioral and structural instances .... 123
...........................82 3.10.8 Hierarchical names . . . . 125
........................... 83 3.10.9 Data structures . . . .126
3.10.10 Parameters .. 128
vii
Page
3.11 Conclusion ................................................... 129
3.12 Further reading ...................................................
130
3.13 Exercises ...................................................
131
4. THREE STAGES FOR VERILOG DESIGN ...................................................
4.1 Pure behavioral examples . 134
. . ................................................134
4.1.1 Four-state division machine....................................................................
134
4.1.1.1 Overview of the source code ...................................................
135
4.1.1.2 Details on slowdivision-system ...................................................
137
4.1.2 Verilog catches the error ............................................ 140
4.1.3 Importance of test code ....................... ............................ 141
4.1.4 Additional pure behavioral examples . .................................. 143
4.1.5 Pure behavioral stage of the two-state division machine
. ...................
148
4.2 Mixed stage of the two-state division machine ............................................
150
4.2.1 Building block devices ................................................... 150
4.2.1.1 enabled register portlist ................ .... ............ .... 51
4.2.1.2 counter-register portlist .151 6.1
4.2.1.3 alul8l portlist .152 6.2
4.2.1.4 comparator portlist .153
4.2.1.5 mux2 portlist .153 6.3
4.2.2 Mixed stage 6.
............................................... 154
4.2.3 Architecture for the division machine ............................................... 6.
154 6.
4.2.4 Controller for the division machine........................................................
157 6.
4.3 Pure structural stage of the two state division machine
................................ 161 6.
4.3.1 The pure structural controller ........................................ 162 6.
4.3.2 next-statejlogic module ............................................. 162 6.
4.3.3 stategen function ................................................
163 6.
4.3.4 Testing state-gen ...............................................
165 6.
4.3.5 It seems to work ...............................................
166 6.
4.4 Hierarchical refinement of the controller . . ............................................. 167 6.
4.4.1 A logic equation approach......................................................................
167 6.
4.4.2 At last: a netlist ...............................................
168 6.
4.4.3 Post-synthesis simulation ............................................ 170 6.
4.4.4 Resetting the present state ..................... .......................... 172
4.5 Conclusion . . . 6.
............................................ 176 6
4.6 Exercises . . . ............................................ 176 6
6
5. ADVANCED ASM TECHNIQUES ...............................................
177 6
5.1 Moore versus Mealy ...............................................
177
5.1.1 Silly example of behavioral Mealy machine ..........................................
178
5.1.2 Silly example of mixed Mealy machine ...............................................
179
5.1.3 Silly example of structural Mealy machine ............................................
180 6.6
6.7
viii
Page
Page
............................. 129
5.2 Mealy version of the division machine .................................................. 181
............................ 130
5.2.1 Eliminating state INIT again ...................... ............................ 181
............................ 131 5.2.2 Merging states COMPUTEI and COMPUTE2 ...................................... 183
5.2.3 Conditionally loading r2 .................................................. 184
............................ 134
5.2.4 Asserting READY early .................................................. 185
............................ 134
5.3 Translating Mealy ASMs into behavioral Verilog ........................................ 186
............................ 134 5.4 Translating complex (goto) ASMs into behavioral Verilog . ....................188
............................. 135 5.4.1 Bottom testing loop .................................................. 189
............................ 137 5.4.2 Time control within a decision .................................................. 191
............................. 140 r 5.4.3 Arbitrary gotos .................................................. 194
........................... 141
5.5 Translating conditional command signals into Verilog ........... ..................... 194
............................. 143
5.6 Single-state Mealy ASMs .................................................. 196
ae .......... 148 5.7 Conclusion .................................................. 197
........................... 150
............................ 150 6. DESIGNING FOR SPEED AND COST .................................................. 198
........................... 151
6.1 Propagation delay. . ........................................................................................ 199
., ........... 151
6.2 Factors that determine clock frequency . . ................................................ 199
. .......... 152 6.3 Example of netlist propagation delay .................................................. 200
. 153 6.3.1 A priori worst case timing analysis .................................................. 202
. 153 6.3.2 Simulation timing analysis ........................................... 204
. 154 6.3.3 Hazards ................................................... 205
. 154
6.3.4 Advanced gate-level modeling ........................................ 207
. 157
6.4 Abstracting propagation delay . . . ............................................... 209
. 161 6.4.1 Inadequate models for propagation delay .............................................. 209
. 162 6.4.2 Event variables in Verilog ............................................ 212
. 162 6.4.3 The disable statement .................................................. 213
. 163 6.4.4 A clock with a PERIOD parameter .................................................. 215
. 165
6.4.5 Propagation delay in the division machine . ..............................
215
............................ 166
6.5 Single cycle, multi-cycle and pipeline . . . ...............................................
217
. .......... 167 6.5.1 Quadratic polynomial evaluator example . ...............................
218
........................... 167
6.5.2 Behavioral single cycle .................................................. 219
............................ 168
6.5.3 Behavioral multi-cycle .................................................. 224
............................ 170
6.5.4 First attempt at pipelining ............................................ 226
........................... 172
6.5.5 Pipelining the ma ................................................... 229
............................ 176
6.5.6 Flushing the pipeline .................................................. 231
........................... 176
6.5.7 Filling the pipeline.................................................................................. 231
6.5.8 Architectures for the quadratic evaluator . ...............................
235
........................... 177
6.5.8.1 Single-cycle architecture .................................................. 235
........................... 177
6.5.8.2 Multi-cycle architecture ................ .................................. 238
........................... 178
6.5.8.3 Pipelined architecture .................................................. 241
............................ 179
6.6 Conclusion . . . . .............................................. 245
............................ 180
6.7 Further reading . .................................................. 247
ix
Page
I
Page
9.3 Data dependencies .............................................. 359
9.4 Data forwarding .............................................. 360
9.5 Control dependencies: implementing JMP .............................................. 362
9.6 Skip instructions in a pipeline .............. ................................ 365 I(
9.7 Our old friend: division .............................................. 368 H
9.8 Multi-port memory .............................................. 372 1
9.9 Pipelined PDP-8 architecture .............................................. 374
9.10 Conclusion .............................................. 375
9.11 Further reading .............................................. 375
9.12 Exercises .............................................. 375
Page
Page
............................. 359
............................. 360 10.9.1 Multiple-port register file ..................................... 403
............................ 362 10.9.2 Interleaved memory ............... .. .................... 403
............................ 365 10.9.3 Examples of dependencies ..................................... 404
............................ 368 10.9.4 Speculative execution .. ................................... 406
............................ 372 10.9.5 Register renaming .. ................................... 406
............................. 374 10.9.5.1 First special-purpose renaming example . 407
............................. 375 10.9.5.2 Second special-purpose renaming example . 409
............................. 375 10.9.6 ASM for the superscalar implementation .. 412
............................ 375 10.9.7 Three parallel activities .. 412
10.9.7.1 Pipeline, parallel and speculative execution . 412
............................ 377 10.9.7.2 Dealing with register renaming . 415
........................... 377 10.9.8 Verilog for the superscalar ARM .. 416
............................ 378 10.9.8.1 The depend function . 416
............................. 379 10.9.8.2 Translating the ASM to Verilog . 418
........................... 379 10.9.8.3 Code coverage . 418
............................ 380 10.9.8.4 Using 'ifdef for the cover task . 419
............................ 382 10.9.9 Test programs .. 421
............................ 382 10.9.9.1 AtestofRI5 . 421
........................... 383 10.9.9.2 Our old friend: division . 423
............................ 384 10.9.9.3 Faster childish division . 426
............................ 385 10.9.9.4 Childish division with conditional instructions . 427
............................. 387 10.10 Comparison of childish division implementations . . 430
........................... 388 10.11 Conclusions .. 433
........................... 388 10.12 Further reading .. 434
............................ 390 10.13 Exercises .. 434
............................ 390
............................ 391 11 SYNTHESIS . 438
............................ 391 11.1 Overview of synthesis .. . 438
........................... 391 11.1.1 Design flow . 439
............................. 391 11.1.2 Testing approaches . 441
........................... 392 11.1.3 Tools used in this chapter . 442
............................ 393 11.1.4 TheM4-128/64CPLD . 442
........................... 393 11.2 Verilog synthesis styles .. . 444
. .......... 393 11.2.1 Behavioral synthesis of registers .............. ....................... 444
............................ 393 11.2.2 Behavioral synthesis of combinational logic ..................................... 444
............................ 396 11.2.3 Behavioral synthesis of implicit style state machines .......................... 445
. .......... 396 11.2.4 Behavioral synthesis of explicit style state machines ............................ 445
........................... 398 11.2.5 Structural synthesis ..................................... 445
. .......... 398 11.3 Synthesizing enabled-register . . .445
. .......... 400 11.3.1 Instantiation by name ................................. 447
............................ 400 11.3.2 Modules supplied by PLSynthesizer ............... .................. 447
............................. 402 11.3.3 Technology specific mapping with PLDesigner ................................. 448
11.3.4 Modules supplied by PLDesigner ............ ..................... 450
xiii
Page
11.3.5 The synthesized design ...................................... 451
11.3.6 Mapping to specific pins ...................................... 453
11.4 Synthesizing a combinational adder ...................................... 454
11.4.1 Test code ...................................... 455
11.4.2 Alternate coding with case ...................................... 457
11.5 Synthesizing an implicit style bit serial adder . . ....................................
460
11.5.1 First attempt at a bit serial adder ............................. 461
11.5.2 Macros needed for implicit style synthesis ................. ............ 462
11.5.3 Using a shift register approach ............................. 462
11.5.4 Using a loop ............................. 463
11.5.5 Test code ............................. 464
11.5.6 Synthesizing ............................. 466
11.6 Switch debouncing and single pulsing .469
11.7 Explicit style switch debouncer .472
11.8 Putting it all together: structural synthesis .474
11.9 A bit serial PDP-8 .475
11.9.1 Verilog for the bit serial CPU . .476
11.9.2 Test code .. 479
11.9.3 Our old friend: division .. 481
11.9.4 Synthesizing and fabricating the PDP-8 . .482
11.10 Conclusions .483
11.11 Further reading .484
11.12 Exercises .484
Appendices
B.. PDP-8
B COMMANDS
PD P-8 C OM MA ND S .487 ............................................................................................ 487
Memory
Mem ory reference instructions .487 ............................................................................ 487
Non-memory
N on-m emory reference instructions .489 .................................................................... 489
Group
Group 1I microinstructions..................................................................................
microinstructions .................................................................................. 489
Group
G microinstructions
roup 2 microinstructions
... . .... ..................................................................................
.. . . .. . ... .................................................................................. 490
17V
xv
Page
D. SEQUENTIAL LOGIC BUILDING BLOCKS . ................................ 525
D.1 System clock ........................................... 525
D.2 Timing Diagrams ........................................... 526
D.3 Synchronous Logic ........................................... 527
D.4 Bus timing diagrams ........................................... 528
D.5 The D-type register ........................................... 529
D.6 Enabled D-type register ........................................... 531
D.7 Up counter register ........................................... 533
D.8 Up/down counter ........................................... 535
D.9 Shift register ........................................... 536
D.10 Unusedinputs ........................................... 538
D.11 Highly specialized registers ............ ............................... 540
D.12 Further Reading ........................................... 541
D.13 Exercises ........................................... 541
E. TRI-STATE DEVICES ............................................ 543
E. 1 Switches ........................................... 543
E. 1.1 Use of switches in non-tri-state gates ........................................... 544
E.1.2 Use of switches in tri-state gates .................. ......................... 545
E.2 Single bit tri-state gate in structural Verilog ........................................... 545
E.3 Bus drivers ........................................... 547
E.4 Uses of tri-state ........................................... 548
E.4.1 Tri-state buffers as a mux replacement ........................................... 548
E.4.1.1 How
HOW Verilog processes four-valued logic ..........................................
v--iitvgpivL;rssesiour-vaiueaiogic .......................................... 549
.)49
...........................................
E.4.1.2 The tri declaration ............................................................................ 550
E.4.2 Bidirectional buses buses ..................................................................................
............................................ 551
551
...........
E.4.2.1 The inout declaration ........................................................................
................................ 551
E.4.2.2
E.4.2.2 A read/write registerregister ...........................................
.......................................................................... 552
552
E.5 Further Reading ...........................................
............................................................................................ 554554
E.6 Exercises ...........................................
...................................................................................................... 554 554
F. TOOLS
TOOLS AND RESO RESOURCES ...........................................
URCES ................................................................................ 556
F.
F. 1
I Prentice Hall ...........................................
.................................................................................................. 556
F.2 VeriWell
VeriWell SimSimulator
ulator ...........................................
........................................................................................ 556
556
F.3 M4-128/64
M4-128/64 demoboard
dernoboard............................................
.................................................................................. 557
F.4
FA W Wirewrap
irewrap, supplies ...........................................
........................................................................................ 557
F.5 VerilogEASY
VerilogEA SY ...........................................
................................................................................................ 557
F.6 PLDesigner ...........................................
.................................................................................................. 558
F.7 VITO ...........................................
............................................................................................................ 558
558
F.8
E 8 Open Verilog International (OVI) ............... ............................
................................................................ 559
559
F.9 Other Verilog and programmable
program m able logic vendors vendors ..........................................
.......................................... 560
F.10 PDP-8 ...........................................
.......................................................................................................... 560
560
xvi
i
Page Page
. ,.......... 525 F.11 ARM ............................................... 560
............................ 525
........................... 526 G. ARM INSTRUCTIONS ............................................... 561
............................ 527 1. Efficient instruction set ............................................... 561
............................ 528 2. Instruction set summary ............................................... 561
............................ 529 Register Model ................................................ 562
........................... 531
........................... 533 H. ANOTHER VIEW ON NON-BLOCKING ASSIGNMENT ............................ 564
............................ 535 H.1 Sequential logic ............................................... 564
........................... 536 H.2 $strobe ............................................... 566
........................... 538 H.3 Inertial versus transport delay ............................................... 566
.......................... 540 H.4 Sequence preservation ............................................... 567
........................... 541 H.5 Further reading ............................................... 567
............................ 541
I^. (.ARA1s5
.
SJI.XAtX
.......................................................................................................................................
s
............................ 543
............................ 543 J. LIMITATIONS ON MEALY WITH IMPLICIT STYLE .580
........................... 544 J.1 Further Reading .582
........................... 545
. .......... 545
............................ 547
., .......... 548
.......................... 548
............................ 549
. .......... 550
. .......... 551
. .......... 551
........................... 552
........................... 554
.......................... 554
. .......... 556
........................... 556
........................... 556
........................... 557
.......................... 557
........................... 557
......................... 558
.......................... 558
.......................... 559
.......................... 560
........................... 560
xvii
ff--
I 0
List of Figures
Page
2. DESIGNING ASMs
Figure 2- 1. ASM with three states ..................... .............................. 8
Figure 2-2. ASM with command outputs............................................................ 9
Figure 2-3. Equivalent to figure 2-2 ................................................... 10
Figure 2-4. ASM with multi-bit output .................................... 11...............
Figure 2-5. ASM with register output ............................. ...................... 12
Figure 2-6. ASM with decision ................ ................................... 13
Figure 2-7. ASMs with external status ................................................... 15
Figure 2-8. Block diagram ................................................... 16
Figure 2-9. Two ways to test multi-bit input ...................................................... 17
Figure 2-10. Pure behavioral block diagram........................................................ 20
Figure 2-11. Mixed block diagram ..................... .............................. 21
Figure 2-12. ASM for friendly user interface ................................................... 25
Figure 2-13. Block diagram ................................................... 25
Figure 2-14. ASM for software paradigm (COMPUTEI at top) .......................... 26
Figure 2-15. ASM for software paradigm (COMPUTEl at bottom) .................... 27
Figure 2-16. Incorrect four-state division machine .............................................. 29
Figure 2-17. Correct four-state division machine ................................................ 30
Figure 2-18. Incorrect user interface (throws quotient away) ................ .............. 32
Figure 2-19. Saving quotient in r3 ................................................... 33
Figure 2-20. Handling quotient of zero ................................................... 34
Figure 2-21. Incorrect rearrangement of states ................................................... 35
Figure 2-22. Incorrect parallelization attempt ................................................... 36
Figure 2-23. Correct parallelization ........................................... 37
Figure 2-24. Goto-less two-state childish division ASM . .....................................
39
Figure 2-25. Equivalent to figure 2-24 ......................................... 40
Figure 2-26. Architecture using subtractor ................................................... 42
Figure 2-27. Architecture using ALU ...................... ............................. 43
Figure 2-28. Methodical architecture .......................................... 44
Figure 2-29. Mixed ASM corresponding to figures 2-24 and 2-28 . . 44
Figure 2-30. System diagram ................................................... 45
Figure 2-31. Mixed ASM corresponding to figures 2-14 and 2-28 . . 46
Figure 2-32. Central ALU architecture ...................... ............................. 47
Figure 2-33. Mixed ASM corresponding to figures 2-14 and 2-32 . . 47
Figure 2-34. Controller ................................................... 50
Figure 2-35. Block diagram and behavioral ASM for adder . . 53
Figure 2-36. Flattened circuit diagram for adder . ................................
53
Figure 2-37. Definition of the adder module . ...................................
55
Figure 2-38. Definition of the full-adder module ................................................ 55
Figure 2-39. Definition of the half-adder module ................................................ 55
xix
A
Page
a;
Figure 2-40. Hierarchical instantiation of modules .............................................. 56 ri
Figure 2-41. "Pure" behavioral block diagram .................................................... 57 Fi
Figure 2-42. Mixed block diagram ....................................... .................... 57 Fi
Figure 2-43. "Pure" structural block diagram ...................................................... 58 Fi
Fi
3. VERILOG HARDWARE DESCRIPTION LANGUAGE Fi
Figure 3- 1. Exclusive or built with ANDs, OR and inverters ............. ............... 75 Fi
Figure 3-2. Every ASM has an infinite loop ...................................................... 100
Figure 3-3. ASM corresponding to if else .......................................................... 102 FN
Figure 3-4. ASM without else ......................... .................................. 103 F
Figure 3-5. ASM with while ....................... .................................... 105
Figure 3-6. Equivalent to figure 3-5 ........................................................... 105 F
Figure 3-7. ASM needing forever ....................................... .................... 107 F
Figure 3-8. Two ways to draw if at the bottom of forever .................................. 108 F
F
4. THREE STAGES FOR VERILOG DESIGN
Figure 4-1. Architecture with names used forVerilog coding ............................ 155
F
Figure 4-2. Netlist for the childish division controller ........................................ 170
F
5. ADVANCED ASM TECHNIQUES
Figure 5-1. Behavioral Mealy ASM ........................................................... 178 F
Figure 5-2. Mixed Mealy ASM ........................................................... 180
Figure 5-3. Mealy division machine with two states in loop .............................. 182 F
Figure 5-4. Incorrect Mealy division ASM ........................................................ 183 F
Figure 5-5. Mealy division ASM with conditional load ................. ................... 184
Figure 5-6. Mealy division ASM with conditional READY .............. ................ 185 F
Figure 5-7. ASM for combinational logic (decoder) .......................................... 196 I
xx
-
Page Page
............................ 56 Figure 6-11. Correct pipelined ASM that fills and flushes ................. ................. 232
........................... 57 Figure 6-12. Single-cycle architecture ........................................................ 235
........................... 57 Figure 6-13. Timing diagram for single-cycle ASM ............................. ............... 237
........................... 58 Figure 6-14. Multi-cycle architecture ....................................... ................. 238
Figure 6-15. Timing diagram for multi-cycle ...................................................... 240
Figure 6-16. Pipelined architecture ........................... ............................. 241
........................... 75 Figure 6-17. Timing diagram for pipelined ASM ................................................ 245
........................... 100
........................... 102 7. ONE HOT DESIGNS
........................... 103 Figure 7-1. Moore ASMs and corresponding components of one hot
........................... 105 controllers ........................................................ 251
............................ 105 Figure 7-2. One hot controller for ASMs of sections 2.2.2 and 2.3.3 .......... ...... 253
........................... 107 Figure 7-3. Power-on device for one hot controllers .......................................... 254
............................ 108 Figure 7-4. One hot controller for ASMs of sections 2.2.7 and 2.3.1 ................ 254
Figure 7-5. Architecture generated from implicit Verilog of sections
7.2.2.1 and 3.8.2.3.3 ........................................................ 260
Figure 7-6. Controller generated from implicit Verilog of sections
. ,.......... 155
........................... 170 7.2.2.1 and 3.8.2.3.3 ........................................................ 262
Figure 7-7. Controller generated from implicit Verilog of sections
7.2.2.2 and 3.8.2.3.4 ......................... ............................... 264
Figure 7-8. Controller generated from implicit Verilog of sections
........................... 178 7.2.2.3 and 3.8.2.3.5 ........................ ................................ 265
.......................... 180
Figure 7-9. Example with Moore command signal ............................................ 266
........................... 182
Figure 7-10. Current command approach suitable for Moore or Mealy
.......................... 183
controller........................................................................................ 268
.......................... 184
Figure 7-11. Next command approach suitable only for Moore controller .......... 268
.......................... 185
........................... 196
Figure 7-12. Behavioral ASM with <- for next command approach ........... ....... 268
Figure 7-13. Example bottom testing loop ........................................................ 273
8. GENERAL-PURPOSE COMPUTERS
.......................... 202
Figure 8-1. Block diagram of typical general-purpose computer ............. ......... 279
........................... 210
Figure 8-2. Symbol for memory with unidirectional data buses ........................ 281
Figure 8-3. Symbol for memory with bidirectional data bus .............................. 282
........................... 216
Figure 8-4. Symbol for synchronous memory .................................................... 283
........................... 219
Figure 8-5. Implementation of synchronous memory ........................................ 284
........................... 220 Figure 8-6. Symbol for asynchronous memory .................................................. 285
ional logic ........ 220 Figure 8-7. ASM implementing four instructions of PDP-8 ................ .............. 294
........................... 224 Figure 8-8. ASM implementing more instructions of the PDP-8 ............. ......... 304
........................... 227 Figure 8-9. Block diagram for the PDP-8 system .............................................. 312
........................... 228 Figure 8-10. System composed of processor (controller and architecture)
atflush ......... 230 with memory as a separate actor .................................................... 313
xxi
Page
Page
Page
............... 315
.............. 332
APPENDIX C COMBINATIONAL LOGIC BUILDING BLOCKS
............... 334
............... 335 Figure C-1. Synbol for a four-bit unidirectional bus .......................................... 493
Figure C-2. Implementation of a four-bit bus .................................................... 494
g.............342
Figure C-3. Transmitting 15 on a four-bit bus .................................................... 494
............... 343
Figure C-4. Transmitting 7 on a four-bit bus .................................................... 494
Figure C-5. One possible routing of a four-bit bus ........................... ................. 495
d..............346 Figure C-6. Unnecessary device .............................. ............................ 497
iory Figure C-7. Transmitting on one bus to multiple destinations for free .......... ....497
Figure C-8. Implementation of figure C-7 .......................................................... 497
.......................... 347
Figure C-9. Using the same name at every node ................................................ 498
Figure C-10. Combinational device to divide by two (three-bit output) ............... 498
Figure C-il . Implementation of figure C-10 ........................................................ 499
............... 355
Figure C-12. Combinational device to divide by two (four-bit output) ............... 499
............... 361
Figure C- 13. Implementation of figure C- 12 ........................................................ 499
.............. 363
Figure C-14. Combinational device to add two n-bit values (n+1 bit output) ...... 500
............... 372
Figure C-15. Treating the high-order bit as carry out ......................... ................. 500
.............. 373
Figure C-16. Alternate symbol for figure C-15 .................................................... 501
.............. 374
Figure C-17. Adder without carry out .......................................................... 501
Figure C- 18. Symbol for multiplexer .......................................................... 502
Figure C-19. Symbol for incrementor.................................................................. 503
.............. 389
Figure C-20. Inefficient implementation of incrementor ...................................... 503
.............. 397 Figure C-21. Symbol for ones complementor . .....................................................504
...............404 Figure C-22. Symbol for twos complementor . .....................................................
504
...............404 Figure C-23. Possible implementation of twos complementor ......... ................... 504
ws clock ..... 405 Figure C-24. Symbol for subtractor .......................................................... 505
...... ........ 411
Figure C-25. Symbol for shifter .......................................................... 505
Figure C-26. Symbol for shifter with shift input . ................................505
Figure C-27. Symbol for barrel shifter with shift count input . ...................... 506
............... 440
Figure C-28. Possible implementation of barrel shifter . ........................... 506
bled Figure C-29. Symbol for multiplier ........................................... 507
.............. 452
Figure C-30. Symbol for Arithmetic Logic Unit (ALU) .............. ........................ 508
register bit Figure C-3 1. Possible implementation of ALU . ................................. 508
............... 453
Figure C-32. Symbol for comparator .......................................... 512
bit slices Figure C-33. Symbol for equality comparator . .................................. 512
............... 456
Figure C-34. Symbol for demux .......................................................... 513
.............. 467 Figure C-35. Misuse of demux .................... ...................................... 514
r2 ............ 469 Figure C-36. Proper design omits demux .......................................................... 514
Figure C-37. Symbol for binary to unary decoder . ............................... 515
............... 470
Figure C-38. Possible implementation of decoder . ............................... 516
Figure C-39. Alternate implementation of decoder . .............................. 516
Figure C-40. Symbol for priority encoder ...................................... 517
xxiii
Page
Figure C-41. Symbol for a Read Only Memory (ROM) .................. .................... 519 Fi
Figure C-42. Possible implementation of a ROM ................................................ 520 Fi
Fi
APPENDIX D.SEQUENTIAL LOGIC BUILDING BLOCKS Fi
Figure D-1. Universal connection to system clock signal shown ........... ........... 525 Fi
Figure D-2. Universal connection to system clock signal assumed .................... 525 Fi
Figure D-3. An analog waveform for the system clock signal .......................... 526 Fi
Figure D-4. A digital abstraction of the system clock signal .............................. 526 Fi
Figure D-5. The system clock divides time into cycles ...................................... 527 Fi
Figure D-6. An ideal synchronous timing diagram ............................................ 527 Fi
Figure D-7. A realistic synchronous timing diagram with propagation delay .... 528 Fi
Figure D-8. An asynchronous timing diagram .................................................... 528
Figure D-9. Timing diagram showing individual bits of a bus ............. ............. 529
Figure D-10. Timing diagram showing numeric values on a bus in decimal ...... 529
Figure D- 11. Notation used for bus timing diagrams ...................... .................... 529
Figure D-12. Symbol for D-type register ............................................................. 529
Figure D-13. Example timing diagram for D-type register .................................. 530
Figure D-14. Another timing diagram for D-type register ................ .................. 530
Figure D-15. Symbol for enabled D-type register ................................................ 531
Figure D-16. Example timing diagram for enabled D-type register ........... ......... 532
Figure D- 17. Implementation of enabled D-type register using simple
D-type and mux ............................................................. 532
Figure D-18. Symbol for up counter register ........................................................ 533
Figure D-19. Implementation of up counter register using simple
D-type register and combinational logic ..................... ................... 534
Figure D-20. Symbol for up/down counter register .............................................. 535
Figure D-21. Implementation of up/down counter register .................................. 536
Figure D-22. Symbol for shift register register .................................................... 537
Figure D-23. Implementation of shift register ...................................................... 537
Figure D-24. Symbols for other registers ............................................................ 538
Figure D-25. Implementations for these registers using a loadable
clearable up counter ............................................................. 538
Figure D-26. Symbol for a non-clearable up counter .......................................... 539
Figure D-27. Possible implementation using a clearable up counter . ................. 539
Figure D-28. Symbol for a non-clearable shift register . ........................... 540
Figure D-29. Possible implementation using a clearable shift register ................ 540
xxiv
ff-- I 1
Page
Page
.......................... 519 Figure E-5. A gate producing z as output ..................... ...................... 545
.......................... 520 Figure E-6. A tri-state gate ........................................... 545
Figure E-7. Effect of tri-state gate when enable is 1.......................................... 546
Figure E-8. Effect of tri-state gate when enable is 0 .......................................... 546
Figure E-9. Tri-state bus driver ........................................... 547
vn .......... 525
525 Figure E- 10. Using tri-state bus drivers to form a mux ........................................ 549
med ..........
Figure E- 11. One bidirectional bus ........................................... 551
.......................... 526
Figure E- 12. Two unidirectional buses .............. ............................. 551
........................... 526
Figure E- 13. A read/write register with a bidirectional bus .................................. 552
........................... 527
Figure E-14. Implementation of figure E-13 ........................................... 552
.......................... 527
Figure E-15. Instantiation of two read/write registers .......................................... 553
?agation delay .... 528
............................ 528
.......................... 529
s in decimal ...... 529
.......................... 529
............................ 529
............................ 530
........................... 530
............................ 531
ster .......... 532
simple
........................... 532
............................ ......
533
?le
........................... 534
............................ 535
............................ 536
........................... 537
............................ 537
........................... 538
ible
............................ ......
538
........................... 539
unter .......... 539
............................. 540
register .......... 540
............................ 543
............................. 544
............................ 544
............................ 544
xxv
L
Preface
When I started teaching Verilog to electrical engineering and computer science seniors at
the University of Wyoming, there were only two books and a handful of papers on the
subject, in contrast to the overwhelming body of academic literature written about VHDL.
Previously, VHDL had been unsuccessful in this course. For all its linguistic merits, VHDL
is too complex for the first-time user. Verilog, on the other hand, is much more straightfor-
ward and allows the first-time user to focus on the design rather than on language details.
Yet Verilog is powerful enough to describe very exotic designs, as illustrated in chapters 8-
11.
As its subtitle indicates, this book emphasizes the algorithmic nature of digital computer
design. This book uses the manual notation of Algorithmic State Machine (ASM) charts
(chapter 2) as the master plan for designs. This book uses a top-down approach, which is
based on the designer's faith that details can be ignored at the beginning of the design
process, so that the designer's total effort can be to develop a correct algorithm.
Chapters 2-11 use the same elementary algorithm, referred to as the childish division algo-
rithm, for many hardware and software examples. Because this algorithm is so simple, it
allows the reader to focus on the Verilog and computer design topics being covered by each
chapter. This book is unique in showing the correspondence of ASM charts to implicit style
Verilog (chapters 3, 5 and 7). All chapters emphasize a feature of Verilog, known as non-
blocking assignment or Register Transfer Notation (RTN), which is the main distinction
between software and synchronous hardware. Except for chapter 6, this book ignores (ab-
stracts away) propagation delay. Instead, the emphasis here is toward designs that are
accurate on a clock cycle by clock cycle basis with non-blocking assignment. (Many exist-
ing Verilog books either provide too much propagation delay information or are so abstract
as to be inaccurate on a clock cycle basis. Appendices C and D motivate the abstraction
level used here.)
Chapter 4 gives a novel three-stage design process (behavioral, mixed, structural), which
exercises the reader's understanding of many elementary features of Verilog. Chapter 7
explains an automated one hot preprocessor, known as VITO, that eliminates the need to go
though this manual three-stage process.
This book defers the introduction of Mealy machines until chapter 5 because my experi-
ence has been that the complex interactions of decisions and non-blocking assignments in a
Mealy machine are confusing to the first-time designer. Understanding chapter 5 is only
necessary to understand chapters 9 and 10, appendix J and sections 7.4 and 11.6.
The goal is to emphasize a few enduring concepts of computer design, such as pipelined
(chapters 6 and 9) and superscalar (chapter 10) approaches, and show that these concepts
are a natural outgrowth of the non-blocking assignment. Chapter 6 uses ASM charts and
implicit Verilog to describe pipelining of a special-purpose machine with only the material
of chapter 4. Chapters 8, 9 and 11 use the classic PDP-8 as an illustration of the basic
principles of a stored program computer and cache memory. Chapter 8 depends only on the
ASM material of chapter 2. Chapter 9 requires an understanding of all preceding chapters,
xxvii
except chapter 7. The capstone of this book, chapter 10 (which depends on chapter 9), uses
the elegant ARM instruction set to explore the RISC approach, again with the unique com-
bination of ASMs, implicit Verilog and non-blocking assignment.
Chapters 3-6, 9 and 10 emphasize Verilog simulation as a tool for uncovering bugs in a
design prior to fabrication. Test code (sometimes called a testbench) that simulates the
operating environment for the machine is given with most designs. Chapter 10 introduces
the concept of Verilog code coverage. Chapters 7 and 11, which are partially accessible to
a reader who understands chapter 3, uses specific synthesis tools for programmable logic to
illustrate general techniques that apply to most vendors' tools. Even in synthesis, simula-
tion is an important part of the design flow. Chapter 11 will be much more meaningful after
the reader has grasped chapters 1-9. The designs in chapter 11 have been tested and down-
loaded (www. phptr . com) into Vantis CPLDs using a tool available to readers of this
book (appendix F), but these designs should also be usable with minor modifications for
other chips, such as FPGAs.
Appendices A, B and G give background on the machine language examples used in chap-
ters 8-11. Appendices C and D give the block diagram notation used in all chapters for
combinational logic and sequential logic, respectively. Chapters 1-11 do not use tri-state
bidirectional buses, but appendix E explains the Verilog coding of such buses.
This book touches upon several different areas, such as "computer design," "state machine
design," "assembly language programming," "computer organization," "computer arithmetic,"
"computer architecture," "register transfer logic design," "hardware/software trade-offs,"
"VLSI design,"" "parallel processing" and "programmable logic design." I would ask the
reader not to try to place this book into the pigeon hole of some narrow academic category.
Rather, I would hope the reader will appreciate in all these digital and computer design
topics the common thread which the ASM and Verilog notations highlight. This book just
scratches the surface of computer design and of Verilog. Space limitations prevented inclu-
sion of material on interfacing (other than section 11.6) and on multiprocessing. The ex-
amples of childish division, PDP-8 and ARM algorithms were chosen for their simplicity.
Sections labeled "Further reading" at the end of most chapters indicate where an interested
reader can find more advanced concepts and algorithms, as well as more sophisticated fea-
tures of Verilog. Appendix F indicates postal and Web addresses for obtaining additional
tools and resources. It is hoped that the simple examples of Verilog and ASMs in this book
will enable the reader to proceed to these more advanced computer design concepts.
In places, this book states my opinions rather boldly. I respect readers who have differing
interpretations and methodologies, but I would ask such readers to look past these distinc-
tions to the unique and valuable approaches in this book that are not found elsewhere. I
have sprinkled (somewhat biased) historical tidbits, primarily from the first quarter century
of electronic computer design, to illustrate how enduring algorithms are, and how transient
technology is. Languages are more algorithmic than they are technological. Just look at the
endurance of the COBOL language for business software. Hardware description languages
will no doubt change as the twenty-first century unfolds, but I suspect whatever they be-
come, they will include something very much like contemporary implicit style Verilog.
xxviii
-
xxix
dedicated to the memory of my father,
:ordon William Arnold,
)uragement and sense of humor
last year of his life stimulated
e writing of this book.
ewas a wonderful dad,
and I cherished him.
1. WHY VERILOG
COMPUTER DESIGN?
1.1 What is computer design?
A computer is a machine that processes information. A machine, of course, is some
tangible device (i.e., hardware) built by hooking together physical components, such
as transistors, in an appropriate arrangement. Processing occurs when the machine fol-
lows the steps of a mathematical algorithm. Information is represented in the machine
by bits, each of which is either 0 or 1. This book only considers digital information
(i.e., bits) and does not consider analog information. Analog information can be ap-
proximated by digital information by using a sufficient number of bits.
Computer design is the thought process that arrives at how to construct the tangible
hardware so that it implements the desired algorithm. The goal is to turn an algorithm
into hardware. Computer designers have two ways to look at the machines they build:
the way they act (known as the behavioral viewpoint, which is closely related to algo-
rithms), and the way they are built (known as the structuralviewpoint, which is like a
"blueprint" for building the machine).
signed, it always does multaneous equations. During World War II, several computers were built, including
. Nevertheless, by the Colossus (in Great Britain), which was used to break coded German messages. In 1945,
gh boring) algorithm. the mathematician John von Neumann popularized the idea of a general-purpose com-
lumes, it is more eco- puter, and his name is often synonymous with a machine that implements the fetch/
ients one boring algo- execute algorithm. The first operational general-purpose computer was the Manchester
nost of its capabilities Mark I, which was a vacuum tube machine built in England that ran its first program in
)lems where the speci- 1948.
rare approach is more In the 1950s, general-purpose vacuum tube computers cost millions of dollars, and
only large corporations and governments owned them. The next major technological
arket, and so there are advance came with the invention of the transistor, which can do the same thing that a
fther hand, many non- vacuum tube can do faster and more economically. Transistors also have the advan-
e products they manu- tages that they run cooler and have a longer life than vacuum tubes.
rpose machines. Also, This, of course, lowered the cost of general-purpose computers so that smaller corpora-
such as modems, that tions could own them, but it also made the application of digital design practical. Digi-
tal designs are special-purpose computers built using electronic circuits that process
bits. Devices like digital watches, digital microwave oven timers, digital thermostats,
hand-held calculators, etc. are all controlled by special-purpose computers that became
economical with the invention of the transistor and related digital electronics.
-hnology
technologies and last- In the 1960s, it became possible to manufacture hundreds or thousands of transistors
cept and a technology. on a chip of semiconductor material, known as an integrated circuit, at very low cost.
egardless of the physi- Integrated circuits made it possible to mass-produce general-purpose computers, as
,y. Many of the algo- well as digital electronic chips. Special- and general-purpose computers are now so
ninds of mathematics powerful and affordable that they are part of almost every complex device built, from
children's toys to the space shuttle.
is named) built one of Since the 1960s, there have been continual improvements in semiconductor technolo-
human intervention to gies. It is now possible to get millions of transistors on a single chip. Of course, today's
vered several interest- chips cost a fraction of the price of, and run faster and cooler than, their predecessors.
mmon use. A century But the algorithms that these chips implement are similar to the algorithms imple-
ogy of his day (preci- mented with earlier technologies.
purpose computer for
general-purpose ma-
lete it due to financial
1.3 Translating algorithms into hardware
In the beginning, hardware designers were programmers and vice versa. The world of
ce that made building hardware design and software design fragmented into separate camps during the 1950s
t with cams and gears, and 1960s as advancing technology made software programming easier. The industry
of algorithm steps in needs many more programmers than hardware designers and programmers require far
i built the first binary less knowledge of the physical machine than hardware designers. Despite this, the role
Charles Berry at Iowa of software designers and hardware designers is essentially the same: solve a problem.
mputer for solving si-
1.5
1.4 Hardware description languages Fonts an
Unfortunately, hardware designers were inundated with the overwhelming technologi- below:
cal changes that occurred with semiconductor electronics. Many hardware designers
lost track of the advances in design methodology that occurred in software. Around
1980, as semiconductor technology advanced, it got more and more difficult to design
hardware. Up to that time, most hardware design was done manually. Designers real-
ized that the ever-increasing power of general-purpose computers could be harnessed
to aid them in designing the next generation of chips. The goal of using the current
generation of general-purpose computers to help design the next generation of special-
and general-purpose computers required bringing the worlds of hardware and soft-
ware back together again.
Out of this union was born the concept of the Hardware Description Language (HDL).
Being a computer language, an HDL allows use of many of the timesaving software
methodologies that hardware designers had been lacking. But as a hardware language,
the HDL allows the expression of concepts that previously could only be expressed by
manual notations, such as the ASM notation and circuit diagrams.
As technology advances, the details about HDLs will undoubtedly change in the fu-
ture, but studying an HDL instills fundamental concepts that will endure. These ideas,
originally thought of as hardware concepts, are becoming more important in software
due to the increased importance of software parallel processing and object-oriented
programming. There is a deep theoretic similarity between the concepts in software
fields (such as operating systems and data structures) and the concepts in computer
design. The growing popularity of HDLs attest to this fact: hardware is becoming
more like software, and vice versa.
Chapter 3 discusses a popular HDL, known as Verilog, which is easy to learn because
it has a syntax similar to C and Pascal. Verilog was developed in the early 1980s by
Philip Moorby as a proprietary HDL for a company that was later accquired by Ca-
dence Design Systems, which put the Verilog standard into the public domain. It is now
1.5 Typography
Fonts are used in this book to distinguish between different kinds of text, as explained
-rwhelming technologi- below:
iny hardware designers
Times is used in the bulk of the text for
ed in software. Around
discussion.
more difficult to design
inually. Designers real- Bold Times is used to emphasize important
ters could be harnessed or surprising concepts.
al of using the current
Italic is used for the definition of an
Lt generation of special-
important term or phrase.
of hardware and soft-
Courier is used for Verilog text, exactly
as it is typed into the file, and
ption Language (HDL).
for similar notations taken from
he timesaving software
ASM charts, hardware dia-
is a hardware language,
grams, and simulation results.
Id only be expressed by
This font is also used for parts
Ms.
of other high level languages,
tedly change in the fu- such as C.
'ill endure. These ideas,
Bold Courier is used for parts of Verilog text
e important in software
that are important in the discus-
ing and object-oriented
sion that precedes or follows
ie concepts in software
them.
concepts in computer
hardware is becoming Italic Courier is used to describe parts of
Verilog syntax, such as a
statement, which can be re-
is easy to learn because
placed with some particular
din the early 1980s by
symbol, such as while. Also,
later accquired by Ca-
it is used to highlight complex
public domain. It is now
simulation results.
9Hardware
Why Verilog Computer Design? 5
1.6 Assumed background 17
It is assumed that the reader has a reasonable amount of experience programming in a
conventional high-level language, such as C, C++, Java or Pascal. Programming expe-
rience in assembly language (appendices A, B and G) is very helpful. It is assumed that
the reader can understand binary, octal and hexadecimal notations, can convert these to
and from decimal and can perform arithmetic in these bases. It is also assumed that the
reader is familiar with the common combinational logic gates (AND, OR, NOT, etc.),
and that the reader knows about the common digital building blocks used in digital
design (appendices C, D and E).
1.7 Conclusion
The few computers built in the nineteenth century were based on classical mechanics
(cams and gears visible to the naked eye). Almost all the computers built in the twenti-
eth century have been based on electronics. It is hard to say what technologies will be
prevalent for computers in the twenty-first century.
Conventional semiconductor technology will someday reach its limit (based on the
minimum size of a transistor and the speed of light). Technologies based on recombinate
DNA, photonics, quantum mechanics, superconductivity and nanomechanics (cams
and gears built of individual atoms) are all contenders to be the computer technology of
the twenty-first century. The point is that it does not matter: technology changes every
day, but concepts endure. The intellectual journey you travel by turning an algorithm
into hardware illustrates these enduring concepts. I hope you enjoy the journey!
to Hardware DesigningASMs 7
of the more advanced concepts in chapter 7 and chapters 9 and above require the reader
to understand Mealy notation. At this time, we will ignore the use of ovals and concen- For exm
trate only on ASM charts for Moore machines. YELL(
Each rectangle is said to describe a state. A label, such as a number or preferably a The fol
meaningful name, can be written on the outside of the rectangle. The term present state chart, t
refers to which rectangle of the ASM chart is active during a particular clock period. connec
The term next state indicates which rectangle of the ASM chart will be active during Althou,
the next clock period. The ASM chart indicates how to determine the next state (given troller,
the present state) by an arrow that points from the rectangle of the present state to the tended
rectangle of the next state. Each arrow eventually arrives at one of the rectangles in the
ASM chart. Since it has a finite number of rectangles, there is at least one loop in an 2.1.1
ASM chart. An ASM chart is said to describe a particularfinite state machine. Unlike
Normal
software, there is no way to stop or halt a finite state machine (unless you pull the
a desig
plug). section
There is a relationship between the ASM chart and its behavior. For example, consider
the following ASM chart with three states:
2.1.1.
Asignc
GREEN mation.
gives a
name o
the mai
YELLOW
rectang
an exar
RED
Assuming that we start in state GREEN, and that the clock has a period of 0.5 seconds,
the ASM chart will make the following state transitions forever:
present next
time state state
0.0 GREEN YELLOW
0.5 YELLOW RED
1.0 RED GREEN
1.5 GREEN YELLOW
Figt
2.0 YELLOW RED
2.5 RED GREEN
STOP,;
the ASI
8 Verilog Digital ComputerDesign: Algorithms into Hardware
bove require the reader ple, between 0.5 and 1.0, the ASM is in state YELLOW. It is again in state
se of ovals and concen- I between 2.0 and 2.5.
wing sections explain the commands that can occur in rectangles of an ASM
iumber or preferably a decisions that can occur in diamonds of an ASM chart, the input and output
.The term present state ns to a machine described by an ASM chart and issues of ASM chart style.
,articular clock period. the examples in the following sections vaguely resemble a traffic light con-
rt will be active during ey are not intended to solve such a practical problem. They are instead in-
ne the next state (given lely to illustrate ASM chart notation and style.
the present state to the
of the rectangles in the
at least one loop in an ASM chart commands
state machine. Unlike the rectangle for a state is not empty. There are three command notations that
te (unless you pull the r can choose to put inside the rectangle, which are described in the following
GREEN F| 1
I I
YELLOW
period of 0.5 seconds, STOP
RED
STOP
1be 1 when the ASM is in state RED or state YELLOW. STOP will be 0 when
is in state GREEN. The following illustrates this situation:
7Hardware
DesigningASMs 9
present next
time state state
0.0 GREEN YELLOW STOP=0
0.5 YELLOW RED STOP=1
1.0 RED GREEN STOP=1
1.5 GREEN YELLOW STOP=0
2.0 YELLOW RED STOP=1
2.5 RED GREEN STOP=1
... ... ... ...
2.1.1.2 Outputting a multi-bit command
When the name of a signal is on the left of an equal sign (=) inside
a rectangle, that
signal takes on the value specified on the right of the equal sign Figu
during the state corre-
sponding to the rectangle in question. In other state rectangles, where
that signal is not
mentioned, that signal takes on its default value.
In the a
The following two diagrams show ASM charts that use =. The first
of these ASMs is
equivalent to the ASM given in section 2.1.1. The second example
introduces a two-bit
bus SPEED whose default value is 00.
GREEN
YELLOW
2.1.1.3
The lasi
RED physica
uses foi
describe
Most all
Figure2-3. Equivalent tofigure 2-2.
comput
signer v
such ter
decisior
by an ar
at the be
left of a
clock c)
2.1.1.3 Registertransfer
The last two notations are simply a way of indicating how state names translate into
physical signals, such as STOP and SPEED. Although we will eventually find many
uses for these two notations, they are by themselves not the most convenient way to
describe an algorithm.
Most algorithms manipulate variables that change their values during the course of the
computation. It is necessary to have a place to store such values. Eventually, the de-
signer will choose some kind of synchronous hardware register (appendix D) to hold
such temporary values. In ASM chart notation, it is not necessary to make this design
decision in order to describe an algorithm. Register Transfer Notation (RTN) (denoted
by an arrow inside a rectangle) tells what happens to the register on the left of the arrow
at the beginning of the next clock cycle. If a particular register is not mentioned on the
left of an arrow in a state, the value of that register will remain the same in the next
clock cycle. For example,
PHardware DesigningASMs 11
I
GREEN
SPEED=3 1
YELLOW STOP
SPEED=1
COUNT- COUNT+1
I
I
RED
STOP
COUNT- COUNT+2
I
Figure 2-5. ASM with registeroutput.
2.1.2.1 Relations
Relational operators ( ==, <, >, <=, >=, != ) as well as logical operators (&&, II,!) can
occur inside a diamond. It is also permissible to use the shorter bitwise operators (&,I,^,-)
inside a diamond when all of the operands are only one-bit wide. When the relation in
the diamond involves registers also used in the rectangle pointing to that diamond, the
action taken is often different than would occur in software. Because the decision in the
diamond occurs at the same time as the operations described in the rectangle, you
ignore whatever register transfer occurs inside the rectangle to decide what the next
state will be. The register transfer is an independent issue, which will only take effect at
the beginning of the next clock cycle. As a illustration of such a decision, consider:
4
]
rNT=000
GREEN
NT=000
SPEED=3
FNT=001
rNT=011
rNT=011 YELLOW STOP
NT=100 SPEED=1 I
COUNT- COUNT+1
Hardware
DesigningASMs 13
present next
M4mo i>F
-- c--
|I[
L.a U t1! GREE!
0.0 GREEN YELLOW STOP=0 SPEED=11 COUNT=000
0.5 YELLOW RED STOP=1 SPEED=01 COUNT=000
1.0 RED GREEN STOP=1 SPEED=00 COUNT=001
1.5 GREEN YELLOW STOP=0 SPEED=11 COUNT=011
2.0 YELLOW YELLOW STOP=1 SPEED=01 COUNT=011
2.5 YELLOW YELLOW STOP=1 SPEED=01 COUNT=100
3.0 YELLOW YELLOW STOP=1 SPEED=01 COUNT=101 YELLOV
3.5 YELLOW YELLOW STOP=1 SPEED=01 COUNT=110
4.0 YELLOW YELLOW STOP=1 SPEED=01 COUNT=111
4.5 YELLOW RED STOP=1 SPEED=01 COUNT=000
5.0 RED GREEN STOP=1 SPEED=00 COUNT=001
... ... ... ... ... ...
The highlighted line shows the last time the ASM is in state YELLOW. The next state
is RED because COUNT is 000. REI
2.1.2.2 Externalstatus Fi
Many hardware systems are composed of independent actors working cooperatively
but in parallel to each other. We use actoras an ambiguous term that incorporates other
digital hardware (i.e., special- and general-purpose computers) as well as non-digital The at
hardware and people who communicate with the machine described by an ASM chart. WALE
From a designer's standpoint, the details of the other actors are normally unimportant. When
These actors need to send information to the machine described by the ASM chart. The m
When such external information can be represented in only one bit, it is known as
external status. (Multi-bit signals can be broken down into several single-bit status 2.1.3
signals if desired.) External status signals have names that are simply labels for physi- AnAS
cal wires connecting the machine that implements the ASM chart to the outside world. its inte
By convention, the name of a status signal can occur by itself inside a diamond. The structu
meaning of such a diamond is the same as testing if the status signal is equal to one. For provid
example, larger
inputs
are eit
ASM
In add
gram i
ments
descril
hardware DesigningASMs 15
arrows pointing into the black box, and outputs as specified by arrows pointing out of being i
the black box. As is standard notation in all hardware structure diagrams, when the 2-9 shc
input or output ports are more than one bit wide, the width is specified by a slash. left tree
The following is a block diagram of the machine described by the ASM chart in 2.1.2.2: o three
WALKBUTTON STOP C-
C
EXAMPLEl
ASM 7'SPEED|
COUNT
NT
A EXAMPLE
3 MACHINE
atus input when it is
usually interpreted as
tion, such as "has the
love in section 2.1.3, Figure 2-9. Two ways to test multi-bit input.
It is natural for designers to treat yes/no information as status. In most other cases, it is
easier for the designer to consider something as a data input than to consider it as a
ifregister transfers in status input, as the above ASMs illustrate. Inputs used on the right of register transfers
i external data input. must be treated as data inputs.
play in conventional In the block diagram above in figure 2-8, there are no external data inputs.
hardware DesigningASMs 17
2.1.3.2.1 External command outputs Arbitt
External command outputs are generated as described in sections 2.1.1.1 and 2.1.1.2. mostly
They are a function only of the present state of the ASM (assuming, as we have been so analol
far, that ovals are not present in the ASM so that it represents a Moore machine). Com- ware.
mand outputs do not retain their value when changing from one state to the next. If a staten
particular command output is not mentioned in the next state, it reverts to its default is ted
value. In the block diagram given in section 2.1.3, the external command outputs are the gc
STOP and SPEED. ling n
Arbitrary next states in a flowchart are just like go tos in software. Therefore, we will
ins 2.1.1.1 and 2.1.1.2. mostly use "got o-less" style ASM charts. Such ASMs are limited to decisions that are
ng, as we have been so analogous to the high-level language style that is nowadays standard practice in soft-
voore machine). Com- ware. In other words, we will try to make our decisions act like high-level language i f
e state to the next. If a statements, and our loops act like high-level language while statements. Although it
it reverts to its default is technically possible to make an ASM chart look like a plate of spaghetti, the goal of
command outputs are the got o-less style is to avoid such a mess. On rare occasions, there may be a compel-
ling need to use an ASM chart which violates the goto-less style.
Hardware DesigningASMs 19
t
the ASM chart describes the passage of time relating to the hardware system clock (as cepts
explained in sections 2.1.1 and 2.1.2) and the machine described by the ASM chart tis fa
connects to the external world via hardware ports (section 2.1.3). A practical example in an
of taking a simple problem and exploring various solutions using pure behavioral ASM design
charts is given in section 2.2. The only kind of structure that exists in the pure behav- The cc
ioral stage consists of the input and output ports, as illustrated by the of RT
following:
output
from t
EXTERNAL EXTERNAL is still
isci
STATUS COMMAND descri
INPUTS OUTPUTS rather
MACHINE about:
EXTERNAL EXTERNAL carry
DATA DATA now
INPUTS OUTPUTS identi
the cc
Figure2-10. Pure behavioralblock diagram.
2.1.5.2 Mixed
A pure behavioral ASM chart is merely the statement of an algorithm with precise
timing information and includes an indication of which operations occur in parallel. It
does not describe precisely what hardware components implement the computation.
The goal of computer design is to arrive at a "blueprint" of a physical machine. The
pure behavioral ASM chart is merely a description of what the designer wants the
machine to do. It does not tell how to connect the physical components together. Soft- ,
ware people wonder why the problem is not done upon completing the behavior ASM.
After all, we do have a solution (an algorithm). Hardware people wonder why we
spend so much time with ASM charts. After all, we do not yet have a solution (physical Fi
hardware). The answer to both groups is: have patience. The pure behavioral stage is
important because it enhances the likelihood the designer will produce a correct solu-
tion. The next stage, which is known as the mixed stage, accomplishes part of the trans- Ath(
formation from the algorithm into a physical hardware structure. more
The mixed stage of the top-down design process partitions the problem into two sepa- for t
rate but interdependent actors: the controller and the architecture. The architecture natio
(sometimes called the datapath) is the place where physical hardware registers will mixe
implement the register transfers originally conceived in the pure behavioral stage. The has t
architecture also contains combinational logic circuits that perform computations re- tectu
quired by the algorithm. What the architecture cannot do by itself is sequence events andE
according to the master plan given in the behavioral ASM. This is why the controller 2.2 i
exists as an independent actor. The controller tells the architecture what to do during tion
each clock cycle so that the master plan is carried out. Although it may seem the con-
20 Verilog DigitalComputer Design: Algorithms into Hardware
dware system clock (as cepts of controller and architecture make things more complicated, in fact working in
ibed by the ASM chart this fashion simplifies the thought process. In theory, it is possible to design a machine
3). A practical example in an extreme way that either has no architecture or has no controller. Such extreme
g pure behavioral ASM designs are as unnatural to think about as software without variable declarations.
dists in the pure behav-
The controller issues commands (as explained in sections 2.1.1.1 and 2.1.1.2) instead
s illustrated by the
of RTN. The architecture receives and acts upon those commands and responds by
outputting status. The controller makes decisions based on such status signals received
from the architecture (as explained in section 2.1.2.2) instead of relational decisions. It
INAL is still possible to draw an ASM chart at this stage of the design, but the ASM chart only
kND describes the independent action of the controller (in terms of commands and status),
JTS rather than the complete behavior of the system. This is what top-down design is all
about: moving from one master plan (the behavioral ASM) to greater detail on how to
NAL
carry out the master plan (the mixed ASM). The hardware structure in the mixed stage
TS now has more detail. From the standpoint of the outside world, the mixed stage is
identical to the pure behavioral stage, but internally we now see the interconnection of
the controller and the architecture.
.... ..... ..... ..... .... .. ... .. ... .. ... .. ... .. ... .. ..... ... .. ... .... .. ... .. ... .. ... .. ... .. ... .. ... .. ... .. ......
, MACHINE
lgorithm with precise
ins occur in parallel. It I I I
Im I
nent the computation.
)hysical machine. The
ie designer wants the
)onents together. Soft-
ng the behavior ASM.
ople wonder why we
ye a solution (physical
Figure 2-11. Mixed block diagram.
ire behavioral stage is
*oduce a correct solu-
ishes part of the trans-
Although, in theory, the architecture could be described by ASM chart(s), it is usually
more effective to use a hardware structure diagram. This is because a single ASM chart
roblem into two sepa- for the architecture could easily have billions of states (corresponding to all the combi-
ture. The architecture nations of values that all the registers in the system could have). Therefore, at the
Lrdware registers will mixed level of abstraction, we use an ASM chart to describe the controller (which still
behavioral stage. The has the same number of states) but use a hardware block diagram to describe the archi-
)rm computations re- tecture. This stage of the design is known as mixed because it is a mixture of behavior
Jlf is sequence events and structure. Examples of translating some of the pure behavioral solutions of section
is why the controller 2.2 into mixed behavioral controller/structural architecture solutions are given in sec-
ire what to do during tion 2.3.
it may seem the con-
Hardware DesigningASMs 21
2.1.5.3 Pure structure
The final stage of the design process is to implement the ASM chart for the cot
as a hardware structure. This translation from the mixed stage to the pure sti
stage is quite mechanical, and in fact software tools exist that create controlle
ware automatically. One can simply describe the controller as a table that says
the present state and status inputs, what the next state and command outputs
Various techniques exist to turn such a table into a hardware structure. For ex
such a table can be burned into a Read Only Memory (ROM). The only other ha
required for the controller besides the ROM is a register to hold the present sta
amples of translating some of the mixed ASM charts of section 2.3 into pure str
solutions are given in section 2.4.
Hardware DesigningASMs 25
E
The,(
top 0
softA
whetI
evani
The
been
waits
of ri
these
tuall)
IDLE
tains
Figure2-14. ASMfor software paradigm(COMPUTEI at top). view
assigi
State
checl
XAlthough in the following example there is a diamond in state IDLE involving an external
and C
status signal, the
original software algorithm does not mention this status signal (pb), and so the software paradigm in set
is pre-
served in this example.
{TN in a TEST
ectangle rl3y
COMPUTEl
rl - r -y
The only difference between these two ASMs is whether state COMPUTEl is at the
top of the loop or at the bottom of the loop. Since these ASMs exactly model the way
software executes one statement at a time (one software statement per ASM rectangle),
whether r or r2 gets a value assigned first is irrelevant, because this was also irrel-
evant in software.
The value of x is assigned to the register rl in state IDLE. Although this could have
been done in an additional state, since we have assumed (see section 2.2. 1) that the user
waits at least two clock cycles when READY is before pushing pb, the initialization
1 ofrl can occur here. The value of x will not be loaded into rl until the second of
-rl -y these two clock cycles. If pb is true, the ASM proceeds to state INIT, which will even-
tually cause r2 to change. If pb is false, as would be the case most of the time, state
-R2+1 IDLE simply loops to itself. Since state IDLE leaves r2 alone and r2 typically con-
tains the last quotient, this user interface allows the user as much time as required to
!P). view the quotient. The user interface, not the division algorithm, requires that r2 be
assigned after the pb test.
State INIT makes sure that r2 is 0 at the time the ASM enters state TEST. State TEST
checks if rl>=y, just as the while statement does in software. States COMPUTEl
and COMPUTE2 implement each software assignment statement as RTN commands
n external status signal, the in separate clock cycles.
software paradigm is pre-
Hardware
DesigningASMs 27
The latter ASA
Both of these ASMs work when x < y. For example, the following shows how the PUTE1 schedi
ASMs proceed when x is 5 and y is 7 (all values are shown in decimal for ease of does not take
understanding): cannot be part
IDLE rl= ? r2= ? pb=O ready=1
IDLE rl= 5 r2= ? pb=1 ready=1
INIT rl= 5 r2= ? pb=O ready=O 2.2.3 Elin
TEST rl= 5 r2= 0 pb=O ready=0 The empty rec
IDLE rl= 5 r2= 0 pb=0 ready=1 lation from so
in with other
diamond follc
The way each of the above ASMs operates is slightly different when x >= y. The diamond take
following shows how the ASM with COMPUTE 1 at the top of the loop proceeds when computation
x is 14 and y is 7: pend on the c(
IDLE rl= ? r2= ? pb=0 ready=1 2-15) that has
IDLE rl= 14 r2= ? pb=1 ready=1
INIT rl= 14 r2= ? pb=0 ready=O
TEST rl= 14 r2= 0 pb=0 ready=0
COMPUTE1 rl= 14 r2= 0 pb=0 ready=O
COMPUTE2 rl= 7 r2= 0 pb=0 ready=0
TEST rl= 7 r2= 1 pb=O ready=0
COMPUTE1 rl= 7 r2= 1 pb=0 ready=0
COMPUTE2 rl= 0 r2= 1 pb=0 ready=0
TEST rl= 0 r2= 2 pb=0 ready=0
IDLE rl= 0 r2= 2 pb=0 ready=1
IDLE rl= ? r2= 2 pb=O ready=1
The time to compute the quotient with this ASM includes at least two clock periods in
state IDLE, a clock period in state INIT, and the time for the loop. The number of times
through the loop is the same as the final quotient (r2). Since there are three states in the
loop, the total time to compute the quotient is at least 3 + 3 *quo t i ent.
Here is what happens with the ASM that has COMPUTE2 at the top of the loop:
IDLE rl= ? r2= ? pb=O ready=1
IDLE rl= 14 r2= ? pb=1 ready=1
INIT rl= 14 r2= ? pb=0 ready=O
TEST rl= 14 r2= 0 pb=0 ready=O
COMPUTE2 rl= 14 r2= 0 pb=0 ready=0
1 pb=0 ready=0
Figure 2-
COMPUTE1 rl= 14 r2=
TEST rl= 7 r2= 1 pb=O ready=0
COMPUTE2 rl= 7 r2= 1 pb=O ready=0 The only dif
COMPUTE1 rl= 7 r2= 2 pb=0 ready=0 for x<y, it I
2 pb=0 ready=O
TEST rl= 0 r2= error, assum
IDLE rl= 0 r2= 2 pb=0 ready=1
IDLE rl= ? r2= 2 pb=O ready=1
Hardware DesigningASMs 29
IDLE rl= ? r2= ? pb=O ready=1 ThisASM
IDLE rl= 14 r2= ? pb=1 ready=1 PUTE2. I
INIT rl= 14 r2= ? pb=O ready=O computati
COMPUTE2 rl= 14 r2= 0 pb=0 ready=0 consider
COMPUTE1 rl= 14 r2= 1 pb=0 ready=O
COMPUTE2 rl= 7 r2= 1 pb=0 ready=O
COMPUTE1 rl= 7 r2= 2 pb=O ready=O
COMPUTE2 rl= 0 r2= 2 pb=0 ready=0
COMPUTE1 rl= 0 r2= 3 pb=0 ready=0
IDLE r1=4089 r2= 3 pb=0 ready=1
IDLE rl= ? r2= 3 pb=0 ready=1
The decision r 1> =y actually occurs separately in two states: INIT and COMPUTE 1.
In state INIT, the only computation involves r2, and so the decision (14 is >= 7) pro-
ceeds correctly. The problem exists in state COMPUTE1 because the computation
The secon
changes ri, and the decision is based on r. The second time in state COMPUTE1,
7 to0. Thi
ri is still 7, although it is scheduled to become 0 at the beginning of the next clock
COM[PUT
cycle. The decision is based on the current value (7), and so the loop executes one more
based on tl
time than it should and the incorrect value of r2 (3) results. The mysterious decimal
times and
4089 is the side effect of 12-bit underflow (4089+7=212).
rema un
Although it is incorrect to remove state TEST in the last example, what about removing
Although I
state TEST in the other ASM (figure 2-14, with COMPUTE 1 at the top of the loop)?
executes f
22A
In addition
the ASM c
sider elim
tangle for
aady=1 This ASM has the decision rl>=y happening in two different states: INIT and COM-
eady=1 PUTE2. The difference here is that the decision is not dependent on the result of the
aady=0 computation in state COMPUTE2. Therefore, this ASM is correct. As an illustration,
aady=0 consider when x is 14 and y is 7:
eady=0
aady=0 IDLE rl= ? r2= ? pb=0 ready=1
mady=0 IDLE rl= 14 r2= ? pb=1 ready=1
eady=0 INIT rl= 14 r2= ? pb=0 ready=0
Bady=0 COMPUTE1 rl= 14 r2= 0 pb=0 ready=0
eady=1 COMPUTE2 rl= 7 r2= 0 pb=0 ready=O
Bady=1 COMPUTE1 rl= 7 r2= 1 pb=0 ready=0
COMPUTE2 rl= 0 r2= 1 pb=O ready=0
~ause
,ision
oop
he
^, mysteric
execut0
what(14
the
aboiciP IDLE rl= 0 r2= 2 pb=0
-r1| NIT
COMPUTE1.
and
1MPUTEl. ready=1
IDLE rl= ? r2= 2 pb=O ready=1
isionis(14>= 77) pro-
_auec
ausecomputation
the)mputation The second time in state COMPUTE 1 schedules the assignment that changes rl from
in state
state C(
COMPUTE1,
,MPUTEI, 7 to 0. This takes effect at the beginning of the clock cycle when the ASM enters state
ning of the next clock
nling COMPUTE2 for the second time. The decision, which is now part of COMPUTE2, is
oopexecutes
s one more based on the correct value (0). This means the loop goes through the correct number of
he mysterious
as decimal times and the quotient in r2 is correct. As was the case with the earlier ASMs, r2 will
remain unchanged until pb is pushed again.
what
aboutt removing Although the ASMs in section 2.2.2 are also correct, this ASM has the advantage that it
the and
NIT top C(
of the
Lhe loop)? executes faster as it requires only 3 +2 *quotient clock cycles.
NI
ndC
2.2.4 Eliminating state INIT
is 1 4h In addition to being able to describe a decision and a computation that occur in parallel,
the ASM chart notation can describe multiple computations that occur in parallel. Con-
sider eliminating state INIT by merging the assignment of zero to r2 into the rec-
tangle for state IDLE:
PHardware DesigningASMs 31
2.2.5 S
ri- X One way t
IDLE r2ADY extra regis
0 b
0 CIOMU l
| r- ri
COMPUT2
r2 - r2 +1
You may have as many RTN assignments occurring in parallel within a state as you
want as long as each left-hand side within that state is unique. In this instance, ri and
r2 are scheduled to have new values assigned at the beginning of the next clock cycle.
Since we have assumed that the user will ensure that the ASM stays in state IDLE Figureg
while x remains constant for at least two clock cycles, ri and r2 will be properly
initialized before entering the loop. This ASM will correctly compute the quotient and This ASM
leave the loop after the proper number of times for the same reason. To illustrate what is 14 and y
this ASM does, consider the same example as the other ASMs (when x is 14 and y is ID
7): ID
IDLE rl= ? r2= 0 pb=0 ready=1 CO
IDLE rl= 14 r2= 0 pb=1 ready=1 CO
COMPUTE1 rl= 14 r2= 0 pb=0 ready=O CO
COMPUTE2 rl= 7 r2= 0 pb=0 ready=0 CO
COMPUTE1 rl= 7 r2= 1 pb=0 ready=0 CO
COMPUTE2 rl= 0 r2= 1 pb=O ready=0 CO
IDLE rl= 0 r2= 2 pb=0 ready=1 ID
IDLE rl= ? r2= 0 pb=O ready=1 ID
There is a new problem with this ASM that we have not seen before: the quotient (2) Unfortunate
exists in r2 for only one clock cycle. This ASM throws it away because the assignment to be zero, r
of 0 to r2 is in state IDLE. From a mathematical standpoint, this ASM is correct, but assignment
from a user interface standpoint, it is unacceptable. zero. One w
test for the
1 -.- r1-y|
FE21
2--r2 +i1
before: the quotient (2) Unfortunately, there is a subtle error in the above ASM: when the answer is supposed
because the assignment to be zero, r3 is left unchanged instead of being cleared. This occurs because the only
this ASM is correct, but assignment to r3 is inside the loop, but the loop never executes when the quotient is
zero. One way to overcome this problem is to include an extra decision in the ASM to
test for the special case that x<y (which can be done by testing if rl>=y is false):
no Hardware DesigningASMs 33
IDLE
READY l
Figure 2
* ID
ZEROR3 0 COPD
. CO
!;~~ID
Of course, this has the disadvantage of taking longer (2 +3 * quo tient clock cycles), ,c
but sometimes a designer must consider a slower solution to eventually discover a cc
faster solution. lo
The value in r3 is one less than it should be since it was assigned too early.
ariations of it inside the
a faster solution that is Another thing to try (which unfortunately will also fail for similar reasons) is to merge
we can break this ASM. states COMPUTE2 and COMPUTE3 into a single state COMPUTE23:
ementing r2:
o Hardware DesigningASMs 35
IDLE 1
-L<
or occurs:
This ASM is correct, as illustrated by the example used before (when x is 14 and
ready=1
y is 7:
ready=l
ready=0 IDLE rl= ? r2= 0 r3= ? pb=0 ready=1
ready=0 IDLE rl= 14 r2= 0 r3= ? pb=1 ready=1
ready=0 COMPUTE rl= 14 r2= 0 r3= ? pb=0 ready=0
ready=0 COMPUTE rl= 7 r2= 1 r3= 0 pb=0 ready=0
ready=1 COMPUTE rl= 0 r2= 2 r3= 1 pb=0 ready=O
ready=1 IDLE rl= 4089 r2= 3 r3= 2 pb=O ready=1
IDLE rl= ? r2= 0 r3= 2 pb=O ready=1
a above the assignment
SM is not affected by The decision involving rl>=y is part of state COMPUTE (as well as being part of
e the rectangle. Since state IDLE), and ri is affected in state COMPUTE. Also there is the interdependence
rTE23, this ASM is not of r2 and r3 observed earlier. The reason why state COMPUTE works here is that all
equivalent to the incor- of these things occur at the same time in parallel. We have now totally left the sequen-
e COMPUTE23, r is tial software paradigm of section 2.2.1 (one statement at a time; no dependency within
R(1), which is not what a state). We are now using the dependency in the algorithm with parallelism to get the
correct result much faster.
is in fact a correct and Although in this ASM r3 still serves as the place where the user can observe the quo-
s merge all three corn- tient when the ASM returns to state IDLE, r3 accomplishes something even more
important. It compensates for the fact that the loop in state COMPUTE executes one
)Hardware
DesigningASMs 37
more time than the software loop would. Even though r2 becomes one more than the
correct quotient, r3 is loaded with the old value of r2 each time through the loop. On IDL
the last time through the loop, r3 is scheduled to be loaded with the correct quotient.
The loop in state COMPUTE is interesting because it has a property that software
loops seldom have: it either does not execute or it executes at least two times. This is
because the decision rl>=y is part of both states IDLE and COMPUTE. To illustrate
this, consider when x is 7 and y is 7:
IDLE rl= ? r2= 0 r3= ? pb=O ready=1
IDLE rl= 7 r2= 0 r3= ? pb=1 ready=1
COMPUTE rl= 7 r2= 0 r3= ? pb=O ready=O
COMPUTE rl= 0 r2= 1 r3= 0 pb=O ready=O
IDLE rl=4089 r2= 2 r3= 1 pb=O ready=1
IDLE rl= ? r2= 0 r3= 1 pb=O ready=1
You can see that rl is 7 in state IDLE, and so the ASM proceeds to state COMPUTE.
In state COMPUTE, rl is scheduled to change, but it remains 7 the first time in state
Figure 2-24.
COMPUTE; thus the next state is state COMPUTE (it loops back to itself). Only on the
second time through state COMPUTE has the scheduled change to rl taken place;
thus the next state finally becomes IDLE. Since pb is tru(
As with earlier ASMs, this ASM works for x<y only because of state ZEROR3. For time. Whether t
example, consider when x is 5 and y is 7: values x>=y th
change the oper
IDLE rl= ? r2= 0 r3= ? pb=O ready=l
x<y. Rather tha
IDLE rl= 5 r2= 0 r3= ? pb=l ready=l
once. In this cas
ZEROR3 rl= 5 r2= 0 r3= ? pb=O ready=0
IDLE rl= 5 r2= 0 r3= 0 pb=O ready=l not >= y.
IDLE rl= ? r2= 0 r3= 0 pb=O ready=l For example, cc
The time required for this ASM is 3 +quot i ent clock cycles. IDLE
IDLE
COMPU
2.2.7 Eliminate state ZEROR3 IDLE
If the loop in state COMPUTE could execute one or more (rather than two or more) IDLE
times, it would be possible to eliminate state ZEROR3. This would work because r2 is
already 0, and the assignment of r2 to r3 would achieve the desired effect of clearing The fact that rI
r3. in the earlier AS
at data output r
One way to describe this in ASM chart notation is to note that pb is true when making
the transition from state IDLE to state COMPUTE (the first time into the loop), but pb The time requir
remains false until the quotient is computed (by our original assumption about a friendly The above ASM
user). Let's change the decision so that it ORs the status signal pb together with the In essence, ther
result of the rl>=y: (testing pb). In
ready=1
ready=1
ready=0
ready=0
ready=1
ready=1
to Hardware DesigningASMs 39
coming out of state IDLE. Such ASMs with redundant tests (in the same clock cycle) 2.3.1 First
can be simplified into shorter equivalent ASM notation. Although this equivalent ASM We could use an
is truly identical and would be implemented with the same hardware, it does not follow % behavioral ASM
a style that can be thought of in terms of i f s and whi 1 es: consider the AS.
chart for the diN
cause they are
mixed stage, we
Consider the RI
loaded with x,;
three registers a
at the beginning
There are man)
mands. The de,
ease, personal r
ture. The only
rectly impleme
behavioral AS
to choose regisl
many of the req
a counter regis
Figure 2-25. Equivalent to figure 2-24. required in stat
COMPUTE). V
Also, in the above, the order of the statements within state COMPUTE were re- from the ASM i
arranged for ease of understanding. As mentioned earlier, changing the order with a
rectangle does not change the meaning. Which way you draw the ASM is both a matter If the designer
of personal taste and also a matter of how you intend to use it. We will see examples to provide for t
where both forms of this ASM prove useful. At this stage it is important for you to be dix C) in the a
comfortable that these two ASMs mean exactly the same thing because under all pos- would just mal
sible circumstances they cause the same state transitions and computations to occur. we will choose
On the other h
outside the reg
plest register
2.3 Mixed examples hardware will
The three stages of the top-down design process were discussed in section 2.1.5. Sec- r3.
tion 2.2 gives several alternative ways to describe the childish division algorithm in the
first stage of the top-down design process (as a pure behavioral ASM). This section Having decide
continues this same example into the second stage of the top-down design process. In how those reg
the second stage, we partition the division machine into a controller and an architec- RTN in state (
ture. ments, and r3
parallel. This i
7 Hardware DesigningASMs 41
0
combinational device (such as a subtractor). Such a combinational device is always When muxctr
computing the difference between r 1 and y (even though that difference is loaded into output of the n
ri only when the controller is in state COMPUTE). appears as foll
Loading r3 with the old value of r2 is easy. The output port of the r2 counter register
is simply connected via a bus to the input port of the r3 enabled register. If the only
state to mention rl, r2 or r3 were state COMPUTE, we would have the following
architecture:
muxctrl
Y
12
Figure2-27.
Although the a]
Figure2-26. Architecture using subtractor 2.2.7, it does nc
relational decis
but the above architecture fails to implement the RTN of state IDLE. The above archi- a comparator) I
tecture provides no way for rl to be loaded with x. relational comic
stead of referrii
One approach that often allows an architecture to deal with different kinds of RTN in
status signals. I
different states is to use an Arithmetic Logic Unit (ALU), which is capable of many
and y.
different operations, instead of a dedicated combinational device (such as a subtractor).
Also, there is often a need for one or more muxes so that the proper information can be There are three
routed to the ALU. In this particular ASM, there are only two different results that There is no >=
might be loaded into r 1: either the difference of r and y or passing through x un- strictly < outpi
changed. This means the ALU must be capable of at least two different operations: output is the in
computing the difference of the ALU's two data inputs and passing through the ALU's
At last, we hav
second input unchanged. The ALU is commanded to do these operations by particular
tional decision
bit patterns on the six-bit aluctrl bus. Symbolically, we will refer to these bit pat-
to translate the
terns as 'DIFFERENCE and 'PASS. The grave accent ('), which is also known as
ASM chart. Th
backquote or tick, indicates a symbol that is replaced by a particular bit pattern.
incr2) instead
On one hand, the ALU should be able to subtract y; on the other hand, the ALU should decisions. The
be able to pass x. To accomplish this requires a mux which can select either x or y. The incr2 or 1dr
output of this mux is connected as the second input of the ALU. Input 0 of this mux is tion on the righ
connected to the external bus x. Input of this mux is connected to the external bus y. the combinatio
Idr3
12
Figure 2-27. Architecture using ALU.
Although the above architecture implements all the RTN of the ASM chart in section
2.2.7, it does not consider the relational decision rl>=y. The simplest way to translate
relational decisions into the mixed stage is to dedicate a combinational device (usually
IDLE. The above archi- a comparator) for calculating an internal status signal that indicates the result of the
relational comparison. In the mixed ASM, this internal status signal will be tested in-
stead of referring to the relational decision. This is why ultimately the ASM uses only
ifferent kinds of RTN in
status signals. In this particular instance, we will use a comparator whose inputs are ri
fich is capable of many
and y.
ce (such as a subtractor).
'oper information can be There are three outputs of a comparator: the strictly <, the exactly == and the strictly >.
No different results that There is no >= output, but we can obtain that output, since it is the complement of the
r passing through x un- strictly < output. We will use the strictly < output as the input of an inverter, whose
vo different operations: output is the internal status signal rgey.
sing through the ALU's
At last, we have an architecture which can correctly implement all the RTN and rela-
operations by particular
tional decisions of the ASM chart in section 2.2.7. Now it will be a mechanical matter
ill refer to these bit pat-
to translate the pure behavioral ASM chart of section 2.2.7 into an equivalent mixed
which is also known as ASM chart. The purpose of this translation will be to use command signals (such as
ticular bit pattern.
incr2) instead of RTN, and to use status signals (such as rgey) instead of relational
,r hand, the ALU should decisions. The - in RTN translates to a command signal (such as drl, clrr2,
select either x or y. The incr2 or dr3) corresponding to the register on the left of the arrow. The computa-
J. Input 0 of this mux is tion on the right of the arrow may or may not require additional commands directed to
ed to the external bus y. the combinational logic units, such as the ALU and mux.
to Hardware DesigningASMs 43
This translation from pure behavioral ASM to mixed ASM always relates to a particu-
lar architecture that the designer has in mind. Although many architectures might have D.V...
I...
.. C
DIVISIC
been chosen for one pure behavioral ASM, each architecture will have a distinct mixed
ASM. The following shows the particular architecture we have just developed and the
corresponding translation of the pure behavioral ASM of section 2.2.7 into the particu-
lar mixed ASM required by this architecture. Finally, we give a system block diagram .X~e
showing the interconnection of this particular controller (as described by the mixed
ASM) and this particular architecture: pb
L
rl gey
: 12
Y ig
...... .. ......
clrr2 incr2 Idr3
Figure2-30
r2 12 r3 12
The external c
translation to tl
in2incr2 1
rigey x~~~~~~~~~~~~~~~~~~~~~~~~~lcr
rl gey x j/
i ~~12 READY
incr2 Idr3 ...............................
4 l Figure2-30. System diagram.
2 12 12 The external command READY and the external status pb are not affected by the
_3 >translation to the mixed stage.
)IFFERENCE
1
d 2-28.
to Hardware DesigningASMs 45
The carry out (
IDLE Idrl be used to det
aluctrI =PASSB ASM of sectio
muxctrl = 0
READY rectangle will
commanded to
0 1 effect of compi
pb<;architectureK
an
TEST
r gei
\ ~COMPUTE1
0 Idrl
aluctrI =\DIFFERENC
muxctr = 1
COMPUTE2|
| _ incr2
Figure 2-i
IdrI 1
rl 21 ALU
muxctrl cout
)IFFERENC Idr2 y b4
1 ~~~~~~~~~~~~~~~~~~~~~~12
l ' xsy~~~~~~~~~~~~~12 6
lcr2 aluctr
vHardware DesigningASMs 47
2.3.4 Methodical versus central ALU architectures
There is a spectrum of possible ways to choose an architecture section 2.3
at the because it
design process. At one extreme is the central ALU approach,
illu transfers.
2.3.3, where one ALU does all the computation required by the
entire
other extreme are methodicalapproaches, illustrated by sections
2.3.
each computation in the algorithm is done by a different hardware
ALU approach typically uses less hardware but only works with
For example, the ASM that works with the methodical approach
certa 2.4 Pi
in sel In the mix
work with the central ALU approach because that ASM performs
mc a controlle
putation per clock cycle. The following table highlights the
differen tion contir
two approaches:
into the th
l section 2.1
Central ALU Met
The third s
What does computation? one ALU implement
regA
then stage, the
regi level of th
to instead de5
muxe mation to
What ALU output connects to? chine. The
every register only
What kind of register? enabled all 2.4.1 F
Number of - per clock cycle one To translat
many
name a sp,
Speed requiremel
slower fast
signer (wh
Cost lower high For an AS:
Example states, a tw
2.3 .3 2.3.
the job. Fo
Figures 2-32, 2-33 2-28
general, an
2-22
I The mixed
that being
In the methodical architecture of section 2.3.2, the output of the sented by
ALU
ri, but in the central ALU architecture of section 2.3.3,
the output c We also ne
nects to both rl and r2. In the ASM implemented with the methodicz
C, for an
011001.
The structi
next state
2Including customized registers (other than those described
in appendix D) built using n combinatic
tional logic that are tailored to the specific algorithm. See section tional logic
7.2.2.1 for an examp:
Hardware DesigningASMs 49
status -
Figure 2-34. Controller.
We can describe the next state logic with a table. For the ASM chart of section 2.3.1,
the corresponding table is:
inputs outputs
ps pb rgey ns 1drl clrr2 incr2 dr3 muxctrl aluctrl ready
For a more co
o a 0 0 1 1 0 0 0 101010 1 tools) could by
o 0 1 0 1 1 0 0 0 101010 1 actual hardwa
0 1 0 1 1 1 0 0 0 101010 1 job.
0 1 1 1 1 1 0 0 0 101010 1
1 0 0 0 1 0 1 1 1 011001 0 Several of the
1 0 1 1 1 0 1 1 1 011001 0 and 001 are id
1 1 0 1 1 0 1 1 1 011001 0 pb and not oi
1 1 1 1 1 0 1 1 1 011001 0 abbreviated f
inputs
In the above, ps stands for the representation of the present state, and ns stands for the ps pb rig
representation of the next state.
0 -
One possible hardware implementation of this table is a ROM. The above table can be
0 1-
used as is to "bum" the ROM. Since there are three bits of address input to the ROM 1 0 0
(one bit for the present state, and two bits for the status), there are eight words (each 13- 1 1 0
bits wide) stored in the ROM for this controller. 1 - 1
Another approach would be to use the above table to derive minimized logic equations
for each bit of output and then use the logic equations to arrive at a structure composed shows a "don
of AND/OR gates. For example, the following logic equations are equivalent to the state transitio
above table: given earlier.
2.4.2 Sec
Assuming th;
follows:
- 0 1 1 0 0 0 101010 1
[he above table can be
- 1 1 1 0 0 0 101010 1
*ess input to the ROM O 0 1 0 1 1 1 011001 0
eight words (each 13- o 1 1 0 1 1 1 011001 0
1 1 1 0 1 1 1 011001 0
mized logic equations
a structure composed a "don't care" as a hyphen for those status inputs that do not affect a particular
are equivalent to the ^nsition. This table means exactly the same thing as the longer form of the table
arlier.
Second example
ing that the five states in the ASM chart of section 2.3.2 are represented as
Hardware DesigningASMs 51
The above,
IDLE 000
flattened ci
INIT 001
of output, i
TEST 010
COMPUTE1 011
COMPUTE2 100 BLOC
Using the "don't care" form of the table is useful because otherwise the table would be
32-lines long. The values of muxctrl and aluctrl in states INIT, TEST and COM-
PUTE2 are arbitrary. There are three extra state encodings (101, 110 and 111) that
should never occur in the proper operation of the machine. On power up, the physical
hardware might find itself in one of these states. To avoid problems of this kind, these
states go to state IDLE but otherwise do nothing.
a-
SUM = a-b
trl aluctrl ready
101010 1
101010 1
101010
Figure2-35. Block diagram and behavioralASM for adder.
0
101010 0
101010 0
011001 0
101010 0
101010 0
101010 0
he system is described
te architecture or con-
an architecture needs
low a device with two
many possible hard-
,vel of the system we
even though we have
ance) implements the
behavior of the adder
le state correspond to Figure 2-36. Flattened circuit diagramfor adder
o- bit adder, the ASM
for
ircuits)
the adder:
Hardware DesigningASMs 53
one-bit wires (nets) that each,
siliconfoundry to fabricate an
directions (to a low-skilled W(
desired machine. Although a fi
mately what is used to build a r
the designer uses to arrive at ti
The term hierarchicaldesign r(
final design. Hierarchical desig
ioral ASMs. In hierarchical de,
circuit diagram. Instead of just
fines modules. For example, th
tion of a module occurs by ins'
composed of a full-adder and a
and also instantiates a half-add,
The designer has to define a sef
posed of two half-adders and an
tiates in the definition of the fu
houses are build from identical
people live in each house. So it
They are instantiated from an id
be discussed in the next paragraj
In the final circuit, the half-add
adders is in turn composed of ar
diagrams show the adder modul
module definition and a circuit c
these definitions inside each oth(
54 V-rilog niinlrk
, --6 L-'gLtal Lome
'an be submitted to a --------- -- ---------------.- ......................
used
be manually as SI
ipper wire to form the ADDER
iivalent netlist) is ulti-a 2
of the thought process
Re full-adder is corn- --
at the designer instan- HALF-
boring suburb. Such
e same. But different
Impose the full-adder.
iof the half-adder will 'ALF
Id by each half-adder.
ADDER
ach of the three half-
)gate. The following
definition, half-adder
rig the instantiation of ------------ - -- --
Hardware DesigningASMs 55
2.5.2 Hi
ADDER1 |................... Sections 2.2
ja hi ) - two-state ve
increasing a
progresses tl
tectural devi
i l ~ \t . system who
j FA.............i...i.
! l _ __ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~..._._;.l..,_._._
DIVISI
..-....... . .. .. . . . . . . . x-
--sum[2]
Figure 2-40. Hierarchicalinstantiationof modules.
Although hierarchical design can be used with either bottom-up or top-down design, it
is most important with top-down design. In top-down design, upon completion of the
third stage, the designer may apply the same three stages over again on any compo-
nents (actors) that are not standard building blocks. Building blocks such as adders are . CC
well understood, and the designer is not normally concerned about their internal struc- g e y
ture. Other problem-specific building blocks (such as the push button in the division (D
machine) would be dealt with at the end of the third stage for the top-level system.
M
x 12 DIVISION
y MACHINE
12
(DESCRIBED BY
BEHAVIORAL / X/y
pb ASM CHART) READY
3m2
um[2.
or top-down design, it
Ron completion of the
again on any compo-
cks such as adders are
ut their internal struc-
button in the division
e top-level system.
MIXED ASM incr2
CHART) rr
confusing. No matter
e is some irreducible
: ........................................... .. ..
~~~~~~~~~~...
I stage, the input and
the "pure" structural . . . . . . . . . . . . . . .. . . . . . . . . . . . . .. . . . . . . . . . . . . .. . . . . . . . . . . . .
iagram is known only
way down to the gate
Figure2-42. Mixed block diagram.
hardware DesigningASMs 57
The third diag
...... ... .. ... .. ... .. ... .. ... .. ... .. ... .. ... .. ... .... ......
DIVISION MACHINE includes the n
xIf ;
12
,
mains the sam
. .._....
_ .... ............................. n
y the internal sti
12-
CONTROLLER rl gey verter, ps, rl
ARCHITECTURE
: -1
12 Here is where
L NEXT
STATE
ns PRESENT Pi
STATE
REGISTER
.1
, , ,:
I
proach. We dc
earlier full-ad(
down to a gat
gram. The rea
LOGIC 1 " > automated too
b I I UAL I
::
ei
61 1
aructrl
Bi _ 12 1 I
Idr3 ; I
.,
incrz- -
2.6 Con
pb cirrz
:!
* 1 1__ This chapter i
'i _, I . . .
behavior) and
r2 [M r3 12!/y
top-down des
L J ! , pure behavior
... .... .. ... .. ... .. ... .. ... .. ... .. ......
tion the mach
:............................................................. I In addition,
sign). This all
... ...-
READY without havip
...- ... ... ... ...-...
... ~~ ...
~ .-.-...--.--..--.-...
~... -... ... ... ...
...
-...
...-...-...-...-...-...-...
.... ...........................
..
The first diagram (figure 2-41), which illustrates the "pure" behavioral stage, has a
single black box. This box represents the complete division machine. The machine has 2.7 Fur
a unified behavior which can be described by a single ASM chart (see section 2.2.7). CLAIR, C. R.,
The only structure in the first stage is the port structure that allows the machine to York, 1973.'
communicate with the outside world.
notation, whi.
The second diagram (figure 2-42) shows the "mixed" stage. Instantiated inside the
division machine are the controller and the architecture. In the mixed stage, the struc- GAJsI, DANIa
ture of the controller remains a mystery (i.e., a black box) which is described only in NJ, 1997. Ch
behavioral terms by the ASM chart of section 2.3.1. This ASM chart refers only to what is callec
status and command signals. The architecture, on the other hand, is visible; however
architectural devices (such as the ALU) remain as black boxes, known only by their
behavior.
2.6 Conclusion
This chapter illustrates two manual graphicalnotations: the ASM chart (to describe
x/y behavior) and the block diagram (to describe structure). There are three stages in the
top-down design process to turn an algorithm (behavior) into hardware (structure):
pure behavioral, mixed and pure structural. The mixed and pure structural stages parti-
tion the machine into a controller and an architecture.
In addition, this chapter describes instantiating structural modules (hierarchical de-
sign). This allows the pure structural stage to be described in an understandable way,
READY
.. ... ... ... ... ... without having to descend to the extreme gate-level detail of a netlist.
The next chapter introduces an automated textual notation that allows us to express
behavioral, structural and hierarchical design in a unified notation.
o Hardware DesigningASMs 59
&.2- D)Peign san
PROSSER, FRANKLIN P. and DAVID E. WINKEL, The Art of DigitalDesign: An Introduction
to Top Down Design, 2nd ed., PTR Prentice Hall, Englewood Cliffs, NJ, 2nd ed., 1987.
Chapters 5-8 give several examples of ASM charts using RTN. This book uses the term
architecture the way it is used here.
2.8 Exercises
2-1. Give a pure behavioral ASM for a factorial machine. The factorial of n is the
product of the numbers from 1 to n. This machine has the data input, n, a push button
input, pb, a data output, prod, and an external status output, READY. READY and
pb are similar to those described in section 2.2.1. Until the user pushes the button,
READY is asserted and prod continues to be whatever it was. When the user pushes
the button, READY is no longer asserted and the machine computes the factorial by
doing no more than one multiplication per clock cycle. For example, when n= 5 after
an appropriate number of clock cycles prod becomes 120 == 1 * 1 *2 *3 *4 * 5
1 * 5 * 4 * 3 * 2 *1 and READY is asserted again.
Use a linear time algorithm in the input n, which means the exact number of clock
cycles that this machine takes to compute n! for a particular value of n can be ex-
pressed as some constant times n plus another constant. (All of the childish division
examples in this chapter are linear time algorithms in the quotient.) For example, a
machine that takes 57 * n+ 17 clock cycles to compute n! would be acceptable, but
you can probably do better than that.
2-2. Design an architecture block diagram and corresponding mixed ASM that is able
to implement the algorithm of problem 2-1 assuming the architecture is composed of
the following building blocks: up/down counter registers, multiplier, comparator and
muxes. Give a system diagram that shows how the architecture and controller fit to-
gether, labeled with appropriate signal names.
2-3. Give a table that describes the structural controller for problem 2-2.
2-4. Give a pure behavioral ASM similar to problem 2- 1, but use repeated addition to
4
perform multiplication. For example, l3 *1 = = 0 + 13 +13 + 13 + 13 + 13 + 13
+13 +13 +13 +13 +13 +13 +13 +13.Directmultiplicationinasinglecycleis
not allowed. The algorithm should be suitable for implementation with the central ALU
approach. This will be a quadratic time algorithm in n because of nested loops.
2-10. For e
registers, %
2-6. Give a table that describes the structural controller for problem 2-5.
e factorial of n is the
2-7. Give a pure behavioral ASM similar to problem 2-4, but use a shift and add algo-
nput, n, a push button
rithm to perform multiplication. Direct multiplication in a single cycle is not allowed.
READY READY and
Here is an example of multiplying 14 by 13 using the shift and add algorithm with 4-bit
ier pushes the button,
input representations, and an 8-bit product.
When the user pushes
iputes the factorial by 1110
mple, when n=5 after *1101
1*1*2*3*4*5 ==
00001110 do (1) add 14 in this cycle
don' t (0) add 28 in this cycle
00111000 do (1) add 56 in this cycle
xact number of clock
+ 01110000 do (1) add 112 in this cycle
value of n can be ex-
Fthe childish division 10110110 product is 182
ient.) For example, a
ild be acceptable, but The number of cycles to perform a single multiplication by n is proportional to the
number of bits used to represent n, which is roughly the logarithm of n. But you have
to perform n such multiplications, and so this factorial algorithm is what is called an n
ixed ASM that is able log n time algorithm, which takes more clock cycles than a linear time algorithm but
Picture is composed of fewer clock cycles than a quadratic time algorithm when n is large. (Note that unlike
plier, comparator and the linear time algorithm of problem 2-1, this approach does not require an expensive
and controller fit to- multiplier.) You should use a methodical approach that exploits maximal parallelism.
2-8. Design an architecture block diagram and corresponding mixed ASM that is able
lem. 2-2. to implement problem 2-7 assuming the following building blocks: enabled registers,
counter registers, shift registers, muxes, adder, comparator. Give a system diagram that
e repeated addition to shows how the architecture and controller fit together, labeled with appropriate signal
+13 +13 +13 +13 names.
ion in a single cycle is
iwith the central ALU 2-9. Give a table that describes the structural controller for problem 2-8. (See section
If nested loops. D.9 for details about controlling a shift register.)
2-10. For each of the following ASMs, draw a timing diagram. x, y and z are 8-bit
registers, whose values should be shown in decimal.
Hardware DesigningASMs 61
2-14. Design an
c) to implement pr
counter register,
ber of any kind
2-15. Design an
to implement pi
enabled register
system diagram
2-16. Design an
to implement pi
d) counter register
ber of any kind
y-2
2-17. Like 2-16.
bit adder, and a
2-18. Design an
I x
I yS x I~~~~~~
to implement pi
shift register (se
ter (section D.J
diagram.
2-12. Like 2-11, except use: one 8-bit counter register, one 8-bit enabled register, one
8-bit ALU and any number of any kind of mux. Label the a and b inputs of the ALU.
2-13. Design an architecture block diagram and corresponding mixed ASM that is able
to implement problem 2-10, part b assuming the following building blocks: two 8-bit
enabled registers, one 8-bit adder, one 8-bit incrementor and any number of any kind of
mux. Give a system diagram.
2-15. Design an architecture block diagram and corresponding mixed ASM that is able
to implement problem 2-10, part d assuming the following building blocks: two 8-bit
I enabled registers, one 8-bit ALU (see section C.6) and one 8-bit two-input mux. Give a
system diagram. Label the a and b inputs of the ALU.
ildn
]y 2-16. Design an architecture block diagram and corresponding mixed ASM that is able
to implement problem 2-10, part e assuming the following building blocks: two 8-bit
counter registers, one 8-bit enabled register, one 8-bit ALU (section C.6) and any num-
ber of any kind of mux. Give a system diagram. Label the a and b inputs of the ALU.
2-17. Like 2-16, except use: one 8-bit incrementor, three 8-bit enabled registers, one 8-
bit adder, and any number of any kind of mux.
2-18. Design an architecture block diagram and corresponding mixed ASM that is able
fl to implement problem 2-10, part f assuming the following building blocks: one 8-bit
shift register (section D.9), one 8-bit counter register, one 8-bit up/down counter regis-
ter (section D.8) and one 8-bit adder. You may not use any muxes. Give a system
I diagram.
2-19. Like 2-18, except use: one 8-bit incrementor, one 8-bit decrementor, three 8-bit
enabled registers, one 8-bit adder, one 8-bit combinational shifter, and any number of
mixeddASM that is able any kind of mux.
ilding;blocks:
ildine two 8-bit
I ofnmux.
ux. Give a system
her, labeled
ibeled with appro-
mix(
)it enabled
ibled register,
register, one
id b inputs
aputs of the ALU.
o Hardware DesigningASMs 63
step (typically
3. VERILOG HARDWARE hardware is of
circuit from a
DESCRIPTION physical fabric
matches the ori
errors.
LANGUAGE
The previous chapter describes how a designer may manually use ASM charts (to de-
scribe behavior) and block diagrams (to describe structure) in top-down hardware de- 3.2 Veri]
sign. The previous chapter also describes how a designer may think hierarchically, HDLs are text
where one module's internal structure is defined in terms of the instantiation of other down design p
modules. This chapter explains how a designer can express all of these ideas in a spe- the behavioral
cial hardware description language known as Verilog. It also explains how Verilog can tures of HDLs
test whether the design meets certain specifications. make an HDI
instantiation a
There are man
3.1 Simulation versus synthesis widely used H
Although the techniques given in chapter 2 work wonderfully to design small ma- by a company
chines by hand, for larger designs it is desirable to automate much of this process. To Verilog to a cc
automate hardware design requires a Hardware Description Language (HDL), a differ- tional (OVI).
ent notation than what we used in chapter 2 which is suitable for processing on a gen- Verilog is eas
eral-purpose computer. There are two major kinds of HDL processing that can occur: thrown in for f
simulation and synthesis. If you want to
Simulation is the interpretation of the HDL statements for the purpose of producing VHDL is a De
human readable output, such as a timing diagram, that predicts approximately how the defense contr;
hardware will behave before it is actually fabricated. As such, HDL simulation is quite those in Verily
similar to running a program in a conventional high-level language, such as Java Script, strongly influ
LISP or BASIC, that is interpreted. Simulation is useful to a designer because it allows guage that th
detection of functional errors in a design without having to fabricate the actual hard- before VHDL
ware. When a designer catches an error with simulation, the error can be corrected with VHDL than -i
a few keystrokes. If the error is not caught until the hardware is fabricated, correcting VHDL. VHD
the problem is much more costly and complicated.
Synthesis is the compilation of high-level behavioral and structural HDL statements
into a flattened gate-level netlist, which then can be used directly either to lay out a 3.3 Rot
printed circuit board, to fabricate a custom integrated circuit or to program a program- The original ]
mable logic device (such as a ROM, PLA, PLD, FPGA, CPLD, etc.). As such, synthe- guage for sini
sis is quite similar to compiling a program in a conventional high-level language, such tion for descr
as C. The difference is that, instead of producing object code that runs on the same during simul
computer, synthesis produces a physical piece of hardware that implements the com- executable st;
putation described by the HDL code. For the designer, producing the netlist is a simple original reasc
same as:
3.4 Behavioral features of Verilog
Verilog is composed of modules (which play an important role in the structural aspects
of the language, as will be described in section 3.10). All the definitions and declara-
tions in Verilog occur inside a module.
la to simple
Vywidth. St
Assume we have devel- Underbars are permitted in Verilog identifiers. Verilog is case sensitive, and so
e. It would not be at all Rain_fall and rain fall are distinct variables. The declarations integer and
cified in this problem is real are intended only for use in test code. Verilog provides other data types, such as
lural Verilog (similar to reg and wire, used in the actual description of hardware. The difference between
proach would be to use these two hardware-oriented declarations primarily has to do with whether the variable
)le values, and one that is given its value by behavioral (reg) or structural (wire) Verilog code. Both of these
es inside the inner loop, declarations are treated like unsigned in C. By default, regs and wires are only
he output of the netlist one bit wide. To specify a wider reg or wire, the left and right bit positions are
I operations found in C, defined in square brackets, separated by a colon. For example:
in hardware. The origi-
t to document how the
reg [3:0] nibble,four bits;
igh-level language pro- declares two variables, each of which can contain numbers between 0 and 15. The most
)u could write in a con- significant bit of nibble isdeclared to be nibble [3], and the least significant bit
original reason Verilog is declared to be nibble [] . This approach is known as little endian notation. Verilog
cause it is impossible to also supports the opposite approach, known as big endian notation:
i machine can be tested
with the 24 bits of input reg [0:3] bigend nibble;
able to conduct such a
ased, say to 32-bits, the where now bigendnibble [3] isthe least significant bit.
trs. Rather than give up
will appear longer, but If you store a signed value' in a reg, the bits are treated as though they are unsigned.
aw that expresses itself For example, the following:
st test code will find the
I four-bits =-5;
fourbits = 11;
inthe structural aspects
lefinitions and declara-
1In order to simplify dealing with twos complement values, many implementations allow integers with an
arbitrary width. Such declarations are like regs, except they are signed.
for (var=
real monthly-precip[ll:0]; stateme
Each of the twelve elements of the array (from monthly-precip [0] to forever
monthlyprecip [11]) is a unique real number. Verilog also allows arrays of wires stateme
and regs to be defined. For example,
case (exg
reg [3:0] reg-arr[999:0]; cons tar
wire[3:0] wirarr[999:0]; ...
default
endcase
Here, regarr [0] is a four-bit variable that can be assigned any number between 0
and 15 by behavioral code, but wirarr [0] is a four-bit value that cannot be as-
where the italic s
signed its value from behavioral code. There are one thousand elements, each four bits
replaced with apj
wide, in each of these two arrays. Although the [ ] means bit select for scalar values,
ment is one of tn
such as nibble [3] , the [ means element select with arrays. It is illegal to com-
semicolons insic
bine these two uses of [ into one, as in i f (regarr [][3]) .To accomplish this
real, reg or a
operation requires two statements:
2 There are other, more advanced statements that are legal. Some of these are described in chapters 6
and 7. 3 Some results are di
M
or reg, for example, Continued
)fthe middle two bits of
itiguous set of bits taken if (condition)
of nibble can also be statement
zither of these notations. else
statement
iy of reals could be de-
)
ement
ien t
I any number between 0
ralue that cannot be as-
elements, each four bits t, var, expression, conditionand constant are
select for scalar values, 'erilog syntax for those parts of the language. A state-
ays. It is illegal to com- tements or a series of the above statements terminated by
] ) . To accomplish this and end. A var is a variable declared as integer,
on of regs. A var cannot be declared as wire.
loHardware
g HardwareDescriptionLanguage 69
name example 16-bit
unsigned declarat
result
addition initial
10+3 13
subtraction begin
10-3 7
negation stat
-10 6552 6
multiplication .. .
10*3 30
division stat
10 /3 3 end
remainder 10%3 1
shift left 10<<3 80 endmodule
shift right 10>>3
& bitwise AND 10&3 1
bitwise OR 2 The name oj
10 13
bitwise exclusive OR 10^3
declaratic
9
bitwise NOT -10 6552 5 men t is termi
conditional operator 0?10:3 3 rather than { a
1?10:3 10 may be omitte
tial block.
logical NOT !10 0
logical AND 10&&3 1 Here is an exa
II logical OR 10113 1
less than section 3.3:
10<3 0
equal to 10==30
less than or equal to module tol
10<=3
greater than or equal 10>=3 0 integer
not equal 1 initial
10 !=3
greater than 1 begin
10>3 I
& 10>3 I x =
while
beg
f
3.4.4 Blocks
All procedural statements occur in what are called blocks that are defined
inside mod-
ules, after the type declarations. There are two kinds of procedural
blocks: the x
initial block and the always block. For the moment, let us
consider only the end
initial block.An initial blockis like conventional software. It starts execution
$write
and eventually (assuming there is not an infinite loop inside the initial
block) it $displa:
stops execution. The simplest form for a single Verilog initial block endmodule
is:
The loop involving x could have been written as a f or loop also but was shown above
as a while for illustration. Note that Verilog does not have the ++ found in C, and so
it is necessary to say something like y = y + 1. This assignment statement is just like
Since infinite loops are useful in hardware, Verilo 1 Note the synta
which means the same thing as while (1). In additi 'DIFFERENC
above can be described as an initial block cont, anything. Macrn
simulation purposes, the following mean the same: quote.
initial initial You can detern
begin begin
preprocessing f
while(l) forever
begin begin
... ...
end end
end end
For synthesis, one should use the always block for prints the mess
is not a block and cannot stand by itself. Like other defined. The
must be inside an initial 'DIFFERENC.
or always block.
Verilog allows,
clude in C an
Later in the code, a reference to these macros (preceded by a backquote) is the same as
I substituting the associated value. The following i fs mean the same:
ie test code requires that if (aluctrl == 'DIFFERENCE) if (aluctrl == 6'bOllOOl)
$display("subtracting'); $display("subtracting");
ithe syntax f orever, Note the syntax difference between variables (such as aluctrl), macros (such as
Lways block mentioned 'DIFFERENCE), andconstants (such as 6 bO11001). Variables arenotprecededby
y a forever loop. For anything. Macros are preceded by backquote. Constants may include one forward single
quote.
You can determine whether a macro is defined using ' ifdef and ' endi f. This
preprocessing feature should not be confused with if. For example, the following:
'ifdef DIFFERENCE
$display("defined");
'endif
prints the message regardless of the value of DIFFERENCE, as long as that macro is
he statement forever
defined. The message is not printed only when there is not a define for
I statements, forever
'DIFFERENCE.
Verilog allows you to separate your source code into more than one file (just like #in-
clude in C and {$I) in Pascal). To use code contained in another file, you say:
| 'include "filename.v"
o Hardware
Verilog Hardware Description Language 73
There are two forms of comments in Verilog, which are the same as the two forms 3.5.1 Instai
found in C++. A comment that extends only for the rest of the current line can occur Of course, there
after / /. A comment that extends for several lines begins with / * and ends with * an xor gate (re(
For example:
module easyxor;
reg a,b;
wire c; means the same
xor xl(c,a,b); circuit diagram:
endmodule
module hardxor;
reg a,b;
wire c;
wire tl,t2,nota,not-b;
not il(nota,a);
not i2(not-b,b);
and al(tl,nota,b);
and a2(t2,a,not-b);
or ol(c,tl,t2);
endmodule
Ld, or, xor, nand,
ntax for these structural The order in which gates are instantiated in structural Verilog code does not matter, and
oral features of Verilog so the following:
wire, which by itself
generated by structural module scrambled xor;
ch gates may be either reg a,b;
e themselves computed wire c;
iu say what kind of gate wire tl,t2,nota,not-b;
stance (since there may
or ol(c,tl,t2);
Following the instance
and al(tl,nota,b);
kte (for example, say the and a2(t2,a,not-b);
output(s) of gates are not il(nota,a);
not i2(not-b,b);
endmodule
means the same thing, because they both represent the interconnection in the following
circuit diagram:
C, mistakenly assume
loing no such thing. It
cted to c and its inputs
his notation is simply a
aph that represents the
reg a,b;
reg c;
reg t,t2,nota,notb;
always ...
begin
C = tllt2;
tl = not_a&b;
t2 = a¬b;
not-a = -a;
not-b = -b;
end
endmodule
because not_a must be computed before t 1 by the Verilog simulator. 4Verilog also alk
scope of this boc
module forget-orthat_outputsc;
reg a,b;
wire c;
wire tl,t2,nota,notb;
not il(nota,a);
not i2(not-b,b);
and al(tl,nota,b);
and a2(t2,a,notb);
endmodule
rnulator. 4
Verilog also allows each bit to have a strength, which is an electronic concept (below gate level) beyond the
scope of this book.
,Hardware
Verilog HardwareDescription Language 77
3.5.2 Comparison with behavioral code
Structural Verilog code does not describe the order in which computations
3.5.3 Int
In software,,
by such a structure are carried out by the Verilog simulator. This is in sh2
behavioral Verilog code, such as the following: the case also
produce elec
physical pos!
module behavioral xor;
1 bz or 1 v]I
reg a,b;
reg c; Obviously, 1
reg tl,t2,not-a,not-b; normally exi
are represent
always ... represent 1'
begin represent 1'
nota = -a; such as CM(
notb = b;
represent infi
tl = nota&b;
t2 = a¬-b;
c = tllt2; 3.5.3.1 h
end In any technc
endmodule a designer ft
means that th
which is a correct behavioral rendition of the same idea. (The ellipses mu: this as high ii
byaVerilogfeaturedescribedlater.)Also, c, t, t2, nota and normally vie
be declared as regs because this behavioral (rather than structural) cod, which this w
ues to them. active low, ii
Furthermore,
To rearrange the order of behavioral assignment statements is incorrect: family. For td
tinct from 1
module bad.xor;
example fron
reg a,b;
reg c;
reg tl,t2,nota,not-b;
always ...
begin
c = tlt2;
tl = not_a&b;
t2 = a¬b;
not-a = -a;
not-b = -b;
end
endmodule
because not-a must be computed before t 1 by the Verilog simulator. 4 Verilog also all
scope of this boc
76 Verilog DigitalComputer Design: Algorithms into Hardwar
3.5.3 Interconnection errors: four-valued logic
amputations implemented
In software, a bit is either a 0 or a 1. In properly functioning hardware, this is usually
his is in sharp contrast to
the case also, but it is possible for gates to be wired together incorrectly in ways that
produce electronic signals that are neither 0 nor 1. To more accurately model such
physical possibilities, 4 each bit in Verilog can be one of four things: 1 bO, 1 'bl,
1 'bz or 1 bx.
Obviously, 1 'bO and 1 b correspond to the logical 0 and logical 1 that we would
normally expect to find in a computer. For most technologies, these two possibilities
are represented by a voltage on a wire. For example, active high TL logic would
represent 1 bO as zero volts and 1 'b as five volts. Active low TTL logic would
represent 1 as five volts and 1 hi as zero volts. Other kinds of logic families,
such as CMOS, use different voltages. ECL logic uses current, rather than voltage, to
represent information, but the concept is the same.
module forgetor_that_outputsc;
reg a,b;
wire c;
wire tl,t2,nota,not-b;
not il(nota,a);
not i2(not b,b);
and al(tl,nota,b);
and a2(t2,a,notb);
endmodule
imulator. 4
Verilog also allows each bit to have a strength, which is an electronic concept (below gate level) beyond the
scope of this book.
7Hardware
Verilog HardwareDescription Language 77
there is no gate th imeftic and
what a and b are. qy-when boti
HTt is bx.I
3.5.3.2 Unkn arlement suc
Another way in wl nuances.
together. This raise IS
output a 1 ' bO, I to~
eliminate the or g
I never disl
eration is al
ver executes
%reare two
1. === and
elligent sirn
'N
78 Verilo
s 1 'bz, regardless of Arithmetic and relational operators (including == and ! =) produce their usual results
only when both operands are composed of 1 bOs and 1 bis. In any other case, the
result is bx. This relates to the fact the corresponding combinational logic required to
implement such operations in hardware would not produce a reliable result under such
of two gates are wired circumstances. For example:
e of the gates wants to
'ample, if we tried to
fther: if ( a == 'bx)
$display("a is unknown");
will never display the message, even when a is 1 bx, because the result of the ==
operation is always 1'bx. 1 bx is not the same as 1 'bl, and so the $display
never executes.
There are two special comparison operators (=== and ! ==) that overcome this limita-
tion. === and ! == cannot be implemented in hardware, but they are useful in writing
intelligent simulations. For example:
if ( a === l'bx)
e two and gates both | $display("a is unknown");
) when a is 1 bO and
er. Fighting gates can will display the message if and only if a is 1 'bx.
'mes out of the chip).
problems before we
To help understand the last examples, you should realize that the following two if
statements are equivalent:
Uninitialized regs in
if(expression) if((expression)===1'b1)
e for structural code,
statement; statement; I
Dperators, such as &,
les of commutativity,
The following table summarizes how the four-valued logic works with common opera-
tors:
gs. When all the bits
rpretation (powers of
ach as 3 'bIzO, the
initial
begin
val[O] = l'bO;
val[l] = l'bl;
val[2] = l'bx;
val[3] = l'bz;
$display
("a b a==b a===b a!=b a!==b a
module twoblocks;
integer x,y;
initial
begin
a=l;
$display("a is one');
end
initial
b allb ab"'); begin
b=2;
$display("b is two');
end
endmodule
The above simulates a system in which a and b are simultaneously assigned their
respective values. This means, from a simulation standpoint, $ time is the same when
b %b %b ", a is assigned one as when b is assigned two. (Since both assignments occur in ini-
ajb,ajj|b,a-b); tial blocks, $time is 0.) Note that this does not imply the sequence in which these
assignments (or the corresponding $display statements) occur.
3.6.3 Scheduling
Like a multiprocessing c
cesses, one for each stru
does not advance until th,
nity to execute at that $ t
in
e
the above wil
the Verilog simulator woi The order in w
computing c is not interr -at a certain po'
ture of Verilog, discussed
There can be
how the # wor
82 Verilog Digi
3.7 Time control
idsequence. In Verilog, Behavioral Verilog may include time control statements, whose purpose is to release
vancing. The sequence control back to the Verilog scheduler so that other processes may execute and also tell
he usual rules found in the Verilog simulator at what $time the current process would like to be restarted.
within different blocks There are three forms of time control that have different ways of telling the simulator
)g will do consistently. when to restart the current process: #, @ and wait.
on 3.7.
is equivalent to several
3.7.1 # time control
When a statement is preceded by # followed by a number, the scheduler will not ex-
t output by one gate. If
ecute the statement until the specified number of $time units have passed. Any other
cs execute at a particu-
process that desires to execute earlier than the $ time specified by the # will execute
which you instantiate
before the current process resumes. If we modify the first example from section 3.6:
in simulate the parallel
the parallel actions of
module twoblockstimecontrol;
integer x,y;
schedules several pro-
The $time variable
initial
so desires an opportu- begin
#4
a=1;
semaphores, you will $display("a is one at $time=%d",$time);
;: what are the atomic end
get interrupted by the
initial
begin
#3
. Although it is nearly b=2;
I code: $display("b is two at $time=%d",$time);
end
endmodule
the above will assign first to b (at $time=3) and then to a one unit of $ time later.
The order in which these statements execute is unambiguous because the # places them
rute because the block at a certain point in $time.
tires an additional fea-
There can be more than one # in a block. The following nonsense module illustrates
how the # works:
i:
end
xor x(c,a,b);
initial
begin
for (ia=0; ia<=l; ia = ia+l)
begin
a = ia;
for (ib=0; ib<=l; ib = ib + 1)
begin
b = ib;
#10 $display(`a=%d b=%d c= =%d,a,b, c);
end
end
end
endmodul e
The first time through, a and b are initialized to be 0 at $ time 0. When #10 executes
20 at $time 2, 50 at at $time 0, the initial block relinquishes control, and xi is given the opportunity
parallel blocks creates to compute a new value (OAO=O) on the wire c. Having completed everything sched-
ock. uled at $ time 0, the simulator advances $time. The next thing scheduled to execute
is the $display statement at $time 10. (The simulator does not waste real time
computing anything for $ t ime 2 through 9 since nothing changes during this $ t ime.)
The simulatorprints out that"a=0 b=0 c=0"at $time 10 and then goes through the
of patterns at specific inner loop once again. While $time is still 10, b becomes 1. The #10 relinquishes
control, xl computes that c is now 1 and $time advances. The $displayprints out
,s control from the test
that "a=0 b=l c=1" at $time 20. The last two lines of the truth table are printed out
Munity to execute. Test
in a similar fashion at $ times 30 and 40.
use the machine being
The behavioral exclusive OR example in section 3.6.3 deadlocks the simulator because There is an eq
it does not have any time control. If we put some time control in this always block control. The fo
(say a propagation delay of #1), the simulator will have an opportunity to schedule the
test code instead of deadlocking inside the always block:
module top;
integer ia,ib;
reg a,b;
reg c;
Both model an
always #1 many (but not,
c = ab; efficient from;
initial 3.7.1.3 Ge
begin Registers and
for (ia=O; ia<=l; ia ia+l) ate such a sigr
begin
and an alway
a = ia;
for (ib=O; ib<=l; ib = ib + 1)
begin
b = ib;
#10 $display("a=%d b=%d c=%d",a,b,c);
end
end
$finish;
end
endmodule
As in the last example, a and b are initialized to be 0 at $time 0. When #10 executes
at $time 0, the initial block relinquishes control, which gives the always loop
an opportunity to execute. The first thing that the always block does is to execute #1,
which relinquishes control until $time 1. Since no other block wants to execute at The above gei
$ time 1, execution of the always block resumes at $t ime 1, and it computes a new $time.
value (OAO=O) for the reg c. Because this is an always block, it loops back to the #1.
Since no other block wants to execute at $time 2, execution of the always block 3.7.1.4 Oi
resumes at $ time 2, and it recomputes the same value for the reg c that it just com- It is permissit
puted at $time 1. The always block continues to waste real time by unnecessarily control to othz
recomputing the same value all the way up to $ time 9. processes havy
will resume. 1
Finally, the $display statement executes at $time 10. The test code prints out "a=0 whose execut
b=0 c=0" and goes through its inner loop once again. While $time is still 10, b algorithmicall
becomes 1. The #10 relinquishes control, and the always block will have another ten both assignmt
chances to compute that c is now 1. The remaining lines of the truth table are printed
out in a similar fashion.
reg sysclk;
initial
sysclk = 0;
always #50
sysclk = -sysclk;
module
@(expression) inte
@(expression or expression or )
...
reg
@(posedge onebit) reg
@(negedge onebit)
@ event alwa
C
When there is a single expression in parenthesis, the @ waits until one or more bit(s) in
the result of the expression change. As long as the result of the expression init
stays the same, the block in which the @ occurs will remain suspended. When multiple be
expressions are separated by or, the @ waits until one or more bit(s) in the result of
any of the expressions change. The word or is not the same as the operator 1.
In the above, onebi t is single-bit wire or reg (declared without the square bracket).
When posedge occurs in the parenthesis, the @ waits until onebi t changes from a
0 to a 1. When negedge occurs in the parenthesis, the @ waits until onebi t changes
from a I to a 0. The following mean the same thing:
module top;
integer ia,ib;
or ... )
reg a,b;
reg c;
always (a or b)
c = ab;
til one or more bit(s) in
of the expression initial
pended. When multiple begin
*ebit(s) in the result of for (ia=O; ia<=l; ia = ia+l)
ne as the operator 1. begin
a = ia;
out the square bracket). for (ib=O; ib<=l; ib = ib + 1)
begin
nebi changes from a
b = ib;
until onebi t changes #10 $display("a=%d b=%d c=%d",a,b,c);
end
end
$finish;
c) a=b; end
endmodule
discussed later.
that a or b can change anymore at $ time 0, the simulator advances $ time. The next eL
thing scheduled to execute is the $display statement at $time 10. (Like the ex-
ample in section 3.7.1.1, but unlike the example in section 3.7.1.2, the simulator does
not waste real time computing anything for $ time I through 9 since nothing changes
during that $ time.) The simulator prints out that "a=O b= 0 c= 0" at $ time 1,and
end
then goes through the inner loop once again. While $time is still 10, b becomes 1.
The #10 relinquishes control, and the always block has an opportunity to do some-
thing. Since b just changed (though a did not change), the @ does not suspend, and c Note that the
is now 1. After $time advances, the $display prints out that "a=0 b= 1 c=l" at controller sen,
$ time 20. The last two lines of the truth table are printed out in a similar fashion at will ignore the
$times 30 and 40. one action (ch
of if stateme
Since this is a model of combinational logic, it is very important that every input to hardware.
the logic be listed after the @. We refer to this list of inputs to the physical gate as the
sensitivity list. 3.7.2.3 M
Most controlled
use posedgk
3.7.2.2 Modeling synchronous registers sysclk hav
Most synchronous registers that we deal with use rising edge clocks. Using @ with chart in sectic
posedge is the easiest way to model such devices. For example, consider an enabled
register whose input (of any bus width) is din and whose output (of similar width as
din) is dout. At the rising edge of the clock, when ld is 1, the value presented on alwi
b
din will be loaded. Otherwise dout remains the same. Assuming din, dout, d
and sysclk are taken care of properly elsewhere in the module, the behavioral code
to model such an enabled register is:
e
Similar Verilog code can be written for a counter register that has cr, ld, and cnt
signals:
There are sevi
to promote rc
There are several things to note about the above code. First, the indentation is used only
to promote readability. Assuming the code for generating sysclk given in section
The second thing to note is that the = in Verilog is just a softwE 3.7.3 wa
ment. (The variable is modified at the $time the statement e The wait st
will retain the new value until modified again.) This is different wait statem
ASM chart notation. (The command signal is a function of the p way that @
mand signal does not retain the new value after the rising edge o wait statem
instead returns to its default value.) Another way of saying th ware devices
default values in standard Verilog variables as there are for A
Despite the distinction between Verilog and ASM chart notati
ASM chart in Verilog by fully specifying every command outi
those states where a command is not mentioned in an ASM cha The wait st
Verilog assignment statement that stores the default value int( when the con
corresponding to the missing ASM chart command. The stop=( will resume v
ments above were not shown in the original ASM chart but are rc
code to model what the hardware would actually do. For example,
described in c
The third thing is the names of the states are not yet included in , result is. Furtl
comments are of course ignored by Verilog.) Eventually, we wil of $time. Tl
ing meaningful state names in the actual code. machine:
The fourth thing is that this ASM chart does not have any RTN
module top
stage). We will need an additional Verilog notation to model ASR
reg pb;
This notation is discussed in section 3.8. integer
wire [11
wire sys
3.7.2.4 @for debugging display ...
@ can also be used for causing the Verilog simulator to print initial
shows what happens as actions unfold in the simulation. For exs begin
pb=
x =
always (a or b or c)
=
$display("a=%b b=%b c=%b at $time=%d",a,b,
#250
@(po
The above block would eliminate the need for the designer t( whil1
$display statements in the test code or in the code for the ma be(
initial
debugging output that
begin
ample,
pb= 0;
x = 0;
y = 0;
c, $time) ;l #250;
@(posedge sysclk);
o worry about putting while (x<=4095)
achine being tested. begin
for (y=l; y<=4095; y = y+1)
ition shortly after each begin
@(posedge sysclk);
pb = 1;
The ellipsis shows where the code for the actual division machine was omitted in the
above. The quotient is produced by this machine which is not shown here. The Blocking pro(
design of this code will be discussed in the next chapter.
RTN. The onc
from continui
blocking proc
3.8 Assignment with time control
The # and @ time control, discussed in sections 3.7.1 and 3.7.2, precede a statement. 3.8.2 Nor
These forms of time control delay execution of the following statement until the speci- The syntax fo
fied $time. There are two special kinds of assignment statements5 that have time dural assignrr
control inside the assignment statement. These two forms are known as blocking and This should t
non-blocking procedural assignment.
charts. For ex
later chapters
5Assignment with time control is not accepted by some commercial synthesis tools but is accepted by all
Verilog simulators. Since there are problems with intra-assignment delay (section 3.8.2.1), some authors
recommend against its use, but when used as recommended later in this chapter (section 3.8.2.2), it becomes
a powerful tool. Chapter 7 explains a preprocessor that allows all synthesis tools to accept the use proposed
in this book.
Other variations are also legal. What distinguishes this from a normal instantaneous
assignment is that the expression is evaluated at the $time the statement first ex-
ecutes, but the variable does not change until after the specified delay. For example,
assuming temp is a reg that is not used elsewhere in the code and that temp is
declared to be the same width as a and b, the following two fragments of code mean
the same thing:
i) embodies the assump-
roduce the pb pulse that initial
nachine has enough time initial begin
is the test code synchro- begin ...
ling numbers to the divi- ... temp = b;
te will spend the required a = @(posedge sysclk) b; @(posedge sysclk) a = temp;
end end
chine was omitted in the
is not shown here. The
Blocking procedural assignment is almost what we need to model an ASM chart with
RTN. The one problem with it, as its name implies, is that it blocks the current process
from continuing to execute additional statements at the same $ time. We will not use
blocking procedural assignment for this reason.
b; r, when one runs this code on a Verilog simulator, the following incorrect re-
roduced (assuming the debugging always block shown in section 3.7.2.4):
ito behavioral Verilog is ve Verilog starts to execute the statements for state YELLOW at $time 150.
suming stop, speed, of these statements evaluates count+l at $time 150 and schedules the
ne might think that the of the result. Since count is still 3'bOOO at $time 150, the result scheduled to
g as: I at the end of $time 250 is3'bOO1. The @ (posedge sysclk) that starts
D causes the always block to suspend until $time 250. The problem shown
ccurs at $time 250 because the assignment initiated by the <= at $time 150
he last thing that occurs at $ time 250. Prior to the assignment, the process
ime and execute the three statements, including count <= @ (posedge
k) count + 2. Since countis still 3'bOOO, this <= schedules 3'bOlO to be
at $ time 350, which is not what happens in an ASM chart. As soon as the
:hanged every clock cycle, in ent of 3'bO0l has been scheduled at $time 250, 3'bOO1 will be stored into
6.5.2, and for post-synthesis 'as a result of the first <=).
dlained in section 11.3.3.
always
begin
@(posedge sysclk) #1 //this models state GREEN
stop = 0;
speed = 3;
@(posedge sysclk) #1 //this models state YELLOW
stop = 1;
speed = 1; 3.8.2.3 Tr
count <= (posedge sysclk) count + 1;
This book con
@(posedge sysclk) #1
ASM with be]
//this models state RED
stop = 1; kinds of ASMq
speed = 0; ASMs can be
count <= (posedge sysclk) count + 2; that adhere to
end
3.8.2.3.1 1
always (posedge sysclk)
The approach
#20 $display("'stop=%b speed=%b count=%b at $time=%d",
while) iskn
stop,speed,count,$time);
implicitly thra
initial always bloc]
begin section 2.4.1)
count = 0; possible circui
#600 $finish;
end Experienced h
endmodule approach conf
way. The imp]
Let's analyze the reason why each block is required in this module. The first initial
between i f ar
block is required to give sysclk a value other than l'bx at $time 0. The next block
find this apprn
98 Verilog DigitalComputer Design:Algorithms into Hardware
toggles sysclk so that the clock period is 100. If sysclk were not initialized at
need to use a non-zero $t ime 0, it would stay 1 bx forever (-1 bx is 1 bx).
tangle of the ASM chart.
primitive way) the ASM
X The only new thing in the always block that models the ASM chart is the addition of
#1 after each @(posedge sysclk). The always block that follows it displays
stop, speed and count during each state.
The test code in the final initial block simply initializes count to be 3'b00. (In a
real machine, this would occur in a state of the ASM, but instead here it is part of the
test code for the purposes of illustration only.) The test code schedules a $ f inish
system task to be called at $ time 600. This is required because the always blocks
would otherwise tell the simulator to go on forever.
With the #1 after each @, the Verilog simulator produces the following correct output:
stop=0 speed=ll count=000 at $time= 70
stop=i speed=01 count=000 at $time= 170
stop=l speed=00 count=001 at $time= 270
stop=0 speed=11 count=011 at $time= 370
ls state GREEN
stop=l speed=01 count=011 at $time= 470
stop=l speed=00 count=100 at $time= 570
te YELLOW
Experienced hardware designers who are new to Verilog may find the implicit style
approach confusing because it requires thinking about a state machine in a different
way. The implicit style is much more like software concepts, such as the distinction
$dule.The first initial between i f and whi 1 e. On the other hand, experienced software designers may also
$ time 0. The next block
find this approach difficult at first because the timing relationship between <= and
nto Hardware Verilog HardwareDescription Language 99
decisions in Verilog is different than in conventional software languages. The follow-
ing sections go through a series of examples that illustrate some typical kinds of ASM Continued
constructs and how they translate into implicit style Verilog. always~
begin
3 .8.2 .3 .2 Identifying the infinite oop t (Po.
a <:
Unlike software, all ASMs have at least one infinite oop. Implicit style behavioral
@ (po.~
Verilog is defined by an always block. Many times this always block can also serve b <-
to implement the infinite loop of the ASM. In the following ASM, the transitions from (o.,
states FIRST, SECOND, THIRD and FOURTH are implicit. The designer does not a <:
have to say anything about their next states. The transition from FIFTH to FIRST oc- @ (poe~
curs because of the always: b <=
@ (po-
a <=
FIRST [ a-1
end
//Followi
4, II hardwe
bLUUXNDI
b-a // happer
always #E
11HIRD
a- b always @(
display
FOURTH [ b- 4 initial
begin
FIFTH [ a-5
5
syscl
#140C
end
endmodule
module top;
//Following are actual hardware registers of ASM
3.8.2.3.3 1
reg [11:0] a,b; Most ASMs ha
if statement
ware designers
//Following is NOT a hardware register the while is
reg sysclk;
The following
I/The following always block models actual hardware
initial
begin
sysclk = 0;
#1400 $stop;
end
endmodule
The above is slightly more primitive than what will be used in later chapters, but the
emphasis of this example is to show how an ASM translates into Verilog. In the above,
angles into (posedge there are three always blocks, but only the first one corresponds to hardware. The
Ltes, so the always uses other two always blocks and the initial block are necessary for simulation (in
later chapters these other blocks will be moved to other modules).
FI I always
begin
@ (POSE
SECOND a <=
@ (pose
b <=
if (a
bec
G
FOURTH IRD
enc
I a b I
else
bec
FIFTH E
enc
@ (posE
For brevity, only the always block that corresponds to the actual hardware is shown:
3.8.2.3.4 R.
always
begin
Often, it is appr
@(posedge sysclk) #1; // state FIRST
a <= @(posedge sysclk) 1;
@(posedge sysclk) #1; // state SECOND
b <= @(posedge sysclk) a;
if (a == 1) S
begin
@(posedge sysclk) #1; // state THIRD
a <= (posedge sysclk) b;
end
else
begin
@(posedge sysclk) #1; // state FOURTH
b <= (posedge sysclk) 4;
end
@(posedge sysclk) #1; // state FIFTH
a <= @(posedge sysclk) 5;
end
The i f el s e is appropriate here because only one of the states (THIRD or FOURTH)
will execute. Because a is one in state SECOND, state THIRD will execute. In the
following very similar Verilog, state FOURTH rather than state THIRD will execute: Figure 3-4.
ate SECOND
SECOND
ate THIRD
ate FOURTH
ate FIFTH
es (THIRD or FOURTH)
IRD will execute. In the
ate THIRD will execute: Figure3-4. ASM without else.
always
begin Figure 3-
@(posedge s // state FIRST
a <= @(pos
@(posedge s // state SECOND
b <= @(pos
if (a = I
begin
@(pose // state THIRD
a <=
@(pose // state FOURTH
b <= 4;
end
@(posedge s // state FIFTH
a <= @(pos
end
3.8.2.3.5 Recogn
The following two AS.
ASMs is very similar 1
not necessarily go to s
Figure3-e
104 Verilog D
determines whether to go to state THIRD or state FIFTH. The second of the following
two ASMs is a much less desirable way to describe the identical hardware. It is undesir-
able because the a== 1 test is duplicated; however, its meaning is exactly the same as
the first of the following two ASMs:
te FIRST
te SECOND
te THIRD
te FOURTH
te FIFTH
:e SECOND
:e THIRD
:e FOURTH
:e FIFTH
always
begin
@(posedge sysclk) #1; // state FIRST
a <= @(posedge sysclk) 1;
@(posedge sysclk) #1; // state SECOND
b <= @(posedge sysclk) a;
while (a == 1)
begin
@(posedge sysclk) #1; // state THIRD
a <= @(posedge sysclk) b;
@(posedge sysclk) #1; // state FOURTH
b <= (posedge sysclk) 4;
end
@(posedge sysclk) #1; // state FIFTH
a <= (posedge sysclk) 5;
end
In fact, the only syntactic difference between the above Verilog and the Verilog in Figure 3-i
section 3.8.2.3.4 is that the word if has been changed to while. The advantage of
looking at this particular ASM as awhile loop is that the decision a==l is shared by It is almost i
both state SECOND and state FOURTH. With the while loop, the designer does not infinite loop
have to worry that the decision is actually part of two states. Many practical algorithms Verilog imph
that produce useful results (as illustrated in chapter 2) demand a loop of this style. The
while in Verilog makes this easy. always
begin
3.8.2.3.6 Recognizing forever ( (C
Sometimes machines need initialization states that execute only once. Since synthesis a <
for
tools only accept behavioral Verilog defined with always blocks, such ASMs still
b
begin with the keyword always. However, the looping action of the always is not
pertinent. (If the designer only wanted to simulate the machine, ini tial would work
just as well as always, but ultimately the synthesis tool will demand always.)
In order to describe the infinite loop that exists beyond the initialization states, the
designer must use f orever. For example, consider the following ASM:
S
end
SECOND
Lte FIRST
Lte SECOND
Lte THIRD
Lte FOURTH
tte FIFTH
FIFTH
FIRST a 1
FOURTH
e
end
In software, ar
i f, but the coi
of nesting a n(
that forms the
behavioral Ver
gested that the
chapters. Desi
3.9 Tast
In convention;
use functions
apart into man
Figure3-8. Two ways to draw i f at the bottom of forever.
procedures: t
sometimes all(
dures) and fun
The ASM on the right tests if a != 1 to see whether to leave the loop involving only
state THIRD and proceed to state FIFTH. The reason the ASM on the right is preferred 3.9.1 TasI
is that its translation into Verilog is obvious: The syntax foi
always
uld think the one on the begin
state THIRD. As long as @(posedge sysclk) #1; // state FIRST
ng about this machine is a <= (posedge sysclk) 1;
@(posedge sysclk) #1; // state FOURTII
e loop. (Such an infinite
b <= (posedge sysclk) 4;
ever.) Because of this,
forever
forever, as illustrated begin
@(posedge sysclk) #1; // state THIRD
a <= (posedge sysclk) b;
if (a != 1)
begin
@(posedge sysclk) #1; // state FIFTH
a <= (posedge sysclk) 5;
end
end
end
always
uld think the one on the begin
state THIRD. As long as @(posedge sysclk) #1; // state FIRST
ng about this machine is a <= (posedge sysclk) 1;
@(posedge sysclk) #1; // state FOURTH
e loop. (Such an infinite
b <= (posedge sysclk) 4;
ever.) Because of this,
forever
forever, as illustrated begin
@(posedge sysclk) #1; // state THIRD
a <= (posedge sysclk) b;
if (a != 1)
begin
@(posedge sysclk) #1; // state FIFTH
a <= (posedge sysclk) 5;
end
end
end
] i f, but the combination of an i f at the bottom of f orever or always has the effect
of nesting a non-infinite loop inside an infinite loop. It is the f orever or always
that forms the looping action, not the if. This example illustrates a kind of implicit
behavioral Verilog that sometimes causes novice Verilog designers confusion. It is sug-
gested that the reader should fully appreciate this example before proceeding to later
chapters. Designers need to be careful not to confuse i f with while.
declarations;
begin
statement;
end
endtask
This task definition must occur inside a module. The task is usually intended to be
called only by initial blocks, always blocks and other tasks within that module.
Tasks may have any behavioral statements, including time control.
Verilog lets the designer choose the order in which the input, output and inout
definitions are given. (The order shown above is just one possibility.) The order in
which input, output and inout definitions occur is based on the calling sequence
desired by the designer. The sequence in which the formal arguments are listed in some
combination of input, output and/or inout definitions determines how the actual
arguments are bound to the formal definitions when the task is called.
The purpose of an input argument is to send information from the calling code into
the task by value. An input argument may include a width (which is equivalent to a
wire of that width) or it may be given a type of integer or real in a separate
declaration. An input argument may not be declared as a reg.
After initial
The purpose of an output argument is to send a result from the task to the calling groups (eaci
code by reference. An output argument must be declared as a reg, integer or to be shorter
real in a separate declaration.
An inout definition combines the roles of input and output. An inout argu-
ment must be declared as a reg, integer or real in a separate declaration.
integer count-arg,numb-arg,sumarg,prod-arg;
begin
sum-arg = sum-arg + countarg;
prod arg = sum arg * count arg;
count-arg = countarg + numbarg;
$display(sumarg,prod arg);
end
endtask
The always
tion 3.8.2.2:
Because the formal inout sumarg is defined first, it corresponds to the actual sum
in the initial always
block. Similarly, the formal output prodarg corresponds to
begin
prod, and the formal inout count-arg corresponds to count. In order to pass
@ (p
different numbers each time to example, the formal numb-arg is defined to be S]
input. The order in which the arguments are declared (in this case with the integer
type) is irrelevant. The $ display statements produce the following: @(p
SI
1 1 S]
4 12 cl
10 60
21 231 @ (p
21 231 18
end
3.9.1.2 enternewstatetask
The translation of the ASM chart from section 2.1.1.3 into Verilog given in section The only c
3.8.2.2 is correct but could be improved in two ways. First, this translation did not enterne,
include state names as part of the Verilog code (they were only in the comments). state GREEI
Second, this translation did not automatically provide default values for states where 0 for speec
command signals were not mentioned, as occurs in ASM chart notation.
The task tha
To overcome both of these limitations, we will define a task, which is arbitrarily given
the name enternewstate. The purpose of this task is to do things that occur
whenever the machine enters any state. This includes storing into present state a
representation of a state (which is passed as an input argument, this state), doing
the #1 (which is legal in a task) to allow the <= to work properly and giving default
...
The always block that implements the ASM chart is similar to the one given in sec-
tion 3.8.2.2:
esponds to the actual sum
always
-od-arg corresponds to begin
count. In order to pass @(posedge sysclk) enter-newstate('GREEN);
nb-arg is defined to be speed = 3;
iscase with the integer
Following: @(posedge sysclk) enter newstate('YELLOW);
stop = 1;
speed = 1;
count <= @(posedge sysclk) count + 1;
The only differences a that the state names are passed as arguments to
Verilog given in section
enternewstate, and default values do not have to be mentioned. For example,
;t, this translation did not state GREEN uses the default value 0 for stop, and state RED uses the default value
only in the comments).
Ofor speed.
ilt values for states where
art notation. The task that accomplishes these things for this particular ASM is:
which is arbitrarily given
is to do things that occur
intopresentstate a
nt, this state), doing
operly and giving default
Even though default values are assigned for every state, since no time control occurs in
this task after the assignment of default values, those states where non-default values
are assigned work correctly. For example, assume the machine enters state GREEN at
$time 50. At that $time, present_state will be assigned 2'bOO. At $time 51,
stop and speed will assigned their defaults of 0, but since there is no more time
control, the always block which called on the task is not interruptable. At the same
$ time 51 speed changes to 3. Any other module concerned about speed at $ time
51 would only observe a change to a value of 3. To understand this, we need to distin-
guish between sequence and $time. Because the task was called, two changes oc-
curred to speed in sequence, but since they happened at the same $ time, the outside
world can only observe the last change. This creates exactly the effect we want. We are
now ready to model ASM charts that do practical things with behavioral Verilog. Ex-
amples of translating ASM charts into Verilog using tasks like this are given in chapter
4. Such a func
machine, sui
3.9.2 Functions scendental f
The syntax for a function is similar to a task:
3.9.2.2 1
function type name; A more corn
input arguments; national log
section 2.5,
decl arati ons;
begin
statement;
...
name = expression;
end
endfunction
except only input arguments are allowed. In the function definition, type is either
integer, real or a bit width defined in square brackets. The statement(s) in a .;
function never include any time control. The name of the function must be assigned
inputs output
a b c s
o a a 0
0 1 0 1
1 0 0 1
inition, type is either 1 1 1 0
. The statement(s) in a
nction must be assigned
The actual argument A in the always block is bound to the formal a in hal f_add,
~tion and the actual argument B is bound to the formal b. The locals c and s are concat-
enated to form a two-bit result (hence the [ 1:0] declaration for the function.) This two
bit result is stored in the two-bit concatenation {C, S } .
Verilog code is composed of one or more modules. Each module is either a top-level
module or an instantiated module. A top-level module is one (like all the earlier ex-
amples in this chapter) which is not instantiated elsewhere in the source code. There is
only one copy of a top-level module. The definition of a top-level module is the same
as the code that executes. The regs and wires in a top-level module are unique.
An instantiated module, on the other hand, is a unique executable copy of the defini-
tion. There may be many such copies. The definition is a "blueprint" for each of these
instances. For example, section 2.5 illustrates an adder that needs three instances of a
half-adder. It is only necessary to define the half-adder once. It can be instantiated as
many times as required. Each instance of an instantiated module has its own copy of
the regs and wires specified by the designer. For example, the value stored in a
particular reg in one instance of a module need not be the same as the value stored in
the reg of the same name in another instance of that module.
.) returns 2'blO. Both Instantiated modules should have ports that allow outside connections with each in-
stance. It is this interconnection (i.e., structure) with the system external to the instance
other possibilities, such
-tion to model the corn- that gives each instance its unique role in the total system. Normally, each instance is
internally identical to other instances derived from the same module definition, and
always block with @
how an instance is connected within the system gives that instance its characteristics.
Some synthesist
118 Verilog Digital Computer Design: Algorithms into Hardware
rect. There are two ways to declare the size: either as a wire of some size (regardless
of whether the module uses a behavioral instance or a structural in-
stance) or with the input definition. 7
There is another analogy for ports: ports are like the pins on an integrated circuit.
Some pins are inputs and some pins are outputs. This is a very good analogy, but it is a
he module in question little dangerous because when a large design is fabricated by a modem silicon foundry,
n the module to have a most of the ports in the design do not correspond to a physical pin on the final inte-
s, which is often incor- grated circuit.
7Some synthesis tools require that the input definition have the size.
Hardware
P
Verilog HardwareDescription Language 119
To understand this pin analogy, let's digress for a moment and look at
the history of 3.10.5 E
hierarchical design and integrated circuit technology. Before the mid-
1960s, all digital Section 2.5 (
computers were built using discrete electronic devices (such as relays,
vacuum tubes or does is to de
transistors). It takes several such devices, wired together by hand in a
certain structure, the @ time c
to make a gate, and of course, as we have seen in section 2.5, it takes
many such gates ever, since a
to make anything remotely useful. In the early 1960's, photographic
technologies be- that models t
came practical to mass-produce entire circuits composed of several devices
on a wafer are physical:
of semiconductor material (typically silicon). The wafer is sliced into
"chips," which exactly the
are mounted in epoxy (or similar material) with metal pins connecting
the circuitry on (sum) is, of
the chip to the outside. There are several standard sizes for the number
and placement with behavio
of pins. For example, one of the oldest and smallest configurations is
the 16-Pin Dual national logi(
Inline Package (DIP). It is a rectangle with seven data pins on each side,
and no pins on A reg is not
the top or bottom. (Two pins are reserved for power and ground.) A notch
or dot at the example of s
top of the chip indicates where pin one is.
is three bits:
Designers in the 1960s and 1970s were limited by the number of devices
the chip and also by the number of pins allowed in these standard
that fit onto
sizes. Realizing the
7
power of hierarchical design, these designers built chips that contain standard
building
blocks that fit within the number of pins available. An example is a four-bit
counter in
one chip, TTL part number 74xx 163, which is still widely used. Whenever
designers
needed a four-bit counter, they could simply specify a 74xx 163, without
worrying about
its internal details. This, of course, is hierarchical design and provides
the same mental
simplification as instantiating a module. Physically, the pins of the 74xx
163 chip would
be soldered into the final circuit.
The widths s]
The relationship between these early integrated circuits and hierarchical
design is not poses!
perfect, hence the danger of saying ports are like pins. If a design
needs one 13-bit
counter, a designer in the 1970s would have to specify that four 74xx163s
be soldered To exhaustive
into the final circuit to act as a single counter. There is an interconnection
between possible coml
these four chips so that they collectively count properly. From a hierarchical
stand-
point, we want to see only one black box, with a 13-bit bus, but this
counter is fabri-
cated as four 74xx1 63s wired together. Some of the physical pins (connected
to another
one of the 74xx163s) have nothing to do with the ports of a 13-bit counter.
The widths shown on input and output definitions are optional for simulation pur-
ierarchical design is not poses.8
Design needs one 13-bit
ir 74xx163s be soldered To exhaustively test this small adder, test code similar to section 3.7.2.1 enumerates all
iterconnection between possible combinations of a and b:
m a hierarchical stand-
ut this counter is fabri-
is (connected to another
I-bit counter.
module top;
integer ia,ib; It is the pos
reg [1:0] a,b; module is in
wire [2:0] sum;
9Verilog rovides
tion, to determine
122 Verilog Digital Computer Design: Algorithms into Hardware
It is the position within the parentheses, and not the names, that matter9 when the
module is instantiated in the test code.
module adder(sum,a,b);
input a,b;
output sum;
wire [1:0] a,b;
a,b, sum) ; wire [2:0] sum;
wire c;
halfadder hal(c,sum[0],a[0],b[0f);
fulladder fal(sum[2],sum[1l,a[l],b[1],c);
r (the name of the mod- endmodule
In the top-level module,
are supplied by behav-
nd so top declares sum Since the adder is defined with two structural instances (named hal and
ale is similar to instanti- fal), all of the ports, including the output port, sum, are wires. The local wire c
)rresponds to the output sends the carry from the half-adder to the full-adder. Of course, we need identical test
der. If the names (such code as in the last example, and we also need module definitions for full_adder
Eas total), the module andhalfadder.
9 Verilog provides an alternative syntax, described in chapter 11, that allows the name, rather than the posi-
tion, to determine how the module is instantiated.
At this point, we have reduced the problem down to Verilog primitive gates (and,
or, xor) whose behavior is built into Verilog.
$display(adderl.c);
The following statement allows the designer to observe cout2 from the test code:
I_ $display(adderl.fal.cout2);
The parts of a hierarchical name are separated by periods. Every part of a hierarchical
name, except the last, is the name of an instance of a module. The names of the corre-
sponding module definitions (adder, full_adder and half_adder in the above
, b ); example) never appear in a hierarchical name.
temp);
t2);
module top;
payroll joe();
payroll jane();
loyee has a unique in- initial
begin
joe.id=254;
joe.hours=40;
joe.rate=14;
joe.display-pay;
jane.id=255;
jane.hours=63;
jane.rate=15;
jane.display-pay;
end
endmodule
By convention, we use capital letters for parameters, but this is not a requirement.
Note
that parameters do not have a backquote preceding them.
If you instantiate this module without specifying a constant, the default given
in the Here is an ex
parameter statement (in this example, 1) will be used as the WIDTH,
and so the
instance R1 will be one bit wide:
wire ldRl,sysclk;
wire Rldout,Rldin;
enabled-register Rl(Rldout,Rldin,ldRl,sysclk);
Since there is only one constant in the parentheses above, it is legal to omit the paren-
meters. These are con- theses:
wpose you would like to
)f arbitrary width: enabled-register #12 R12(Rl2dout,R12din,ldRl2,sysclk); I
I sysclk); Sometimes, you need more than one constant in the definition of a module. For ex-
ample, a combinational multiplier has two input buses, whose widths need not be the
same:
module multiplier(prod,a,b);
parameter WIDTHA=l,WIDTHB=l;
output prod;
input a,b;
reg [WIDTHA+WIDTHB-l:0] prod;
wire [WIDTHA-l:0] a;
wire [WIDTHB-l:0] b;
always @(a or b)
not a requirement. Note prod = a*b;
endmodule
ysclk);
]
by a list of constants in
multiplier #(6,4) ml(pay,hours,rate);
o Hardware
Verilog HardwareDescription Language 129
block(s) or with a structural instance (built-in gates or instantiation
designer-provided modules). Behavioral and structural of other SMrIH, DOUG]
instances may be mixed in the
same module. andSimulatij
son, AL, I199
Variables produced by behavioral code, including
outputs from the module, are de-
clared to be regs. Behavioral modules have STERNHEim, El
the usual high-level statements, such as
if and while, as well as time control (#, @ and Automata Pui
wait) that indicate when the process
can be suspended and resumed. The $ time variable
simulates the passage of time in
the fabricated hardware. Verilog makes a distinction ThOMAS, DON
between algorithmic sequence
and the passage of $ t ime. The most important guage, Third
forms of time control are # followed by
a constant, which is used for generating the internally.
clock and test vectors; @(posedge
sysclk), which is used to model controllers
and registers; and @ followed by a
sensitivity list, which is used for combinational
logic. Verilog provides the non-block-
ing assignment statement, which is ideal for translating
ASM charts that use RTN into
behavioral Verilog. Verilog also provides tasks 3.13
and functions, which like similar fea-
tures in conventional high-level languages, simplify
Exi
coding. 3-1. Design b,
described in e
Structural modules have a simple syntax. They
may instantiate other designer-pro-
vided modules to achieve hierarchical design.
They may also instantiate built-in gates.
The syntax for both kinds of instantiation is identical. des i d
All variables in a structural mod-
ule, including outputs, are wires.
from the module, are de- STERNHEIM, ELIEZER, RAJVIR SINGH and YATIN TRIVEDI, DigitalDesign with Verilog HDL,
-level statements, such as Automata Publishing, San Jose, CA, 1990. Has several case studies of using Verilog.
indicate when the process
ites the passage of time in THoMAs, DONALD E. and PHILIP R. MooRBY, The Verilog HardwareDescription Lan-
een algorithmic sequence guage, Third edition, Kluwer, Norwell, MA., 1996. Explains how a simulator works
@control are # followed by internally.
,st vectors; @(posedge
,rs; and @ followed by a
)gprovides the non-block-
Acharts that use RTN into 3.13 Exercises
as, which like similar fea-
3-1. Design behavioral Verilog for a two-input 3-bit wide mux using the technique
described in section 3.7.2.1. The port list for this module should be:
Lntiate other designer-pro-
o instantiate built-in gates.
riables in a structural mod- I module mux2(iO, il, sel, out);
3-2. Design a structural Verilog module (mux2) equivalent to problem 3-1 using only
)m other modules. Use of instances of and, or, not and buf.
this chapter to express the 3-3. Modify the solution to problem 3-1 to use a parameter named SIZE that allows
'the design process for the instantiationofanarbitrary width for i0, il and out as explainedin section 3.10.10.
!. The advantage of using For example, the following instance of this device would be useful in the architecture
late each stage to be sure it drawn in section 2.3.1:
Terilog code can be synthe-
rhaving to toil manually to
wire muxctri;
wire [11:0] x,y,muxbus;
mux2 #12 mx(x,y,muxctrl,muxbus);
3-4. Given the instance (mx) of the module (mux2) shown in problem 3-3, what hierar-
chical names are equivalent to x, y, muxctrl and muxbus?
L, 1997. Gives several ex-
7d Synthesis, Prentice Hall 3-5. Design behavioral Verilog for combinational incrementor and decrementor mod-
e for all aspects of Verilog. ules using the technique described in section 3.7.2.1. Use a parameter named SIZE
that allows instantiation of an arbitrary width for the ports as explained in section 3.10. 10.
@(pose(
I module updown-register(din,dout,ld,up,count,clk); I a =_
3-7. Modify the solutions to problem 3-6 to use a parameter named SIZE that allows Run the modifie,
instantiation of an arbitrary width for the ports as explained in section 3.10.10. circle the differed
file for the modif
are not any differ
3-8. Design behavioral Verilog for a simple D-type register (section D.5) using the
technique described in section 3.7.2.2. Use a parameter named SIZE that allows
instantiation of an arbitrary width for the ports as explained in section 3.10.10. The port 3-12. Without w
list for this module should be: scribed by the A'
clock cycles, and
run the original i
I module simpled register(din,dout,clk); of the .log file. 0
each clock cycle.
3-9. Design a structural Verilog module (updown-register) equivalent to problem log file. Finally,
3-7 using only instances of the modules defined in problems 3-3, 3-5 and 3-8. Verilog code and
if any, that exist I
Verilog. In no mi
3-10. For each of the ASM charts given in problem 2-10, translate to implicit style ences between i
Verilog using non-blocking assignment for - and @(posedge sysclk) #1 for
each rectangle, as explained in section 3.8.2.3.1. As in that example, there should be
one always that models the hardware, one always for the $display and an 3-13. Without us
always and initial for sysclk. Compare the result of simulation with the manually scribed by the As
produced timing diagram of problem 2-10. clock cycles, and
run the original i
of the .log file. 0
3-11. Without using a Verilog simulator, give a timing diagram for the machine de- each clock cycle.
scribed by the ASM chart of section 3.8.2.3.3. Show the values of a and b in the first .log file. Finally,
twelve clock cycles, and label each clock cycle to indicate which state the machine is and make a print
in. Next, run the original implicit style Verilog code equivalent to the ASM and make exist between the
a printout of the log file. On this printout, write the name of the state that the machine no more than thre
is in during each clock cycle. The manually created timing diagram should agree with using and omittir
the Verilog .log file. Finally, modify the following:
imed SIZE that allows Run the modified Verilog code and make a printout of its .log file. On this printout,
section 3.10.10. circle the differences, if any, that exist between the correct timing diagram and the log
file for the modified Verilog. In no more than three sentences, explain why there are or
are not any differences between = and <=.
section D.5) using the
ned SIZE that allows
action 3.10.10. The port 3-12. Without using a Verilog simulator, give a timing diagram for the machine de-
scribed by the ASM of section 3.8.2.3.4. Show the values of a and b in the first twelve
clock cycles, and label each clock cycle to indicate which state the machine is in. Next,
run the original implicit style Verilog code equivalent to the ASM and make a printout
c1k) ; of the .log file. On this printout write the name of the state that the machine is in during
each clock cycle. The manually created timing diagram should agree with the Verilog
) equivalent to problem .log file. Finally, modify the code to change the if to a while. Run the modified
-3, 3-5 and 3-8. Verilog code and make a printout of its .log file. On this printout, circle the differences,
if any, that exist between the correct timing diagram and the log file for the modified
Verilog. In no more than three sentences, explain why there are or are not any differ-
inslate to implicit style ences between i f and whi le.
age sysclk) #1 for
:ample, there should be
he $display and an 3-13. Without using a Verilog simulator, give a timing diagram for the machine de-
lation with the manually scribed by the ASM of section 3.8.2.3.5. Show the values of a and b in the first twelve
clock cycles, and label each clock cycle to indicate which state the machine is in. Next,
run the original implicit style Verilog code equivalent to the ASM and make a printout
of the .log file. On this printout write the name of the state that the machine is in during
im for the machine de- each clock cycle. The manually created timing diagram should agree with the Verilog
s of a and b in the first .log file. Finally, modify the code to eliminate all #s. Run the modified Verilog code
ich state the machine is and make a printout of its .log file. On this printout, circle the differences, if any, that
it to the ASM and make exist between the correct timing diagram and the .log file for the modified Verilog. In
e state that the machine no more than three sentences, explain why there are or are not any differences between
gram should agree with using and omitting #s.
te FIRST
These definitions occur outside any modules. Next, we need to include the definition
iges of the top-down de-
and pure structural. Be- Ifa module that generates the clock in a fashion similar to 3.7.1.3, except the clock is
cts, one can transform a <utput as a port of the module:
log source code. Section module cl(clk);
of chapter 2 can be writ- parameter TIMELIMIT = 110000;
.tion 3.8.2.3. Section 4.2 output clk;
cn into the mixed stage. reg clk;
the pure structural stage
initial
n 4.4 shows that, having
clk = 0;
lows additional structure
l place of behavior using always
#50 clk = -clk;
endmodule
quotient=%d",
These declarations were constrained by the portlist, how the module was instantiated
in the test code and by the description of the problem given in chapter 2.
task enternewstate;
input ['NUM STATEBITS-1:0] thisstate; The net effect
begin the module giv
present_state = this-state;
7
#1 ready=O;
end
endtask
The definition of this task will be nearly identical for every pure behavioral ASM. The
only distinction from one problem to another is the list of external command outputs
(see section 2.1.3.2.1) specific to the particular machine. In this case, the only external 1
command output is ready. It has a default value of 0; thus this task must initialize it at
the beginning of every clock cycle.
Having defined the above task within the slowdivisionsystem module, it is
possible to translate the ASM from section 2.2.3 into Verilog: 1
always
begin
@(posedge sysclk) enternewstate('IDLE);
rl <= (posedge sysclk) x;
ready = 1;
if (pb)
begin
@(posedge sysclk) enter_newstate('INIT);
r2 <= (posedge sysclk) 0;
while (rl >= y)
begin
@(posedge sysclk) enter newstate('COMPUTEl);
rl <= (posedge sysclk) rl - y;
@(posedge sysclk) enter newstate('COMPUTE2);
r2 <= @(posedge sysclk) r2 + 1;
end
end The regs ri ai
end by the $displ
machine return
is the same as,
The only other thing that would be desirable to put in this module is a debugging code uses to de
display, as described in section 3.7.2.4:
The net effect of the other modules defined in section 4.1.1.1 and all the details inside
the module given above is to produce the following simulation output from Verilog:
The regs ri and r2 are not initialized at $time O,and so the value 12 'bx is printed
by the $display simply as x (not to be confused with the variable x). Each time the
machine returns to state IDLE (ready= 1), the outputs of the machine from r2 (which
is the same as quotient) are highlighted above. These are the values that the test
module is a debugging code uses to determine that everything is "ok" each time.
The net effect of the other modules defined in section 4.1.1.1 and all the details inside
the module given above is to produce the following simulation output from Verilog:
The regs rl and r2 are not initialized at $ time 0, and so the value 12 'bx is printed
by the $display simply as x (not to be confused with the variable x). Each time the
machine returns to state IDLE (ready= 1), the outputs of the machine from r2 (which
is the same as quotient) are highlighted above. These are the values that the test
nodule is a debugging code uses to determine that everything is "ok" each time.
always Because of th
begin complete the t
thus the test o
while (rl >= y)
quotient is
begin
@(posedge sysclk) enternewstate('COMPUTE2); Rather than sl
r2 <= @(posedge sysclk) r2 + 1; designer can
@(posedge sysclk) enter_new_state('COMPUTE1);
rl <= (posedge sysclk) rl - y;
end 4.1.3 ImE
end Do not be del
end
in section 4.1
automated to(
The output from the Verilog simulator makes the problem obvious: to think. In fi
signer is resp
2670 rl= 5 r2= 0 pb=0 ready=l designing tes
2770 rl= 6 r2= 0 pb=l ready=l When we beE
2870 rl= 6 r2= 0 pb=0 ready=0
assumptions
2970 rl= 6 r2= 0 pb=0 ready=l
things: there
ok
3070 rl= 6 r2= 0 pb=0 ready=l this user nee
3170 rl= 7 r2= 0 pb=l ready=l state IDLE (1
3270 rl= 7 r2= 0 pb=0 ready=0 a single cloc]
3370 rl= 7 r2= 0 pb=0 ready=0
1 pb=0 ready=0
The test cod
3470 rl= 7 r2=
3570 rl= 0 r2= 1 pb=0 ready=0 vided in the
3670 rl= 0 r2= 2 pb=0 ready=0 chine stay in
3770 rl=4089 r2= 2 pb=0 ready=l ever, a probl
error x= 7 y= 7 x/y= 1 quotient= 2 illustrated b
Because of the error (which causes the loop to execute an extra time), the time to
complete the test is longer. The wait statement in the test code compensates for this;
thus the test code is checking r2 via quotient at the proper time, but when y>=7,
quotient is just plain wrong.
ate('COMPUTE2);
Rather than spending thousands of dollars actually fabricating a faulty computer, the
1;
Ite('COPUTEl); designer can observe the problem simply from the behavioral Verilog code.
Y;
The first ASM chart of 2.2.5 can be translated into Verilog as:
t detects no errors:
ly=l always
ly=l begin
@(posedge sysclk) enter_new state('IDLE);
ly=l
Iy=l rl <= (posedge sysclk) x;
r2 <= (posedge sysclk) 0;
ly=l
ly=l ready = 1;
if (pb)
begin
Iy=l while (rl >= y)
Iy=l begin
[y=o 0(posedge sysclk) enternew-state('COMPUTE1);
rl <= (posedge sysclk) rl - y;
ly=O
Lyn1 *(posedge sysclk) enter new state('COMPUTE2);
r2 <= (posedge sysclk) r2 + 1;
LIyr *(posedge sysclk) enternew state('COMPUTE3);
Iy=l r3 <= (posedge sysclk) r2;
Ly=0 end
Ly= 0 end
Ly= 0 end
Ly= 0
Ly=l
1See J. Cooley, IntegratedSystem Design, July 1995, pp. 56-60 for a description of a Verilog contest where
the test code provided to the contestants was erroneous. The "winning design" would not actually work
correctly because the test code could not detect a flaw in the design.
The second X
and the portlist now has r3 rather than r2 as the output:
always
begin
module slowdivsystem(pb,ready,x,y,r3,sysclk); @(pos
input pb,x,y,sysclk; rl <
output ready,r3; r2 <
wire pb; read
wire [11:0] x,y; if (
reg ready; be
reg [11:0] r,r2,r3; i
reg ['NUM_STATEBITS-1:0] present-state;
always
begin
Lk); @(posedge sysclk) enternew state('IDLE);
rl <= @(posedge sysclk) x;
r2 <= @(posedge sysclk) 0;
ready = 1;
if (pb)
begin
if (rl >= y)
while (rl >= y)
begin
@(posedge sysclk) enternew state('COMPUTEl);
,ment. As was described rl <= @(posedge sysclk) rl - y;
wn): @(posedge sysclk) enter new state('COMPUTE2);
r2 <= @(posedge sysclk) r2 + 1;
@(posedge sysclk) enter_new state('COMPUTE3);
r3 <= @(posedge sysclk) r2;
end
else
begin
e (posedge sysclk) enternew state('ZEROR3);
r3 <= (posedge sysclk) 0;
end
end
end
The above correctly models the design error due to inappropriate use of parallelism in
state COMPUTE23 that causes r3 to be assigned a value too early:
The corrected ASM chart of 2.2.6 that has all three computations happening in parallel
in state COMPUTE123 can be translated into Verilog as:
always
begin
@(posedge sysclk) enternew-state('IDLE);
rl <= @(posedge sysclk) x;
r2 <= @(posedge sysclk) 0;
ready = 1;
if (pb)
begin
if (rl >= y)
while (rI >= y)
begin
@(posedge sysclk) enter new state('COMPUTE123);
rl <= (posedge sysclk) rl - y;
r2 <= (posedge sysclk) r2 + 1;
:e('COMPUTE); r3 <= (posedge sysclk) r2;
end
:e('COMPUTE23); else
begin
@(posedge sysclk) enternewstate('ZEROR3);
r3 <= @(posedge sysclk) 0;
end
end
'ZEROR3); end
task
inpi
4.1.5 Pure behavioral stage of the two-state division machine beg:
The best correct design proposed in chapter 2 for the division machine is described by pi
the ASM chart in section 2.2.7. It has the advantage that it takes only one clock cycle
end
each time it goes through the loop, and it only needs an ASM with two states. Here is
endta!
how this ASM chart can be translated into a pure behavioral module, similar to the
earlier examples: alway!
$di:
task enternewstate;
input ['NUM_STATE_BITS-1:0] thisstate;
ision machine begin
machine is described by present_state = this_state;
:es only one clock cycle #1 ready=0;
end
with two states. Here is
endtask
.module, similar to the
always (posedge sysclk) #20
$display("%d rl=%d r2=%d r3=%d pb=%b ready=%b",
$time, rl,r2,r3, pb, ready);
endmodule
_ I~~~~~~~~~~~
Fo rbrevity, the cl module has been placed in the "clock . v" file. Here is the Verilog
sinnulation that shows it working:
from pure behavioral Verilog into mixed Verilog (section 4.2) and into pure structural reg
Verilog (section 4.3). This example is also used to illustrate the hierarchical refinement wirE
of the controller to become a netlist (section 4.4). wirE
wirE
endmodule
.er(di,do,enable,clk);
1;
cik;
show how to translate
md into pure structural o;
Hierarchical refinement di;
. ..
endmodule
machine
-o hardware eventually behavioral definition of the register (see sections 3.7.2.2
The mixed stage is the -was inspired by the 74xx377 (eight-bit enabled regis-
registers and combina- d register) TTL chips, which have an active low enable
.The only constraint is xle where the physical 74xx377 is scheduled to change
commands used in the te 74x377 will have zero volts representing the 1 bi
ble. Other than this minor detail, any architecture in-
in be constructed from these chips.
.0;
di;
)Hardware
'e Stagesfor Verilog Design 151
Continued 4.2.1.4 cor
wire load,count,clr;
wire clk; This module mo
scription of the b
...
endmodule b, and three static
the outputs will 1
This module was inspired by the 74xx 163 (4-bit up counter), which has active low ci r
and load signals (see the discussion in section 4.2.1.1.) Also, this chip has two inputs
module com
that must both simultaneously be one to cause counting. The reason for having two
param
inputs, rather than just the one count shown above, is to simplify the connections outpu
required to cascade the four-bit chip to form larger counters. Since at this stage of the input
design we are not at all concerned with such physical details, the above module was wire
simplified to have a single count signal. reg a
endmodule
4.2.1.3 alui81portlist
This module models a combinational ALU inspired by the 74xx181. (See section C.6
for a description of the hardware being modeled by this module.) It has two data
inputs, a and b, and a similar sized data output bus, f. It also has status outputs, cout 4.2.1.5 mm
(I when addition and similar operations produce a carry) and zero (1 when f is zero).
It is controlled by the commands: s, m and cin: This module mi
description of th
and ii, and a si
module alul8l(a,b,s,m,cin,cout,f,zero);
the output is i 0
parameter SIZE = 1;
input a,b,smcin;
output cout,f,zero;
wire [SIZE-l:0] a,b; module mu)
wire mcin; parar
wire [3:0] s; input
reg [SIZE-l:0] f; outpi
reg cout,zero; wire
wire
...
endmodule reg
...
endmodule
In chapter 2, the ALU was considered to have a six-bit command input, aluctrl.
When this module is instantiated, this input should be subdivided in the following
fashion:
alul81 #size instancename(a,b,aluctrl[5:21,
aluctrl[l,aluctrl[0],cout,f,zero);
This module models a comparator inspired by the 74xx85. (See section C.7 for a de-
scription of the hardware being modeled by this module.) It has two data inputs, a and
b, and three status outputs, alt b, aeqjb, agt-b. At any time, only one of
the outputs will be 1, depending on a and b:
thich has active low ci r
this chip has two inputs
e reason for having two module comparator(a_lt-b, a_eq_b, agt-b, a, b);
implify the connections parameter SIZE = 1;
output altb, a_eqb, agtb;
Since at this stage of the
input a, b;
, the above module was
wire [SIZE-l:0] a,b;
reg altb, a_eq_b, agt-b;
endmodule
endmnodule
tmand input, aluctri.
livided in the following
f' zero)
ito Hardware
Three Stagesfor Verilog Design 153
4.2.2 Mixed stage
As discussed in chapter 2, the system is no longer described simply in terms of its
behavior. Instead, in the mixed stage, there is a specific structure that interconnects the
controller and the architecture:
x 0
12 r
y 1
module slow div system(pb,ready,x,y,r3,sysclk); 12
input pb,x,y,sysclk;
output ready,r3; ml
wire pb;
wire [11:0] x,y;
wire ready;
wire [11:0] r3;
wire sysclk;
This version of slowdiv sys tem replaces the behavioral version of this module module s
discussed in section 4.1. The test code that instantiates slowdiv-system should
not notice any difference between this mixed stage module and the earlier pure behav- input a
output
ioral stage. Note that all ports and locals in this module are now declared to be wire
wire [5
since this module is composed simply of two structural instances, and there are no
wire mu
behavioral assignment statements. wire [1
wire [1
counter
enabled
Y
Lk);
sysclk -
o Hardware
Three Stages for Verilog Design 155
Continued The next t
comparat
always (posedge sysclk) #20 than y. Th(
begin The built-it
$display("%d rl=%d r2=%d r3=%d pb=%b ready=%b", $time,
rlbus,r2bus,r3bus, There is a
slowdivmachine.pb,slow-divmachine.ready); Its data in
$write(" %b %b WI, opposed to
ldrl,{clrr2,incr2},ldr3); and increm
$display(" muxbus=%d alubus=%d",muxbus,alubus); The tc out]
$write(" '1); but it must
$display(" muxctrl=%b aluctrl=%b", the 74xx16
muxctrl,aluctrl);
high logic,
$write ("
left disconi
$display(" x=%d rgey=%b",x,rlgey);
end Finally, the
endmodule from r2bu
ule. The co
The portlist for this module includes the commands that are input to this architecture At the bottc
(that were output from the controller). These commands include the six-bit aluctrl have chang
as well as muxctrl, ldrl, clrr2, incr2 and ldr3. Also, the portlist has the refer to rl}
status output rlgey. The portlist has the twelve bit data inputs x and y and the 12-bit its portlist,
data output r3bus. Of course, since there are clocked registers in the architecture, play state
they must be supplied with sysclk. The order in this portlist matches the order where not corresp
this is instantiated in section 4.2.2. designer, it
The first three structural instances (rl, mx and alu) define the portion of the block cal names
diagram that relates to register rl. This name is no longer a reg as it was in the pure statements
behavioral stage, but is instead the instance name for an enabled register, whose mand signa
portlist is defined in section 4.2.1.1. This instance is for a twelve bit wide register values of th
(because the parameter is instantiated with 12). The input to this enabled-register below therr
comes from alubus, which is described below. The output from this the test cod
enabled-register is known as rlbus. Of course both alubus and rlbus are data input (
wires since this module is defined only with structure. The load signal for rl is ldrl,
and as is necessary in synchronous design, rl is connected to sysclk.
4.2.4 Cs
There is an instance (named mx) of mux2 (see section 4.2.1.5) instantiated to be 12 bits
wide. It selects the data input x when muxctrl is 0, and y when muxctrl is 1. Its
Although ir
outputismuxbus.All of thesebuses areof course 12-bits wide. The instance of alul8l
is a creativ
(see section 4.2.1.3) named alu takes its inputs from rlbus and muxbus. The
behavioral
aluctrl is provided to the appropriate ports. The cout and zero ports of alul 81
not involve
are left disconnected. The f output connects to the alubus (mentioned in the last
architecture
paragraph) that provides the input to rl.
Here is the module that is instantiated in section 4.2.2 and corresponds to pure behav- end
ioral Verilog of section 4.1.5, and that is equivalent to the mixed ASM chart of section
task er
2.3.1: input
begin
presE
module slowdivctrl(pb,ready,aluctrl,muxctrl,ldrl, #l{re
clrr2,incr2,1dr3,rlgey,syscik); end
input pb,rigey,sysclk; endtask
output ready,aluctrl,muxctrl,ldrl,clrr2,incr2,ldr3; endmodule
The boldface above shows the editing done to transform the pure behavioral Verilog
into this mixed stage. Of some interest is the fact that the pure behavioral while,
which has the condition ( (rl>=y) Ipb),is translated above into (rlgey I pb).
Use of single bit &and | (or perhaps more clearly &&and I I ) is permitted inside a
mixed controller. This notation is not a data computation that must occur in the archi-
tecture (although the designer could have chosen to put a single or gate in the architec-
ture to accomplish this). It is important to distinguish this decision-making use of I
from a data manipulation use of , such as rlI y, which should be performed by com-
binational logic (such as theALU) in the architecture. In the case of ( (rl>=y) I pb),
there are two reasonable ways to translate this into the mixed stage: the way that was
shown above, and the way that requires introducing an extra signal in the architecture
to represent the or of rlgey and pb. (pb would then be classified as an external data
input to the architecture, in addition to being an external status input to the controller.)
Since we would like to minimize the number of wires that interconnect the controller to
the architecture, we chose the former approach where pb is simply an external status
signal.
.
-
.
Of course, the bit patterns for controlling the ALU must be defined outside this mod- Continued
ule:
6770 rl-
1
'define DIFFERENCE 6'bO11001
'define PASSB 6'blO1010
6870 rl=
1
The test code is the same as the pure behavioral system. Here is the output from the
completed mixed stage:
6970 rl=
1
70 rl= x r2= x r3= x pb=O ready=1
1 10 0 muxbus= 0 alubus= 0
muxctrl=0 aluctrl=101010 7070 rl=z
x= 0 rlgey=x 1
170 rl= 0 r2= 0 r3= x pb=O ready=1
1 10 0 muxbus= 0 alubus= 0
muxctrl=0 aluctrl=101010 ok
x= 0 rlgey=0
270 rl= 0 r2= 0 r3= x pb=O ready=1
1 10 0 muxbus= 0 alubus= 0
muxctrl=0 aluctrl=101010
abus= 0
actrl=101010
Pure structural stage of the two state division
ubus= 0 machine
uctrl=101010 ing from the mixed stage to the "pure" structural stage is an easy and mechani-
cugh somewhat tedious) process. All modules except the controller remain the
s explained in section 2.4.1, the controller module becomes a structure com-
ubus=4089 Ia present state register (which is an instance of an actual register module, and
uctrl=011001
eg) and the next state logic.
)ure" structural stage, the definition of the next state logic may remain as be-
.ubus= 0 code (a function) that is a transformation of the code inside the always block
.uctrl=101010 nixed stage. In section 4.4, we will see how the next state logic could also be
in terms of built-in gates, using hierarchical design. Fortunately, it is not nor-
ecessary to worry about the details given later in section 4.4, because synthesis
st that can automatically transform the behavioral next state function described
Lubus= 14 action into a netlist. For this reason, we consider this section to be the final step
Luctrl=101010 Jesigner has to be involved with. Section 4.4 is presented later only to motivate
L of transformations that synthesis tools are capable of.
Lubus= 14
Luctrl=101010
M
module nextstatelogic(nextstate,
nstance name of the ldrl,incr2,clrr2,ldr3,
ction 2.4.1, rlgey and muxctrl,aluctrl, ready,
cr2,clrr2, ldr3, presentstate, rlgey, pb);
ts of nsl. The input to output nextstate,ldrl,incr2,clrr2,ldr3,muxctrl,
Lte. The portlist of the aluctrl,ready;
itputs are declared to be input present_state, rgey, pb;
reg ['NUM_STATE_BITS-l:0] next_state;
reg ldrl,incr2,clrr2,ldr3,muxctrl,ready;
reg [5:0] aluctrl;
wire ['NUN_STATE_BITS-l:0] presentstate;
rl, ldrl,
wire rlgey,pb;
,sysclk);
'include "divbookf.v"
acr2, ldr3;
always (presentstate or rlgey or pb)
{nextstate,ldrl,clrr2,incr2,ldr3,muxctrl,aluctrl,
ready} = stategen(presentstate, pb, rgey);
endmodule
function 'NUNSTATE_BITS-1+12:0] en
stategen; St
input [NUM STATEBITS-1:O ps;
input pb,rlgey;
end
reg ready;
endfun
reg [5:0] aluctrl;
reg muxctrl,1drl,clrr2,incr2,ldr3;
reg ['INUSTATE-BITS-1:0 ns; Here boldface
begin
{nsready,aluctrl,muxctrl,ldrl,clrr2,incr2,ldr3}=0;
4.3.4 Test
case (ps)
Since state
'IDLE: begin some trivial V
//rl <= (posedge sysclk) x;
//r2 <= (posedge sysclk) 0;
ready = 1; 'defineI
aluctr = 'PASSB; 'define
muxctrl = 0;
ldrl =1; 'define I
if ifrr2 1;
(pb)'dfn 'define
- j-WWYjW"j
module t(
Here boldface shows some changes that were made to make this work as a function.
'define NUMSTATEBITS 1
'define IDLE 1'bO
'define COMPUTE l'bl
module test;
always
module nextstatelogic(nextstate,
hat we will want to con- ldrl,incr2,clrr2,ldr3,
Mditional detail) that the muxctrl,aluctrl, ready,
ugh a synthesis tool, it is presentstate, rgey, pb);
code after reaching the output nextstate,ldri,incr2,clrr2,ldr3,muxctrl,
synthesis tool continues aluctrl,ready;
input present_state, rgey, pb;
reg ['NUN_STATE_BITS-l:0] nextstate;
reg ldrl,incr2,clrr2,ldr3,muxctrl,ready;
reg [5:0] aluctrl;
wire ['NUNSTATE_BITS-1:0] present-state;
wire rlgey,pb;
Of course, many other possible solutions exist that produce the same truth table. The
built-in Verilog gate bu f (non-inverting buffer) passes through its last port unchanged
to all the other ports, which are outputs. The only difference between buf and not is
st choice when designing that the latter inverts its outputs.
Dperties of Boolean alge-
We will ignore discussing the netlist for the architectural devices, such as mux2, since
out on digital computers
this is a trivial but tedious task. The synthesis tool would do this identically to the way
the controller was synthesized.
0 b- aluctrl [2] 17
that continues like this for as long as you are willing to let the simulator run. This is the
physical problem alluded to earlier: as the controller is currently interconnected, the
gate level netlist does not seem to work. The simulator's output is splattered with 1 bxs
and 1 bzs.
Although you might think something is wrong with the logic equations (given in sec-
tion 4.4.1) or the equivalent netlist (given in section 4.4.2), there is not. The logic
equations and equivalent netlist are correct. What's the problem?
-onextstate [0] To understand the problem, you need to remember the intent behind having the four-
valued logic system. When 1 bxs or 1 ' bzs appear where you were expecting a 1 or a
0, this is an indication of some flaw in the design. Although major interconnection
errors can cause this (see section 3.5.3), more subtle problems can cause this as well.
Since everything is interconnected properly in this netlist, we need to understand what
the 1 bxs and 1 bzs are trying to tell us here.
At $time 0, all regs start as bxs and all wires start as bzs. If the simulation does
not change these values, that is how they will stay. The ps_ reg of the controller has
an internal reg that holds the present state. At $ time 0, it is 1 bx. The next state that
ite that our synthesized the machine computes from psreg is also unknown. A Boolean function of ' bx is
iesis simulation prior to usually ' bx. Therefore, the ps reg is reloaded with 1 bx, rather than the proper
the first place: to simu- sequence of states. The four-valued logic of the simulation has detected a potential
the predominate use of flaw in the design: we do not know what state the controller starts out in, so we cannot
rilog than simulators for predict what happens next.
Why didn't the pure structural version (see section 4.3.3) detect this problem? The
e used and simulate the reason is found in the definition of the stategen function. The first statement of
et a very interesting but this function initializes ns (which is what becomes next_state) to be 1 bO:
The above is patterned after the 74xx 175 (six-bit resettable D-type register), except as
is typical with TTL logic, the reset signal on the 74xx175 is active low.
2,ldr3}=0;
This is the first and only time that we will admit an asynchronous signal into our de-
sign. Asynchronous means that a change happens in a register at a $time other than
the rising edge of sys cl k. Notice the difference between the cl r signal used in the
synchronous counterregister (described in sections 3.7.2.2 and 4.2.1.2) and
similarly to the netlist. the reset signal described here. Although both signals cause the register to become
zero at some point in $ t ime, the c r signal simply schedules the change to happen at
the next rising edge, but the reset signal causes the clearing to happen instantly. The
register is continually rezeroed for as long as reset is asserted because of the if,
even should a rising edge of the clock occur. Without the posedge reset, the regis-
state machines exhibit, ter would be a synchronous, clearable D-type register.
st turned on, we do not
1 and mixed stages use
clrr2,incr2,ldr3,rlgey,reset,sysclk); It must a
input pb,rlgey,sysclk,reset;
output ready,aluctrlmuxctrlldrl,clrr2,incr2,ldr3;
modu:
wire ['NUM STATEBITS-1:0] presentstate;
re(
wire pb;
re(
wire ready;
wire [5:0] aluctrl;
wil
wire muxctrl,ldrl,clrr2,incr2,ldr3;
wi]
wire rlgey,sysclk,reset;
int
nextstatelogic nsl(nextstate, Wi2
ldrl,incr2,clrr2,ldr3, res
muxctrl,aluctrl,ready,
presentstate, rgey, pb); cl
rA.*tWmh1 - s ... .( 'UMSTATE-BITS) preg(nextstate, slC
present-state,reset,sysclk); slow_
endinodul e
ini
and of the system that instantiates the controller: c
I - -
module slow div-system(pb,readyxyr3,resetsysclk);
input pb,x,y,sysclkreset;
output ready,r3;
wire pb;
wire [11:0] x,y;
wire ready;
endmo
wire [11:0] r3;
wire sysclk,reset;
The test c
wire [5:0] aluctrl; present s,
wire muxctrl,ldrl,clrr2,incr2,ldr3,rlgey;
simulated
slow div-arch a(aluctrl,muxctrl,ldrl,clrr2, mixed stall
incr2,ldr3,rlgey,x,y,r3,sysclk);
I
174 VWrilno
.g:L lwiotA -. _ r . t w
.-- -C, -6.- -- my"tuf L"esign: AtgOritnMS into Hardware
zero prior to the arrival Continued
rovided by our friendly
omes a port of several v_ctrl c(pb,ready,aluctrl,muxctrl,ldrl,
rt of the controller: clrr2,incr2,ldr3,rlgey,reset,sysclk);
cr2, ldr3;
op;
1:0] xy;
1:0] quotient;
!ady;
s;
ysclk;
Ir3, set;
ady,
ey, pb); 000 clock(sysclk);
ig(nextstate, .v_system
t,sysclk);
_machine(pb,ready,x,y,quotient,reset,sysclk);
O;
0;
7;
,sysclk); let 0;
I reset = 1;
I reset = 0;
.0;
issues a reset pulse that lasts for 30 units of $time, which causes the
: become zero. When the netlist for the controller (section 4.4.2) is re-
hthe above, it produces the same correct answers we obtained for the
5.1.2 Sill,
An architectu:
attached to ai
comparator tc
sponding mix
Figure 5-1. Behavioral Mealy ASM.
te
two conditional corn- Between 0.5 and 1.0, the machine is in state YELLOW, but because COUNT is zero,
tile the machine stays in the decision does not go on the path that loops back to state YELLOW but instead goes
the machine is in state on the path where the next state is state RED. On this path is an oval that asserts the
time. Here is the ASM LEAVE signal. This conditional signal is asserted during the entire clock cycle, just as
the unconditional signal STOP is asserted during the same time. The signal STAY,
which is on a different path, is not asserted during this clock cycle.
Between 2.0 and 2.5, the machine again is in state YELLOW, but because COUNT is
non-zero, the decision goes on the path that loops back to the same state. On this path
is the oval that asserts the STAY signal. The signal LEAVE is not asserted in this clock
cycle.
Because COUNT is three bits, COUNT is zero again between 4.5 and 5.0, and so this is
the last clock cycle that the machine loops in state YELLOW. This means that STAY is
not asserted but that LEAVE is asserted.
A Mealy mac
LD INC function of bi
table of the nc
3 COUNT > 3 CMP and architects
2 / > -+~COUNTEQ0
> ~33
Ps co
00
01
01
10
SPEED, STO
only. LEAVE
signals LEAV
5.2 Mea
Section 2.2 gin
algorithm. Th
ovals in the A'
5.2.1 Elin
Section 2.2.3
Fig 5-2. Mixed Mealy ASM. only two regis
cessful attemp
section 2.2.5 t(
2.2.4.
5.1.3 Silly example of structural Mealy machine
The generic diagram of the pure structural controller given in section 2.4.1 applies to The problem i
any machine, whether it is a Mealy or Moore machine. The next state combinational tional commar
logic will be a little different when the machine is a Mealy machine than when it is a INIT without
Moore machine. With a Moore machine, only the next state bits (and not the command state IDLE for
bits) are a function of both the present state and the status inputs. With a Moore ma- computing the
chine, the commands are a function of the present state only. In other words, for a a transition fro
Moore machine, every line of the truth table where ps is the same has the same com- Here is the AS
mand outputs.
PEED, STOP, INC and LD are unconditional commands that are a function of ps
ily. LEAVE and STAY are a function of both ps and COUNTEQO. The conditional
gnals LEAVE and STAY are the only things here that make this a Mealy machine.
IDLE
Here is an example that shows the machine works when x=14 and y=7 :
Figure 5-4.
IDLE rl= ? r2= ? pb=O ready=1
IDLE rl= ? r2= ? pb=O ready=1 To illustrate ho
IDLE rl= 14 r2= ? pb=1 ready=1
COMPUTE1 rl= 14 r2= 0 pb=O ready=O
IDLE
COMPUTE2 rl= 7 r2= 0 pb=0 ready=O
IDLE
COMPUTE1 rl= 7 r2= 1 pb=0 ready=O
IDLE
COMPUTE2 rl= 0 r2= 1 pb=0 ready=O
COMPUTE
IDLE rl= 0 r2= 2 pb=0 ready=1
COMPUTE
IDLE rl= ? r2= 2 pb=O ready=1
COMPUTE
IDLE
The highlighted line shows where the conditional command to clear r2 occurs. This IDLE
takes effect at the next rising edge of the clock, which is when the machine enters state
COMPUTE1 (r2 = 0 on the next line is also highlighted to illustrate this).
By the point w
Based on the assumptions used throughout all of the chapter 2 examples, the above time too many.
ASM executes in 2+2 *quotient clock cycles, which is one clock cycle faster than save the correc
the correct ASM of section 2.2.3. tion here, it wo
182 Verilog Digital ComputerDesign: Algorithms into Hardware
5.2.2 Merging states COMPUTE1 and COMPUTE2
The above ASM requires about twice as long as the best solution discussed in chapter
2. To achieve the same kind of speed up with the Mealy ASM, we need to do the same
thing we did in chapter 2: the operations in the loop need to occur in parallel. Consider
the following incorrect ASM:
and y7:
Figure 5-4. Incorrect Mealy division ASM.
ready=1
ready=1
illustrate how this ASM fails, consider when x=14, and y=7:
ready=.
ready=0
ready=O IDLE rl= r2= ? pb=0 ready=1
I'
ready=O IDLE rl= r2= ? pb=0 ready=1
ready=O IDLE rl= 14 r2= ? pb=1 ready=1
ready=1 COMPUTE12 rl= 14 r2= 0 pb=0 ready=0
COMPUTE12 rl= 7 r2= 1 pb=0 ready=0
COMPUTE12 rl= r2= 2 pb=0 ready=0
0
IDLE rl= 4089 r2= 3 pb=0 ready=1
clear r2 occurs. This IDLE rl= r2= 3 pb=0 ready=1
machine enters state
rate this).
By the point when the machine returns to state IDLE, r2 has been incremented one
examples, the above time too many. In section 2.2.5, this problem was solved by using the r3 register to
lock cycle faster than save the correct quotient. However, since we are striving for a faster and cheaper solu-
tion here, it would be better to avoid introducing the r3 register in this design.
Hardware
Advanced ASM Techniques 183
5.2.3 Conditionally loading r2 5.2.4 Ass
To solve the bug illustrated in section 5.2.2, we need to load r2 only when the machine The reason tl
stays in the loop, and to keep the old value of r2 when the machine leaves the loop to machine in 2
return to state IDLE. This of course requires another oval in the ASM: cycles while t
ing READY
be possible to
ing READY
COMPUTE i
so r2 is not s
state the macd
pb can be pre
The following
state COMPU
bottom of the
loop. When th
machine will I
IDLEE
To illustrate that this ASM works correctly, consider the case we looked at in the last
section:
This machine can achieve the correct result in 3 +quotient clock cycles using only
two (instead of three) registers. Therefore, it is as fast as the fastest Moore machine in Figure 5-6.
chapter 2 using fewer registers.
D ready=1
0 ready=l
. ready=1
0 ready=O
0 ready=O
0 ready=O
0 ready=1
0 ready=1
it clock cycles using only Figure 5-6. Mealy division ASM with conditional READY
fastest Moore machine in
This machine can achieve the correct result in 2 +quotient clock cycles using only begir
two (instead of three) registers. Therefore, this Mealy machine is cheaper and faster
than any of the Moore machines given in chapter 2.
The condition (ri >= y) always produces identical results in the if and in thewhile
because no $time passes from when it is evaluated by the if and when it is later
reevaluated by the while.
As a final example, consider translating the Mealy ASM of section 5.2.4:
always
begin
@(posedge sysclk) enternewstate('IDLE);
rl <= @(posedge sysclk) x;
COMPUTE1); ready = 1;
if (pb)
begin
nto Hardware
Advanced ASM Techniques 187
Continued. comes from the
r2 <= (posedge sysclk) 0; tions illustrate t
while (rl >= y) ! == can solve t
begin
@(posedge sysclk) enternewstate('COMPUTE);
rl <= (posedge sysclk) rl - y; 5.4.1 Botto
if (rl >= y) A bottom testin
r2 <= (posedge sysclk) r2 + 1; difficult to trans
else in the language.
ready = 1;
bottom of the lo
end
end of the loop. (In t
end the computation
stage, since Veri
The conditional command simply translates into an else. using a while.
As an illustration
states are assign
5.4 Translating complex (goto) ASMs into behavioral
Verilog
Section 2.1.4 discusses the goto-less style for ASM charts, where every decision is
described in terms of high-level while, if and case constructs. Since Verilog has
statements that correspond to these constructs, it is usually straightforward to translate
such an ASM chart into behavioral Verilog, regardless of whether it is a Moore or
Mealy ASM. It is incorrect to
On the other hand, because Verilog does not provide a goto statement, there are three
situations when translating an ASM chart into Verilog is more difficult. First, transla-
tion is difficult when an ASM chart uses a bottom testing loop construct, similar to the always
begin
repeat . . . until of Pascal or do . . . while ( ) of C. Second, translation is
@ (posed
difficult when an ASM chart has intervening time control before the loop exit decision
stop
(as in the ASM of section 2.2.2). Third, translation is difficult when the decision can speed
only be described with gotos. while
The general solution to these difficulties involves using the present_state vari- bec
@
able inside if s and whiles. In the behavioral Verilog model of an ASM, the
presentstate variable indicates which algorithmic state the ASM is currently
performing. By testing the present state inside if s and whiles with the ! ==
operator, it is possible to implement arbitrary (goto-like) decisions without needing a end
goto statement. Such tests are not part of what the hardware does. Mentioning @(poseds
present_state in an ASM chart is unnecessary since an ASM chart allows arbi- stop
trary gotos to any state. Such decisions are required only to overcome a limitation of speed
Verilog, and so using ! == (rather than ! =) is appropriate. The need for using ! == count
end
ite('COMPUTE);
5.4.1 Bottom testing loop
r- Y;
A bottom testing loop is, technically speaking, "goto-less," but such a loop is still
r2 + 1; difficult to translate because Verilog does not provide a bottom testing loop construct
in the language. In essence, since such a construct does not exist, the decision at the
bottom of the loop has to be thought of as a conditional goto that branches to the top
of the loop. (In the pure structural stage, this is how the loop would be implemented by
the computation of the next state in the stategen function.) In the pure behavioral
stage, since Verilog lacks a goto statement, the only choice is to describe such a loop
using awhile.
As an illustration, consider the nonsense ASM chart from section 2.1.2.1. Suppose the
states are assigned the following representations:
into behavioral
'define NUMSTATEBITS 2
'define GREEN 2'bOO
where every decision is 'define YELLOW 2'bOl
tructs. Since Verilog has 'define RED 2'blO
aightforward to translate
vhether it is a Moore or
It is incorrect to translate the loop involving state YELLOW using just a while:
module slowdivsystem(pb,ready,x,y,r2,sysclk);
input pb,x,y,sysclk; where the state
output ready, 2;
wire pb;
wire [11:0] x,y;
reg ready;
reg [11:0] rl,r2;
reg ['NUMSTATE-BITS-1:0] present state;
always
begin
@(posedge sysclk) enter_new state('IDLE);
rl <= @(posedge sysclk) x; The troublesoi
ready = 1; includes state
if (pb)
loop. Three sit
begin
while loop i
@(posedge sysclk) enternew state('INIT);
r2 <= @(posedge sysclk) 0; while loopi
while ((rl >= y)|presentstate !=='TEST) while loop i
begin inside the Veri
@(posedge sysclk)enter new state('TEST); which the Veri
if (rl >= y) from state TE5
begin guaranteed to
@(posedge sysclk enter new state('COMPUTEl); !== 'TESTc
rl <= @(posedge sysclk) rl - y; situations will
@(posedge sysclk) enternew state('COMPUTE2);
r2 <= @(posedge sysclk) r2 + 1; In order to alk
end loop exits, the:
end if uses the sc
end
while loop. ]
end
fore the Verilo
task enter newstate;
input ['NUM_STATE_BITS-1:0] this-state; re-evaluated.
begin mains false sir
present-state = thisstate; is a simulation
#1 ready=0;
end
endtask
'define NUMSTATEBITS 3
'define IDLE 3'bOOO
'define INIT 3'bOOl
'define TEST 3'bOlO
'define COMPUTEl 3'bOll
'define COMPUTE2 3'blOO
The troublesome state here is state TEST. There is a Verilog while loop whose body
includes state TEST and an if statement that includes the other states of the ASM
loop. Three situations can occur with the Verilog while loop: It is possible that the
while loop is being entered for the first time from state INIT, it is possible that the
'INIT);
while loop is to be reexecuted from state COMPUTE2, and it is possible that the
=='TEST) while loop is to exit from state TEST. In each of these three situations, the condition
inside the Verilog while loop is evaluated. The only one of these three situations in
'TEST); which the Verilog loop body does not proceed to execute is when the ASM loop exits
from state TEST. The other two situations (from state INIT and state COMPUTE2) are
guaranteed to stay inside the Verilog while loop. Therefore, the present state
_state('COMPUTEl); !== 'TEST condition makes sure that the next thing to execute in both of those two
rl - y;
situations will be the algorithmic top of the Verilog while loop (state TEST).
_state('COMPUTE2);
r2 + 1; In order to allow the Verilog whi le loop to exit at the identical $time that the ASM
loop exits, there is a nested if inside the Verilog while loop, after state TEST. This
if uses the same ASM condition (rl >= y) that was also mentioned in the Verilog
while loop. In the situation when this condition is false, no $time has elapsed be-
fore theVerilogwhile condition ( (rl >= y) presentstate ! == 'TEST) is
re-evaluated. Since the present state is state TEST and ASM condition (rl >= y) re-
mains false since no $ time has elasped, the Verilog whi e loop exits properly. Here
is a simulation for x=14 and y=7:
Initializing such conditional command signals is important because in many situations Since this is a
the Mealy command is explicitly mentioned only on certain paths through the ASM. exits from the
By describing the default values for all outputs (whether they are Mealy or Moore) in from state GRE
enternew_state, the behavioral Verilog will be a one-to-one mapping of the cor- 'YELLOW.) Th
responding ASM chart. In the above Ve
inals into The diamond and oval inside the loop simply translate into an if statement followed
by the stay = 1 statement, with no intervening time control. Therefore, there is no
at has conditional corn- time control between the return from enternew-state and the execution of the
tate must include all if (and the possible consequent execution of stay= 1.) Suppose that count is non-
ction 5.1. 1, the task has zero, which means stay becomes one at the same $time that speed and count
become one. Since leave is not mentioned inside the loop, it retains its default value
of zero.
On the other hand, suppose count is zero inside the loop. This means stay=1 does
not execute, and so stay retains its default value (of zero) given to it by
enternewstate. No $time passes at the point where thewhile retests whether
count !=0. Since count is zero, thewhile isguaranteedtoexit, butstill no $time
has elasped. This means that the leave=1 statement executes at the same $time as
the final call to enternewstate ( ' YELLOW) returns back to the loop body. There-
fore the last cycle in which the machine is in state YELLOW will output leave as
one, but stay as zero.
Since this is a correct translation of a bottom testing loop, the only way the machine
ause in many situations exits from the while loop is from state YELLOW. (It is not possible to get directly
aths through the ASM. from state GREEN to the exit of the while because of the presentstate I==
are Mealy or Moore) in 'YELLOW.) Therefore, this Verilog is a one to one mapping of the ASM.
one mapping of the cor-
In the above Verilog, the states are represented as:
reg stop;
reg [1:0] speed;
reg [2:0] count;
reg ['NUMSTATE BITS-1:0] presentstate;
reg stay,leave;
er
The only difference between the ASM and the Verilog is that the Verilog needs the
quires two bits to repre- proper time control for combinational logic, which is @ followed by a sensitivity list,
ne bit to represent those rather than any mention of the system clock.
in section 2.5, an ASM
nd therefore needs zero
te state, and so it is not
5.7 Conclusion
Moore machines have commands that occur when the machine is in a particular state.
mplex decision happen- Mealy machines allow commands in a particular state to occur based on status. This
wo-bit bus, inbus, and chapter shows how Mealy ASMs allow a designer to express faster and better algo-
rithms. Like Moore ASMs, Mealy ASMs have unconditional commands in rectangles.
Unlike Moore ASMs, Mealy ASMs have conditional commands in ovals that follow
diamonds. The conditional commands in the ovals happen at the same time as the
unconditional commands in the rectangle and the decisions in the diamonds.
Translating a Mealy ASM into behavioral Verilog is usually simple, typically involving
an i f statement with no intervening time control. When the ASM involves command
signals (rather than RTN), as would be the case at the mixed stage, the
enternewstate task must initialize the conditional commands. SomeASMs (both
Moore and Mealy) cannot be expressed in the goto-less style with simple whiles
and i f s. Such ASMs need to be translated into Verilog using !== tests of
presentstate. Acommon example of anASM that must be translated into Verilog
with a present state ! == test is a bottom testing loop. These techniques work
only for simulation. See chapter 11 for synthesis techniques.
Single-state Mealy ASMs are a general notation to describe combinational logic in a
behavioral fashion. As such, they are closely related to the behavioral Verilog descrip-
tion of combinational logic.
wire c;
;e the built-in halfadder hal(c,sum[O],a[O],b[0]);
nly for built- fulladder fal(sum[2],sum[l],a[l],b[l],c);
ning the user endmodule
ha2
....
. . . _..._.....
...__
SrM~
sum
. S! 3 There is also
change in a [ C
cout2 .cout l
. .......
.... ...-- ...-
.......-..
.sum[2]
....... ..-. ..- -..---...- --..--..- -.... -.-
--- ---- ------------------------
Figure 6-1. Adder with names used in structuralVerilog.
a[O] 0
adderl.a[0J 0
adderl.hal.a 0
adderl.hal.al 1
adderl.hal.c 1
adderl.c 1
adderl.fal.a 1
adderl.fal.ha2.a 1
adderl.fal.ha2.al 2
adderl.fal.ha2.c 2
adderl.fal.coutl 2
adderl.fal.ol 3
adderl.fal.c 3
adderl.sum[2] 3
sum[2] 3
-0-+ sum
3 There is also a dependency of sum [ 1 ] on a [0]. The following shows where the
5l; change in a [0] has to propagate, and how much $ time is required:
a[0] 0
adderl.a[O] 0
adderl.hal.a 0
adderl.hal.al 1
adderl.hal.c 1
adderl.c 1
adderl.fal.a 1
e worst case for
adderl.fal.ha2.a 1
gh the gates that adderl.fal.ha2.xl 3
iere is a depen- adderl.fal.ha2.s 3
rrect result for adderl.fal.stemp 3
ugh the path to adderl.fal.ha3.a 3
and how much adderl.fal.ha3.xl 5
adderl.fal.ha3.s 5
adderl.sum[l] 5
sum[l] 5
There are other similar delay paths, but none of them are longer than five units of
$ time. Therefore, whatever code instantiates adder must wait more than five units
of $ time after changing a and b before using sum.
integer ia,ib;
reg [1:0] a,b; 6.3.3 Haz
wire [2:0] sum; In addition to
reg [2:0] oldsum; an always b]
(#0) of each
adder adderl(sum,a,b); adder (sum) i
it prints out a
always #1
for a "WRONG
begin
#0 if (a+b==sum)
what sum use
$display("a=%d b=%d sum=%d CORRECT $time=%d",
behind the cW
a,bsum,$time); the change in
else combinational
if (sum==oldsum) glitch). Hazar(
$display("a=%d b=%d sum=%d WRONG LAG $time=%d",
a,b,sum,$time);
else Here is a parti
$display("a=%d b=%d sum=%d WRONG GLITCH $time=%d",
a,b,sum,$time);
oldsum = sum;
end
initial
6.3.3 Hazards
In addition to the initial block in the test code of section 6.3.2, it is helpful to have I
an always block that monitors the change in sum at every unit of $time. At the end
(#0) of each unit of $time, the always block checks if the current output of the
adder (sum) is equal to a+b. If it is, it prints out the "CORRECT " message. If it is not,
it prints out a message explaining the reason why. There are two possible explanations
for a "WRONG" value of sum. The first is that the current value of sum is the same as
I
what sum used to be at the previous unit of $time. In other words, sum is lagging
behind the change in a or b. The other possible error is that sum has changed (due to
the change in a or b) to an incorrect value. Such a momentary incorrect value from
combinational logic with propagation delay is known as a hazard (also known as a
glitch). Hazards occur when combinational logic internally has different path delays.
In the above, for cases such as a= 0 b= 1, the output of sum simply retains its old value Although four
until sum makes a single change to the correct value. In essence, in these cases, it is failed to stabili
like describing the adder with the following behavioral block: correct behavic
module adder(sum,a,b); our machine tt
parameter DELAY=1; tional logic sta
output sum; ation is unlike]
input a,b; Verilog simula
reg [2:0] sum; analysis would
wire [1:0] a,b;
always (a or b)
# DELAY sum=a+b; 6.3.4 Adv
endmodule
Verilog provid
ing delays, mir
where DELAY is an integer propagation delay. Although the above is an attractive way allows us to m
of viewing propagation delay, it does not describe the more complex behavior that gate to change
occurs in other cases. For example, in the simulation of the adder given in section 6.3, output to zero.
for cases such as a=2 b= 3, at first (like the other cases) the output makes no change units of $tim
output change!
206 Verilog DigitalComputer Design: Algorithms into Hardware
-.
9
(since the input change has not yet propagated to the output). Later, the output changes
to an incorrect result that is different from the earlier value of sum. Finally, the output
stabilizes on the correct result.
Although the a priori analysis using the circuit diagram indicates that more than #5
would always be safe, we could use simulation to see if #4 would be enough. Here is a
partial output of this simulation:
.
PWI-
I Not all simulators support specify blocks. For more information, check the docu-
mentation for your simulator, or see the book by Palnitkar mentioned at the end of this
I1 that there ar,areI chapter.
posedly identi
!posedly identi-
are supposed to4
Ver.ulators
than
. others
others.
dallows
i
ariation duringII 6.4 Abstracting propagation delay
Mine the typicalI As the previous sections illustrate, once a design has been synthesized down to the gate
tceptable speedI level, Verilog can provide a fairly accurate model of propagation delay. A problem
arises if one wishes to estimate propagation delay before synthesis. For a given tech-
ifying
pecifying these nology, manufacturers usually publish a priori estimates of worst case propagation de-
lays for bus-width building blocks (such as adders). We would like to be able to use
such worst case estimates to simulate the propagation delay of an architecture when it
is still at the mixed stage (block diagram). The problem is that the propagation delay of
a physical bus-width device exhibits itself only as specific hazards (like those illus-
I trated in section 6.3.3) that require a synthesized netlist to be simulated.
relay
elay of 33. Th(
The This section illustrates how Verilog can be used to model abstractly the propagation
all simulators
simulator! delay of a bus-width device. The correct Verilog code for doing this uses some rela-
tively advanced features of Verilog. To motivate the need for these features, we will
first consider some incorrect attempts at modeling propagation delay.
Vblock, allow!
allows
Ion individual
Ion
6.3:
in section 63 6.4.1 Inadequate models for propagation delay
The simplest Verilog code for a bus-width device that includes some notation of propa-
gation delay is similar to the code given in 6.3.3, except that the port sizes are defined
by the first parameter, and the propagation delay is defined by the second parameter:
module adder(s,a,b);
parameter SIZE = 1, DELAY = 0;
output s;
input a,b;
reg [SIZE-1:0] s;
wire [SIZE-1:0] a,b;
always @(a or b)
# DELAY s=a+b;
endmodule
As explained in section 6.3.3, this code is deficient because it does not model cases module tE
where there is a hazard but instead always models the error as a lag. reg [11
wire [:
How should a hazard be represented abstractly? The specific value that presents itself adder
when a hazard occurs can only be predicted from the synthesized netlist. Instead, at the adder
abstract level, we will use bx to represent the hazard. initial
begir
module adder(s,a,b);
parameter SIZE = 1, DELAY = 0;
output s;
input a,b;
#3C
reg [SIZE-l:0] s;
wire [SIZE-l:0] a,b;
always (a or b)
#4C
begin
end
s = 'bx;
always
# DELAY sa+b;
$disr
end
always
endmodule
$disr
endmodulE
Although this is an improvement, the above still has a flaw. To see why it is deficient,
consider the following design which instantiates the above adder twice:
This is illustra
a / o a.a
12 + a.s 124 a2.a
8ns -- 1- Ons
lo 18ns worst cas e~
0
$time= 0 t=x
$time= 0 s=x
I'
$time= 8 t= 0
$time= 10 s= 0
$time= 30 t=x
$time= 30 s=x
$time= 38 t= 120
$time= 40 s= 123
The test code instantiates two adders. The first instance, a , has a propagation delay of
Older gives a 8, and the second instance, a2, has a delay of IO. At$time 30, the test code causes a,
b and c to change. A worst case analysis indicates that it should take 18 additional
$t ime units ($ t ime=48) to produce the sum of 100+20+3; however the simulation
shows the correct sum in only 10 $ t ime units ($ t ime=40).
This flaw exists because the always block for a2 is still delaying (#10) when the
change in t (also known as a2 a) occurs at $ t ime=38. Rather than delaying an
additional 10 units of $time from $time=38, Verilog simply returns to the time
always
event e; # DELI
task stU
begin
S=
and, second, in time control:
end
endtask
always @ e endmodule
Unfortunately
Note: There are no parentheses around the variable in the @ time control. The -> result 123 is ai
triggers the corresponding @ to be scheduled. For example, the following prints "10"
and "30":
6.4.3 The
event e;
initial
This statement
begin it causes them
#10; optional label
-> e; statement ovei
#20
-> e;
end module adc
always e parametE
$display($time); output
input a,
reg [SI2
Here is an example of how an event could be used to model the adder: wire [S]
event ct
always
sta
always change
# DELAY s=a+b;
task startchange;
begin
s = 'bx;
-> change;
end
endtask
endmodule
Unfortunately, the above produces the same incorrect model of the adder (the correct
->
Introl. The --, result 123 is available too soon) for the same reasons discussed in the previous section.
Ing prints -io'
"10'
module adder(s,a,b);
parameter SIZE = 1, DELAY = 0;
output s;
input a,b;
reg [SIZE-l:0 s;
wire [SIZE-l:0] a,b;
event change;
always (a or b)
startchange;
Continued
6.4.4 A c1(
always @change
The reason we
begin : change-block
# DELAY s=a+b;
chines that are
end with PERIOD
module cl(
task start-change;
param
begin
s = 'bx; outpu
disable change-block; reg c
#0; initi
-> change; clk
end alway
endtask beg
endmodule
The task startchange can be called many times from the first always block end
alway
without $ time advancing. This way, every change in the inputs will be noticed by the
if
Verilog scheduler. The only # control in startchange is #0. This is required so the endmodule
disable statement can take effect. After change block has been disabled, the
change event is retriggered. This, in turn, causes the full # DELAY before the output
changes from bx. Note that if PE
the situation in
The #DELAY is in the block (change_ block) which can be disabled. There is no
way that changes that occur in the middle of a #DELAY will be missed. Therefore,
instantiating a series of these adders will produce a correct model of the propagation 6.4.5 Prop
delay. For example, here is the simulation using the same test code as section 6.4. 1: Suppose the pr
parator and 10
4.4.5, simply in
$time= 0 t=x
module sloi
$time= 0 s=x
$time= 8 t= 0
input ali
$time= 18 s= 0
output r:
$time= 30 t=x
wire [5:(
$time= 30 s=x
wire mux
$time= 38 t= 120
wire [11
$time= 48 s= 123
wire [11
Note that s is bx from $time 30 until $time 48, as is predicted by worst case enabled
timing analysis. mux2
alul81
comparatc
not
are
Designingfor Speed and Cost 215
cl #(2000
counterregister #12 r2(,r2bus,,1'bO,incr2,clrr2,sysclk);
slow div_
enabled-register #12 r3(r2bus, r3bus,ldr3,sysclk);
x, y, qu
endmodule
The test code
When this is simulated with a clock period of 100, it works:
L
ci #(20000,90) clock(sysclk);
5ysclk)
slowdivsystem slow divmachine(pb,ready,
x,y,quotient,resetsysclk);
because the ALU will not have had a chance to stabilize by the time of the rising edge
of the clock.
7 0o
If cost were not a constraint, problems with totally independent data values could be
1 solved by building one combinational logic machine for each data value to be pro-
cessed. Each such machine could compute its answer in parallel to all the other ma-
gation delay. chines. Although this kind of massively parallel approach is sometimes used, it is not
practical in many situations due to cost constraints.
Because practical problems with perfectly independent data are commonplace where
I up to have a cost is as or more important than speed, three standard techniques have been developed
that allow the designer to choose the trade-off between speed and cost. These three
techniques are known as the single-cycle, pipelined, and multi-cycle approaches. What
these three techniques share in common is that no more than one complete result is
produced per clock cycle.
In between the single-cycle approach and the multi-cycle approach is the pipelined
approach. The pipelined approach usually requires more hardware than the other ap-
proaches but often is the fastest and most efficient. In order to understand the pipelined
approach, it is necessary to investigate the two other approaches first. Figure6-4.
As discussed earlier in this chapter, the total time required by a machine is the number
of clock cycles multiplied by the clock period. The three approaches discussed in this For example,
section differ both in terms of the number of clock cycles required and the clock period.
The machine
We can understand the algorithmic distinctions among these three approaches at the
behavioral stage and even predict the number of clock cycles required at the behavioral 2.2.1. The mac
state IDLE, it 1
stage; however, we cannot predict which approach will be fastest at the behavioral
stage. This is because the clock period is determined by the propagation delay in the IDLE it will be
architecture, which we cannot predict until the mixed stage, or when the hardware has The following
been synthesized. be implemente
incorrect versic
solution is shot
6.5.1 Quadratic polynomial evaluator example
The quadratic polynomial a*x*x + b*x + c is a simple example of a formula that
a machine might evaluate many times with different values of x, but the same values of 6.5.2 Beha
a, b and c (which remain unchanged for a suitable period before, during and after the The ASM chart
quadratic evaluations). For each unique x value, the computation of the quadratic for- only needs to N
mula is independent of the computation for other values of x. Although the formulae approach provi
used for practical problems, such as computer graphics, are more complex than this each clock cycl
familiar old quadratic, the nature of the formulae in such practical problems is very the quadratic is
similar to this quadratic. stored into y at
Although a practical machine would probably store the x values in a synchronous Suppose the ma
memory, for the sake of simplicity in this example, assume the values of x are con- lowing two ASb
tained in a ROM. The goal of the machine is to evaluate the quadratic polynomial for possible fashior
each of these x values and store the corresponding y values into a synchronous memory cations and addi
349 ps=Ci
z
449 ps=OW
z
549 ps=OC
Figure 6-5. Behavioral single cycle ASM with only -.
66
1149 ps=O(
66
1249 ps=O(
66
In state COMI
ROM. After g(
computes the!
cycle, the mac
but before the
sum (ax2+bx
value appears
The number (
MAXMA+1, t
Figure 6-6. Equivalent single cycle ASM with = for combinational logic.
Here is the bet
Note the use of = rather than *- for the intermediate results (xl, x2, bx, ax2 and 'define N
bxc). As discussed in chapter 2, the = means that combinational logic computes all of 'define I
these values in one clock cycle. Note that x2 and bx are dependent on xl; ax2 is 'define C
dependent on x2; and bxc is dependent on bx. This means that the minimum clock module po
input pb
period for the single-cycle approach must allow enough time for the computations of
output r
all of these intermediate results to stabilize. The amount of time it takes for the combi- wire pb,
national logic to finish computing these intermediate values is not something we can wire [11
predict at the behavioral stage. Up until this chapter, we have neglected such propaga- reg [11:
tion delays, but later in this chapter, we will estimate what these delays will be. reg read
reg [11:
L
Since we do not know how fast the machine can be clocked, let us assume that the clock
period is 100 units of Verilog $ time for the purpose of the following and later simula-
tions. Also, for reasons to be explained later, we will assume that each word in y is
initialized to bz prior to execution of this ASM. In the following partial simulation
output, the $time and registers are printed on one line with the contents of y on the
following line:
rE
7EI 349 ps=000 ma= 0 xl=x x2=x bx=x ax2=x bxc=x
z z z z z z z z
*X 449 ps=001 ma= 0 xl= 7 x2=49 bx=14 ax2=49 bxc=17
z z z z z z z z
549 ps=001 ma= 1 xl= 6 x2=36 bx=12 ax2=36 bxc=15
66 z z z z z z z
In state COMPUTEl at $ time 449, the machine obtains the value (7) of x1 from the
ROM. After getting this from the ROM, but during the same clock cycle, the machine
II computes the square (49), and the product (49) of a and the square. Also in this clock
cycle, the machine computes the product (bx= 14) of b and x1. After computing bx,
but before the end of the clock cycle, the machine computes the sum (bxc = 17) and the
sum (ax2 +bxc). This final result (66) is scheduled to be stored in the memory. This
value appears at the correct place (y [ 0 ] ) by $ time 549.
The number of clock periods required for this single cycle ASM to complete is
MAXMA+1, because one result is produced each clock cycle.
2gic.
V ;C.
Here is the behavioral Verilog code used to produce the above simulation:
,bx,ax2
, I x, ax2 and 'define NUMSTATEBITS 3
omputes
V of
aputes all of 'define IDLE 3'bOOO
Dn x;
Dn xi; ax2
ax is 'define COMPUTEl 3'bOO1
inimum
inium clock module poly-system(pb,a,b,c,ready,sysclk);
mputations input pb,a,b,c,sysclk;
)TY mutations of
output ready;
for
fc r the combi-
wire pb,sysclk;
Jet
tething
fing we can wire [11:0] a,b,c;
such
Stch propaga- reg [11:0] x['MAXMA:0],y['MAXMA:0];
will
will be. re rd9N7
reg [11:0] ma;
Continued
reg [11:0] xl,x2,bx,ax2,bxc; ma <= '
reg ['NUMSTATEBITS-1:0] present_state;
integer i; y[ma] <=
initial
begin
for (i=0;i<='MAXMA;i=i+l) However, @(j
begin Verilog simul
x[i]='MAXMA-i; in Verilog: se
y[il='bz; new value to I
end
save the right.
end
specified cloc
always
the behavior (
begin memory (y [rr
@(posedge sysclk) enternew-state('IDLE); memory until
ma <= @(posedge sysclk) 0; and y[ma] is
ready = 1; vendors, for r(
if (pb)
begin To overcome
while (ma != 'MAXMA) statements in
begin The falling ed 1
@(posedge sysclk) enternew state('COMPUTEl); non-blocking
ma <= (posedge sysclk) ma + 1; have the corre
xl = x[ma]; dependent con
x2 = xl*xl;
bx = b*xl; This is anothei
ax2 = a*x2; C. 1). In the pu
bxc = bx + c; memories at tl
y[ma] <= @(negedge sysclk) ax2 + bxc; lates the value:
end
simulator arrai
end
explores differ
end
endmodule chronous mem
ordinary regist
significant to d
Note that the order of the intermediate computations (=) matters in Verilog. all <=, there iP
The non-blocking assignment to the memory location, y [ma], uses @(negedge about clock fre
sysclk) rather than the @(posedge sysclk) typical for non-blocking assign-
ment to ordinary registers (see section 3.8.2). The problem here arises because new
values are stored into distinct elements of y during every clock cycle. Some simulators
will do the proper thing in a situation like this even if you were to use:
_ ._
5249 ps=OOC
66
In state COMPI
x[ma]=x[0]='
around $time 5
which shows up
(14) of x1 and b
state COMPUTE
loaded into ax2,
849, the sum of b
by $time 949. F
bxc is scheduled
The number of,
6*(MAXMA+I).
Figure 6-7. Behavioral multi-cycle ASM. cycle approach o1
to predict the prol
This machine has six registers (ma, x1, x2, bx, ax2 and bxc), and has six determines the m;
states inside the loop. Here is a partial simulation, again assuming a clock period of 100 in each clock cyc
(which may be much longer than is actually required): the single-cycle
thing we can onl3
Here is the behav
termediate corn- 649 ps=011 ma= 0 xl= 7 x2=49 bx=x ax2=x bxc=x
z z z z z z z
z
t for the multi-
749 ps=100 ma= 0 xl= 7 x2=49 bx=14 ax2=x bxc=x
z z z z z z z z
849 ps=101 ma= 0 xl= 7 x2=49 bx=14 ax2=49 bxc=x
z z z z z z z z
949 ps=110 ma= 0 xl= 7 x2=49 bx=14 ax2=49 bxc=17
z z z z z z z z
1049 ps=001 ma= 1 xl= 7 x2=49 bx=14 ax2=49 bxc=17
66 z z z z z z z
'define NUM_STATEBITS 3
The pipelined
'define IDLE 3'bOOO (correspondin
'define COMPUTEl 3'bOOl two approach
'define COMPUTE2 3'bOlO parallel to the
'define COMPUTE3 3'bOll computations
'define COMPUTE4 3'blOO amount of ind
'define COMPUTE5 3'blOl
'define COMPUTE6 3'bllO A pipelined ir
because they
always each worker ii
begin tion line. For
@(posedge sysclk) enternew-state('IDLE);
ma <= (posedge sysclk) 0;
item
ready = 1;
untig
if (pb) unwE
begin unpa
while (ma != 'MAXMA)
begin
P(posedge sysclk) enternewstate('COMPUTEl);
xl <= P(posedge sysclk) x[ma];
@(posedge sysclk) enternewstate('COMPUTE2);
x2 <= (posedge sysclk) xl*xl;
P(posedge sysclk)
bx <=
enternewstate('COMPUTE3);
(posedge sysclk) b*xl;
/w
@(posedge sysclk) enternewstate('COMPUTE4);
ax2 <= (posedge sysclk) a*x2; Figure 6-8.
@(posedge sysclk) enternewstate('COMPUTE5);
bxc <= P(posedge sysclk) bx + c;
@(posedge sysclk) enternewstate('COMPUTE6); Worker #1 mi
ma <= (posedge sysclk) ma + 1; paint the item
y[ma] <= (negedge sysclk) ax2 + bxc; worker #1 is t
end item #2 (whic
end (which has it
end the correct se,
welding and p
With this anal
6.5.4 First attempt at pipelining
The single-cycle approach puts all the computation steps into one clock cycle but uses tions ( <- ) in
= (corresponding only to combinational logic) for the intermediate results. The multi- from a differ
cycle approach spreads the computation steps across separate clock cycles, but uses - factory-like ol
(corresponding to registers) for the intermediate results. The pipelined approach is half-
way between these two approaches.
Worker #1 might tighten a bolt, worker #2 might weld a seam and worker #3 might
paint the item. Each worker acts in parallel to the other workers. In the above picture,
worker #1 is tightening the bolt on item #3 at the same time that worker #2 is welding
item #2 (which already has its bolt tightened) and that worker #3 is painting item #1
(which has it bolt tightened and which has been welded). Each item has experienced
the correct sequence in order (tightening, welding and painting), but the tightening,
welding and painting that happens at any given instance occurs to independent items.
With this analogy in mind, we can understand that each of the intermediate computa-
k cycle but uses tions ( - ) in the pipelined quadratic evaluator produces an intermediate result derived
;ults. The multi- from a different x value. Here is a first (somewhat flawed) attempt to describe the
cles, but uses - factory-like operation of this pipelined system:
Ipproach is half-
'=X
C=X
C=X
Another, less obvious, flaw is that garbage values ( bx, 'bx and 2) are stored into the
'
memory during the first clock cycles (449, 549 and 649). Initializing yto bz rather
'=X
C'=X
C'=X than bx to highlights this flaw.
r=
- 1
r= 1
6.5.5 Pipelining the ma
c=17
C=17
-17 The major problem with the ASM chart of section 6.5.4 is that the memory address
used to store into y does not correspond to the value being stored. To overcome this
c=15
-15 problem, we can introduce three additional registers, mal, ma2 and ma3, that will
save the memory addresses from the previous three clock cycles. In a given clock cycle,
=13
-13
p=13
mal is the value of ma one clock cycle ago, ma2 is the value of ma two clock cycles
ago, and ma3 is the value of ma three clock cycles ago.
C=11
C=11
-11
C=
- 9
c=
- 7
c=
vare
,are
vare
Designingfor Speed and Cost 229
Continued.
1049 ps=OC
66
1149 ps=OC
66
1249 ps=OC
66
Although the;
vised ASM sti]
in the ASM of
left frozen in tf
('bx, bx ai
though now th,
Figure 6-10. PipelinedASM with multiple addresses but withoutflush.
Here is a simulation that shows how addresses flow through the ma pipeline: 6.5.6 Flus
In order to pre
some addition,
349 ps=000 ma= 0 xl=x x2=x bx=x ing to the facto
mal= 0 ax2=x bxc=x ma2= 0 ma3= 0 model item, w
z z z z z z z z tasks on the las
449 ps=001 ma= 0 xl=x x2=x bx=x #3. So it is wit
mal= 0 ax2=x bxc=x ma2= 0 ma3= 0
z z z z z z z z The ASM nee(
549 ps=001 ma= 1 xl= 7 x2=X bx=4094 quired comput
mal= 0 ax2=4095 bxc=x ma2= 0 ma3= 0 data, the comp
x z z z z z z z flushing state,
649 ps=001 ma= 2 xl= 6 x2= 49 bx= 14 successive stat
mal= 1 ax2= 1 bxc= 1 ma2= 0 ma3= 0
x z z z z z z z
749 ps=001 ma= 3 xl= 5 x2= 36 bx= 12
mal= 2 ax2= 49 bxc= 17 ma2= lma3= 0
2 z z z z z z z
849 ps=001 ma= 4 xl= 4 x2= 25 bx= 10
mal=
6.5.7 Filli
3 ax2= 36 bxc= 15 ma2= 2 ma3= 1
66 Z Z z z z z z The reason thai
949 ps=001 ma= 5 xl= 3 x2= 16 bx= 8 because of the
mal= 4 ax2= 25 bxc= 13 ma2= 3 ma3= 2 cycles when st
66 51 z z z z z z values. Therefc
Although the addition of mal, ma2 and ma3 solves the addressing problem, the re-
vised ASM still does not finish the complete job (storing 11, 6, and 3). As was the case
in the ASM of section 6.5.4, the intermediate results needed to produce 11, 6 and 3 are
left frozen in the pipeline when the machine returns to state IDLE. Also, garbage values
( bx, bx and 2) are still stored into the memory during the first clock cycles, al-
though now they are stored in y [ 0 ] each time.
ush.
I h.
feline
ipipeline:
ipeline 6.5.6 Flushing the pipeline
In order to prevent the final values from being frozen in the pipeline, there need to be
some additional clock cycles spent "flushing" those values out of the pipeline. Return-
ing to the factory analogy, when the factory is about to cease production of a particular
model item, worker #1 can stop work earliest, but the other workers must finish their
tasks on the last item worker #1 tightened. Similarly, worker #2 can stop before worker
#3. So it is with flushing the pipeline.
The ASM needs three states, FLUSH1, FLUSH2 and FLUSH3, that perform the re-
quired computations on the valid data in the pipeline. For those registers that have valid
data, the computations are identical to those in state COMPUTE . At each successive
flushing state, there are fewer registers in the pipeline that contain valid data; thus each
successive state has fewer computations to perform.
* 0
1
6.5.7 Filling the pipeline
The reason that garbage values have been stored by all the previous pipeline attempts is
because of the assignment to y [ma3] in state COMPUTEI. During the first clock
2 cycles when state COMPUTE1 executes, ma3, ax2 and bxc do not have legitimate
values. Therefore, to store ax2 +bxc at address ma3 is illegitimate. The situation is the
z
RE
FILL1 549 ps=Oll
0 1 ma ma+1
p mal-ma z
xl-x[ma] 649 ps=100
FILL2 ma-ma + 1 z
mal-ma 749 ps=OOl
ma2- mal
x1 - x[ma]
x2-xl*xl z
bx- b*xl 849 ps=OOl
FILL3 -ma- ma + 1 66
mal - ma . . .
Ma2- mal
ma3- ma2
x1- x[ma] The rest of the
x2-xl*xl Verilog code u
bx-b*xl
ax2-a*x2 'define N
bxc-bx + c 'define I
I-
'define C
'define F
maF=MASH 1
'define E
FL-
FLUSH1
- I ma-ma +1
~~COMPUTE1 'define F
ma2- mal mal-ma 'define E
ma3- ma2 ma2-mal 'define E
x2- xl*x1 ma3- ma2
bx- b*xl 'define I
x1" -x[ma]
ax2- a*x2 x2 x1*x1
bxc- bx + c bx - b*xl always
y[ma3]-- ax2 + bxc ax2- a*x2 begin
bxc- bx + c @ (pose(
ma3- ma2 y[ma] - ax2 + bxc ma <=
ax2- a*x2 4. ready
bxc- bx + c if (pI
y[ma3]-ax2 + bxc begii
FLUSH3 (c
|y[ma3]--ax2 + bxc ma
ma:
Figure 6-11. CorrectpipelinedASM thatfills andflushes. xl
8(p(
ma
Here is a simulation that shows the proper values filling the pipeline:
The rest of the simulation is similar to the previous example. Here is the behavioral
Verilog code used to produce the simulation of the correct pipelined machine:
'define NUMSTATEBITS 3
'define IDLE 3'bOOO
'define COMPUTE1 3'bOOl
'define FLUSH1 3'blOl
UTE1I 'define FLUSH2 3'bllO
'define FLUSH3 3'blll
'define FILL1 3'bOlO
'define FILL2 3'bOll
'define FILL3 3'blOO
always
begin
@(posedge sysclk) enter newstate('IDLE);
ma <= @(posedge sysclk) 0;
ready = 1;
if (pb)
begin
@(posedge sysclk) enternew state('FILLl);
ma <= @(posedge sysclk) ma + 1;
mal <= @(posedge sysclk) ma;
xl <= @(posedge sysclk) x[ma];
@(posedge sysclk) enternew state('FILL2);
ma <= @(posedge sysclk) ma + 1;
aware
iwai e DesigningforSpeed and Cost 233
Continued. As described i
mal <= @(posedge sysclk) ma; with many Ver
ma2 <= @(posedge sysclk) mal;
xl <= @(posedge sysclk) x[ma];
x2 <= @(posedge sysclk) xl*xl; 6.5.8 Arc
bx <= (posedge sysclk) b*xl; Only by procei
@(posedge sysclk) enter_new state('FILL3); mum speed of
ma <= @(posedge sysclk) ma + 1; and pipelined
mal <= @(posedge sysclk) ma; tecture approp
ma2 <= @(posedge sysclk) mal;
combinational
ma3 <= @(posedge sysclk) ma2;
xl <= @(posedge sysclk) x[ma];
(adders and m
x2 <= @(posedge sysclk) xl*xl; differ only wit
bx <= @(posedge sysclk) b*xl; registers. The
ax2 <= @(posedge sysclk) a*x2;
bxc <= @(posedge sysclk) bx + c;
while (ma != 'MAXMA) 6.5.8.1 Si1
begin The behavioral
@(posedge sysclk) enternewstate('COMPUTEl); tions that must
ma <= @(posedge sysclk) ma + 1; lowing architect
mal <= @(posedge sysclk) ma;
ma2 <= @(posedge sysclk) mal;
ma3 <= @(posedge sysclk) ma2;
xl <= @(posedge sysclk) x[ma]; cj
x2 <= @(posedge sysclk) xl*xl;
bx <= @(posedge sysclk) b*xl;
ax2 <= @(posedge sysclk) a*x2;
bxc <= @(posedge sysclk) bx + c;
y[ma3l <= @(negedge sysclk) ax2 + bxc;
end
@(posedge sysclk) enter_new state('FLUSHl);
ma2 <= @(posedge sysclk) mal;
ma3 <= @(posedge sysclk) ma2;
x2 <= @(posedge sysclk) xl*xl;
bx <= @(posedge sysclk) b*xl;
ax2 <= @(posedge sysclk) a*x2; Figure 6-12
bxc <= @(posedge sysclk) bx + c;
y[ma3] <= @(negedge sysclk) ax2 + bxc;
The ma registe
@(posedge sysclk) enternew-state('FLUSH2);
ma3 <= @(posedge sysclk) ma2;
against MAXN
ax2 <= @(posedge sysclk) a*x2; There are three
bxc <= @(posedge sysclk) bx + c; from the ROM
y[ma3l <= @(negedge sysclk) ax2 + bxc;
produces the sc
@(posedge sysclk) enter new state('FLUSH3);
y[ma3] <= @(negedge sysclk) ax2 + bxc;
that multiplies
end
j
w
ma 12
>~ ~ * * + y
ROM 1 a 12
12
bC
2 1 Idy
The ma register provides the same address to x and y. The mabus is also compared
against MAXMA to produce the status signal maeqmax.
There are three multipliers in the architecture. The ma register selects a particular word
from the ROM. This value is fed to the first two multipliers. One of these multipliers
produces the square, and the other multiplies this value by b. There is a third multiplier
that multiplies the square by a.
bxc[ 1
module polyarch(maeqmax,ldy,incma,clrma,a,b,c,sysclk);
output maeqmax; s[1 1
input ldy,incma,clrma,a,b,c,sysclk;
wire maeqmax,ldy,incma,clrma,sysclk;
SYS
wire [11:0] a,b,c;
wire [11:0] x2,bx,ax2,bxc,xbus,ydibus,mabus;
Figure 6-1
F_
In the above, it is assumed that the propagation delays of the ROM, multipliers and
adders are 23, 24 and 25 units of $time, respectively. 'CLOCKPERIOD is 100,
which is just barely long enough for all the combinational logic to stabilize before a
result is clocked into y, as illustrated by the following timing diagram produced by a
Verilog simulator:
mabus[1 1:0] 21 3 l
xbus[1 1:0] 51 x 1 4 l
x2[11:0] 251 x | 16 l
bx[11:0] 10| x | 8 |
ax2[11:0] 251 x | 16 l
bxc[11:0] 131 x I 11
s[11:0] W 38 x 27
sysclk
I F-
-
//
(P
id
@(P
//
id
Figure6-14. Multi-cycle architecture.
id
The only difference between this architecture and the previous architecture is the inser-
tion of registers for xl, x2, bx, ax2 and bxc. As indicated by the ASM chart, it
takes six clock cycles for each computation to travel through this architecture. In the
first cycle, only ldxl is asserted. In the second cycle, only ldx2 is asserted. In the in
third cycle, only ldbx is asserted. In the fourth cycle, only dax2 is asserted. In the
fifth cycle, only ldbxc is asserted. In the sixth cycle, finally dy and incma are end
asserted. end
end
As mentioned above, this is not a particularly efficient architecture for the multi-cycle
approach because in any given clock cycle, five-sixths of the architecture is not per- counter-i
forming any useful computation. Nevertheless, the insertion of the registers allows this comparatc
architecture to be clocked considerably faster than the architecture in section 6.5.8.1. rom
enabledz
The following Verilog code shows the definition of this multi-cycle architecture, along multipliE
with the corresponding mixed controller: enabledz
multipliE
enabledz
multipliE
enabledz
adder
Continued
6.5.8.3
enabledregister #12 bxc(bxcdibus,bxcdobus The correct
ldbxc,sysclk); three additic
adder #(12,25) a2(ydibus,ax2dobus,bxcdobus); ters inserted
ram #12 y(ldy,mabus,ydibus,,sysclk);
referred to
pipeline regi
In the above, it is assumed that the propagation delays are the same as in the single-
cycle approach of section 6.5.8.1. With the multi-cycle approach, 'CLOCKPERIOD
MAXMA-
can now be 26 in this example, which is nearly four times faster than the single-cycle
incmar_1
approach. The faster clock is possible because there is less logic that has to stabilize crma
before each intermediate result is clocked into one of the registers. The following tim-
ing diagram illustrates this:
ma 12
870 895 920 945 970 995 1020
1 1 1 1 1 1 1 1 11 1 III lIllI I lIIliI I1
1 III II IIII l III 1 IIIIIIII
present state[2:0] I 001 010 011l 100 I 101 110 001 _J
xbus[11:0] xl
x1 bus[11:0] 4
I
3
x I
x2dibus[11:0] 16 x
x2dobus[11:0] 16 9
ax2dibus[1 1:0]
Figure 6-1
16
ax2dobus[11:0] 16 9
Notice that th
bxdibus[1 1:0] 8 x drawn in the
bxdobus[11:0] 8 6 the next pipe
bxcdibus[11:0] pipeline regis
11 X
binational lot
bxcdobus[11:0] 11 9 register is kn(
ydibus[1 1:0] 27 x x The second p
mabus[11 :0] third stage ini
4
(This architec
sysclk though it just
I k
6.5.8.3 Pipelinedarchitecture
The correct behavioral ASM for the pipelined method given in section 6.5.7 requires
three additional registers: mal, ma2 and ma3. In a pipelined design, all of the regis-
us) ters inserted in the single cycle architecture to make it a pipelined architecture are
referred to as pipeline registers. In this architecture, every register, except ma, is a
pipeline register.
i the single-
K_PERIOD
single-cycle
to stabilize
[lowing tim-
D20
I|001
'1 x
always Continued
begin
@(posedge sysclk) enternewstate('IDLE); while
//ma <= @(posedge sysclk) 0; begi
ready = 1; (P
clrma = 1; /
if (pb) //
begin //
@(posedge sysclk) enter new state('FILLl);
//ma <= @(posedge sysclk) ma + 1;
//mal <= @(posedge sysclk) ma; l
//xl <= @(posedge sysclk) x[ma]; l
//
ldxl = 1;
//
incma = 1; ld
ldmal = 1; //
l-
@(posedge sysclk) enter-new state('FILL2); ld
//ma <= (posedge sysclk) ma + 1; ld
//mal <= @(posedge sysclk) ma; ld
ld
//ma2 <= @(posedge sysclk) mal;
//xl <= @(posedge sysclk) x[ma]; ld
//x2 <= @(posedge sysclk) xl*xl; in
//bx <= @(posedge sysclk) b*xl; ld
ldxl = 1;
ldx2 = 1;
/a1d
'1d
ldbx = 1; 1d
incma = 1; end
ldmal = 1; @(pos
ldma2 = 1; //ma
@(posedge sysclk) enternew state('FILL3); //ma
//ma <= (posedge sysclk) ma + 1; //x2
//mal <=
@(posedge sysclk) ma; //by
//ma2 <=
@(posedge sysclk) mal; //ax
//ma3 <=
@(posedge sysclk) ma2; //by
//xl <=
@(posedge sysclk) x[ma]; //y[
//x2 <=
@(posedge sysclk) xl*xl; ldx2
//bx <=
@(posedge sysclk) b*xl; ldby
//ax2 <=
@(posedge sysclk) a*x2; lday
//bxc <= @(posedge sysclk) bx + c; ldby
ldxl = 1; ldma
ldx2 = 1; ldma
ldbx = 1; ldy
ldax2 = 1; @(poE
ldbxc = 1; //ma
incma = 1; //a>
ldmal = 1; //b>
ldma2 = 1;
ldma3 = 1;
358 369
347 .1........I........l
. .. 380
I42 391 402 413 424 435 43 446
presentLstate [2:0] _ _ .............................................................................................
001
xbus[1 1:0] x x a x 1 x
x2dibus[1 1:0] l 1125 x 1116 x x 114 x 1
x2dobus[1 1:0] 361 25 1 16 1 9 1 4 11
ax2dibus[1 1:0] lKI*-36 x I 1 25 x 11 6 x I19- x
ax2dobus[1 1:0] l91 36 1 25 1 16 1 9
bxdibus[1 1:0] lOK10 X I1 8 x I I*Q x I1 4 x
bxdobus[1 1:0] l12 10 1 8 1 6 1 4
bxcdibus[1 1:0] x * x 113 x 11 x klg) x 7
sysclk);
bxcdobus[1 1:0] 171 15 1 13 1 11 1 9
:1k); ydibus[1 1:0] x166 x 1I51 x 1138 x 1127 x
:1k); mabus[1 1::0 31 4 1 5 1 8 1 7 18
:1k); mal bus[1 1::0 21 3 1 4 1 5 1 6 17
slk);
ma2bus[1 1:0] 11 2 1 3 1 4 1 5 Ib
ma3bus[1 1:0] 0 1 1 1 2 1 3 1 4 15
!, sysclk) ; sy, Ak
6.6 Conclusion
The first duty of a designer is to produce a correct design. The top-down design process
explained in earlier chapters helps organize a designer's thinking to achieve this goal.
lobus ); Often, in addition to being algorithmically correct, a practical design must meet the
I criteria of speed and cost. This chapter explains how Verilog can help a designer deter-
mine if a design meets its speed goals. This chapter also explains different design alter-
natives that allow a designer to trade off speed and cost.
nthe single- and The speed of an algorithm implemented in hardware depends on two factors: the clock
[OD can usually period and the number of clock cycles. The algorithm itself determines how many
it the number of clock cycles are required, but the limiting factor on how fast the clock period can be is
Unlike the multi- gate-level propagation delay. Synthesizable Verilog cannot have propagation delay, but
oductive during once a design is synthesized, it is easy to annotate the built-in gates of the netlist with
1 logic is doing propagation delays. (Some synthesis tools automatically backannotate the netlist for
elined approach post-synthesis simulation.) Gates with delays create the possibility of spurious wrong
approach: outputs, known as hazards.
ynthesized into a signs. One does not have to be a genius like Babbage to understand pipelining today,
tterns of hazards because modem tools like Verilog simulators make these intricate $ time related con-
t that we cannot cepts much easier to understand.
stractly model a
tement and bx.
iming simulation
defined this way 6.7 Further reading
Is-width devices. GAJSKI, DANIEL D., Principles of Digital Design, Prentice Hall, Upper Saddle River,
ntil such changes NJ, 1997. Chapter 8 discusses how pipelining can be applied to both the controller and
the collection is the architecture (datapath).
PALNITKAR, S., Verilog HDL: A Guide to Digital Design and Synthesis, Prentice Hall,
large amounts of PTR,Upper Saddle River, NJ, 1996. Chapters 5 and 10 explain about sophisticated
solve such prob- gate-level delay modeling in Verilog.
b-cycle approach,
hed in one clock PATrERSON, DAVID A. and JOHN L. HENNESSY, Computer Organizationand Design: The
lent piece of data Hardware/SoftwareInterface, Morgan Kaufmann, San Mateo, CA, 1994. Chapters 5
van be processed. and 6 discuss the trade-offs of the single-cycle, multi-cycle and pipelined approaches.
nt aspects of the
time. Although it
particular piece of
6.8 Exercises
6-1. A complex number, X, can be represented inside a machine as two integers: the
hat they both can real part, xr, and the imaginary part, xi. Mathematicians say that X = xr+i*xi,
a single building where i is the square root of minus one. (Some electrical engineers use the symbol j
fined by the delay instead of i.) To add two complex numbers, X and Y,simply requires adding the real
exactly one result and imaginary parts separately. To multiply two complex numbers, X and Y requires
fit one result per computing xr*yr-xi*yi and xr*yi+xi*yr. Suppose that a machine has four
cmare
compared to how ROMs: xr [ma], xi [ma],Dyr g[ma] andSyi d[ma]. C Design a multi-cycle behavioral
ch needs several ASM suitable for a central ALU architecture that computes the sum of the products of
is usually fastest the complex values in X and Y ROMs. This computation has many practical applica-
rly one result per tions in the field of digital signal processing, such as filtering out unwanted noise in a
telephone conversation. Note that there is no need for a memory in this problem be-
ign (and market- cause the desired answer is a single complex sum composed of sumr and sumi. You
en used since the may assume the ALU can do either an integer addition, subtraction or multiplication in
with special-pur- a single cycle.
algorithmic con-
in the 1820s. On
a crank. To avoid 6-2. Implement a pure behavioral Verilog simulation and test code that verifies your
lined design. De- design in problem 6-1.
.ydesigners since
sen pipelined de-
ware
One Hot Designs 249
7.1.2 Arrow/wire
An arrow in an ASM (or the implicit flow of control in Verilog) corresponds to a physi-
cal wire in a one hot controller. When that wire is hot it means that the corresponding
statement in Verilog is active during the current clock cycle. Several statements (that
execute in parallel) might be hot in a particular clock cycle, but only one flip flop
(corresponding to a state) is hot in that clock cycle.
7.1.4 Decision/demux
A decision (diamond in an ASM or an equivalent if or while in Verilog) translates
into a one bit wide demux. Recall from appendix C that the combinational logic for a
demux is very different from that of a mux. The following truth table describes the
outputs (outO and out1) of the demux, given its two inputs (in and cond):
o 0 0 0
o 1 0 0
1 0 0 1
1 1 1 0
Notice when in is cold (i.e., 0), both outputs are cold. When in is hot (i.e., 1), only
one of its two outputs is hot; hence it preserves the one hot property.
Ining together
re at the same
nulation), it is
erefore, when
ne hot design
I1 I
og) translates
nal logic for a
describes the cond 1
ond):
40
I
(i.e., 1), only
The one hot method uses more flip flops than the method shown in chapter 4. The
number of flip flops of the present state register in chapter 4 is approximately the base
two logarithm of the number of states. The following table shows how many flip flops
are required in each method:
2 1 2 3
3 2 3 4
4 2 4 5 reset-FR
5 3 5 6
6 3 6 7
7 3 7 8
pb*
8 3 8 9
The need for the extra flip flop for "power on" will be explained in the next section.
i section 23.3),
o'M from section
re there are five
-onnected to the
d to them. Later
names as wire
e, there is a path
iat leads to state
hose input is the
Des to OR gates
1 output of the Figure 72. One hot controllerforASMs of sections 22.2 and 23.3.
;, the ASM chart
In order to guarantee that the one hot property holds at $t ime 0, all of the flip flops are
n state INIT and connected to an asynchronous rese t signal. In physical hardware, shortly after $ t ime
scorresponds to 0 is when this reset signal ceases to be active. It is never used again.
which tells if rl
1; otherwise the By itself, just reseting the flip flops that correspond to the states in a particular ASM
hose input is the would have the effect of making all those flip flops cold at $time 0. Exactly one of
(is rlgey. The these flip flops (the one for state IDLE in this example) needs to become hot at the first
for state COM- rising edge of the clock. To accomplish this, we need to OR an additional wire on the
.eads back to the path to the flip flop for state IDLE.
This extra wire will be the output of a power-on device that will be hot only between
$ time 0 until the first rising edge of the clock. After the first rising edge of the clock,
this power on device will will be cold thereafter.
module
reg f:
wire
assigi
endmodu:
There also is at
declaration to o
There also is an additional shorthand for continuous assignment that allows the wire
declaration to occur on the same line. For example, the above is equivalent to:
Note that continuous assign is not a behavioral statement. The left-hand side is a wire,
not a reg. Clearly, using continuous assign shortens the code considerably compared
to the "hidden-module" shown below:
are
One Hot Designs 255
Continued
module test;
reg ffl,ff_2;
wirE
... /code that deals with ff_1 and ff_2
hidd
wire s_3;
endmodu
hiddenmodule hl(s_3, ff_1, ff_2); module
endmodule
input
outpu
This "hiddenmodule" defines combinational logic in the usual way with a reg wire
reg [
for the output port and a sensitivity list for the input wires:
alway
sum
endmodu
module hiddenmodule(s_3, ff1, ff_2);
output s_3;
input ff_1, ff_2; An advantage o
reg s_3; of sum,a and
wire ff_1, ff_2; with the idder
always (ffl or ff_2) inside the decla
s_3 = ff_1 ff_2;
endmodule Also, continuot
ample, the folla
The computation, f f_1 I f _2, is the same as given in the continuous assign. The
power of the continuous assign is that it allows arbitrarily complicated expressions (of
module
arbitrary width) to be evaluated. For example, the following: reg [
reg s
. . .
module test;
reg [11:0] a,b; wire
... /code that deals with a and b
assig
wire [11:0] sum; endmodu
assign sum = a + b;
endmodule is equivalent to
section 4.2.1.5.
is equivalent to instantiating a hiddenadder:
module test;
reg [11:0] a,b;
... /code that deals with a and b
An advantage of the continuous assignment is that you do not have to specify the widths
of sum, a and b multiple times-their previous declarations are sufficient. In contrast,
with the hidden adder approach you have to duplicate the declaration of their widths
inside the declaration of hiddenadder.
Also, continuous assignment allows the use of the conditional operator (? :). For ex-
ample, the following:
is assign. The
:pressions (of module test;
reg [11:0] a,b;
reg sel;
... /code that deals with a,b,sel I
A
7.2.2 One hot using continuous assignment FIFTH for the f
The wires that implement the combinational logic of a one hot circuit can be described by Verilog. [Th(
with continuous assignment. This is done as a notational convenience because continu- and chapter 4] t
ous assignments are equivalent to structural instances but are much more concise. For tasks.) The exar
example, the adder and mux in the last section could have been of any width, but the the state names
syntax of the actual continuous assignment would have been the same. Synthesis tools illustrating the
available from many different vendors are able to translate continuous assignments mated, the nam
into the structural instances (netlist) required to fabricate hardware. Each flip flop re- notice what nan
quired by the one hot circuit will be described by a separate one-bit reg. Such regs
Every statement
are also synthesizable. The names of these wires and regs will relate to the statement
ing statement e
numbers of the lines in the Verilog always block from which they derive.
ASM state narn
appropriate for
7.2.2.1 One hot with if else through s_19)1
The following example Verilog (taken from section 3.8.2.3.3) illustrates implicit style Of these wires
behavioral Verilog with an if else statement. Inthis example @(posedge sysclk)
#1 occurs on lines 3, 5, 9, 14 and 17, so the names of the five flip flops for the one hot
controller will be f f_3, f f _5, ff_9, f f 14 and f f_17: wire
s_4
s_6
s_10
1: always
s_15
2: begin
s_18
3: @(posedge sysclk) #1; //FIRST is ff_3
4: a <= @(posedge sysclk) 1;
5: @(posedge sysclk) #1; //SECOND is f_5 The other wires
6: b <= @(posedge sysclk) a; s_16, s17 ani
7: if (a == 1)
those wires are
8: begin
is synonymous v
9: @(posedge sysclk) #1;//THIRD is f_9
10: a <= (posedge sysclk) b; Although there a
11: end are sent to the ax
12: else
and 8.4.1) for de
13: begin
hand side of the
14: @(posedge sysclk) #1;//FOURTH is ff_14
15: b <= (posedge sysclk) 4;
16: end
17: @(posedge sysclk) #1; //FIFTH is ff_17 s_4
18: a <= (posedge sysclk) 5; s1 0
19: end s_18
s_6
It is easier to give each flip flop a name that relates to what statement number the s_15
@(posedge sysclk) #1 occurs on than to use the name from the original ASM.
The reason that we do not use the names FIRST, SECOND, THIRD, FOURTH and
L
FIFTH for the flip flops is that those names were inside comments, which are ignored
m be described by Verilog. [The reason we do not use the enternew_state task (sections 3.9.1.2
cause continu- and chapter 4] to give each state a name is that the VITO preprocessor does not support
re concise. For tasks.) The example in section 7.1 of translating from an ASM to a one hot circuit used
width, but the the state names given in the ASMs as the names of the flip flops only for the purpose of
Synthesis tools illustrating the nature of the one hot method. Since this translation will now be auto-
is assignments mated, the names do not matter. In the automated process, the designer will seldom
ch flip flop re- notice what name is given to each wire.
sg. Such regs Every statement also has a wire associated with it that is active when the correspond-
o the statement ing statement executes. (In the earlier example, these names were also the original
five. ASM state names. In general this need not be the case, and so separate names are
appropriate for an automated tool.) In this example, there are nineteen wires (s_1
through s_1 9) that correspond to statements in the original implicit style Verilog code.
s implicit style Of these wires, five act as command signals to the architecture:
Ige sysclk)
for the one hot
wire action in architecture when wire is active
s_4 a <= @(posedge sysclk) 1;
s_6 b <= @(posedge sysclk) a;
s_10 a <= @(posedge sysclk) b;
s_15 b <= @(posedge sysclk) 4;
s_18 a <= @(posedge sysclk) 5;
The otherwires (s_1, s2, s_3, s_5, s7, s_8, s_9, s_11, s_12, s_13, s_14,
s_16, s_17 and s_19) are used to define the rest of the one hot controller. Some of
those wires are synonymous with each other. For example s_11 (an end statement)
is synonymous with the s_10 wire that precedes it.
Although there are nineteen wire names in the one-hot controller, only the above five
are sent to the architecture. Using the methodical approach (such as in sections 2.3.1
and 8.4.1) for designing the architecture, we sort the above list according to the left-
hand side of the <=, and separate them according to these destinations:
All of the actions normally encapsulated inside a register building block (of the kind
described in appendix D) now have to be given with the combinational logic that com- Because of the n.
putes newa and newb. From the sorted list above, one approach would be to use <= 1 and s_6
three muxes for computing newa and two muxes for computing new-b: same destinatioi
any clock cycle.
also suffice, suc]
__7 a;
assign newb =
(s_1S) ? 4 :
(s_6) ? a :
b;
4e 4*+ a
(s_18) ? 5
assign new_b =
a; I
12
B (s_6) ? a
(s_15) ? 4
b;
Notice that the architecture was created by a textual transformation of the original
Verilog. The block diagram given above was shown only as an aid to understand how
the Verilog continuous assignment works. The preprocessor produces similar Verilog
Ae by just rearranging the original text of the Verilog.
-+ b
e'e 1212 Having defined the architecture, all that remains is to define the controller. The follow-
ing circuit diagram shows how each implicit style behavioral statement translates into
hardware:
7.2.2.1 and
Again, VITO
only to help e:
ing from the o
ments require(
7.2.2.2 0
The following
illustrates a sli
nal Verilog us(
1:a
2: :
3:
4:
5:
6:
7:
8:
9 :
10:
1 1:
12 :
13 :
reset-
14 :
15:
16:
The if statem
parator for state
corresponds to
s_7 correspon(
The preprocess
some of the wi
s-18, s_19
1: always
2: begin
3: @(posedge sysclk) #1; //FIRST is ff_3
4: a <= (posedge sysclk) 1;
5: @(posedge sysclk) #1; //SECOND is ff_5
6: b <= @(posedge sysclk) a; 10
s
0
7: if (a == 1)
8: begin
9: @(posedge sysclk) #1;//THIRD is ff_9 4
10: a <= (posedge sysclk) b;
1
11: @(posedge sysclk) #1;//FOURTH is ff_11
12: b <= (posedge sysclk) 4;
13: end 0
14: @(posedge sysclk) #1; //FIFTH is ff_14
15: a <= @(posedge sysclk) 5; i)
16: end
S_11
The if statement translates to a demux whose input, qual 7, comes from the com-
parator for statement 7 that implements the condition a == 1. The 1 output, sT_7,
corresponds to when this condition is true at the $ time the i f executes. The 0 output,
s_7 corresponds to when this condition is false at the $time the if executes.
The preprocessor generates the following one hot controller. In the following, only
some of the wire names are shown:
7.2.2.1 Iand
7.2.2. and
10:
11:
12:
13:
14:
15:
16:
The preproces
1: always
2: begin
3: @(posedge sysclk) #1; //FIRST is ff_3
4: a <= (posedge sysclk) 1;
5: @(posedge sysclk) #1; //SECOND is ff_5
6: b <= (posedge sysclk) a; Figure 7-8.
7: while (a == 1) 3.8.2.3.5.
8: begin
9: @(posedge sysclk) #1;//THIRD is ff_9
Continued
resetH DEVICON
HS_999
resets
- DQ
ts4
s-6
.2.2 and
qua17 < A
s_7 - sT_7
3.8.2.3.5. It r
ponding OR s10
thewhile): h : D
task eni
7.4.1 Example to illustrate the technique input
As an example to explain this technique, consider the following machine that asserts an begin
external command signal, comm, when the machine is in state BOT: prep
#1
end
endtask
M
ppear to prevent
As
As explained in section 39.1.2,
3.9.1.2, the cornm
con =
= statement can been hidden inside the
the
nnpnr tn nr~~~vtnt entr
enter ewsate
new
enter-new-state tas so
state task o that
tht00 isstedefaut
the default vauo om
value for comm:
nd outputs, such always
nmand signal is begin
@(posedge sysclk) enternewstate('TOP);
s usually a small if (pb)
begin
@(posedge sysclk) enternewstate('BOT);
comm = 1;
ing to asynchro- end
end
task enternewstate;
input ['NUM _STATE_BITS-l:0] this_state;
begin
ne that asserts an presentstate = thisstate;
#1 comm =0;
end
endtask
Since
Since the VITO preprocessor only allows <=, we need to describe a machine
machine without
using
using = whose behavior will be identical to the above after the first clock cycle.
One
One of the essential ideas used throughout this entire book is the meaning of the the non-
blocking
blocking assignment. ItIt computes a value
value now but assigns that value to a register at the
next
next rising edge of the clock. Since the above Verilog is a Moore machine, the com-
mand
mand is synonymous with the machine being in a particular state. As described in in
sections
sections 2.4
24 and 44.5,
4.4.5, such Moore commands can be generated by combinational logic
that
that is part of the next state logic:
e
Although Mealy machines must be defined using the above, we can look at a Moore
machine such as this example in a different way. We know what the next state is going end
to be, and we know that there is a command synonymous with being in that next state.
Instead of using combinational logic for the current command as we have done in Except for the
previous chapters, we can instead use a register that will contain the next command: 1 'bx), the abo
= Note that the
so the assign
next comm when the
Moore ate PS
Ps present In other words,
only state that the next sto
combinational next in state TOP wl
logic comm
pb l COMMcurrent
> -- ] C
-- command Likewise, the n
is already on th(
0 only when wi
Figure 7-11. Next command approachsuitable only for Moore controller ways that we c(
when pb ==
corm <= O. i
ment.
By rearranging
loop, the follow
cycle:
always
begin
COmr
@(pose
if (
be
d Likewise, the machine only schedules the assignment of 0 to comm when the machine
is already on the path where the next state will be state TOP. The next command will be
O only when we know that the next state will be state TOP. There are of course two
roller. ways that we could know that the next state is state TOP: conditionally in state TOP
when pb == 0, and unconditionally in state BOT. The non-blocking assignment,
comm <= 0, only has to be described once because it was given after the if state-
ment.
By rearranging the comm <= @ (posedge sysclk) 0 to the top of the always
loop, the following is identical to the original Moore ASM, including the firstfll clock
cycle:
always
begin
comm <= (posedge sysclk) 0;
@(posedge sysclk) #1; //TOP
if (pb)
begin
comm <= (posedge sysclk) 1;
@(posedge sysclk) #1;//BOT
end
end
input pb,r
output rea
2 Some synthesis tools might not produce a reliable circuit under these circumstances, and so the former
//reg ['NU
method (assigning 0 to comm at the bottom of the always) might be preferred. The latter Verilog code is
wire pb;
logically correct, but its physical implementation may be unreliable, depending on the clock frequency and
reg ready;
the physical properties of the technology.
reg [5:0]
3Actually, the signal that clears the register is s_3, which is the OR of s999 and the wire from the reg muxctr
bottom of the always loop. wire rlgey
module slowdivctrl(pb,ready,aluctrl,muxctrl,ldrl,
clrr2,incr2,ldr3,rlgey,sysclk);
input pb,rlgey,sysclk;
output ready,aluctrl,muxctrl,ldrl,clrr2,incr2,ldr3;
Using the simulation technique of section 5.4.1, this could be translated to Verilog as
follows:
I
always quite differen
begin avoided usinj
@(posedge sysclk) enternew state('OUTSIDE); agree with sy
count <= (posedge sysclk) 0; testing loops.
while (count!=4 & present-state !== BOT);
begin
@(posedge sysclk) enter newstate('TOP);
@(posedge sysclk) enternew_state('BOT);
count <= (posedge sysclk) count + 1;
7.6 Con
end One hot enco(
end hardware thar
this, a prepro,
implicit style
The above simulates correctly. However, the above cannot be synthesized into a one
between the N
hot machine using the VITO preprocessor.
There are gral
An alternative way to implement a bottom testing loop is to use a forever statement
Verilog, but tt
with a disable statement inside. Using a disable statement requires an extrablock
designer into,
that has a label to surround the forever. The forever by itself would never exit,
ful to a design
and so the disable statement causes a goto the end that matches the labeled be-
to hide the de
gin. For example, the above ASM could be translated into:
-
ASM charts c
always textual form
begin synthesis tooh
@(posedge sysclk) #1; //OUTSIDE (that uses the (
count <= (posedge sysclk) 0; often enough t
begin : looplab
forever The central coi
begin ioral ASM cha
@(posedge sysclk) #1; //TOP style whiles
@(posedge sysclk) #1; //BOT different than
count <= @(posedge sysclk) count + 1;
because of the
if(count==4)
hardware beca
begin
disable looplab; behavioral Ver
end fashion the ope
end 4 and 5 show I
end Verilog to one I
end manual transla
The above works correctly with the VITO preprocessor. However, this will not simu-
late properly on most Verilog simulators because the disable statement will also 7.7 Furt]
disable the non-blocking assignment. Putting # 1 in front of the disable may help on ARNOLD, MARK
some simulators, but on many simulators there seems to be no way to use disable in Verilog into On
this way properly. Therefore, the Verilog you choose for a bottom testing loop will be CA, March 31-
processor.
274 Verilog Digital Computer Design:Algorithms into Hardware
quite different if you want to simulate than if you want to synthesize. This book has
avoided using bottom testing loops in most examples in order that simulation may
agree with synthesis, but there are situations where hardware designers prefer bottom
testing loops.
7.6 Conclusion
One hot encoding provides a more natural way of translating complex algorithms into
hardware than the binary encoded approach described in earlier chapters. Because of
this, a preprocessor tool is available that directly translates an algorithm written in
implicit style behavioral Verilog into a one hot circuit. There is a one to one mapping
esized into a one between the Verilog (or the equivalent ASM) and the one hot controller.
There are graphicalsoftware tools that can automatically translate an ASM chart into
,ever statement Verilog, but the use of such tools is often ill advised. The use of such tools locks the
res an extra block designer into a proprietary file format. Although manually drawn ASM charts are use-
would never exit, ful to a designer in the early stages of design, they lack the expressive power of Verilog
;the labeled be- to hide the details of a design with good notation. Instead, this book uses graphical
ASM charts only as the master plan for the design. The details of the design occur in
textual form as implicit style behavioral Verilog. With one of several commercial
synthesis tools and perhaps the VITO synthesis preprocessor described in appendix F
(that uses the one-hot techniques given in this chapter), implicit style Verilog alone is
often enough to create operational hardware.
The central concept of this book is that algorithms can be described using pure behav-
ioral ASM charts (with RTN) or the equivalent pure behavioral Verilog (with implicit
style whiles and if s together with the non-blocking assignment). This approach is
different than traditional software because of the potential for parallel processing and
because of the idea of the system clock. This approach is different than traditional
hardware because of the emphasis on algorithms and behavior. Such implicit style
behavioral Verilog algorithms (or their equivalent ASM charts) describe in an abstract
fashion the operations carried out by some specific synchronous architecture. Chapters
4 and 5 show how you can manually design such architectures using Verilog, but the
Verilog to one hot preprocessor (explained in this chapter) eliminates the need for such
manual translation.
7.8 Exercises
7-1. Draw a circuit diagram for a one hot controller corresponding to the Moore ASM
given in section 2.2.4. Label the output of each flip flop with the name of the state. 8.1 Intr
Assume the command and status signals of the architecture are the same as in sections The machines
2.3.1 and 4.2.3. that solves a s
2. We use the
7-2. Draw a circuit diagram for a one hot controller corresponding to the Mealy ASM designed to sc
given in section 5.2.4. Label the output of each flip flop with the name of the state. forms the algc
Assume the command and status signals of the architecture are the same as in sections directly perfo
2.3.1 and 4.2.3. automation is
7-3. Draw a block diagram using muxes and combinational logic which is equivalent to Pascal's 1642
the following continuous assignment (assume that the 12-bit a and b are defined else- ence engine, ]
where): tabulated the
chess playing
wire [11:0] new a;
early twentiet
assign new a =
(s_10) ? a+b It is not surpt
(s_20) ? 2*a-b : a; tube) technoli
binary counte
7-4. Rewrite the pure behavioral Verilog of section 4.1.3 into the implicit style form computer (lat(
suitable for the VITO preprocessor. (Eliminate the enternew_state task and con- taneous equat
vert ready to <= as described in section 7.4.2.) Use the preprocessor to produce the German codei
continuous assignments that are equivalent to the one hot design. Draw a circuit dia- tions of specie
gram for the one hot controller labeled with the names used in the output of the prepro-
cessor. Also draw a block diagram for the architecture constructed only with combina- In contrast to
tional logic, muxes and simple (non-enabled) D-type registers corresponding to the ? solve any pro
: in the output of the preprocessor. purpose comp
Lovelace) ens
7-5. Rewrite the pure behavioral Verilog of section 5.4.2 into the implicit style form lished a theon
suitable for the VITO preprocessor. (Eliminate the enternew_state task and con- cannot be con
vert ready to <= as described in section 7.4.2. Also, use the disable statement in a cient but plau
different way than was described in this chapter.) Use the preprocessor to produce the written. The td
continuous assignments that are equivalent to the one hot design. Draw a circuit dia- because they,
gram for the one hot controller labeled with the names used in the output of the prepro-
cessor. Also draw a block diagram for the architecture constructed only with combina-
tional logic, muxes and simple (non-enabled) D-type registers corresponding to the ? l Hollerith started
: in the output of the preprocessor. 2 Theterm "comp
1950s. Previousl
General-PurposeComputers 277
Governments on both sides during World War II focused more resources on the design hardware impli
of computers than had ever occurred before. Although at first many such machines until 1951, in
were justified because they solved some important special-purpose problem, such as Institute for Ac
ballistics, the huge expense required to build and maintain such machines motivated
several independent groups t design machines that could be reused to solve different The first gener
problems. These wartime machines were not fully general-purpose in the modern sense, "Baby Mark I'
but were programmable via punched tape. The tape moved in one direction past a Manchester in
reader, and the holes told the machine what to do. Although on most such machines limited (only 3
looping was not possible (because the tape moved in only one direction) and self- software progi
modification was not possible (because once a tape was punched, it could not be June 21, 1948.
repunched), such machines made it easy to change the program by changing the tape. tional in 1949
produced in Bi
Konrad Zuse filed a patent in Germany in 1936 on such a tape-controlled machine and Wilkes and ot
built several versions of this machine, the first of which became operational in 1941. memory.
Colossus, in fact, was tape-controlled due to the flexibility required by British math-
ematicians (including Turing) who sought to break ever-changing German codes. George In the U.S., the
R. Stibitz and others at Bell Labs built several tape-controlled relay computers, some of memory, bu
of which remained in use for over a decade after the war. In 1943 Howard Aiken and start their own
others at Harvard, with the help of engineers from IBM, built the tape-controlled Harvard first general-p
Mark I, which was used by U.S. Navy personnel (including the later to become famous than 20 were s
Admiral Grace Hopper). Near the end of the war, IBM started to build the SSEC, which Since the earl)
was unique among the tape-controlled computers of the war because it had some lim- pose computer
ited ability for the type of self-modification alluded to by Babbage and Turing (and was in the details,
therefore almost a true general-purpose computer). execute, whicl
John P. Eckert, John W. Mauchly and others in the Moore School at the University of
Pennsylvania built ENIAC from 1943 to 1945 for ballistic computations required by
the U. S. Army. It was the largest computer built during the war, constructed with an
order of magnitude more vacuum tubes (nearly 20,000) than any of the other machines.
8.2 Stru
Unlike other machines of the era, it was not programmed via a tape, but instead it had Since the Man
to be rewired (via a plugboard) to solve a different problem. (Designing a "program" overall top-lei
for the ENIAC was similar to designing the controller and architecture as illustrated in
chapter 2). This made ENIAC far more specialized and inconvenient than the tape-
controlled machines. Recognizing this inconvenience, people at the Moore School (no-
tably John von Neumann) proposed building EDVAC, which would represent pro-
grams in the same memory as data, rather than on tape or with a plugboard.
Although EDVAC was not the first general-purpose computer to become operational,
von Neumann's 1945 poposal was profoundly influential. To this day, his name is
synonymous with general-purpose computers that store their programs in the same
memory as their data and that use what we now call the fetch/execute algorithm. The
Figure8-1
General-PurposeComputers 279
- -
4
A different issue
3The memory we are talking about can be used both to store and retreive bits. It is refered to as "RAM" by
Multi-ported mem
some people.
read port and one
280 Verilog Digital Computer Design:Algorithms into Hardware
means the address is 01238 == 0000010100112 == 83 and the contents is 45678 ==
1001011101112 == 2423, or more succinctly in array notation, memory 83] ==
I an architec- 2423. We will sometimes abbreviate even further to say m [83] = = 2423.
irchitecture is There are five independent issues that can be used to categorize memory: unidirec-
side. tional versus bidirectional (section 8.2.2.1), deterministic versus non-deterministic ac-
face the CPU cess time (section 8.2.2.2), synchronous versus asynchronous (section 8.2.2.3), static
own control- versus dynamic (section 8.2.2.4) and volatile versus non-volatile (section 8.2.2.5).4
memory (five
inthis chapter din memory dout
ize of 12-bits.
machines typi-
sirrelevant to
addr
other
input(s)
dd
a
I
nd contents of Figure8-2. Symbol for memory with unidirectionaldata buses.
left, and the
from the con- A bidirectionalbus is one that is used to send information two ways. In the following
diagram, a bidirectional bus is indicated by an arrow that points both ways. In this case,
bidirectionality allows combining the din and dout buses, into a single data bus as
illustrated in the following:
4
A different issue related to memory (not discussed in this chapter) is how many ports the memory has.
I to as "RAM" by Multi-ported memory is discussed in section 9.8, but in this chapter all memory is assumed to have only one
read port and one write port, as illustrated in the following sections.
re
General-PurposeComputers 281
In addition to
addr_ memory data
tells the memc
other the machine, it
input(s) > with separate
The advantage of bidirectionality is that there are fewer wires connecting the memory
device to the rest of the system, however, interfacing to such a device is more compli-
cated. This requires the use of tri-state buffers. Except for such tri-state buffers, the
internal structure of this memory is identical to that of a memory with separate din
and dout buses. We will not consider memory devices with bidirectional buses in this Figure8-4.
chapter.
dout
g the memory
more compli- Idn
e buffers, the
so
separate din
I buses in this Figure8-4. Symbolfor synchronous memory.
There are two things that the device does. Given enough time, the output of the device
te reflects the contents of the memory at the word indicated by the address bus. In other
intaneous. Al- words, neglecting the propagation delay (i.e., neglecting the access time),
e time it takes
ememory will
;ince both the I dout = memory[addr]
final stages of
The second thing that the memory can do is based on the ldmem command signal. On
from memory the next rising edge after ldmem becomes true, the word in memory indicated by the
address bus changes to become the value of the data input bus,
es. Almost all
mes.
I memory[addr] - din
ichronous and Note that almost instantly after this change takes effect, dout will also change. At
item clock. most one word in memory can be changed in one clock cycle when a memory is single-
ported. 5 If ldmnem is not true, memory remains unchanged.
A synchronous memory device can be thought of as a bank of registers. Although it is
not usually the most efficient way to build a memory, the following diagram shows a
ie designer to
Drtant, such as structure that achieves this goal:
Because of its
program com-
imary memo-
5Chapters 9 and 10 describe multi-ported memories that allow more than one memory operation per clock
cycle.
General-PurposeComputers 283
k
w
8.2.2.3.2
A significant
being provide
been asynchr(
cope with a r
The block dia
ports) is:
dout
I N I V I
Figure 8-6.
L
L ---------------------------------- ,
Here the write signal combines the roles of the ldmem signal and the sysclk sig-
nal. Asynchronous memory may also have a bidirectional data bus instead of two uni-
directional buses.
a 1*2a demux In general, asynchronous design is highly unsafe, and should only be attempted by
expert designers. Proper asynchronous design involves consideration of much lower
to the ldmem (electronic) details than is the case for synchronous design. With the introduction of
dy been placed HDLs, the vast majority of design (such as CPUs) in industry is synchronous because
bmemory loca- synchronous designs are much more likely to be synthesized correctly. Asynchronous
addr bus. The design is beyond the scope of this text, and so we will not consider the internal struc-
the dout bus. ture that implements this memory (although it is similar in concept to the diagram in
8.2.2.3.1).
need to recall
he demux will Fortunately, since memory is such an important commodity, electronic experts have
. When ldmem hidden most of the asynchronous ugliness inside commonly available memory chips.
the others. The To cope safely with such devices, there are only three extra things that the designer has
-fore if ldmem to do:
)us will change 1. Choose a conservative clock speed for the rest of the system relative to the access
time of the memory. In other words, the access time of the memory should be a
(essentially the small fraction of the clock period. Some memories have a different time for read
ocked memory and write, and so you should choose the larger of these.
w many clock 2. Hold addr and din constant for at least one clock cycle before and during the
any advantage cycle write is active. This means both addr and din should come from
registers in the architecture that are not changed during this time.
not. The only As the ABC, Colossus and ENIAC illustrate, vacuum tube technology was available by
faster than the the start of World War II to implement CPUs and peripherals. The technological prob-
er all propaga- lem from the late 1930s until the early 1950s was memory. Although the pioneers were
possible clock aware of techniques like section 8.2.2.1.1 and used small memories of this sort for data
access, the cost of storing programs in such memory was prohibitive. Currently, memo-
ries of this kind are commonly used, but not as the primary memory for stored program
only is there a computers. It takes about six switching devices (relays, vacuum tubes, transistors or
bed. This is be- whatever technology is in vogue) to construct an enabled flip flop, so it would take
memory tech- 6 *d* 2 a switching devices to build the registers. It takes approximately a *2 a switch-
)acitors, which ing devices to construct the demux, and d*a*2a switching devices for the mux. This
Drefreshed, the makes the total about ( (a+ 6) *d + a) * 2 a switching devices to construct a working
memory unit along the lines shown in section 8.2.2.3.1.
available. For For Williams and Kilburn's 32 word memory (which, even in 1948, was considered too
;over and over, small for practical programming), this would require ((5+6)*32 + 5)*32 = 11,424 vacuum
ither problems, tubes, which is more than an order of magnitude more tubes than was required for their
trollerbetween entire CPU. (The ENIAC used about 20,000 vacuum tubes because it stored all its data
in vacuum tubes. Also, storing a program in vacuum tubes would have been unrealis-
tic.)
In order to build their machines, the pioneers had to invent technologies for memory
is turned off. A that were more efficient and reliable than simple vacuum tubes. Zuse invented a binary
ning any power. mechanical memory. Atanasoff invented a rotating drum using capacitors (which is
able because it conceptually similar to the dynamic memory chips in widespread use today). Although
; however with creative, neither of these technologies would have been reasonable for a general-pur-
ery to preserve pose computer in the 1940s.
The breakthrough came when Williams and Kilburn invented a TV-like tube for storing
bits in the Mark I. Using the terminology defined above, the Williams tube was bidirec-
tional, asynchronous, dynamic and volatile. Most importantly, the Williams tube was
the first affordable technology that had the same kind of deterministic access time
but algorithms
dills will benefit
Learning about
6For example, modem Verilog simulators and synthesis tools are possible only because of large and fast
general-purpose computers.
iI
What is certain is that the cost, speed and capacity of integrated circuit memory has For readers wit]
improved radically in the last quarter century. It is likely these improvements will con- a short introdui
tinue well into the 21 st century. That these technological factors have improved expo- level language,
nentially is in large part responsible for the success of the general-purpose computer, used in later se
which needs to store both its programs and its data in memory.
8.3.1 Limi
Although the fe
8.3 Behavioral fetch/execute machine-specif
tion set is the so
The three components of a general-purpose computer described in section 8.2 (CPU,
software on a p;
peripherals and memory) act together as a unified system that implements the fetch/
ware is only c;
execute algorithm. This section describes how to model the behavior of this unified
Although conc
system with an ASM chart, without regard to the structural interconnection of the hard-
different instru
ware. This section explains what is referred to in chapter 2 as the "pure" behavioral
stage of the top-down design process. Later, in section 8.4, the "mixed" stage of the
top-down design process shows some of the structural interconnections for the CPU 8.3.1.1 The
and memory. We need a sim;
This section focuses on the algorithm that makes the general-purpose possible: fetch/ the stages of th
execute. Although the details of the fetch/execute algorithm vary widely among the (which has reir
thousands of general-purpose machines designed and built since 1948, the fundamen- unique hardwai
tal operations of the fetch/execute algorithm have remained essentially the same: we will use as a
PDP-8 is a cla
1. Fetch the current instruction from memory struction set. (a
2. If needed, fetch data from memory had this simple
3. Prepare for fetching the next instruction
4. Decode and execute the current instruction
a) Interpret what the current instruction means 8.3.1.2 His
b) Carry out the operation asked for by the current instruction, possibly The PDP-8, wh
modifying memory is pivotal in th(
cost only a few
Steps 2 and 4 have details that are machine specific. It may be possible to rearrange the
achieve this wil
order in which steps 2, 3 and 4 occur, depending on these machine-specific details.
simple and the
A general-purpose computer can modify its instructions without programmer interven- proved technol
tion because it uses the same memory to store instructions as it uses to store data. In tion set. These;
other words, it can treat instructions as though they were data. This characteristic of
universal machines, known as self-modification, is difficult for programmers to use
effectively. However, this capability is the key to the success of the general-purpose 8.3.1.3 Instr
computer. The ability for self-modification allows software (known as compilers and The complete P
assemblers) to translate programs automatically from an easy to understand high-level instructions. E)
language (C, Java, Pascal, Verilog, etc.) to the much more tedious machine language sets ever design
that is specific to the hardware. would distract I
-
jit memory has For readers without intimate experience with low-level programming, appendix A gives
ments will con- a short introduction to machine and assembly language (and how they relate to high-
mproved expo- level language) using an example of adding three numbers. This example will also be
pose computer, used in later sections of this chapter.
a machine that Some of the registers are specified by the specific instruction set. The details of these
I in the example registers are machine dependent. In the case of the PDP-8, the 12-bit accumulator, ac,
is the primary register that the machine language programmer uses. (There are a few
other registers, such as the l ink, that are specific to the PDP-8. As was done in appen-
dix A, we will ignore these for the moment in order to keep this example simple.) Other
machines, such as the Pentium, have different registers that the programmer can ma-
nipulate. We refer to the registers that are visible to the programmer as the programmer's
model. Some people refer to these as the computerarchitecture;however we do not use
this term since the registers in the programmer's model are not everything contained in
the internal architecture of the CPU.
ry and
In addition to the registers required to implement a specific machine language, the
fetch/execute algorithm requires the hardware to have two registers: the program counter,
pc, and the instruction register, ir. Typically, the pc contains the address in memory
of the next instruction to execute, and the ir contains the current instruction which is
about to execute. If the machine did not have an HLT instruction, the machine would
)r practical pro- simply loop forever doing the four phases of the algorithm:
ipitalized letters
) and DCA) are 1. fetch the instruction from m[pcI into ir
,ate an effective 2. calculate the effective address
igto reference). 3. increment the pc (prepare for fetch of next instruction)
tions (known as 4. decode and execute the instruction in the i r
nly consider the where mrefers to memory array. Most machines, including the PDP-8, have some form
s in the instruc- of HLT instruction. In order to keep track of whether the machine has halted or not,
Vith direct page there needs to be an additional one-bit register, halt. When the machine has not ex-
even bits of the ecuted an HLT instruction, halt is 0. When the machine has just executed an HLT
season the PDP- instruction, halt becomes 1. The fetch/execute algorithm proceeds only when halt
at most a single is 0.
The machine needs a register to hold the effective address of data in memory to be
lout referencing manipulated by an instruction. This register, which may be used for other purposes at
struction (HLT) different times, is known as the memory address register, ma. It will be convenient to
o a special state have an additional register, known as the memory buffer register, mb, to contain the
program. data that was in memory at the effective address prior to the execution of the instruc-
tion.
In later stages of the top-down design process, it will be convenient to have ma as the
,r transfer nota- sole source providing the addr input to the memory device. At the "pure" behavioral
ter 2. Therefore, stage, we can ensure this is possible by restricting the use of the memory array. All
vioral fetch/ex- references to memory must be m [ma], rather than the somewhat more natural refer-
ences, m [pc . Also it will be convenient to have mb as the sole source providing the
din input to the memory device. In the restricted behavioral ASM, the only way to
store something into memory is by saying m[ma] - mb. This will require that the
behavioral ASM have states that initialize mb properly.
'are
General-PurposeComputers 293
-
The next state after state F2 is state F3A. State F3A fetches the instruction stored in ma at the nex
memory pointed to by the original program counter, which is now in the memory ad- that next clo
dress register. In the behavioral ASM, we use the same register transfer notation for become 0107
dealing with memory, i r<-m [ma] , as is used for dealing with other registers. In later loaded into n
stages of the design, the timing of memory may be somewhat more difficult than that F4B), mb bec
of the internal CPU registers, but at this early stage, we will ignore these details.
State F4B is r
The next state after state F3A is state F3B. State F3B calculates the effective address the ASM to i
using the information in the instruction register. This calculation is denoted by a func- not described
tion referred to as ea (ir) . For example, if the ir is 11078, ea (ir) is 01078. If the
ir is 31118, ea (ir) is 01118. In the later stages of the top-down design process, the The bottom o
ea function will be realized using combinational logic. Appendix A assumes there is currently in tf
an additional register for the effective address, but this is not necessary here since state This series of
F3B uses the existing ma register to hold the effective address. (To implement the rithm. For MI
complete instruction set of the PDP-8 described in appendix B, more complicated com- ing decisions I
binational logic is required for ea. This is because the ea function for some of the this instructio
addressing modes not implemented in this chapter needs an additional argument.) tions. The mc
State F3B has a decision to determine what state occurs next. If the instruction is a Most machine
Memory Reference Instruction (MRI), the next state after F3B will be F4A. If the ment decoding
instruction is not an MRI, the next state after F3A will be one that implements the else if ..
operation requested by the instruction ("E"xecute it). Even though in this section the machine langu
only MRIs in our PDP-8 subset start with 1 and 3, we will describe how to test for any The remaining
MRI. In the complete PDP-8 instruction set (given in appendix B), an instruction is tion of each sp
MRI if the high order octal digit (three bits) of the instruction is between 0 through 5
inclusive. There are several ways one could write this test. It could be written as i r < If ir [11 : 9]
6 0 0 0, however, this does not emphasize that only the high-order three bits of the in- complement A
struction register determine the outcome. The test could be written as i r / 10008 < 6 machine adds
or i r>> 9 < 6 to emphasize that the outcome is based on the high-order three bits, but complete imph
neither one of these tests is the most succinct way to express this. We need a notation we will ignore
that clearly says "just look at these bits." Although the material in this section does not EOTAD, the m
depend on any knowledge of Verilog, Verilog does indeed have such a bit selection fetch another ii
notation: i r [1 1: 9] says form a three-bit value using bits 11 through 9 of i r, which If ir[11:9]i
is roughly equivalent to (ir >> 9) & 7, which in this case is equivalent to ir >> and Clear Acci
9 since ir is 12-bits. For example, if ir is 11078, ir [11 :9] is 1. If ir is 31118, though TAD ar
ir [11: 9] is 3. We will use this aspect of Verilog notation in our ASM because it complex becau!
clearly documents what the hardware will do, which of course is the goal of a behav- the accumulato
ioral ASM. the DCA instru
If the instruction in the i r is MRI, the machine proceeds to state F4A. State F4B loads The restriction
mb with the data that the machine may need to use to execute the memory reference anything that is
instruction. For example, in the program of appendix A, when the memory reference EODCA schedu
instruction 1107 is fetched by state F3A, the machine schedules 0107 to be loaded into the accumulator
next state, whic
ruction stored in ma at the next rising edge of the clock. Since ir [ 11: 9 ] is 1 (which is less than 6), at
the memory ad- that next clock edge the machine proceeds to state F4A. In state F4A, ma has just
sfer notation for become 0107, and so the contents of memory at that effective address, m [ma] , can be
registers. In later loaded into mb. In this example, one clock cycle later (when the machine is in state
lifficult than that F4B), mb becomes 0152.
iese details. State F4B is not necessary. It is included here as a placeholder for operations needed in
effective address the ASM to implement the complete instruction set of the PDP-8, including features
noted by a func- not described yet.
is 01078. If the The bottom of the ASM has a series of decisions that determines which instruction is
sign process, the currently in the instruction register.
assumes there is
y here since state This series of decisions is known as the decoding portion of the fetch/execute algo-
D implement the rithm. For MRI, the decoding decisions happen in state F4B. For non-MRI, the decod-
omplicated com- ing decisions happen in state F3B. Since we are implementing only four instructions in
for some of the this instruction subset, there are only four decisions required to decode these instruc-
d argument.) tions. The more complex the instruction set, the more difficult it is to do decoding.
Most machines, including the complete PDP-8 have a long string of decisions to imple-
^instruction is a ment decoding. Notice, from a high-level view, decoding occurs as a series of i f ...
I be F4A. If the else if ... else if ... style decisions, since each instruction has a unique
implements the machine language code.
rithis section the
w to test for any The remaining states of the machine perform certain actions required during the execu-
an instruction is tion of each specific instruction.
veen 0 through 5 If ir [ 11: 9 ] is 1 in state F4B, the instruction is what the programmer calls a "Twos
written as i r < complement ADd," and so the machine proceeds to state EOTAD. In this state, the
-ee bits of the in- machine adds the data from memory at the effective address to the accumulator. In a
Lr/1000 < 6 complete implementation of the PDP-8, other operations are involved with a TAD, but
der three bits, but we will ignore those details for the moment. After performing the addition in state
>need a notation
EOTAD, the machine has completely executed the TAD instruction and is ready to
section does not fetch another instruction. Therefore, the next state after state EOTAD is state Fl.
h a bit selection
,h 9 of i r, which If ir [ 11: 9 ] is 3 in state F4B, the instruction is what the programmer calls a "Deposit
valent to i r > > and Clear Accumulator" (DCA) and so the machine proceeds to state EODCA. Al-
If ir is 31118, though TAD and DCA are both MRIs, the operations involved for the DCA are more
ASM because it complex because the DCA instruction stores the accumulator in memory and then clears
goal of a behav- the accumulator. It takes three clock cycles to accomplish all the operations required by
the DCA instruction. State EODCA occurs during the first of these three clock cycles.
The restrictions on the use of memory described at the end of section 8.3.1.4 require
State F4B loads anything that is to be stored in memory be placed in the memory buffer register. State
memory reference EODCA schedules that the memory buffer register be assigned a copy of the value in
memory reference the accumulator at the next rising edge of the clock. This is done in preparation for the
to be loaded into next state, which is state ElADCA.
7
Inother chapters,
a "HaLT," and
itruction regis- IDLE ma=O100 mb=???? pc=O100 ir=???? halt=O ac=????
Fl ma=O100 mb=???? pc=O100 ir=???? halt=O ac=????
T, the machine
F2 ma=O100 mb=???? pc=O100 ir=???? halt=O ac=????
,e of the clock. F3A ma=O100 mb=???? pc=O101 ir=???? halt=O ac=????
going to fetch F3B ma=O100 mb=???? pc=O101 ir=7200 halt=O ac=????
halt register EOCLA ma=OOOO mb=???? pc=O101 ir=7200 halt=O ac=????
ne proceeds to F1 ma=OOOO mb=???? pc=O101 ir=7200 halt=O ac=OOOO
es near the end
E....
In state Fl, the program counter (0100) is saved in the memory address register. In state
F2, the program counter is scheduled to be incremented to become 0101 (as can be
seen in state F3A) in preparation for fetching the next instruction four clock cycles
later. In state F3A, the instruction register is scheduled to be loaded from memory
address 0100. In state F3B, this instruction (7200) becomes available in the instruction
register, and since ir [ 11: 9 ] >= 6, the instruction decoding takes place. State
EOCLA schedules zero to be loaded into the accumulator. Having completed the fetch-
ing and execution of the CLA instruction, the machine performs similar operations to
fetch the second instruction. This time, the program counter is 0101 in state Fl. The
following shows how the fetching and execution of the second instruction proceeds:
7
In other chapters, a similar idea is denoted with the "x" value in Verilog.
General-PurposeComputers 299
F2 ma=0101 mb=???? pc=0101 ir=7200 halt=O ac=0000
F2
F3A ma=0101 mb=???? pc=0102 ir=7200 halt=O ac=0000
F3A
F3B ma=0101 mb=???? pc=0102 ir=1106 halt=O ac=0000
F4A F3B
ma=0106 mb=???? pc=0102 ir=1106 halt=O ac=0000
F4A
F4B ma=0106 mb=0112 pc= 0 1 0 2 ir=1106 halt=O ac=0000
F4B
EOTAD ma=0106 mb=0112 pc=0102 ir=1106 halt=O ac=0000
F1 ma=0106 mb=0112 EODCA
pc=0102 ir=1106 halt=O ac=0112
ElADCA
E1BDCA
In state F2 the saved program counter (0101) is visible in the memory address register
F1
at the same time the program counter is scheduled to be incremented to become 0102
(as can be seen in state F3A). In state F3A, the instruction register is scheduled to be
loaded from memory address 0101. In state F3B, this instruction (1106) becomes avail- In state F2 the
able in the instruction register, but unlike the above non-MRI, instruction decoding at the same tim
does not take place in state F3B. Instead, state F3B schedules the memory address (as can be seen
register to be loaded with the effective address (0106), derived from the instruction loaded from m(
register. Since ir [ 11: 9 ] < 6 in state F3B, the machine proceeds to state F4A, able in the inst
where the memory buffer register is scheduled to be loaded with the contents of memory loaded with th(
(0112) at that effective address, as can be seen in state F4B. In state F4B, instruction ir[11:9] <
decoding takes place. Since, ir [11: 9] == 1, the machine proceeds to state EOTAD, buffer register
where the 0112 in the memory buffer register is added to the zero in the accumulator. effective addre;
The remaining two TAD instructions execute in a similar fashion: place. Since, i
memory buffer
(0510).8 In stat
F2 ma=0102 mb=0112 pc=0102 ir=1106 halt=O ac=0112 the accumulato
F3A ma=0102 mb=0112 pc=0103 ir=1106 halt=O ac=0112
F3B ma=0102 mb=0112 pc=0103 ir=1107 halt=O ac=0112
F4A ma=0107 F3A
mb=0112 pc=0103 ir=1107 halt=O ac=0112
F4B ma=0107 F3B
mb=0152 pc=0103 ir=1107 halt=O ac=0112
EOTAD ma=0107 EOHLT
mb=0152 pc=0103 ir=1107 halt=O ac=0112
F1 ma=0107 F1
mb=0152 pc=0103 ir=1107 halt=O ac=0264
F2 ma=0103 IDLE
mb=0152 pc=0103 ir=1107 halt=O ac=0264
F3A ma=0103 IDLE
mb=0152 pc=0104 ir=1107 halt=O ac=0264
F3B ma=0103 mb=0152 pc=0104 ir=1110 halt=O ac=0264
F4A ma=0110 mb=0152 pc=0104 ir=1110 halt=O ac=0264 The value in th
F4B ma=OllO mb=0224 pc=0104 ir=1110 halt=O ac=0264
becomes visible
EOTAD ma=0110 mb=0224 pc=0104 ir=1110 halt=O ac=0264
F1 ma=0l10 mb=0224 pc=0104 ir=1ll0 halt=O ac=0510
less computatic
tation when it i
load these bits
The accumulator now contains the sum of the three numbers (0510). The following
shows the execution of the DCA instruction:
8Having the ASM I
use the value loade,
less but slower thar
The value in the memory address register calculated by state F3B (this value, 0002,
ac=0264
becomes visible in state EOHLT) is irrelevant. In hardware, unnecessarily doing a harm-
ac=0264
less computation sometimes is more efficient than having a decision avoid the compu-
tation when it is unwanted. 9 It does not slow the machine, and it is simpler always to
load these bits from the instruction register into the memory address register, even, as
The following
8Having the ASM proceed through F4A and F4B was unnecessary in this case since state EODCA does not
use the value loaded into memory buffer register by state F4A in the same way EOTAD does. This is harm-
less but slower than required and was done to simplify the explanation of state F3B.
9As the last footnote indicates, whether it is efficient depends on whether extra states, like F4A, are involved
or not. Here there are no extra states involved.
General-PurposeComputers 301
i7
Ii
i
in this case, when they are not needed. State EOHLT schedules the halt flag to be- These instruc
come zero, which causes the machine to go to F1 and then back to IDLE, where the (AND, TAD,
machine will stay (unless cont is pressed again). we will only
The next eig]
8.3.2 Including more in the instruction set describe insti
The machine described by the ASM in section 8.3.1.5 is rather useless. It was presented referencing ir
only to introduce the essential aspects of the fetch/execute algorithm. Rather than imple- requires some
ment a useless subset of instructions in hardware, let's include more of the PDP-8's instruction se
instructions in our instruction set. For the extended example in this section, we will the eight liste
implement a machine that executes the following PDP-8 instructions: The skip insti
condition is ry
skip acts like
octal
mnemonic machine what the mnemonic stands for The HLT inst
language ceed to states
ample in secti
AND Oxxx AND memory with accumulator
TAD lxxx add memory to accumulator (Two's Complement Add) HLT instructi4
DCA 3xxx Deposit accumulator in memory and Clear time the progi
Accumulator using an exte
JMP 5xxx goto a new instruction (JuMP) known as the
CLA 7200 CLear Accumulator bus, very mui
CLL 7100 CLear Link physical reali:
CMA 7040 CoMplement Accumulator each bit).
CML 7020 CoMplement Link
IAC 7001 Increment ACcumulator The OSR instj
RAL 7004 Rotate Accumulator and link Left is ideal for ou
RAR 7010 Rotate Accumulator and link Right
the external s
CLACLL 7300 CLear Accumulator and CLear Link
of software in
SZA 7440 Skip next instruction if Zero is in Accumulator
SNA 7450 Skip next instruction if Non-zero value is in
Even though
Accumulator have still cho
SMA 7500 Skip next instruction if Minus (negative) value about whether
is in Accumulator design will co
SPA 7510 Skip next instruction if Positive (non-negative)
value is in Accumulator
SZL 7430 Skip next instruction if Zero is in Link
SNL 7420 Skip next instruction if one (Non-zero) is in 8.3.2.1 A'
Link Here is the A
above:
HLT 7402 HaLT
OSR 7404 Or Switch "Register" with accumulator
,altflag to be- These instructions are explained more fully in appendix B. The first four instructions
DLE, where the (AND, TAD, DCA and JMP) are memory reference instructions. As in section 8.3.1,
we will only consider direct page zero addressing.
The next eight mnemonics (CLA, CLL, CMA, CML, IAC, RAL, RAR, CLACLL)
describe instructions that manipulate the accumulator and the link registers without
referencing memory. (The link register was not considered in section 8.3.1.5 because it
It was presented
requires some details that are discussed in section 8.3.2.2.) Although the full PDP-8
tther than imple-
instruction set allows for 256 combinations of these operations, we will only consider
of the PDP-8's
the eight listed here.
section, we will
The skip instructions allow conditional execution of the following instruction. If the
condition is met, the following instruction does not execute. If the condition is met, the
skip acts like a NOP.
The HLT instruction causes the machine to stop executing a program and instead pro-
ceed to states that allow the machine to interface with its programmer. Unlike the ex-
ample in section 8.3.1.5, the ASM in this section will include interface states after the
HLT instruction that allow an arbitrary program to be loaded at an arbitrary address any
nplement Add)
lear time the programmer wishes. The programmer communicates with the halted machine
using an external 12-bit input, sr. In the original PDP-8 documentation, the sr is
known as the switch "register"; however sr is not a register. sr is an external input
bus, very much like the buses x and y in the division machine of chapter 2. In the
physical realization of the PDP-8, the sr is simply a set of twelve switches (one for
each bit).
The OSR instruction is an unusual kind of input instruction unique to the PDP-8, which
is ideal for our purposes in this section. The OSR instruction ORs input coming from
the external sr bus with the contents of the accumulator. This allows a discussion here
of software input without the need for machine language instructions 6xxx.
Accumulator
Even though the PDP-8 is one of the simplest instruction sets ever designed, and we
alue is in
have still chosen to implement only about half of it, you may have a feeling of panic
native) value about whether you will ever be able to design such a machine. Have faith-top-down
design will come to the rescue.
non-negative)
Link
ero) is in
8.3.2.1 ASM implementing additionalinstructions
Here is the ASM for the improved machine that implements the instructions listed
above:
or
8.3.2.2 St,
Many of the e
improved ASP
useless ASM i
Also, the fetcl
tion 8.3.2.1 ai
section 8.3.1.5
instruction (31
illustrates, it is
buffer register
Therefore, in t
addressing, on
F4A. Since JN
section (JMS a
< 3. If the full
condition wou]
8.3.2.2.1 L
One place who
improved AS
instruction trea
this is to say ti
performs such;
out of the ac a]
a 13-bit bus. W
concatenation.,
of Verilog, Veri
13-bit value. T
most significan
As a different
fir [11], ir
State EOTAD o
and is affected 1
side of a register
Figure8-8. ASM implementing more instructions of the PDP-8. that the 12-bit n
before after
link ac mb link ac mb
0 0040 4001 0 4041 4001
0 4040 4002 1 0042 4002
1 0040 4003 1 4043 4003
1 4040 4004 0 0044 4004
I
{ac,link} ==
RAR is the it
{ac[ll] ,ac[10] ,ac[9] ,ac[8] ,ac[7] ,ac[6] ,ac[5] ,ac[4] ,ac[3] ,ac[2] ,ac[1l] ,ac[0] ,link
sof the 13-bit is a more succinct way to describe 13 separate register transfers, each one bit wide:
ir examples of
link <- ac[11]
ac[11] *- ac[10]
ac[10] <-- ac[9]
ac[9] *- ac[8]
ac[2] *- ac[l]
ac[l] *- ac[0]
ac[0] *- link
)01000002 +
)fit in 13 bits, Observe that, except for the link, the bits are shifted over one place to the left. The old
value of the link "rotates" around to the least significant bit of the accumulator. The
following table illustrates examples what is in the link and the accumulator before and
ire most easily after state EORAL:
in state F3B,
r," and so the
me rather than before after
B, the instruc- link ac link ac
k," and so the 0 1001 0 2002
{link,ac}. 0 2002 0 4004
0 4004 1 0010
e programmer 1 1001 0 2003
Ceeds to state 1 2002 0 4005
he instruction 1 4004 1 0011
,alls a "Rotate
AL, where the
Although the software uses of the previous instructions were fairly obvious, the RAL
describe rota-
instruction may seem a bit strange. In fact, RAL has two uses: arithmetic and logical.
The first three lines above illustrate the arithmetic use: if the programmer has previ-
ously cleared the link, RAL is like multiplication by two (with overflow in the link).
IaII The remaining examples above illustrate the logical use: to rearrange bits without
losing any information.
RAR is the inverse of RAL, and the concatenation notation makes this clear:
I'll
General-PurposeComputers 307
-
-
In the following, the second, third and last three lines use data from the previous (RAL) Assume the r
table to illustrate that RAR is the inverse of RAL (the rotates do not lose information, discussion. M
they simply rearrange it): instruction, 7
remains at 00
is desired on
before after algorithm prc
link ac link ac This is done ti
0 1001 1 0400
next instructi(
0 2002 0 1001
0 4004
accumulator
0 2002
1 1001 1 4400
identity for 0
1 2002 0 5001 the accumulal
1 4004 0 6002 OSR instructi(
0 2003 1 1001 user interface
0 4005 1 2002
1 0011 1 4004
The first three examples above illustrate the arithmetic use of RAR: if the programmer 8.3.2.2.3
clears the link, RAR is like unsigned division by two (with the remainder in the link). There are six
these (TAD an
exercises. The
If ir[11:9]
8.3.2.2.2 Additional non-memory reference instructions "AND," and s(
If the instruction register is 7100 in state F3B, the instruction is what the programmer except the AN]
calls a "CLear Link," and so the machine proceeds to state EOCLL, where zero is ter alone.) Rec
assigned only to the link (the accumulator is left alone). If the instruction register is
7040 in state F3B, the instruction is what the programmer calls a "CoMplement Accu-
mulator," and so the machine proceeds to state EOCMA, where -ac is assigned only to
the accumulator (the link is left alone). If the instruction register is 7020 in state F3B,
is equivalent tc
the instruction is what the programmer calls a "CoMplement Link," and so the ma-
chine proceeds to state EOCML, where -link is assigned only to the link (the accu-
mulator is left alone).
If the instruction register is 7404 in state F3B, the instruction is what the programmer
calls "Or Switch Register," and so the machine proceeds to state EOOSR, where the
external sr input is ORed with the current value of the accumulator. Here is a typical
use of this instruction:
If ir[11:9] i
and so the mach
kind of jump (sc
ieprogrammer
8.3.2.2.3 Additional memory reference instructions
er in the link). There are six memory reference instructions in the instruction set of the PDP-8. Two of
these (TAD and DCA) were described earlier. Two of these (JMS and ISZ) are left as
exercises. The other two (AND and JMP) are described in this section.
If i r [ 11: 9 ] is 0 in state F4B, the instruction is what the programmer refers to as
"AND," and so the machine proceeds to state EOAND. This state is similar to EOTAD,
e programmer except the AND instruction only changes the accumulator. (AND leaves the link regis-
where zero is ter alone.) Recall that & is the bitwise AND operator, and so the register transfer:
don register is
plement Accu- I ac - ac & mb
;signed only to
0 in state F3B, is equivalent to:
md so the ma-
link (the accu- ac[11] *- ac[11] & mb[11]
ac[10] <- ac[10] & mb[10]
If ir [11: 9] is 5 in state F4B, the instruction is what the programmer calls a "JuMP,"
and so the machine proceeds to state EOJMP. All general-purpose computers have some
kind of jump (sometimes known as branch) instruction. The purpose of ajump instruc-
General-PurposeComputers 309
tion is to modify the program counter. The jump instruction allows high-level language tively, are the
features (such as loops and decisions) to be translated into machine language. is 1 for SZA
Although the jump instruction of the PDP-8 is categorized as a memory reference in- decides whet]
struction, it does not actually reference memory. It simply takes the effective address
(from the memory address register) and uses this as the new value of the program
counter.
0014/7200
If ir [ 11: 8 ] is 15 in state F3B, the instruction is one of the above skip instructions. If
0015/7100
the condition is met, the machine proceeds to state EOASKIP, where the program counter
0016/1101
is scheduled to be incremented an extra time. If the condition is not met, the machine 0017/7040
proceeds to state EOBSKIP, where the machine leaves the program counter the way it 0020/7020
was. 0021/7001
0022/1032
The condition is determined by ir [6 3]. ir [3] is a bit that reverses the meaning of
0023/7430
the instruction; hence ir [ 3 ] is 0 for SMA, SZA and SNL, but ir [ 3] is 1 for SPA, 0024/5xxx
SNA and SZL. (If you think about it, you will realize SMA, SZA and SNL, respec-
-level language tively, are the opposites of SPA, SNA and SZL.) i r [ 6 ] is 1 for SMA and SPA. i r [ 5 ]
iguage. is 1 for SZA and SNA. i r [ 4 ] is 1 for SNL and SZL. Therefore, the condition that
decides whether to proceed to state EOASKIP is:
y reference in-
fective address
f the program
ir[3] (ir[6]&ac[11]Iir[5]&(ac==O)Iir[4]&link)
where A is the exclusive OR, which reverses the meaning of the parenthesized expres-
sion when ir [ 3 ] is one. Note: ac [ 11 ] is the "sign" bit of the accumulator (the bit
idwhile. The that indicates 12-bit negative twos complement values).
this reason, the
ructions test to As an illustration of how a programmer uses the skip instructions in conjunction with
Lions. If it does, the other instructions, consider the unsigned greater than or equal decision. Suppose
instruction will ri (stored at 0032) and y (stored at 0101) are software variables that contain 12-bit
tions, and how unsigned numbers. Should the high-level language programmer wish to test (in either
an if or a while) to see whether rl is greater than or equal to y, there are several
equivalent ways to write this (given that the following is performed in 13-bit twos
complement arithmetic):
nus
Accumulator rl >= y
ro
rl >= {,y1
r
n-zero Link
-{O,y} + rl >= 0
{-O,-yl+l + rl >= 0
sitive
r
n-zero The last of these can be performed with the instructions described earlier. The final
r signed 13-bit result in {1 ink, ac } can be tested with the SZL instruction, as shown
ro Link below:
0014/7200 CLA
p instructions. If 0015/7100 CLL /{link,ac} = {0,0}
program counter 0016/1101 TAD y /{link,ac} = {O,y}
iet, the machine 0017/7040 CMA /{link,ac} = {O,-y}
hunter the way it 0020/7020 CML /{link,ac} = {-O,-y}
0021/7001 IAC /{link,ac} = {-O,-y}+l
0022/1032 TAD rl /{link,ac} = {-O,-y}+l + rl
s the meaning of 0023/7430 SZL /test whether {O,-y}+l + rl >= 0
3] is 1 for SPA, 0024/5xxx JMP xxx /if {O,-y}+l + rl < 0, goto xxx
id SNL, respec- ... /if {O,-y}+l + rl >=0, execute here
data in d memory
Mmr
writer
purpose com- Figure8-10. System composed of processor(controller and architecture)with
)utputs of this memory as a separate actor
Let's assume that we will implement this machine using an asynchronous, volatile,
static, deterministic access time memory with separate data input and data output. The
choice of this kind of memory device simplifies the design in several ways. First, since
this memory is static, there is no need to refresh it. Second, since the access time is
te known, proper functioning is easily guaranteed by choosing a sufficiently slow clock.
its... Third, since this memory has separate buses for data input and data output, there is no
need to introduce the complexity of tri-state buffers.
The one design complexity that must be dealt with is the fact that this memory is asyn-
chronous. The reason for choosing an asynchronous memory device is cost and avail-
ability. The problem with doing so is that extra care must be taken in providing the
F3Af;;
8.3.2.4.2 Pure behavioralASM with memory as a separateactor
The ASM of section 8.3.2.1 can be rewritten to reflect that memory is a separate actor.
Every place (states F3A and F4A) where m [ma] is used on the right-hand side of a
register transfer in section 8.3.2.1, the revised ASM will use membus instead. All other
places (states ElADEP and E1ADCA) that mention m [ma] are of the form m [ma]
- mb. These can be replaced with an assertion of the write signal, as illustrated in
figure 8-11.
'(Some models of the PDP-8 had an optional hardware feature, known as EAE, that assisted in performing
division. Figure8-11.,v
Ite actor
l separate actor.
t-hand side of a
stead. All other
le form m [ma ]
as illustrated in
ssisted in performing
Figure 8-11. ASM with memory as a separate actor
Iware General-Purpose
General-PurposeComputers
Computers 317
317
-
c) an external data output bus (ac), connected to twelve lights, so the "amicable" Continued
user can observe the computation of the quotient (and other things) in binary as it
progresses. (If the clock is fast enough, the user will not notice anything except the 0011/1100
0012/3032
result.)
0013/3033
d) an external command output (present-state), connected to lights, so the /
/ Th,
"amicable" user can know when ac has become the correct quotient (i.e., when
/ wI
present_state becomes IDLE).
0014/7200
0015/7100
0016/1101
0017/7040
8.3.2.5.3 Childish division program in machine language 0020/7020
On the left side of the following is the PDP-8 machine language code for the childish 0021/7001
0022/1032
division algorithm. On the right side is the corresponding assembly language mnemon- 0023/7430
ics and symbolic operands, in the style explained in appendix A, with comments fol- 0024/5000
lowing the slash. Only the machine language resides in memory. The commented as-
sembly language is shown only to clarify how the program operates:
/The
0014/7200 CLA
0015/7100 CLL / {link,ac} = {0,0)
0016/1101 TAD y / {link,ac) = {O,y}
0017/7040 CMA / link,ac} = {0,-y}
0020/7020 CML / {link,ac} = {-0,-y)
the childish 0021/7001 IAC / {link,ac} = (-O,-y}+l
0022/1032 TAD rl / {link,ac} = {-0,-y}+l + rl
ge mnemon- 0023/7430 SZL / test whether {-0,-y}+l + rl >= 0
mments fol- 0024/5000 JMP LO / if {-0,-y}+l + rl < 0, exit while (goto L)
mmented as- / if {-O,-y}+l + rl >=0, stay in while loop
r2
/
/
/r2 = r2+1
{link,ac} = {OO+r2}
{link,ac} = {OO+r2}+1 (
0031/5014 JMP L1 /continue while loop
N
/
/ The following 2 words store date manipulated by
/ the childish division algorithm r
/
0032/0000 rl, 0000
0033/0000 r2, 0000
/
/ The following 2 words store data input from the sr
/
0100/0000 x, 0000
0101/0000 y, 0000
,e I General-PurposeComputers
Computers 319
319
I I
High-level Operations Before During After The "section" cc
while while while also the machine
takes to comput,
rl = x; 14
inputs. The "12-
r2 = 0; 7
required (in the
while (rl >= y) 44 44
address register,
rl = rl - y; 7 isters such as ha
r2 = r2 + 1 24 cost of the coml
} 5 (since this is the
display r2 in accumulator and halt 18 many words the
cial-purpose imr
Total Clock Cycles 21 75 67 ware registers cc
The "ctrl states"
The times are listed in three columns. The left column indicates operations that execute
Any way you lo
just once, before the while loop begins. The right column indicates operations that the fastest speci
execute just once, upon exiting from the while loop. The middle column indicates approaches beinj
operations that occur each time through the loop. The while statement itself involves
formation of the 13-bit twos complement (32 cycles) and testing (12 cycles). The entry
44 (32+12) occurs both in the middle and right columns because this machine code Lim
occurs each time through the loop as well as the final time when the condition rl>=y
quotient
becomes false. Just as in chapter 2, the number of times the loop executes is propor-
tionate to the quotient. Neglecting how long it takes for the user to toggle in the inputs,
from the time the program actually starts computing the quotient (when the program For the particula
counter was 0011) until the machine returns to state IDLE is 21 + 67 + 75 *quo- 238/5, or about 4
tient clock cycles. tion 2.2.7 makes
of three high-lev(
ing each clock cy
8.3.2.5.5 Comparison with special-purpose implementation ism, and so we oi
This table compares different implementations of the childish division algorithm:
On the other han
12-bit more sporting, th
section clock cycles 12-bit memory ctrl because it only d(
registers words states
software approac
2.2.7 3+quotient 3 0 2
special 2.2.3 2+2*quotient 2 0 4
purpose 2.2.2 3+3*quotient 2 0 5
hardware 2.2.5 2+3*quotient 3 0 5 Lim
quotient --
PDP-8 software 8.3.2 88+75*quotient 5 30 31
L
ng Atter The "section" column indicates where the ASM (and in the case of the PDP-8 software,
ile whilE also the machine language) is defined. The "clock cycle" column indicates how long it
takes to compute the quotient, neglecting the time for the user to toggle in the binary
inputs. The "12-bit registers" indicates how many hardware registers of this size are
required (in the case of the PDP-8, this is the accumulator, instruction register, memory
44 44
address register, memory buffer register and program counter). We neglect one bit reg-
isters such as halt and link as being insignificant in the total cost. We also neglect the
7
24
cost of the combinational logic that interconnects the registers within the architecture
5 (since this is the subject of section 8.4). The "12-bit memory words" indicates how
18 many words the machine language version requires for both program and data. Spe-
cial-purpose implementations of this algorithm do not need memory because the hard-
75 67 ware registers continually hold the data, and the controller implements the algorithm.
The ctrl states" indicates how many states are required by the hardware ASM.
ons that execute Any way you look at the above table, software appears to be a real loser. Compared to
operations that the fastest special-purpose implementation listed above (section 2.2.7), the software
olumn indicates approaches being about seventy-five times slower for a large quotient:
It itself involves
tcles). The entry
is machine code Lim (8 8 +75*quotient)/(3+quotient)
d
= 75
Dndition rl>=y
I
quotient -
,cutes is propor-
gle in the inputs, 4
hen the program For the particular case traced above and in chapter two (quotient=14/7), the ratio is
7 + 75*quo- 238/5, or about 47 times slower. One reason why the hardware implementation in sec-
tion 2.2.7 makes the software look so bad is because the hardware does the equivalent
of three high-level operations (test rl >=y, rl=rl -y and r2 =r2 +1) in parallel dur-
ing each clock cycle. The childish division algorithm has the potential for this parallel-
ism, and so we ought to exploit this.
)n
On the other hand, if we wanted to handicap the hardware to make the contest seem
more sporting, the ASM of section 2.2.2 is the closest to the software implementation
because it only does one high-level operation at a time. For very large quotient, the
software approaches being about 25 times slower than section 2.2.2:
Lim (8 8+ 7 5*quotient)/(3+3*quotient) = 25
quotient - 00
For the particular case traced above and in chapter two (quotient=14/7), the ratio is
238/9, or about 26 times slower.
General-PurposeComputers 321
-
Even when the hardware only does one thing at a time (as in section 2.2.2), the soft-
(a very mode
ware appears much slower. There are two reasons for this. First, it takes several PDP-8
will only see
instructions to do the equivalent of one high-level language statement (which is most
cation of a pr,
noticeable in implementing the while). Second, the way we have implemented the
user, the inpul
ASM for the PDP-8, it takes several clock cycles (either five or seven) for each instruc-
per second, oi
tion to execute.
In most instar
Software requires general-purpose hardware in order to run. The PDP-8 is about as
is design spee
simple as a general-purpose computer can be, but even so, it requires five registers. cost for the de
Software also requires memory for programs and data. Because of technological dif-
ware which is
ferences explained in section 8.1, the cost to store a bit in memory is usually several
necessary bec
times lower than to store a bit in a register. For the sake of argument, say that the cost
designer finds
for a 12-bit word in memory is five times cheaper than for a 12-bit register. The storage
purpose comp
costs are then 2*5 for section 2.2.2 hardware, 3*5 for section 2.2.7 hardware, and
puter. Also, m;
5*5+30 for the PDP-8 implementation (assuming we only pay for the memory actually
are required dt
used to implement the childish division program). Therefore, section 2.2.2 storage cost
puters, such as
is about one fifth that of the PDP-8 implementation, and section 2.2.7 storage cost is
ware designer
about one-quarter that of the PDP-8 implementation.
the above mac
The situation t
8.3.2.5.6 Trade-off between hardwareand software between using
One cannot draw sweeping conclusions having examined only a single algorithm in the market prim
hardware and software, and having examined the software on a single implementation within budget
of a single instruction set. The difference between hardware and software may be less meets physical
pronounced when the algorithm is more complicated or when the instruction set is tended applica
more capable. In particular, algorithms that require memory for storage of data struc- cause of the ads
tures, such as arrays, may show software performance closer to that of special-purpose on general-pur
hardware. However, for the childish division algorithm, we can conclude the software ated two differ,
solution gives lower performance and costs more.
First is the emc
Would you pay more to buy something slower? Paradoxically, in most instances, you handful of corn
probably would because hardware speed and cost are often not the primary concern. nies employ on
Certainly in this case, speed is unimportant when you consider the problem that the of this book wi
childish division algorithm solves. It interactively obtains two 12-bit inputs, divides signers face a d
them in a very inefficient way,"1 and displays the answer. It is going to take the user no one has yet c
several seconds to toggle in the inputs, and several more for the user to comprehend the the general-pufl
output. Since the largest 12-bit quotient is 4095, the maximum total time for the PDP- how the machir
8 implementation is 88+75*4095 = 307213 clock cycles. Although this seems awful in a designer solv
comparison to the 4098 clock cycles required by the hardware implementation of sec- chine is fast en
tion 2.2.7, it is less than the blink of an eye when the clock period is 100 nanoseconds puter in existen
General-purpos,
1 Regardless of the underling implementation (hardware or software), there are much better algorithms
than
the childish division algorithm if you really want to divide fast. 12Here "technologi
dollars.
2.2), the soft- (a very moderate clock speed using current integrated circuit technology). The user
everal PDP-8 will only see a brief flash before the correct answer appears. Occasionally, the specifi-
which is most cation of a problem has a real-time aspect to it. For example, if instead of our friendly
lemented the user, the input came from another machine that needed to divide two thousand numbers
r each instruc- per second, only the hardware of section 2.2.7 would be able to keep up.
In most instances however, the factor that matters more than hardware speed and cost
-8 is about as is design speed and cost. In other words, how long does it take and how much does it
five registers. cost for the designer to produce a correct design? Designers are willing to use hard-
biological dif- ware which is, in a technological"2 sense, more costly and slower than is theoretically
sually several necessary because in doing so they obtain the benefit of rapid debugging. When a
y that the cost designer finds an error, it is easier to change a few bits in the memory of a general-
Xr.The storage purpose computer than it is to fabricate a corrected version of a special-purpose com-
hardware, and puter. Also, many design changes occur not because of a designer's mistake but instead
Emory actually are required due to changing specifications. Productivity tools for general-purpose com-
.2 storage cost puters, such as compilers, assemblers, linkers, editors, debuggers, etc., make the soft-
storage cost is ware designer's task of coping with bugs and changing specifications much easier than
the above machine language examples.
The situation that has existed for the last half century is designers have had the choice
between using a general-purpose computer or building a special-purpose computer. If
e algorithm in
nplementation
the market price (in dollars, rather than in gates) of the general-purpose computer is
within budget and its speed is adequate (not the fastest, just adequate) and otherwise f
Lre may be less meets physical constraints (size, weight, power consumption, ruggedness) for the in-
itruction set is tended application, the designer typically chooses the general-purpose computer be-
,of data struc- cause of the advantages of rapid debugging. Although most algorithms work adequately
pecial-purpose on general-purpose computers, some demand special-purpose hardware. This has cre-
le the software ated two different economic phenomena.
First is the emergence of the general-purpose computer industry, composed of only a
instances, you handful of companies worldwide that actually design CPUs. All together, these compa-
imary concern. nies employ only a few hundred computer designers at best, and so few of the readers
-oblem that the of this book will ever be employed as general-purpose computer designers. These de-
inputs, divides signers face a daunting challenge: they design machines that will be used for tasks that
to take the user no one has yet conceived. Programmers in the future will think of new things to do with
Comprehend the the general-purpose machines that designers are working on today. Why does knowing
ie for the PDP- how the machine will be used assist the designer? Speed is not the primary concern of
seems awful in a designer solving a specific problem because the designer can easily tell if the ma-
entation of sec- chine is fast enough. A special-purpose computer does not have to be the fastest com-
)Onanoseconds puter in existence-it just has to be fast enough, and, of course, do its job correctly.
General-purpose computer designers do not have the luxury of knowing what is fast
tter algorithms than 12Here "technological" means measuring cost in in terms of registers, gates, chip area, etc., rather than in
dollars.
RTN State ()
ac - 0 EOCLA, E1BDCA
ac - ac & mb EOAND
ign of a general- ac - ac sr EOOSR
^e specified, let's ac - ac EOCMA
nstruction subset link - 0 EOCLL
ction 2.1.5.2, the link - -link EOCML
architecture.
{ac,link} - {link, ac} EORAR
behavioral ASM. {link,ac} - {ac, link} EORAL
y in the design of
{link,ac} - 0 EOCLACLL
he design of the
{link,ac} <- {link,ac + 1 EOIAC
{link,ac} <- {link,ac} + mb EOTAD
General-PurposeComputers 325
Ir-
Continued
The first deci
RTN State(s) ment each ret
PC <- ma EOJMP in the ASM (I
PC <- PC + 1 F2,
be more com
EOASKIP
PC -
look at each g
sr EOPC
note those reg:
the variable (a
Note that implicitly, the {1 ink, ac } group should be thought of as implementing the
For the link an
following register transfers:
Here is where the creative part occurs. Whatever hardware structure the designer chooses
must be capable of implementing each of the above register transfers during the state(s)
Similarly, for the I
indicated. The controller will take care of making sure the states happen at the proper both sides:
times, so we do not have to worry about that. Our concern now is that the architecture
can manipulate the data as listed above.
a
The first decision the designer must make is what kind of structural device will imple-
ment each register. One possibility would be to use enabled registers for every variable
in the ASM (other than memory); however, this will typically cause the architecture to
be more complex than if other types of registers are selected. A better approach is to
look at each group (corresponding to transfers to a particular register) individually and
note those register transfers where the right-hand side consists only of constants and/or
the variable (or concatenated variables) on the left-hand side.
implementing the For the link and accumulator group, there are several such register transfers:
For the halt flag, both of the possible register transfers are of this kind:
rs are equivalent.
?ortant in the next For the memory address register, only one of the register transfers meet this criteria:
ma - ma + 1 EOBDEP
designer chooses
during the state(s) Similarly, for the program counter, there is only one register transfer that uses pc on
pen at the proper both sides:
at the architecture
| pc *- pC + 1 F2, EOASKIP
For the instruction register and memory buffer register, there are no such register trans-
fers.
{
{link,ac} - link,O} EOCLA, EBDCA m.
{link,ac} - {,ac} EOCLL i
{link,ac} <- EOCLACLL PI
hi
{link,ac} - {link,ac} + 1 EOIAC
{link,ac} - EOCLACLL
halt - 1 EOHLT,EOINIT
ir - membus F3A
The default (wi
the accumulate
ma <- ea(ir) F3B
The default (when l inkc trl and acctrl are not mentioned in a state) is to hold
the accumulator and link as they are.
The ea (ir) function can be implemented by trivial combinational logic. We leave
this as a separate device since there are other addressing modes not implemented here
that are described in appendix B and that are left as exercises.
There is only one register transfer left for the halt flag, and so its input is a constant one.
Similarly, there is only one register transfer for the instruction register, and so its input
is the memory bus (which provides m[ma ] to the architecture from the external memory
device).
The remaining register transfers can be provided for by placing muxes on the inputs of
the appropriate registers. The input to the memory buffer register is a 12-bit mux that
selects among sr, the accumulator and memory bus. The input to the memory address
register is a 12-bit mux that selects among s r, ea (i r) and the program counter. The
il form to provide input to the program counter is a 12-bit mux that selects between the sr and the memory
nple, ac 4- 0 is address register.
Here is the block diagram of the architecture that was just derived for the subset PDP-
8:
of the remaining
by the ALU. One
ill be a 13-bit mux
Figure 8-1.
8.4.5 Imp
The ASM of s
inputs (cont,
architecture. I
tecture (link
these decision
exists.
Recall from se
against constai
bits of the stat
simply link,
particular, sin(
make it an inp
8.4.6 Mix
Here is the mi,
ter transfers of
The fourth inputs to the memory address and memory buffer muxes are not required
and therefore tied to zero. It is left as an exercise to show that these fourth inputs will
help to implement more of the instructions given in appendix B.
F3B| I~~~~mdaull
amuxctrl=2 1
8.4.7 Bo
rte
j1 ADEP
0~~~~~~ Idmb ||nm 1BE The following
r 00 aluctrl='ZERO;acctrl=LOA-D EOCA
~E , - I alumuxctr=2;1inkctrlLAD
==o
ir 11: 1aluctri=AN actrI=A ECN
r==1 1=
cont
r== 47 butPC-t---
but_MA-- -
butDEP$.
r ==740QEEIP p
70i3i6&clj5&a=0 EOIIr] ln
- 1 alu~~~pmuxctrl=1 OII
Figure 8-14
Figure8-13. Mixed ASMfor PDP-8 subset.
^ together:
- I
I
ac
nbus I
, 121 memory
'12
77L
i
iI
I is* present
'4 Also, there is the chance the disk head has to move, which can take a significant fraction of a second.
The tag memory is needed because a particular part of the cache may be associated
with more than one address at different times during the operation of the cache. In 0000/7
contrast, a particular part of an ordinary (main) memory will always be associated with 0001/1
one particular constant address. As explained in section 8.2.2.3.1, such a main memory 0002/li
can be thought of as a mux which selects one of several register values. Each cell in a 0003 /7!
main memory is always associated with its particular address because that address 0004/5<
0 005/7'
specifies the port of the mux to which the corresponding register is wired. 0006/7'
There are two common approaches to designing a cache. In the direct mapped ap-
proach, there is only one tag memory and one data memory. In the multi-way set asso- 0011/0(
ciative approach, there are several parallel tag and corresponding data memories. The
direct mapped approach is simpler and therefore allows a faster access time. On the Assuming the
other hand, the direct approach is often not as successful in keeping the appropriate listed above, a
words in the cache as the multi-way approach, and so even though the access time of
the multi-way approach is slower, it may be faster overall for some programs than the
direct approach. This section, however, will concentrate on the direct mapped approach,
which is easier to comprehend.
The typical cache memory uses the low-order bits of the address bus to select informa-
c
tion out of both the data and tag portions of the cache. In order for a memory access to
be fast, the information fetched from the tag memory must match the address bus.' 5 If
it does not, the cache must be updated from some lower level of the memory hierarchy.
Commercial computer systems often have more than one level of cache. In such sys-
tems, the first level is often on the same chip as the processor to maintain the highest
(single clock) speed. The second level (referred to as L2) is contained on separate chips The words shc
that allow access in a small number of cycles. The main memory is composed of dy- When the proc
namic memory, with an access time of many clock cycles. In this section, however, cause address
there will only be two levels in the memory hierarchy: the direct mapped cache and the cache has to bi
main memory. looks like:
In this chapter, we will assume each element of the cache content memory is a single
word. Often, in commercial systems, each element of the cache content memory is a tag
group of several contiguous words, known as a line. Using a line composed of several
words may improve the performance of the cache, but including such details here would 1/001
obscure the idea being discussed in this section: how a cache is a cost-effective way to 2/000.
improve the performance of a general-purpose computer. 3/000:
For example, assume a cache size of four words' 6 with the following simple program
that goes through a loop eight times producing nine values'7 (7760, 7762, ... 7776 and
0000) in the accumulator:
15
In an actual implementation, only the high-order bits need to be stored in the tag memory and checked Fetching the ne
against the high-order bits of the address bus, but we will ignore this detail for now.
16This is too small for practical use but will illustrate how a cache works.
17These are the nine decimal values -16, -14, ... -2 and 0.
However, when this TAD instruction is executed, the cache already has the data 7760
required by the processor. This is known as a cache hit. The second memory access
during this instruction is fast because it is a cache hit.
Fetching and executing the next instruction (1011) causes two cache misses:
In total, there.
value of A, th
cache main memory increase for v/
tag data 0000/7300
0/0000 0/7300 0001/1006 The good perf
1/0011 1/0002 0002/1011 heavily on hot
2/0002 2/1011 0003/7510 at address 000'
3/0003 3/7510 0004/5002 miss rate and,
0005/7402
program with
0006/7760
0011/0002
rate becomes
program with
rate is 94% (tv
Fetching and executing the SPA instruction (7510) causes a cache hit, and so this memory and 0011 cann
access is fast. Since the accumulator is negative, the skip does not occur, and the pro-
cessor needs to fetch the next (5002) instruction. This causes another cache miss:
8.5.2 Men
Regardless of
memory hierar
cache main memory
we expect the
tag data 0000/7300
0/0004 0/5002
will take additi
0001/1006
1/0011 1/0002 0002/1011 assumes that e
2/0002 2/1011 0003/7510 memory hierar
3/0003 3/7510 0004/5002
0005/7402
0006/7760
0011/0002
18The number of n
times the loop execs
19 The number of h
In total, there are six cache misses' 8 and 29-cache hits in this example.
With the given
value of A, this is a 17% miss rate and an 83% hit rate, although
the hit rate would
increase for values of A that are more negative.19
The good performance that the above program exhibits using this
little cache depends
heavily on how the instructions and data are arranged. For example,
if B were located
at address 0007, there would be 20 cache misses and only 15 cache
hits, which is a 57%
miss rate and 43% hit rate. A larger cache size will often improve
performance. If the
program with B at address 0007 runs on a machine with a cache
size of eight, the hit
rate becomes 100% because this entire tiny program can reside in
the cache. If the
program with B at address 0011 runs on a machine with a cache size
of eight, the hit
rate is 94% (two misses) because the program cannot all fit in the
nd so this memory cache at once (0001
and 0011 cannot reside in a direct mapped cache of size eight at the
wccur, and the pro- same time).
er cache miss:
8.5.2 Memory handshaking
Regardless of whether a machine uses cache memory, virtual memory
or both in its
memory hierarchy, one thing is clear: the access time is non-deterministic.
Although
we expect the majority of memory accesses to occur in a single cycle,
some accesses
will take additional cycles. The ASM chart for fetch/execute given
in section 8.3.2.4.2
assumes that every memory access can occur in one cycle, which is
not the case for a
memory hierarchy. A more sophisticated ASM is required that waits
for the memory
18The number of misses is the same in this program regardless of the value
ofA and therefore of how many
times the loop executes.
9The number of hits depends on how many times the loop executes.
General-PurposeComputers 341
-
Figure 8-16
20
Ignoring a trivial aount of propagation delay, as was done in earlier portions of this chapter.
- membus
<
F4WAIT
nreq
chapter.
MEMORY HIERARCHY
mabus
a 1 ,- - -
a addr
I ~MAIN
l MEMORY
mbbus din dout
dl d I
i | mainwrite mainbus I
l cachecontent
I + addr dout | r *membusFigure 8-1
(memwack) signal
itent memory have
Allowing shows the
irect mapped cache
e cache controller:
-F--L /-membus
Figure 8-19. ASMfor direct mapped write-through cache memory controller
-+ memrack The ASM stays in state CACHEIDLE unless a memory request occurs. There are three
possibilities for a memory request. Two of these possibilities are when the memory
request is for a read operation (i.e., write is zero): either the requested data is in the
cache or the requested data is not in the cache. The third possibility is a memory write
request from the CPU (regardless of whether it is in the cache).
The first possibility is when the data being read by the CPU is already in the cache. In
this case, memrack will be true during the first clock cycle that memreq is asserted.
Because the mrnemrack signal comes straight from the combinational logic compara-
e c tor, both the ASM for the CPU and the ASM for the cache controller proceed without
_+ memwack delay states. The ASM for the CPU makes a transition such as from state F4A to F4B,
and the ASM for the cache controller makes a transition from CACHEIDLE back to
that same state.
d write-through
The second possibility is when the data being read by the CPU is not in the cache CPU
during the first clock cycle that memreq is asserted. During this clock cycle memrack state
will be false. It will stay false for as long as the output of the cache tag memory does
not equal the memory address bus from the CPU. In a case like this, when memrack is
false, the ASM for the CPU makes a transition such as from state F4A to F4WAIT, and
the ASM for the cache controller makes a transition from CACHEIDLE to state RI.
The ASM has an appropriate number of empty delay states (not shown) to allow for the
read access time of the asynchronous main memory. Then, in state RL, the cache con- F1
troller issues the ldcont command. This causes the cache content memory to be F2
loaded at the next rising edge of the clock with the data obtained from the slow main F3A
memory. Also in state RL, the cache controller issues the ldtag command. This causes F3B
the cache tag memory to be loaded at the next rising edge of the clock with the address EOCLACLL
being provided by the CPU. Because of this change to the tag memory, when the cache F1
controller proceeds to the empty state, RA, the architecture will for the first time assert F2
F3A I
memrack. The one empty state, RA, is all that is necessary to allow the CPU to make
F3B I
a transition such as from state F4WAIT to F4B. Of course, the cache controller makes
F4A I
a transition during that same clock cycle from state RA back to CACHEIDLE. F4B
The third possibility is when the CPU makes a memory write request. The ASM for the EOTAD
:PI
cache controller proceeds from state CACHEIDLE to state WI during the same clock _
cycle that the ASM for the CPU proceeds from a state such as E lADCA to E 1DCAWAIT.
The ASM has an appropriate number of delay states (not shown) that each assert As execution
mainwri te. This allows for the write access time of the asynchronous main memory.2 6 memory contrc
Finally, in state WLA, the cache controller asserts ldcont, ldtag and memwack. (shown in italic
The assertion of ldcont and ldtag is not necessary for this write operation but is
required for any future read operations to be fast. Therefore, a separate empty state for
F2 0
write acknowledgement is not necessary here as was the case for read acknowledgement.
F3A 0
Because memwack is asserted in state WLA, at the same time that the ASM for the F3WAIT G
cache controller makes a transition from state WLA to state CACHEIDLE, the ASM F3WAIT 0
for the CPU makes a transition such as from state ElDCAWAIT to ElBDCA. F3WAIT 0
F3WAIT 0
The following example is a program that adds two numbers together and stores the sum
F3WAIT 0
in memory. Both state machines (CPU and memory controllers) cooperate to fetch F3B 0
instructions and data and to store results back in memory. This example illustrates each
of the three possibilities explained above. The first two instructions, as well as the first Fetching the op
word of data fetched, are already in the cache. In such an instance (shown in bold) the instruction caus
cache state remains in CACHEIDLE and the CPU does not need a wait state. This
situation is signaled by memreq and memrack both being one during the same clock
cycle.
26
The read and write access times need not be the same.
on the CPU designed in this chapter is slower and less efficient than when the childish 8-3. Revise the
division algorithm is implemented in special-purpose hardware. The next chapter will
look at how this performance discrepancy can be diminished. 8-4. Revise the
appendix B.
8-5. Revise the
appendix B.
8.7 1 urther reading
BELL, C. GORDON and A. NEWELL, Computer Structucz..: Readings and Examples, 8-6. Revise the
McGraw-Hill, New York, NY, 1971. Chapter 5 is the defi tive description of the PDP- and associated
8 from the man who also invented the first HDL (a language known as ISP). 8-7. Revise the
BELL,C. GORDON, J. C. MUDGE and JOHN E. MCNAMARA, CoM puter Engineering:A 8-8. Revise the
DEC View ofHardwareSystems Design, Digital Press, Bedford, MA, 1978. Chapter 8.
8-9. Suppose a,
LAvINGTON, S., Early British Computers: The Stor of Vintage Computers and the People addresses 0004
Who Built Them, Digital Press/Manchester University Press, Bedford, MA, 1980. De- gram:
scribes the work of Kilburn, Williams, Turing, Wilkes and other British pioneers.
The Origins of Digital Computers: Selected apers, 2nd ed., Edited by B. Randell,
Springer-Verlan, Berlin, 1982. Reprints of original papers by computer pioneers.
PATTERSON, DAVID A. and JOHN L. HENNESSY, Computer Organizationand Design: The
Hardware/SoftwareInterface, Morgan Kaufmann, San Mateo, CA, 1994. Chapter 7
explains virtual memory and multi-way set associative caches.
PROSSER, FRANKLIN P. and DAVID E. WINKEL, The Art of DigitalDesign: An Introduction a) How man)
to Top down Design, 2nd ed., Prentice Hall PTR, Englewood Cliffs, NJ, 1987. Chapter b) How man)
7 describes an elegant central ALU architecture for the complete PDP-8 instruction set. c) What will
SLATER, ROBERT, Portraitsin Silicon, MIT Press, Cambridge, MA, 1987. Gives biogra- 8-10. Translate
phies of several important pioneers including Babbage, Zuse, Atanasoff, Turing, Aiken, and runs the chi
Eckert, Mauchly, von Neumann, Forrester, Bell and Noyce.
8-11. Translate t
WAYNER, P., "Smart Memory," BYTE, June 1995, p. 190. 8-13 into Veriloq
WOLF, WAYNE, Modern VLSI Design: A Systems Approach, 2nd ed., Prentice Hall PTR, 8-12. Modify prD
Englewood Cliffs, NJ, 2nd ed., 1994, p. 356-370. Shows how to layout a VLSI chip Assume it takes
that imple.nents a PDP-8 architecture. Verilog test cod(
8.8 Exercises
8-1. Revise the ASM of section 8.3.2.1 to include the ISZ instruction described in
appendix B.
8-2. Revise the architecture of section 8.4.7 to correspond to problem 8-1.
A
,vhen the childish 8-3. Revise the mixed ASM of 8.4.6 to correspond to problem 8-2.
next chapter will
8-4. Revise the ASM of section 8.3.2.1 to include the JMS instruction described in
appendix B.
8-5. Revise the ASM of problem 8-4 to include all the addressing modes described in
appendix B.
8-6. Revise the ASM of problem 8-5 to include the interrupt instructions ION and IOF
,s and Examples,
and associated hardware described in appendix B.
ption of the PDP-
as ISP). 8-7. Revise the architecture of section 8.4.7 to correspond to problem 8-6.
r Engineering: A 8-8. Revise the mixed ASM of 8.4.6 to correspond to problem 8-7.
,1978. Chapter 8.
8-9. Suppose a direct mapped write-through cache of size four contains the contents of
frs and the People addresses 0004, 0001, 0002 and 0003 when starting to run the following PDP-8 pro-
1, MA, 1980. De- gram:
ish pioneers.
,d by B. Randell,
0000/7200
ter pioneers.
0001/1004
and Design: The 0002/3006
1994. Chapter 7 0003/7402
0004/1000
:tion described in
The existence of the NOP instruction (7000) is important to the design of the pipelined
fetch/execute. By putting a NOP in the pipeline when none existed in the original pro-
gram, it will be possible to cope with several special situations. The essential goal of
the pipelined machine is to end up with the same answer in memory and the accumula-
tor as would be obtained from a non-pipelined version. Since a NOP leaves both the
accumulator and memory alone, NOP provides for a safe way to stall later stages of the
pipeline while earlier stages of the pipeline are being filled. This is quite advantageous,
since it can eliminate the need for "FILL" and "FLUSH" states of the kind described in
chapter 6.
puires
luires five
five clock
clock
hat ASM uses a INil
termined by the
Aly the ALU. On
)uld be achieved
cute must do in
s. (See chapter 5
the pipeline will
Later stages de-
ges. In a Mealy
-k cycle after the
i until one clock
cycles after the
a Mealy ASM is
)se computer.
i of the pipelined
the original pro-
essential goal of
nd the accumula-
P leaves both the
later stages of the
[te advantageous,
kind described in
;M that is equiva-
pipeline consist-
tily, in each clock
fetched and being Figure
the fetch/execute
nachine language
wcy of a softwareHeeia
[icy software fl Here is a portion of the implicit style Verilog corresponding to this ASM:
The execution of each instruction must be described in a Mealy oval. When i r2 con-
tains a TAD instruction, the accumulator is scheduled to be updated by adding it to the
operand fetched in the previous stage. When i r2 contains a DCA instruction, the ac-
)TAD, EODCA, cumulator is scheduled to be stored (m[ ea (ir2) <- ac) in parallel to scheduling
merged into state that the accumulator be cleared.
set of registers
has been elimi-
memory during
fetched and the 9.2 Example of independent instructions
not allow all of The ASM of section 9.1 is only able to execute certain PDP-8 programs correctly. By
"correctly," we mean that the pipelined version produces (in fewer clock cycles) the
same result that the multi-cycle version (section 8.3.1.5) produces in more clock cycles.
isters, irl and Since the multi-cycle and the pipelined versions proceed differently, we have to wait
vel through the until both machines are halted to check if the results are the same. The limitation on the
he memory, and kind of machine language program that figure 9-1 will execute properly is that each
previous clock instruction is independent of the others. In other words, there are no data dependencies.
g us that it is the (This is the only kind of pipelining discussed in chapter 6.) An example of such a
program is the one given in appendix A, which is used with the multi-cycle ASM in
section 8.3.1.6:
travel through the pipeline. In the first clock cycle after leaving IDLE ($time 359), F1 pc=(
F1 pc=z
irl and ir2 contain NOPs, so nothing happens to the accumulator. In the next clock
F1 pc=(
cycle, irl contains the first instruction (7200), but ir2 still contains a NOP. Only in F1 pc=(
the third clock cycle after leaving IDLE ($ time 559) does an actual instruction from F1 pC=(
the program execute-in this case the accumulator is scheduled to be cleared. This F1 pC=(
action becomes visible at $time 659. At that same time the first TAD instruction is IDLE pc=(
ready to execute. In the previous clock cycle, the operand (0112) needed for this TAD IDLE pc=C
instruction was scheduled to be loaded into mb2. Therefore, at $ time 659 the ac -
ac + mb2 can be scheduled. The sum (0000+0112) becomes visible at $time 759.
In the above, it
The remaining TAD instructions have filled the pipeline, so they can execute one per
Everything look
clock cycle. This is possible because the operands (0152 available at $time 759 and
contain the ope
0224 available at $time 859) have also been fetched. At $time 959, the correct sum
Unfortunately, E
(0510) is stored into memory at address 0111.
dress 0111 still
358 Verilog Digital Computer Design: Algorithms into Hardware
i
9.3 Data dependencies
What happens if the instructions are not independent of each other? For software to do
practical things, often one instruction needs to depend on results computed by previous
instructions. This is known as a datadependency. For example, a slight variation of the
program from appendix A:
0100/7200
0101/1106
0102/1107
0103/3111 - this is different from appendix A
program: 0104/1111 <- this is also different
0105/ 74 02
0106/ 0112
K h=x 59
0107/0152
K h=1 159
0110/022 4
K h=1 259
0111/0000
x h=0 359
x h=0 459
x h=0 559 illustrates the problem that the above ASM has with instructions that are not indepen-
0 h=0 659 dent. In this program, instead of doing a third TAD at 0103, the DCA (3111) occurs.
2 h=0 759
This is followed by a TAD (1111) from this same location. The TAD instruction at 0104
4 h=0 859
is dependent on the DCA instruction at 0103. Here is the wrong result that figure 9-1
0 h=0 959
0 h=0 1059
produces:
0 h=1 1159 INIT pc=xxxx irl=xxxx mb2=xxxx ir2=xxxx ac=xxxx h=x 59
0 h=1 1259 F1 pc=0100 irl=xxxx mb2=xxxx ir2=xxxx ac=xxxx h=1 159
0 h=0 1359 IDLE pc=0100 irl=xxxx mb2=xxxx ir2=xxxx ac=xxxx h=1 259
F1 pc=0100 irl=7000 mb2=xxxx ir2=7000 ac=xxxx h=0 359
F1 pc=0101 irl=7200 mb2=xxxx ir2=7000 ac=xxxx h=0 459
s travel through the
F1 pc=0102 irl=1106 mb2=xxxx ir2=7200 ac=xxxx h=0 559
[e DCA instructions F1 pc=0103 irl=1107 mb2=0112 ir2=1106 ac=0000 h=0 659
DLE ($time 359), F1 pc=0104 irl=3111 mb2=0152 ir2=1107 ac=0112 h=0 759
or. In the next clock F1 pc=0105 irl=1111 mb2=0000 ir2=3111 ac=0264 h=0 859
ains a NOP. Only in F1 pc=0106 irl=7402 mb2=0000 ir2=1111 ac=0000 h=0 959
ual instruction from F1 pc=0107 irl=0112 mb2=xxxx ir2=7402 ac=O00 h=0 1059
to be cleared. This F1 pc=0110 irl=0152 mb2=xxxx ir2=0112 ac=0000 h=1 1159
tTAD instruction is IDLE pc=0110 irl=0152 mb2=xxxx ir2=0112 ac=0000 h=1 1259
needed for this TAD IDLE pc=0110 irl=7000 mb2=xxxx ir2=7000 ac=0000 h=0 1359
:ime 659 the ac *-
;ible at $time 759. In the above, italics show how the instruction at 0104 travels through the pipeline.
can execute one per Everything looks fine until $ time 959. The mb2 register (shown in bold italics) should
e at $time 759 and contain the operand needed in the next clock cycle for the TAD (1111) instruction.
959, the correct sum Unfortunately, at $time 859 when mb2 was scheduled to be loaded, memory at ad-
dress 0111 still contains the zero put there originally. The DCA (3111) instruction that
if (halt)
else
begin
pc <= (posedge sysclk) pc + 1;
irl <= @(posedge sysclk) mpc];
ir2 <= (posedge sysclk) irl;
if ((ir2[ll:9] == 3)&&(ea(irl)==ea(ir2)))
nib2 <= (posedge sysclk) ac;
else
mb2 <= @(posedge sysclk) m[ea(irl)];
Figure9-2.
shed executing. By
lics) becomes obvi-
)ntains 0000.
As in the last example, italics show how the instruction at 0104 travels through the 0
pipeline. In this case, data forwarding only occurs at $time 859, because
ea(llll)==ea(3111)&ir2 [11:9]==3. The underlining emphasizes theparts tmb2-.-m[
of i r 1and i r2 that must be identical for data forwarding to occur. During that clock
cycle, the accumulator (shown in non-italic bold) contains 0264. The effect of the data
forwarding becomes visible at $ time 959, when mb2 (shown in italic bold) becomes
0264, which is correct. At $time 1059, we see that the accumulator (shown in italic
bold) has the correct value because of this data forwarding.
else
beg
i
I
x 59
1 159
1 259
0 359
0 459
0 559
0 659
0 759
0 859
0 959
0 1059
1 1159
1 1259
0 1359
Is through the
859, because
isizes the parts
ring that clock
Sect of the data
bold) becomes
,hown in italic
al instructions.
rith a pipelined Figure 9-3. Pipelinedfetch/execute with JMP
ollow the JMP
J this error, the
The following shows in bold how the Verilog must be changed to implement JMP
ead it puts two properly for the pipelined ASM:
if (halt)
. . .
else
begin
if (irl[11:9] == 5)
begin
Pc <= (posedge sysclk) ea(irl);
PipelinedGeneral-PurposeProcessor 363
-
Continued
At $time5
irl <= (posedge sysclk) 12'o7000; first value (0
ir2 <= (posedge sysclk) 12'o7000;
for the first']
end
else tains a JMP
begin done in the
pc <= @(posedge sysclk) pc + 1; (0105) of thc
irl <= @(posedge sysclk) m[pc]; $time 759. TI
ir2 <= @(posedge sysclk) irl; must also scE
end accumulator
if ((ir2[11:9] == 3)&&(ea(irl)==ea(ir2))) but it will tal
ecute. At $t:
This occurs when the instruction in irl is a JMP (rather than waiting until ir2 con-
tains the JMP). To illustrate how this works, consider the following variation of the
program in appendix A:
0100/ 72 00 9.6 Ski1
0101/1106 The conditio
0102/5105 <- This is different from appendix A mulator, 751
0103/1110 incrementing
0104/3111 8. To implem
0105 /74 02
proach becau
0106/0112
execution.
0107/ 0152
0110/ 0224 One of the im]
0111/0000 Regardless of
only one non-
Instead of a TAD instruction at address 0102, there is a JMP (5105) instruction that ues to be store
avoids executing the TAD (1110) instruction at address 0103 and the DCA instruction describe a skij
at 0104. The following shows how figure 9-3 executes this program correctly: counter yet ag
tion after the
INIT pc=xxxx irl=xxxx mb2=xxxx ir2=xxxx ac=xxxx h=x 59 too late to inc
F1 pc=0100 irl=xxxx mb2=xxxx ir2=xxxx ac=xxxx h=l 159
IDLE pc=0100
instruction is
irl=xxxx mb2=xxxx ir2=xxxx ac=xxxx h=1 259
F1 pc=0100 irl=7000 mb2=xxxx
fetched into i
ir2=7000 ac=xxxx h=0 359
Fl pc=0101 irl=7200 mb2=xxxx ir2=7000 ac=xxxx
counter by tw
h=0 459
F1 pc=0102 irl=1106 mb2=xxxx ir2=7200 ac=xxxx h=0 559
negative. We
F1 pc=0103 irl=5105 mb2=0112 ir2=1106 ac=0000 h=0 659
F1
The overall el
pc=0105 irl=7000 mb2=7402 ir2=7000 ac=0112 h=0 759
Fl pc=0106 irl=7402 mb2=xxxx
implementatic
ir2=7000 ac=0112 h=0 859
Fl pc=0107 irl=0112 mb2=xxxx ir2=7402 ac=0112
mentation, we
h=0 959
F1 pc=0110 irl=0152 mb2=xxxx ir2=0112 ac=0112 h=1 1059
fled with a NC
IDLE pc=0110 irl=0152 mb2=xxxx ir2=0112 ac=0112 h=l 1159
IDLE pc=0110 irl=7000 mb2=xxxx ir2=7000 ac=0112 h=0 1259
ng until i r2 con-
,gvariation of the
if (ir2[11:9] == 1)
ac <= (posedge sysclk) ac + mb2;
else if (ir2[11:9] == 3)
begin
m[ea(ir2)] <= (negedge sysclk) ac;
ac <= (posedge sysclk) 0;
end
else if (ir2 == 12'o7200)
ac <= (posedge sysclk) 0;
else if (ir2 == 12'o7402)
halt <= (posedge sysclk) 1;
else if (ir2 == 12'o7041)
ac <= (posedge sysclk) -ac;
else if (ir2 == 12'o7001)
ac <= @(posedge sysclk) ac + 1;
else if (ir2 == 12'o7000)
else
$display("other instructions...");
end
end
end
The decision whether to nullify the instruction that follows the skip must occur at the
top of the algorithm. This is because each register, such as ir2, can only have one
value stored into it during each clock cycle. The normal behavior of the pipeline (trans-
ferring irl into ir2) cannot occur when the next instruction is to be nullified. Simi-
larly, if that next instruction (in irl) is a JMP (as is likely), the skip needs to take
precedence over the JMP. Therefore the precedence of the decisions at the top of the
algorithm is:
a) a skip instruction in ir2 that is to be taken
b) a JMP instruction in i r 1
c) normal pipelined behavior
ware I PipelinedGeneral-Purpose
Pipelined General-PurposeProcessor
Processor 367
367
.1
-
Any other precedence would be incorrect. At the time the algorithm makes
this deci- For simplicii
sion, ir2 already contains the skip instruction. Therefore, the bottom of the
algorithm that these va
(which executes in parallel) needs to treat the 7500 or 7510 as a NOP, regardless
of
whether or not the following instruction will be nullified. As has been
The above also includes the IAC (Increment ACcumulator, 7001) and CIA roles: it avoi
(Comple- stops the loo
ment and Increment Accumulator, 7041) instructions. These non-memory
reference implemented
instructions are similar to the CLA (7200) instruction in that the pipeline follows
its JMP at the
normal behavior. To achieve simple pipelined behavior here with the CIA instruction,
we assume that the ALU can form the twos complement negation of the accumulator guage, but it I
in the loop. We
a single clock cycle.
we can. Sucl
machine. Th,
entered the fi
9.7 Our old friend: division
The recurring example in this book is the childish division algorithm, introduced 0100/ 72 00
in
section 2.2. It is used in chapter 2 to illustrate Moore ASMs, used in chapter 0101/ 112 6
3 to illus-
trate Verilog test code, used in chapter 4 to illustrate behavioral, mixed and 0102 /3 124
structural 0103 /3 125
Verilog, used in chapter 5 to illustrate Mealy ASMs, used in chapter 6 to
illustrate 0104/ 72 00
propagation delay and used in chapter 8 to benchmark the multi-cycle general-purpose
0105/ 1127
PDP-8 against the special-purpose hardware of earlier chapters. The conclusion
in chap- 0106/ 7041
ter 8 is that special-purpose hardware implementations of the childish division
algo- 0107/1124
rithm were considerably faster and cheaper than the same algorithm running
as soft- 0110 /7510
ware on the multi-cycle implementation of the general-purpose PDP-8. Yet 0111/5123
most algo-
rithms are implemented in software rather than hardware because software 0112 /3 124
is easier to
design and maintain. Pipelining allows a designer to create a more expensive
general-
purpose computer where the speed of its software comes closer to that of special-pur-
pose hardware. 0113/112 5
0114 /7 001
To illustrate what we have achieved by pipelining the PDP-8 as described in the 0115 /3 12 5
previ-
ous sections, recall the description of the childish division algorithm in C: 0116/ 1127
0117/ 7041
012 0/1124
0 121/7500
0122/5112
0123/7402
0124/0000
0125/0000
0126/0016 )
0127/0007
l Since we have
of x and could
368 Verilog DigitalComputer Design: Algorithms into Hardware
makes this deci- For simplicity, we will assume x and y already have their values stored in memory, and
n of the algorithm that these values are less than 2048.'
FOP, regardless of
As has been illustrated many times in earlier chapters, the while loop serves two
roles: it avoids entering the loop and thus keeps r2 zero when x<y, or otherwise it
nd CIA (Comple- stops the loop when it has repeated the proper number of times. In chapter 8, this was
nemory reference implemented as a skip and JMP at the top of the software loop and an unconditional
peline follows its JMP at the bottom. Such an approach is the easiest way to translate to machine lan-
e CIA instruction, guage, but it has the cost of requiring additional instructions to execute each time through
he accumulator in the loop. We need to find as good a machine language translation of this algorithm as
we can. Such a machine language program will make the best use of the pipelined
machine. The following uses an SPA instruction at the top to cause the loop to be
entered the first time, and an SMA instruction at the bottom to cause the loop to exit:
*0100
0100/72 00 CLA
hm, introduced in
0101/1126 TAD X // ac = +x
chapter 3 to illus- 0102/3124 DCA Ri // rl = x
xed and structural 0103/3125 DCA R2 // r2 = 0
Aer 6 to illustrate 0104/ 72 0 0 CLA
.egeneral-purpose 0105/1127 TAD Y // ac = O+y
onclusion in chap- 0106/ 7041 CIA // ac = -y
lish division algo- 0107 /1124 TAD R1 // ac = rl-y
n running as soft- 0L10/7510 SPA // if (rl-y >= 0) goto Li
0 111/5123 JMP L2 // else goto L2
'-8. Yet most algo-
0112/3124 Li, DCA Ri // rl = rl-y
ftware is easier to
// depends on ac containing r-y
Expensive general-
// on both paths to this inst.
hat of special-pur- 0113/112 5 TAD R2 // ac = O+r2
0114/7001 IAC // ac = r2+1
0115 /3 125 DCA R2 // r2 = r2+1
ribed in the previ- 0116/1127 TAD Y // ac =0+y
ninC: 0117/ 7041 CIA // ac = -Y
0120/1124 TAD R1 // ac = rl-y
0121/7500 SMA // if (rl-y < 0) goto L2
0122 /5112 JMP Li // else goto Li
0123/7402 L2, HLT // done
//
0124/0000 RI, 0000
012 5/0000 R2, 0000
012 6/0016 X, 0016 // These must be < 2048 (3777 octal)
0127/0007 Y. 0007
| Since we have not implemented the link register of the PDP-8 in this pipelined version, larger values
of x and y could cause the program to malfunction.
ites how the skip (75 10).2 As described above, the skip is given precedence over the JMP. Therefore,
whether the next instruction (currently in irl) will be nullified is based on ac [11] .
In this case, ac [1 1] = = 0, so the SPA will nullify the following instruction. At $ time
1459, ir2 has become NOP (7000), but 3124 was fetched normally into irl so that
h=x 59 the algorithm can proceed sequentially.
h=1 159
h=1 259 A different situation occurs at $ time 2259. Here the SMA (7500) does not nullify the
h=O 359 JMP instruction (5112) because the accumulator is not negative, so the behavior de-
h=O 459 scribed in section 9.5 occurs. Both irl and ir2 are loaded with NOPs (7000), as is
h=O 559 visible at $ time 2359. The machine does not start executing useful instructions after
h=O 659 the JMP until $ time 2559 because of the time required to fill the pipeline.
h=O 759
h=O 859 Finally, at $time 3259, the SMA (7500) does nullify the JMP instruction (5112) be-
h=O 959 cause the accumulatoris negative, so only ir2 has aNOP(7000) at $time 3359. This
h=O 1059 allows sequential execution of the HLT (7402) at $ time 3459.
h=O 1159
h=O 1259 Between $time 359 and $time 3459 are 32 clock cycles. In general, if the quo-
h=O 1359 tient >= 1, the number of clock cycles is 12 +10 *quo tient. The following table
h=O 1459 summarizes implementations of the childish division algorithm given in this and ear-
h=O 1559 lier chapters:
h=O 1659
h=O 1759
h=O 1859
max pipe kind hardware software
h=O 1959
int of ASM section section clock cycles
h=O 2059
h=O 2159
4095 n Moore 2.2.7 n/a 3 + quotient
h=O 2259
4095 n Moore 2.2.3 n/a 2 + 2*quotient
h=O 2359
4095 n Moore 2.2.2 n/a 3 + 3*quotient
h=O 2459
4095 n Moore 2.2.5 n/a 2 + 3*quotient
h=0 2559
4095 n Mealy 5.2.1 n/a 2 + 2*quotient
h=O 2659
4095 n Mealy 5.2.3 n/a 3 + quotient
h=O 2759
4095 n Mealy 5.2.4 n/a 2 + quotient
h=O 2859
4095 n Moore 8.3.2.1 8.3.2.5.3 88+75*quotient
h=O 2959
2047 n Moore 8.3.2.1 9.7 55+55*quotient
h=O 3059
2047 y Mealy 9.6 9.7 12+10*quotient
h=O 3159
h=O 3259
h=O 3359 The first seven lines above are for special-purpose computers whose ASMs implement
h=O 3459 the childish division algorithm. The last three lines are for general-purpose computers
(whose ASMs implement fetch/execute) that need a machine language program to imple-
ment division. The "max int" column shows the maximum allowable integer input,
ion (7510), we do which is 2047 for the software given in this section. The "pipe" column indicates whether
e, so the pipeline the hardware is pipelined. The "kind of ASM" indicates whether the ASM uses condi-
must be made be-
ieSPA instruction 2 The 7510 in mb2 is sheer coincidence.
doutO Idm2
-*mal doutl m
ma2
MULTI-PORT MEMORY
Figure 9-5.
always @(m[mal])
ection takes to doutl = m[mal];
)wn to make a
always (posedge sysclk)
is quotient
begin
ersus the same
if (ldm2)
hes 55/10=5.5. m[ma2] = din2;
up to ten times end
xt section, the
iory, known as
single-ported so that the architecture that instantiates the multi-port memory may do three things to
memory in parallel.
tbe possible to
data and store
this to happen,
To allow mul-
mory, which is
Idm2
in2
9.11 Furl
PATTERSON, DAV
Hardware/Softi
gives more deta
complicated tha
STERNHEIM, ELIE,
Automata Publi
modeling a pipe
9.12 Exei
9-1. Modify the
PDP-8 (describe
only increase its
Figure 9-6. Architecturefor pipelined PDP-8. Simulate the mo
374 Verilog Digital Computer Design: Algorithms into Hardware
using a demux,
an of the single-
9.10 Conclusion
-portmemory is The pipelined PDP-8 designed in this chapter can run software in some situations about
ndependently. five times faster than the multi-cycle PDP-8 given in the last chapter. Because the
propagation delays (which determine the clock frequency) in the pipelined and multi-
cycle versions are nearly identical, there are two other factors that determine the speed.
First is the number of clock cycles per instruction. (In chapter 8, most instructions take
five clock cycles, but in this chapter instructions other than JMP take only one cycle.)
the multi-port Second is the the mix of instructions in the program, such as how frequently JMPs
and ldpc com- occur. (The example here is the childish division algorithm, which may or may not be
tor) are enabled representative of how the algorithm you want to implement will perform.)
are two muxes The major cost of the pipelined approach in this chapter is the multi-port memory,
other mux that which allows simultaneous access to memory for instructions and data. The problem is
is a comparator that even with pipelining, this approach provides one-tenth the speed of the specialized
hardware for the childish division algorithm.
When you consider both cost and speed, special-purpose hardware is much better than
software running on a pipelined PDP-8, at least for this example. Although the relative
performance of other algorithms might be different, this example points out that other
techniques beyond pipelining of the PDP-8 are going to be required if software speed is
going to approach that of special purpose hardware. The next chapter illustrates some
of these techniques.
STERNHEIM, ELIEZER, RAJVIR SINGH and YATIN TRIVEDI, DigitalDesign with Verilog HDL,
Automata Publishing, San Jose, CA, 1990. Chapter 3 gives a different approach to
modeling a pipelined general-purpose computer in Verilog.
9.12 Exercises
9-1. Modify the behavioral design in section 9.6 to include the ISZ instruction of the
PDP-8 (described in appendix B). Including an ISZ instruction in a program should
only increase its execution time by one clock cycle for each time the ISZ is executed.
Simulate the modified design with the following programs:
ts.
10.1 History of CISC versus RISC
lifying programs One attempt to increase performance of general-purpose processors that became popu-
ndencies that are lar in the 1970s is the idea of a Complex Instruction Set Computer (CISC). In essence,
the idea is to merge a simple general-purpose machine together with special hardware
(and special registers) that solve certain specific computations. The thought was that
this would give the user the best of both worlds (special-purpose and general-purpose
computers). To activate each special hardware unit requires including a new instruc-
tion in the instruction set. Rather than the handful of machine language instructions
described in appendix B for the PDP-8, a CISC machine might have thousands of
distinct instructions. Fitting all these instructions into a reasonable sized instruction
register requires that some instructions occupy multiple words, which is known as a
'are~~~~
10.6 ARM subset
There are eleven different categories of ARM instructions described in appendix G. It S
is possible to do very useful things in software using only a few of these instructions,
and so we can select a handful of these instructions to illustrate the design of a RISC e 0
processor. Of the eleven categories of instructions in appendix G, we will only imple- /..\ /....\
L110 00 0 0
ment the "data processing" and "branch" categories. The data processing category is
subdivided into sixteen different mnemonics, and the branch category is subdivided
into two different mnemonics. We will only implement four of the eighteen possible II I II
mnemonics in these two categories. I I1
I I
I I I
10.6.1 Data processing instructions I I I
There are zeros in instruction register bits 27 and 26 to indicate the data processing
category. Instruction register bits 24 down to 21 determine which one of the sixteen I Il+
data processing mnemonics is associated with that particular instruction. For simplic- I +
ity, we will only implement the following three of the sixteen possible mnemonics:
decoding memonic
nic
The following example branch instruction forms an infinite loop by branching back to
itself. Because of the relative addressing mode, this same machine language instruc- For example,
tion will work identically regardless of the location where it occurs in a program: prior to the ex
10.6.1. Becaus
the instruction
L2 L2 mnemonic (L2 is a label)
As a different
e a f f f f
32 'hOOcOOC
f e hexadecimal
tion (eO5 000'
110 101 0 111111111111111111111110 binary and bit 20 of ti
If bit 20 of a
l
+ | two's complement -2 offset example, supp,
+ ir[24] ignored here prior to the exe
+ ir[27:25] == 5 so it branches except that bit
+ ir[31:28] == 4'blllO so it executes tive(r[O] ==
20 of the instru
There are sever;
It may seem a little strange, but the -2 indicates branching back to the same instruction. bit 30 of the PS
In other words, the new value of the program counter is the value of the program data processing
counter at the time the instruction is fetched plus 4 * off s et+ 8, where the of f set is the PSR is the
a sign extended version of instruction register bits 23 down to 0. The reason the ARM processing insti
designers chose to make -2 mean branching back to itself will become clear later in this the LINK of the
chapter. result (of the m
caused a signed
versa).
10.6.3 Program status register
Another detail in which the ARM is different than the PDP-8 is the way in which the
ARM tests for conditions, such as testing for negative numbers. On the PDP-8, since
10.6.4 Con,
the accumulator is the only place where a number to be tested can reside, the hardware One of the most
simply uses the most significant bit of the accumulator to determine whether that num- be conditional,
ber is negative or not. On the ARM, there are sixteen different registers that a program- instruction is tre
normally. The c(
mer might choose to test, and so there are sixteen different sign bits that the hardware
might need to use, which would not be economical. Instead, the ARM allows the pro- Although there
grammer to specify a one as bit 20 of the instruction register for a data processing shown in appenc
instruction ("S" suffix on the mnemonic). When bit 20 is a one, certain critical infor- mented here:
mation about the result of the data processing instruction is saved in the programstatus
register. (The "S" suffix means set the PSR.) In this chapter, we will consider bit 31 of
the program status register (PSR), which is known as the "N" (negative) flag. The N
flag stores the sign bit of the result of the most recent data processing instruction with
an "S" suffix mnemonic.
0121/7500 SMA
0122/5112 JMP LI
As another exaj
the decimal cot
The analogous ARM instruction is BPL:
e 2 8 2 2 0 0 1 hexadecimal
/. . \/ .. \ / \ /. \ /.
t branches when 1110 00 1 0100 0 0010 0010 00 00 00000001 binary
uction when the
I I I I I I
+- ir[7:0]==l so 'OPB is 1
I I I I I I + ir[l1:8] ignored here
I I I I I + ir[15:12] == 2 so 'RD is r[2]
I I I I + - ir[19:16] == 2 so 'OPA is r2]
I I I + - ir[20] == 0 so don't set psr
ir[24:21]==4 so mnemonic is
II I
I +- ADD'
SMA prior to a i.e., 'RD -- 'OPA + 'OPB
I I i.e., r[2] - r[2] + 1
l+ ir[25] == 1 so 'OPB is immediate
+ ir[27:26] 0 so it is data processing
ir[31:28] == 4'blllO so it executes
As another example, consider the ARM instruction that initializes the RI register with
the decimal constant fourteen:
a
ROXe
R1,0x~e
hexadecimal
binary
I INI
l l l l l l l l +- ir[7:0]==14 so'OPB is 14 F
l l l l l Il| + ir[11:8] ignored here
l l lIl | | + ir[15:12] == 1 so 'RD is rl]
l l l l | + ir[19:16] ignored: not used byMOV
l | | | + ir[20] == 0 so don't set psr
l | | + ir[24:21] == 13 so mnemonic is MOV"
I l l i.e., 'RD - 'OPB
F
l l l i.e., r[l] - 14
+ ir[25] == 1 so 'OPB is immediate
+ ir[27:26] == 0 so it is data processing
F3
+ ir[31:28] == 4'blllO so it executes
C.^
F 1
The state names are the same as the ones in the PDP-8's ASM, except for the execute
states. In the ASM for the ARM, state EODP occurs when a data processing instruction
(such as ADD or SUB) executes, and state EOB occurs when a branch instruction (B)
executes.
I Figure 10-.
Lmmediate
ata processing
executes
4 subset
ne of the PDP-8's
thm for the ARM
software interrupt
)de to a supervisor
elpful to keep this
)se of Verilog test
the actual ARM is
-dware
RISC Processors 389
10.7.2 Fetch states 10.7.4 Da
State INIT initializes the program counter, halt and program status registers. The ma- State EODP h
chine will then proceed to state FI and to state IDLE. quested data
into the destit
When a program executes, the normal sequence is to proceed through states Fl, F2, register.
F3A, F3B and one of the execute states. State F2 increments the program counter by
four (rather than by one) because the program counter refers to an address in terms of
eight-bit bytes but each 32-bit instruction is actually four bytes long. In a related way, 10.7.4.1
when state F3A fetches an instruction from memory, the memory address is shifted Which kind o
over two bits to the right because the program counter is four times the required memory bits 24 down t
address.4 ing operations
tails in a fund
other 13-data-
10.7.3 The condx function tion is that the
The decoding (F3B) and executing (EODP, EOB or EOHLT) states for the ARM are tion register b
quite different than the analogous states for the PDP-8. First of all, every instruction on
the ARM has the potential of being conditional, which is why instruction register bits
31 down to 28 are reserved for this purpose. The first decision that occurs in state F3B 10.7.4.2 C
is whether the instruction should be nullified or not. On the actual ARM, this decision Using instruct
involves sixteen possibilities. Even though we are only going to implement four of processing ins
these (4, 5, e and f), it is prudent to isolate this detail in a function which we will refer 20 is zero, sta
to as condx(ir[31:28] ,psr). hand, if instru,
state EODP he
In the actual hardware, there will be some combinational logic that implements this
register. This
function. The important observation is that whether an instruction is executed or is
data processin
nullified depends only on two things: instruction register bits 31 down to 28 and the
isolating detai
current information in the program status register (which, in this implementation, only
the ARM, shoi
contains the N flag). Because these details have been isolated inside the condx func-
result to recor
tion, the other twelve conditions (0-3, 6-d) not considered here could be implemented
conditional in
fairly easily without having to change this ASM.
register proper
After recognizing that the condition for the instruction has been satisfied, state F3B register as '01
proceeds to decode the instruction. If it is a data processing instruction (instruction
register bits 27 and 26 equal zero), the ASM proceeds to state EODP. If it is a branch
instruction (instruction register bits 27 down to 25 equal 5), the ASM proceeds to EOB. 10.7.4.3
If it is a SWI instruction, the machine proceeds to the PDP-8 like state EOHLT for the The use of m,
purpose of communicating with the Verilog top-level module that will test this ma- details require
chine. (As mentioned above, the actual ARM would do something more complicated
for SWI.)
allows us to de
4
The reason for this inconsistency only becomes apparent with some of the instructions we are ignoring, having to meni
such as LDR and STR, that use byte-sized data in memory.
'define RD r[ir[15:12]]
allows us to describe the destination register for a data processing instruction without
ions we are ignoring, having to mention the instruction register bits.
l'define OFFSE
As in C, parentheses are a good idea to avoid creating precedence problems when
Verilog substitutes such a complicated macro.
The definition of 'OPB is even more involved because instruction register bit 25 al- 10.7.6 Ver
lows the programmer to choose between an immediate value or a register value. The Throughout thi
same problem with R15 mentioned above also must be considered: For simple mac
Verilog are equ
the hardware. II
'define OPB (ir[25]?ir[7:0]:(ir[3:0]!=15? r[ir[3:0]]:r[ir[3:0]]+4)) to understand,;
The last section
In fact, there are other issues about 'OPB that we are ignoring here. (The actual ARM documentation
allows rotation of 'OPB, which would require a more complicated expression for 'OPB.) a bit of Verilog
that Verilog im
that should be c
10.7.5 Branch textual langual
State EOB performs the relative branch by adding four plus four times the signed offset
(from the low-order 24 bits of the instruction register) to the program counter.
This function formally describes the SUB, ADD and MOV instructions. Except for the Generating all
$display statement, this function could be synthesized into the combinational ALU isolating it her
required in the actual hardware. (The $display statement wams us if we attempt to
execute a data processing instruction that is one of the 13 not implemented here.) This A great deal o
function can be reused as we improve the performance of the design. Because these mentioned in t
details have been isolated into a function, it is easy for a designer to know where to
modify the Verilog code in order to implement the remaining 13 operations.
__7 end
endfunction
Again, isolating this in a function makes it easy to know how to implement the remain-
ing operations. Also, as will be shown later, defining this function will prove extremely
helpful as we use more sophisticated techniques to improve performance.
For our subset of the ARM, we only implement the N flag in the program status regis-
ter. The function which creates this information from the result of the ALU is trivial:
function [31:0] f;
input [31:0] dpres;
begin
f = dpres & 32'h80000000;
end
endfunction
.s.Except for the Generating all bits of the program status register is considerably more complicated, but
ibinational ALU isolating it here helps some future designer whose job might be to do so.
if we attempt to
-nted here.) This A great deal of the abstraction needed for this design comes from the Verilog macros
i Because these mentioned in the last section. For this multi-cycle implementation, these are:
know where to
ations.
0
10.8 Pipelined implementation
The problem with the multi-cycle implementation is that it requires five cycles per
instruction. To improve this performance, we can use a pipelined approach. There are ir1 r
several reasons why a pipelined implementation of our ARM instruction subset will be ir2---
0 PC
easier than the pipelined PDP-8 discussed in chapter 9. First, the ARM has a RISC
instruction set which was designed to be pipelined. Second, we are neglecting memory
reference instructions, and so the issues of operand fetch and data forwarding may be 0~IIIX
ignored here. Third, we can reuse the functions defined above without modification.
Fourth, the Verilog macros given earlier can easily be redefined to match the needs of
the pipelined implementation.
0 condx(ir2[31:25],psr) II 1
&&((ir2[27:26]==O&&r2[15:12]== 5) i
I I ir2[27:25]==5) I
es five cycles per
)proach. There are ir- m['PC> .fOOOOOOO
00ir2
vIr, ii
tion subset will be
ARM has a RISC
eglecting memory
orwarding may be
hout modification.
match the needs of
i) use a three-stage
[ive-stage pipeline.
similar to the three- Figure 10-2. PipelinedASMforARM subset.
)n set was designed
most natural. The 5Although the ARM's designers may someday redefine the meaning of this machine code to be something
other than NOP, f000000 is convenient since it is easy to recognize
Except for the "ir2 is B" case, these are identical to the multi-cycle ASM given in
ines what will be section 10.7. In the "ir2 is B" case, four times the sign extended offset ('OFFSET4) is
(cle. The second, added to the program counter. Here is where we see that the ARM was designed to
7, deals with de- work with a three-stage pipeline. The reason that an offset of -2 means branch back to
itself is that by the time the branch instruction has reached the final stage of the pipe-
n the instruction line, the program counter will already have been incremented twice, i.e., it is eight
- as the "B/R15" greater than when the branch was fetched. When the offset is -2, 'OFFSET4 is -8 and
' case (increment so adding it to the program counter in this case puts the program counter back to where
Dn in the final the same branch instruction will be fetched again.
s indicated by
(bits 27 down to
(since r [ 15 ] is
like a branch in-
two are contradictory. It is impossible for ir2 to contain a branch or data processing if
instruction that modifies R15 in the same clock cycle that it contains an SWI instruc- bs
tion. Also, when "ir2 is B," the "normal" case cannot occur. When these cases are
eliminated, we are left with eight cases to consider. The "B/R15" case might occur in
parallel with either the "nullify," "dp set," "dp no set" or "ir2 is B" case. Alternatively,
the "normal" case might occur together with either the "nullify," "dp set," "dp no set"
or "SWI" case.
The "B/R 15" and "normal" cases are the only places where the instruction registers are
scheduled to be assigned, and so there is no problem with them. The "dp set" case is the
only place where the program status register is scheduled to be assigned a value, and so
it is fine. Also, the "SWI" case is the only place where the halt flag is scheduled to be
assigned a value; thus we do not need to be concerned with it. The danger arises with
the program counter and 'RD, since 'RD could be r [ 15 ], which is the program counter.
To avoid this danger, we must leave the program counter alone in the "B/R15" case,
because the program counter is modified in parallel by the "dp no set," "dp set" or "ir is
end
B" cases of this Mealy ASM.
end
'define RD
forever 'define OPA
begin 'define OPB
@(posedge sysclk) 'define OEFSE
enter_new state('Fl);
else
begin
Interestingly, I
if (condx(ir2[3l:28],psr) &&
mind, the defii
((ir2[27:25]==3'blOl)
|| (ir2[27:261==2'bOO&&ir2[15:12]==4'bllll))) tation. This si
begin // B/R15" tioned. The va
irl <= (posedge sysclk) 32'hfOOOOOO; line is, by defi
ir2 <= (posedge sysclk) 32'hfOOOOOO;
end
Continued
Continued
ing assignment else
:empt to assign
-empt begin // "normal"
ng this mistake,
.ig 'PC <= @(posedge sysclk) 'PC + 4;
re are two paths irl <= (posedge sysclk) m['PC>>2];
er half that ex- ir2 <= (posedge sysclk) irl;
1, end
k, but of these,
if (condx(ir2[31:28],psr))
Jata processing
begin
LnInSWI instruc- if (ir2[27:26] == 2'bOO)
these cases are begin // "dp set" or dp no set"
might occur in 'RD <= @(negedge sysclk)
Alternatively, dp(ir2[24:21]'OPA'OPB);
et," 11dp
"dp no set" if (ir2[20]) //"dp set"
psr <= @(posedge sysclk)
f(dp(ir2[24:21],'OPA,'OPB));
ion registers are
on end
) see'
set" case is the else if (ir2[27:25] == 3'blOl) //"ir2 is B"
so
I a value, and so 'PC <= @(posedge sysclk) 'PC + 'OFFSET4;
scheduled to be else if (ir2[27:24] == 4'bllll)//"SWI"
iger arises with halt <= @(posedge sysclk) 1;
else
rogram counter.
program counter.
$display("other instructions...");
"B/R15" case,
"B/RI5"
end
'dp set" or "ir isis
"dp end
end
Some
Some of the macros need to be redefined to take into account that i rr22 is the final
final stage
ction 10.8.1:
aion 10.8.1: of this pipeline:
of
,define
'define RD r[ir2[15:12]]
'define
,define OPA r[ir2[19:16]]
r[ir2[19:1611
'define
,define OPB (ir2[25] ? ir2[7:0] : r[ir2[3:011)
r[ir2 [3:01])
'define OFFSET4
OFFSEN {ir2
ir2 [23],
23] , ir2 [23],
23] , ir2 [23],
231 , ir2 [23],
231 , ir2
ir2 [23],
231 , ir2 [23],
231 , ir2 [23
23 :0]
01 , 22 'bOO}
'bOO)
Interestingly, because the original ARM was designed with a three-stage pipeline in
Interestingly,
mind, the definition of 'OPA 'OPB are simpler than for the multi-cycle implemen-
'OPA and 'OPB
tation. This simplification occurs since r [ 1155 ] does not have to be explicitly men-
tioned. The value of r [ 151 5 ] at the time the instruction is in the final
final stage of the pipe-
line is, by definition, the correct value to use.
i
Execution of a data processing instruction involves non-blocking assignment to 'RD, 10.9.1 M
which is a macro that substitutes the subscripted Verilog array, r [ ir2 [ 15 :12] ].
From a struci
This non-blocking assignment therefore uses negedge rather than posedge to be
guished from
portable for the reasons explained in section 6.5.2. (Remember that, in this pipelined
and pipelined
implementation, ir2 changes every clock cycle.)
multiple instr
to have two
execute per c
10.9 Superscalar implementation A consequent
The pipelined implementation given in the last section has a speed that approaches (but cated. If then
never quite reaches) one clock cycle per instruction. Because ARM data processing operands in c
instructions have three register operands ('RD, 'OPA and 'OPB), one basic computa- behavioral sta
tion, such as incrementing r [ 2 , can be performed per clock cycle. Although this can way we did ii
be up to three times faster than the pipelined single-accumulator design described in tion will be e:
chapter 9, it still is certain to be no better than the slowest special-purpose designs in ferred to as i
chapter 2 (such as section 2.2.2). Even for a simple algorithm like childish division, it
The two resu
is often possible for more than one computation to occur in parallel (e.g., incrementing
behavioral sta
r [ 2 ] in parallel with subtracting from r [ 1 ] ). A pipelined general-purpose processor
only works because of quite a bit of parallel activity in the implementation of fetch/ A register file
execute. Even so, a pipelined general-purpose computer cannot exploit the parallelism sive than the
in an algorithm. Such parallelism can be exploited by special-purpose hardware (such tional comple
as section 2.2.7). will see later
things.
Since the designer of a general-purpose computer can never be certain how fast is "fast
enough," it would be desirable if the general-purpose computer could execute more
than one instruction in parallel. Such an approach, known as a superscalarimplemen- 10.9.2 Ini
tation, is an extension to the pipelined approach. Superscalar implementation is con- In order to kec
siderably more complex than the pipelined approach because the hardware itself must with as many
take seemingly sequential instructions and recognize when it is permissible for them to ample, if our
execute in parallel. In essence, some of the intelligence and skill of the hardware de- sary to load b
signer (as illustrated by the design alternatives of chapter 2) must be placed inside the 'PC, respectix
hardware itself. Because the hardware of a superscalar general-purpose computer will
never have as much information about the software algorithm as the designer of a The single-po
special-purpose computer has about the ASM, a superscalar general-purpose machine and pipelined
will not be as fast as the best special-purpose hardware. Also, the complexities of per clock cyc]
superscalar design means its hardware cost may be many times the cost of the equiva-
lent but faster special-purpose machine. However, the economies of scale for general-
purpose computers have made superscalar processors viable.
vare
RISC Processors 403
-
Unfortunatel)
m[O] first instruction example:
m[4] second instruction
m[8] third instruction
m[1 2] fourth instruction
It might appe'
approach wol
Figure 10-3. Non-interleaved memory.
tions execute
time for both
Although a dual-ported memory for instructions would allow fetching of two instruc-
tions per clock cycle, such a memory is expensive. A cheaper alternative is to use an
interleavedmemory. A simple interleaved memory stores half of the instructions in one
bank and the adjacent instructions in another:
SUB R2,R1,R4
ADD R2,R2,1
It might appear that data forwarding (of RI minus R4) could be helpful here. Such an
approach would be algorithmically correct but would be slow. To make these instruc-
tions execute in parallel, the clock period would have to be slow enough allow enough
time for both the ADD and the SUB:
of two instruc-
ive is to use an forwarded
value
tructions in one
R1 ALU ALU
doing doing new R2
R4 SUB ADD
Instead of data forwarding in a situation like this, it is better for the machine to execute
only one instruction per clock cycle. At least this way, the clock cycle remains fast. In
, sufficient only other words, it behaves like the simple pipeline approach of section 10.8. The hope is
ctions, they will that after executing these two instructions sequentially, the machine will fetch some
n address divis- independent instructions (like the ones shown earlier) that it can execute in parallel.
address. From a
Some programs have combinations of instructions that simply cannot be executed in
-ray notation for parallel:
im['PC>>2].
SWI
ADD R2,R2,1
The machine is supposed to halt (in our subset, at least) before the ADD instruction
executes. In such a situation, we have to revert back to a one instruction per clock cycle
(simple pipeline) approach, which allows the machine to process the SWI in exactly
~,the above two the order the programmer intends. On a machine that actually implements interrupts
,presented to the (unlike our subset), exact processing of interrupts and similar issues are significant.
parallel and their
like this. Such an It is interesting to note that ren-tag is five, rather than four, bits wide. This is re-
sirable side effect quired because, in addition to the sixteen user registers, we need to indicate when the
SUB completes). renamed register is not valid. To do so, the following constant is defined:
Dng, which would
1.
I define INVALID 16
Imentation uses a When an instruction cannot be executed speculatively (as in the SWI example from
knowing whether section 10.9.3), the machine assigns 'INVALID to rentag. In the next clock cycle,
For most instruc- this will cause ren_val to be ignored.
ating that instruc-
On the other hand, when an instruction can be executed speculatively, the machine
ecution means we
assigns the destination register number to rentag, the condition upon which that
hether or not that
assignment succeeds to rencond and the potential new value of that register to
ack in the register
renval.
ware
RISC Processors 407
2
words, we are going to describe a special-purpose machine that only executes one
(nonsensical) algorithm, which we will state in terms of ARM mnemonics:
specific instructions:
Again, there is
transfers as w(
implementation
l
executes one @(posedge sysclk) #1;
s:
r[2] <= @(posedge sysclk) r[1] - r[4];
psr <= f(r[1] - r[4]);
ren val <= @(posedge sysclk) r[3] + r[3];
ren tag <= (posedge sysclk) 3;
rencond <= @(posedge sysclk) 'PL;
@(posedge sysclk) #1;
rencond <= @(posedge sysclk) 'NV;
I
out
s to carry ou if ((rencond == 'L)&&(psr[3l]==O)
tecute on th(
cecute the (ren-cond == 'MI)&&(psr[31]==l) ||
.0.8.1: (ren_cond == 'AL))
r[3] <= @(posedge sysclk) renval+l; //renamed
else
r[3] <= @(posedge sysclk) r[3]+1; //not renamed
Inn parallel to the subtraction during the first clock cycle, the doubling of r [ 3 ] occurs
occurs
before
)efore the machine can know whether the difference will be positive. Therefore, the
machine
machine saves the doubled value
rencond
value in ren -va
val,1, and at the same $time
-en cond of the condition ('PL) under which this speculative doubled
time makes note in -
00
doubled result is to be
renamed as r [ren tag]
enamedasr[ren . In the second clock cycle, after the psr resulting from the
tagl.lnthesecondclockcycleafterthepsrresultingfromthe aNo
subtraction
subtraction is valid, the machine makes a decision whether or not renaming occurs. If it
does
loes not, incrementation of rr[ 3 ] occurs based on the value already in in the
the register file
stored in
stored in the
the
th( rorn two or more clock cycles ago. If renaming does occur,
from occur, there is aa literal substitution
rr determines
determines
determine: r determis renval
)f ren-val
of ren -va 1 for rr[ 33 1] in this clock cycle. Regardless of whether renaming
renaming occurs in
ily
ily machine).
ily machine)
machine). he second clock cycle, ren
the rencond
-cond
- cond is set to 'NV because the NOPNOP will not cause any -00
00
Dck cycle,
ock
Dck cycle, aat
at ock cycle t naming in
renaming in the
the third
third cycle
cycle (not
(not shown).
shown).
te.
t. The fourth
fourd
t. The fourth 10.9.5.2 Second
10.9.5.2
10.9.5.2 Second special-purpose
Second special-purposerenaming
renaming example
renaming example
example
Let's
,et's consider a second example, similar to the last one, except the destination of the
)ns per state;
)ns state rns per sta; third
hird instruction (shown in bold) is not the same as the destination of the ADDPL
we desire toto:we desire to
t( instruction
instruction that executes speculatively:
n and register
ine described SUBS R2,R1,R4 ;//sets psr
tinstructions ADDPL R3,R3,R3 ;//speculative
however, the ADD R6,R3,1 ;//R3 same but not dest
related to the NOP ;//NOP to simplify discussion
Again, there is no problem when all we want to do is to carry out the same register
transfers as would occur when the above instructions execute on the pipelined
implementation of the general-purpose ARM given in section 10.8. 1:
I
l
INIT
@(posedge sysclk) #1; halt-i
r[2] <= @(posedge sysclk) r[l] - r[4]; psr+-O
psr <= f(r[l] - r[4]);
@(posedge sysclk) #1;
if (psr[31]==0)
r[3] <= (posedge sysclk) r[3] + r[3]; ir2[27:26]=
@(posedge sysclk) #1; &&conc
r[6] <= (posedge sysclk) r[3] + 1;
@(posedge sysclk) #1;
0
Of course, things get more interesting when we use speculative execution and register
renaming. The register transfers of the special-purpose machine below are similar to
those carried out when the equivalent instructions execute on the general-purpose
superscalar ARM given in section 10.9.6: ir1+-I
ir2-
@(posedge sysclk) #1;
r[2] <= (posedge sysclk) rl] - r[4]; 'Pi
psr <= f(r[l] - r[4]); irl-r
renval <= @(posedge sysclk) r[3] + r[3]; ir2-
ren tag <= @(posedge sysclk) 3; en_val- dr
ren cond <= @(posedge sysclk) 'PL;
renI
@(posedge sysclk) #1;
rencond <= @(posedge sysclk) 'NV;
if ((ren-cond == 'PL)&&(psr[31]==0) |
(rencond 'MI)&&(psr[31]==1) |
(ren-cond == 'AL))
begin
r[ren tag] <= @(posedge sysclk) ren val; //renamed
r[6] <= @(posedge sysclk) renval+1;
end
else
r[6] <= @(posedge sysclk) r[3]+1; //not renamed
The first clock cycle is identical to the speculative example in section 10.9.5.1; thus the
speculative doubling of r [3] occurs before the machine knows whether the difference U
_,~I 11U VI
of r [1] and r [4] will be positive. Again, ren val will contain the doubled value
and ren cond will indicate the condition ('PL) when renval is to be renamed as
r [ren-tag] . In the second clock cycle, after the ps r resulting from the subtraction
is valid, the machine makes a decision whether or not renaming occurs. If it does not,
the assignment to r [6] occurs based on the value of r [3] already in the register file
from two or more clock cycles ago. If renaming does occur, the situation is quite different
than in the example of section 10.9.5.1. In this example, the destination of the third
instruction (r [6] ) is different than the destination of the speculative instruction (r [3]).
There is still a literal substitution of ren val for r [3], but there must also be storage .
Figure10-6.
410 Verilog Digital Computer Design:Algorithms into Hardware
ion and register
v are similar to
ieneral-purpose
6
As of 1997, despite its suitability for superscalar implementation, ARM had not yet introduced such a
version of its processor, instead focusing on low-cost, low-power versions that use only pipelining.
t introduced such a
y pipelining.
L
Case 5 is interesting because it is the reason for using a renamed register. The value Cases 5, 6 and
scheduled to be assigned to the renamed register is the result from the parallel ALU. instructions ex
The function computed by this ALU is based on irl [24: 21] . The other ALU uses eight rather th
i r2 [24: 21] . The condition that says whether the value in the renamed register will interleaved me
actually be used in the next clock cycle comes from ir [31 28] in this clock cycle. ally be used lai
The tag for the renamed register is scheduled to become the register specified as the (and harmless)
destination in this instruction (irl [15 12]). At the next rising edge of the clock
after case 5 occurs, ren-tag will be the register number that will be modified if this Cases 3 and 4,
speculatively executed instruction actually executes; ren_cond will indicate whether (because of the
the register indicated by ren tag should change based on the program status register gram status reE
in this next clock cycle and ren_val will be that new value. to take the bran
gous to the dec
There is a hidden detail in the depend function that relates to case 5. The depend program count
function prevents parallel execution if both the instructions in irl and r2 set the 6 and 7). If the
program status register. Because of this, distinguishing between case 5 versus cases 6 with NOPs an(
and 7 is simply a matter of looking at irl [20] . If irl [20] and ir2 [20] indicate the 'OFFSET+
both instructions will modify the program status register, case 2 applies, and the in- instruction doe
structions will execute sequentially. The reason for this is that both instructions cannot comes 'INVAL
modify the program status register in the same clock cycle, but it is acceptable for each
of them to modify the program status register in sequence.
10.9.7.2 D
If by the point of the decision ir [20] indicates that this instruction will modify the The second par
program status register, we know that r2 will not. This means the program status register set up
register in the current clock cycle (rather than in the next clock cycle as was the situa- tion of an instr
tion for case 5) accurately reflects the information needed to decide whether irI will the register file
execute. Therefore, the decision to choose between cases 6 and 7 is
condx (irl [31: 28] ,psr). Note once again the advantage of being able to reuse There are three
this Verilog function. If it is known in this clock cycle (case 6) that the instruction will without being
execute, rencond will become 4'blllO (always) in the next clock cycle, rather
than whatever condition was present in ir [31 28]. If it is known in this clock
cycle (case 7) that the instruction will not execute, ren_condwill become 4 'blli a. ir2 has
(never) in the next clock cycle. This way we can use the same hardware that imple- register th
ments speculative execution also to handle cases 6 and 7. b. ren tag
c. evaluation
The reason we cannot use speculative execution here (i.e., making ren cond be clock cycl
irl [31 28]) is that case 6 changes the program status register. If a conditional in-
struction that changes the program status register (such as ADDPLS) executes due to
the current program status information, it is possible register renaming will fail to hap- If none of these
pen in the next clock cycle because the condition is no longer true. That would prevent (case 8). The pa
an instruction that is supposed to execute from actually executing. Therefore, cases 6 Verilog as:
and 7 evaluate condx with the current program status register and communicate this
unambiguously into the next clock cycle with the 4'blllO or 4 'b1ll.
IZ
414 Verilog Digital Computer Design: Algorithms into Hardware
j
P-
gister. The value Cases 5, 6 and 7 have quite a few things in common. In each case, two data processing
he parallel ALU. instructions execute in parallel. This means the program counter needs to increment by
other ALU uses eight rather than four. Also two instructions need to be fetched in parallel from the
med register will interleaved memory. In each case, renval is computed, whether or not it will actu-
this clock cycle. ally be used later. In theory, for case 7, ren-val need not be computed, but it is easier
specified as the (and harmless) to do so.
dge of the clock
modified if this Cases 3 and 4 deal with a branch instruction in i r 1. If we reach case 3 or 4, we know
indicate whether (because of the depend function) that the instruction in r2 will not affect the pro-
im status register gram status register. (If it does, case 2 applies instead.) Therefore, the decision whether
to take the branch can be based on condx (irl [31:281 , psr) . The reason is analo-
gous to the decision for cases 6 and 7. If the branch instruction is nullified (case 4), the
5. The depend program counter is incremented by eight and two instruction are fetched (as in cases 5,
and ir2 set the 6 and 7). If the branch instruction in i r occurs (case 3), the instruction pipeline fills
5 versus cases 6 with NOPs and the program counter changes (by adding 'POFFSET4 + 4, similar to
r2 [201 indicate the 'OFFSET+4 in multi-cycle implementation). In either case 3 or case 4, the branch
?lies, and the in- instruction does not modify a user register; thus the tag for the renamed register be-
structions cannot comes 'INVALID.
ceptable for each
k
iat they forward The fourth condition above is a form of hazard known as Write After Write (WAW).
This is a situation that has been warned against throughout this entire book: you cannot
have two non-blocking assignments to the same register in the same clock cycle. As
explained in section 10.9.7.1, the ASM is designed with the understanding that this
situation will never occur in case 5; thus the depend function must cause the ASM to
r[ir2[3:0]]))
handle such situations in case 2 (i.e., ir2 and irl will execute sequentially).
r[irl[3:01]))
The final three situations deal with instructions for which the ASM was not designed to
execute in parallel. Here is the Verilog function that detects these seven conditions that
comes from the cause the ASM to proceed to case 2:
ID if rentag
[his clock cycle.
function depend;
ycle, if the above
input [31:0] irl,ir2,psr;
this clock cycle. begin
depend=(ir2[15:12] == irl[19:16]
&& ir2[27:26] == 2'bOO && irl[27:26] == 2'bOO
&& condx(ir2[31:28],psr))//POPA bad (RAW)
r with Verilog. It (ir2[15:12] == irl[3:0] && irl[25]==0
into a one-page && ir2[27:26] == 2'bOO && irl[27:26] == 2'bOO
or macros. Most && condx(ir2[31:28],psr))//POPB bad (RAW)
the multi-cycle (ir2[20]
&& ir2[27:26] == 2'bOO
&& condx(ir2[31:28],psr)
&& irl[31:28] 4'billO
&& irl[27:26] 2'bOO) //psr bad(RAW)non-dp
i:depend. This (irl[20] && irl[27:26] == 2'bOO
two instructions && ir2[20] && ir2[27:26] == 2'bOO
&& condx(ir2[31:281,psr))//psr bad(WAW) dp
| ((irl[27:26] != 2'bOO)
&&(irl[27:25] != 3'blOl))//irl not dp or branch
((ir2[27:26] != 2'bOO)
&&(ir2[27:25] != 3'blOl))//ir2 not dp or branch
| (irl[27:26] == 2'bOO //irl has PC as ALUop
&& ((irl[3:0] == 4'blll && irl[25]==1'bO)
lirl[15:12] == 4'bllll
Hlirl[19:16] == 4'bllll));
end
endfunction
(RAW). If these
write a value into Since the goal is to execute as many instructions in parallel as can be executed cor-
id. To attempt to rectly, it is useful to ignore instructions that are known will be nullified. Since we know
te wrong value. with certainty whether ir2 will be nullified (based on the current program status
register), conditions a-d (which mention ir2) can be ANDed with
L
l 1
condx (ir2 [31: 28 , psr). This means the depend function only slows the ma- by careless des
chine to one instruction per clock cycle when it is actually necessary. For example, the exercises the 5
following two instructions possible case.
test code exam
ADDPL Rl,Rl,l
| ADD R2,Rl,l
chine, it is iml
I
pends on how
can be processed in parallel if the ADDPL is nullified but must execute sequentially if the Verilog tesi
the ADDPL is not nullified. a million time
Verilog is that
the Verilog co(
The superscala
10.9.8.2 Translatingthe ASM to Verilog of the earlier d(
Once all the macros and functions are defined, it is easy to translate the ASM to Verilog. if it works con
For example, the following is the beginning portion of the Verilog code corresponding whether all of
to the first parallel activity of state F1 (parallel and speculative execution): is that the desi
moderately coi
counterintuitiv
begin
what is being t
if (condx(ir2[31:28],psr) &&
((ir2[27:25] == 3'blOl)
(ir2[27:26] == 2'bOO &&
10.9.8.4 U
ir2[15:12] == 4'bllll))) The Verilog co
begin ments that con
irl <= @(posedge sysclk) 32'hfOOOOOOO;
ir2 <= @(posedge sysclk) 32'hfOOOOOOO;
ren-tag <= (posedge sysclk) 'INVALID; 'ifdef DEB
'ifdef DEBUG $display
$display( " 1. ir
' 1. ir2 branch or R15 prevents | ,$time); cover(l)
cover(l); 'endif
'endif
end What this meai
else ...
A
ly slows the ma- by careless designers, is the test code. This code, sometimes referred to as the testbench,
For example, the exercises the Verilog that simulates the hardware. Ideally, we would like to try every
possible case. For tiny special-purpose machines, such as the 12-bit childish division
test code example in section 3.7.3, this is barely possible. For a general-purpose ma-
chine, it is impossible to test everything. The usefulness of simulation, however, de-
pends on how completely the Verilog code that simulates hardware has been tested by
te sequentially if the Verilog test code. It does not do any good to test the same correct Verilog statement
a million times but ignore another statement that has a bug in it. The advantage of
Verilog is that its software-like statements can be used to warn the designer that parts of
the Verilog code that simulates hardware has not been tested.
The superscalar implementation given in the last section is far more complex than any
of the earlier designs in this book. It is not feasible to test every possible program to see
ASM to Verilog. if it works correctly. We will create several programs, but then Verilog will inform us
le corresponding whether all of the cases we are interested in have been tested. The reason for doing this
tion): is that the designer will more than likely make a mistake in guessing what cases a
moderately complex program will test. The operation of the superscalar machine is so
counterintuitive (even on a small program) that it is better for Verilog to keep track of
what is being tested.
'ifdef DEBUG
$display(
" 1. ir2 branch or R15 prevents Il",$time);
cover(l);
'endif
'define DEBUG
A
- -
Note that ' ifdef is different than an if statement (where the tasks would be com-
piled, but might not execute). In particular, ' i fde f can be used to alter which control Notice how ti
inside a behai
statements are compiled into the code. For example, cases 9 and 10 of the renaming
block. If 'DE]
parallel activity do nothing:
will not be de
'ifdef DEBUG
Each case of
else which is used
begin cover does i
$display("10. dp overwrites renamed r%d',
After the prol
rentag,$time);
cover(10);
cases of the Vc
end in the coverage
'endif (In reality, we
this design, bu
If 'DEBUG is not defined, there is no need for the else begin ... end to be
compiled. The above shows how the scope of statements that are conditionally com- 10.9.9 Tee
piled can cross begin end boundaries. This is possible because the substitution oc- The special-pi
curs at compile time. purpose machi
In addition to calling on these tasks, the cover task has to be defined. We only want to being tested.
define it if the 'DEBUG macro is defined. Therefore, the task and everything associ- manipulates th
pose machine
ated with it will be enclosed in the ' ifdef:
turn interprets
purpose machi
'ifdef DEBUG
reg ['MAX_CASENO:O] coverage-set;
10.9.9.1 A
task cover;
input caseno;
One of the detx
integer caseno; implementation
begin Verilog macros
coverage-set = coverageset tionally, on the
((1 << 'MAXCASENO) >> caseno); fill with NOPs
end tion, there are
endtask
Therefore, we
initial involving r [1
begin
coverageset = 0;
wait(halt l'bO); 'ifdef PRO(
wait(halt === l'bl); arm7_mach:
$display("coverage=%b", arm7_mach.
coverageset['MAX-CASENO-1:0]); arm7_mach:
end arm7_mach:
'endif arm7_mach:
After the program has halted, the initial block prints out the coverage set. The more
cases of the Verilog code that were covered by the program, the more ones there will be
in the coverage set. We will run several programs in order to obtain complete coverage.
(In reality, we have not considered enough test cases here to have total confidence in
this design, but this task can be expanded to cover an arbitrary number of cases.)
.. end to be
ditionally com- 10.9.9 Test programs
substitution oc- The special-purpose machines in chapters 4 and 5 are easier to test than a general-
purpose machine because the test code simply has to supply test data to the machine
being tested. A special-purpose machine is supposed to follow some algorithm that
We only want to manipulates the data, and it is often easy to tell if the result is correct. A general-pur-
erything associ- pose machine implements a (sometimes intricate) variation of fetch/execute which in
turn interprets a program that manipulates the data. It is much harder to tell if a general-
purpose machine is correct.
A Verilog macro, 'PROGRAM 1, is defined when this is the program we want to use to Notice the va]
test the machine with. This test code can be used with any of the implementations. For $time 1151..
example, the pipelined implementation produces the following: to what was vi
This program
PC=00000024 IR1=eafffffe IR2=elaOfOOl N=O 1251 the pipelined i
rO=fffffffs rl=0000000c r2=00000010 r3=00000014 r4=00000018 vents executio
PC=OOOOOOOc IR1=f0000000 IR2=fOOOOOOO N=O 1351 This program i
rO=fffffff8 rl=OOOOOOOc r2=00000010 r3=00000014 r4=00000018 that case 2 hap
PC=00000010 IR1=eO8f3000 IR2=fOOOOOOO N=O 1451
rO=fffffff8 rl=OOOOOOOc r2=00000010 r3=00000014 r4=00000018
PC=00000014 IR1=eO80400f IR2=eO8f3000 N=O 1551
rO=fffffff8 rl=OOOOOOOc r2=00000010 r3=00000014 r4=00000018
|| (irl[27
PC=00000018 IR1=e3bOeOff IR2=eO80400f N=O
&& ((irl
1651
rO=fffffff8 r=OOOOOOOc r2=00000010 r3=0000000c I rl [
r4=00000018
| |irif
Notice the contents of the registers at $time 1251 and the NOPs in the pipeline at
The coverage s'
$time 1351. Also notice the contents of the registers at $time 1651. On the other
10 are covered,
hand, the superscalar implementation produces equivalent results in an entirely differ-
why these case;
ent way:
10.9.9.2 Oi
PC=00000024 IR1=eafffffe IR2=ela0f001 N=O 1051 The last progral
rO=fffffff8 r1=0000000c r2=00000010 r3=00000014 r4=00000018 tain ARM instn
1.ir2 branch or R15 prevents | 1051 the childish divi
other DP instructions... not have a divi(
5.use || ALU noS irl=fOOOOOOO A=fffffff8 B=fffffff8 1151 ware. Although
PC=0000000c IR1=fOOOOOOO IR2=fOOOOOOO N=O 1151 sion than the ch
rO=fffffff8 rl=0000000c r2=00000010 r3=00000014 r4=00000018
ties that more s
PC=00000014 IR1=eO80400f IR2=eO8f3000 N=O
cond=f,renR 0 =00000000 implemented ms
1251
rO=fffffff8 rl=OOOOOOOc r2=00000010 r3=00000014 r4=00000018 childish division
2.depend prevents || 1251
9.nullify renamed r 0 Idp 1251
6.use || ALU S irl=e3bOeOff A=fffffff8 B=OOOOOOff 1351
PC=00000018 IRl=e3bOeOff IR2=eO80400f N=O 1351
rO=fffffff8 rl=OOOOOOOc r2=00000010 r3=OOOOOOOc r4=00000018
PC=00000020 IRl=elaOfOOl IR2=e2400008 N=O
we want to use to Notice the values of the registers at $time 1051 and the NOPs in the pipeline at
lamentations. For $time 1151. Also notice the value of the registers at $time 1451. These correspond
to what was visible in the pipelined implementation at $ time 1251, 1351 and 1651.
This program does not execute much faster on the superscalar implementation than on
1251 the pipelined implementation because the superscalar implementation properly pre-
r4=00000018 vents execution of more than one instruction per clock cycle when R15 is involved.
L351 This program is important as a means of testing the Verilog code because it illustrates
r4=00 0000 18 that case 2 happens due to the following portion of the depend function:
1451
r4=00000018
1551 (irl[27:26] == 2'bOO
r4=00000018
&& ((irl[3:0] == 4'bllll && irl[251==1'bO)
1651
Ilirl[15:12] == 4'bllll
r4=00000018
Ilirl[19:161 == 4'bllll));//irl has PC as ALUop
in the pipeline at The coverage set for this program is 1100110101; in other words, cases 1,2,5,6, 8 and
551. On the other 10 are covered. Notice how helpful the output from the $displays is in annotating
an entirely differ- why these cases occur.
assembly language programmer refers to, but these are r [ 1 ] and r [ 2 ] in Verilog. rO=ffffff
PC=0000002
Also, it will be convenient for y to reside in R4. The implementations of this algorithm
rO=ffffff
given in chapters 8 and 9 made use of the accumulator of the PDP-8 to contain the PC=0000002
difference. We need to have a similar register on the ARM. In the following program, rO=ffffff
let us use RO to serve the same role as as the accumulator. This illustrates an important PC=0000002
property of all RISC machines (notjust the ARM): there is nothing special about RO- rO=ffffff
we could have chosen any other available register to hold the difference. The following
is an ARM program that implements this algorithm in the most straightforward way
The loop exect
possible:
duces a negativ
branch. Not she
OOOOOOOO/e3aOlOOe
GRAM5 on the
MOV Rl,OxOe
00000004/e3aO4007 MOV R4,0x07
different way:
00000008/e3aO2000 MOV R2,QxOO
OOOOOc/eO510004 Li SUBS RO,Rl,R4 2.depend pi
00000010/4a00002 BMI L2 lO.dp over
00000014/elaO000 MOV Rl,RO PC=00000014
00000018/e2822001 ADD R2,R2,OxOl cond=f,r(
OOOOlc/eafffffa B Li rO=0000000(
00000020/efOOOQO L2 SWI PC=oooooo1A
rO=ffffffft
l.ir2 bran(
The above is analogous to the PDP-8 program given in section 8.3.2.5.3. The first three
other DP it
MOV instructions set up RI, R4 and R2 to their initial values of x (14), y (7) and zero, 5.use | AI
respectively. The SUBS is the only instruction that sets the program status register. The PC=0000002(
purpose of the SUBS instruction is twofold: to compute the difference and to see if rO=ffffffft
R1>=R4. The BMI makes use of this program status information. As long as Rl >= R4, PC=00000021
the BMI is nullified and the loop continues. The difference would then be moved from cond=f,r(
RO to RI, and R2 is incremented. The unconditional branch to the label LI causes the rO=fffffff5
test at the top of the loop to happen again. This loop repeats while the difference (in RO) 2.depend pi
9.nullifl
he ARM, the
with x and y.
PC=00000014 IR1=4aO00002 IR2=e0510004 N=O 2251
ase used with
rO=00000000 rl=00000000 r2=00000002 r3=xxxxxxxx r4=00000007
r2 high-level
PC=00000018 IR1=elaO1000 IR2=4a000002 N=1 2351
isters that the rO=fffffff9 rl=00000000 r2=00000002 r3=xxxxxxxx r4=00000007
' ] in Verilog. PC=00000020 IRl=fOOOOOOO IR2=fOOOOOOO N=1 2451
this algorithm rO=fffffff9 rl=00000000 r2=00000002 r3=xxxxxxxx r4=00000007
to contain the PC=00000024 IRl=efOOOOOO IR2=fOOOOOOO N=1 2551
ving program, rO=fffffff9 rl=00000000 r2=00000002 r3=xxxxxxxx r4=00000007
s an important PC=00000028 IR1=xxxxxxxx IR2=efOO0000 N=1 2651
al about RO- rO=fffffff9 rl=00000000 r2=00000002 r3=xxxxxxxx r4=00000007
The following U
itforward way The loop executes two times. The third execution of the SUBS ($time 2251) pro-
duces a negative number (fffffff9), which sets the N flag. This in turn causes the BMI to
branch. Not show earlier, the BMI had been nullified. On the other hand, running 'PRO-
GRAM5 on the superscalar implementation produces equivalent results in an entirely
different way:
the compiler k
OOOOOOOO/e3aOlOOe MOV Rl,OxOe
code looks lik(
00000004/e3aO2000 MOV R2,OxOO
00000008/e3aO4007 MOV R4,0x07
OOOOOOOc/e0510004 SUBS R0,R1,R4
00000010/4a00003 BMI L2
00000014/elaOlOOO Li MOV Rl,RO
00000018/e2822001 ADD R2,R2,0xOi
OOOOOOlc/e0510004 SUBS RO,R1,R4
00000020/5afffffb BPL Li
00000024/efOOOOOO L2 SWI
which can be i]
Running on the pipelined implementation, this program (let's refer to it as 'PROGRAM4)
produces at $ time 2051 the same results that 'PROGRAM5 produces (also running
on the pipelined implementation) at $time 2651. This illustrates that to make good
use of a pipelined machine, a good compiler is essential. Manually created assembly
language programs are often not as effective as the automatically created machine lan-
guage from compilers. Running on this superscalar implementation, this program pro-
duces at $time 1451 the same results that 'PROGRAM5 (also running on the super-
scalar implementation) produces at $ time 1651.
The Verilog coverage set for the superscalar run of 'PROGRAM4 is 1100110110. 'PRO- If the program I
GRAM4 does not add to the coverage of the Verilog code provided by 'PROGRAM5 the loop compl
(1110110111); thus we need an additional test program to cover cases 4 and 7.
that illustrates a
)oth the top and for(i=O;i<3;i++)
mes the branch {
actions. The fol- rl = rl - y;
r2 = r2 + 1;
PL and BMI in-
}
the compiler knows a priori how many times the loop will execute; thus the unrolled
code looks like:
rl = rl - y;
r2 = r2 + 1;
rl = rl - y;
r2 = r2 + 1;
rl = rl - y;
r2 = r2 + 1;
s 'PROGRAM4)
:es (also running SUB R1,R1,R4
at to make good ADD R2,R2,0x0l
,reated assembly SUB R1,R1,R4
lted machine lan- ADD R2,R2,0xOl
his program pro- SUB R1,R1,R4
ing on the super- ADD R2,R2,0xOl
)0110110. 'PRO- If the program has a f or loop that repeats too many times for it to be practical to unroll
)y 'PROGRAM5 the loop completely, it can be partially unrolled. For example:
s4 and 7.
for(i=O;i<1000;i++)
{
rl = rl - y;
r2 = r2 + 1;
rl = rl - y;
r2 = r2 + 1;
rl = rl - y;
r2 = r2 + 1;
} The reason th,
(bit 20 equal t(
which would incur the branch penalty one-third as often. so that severa
The difficulty with the childish division algorithm (and with many practical programs) program, the S
is that we do not know before we run the program how many times the loop will ex- and SUBPLS
ecute. (In the case of childish division, the number of times the loop will execute is the execute. Since
answer we are trying to compute.) ditional data p
ten speed up a
Here is where conditional data processing instructions come in handy. Assuming we
do not care about the result in rl, the childish division algorithm can be partially The Verilog co
unrolled as the following C code: GRAM3 does
(1110110111);
is identical to
rl=x; x.
r2=0;
do
{
rl = rl - y;
if (rl>=0) r2 = r2 + 1;
if (rl>=0) rl = rl - y; The coverage
if (rl>=0) r2 = r2 +1; covers cases 4
} while (rl>=0) been tested at I
we know we h
Verilog has he
The reason that the ARM provides the ability to either set the program status register
(bit 20 equal to one) or leave the program status register alone (bit 20 equal to zero) is
so that several instructions can be made conditional on the same condition. In this
bctical programs) program, the SUBS (andpossibly the SUBPLS) determine whetherRI >= 0. TheADDPL
the loop will ex- and SUBPLS instructions use this program status information to decide whether to
(ill execute is the execute. Since the pipelined and superscalar implementations allow execution of con-
ditional data processing instructions without branch penalties, such techniques can of-
ten speed up a program.
ly. Assuming we
can be partially The Verilog coverage set for the superscalar run of 'PROGRAM3 is 0110110110. 'PRO-
GRAM3 does not add to the coverage of the Verilog code provided by 'PROGRAM5
(1110110111); thus we need to do a different test to cover cases 4 and 7. One such test
is identical to 'PROGRAM3, except R is loaded with 6 rather than 14 as the value of
x.
The coverage set for the superscalar run of this modified program is 0101111110, which
covers cases 4 and 7. Therefore, all of the ten cases identified in the source code have
been tested at least once. This is not to say that the overall design is correct, but at least
we know we have checked all the Verilog statements translated from the original ASM.
Verilog has helped us make sure that all the code has been covered.
'ifdef DIVTEST P 5 13
cont = 0; P 4 13
t = O; P 3 14
for (x=O; x<=42; x = x + 1)
begin S5 9
arm7 machine.m[O] = S 4 10
(arm7_machine.m[0] & 32'hffffffOO) + x; S3 9
arm7_machine.r[15] = 0;
#200 cont = 1; The column o
#100 cont = 0; machine ("M'
#400 wait(arm7 machine.halt); "S 3" indicate
if (arm7_machine.r[2] != x/7) on the right is
$display("error");
look back at tl
$display("x=%d cl=%d r2=%d %d",x,
($time-t)/100,arm7_machine.r[2], $time);
division algor
t = $time; division algor
end
$finish; register
'endif f/e data
0 2
0 2
The above Verilog modifies the MOV immediate at address 0 to initialize different 0 3
values of x, that range from 0 to 42 and causes the arm7_machine to run the modi-
fied program. If the quotient (in r [ 2 ] ) is not erroneous, the Verilog code simply prints 0 2
the number of clock cycles (of period 100) elapsed since the machine language pro- 0 2
gram started running for the given value of x. To use the above code, 'DIVTEST, as 4 15
well as the macro for the desired machine language program ('PROGRAM3, 'PRO-
0 3
GRAM4 or 'PROGRAM5), must be defined. When each of these three programs is
0 2
run on each of the three implementations (multi-cycle, pipelined and superscalar), we
obtain the following data: 3 15
4 15
4 15
3 15
3 15
4 1
P 5 13 20 27 34 41 48 ... 7*quotient + 13
P 4 13 15 21 27 33 39 ... 6*quotient + 9
P 3 14 14 21 21 35 35 ... 3.5*quotient + 14
S 5 9 13 17 21 25 29 ... 4*quotient + 9
S 4 10 11 15 19 23 27 ... 4*quotient + 7
S 3 9 9 13 13 17 17 ... 2*quotient + 9
The column on the left ("run") indicates which program (3, 4 or 5) was run on which
machine ("M" for multi-cycle, "P" for pipelined, or "" for superscalar). For example,
"S 3" indicates 'PROGRAM3 was run on the superscalar implementation. The column
on the right is an equation of an upper bound on this data for quotient > 0. Let us
look back at the interesting journey we have traveled with our old friend, the childish
division algorithm. The following table summarizes implementations of the childish
division algorithm given in this and earlier chapters:
10.13 Exercises
10-1. The following Verilog code for a special-purpose machine describes the register
transfers carried out by four ARM instructions followed by two NOPs run on the
superscalar general-purpose ARM. What are these four instructions?
10-3. In problem 10-1, which of the seven cases described in section 10.9.7.1 applies to
each state of the special-purpose machine? Draw a block
10-4. In problem 10-1, which of the instructions is executed speculatively? 10-9. The inter
ries. For this
replace these
The external input is 'OFFSET4 in the pipelined version and 'OFFSET4+4 in the multi-
cycle version. Draw a block diagram that implements this synchronous register file.
10-6. Using a register file of the kind given in problem 10-5, design an architecture for
the multi-cycle ARM subset given in section 10.7, and give the corresponding mixed
ASM.
10-7. Using a register file of the kind given in problem 10-5, design an architecture for
the pipelined ARM subset given in section 10.8.1, and give the corresponding mixed
ASM.
10-8. The register file for the superscalar ARM is a multi-port memory with four read
ports and two write ports. The program counter (r [ 15] ) must be able to be incremented
(by 4 or 8) or loaded (with r [ 15 ] plus an externally supplied 26-bit signed value plus
either 0 or 4) independently of the operations that occur on the other ports. Assume that
the register file has command inputs ldPC, incPC and plus4PC to deal with these
special operations:
0.9.7.1 applies to
Draw a block diagram that implements this synchronous register file.
ively? 10-9. The interleaved memory described in section 10.9.2 has two conventional memo-
ries. For this problem, since the program does not change during execution, we will
replace these memories with ROMs (oddjm and evenm). One of the ROMs is for
I
_-
r_I
words whose addr/4 is odd. The other ROM is for words whose addr/4 is even. 10-15. Draw;
The problem is we cannot predict whether the CPU will need the odd and even instruc- In the event
tions fetched into irI and ir2 or vice versa. Give a block diagram for the interleaved pipeline to sta
memory that overcomes this problem using three muxes and an incrementor in addi- operand is no
tion to the ROMs. STR instructi
10-10. Using a register file of the kind given in problem 10-8 and an interleaved memory address and c
of the kind described in problem 10-9, design an architecture for the superscalar ARM structions wil
subset given in section 10.9.6, and give the corresponding mixed ASM. hierarchy. Be(
dress creates
10-11. As explained in appendix G, the ARM is actually a Princeton machine, which
stores its program and data in the same memory. Like many other RISC machines, the 10-16. Assun
ARM does not allow computation on values in memory. Rather, it only allows load and superscalar be
store instructions. The two most important instructions of this kind are LDR instructions de
(ir[ 2 7:26] ==1&ir[20]==1)andSTR(ir[27:26]=1&ir[20]==0). There 10-17. Modifb
are several addressing modes available, but for this problem only consider the simple remaining dat
indexed addressing mode (ir [ 2 4: 21 ] can be ignored in this problem) that accesses
m[ 'OPA+ 'OPB] . Assuming a single-port memory of the kind described in section 10-18. Modif,
8.2.2.3.2, give multi-cycle behavioral Verilog to implement such LDR and STR in- remaining dat
structions along with the other instructions described in 10.7. Create appropriate test 10-19. Modify
code. remaining dat,
10-12. Assuming a multi-port memory of the kind described in section 9.6, modify the 10-20. The A
pipelined behavioral Verilog of section 10.8.7 to implement the LDR and STR instruc- ir [27:22 ] =
tions described in problem 10-11. Create appropriate test code. Unlike chapter 9, oper-
MUL
and fetch does not occur until the execution stage of the pipeline, because the ARM has
a loadinstruction (LDR), rather than the addition instruction (TAD) of the PDP-8 which MLA
required an extra stage to complete. This is important because 'OPA or 'OPB may not
Assume that th
be available until that final clock cycle. For the same reason, a STR followed by a LDR
multi-cycle be]
from the same address will not require forwarding .
structions usin
10-13. Rework problem 10-7 to support the instructions of problem 10-12. Note that Test with code
for the STR instruction there will need to be a mux that provides i r2 [ 15 : 12 ] to one
10-21. Modify
of the read ports of the register file.
MUL and ML}
10-14. Using a multi-port memory like that in section 9.6 for problem 10-12 may be uct per clock c
too expensive. Design a memory hierarchy, consisting of two direct mapped caches
10-22. Modify
(section 8.5) and a main memory (that takes five cycles per access). One cache is for
MUL and ML)
data manipulated by LDR and STR instructions, and uses the read, write, memreq,
memrack and memwack signals described in section 8.5.3. The other cache is only 10-23. Modify
for instructions being fetched, and uses ireq (which combines the roles of read and bit input to dete
memreq for this cache) and imemack. You may assume that no machine language
10-24. Modify
instruction will be modified during the execution of the program so that there is no
need for write-through with the instruction cache.
I
ddr/4 is even. 10-15. Draw a pure behavioral ASM chart which combines problems 10- 12 and 10- 14.
nd even instruc- In the event that an instruction is not in the instruction cache, let NOP(s) enter the
r the interleaved pipeline to stall until imemack is asserted. In the event that an LDR executes when the
mentor in addi- operand is not in the data cache, use a wait loop similar to those in section 8.5.2. For
STR instructions, use a write buffer, which consists of registers that hold the memory
address and contents while it is being written. The second of two successive STR in-
rleaved memory structions will go to a wait state only if the first is still being processed by the memory
aperscalar ARM hierarchy. Because of the write buffer an STR followed by a LDR from the same ad-
A. dress creates a dependency that will require forwarding.
machine, which 10-16. Assuming a powerful multi-port memory of some kind exists, modify the
C machines, the superscalar behavioral Verilog of section 10.9.8.2 to implement the LDR and STR
allows load and instructions described in problem 10- 11. Create appropriate test code.
kind are LDR
20] ==0). There 10-17. Modify the multi-cycle behavioral Verilog of section 10.7.6 to implement the
isider the simple remaining data-processing instructions described by appendix G. Give test code.
m) that accesses 10-18. Modify the pipelined behavioral Verilog of section 10.8.7 to implement the
ribed in section remaining data-processing instructions described by appendix G. Give test code.
)R and STR in-
appropriate test 10-19. Modify the superscalar behavioral Verilog of section 10.9.8.2 to implement the
remaining data processing instructions described by appendix G. Give test code.
i 9.6, modify the 10-20. The ARM has two multiplication instructions which are identified by
and STR instruc- ir[27 :22]==0 && irl[7:4]==9,MLJL(ir[21]==0)andMLA(ir[21] ==1):
chapter 9, oper- MUL r[ir[19:16]]<-r[ir[3:0]]*r[ir[11:8]]
use the ARM has
MLA r[ir[l9:16]]<-r[ir[3:0]]*r[ir[11:8]]+r[ir[15:12]
the PDP-8 which
)r 'OPB may not Assume that the ALU does not include a combinational multiply operation. Modify the
[lowed by a LDR multi-cycle behavioral Verilog of section 10.7.6 to implement the MUL and MLA in-
structions using a shift and add algorithm such as the one explained in problem 2-7.
10-12. Note that Test with code that computes a quadratic polynomial, a*x*x+b*x+c.
[15:12] to one 10-21. Modify the pipelined behavioral Verilog of section 10.8.7 to implement the
MUL and MLA instructions assuming a combinational multiplier can produce one prod-
Em 10-12 may be uct per clock cycle. Use the same test code as problem 10-20.
t mapped caches 10-22. Modify the superscalar behavioral Verilog of section 10.9.8.2 to implement the
One cache is for MUL and MLA instructions. Use the same test code as problem 10-20.
rite, memreq,
her cache is only 10-23. Modify condx and f to allow for all sixteen conditions. Hint: f will need a 33-
DIes of read and bit input to detect overflow. Give a written justification why your test code is adequate.
.achine language 10-24. Modify 'OPB to include shift and rotate.
o that there is no
ware
Synthesis 439
supplied 11.1.2 T
optional by designer automated
The file to b
unsynthesiza
tial for using
a module wii
neous results
thesis of a Ve
'include
What is not r
saving time
far easier to d
mented in a p
The most thoi
support some
tools. This ou
netlist produc
the designer
unsynthesizec
tor that allows
physical chip.
I l Although som
sense when th
Figure 11-1 Designflowfor CPLD synthesis. VITO, VerilogEASY MACHPRO
natives to finc
and PLDesigner-XL are specific tools discussed in section 11.1.3.
discover whic
makes a smal.
sized chip. The easiest and often most efficient approach is to let the synthesis tool that the correl
choose how to connect these bits to the physical pins. In some cases, such as using a algorithm thai
circuit board where the programmable logic chip already has its sysclk and similar signer use sim
signals soldered to specific pins, it is necessary for the synthesis tool to use specific is much harde
pins. There is no standard syntax in Verilog to indicate pin numbers in the file (. v) that when it is give
contains the highest level module, but many synthesis tools allow the designer to force of the design I
the tool to connect specific bits to specific pins with the physical information file (sec-
tion 11.3.6) or a similar approach.
Synthesis 441
When the physical chip will not be operated near its maximum frequency, there often is 64 contains 12
no need to simulate the back annotated Verilog resulting from synthesis. 3 In such a and a single 1
case, simulating only before synthesis may be reasonable. For example, the clock used directly availat
in this chapter is slow enough that propagation delay is not a concern with the designs OR gate receii
discussed below. We will do post-placement simulation only to illustrate the logical macrocell, m,
correctness of the process, and not out of concern for speed. In commercial design,
speed is often an important issue, but correctness is always the first concern.
3Provided that the subset of Verilog used means the same thing in both synthesis and simulation.
7When the inputs t(
4 MINC has made a restricted version of this technology known as VerilogEASY available to readers of this up to 125 MHz. W
book. See appendix F for details. mum frequency is 1
5Vantis is a spinoff from Advanced Micro Devices (AMD), and the M4-128/64 used to be known as the 8 Of which the Field
AMD Mach445. rather than the ANI
6 The restricted version of VerilogEASY only allows 40 of these pins to be used. synthesis tool. In tl
)ading to pro- signer does not have to use all the terms possible, but there are fairly complex internal
ion (. pi) file constraints on how many and which terms may be used in particular macrocells. Be-
DEC file as a cause of the internal complexity of the CPLD, it is necessary for the designer to use a
.Designer cre- tool. This is true even if the designer were to create a netlist manually because theplace
equations (in and route tool must transform the original netlist into one that fits within the complex
;MACHPRO, constraints of the CPLD.
and sends the
e package is a Each I/O pin of the M4-128/64 has an optional flip flop, which the synthesis tool may
choose to disconnect (for a combinational logic function of the input). Considering the
macrocells (64 bonded to 1/0 pins and 64 hidden) and the I/O pins, the total number of
flip flops that the M4-128/64 contains is 192. When all its macrocells are fully in use,
the M4-128/64 is the equivalent of about 5000 gates.
100-pin pack-
Vantis makes a printed circuit board, known as a demo board, that has one M4- 128/64
se 64 I/O pins
mounted on it together with additional hardware, such as a 1.8432MHz oscillator 7 that
~,the M4-128/ produces the sysc 1k signal. Although many similar types of devices exists the reason
ulation.
7 When the inputs to every macrocell only come from the internal flip flops, the M4-128/64 may be clocked
to readers of this up to 125 MHz. When macrocells are cascaded together to form complex combinational logic, the maxi-
mum frequency is lower. The 1.8432 MHz is slow enough to be safe for most designs.
be known as the 8Of which the Field Programmable Gate Array (FPGA) is perhaps the most common. The FPGA uses a table
rather than the AND/OR structure of a CPLD, but such details are seldom important to a designer using a
synthesis tool. In the 1990s, companies such as Xilinx and Altera were leading suppliers of FPGAs.
Synthesis 443
-
for describing the M4-128/64 demo board here is that it is well suited for small synthe- 11.2.3 Be
sis experiments. The M4-128/64 demo board connects to a personal computer via the One of the m
parallel (printer) port of that computer. This personal computer runs the synthesis tools level possible
(such as PLSynthesizer and PLDesigner) and also the MACHPRO software, which signing hardy
downloads the configuration of hardware determined by the synthesis tools into the ioral ASM" (5
M4-128/64. The downloading process changes which terms are connected to which tion 3.8.2.3). 2
macrocells. If a designer makes a mistake, it is a simple matter to download a corrected goes back for
version of the design because the internal technology of the CPLD is similar to an recently begu
EEPROM. designers havi
designs in les
there are somc
described in a]
11.2 Verilog synthesis styles the implicit st,
Regardless of whether the designer wants programmable logic or custom integrated plicit style cot
circuits, and regardless of which vendors' tools are involved, there are five basic styles few additional
of Verilog code used in synthesis: behavioral registers, behavioral combinational logic, are given in se
behavioral implicit style state machines, behavioral explicit style state machines and
structural instantiation. Often a particular design contains a combination of these styles.
11.2.4 Bet
In contrast to t
11.2.1 Behavioral synthesis of registers present state re,
As described in sections 3.7.2.2 and 4.4.4, the synthesizable model for a register is an synthesis vend
instantaneous assignment statement inside a block with a single time control syntax,
suchas @ (posedge sysclk), (posedge sysclk or posedge reset)
or @(posedge sysclk or negedge reset) time control. All synthesis ven- 11.2.5 Stri
dors support this Verilog construct. Registers synthesize to a group of flip flops, typi- The most prim
cally D-type flip flops. Often there is combinational logic associated with a register. An stances. If all r
example of synthesizing a register is given in section 11.3. thesis is simply
straints of the c]
of the kind(s) o
11.2.2 Behavioral synthesis of combinational logic synthesized app
There are two ways to describe combinational logic using behavioral Verilog that all 11.8.
synthesis tools accept: the continuous assign statement (section 7.2.1) and an al-
ways block with a sensitivity list composed of all the variables in the block that are
not on the left of any of the =s inside the block (section 3.7.2.1).9 All synthesis vendors
support both of these constructs. Combinational logic synthesizes to the primitive com- 11.3 Synt
binational units of the target hardware which are AND/OR gates for CPLDs, lookup As described in
tables for FPGAs and ROMs and arbitrary combinational gates for custom logic. An tial building blo(
example of synthesizing combinational logic is given in section 11.4. wide enabled re
except that we s
synthesis tools d
the size is menti
9With the additional requirement that none of the variables on the left of the =s occur on the right of the =s.
module LPMDFF_2_x(Ck,CkEn,D,Q);
input Clk,ClkEn,D; output Q; LPM_
wire netO, netl, net2, net3, net4;
NAN2 I_2_NAN2(.IO(netO),.Il(netl),.O(net2));
NAN2 I_3_NAN2(.IO(Q),.Il(net3),.O(netl));
NAN2 I_4_NAN2(.IO(ClkEn),.Il(D),.O(netO)); are among th
INV I_l_INV(.IO(ClkEn),.O(net3));
DFF IO(.CLK(Clk),.D(net2),.Q(Q),.QBAR(net4));
endmodule
module LPMDFFl x(Cik,CikEn,D,Q);
input Clk,ClkEn; input[l:O]D; output[l:O]Q;
11.3.2 Mi
LPM_DF_2_x IO(.Clk(Clk),.ClkEn(ClkEn), The modules
.D(D[l]),.Q(Q[l])); gate-level tim
LPM_DFF_2_x Il(.Clk(Clk),.ClkEn(ClkEn), inside the M4
.D(D[O]),.Q(Q[0])); one in section
endmodule logical correc
module enabled register(di,do,enable,clk); this transform
input [1:0] di; output [1:0] do;
input enable,clk;
LPM_DFF lx dox x(.Clk(clk),.ClkEn(enable),
moduli
.D(di), .Q(do));
endmodule inpul
endmo
modul
inpi
endmo(
There is no other way to write this with the positional syntax of section 3.10. The other
kind of syntax that is legal in Verilog is instantiation by name, which is illustrated
above in bold. Like many synthesis tools, PLSynthesizer uses this alternative syntax
because the modules generated by the tool may have lengthy portlists. The advantage
of instantiation by name is that the ports may be rearranged in any order and the mean-
ize the above ing is the same. For example, the following:
ve, we get the
are among the twenty-four permutations that mean the same thing.
I
11.3.2 Modules supplied by PLSynthesizer
The modules in section 11.3.1 (such as NAN2, INV and DFF) could contain detailed
gate-level timing information, but this netlist has not yet been placed. After placement
inside the M4-128/64 CPLD, the netlist is likely to be considerably different than the
one in section 11.3.1. Rather, the netlist in section 11.3.1 is primarily of use to show the
logical correctness of the transformation carried out by the synthesis tool. To illustrate
this transformation, we will define idealized versions of the modules it instantiates:
module NAN2(IO,Il,o);
input IO,Il;output O;nand gl(O,IO,Il);
endmodule
module INV(IO,O);
input IO;output O;not g2(0,IO);
endmodule
re Synthesis 447
-9
Continued Continued
module DFF(CLK,D,Q,QBAR);
input CLK,D;output Q,QBAR; out]
assign QBAR = -Q; inp
always @(posedge CLK)Q = D; inp
endmodule enabled
endmodu.
11.3.3 Technology specific mapping with PLDesigner
In addition to structural Verilog, PLSynthesizer produces the same netlist in a propri- Although much
etary form, known as DSL. The place and route tool, PLDesigner, uses the DSL to there are a few f
generate a netlist that is fitted within the constraints of the M4-128/64 CPLD. The mentioned prev
output of PLDesigner includes the JEDEC netlist and an equivalent post-placement proper meaning
Verilog netlist. Such post-placement structural Verilog more accurately reflects the re- the supply d
sult of place and route than the netlist produced by PLSynthesizer. For this example, 7.2.1) of the on(
the resulting structural Verilog'" of the enabled register is: inside the M4-
mbuf and df f
//Model automatically generated by Modgen Version 3.8 model hardware
'timescale lns/lOOps ated by PLDesi 1
enabledoOO(dolOr,dollr,dillr,enable,dilOr,clk); ioral module nai
output dolOr, dollr; may differ from
input dillr, enable, dilOr, lk; supplyO GND; name is done.
wire pin-8,pin-1l,pin_12,pin_13,pin_93,pin_94,tmpl2,
tmpl4,tmpl5,tmpl6,tmpl7,tmpl8,tmpl9,tmp2O,tmp2l,tmp22; In addition to thi
portin PIl(pin-8,dillr); portin PI2(pin ll,enable); tation file (. do(
portin PI3(pin-12,dilOr); portin PI4(pin_13,clk);
portout Pl(dolOr,pin-93); portout P02(dollr,pin_94);
mbuf Bl(tmpl2,pin-13); and A(tmpl5,pin_12,pin_11);
not Il(tmpl7,pin_11); and A2(tmpl6,pin-93,tmpl7);
or 0l(tmpl4,tmpl5,tmpl6);
dffarap DFFl(pin_93, tmpl2, tmpl4, GND, GND);
mbuf B2(tmpl8,pin_13); and A3(tmp2O,pin_8,pinll);
not I2(tmp22,pin_11); and A4(tmp2l,pin-94,tmp22);
or 02(tmpl9,tmp2O,tmp2l); This is a much
dffarap DFF2(pin-94, tmpl8, tmpl9, GND, GND); indicate the mac
endmodule
more understanc
must be rewrittei
manual translati4
always @
begin
do[1]<:
do[O]<:
10 This Verilog was edited slightly for brevity. end
i_94);
11);
do[1].D=do[l]*/enable+di[l]*enable;
do[l].CLK=clk;
do[O].D=do[O]*/enable+di[O]*enable;
do[O].CLK=clk;
This is a much more primitive language than Verilog. The . D and . CLK notations
indicate the macrocells are being used as D-type flip flops. To put the above in the
more understandable Verilog form, the notation for Boolean operations ('*', '+', '/')
must be rewritten into the corresponding Verilog notation ('&', "
'-'). The following
manual translation is the equivalent behavioral Verilog:
always @(posedge clk)
begin
do[l]<= #((di[l]&enable)l(-enable&do[l]));
do[0><= #((di[o]&enable)l(enable&do[OJ));
end
A
The assignment statements must be non-blocking (with time control of #0) and must be different hard
listed inside an always block with a single @(posedge clk) as the time control. nal from an I/
This non-blocking assignment is somewhat different than the one used in earlier chap- portoutco
ters. It is used above so that the order in which the Verilog statements occur will not 1/0 pin to be
effect the result. Since the non-blocking assignments use #0, the effect is almost the macrocells. L
same as plain =, except all of the right-hand values will be evaluated before any of the be timing, wh
left-hand values are changed. Because of the single @(posedge clk) at the begin-
ning of the always block, do will only change at the rising edge of the clock. The Although Ver
only reason to manually rewrite these logic equations back into Verilog is to describe not, Verilog
the meaning of the . doc file. This file explains the transformation that PLSynthesizer and route too
has performed on the original behavioral Verilog more succinctly than the netlist. technology, ir
flip flop, which
module dffarap(Q,CLK,D,AR,AP);
on of timing output Q;
ement infor- input CLK,D,ARAP;
use the place
'he details of
reg Q;
always (posedge CLK or -
posedge AP or posedge AR)
a place and begin
nodules used if (AP)
etailed gate- Q = 1;
nnect delays else if (AR)
Ltes an . sdf Q = 0;
I circuit. For else
ation but are Q D;
ed out by the end
endmodule
dealized ver-
the input (i) The above models a flip flop with an asynchronous reset (AR) and an asynchronous
preset (AP). Such asynchronous signals are typically only used to initialize a controller
when it is first powered up (see sections 4.4.4 and 7.1.6). In this example, these asyn-
chronous signals are not used, and so they are instantiated with a connection to GND.
11
For a CPLD, there is no delay attributed to a particular AND or OR gate. Rather the delay is associated
tated Verilog, with the macrocell. For this reason, PLDesigner-XL uses built-in delayless and, or and not. Place and
:orrespond to route tools for FPGAs or custom logic may take a different approach.
Synthesis45 451
choose which signals will go in and out of the pins of the chip. In this instance, there are
four bits of information being input and two bits of information being output, as illus- pin-
trated in figure 11-2.
pin-
I------------------------------------------------------
.. .- I------------------------------------------- I
die | 4dO
2' "2*
enableI
pin.
nlk
I
Figure11-:
Figure 11-2. Physical pins of M4-128/64 usedfor two-bit enabled register.
slices.
The structural Verilog refers to the internal wires that connect to the 1/0 pins with the 11.3.6 Ma
prefix pin_. This could be a little confusing since, for example, pin13 is not actu- The physical I
ally the physical pin 13 but rather is the signal from that pin after it has been buffered designer wish
internally by the M4-128/64. In this design, pin_13 is logically the same as the clk For example,
port, which presumably would connect to the global sysclk signal. The instance U1 ware soldered
of enabledoOO separates the individual one-bit nets from the multi-bit ports. There
are buses (which are oversimplified in this diagram) that connect the I/O pins to the
macrocells. The actual implementation of the synthesized design occurs in the
macrocells.
The synthesis tool has bit blasted the design into individual one-bit-wide bit slices,
each one of which fits into a single macrocell, as shown in figure 11-3.
The circuit in figure 11-3 is a literal transcription of the Verilog produced by the syn-
l
thesis tool. Notice how each bit slice of the mux has turned into an AND/OR gate
arrangement. When enable (pin_11) is asserted, the outputs of theA2 andA4AND
gates will be zero. Thus, di [0] (pin_12) and di [] (pin_8) will pass through The wires wl
their respective OR gates (01 and 02) to become the new values of their respective flip Diodes (LEDs:
flops (DFFI and DFF2) at the next rising edge of clk (pin_ 13). When enable is signals control
not asserted, the old values (pin_93 and pin_94) will be reloaded into their respec- at the top and c
tive flip flops (DFFI and DFF2) at the next rising edge of clk. should be 1 to
II
il0rj ir dO
+dO
____L
22
>
ilrr: i
ill
. I pin_8-' MACROCELL !
i tm 2 I pin_94
I I
Figure 11-3. Macrocells in M4-128/64 implementing enabled registerbit
Yim K
'im
,ister slices.
(O pins
/O ins with the
I pins 11.3.6 Mapping to specific pins
-1
_13-.3is .iot actu-
is not The physical pin numbers shown in figure 11-2 were chosen by PLDesigner. Often a
isisbeen
b ,en
een buffered designer wishes to override the choices automatically made by the place and route tool.
iame
iam
iarne ,eas
as c1k
the clk For example, on the M4-128/64 demoboard, certain pins are attached to other hard-
The instance
ins Lance U
UI
U1 ware soldered on the board:
-bit ports.
pol
.-bit pI
,-bit ts.
sports. There
- 1/0
MO R Us to the
) pins
pi
o I rs in the
n occurs
-ccurs
occu al:93 a2: 5 a3:19 a4:31 sysclk:13
bl:94 b2: 6 b3:20 b4:32 reset: 4
c1:95 c2: 7 c3:21 c4:33 sw3:18
t-wide
L-Wi fit slices,
ide ,bit
IC dl:96 d2: 8 d3:22 d4:34 sw2:54
3.
3. el:97 e2: 9 e3:23 e4:35 swl:63
fl:98 f2:10 f3:24 f4:36 swO:68
[uced
luce Xd
luced b i the syn-
I by
b, gl:99 g2:11 g3:25 g4:37
Ln
in AND
AND/OR
ANE /OR gate
ND/OR
A2
bA2 and
b an( IMAND
A4 AND
will
Kill pw through
pass
will ps s The wires whose names above begin with "a" through "g" are for Light Emitting
irsp -ctive flip
eir respective
respective
resps Diodes (LEDs) in seven-segment displays. For example, the active low al . . . gi
When
Whe an e nable is
n enable
e is signals control the leftmost digit. These seven segments are labeled clockwise, with al
into i thi ,ir
their
int( their respec- at the top and gl at the center; thus bl, cl, f 1 and gl should be 0 and al, dl and el
should be to display the digit "4." The 1.8432MHz clock is available as sysclk,
lware
1waj
Iwarerre Synthesis 453
i
and a debounced push button provides the active low reset, which is also activated module adz
when the demoboard is powered up. There are four input DIP switches (swO 0 sw3) output [3
available on the demoboard. wire [3:0
assign s[(
These input pins can be named anything the designer wishes. For example, in the en- assign s[:
abled register of section 11.3, it might be reasonable to take the enable from the (a[ll&a[(
switch on pin 54, and the di bus from the switches on pins 63 and 68. The two do bits (-a [ ]&-E
might directly drive the al and bl LED segments' 2 on pins 93 and 94. Note that be- assign c2
cause of the active low nature of the LEDs, the light will not illuminate when the bit is &b[21&b[1]
a one, but it will light up when the bit is a zero. The following file, whose name must be |(a[2]&a[1
assign s2
similar to the name of the file that contains the module to be synthesized but with the |(a[2]&-a[
extension .pi, is required to indicate the pin numbers to PLDesigner: (a[2]&-a[
(a[2]&a[1
(a[2]&a[0
{MAXSYMBOLS 0,MAX-PTERMS ,POLARITYCONTROL TRUE, |(-a[2]&a[
MAXXORPTERMS ,XORPOLARITYCONTROL FALSE}; |(-a[2]&-a
device target 'partnumber amd MACH445-12YC'; |(a[2]&-a[
OUTPUT do[l]:93;OUTPUT do[0]:94;INPUT clk:13; (-a[2]&a[
INPUT enable:54;INPUT di[l]:63;INPUT di[0]:68; assign s[3
end device; |(-a[3&-b
endmodule
As in the last example, the input and output definitions need a size (four bits in
this case). When the above is synthesized similarly to the last example, PLDesigner 11.4.1 Test
produces a . doc file that describes a series of logic equations for each bit of s. The In any event, the
following is a manual translation of this back into Verilog: 6.3.2) that does,
Of course, the designer would probably use the backannotated output from PLDesigner.
combinational This lengthy output, which is equivalent to the above assign statements, has been
i 3.10.5 through omitted for brevity. In this output, the internal name for the one-bit carry wire varies
y. Of these, the depending on how the module is synthesized. The name might be something like
LPMADDSUBl_x__nO02.Itmightalsobejustc[3] asshownabove.
This result from synthesis is quite a bit more complicated than one might expect when
solving the same problem manually using full-adders. The above is complex because
the place and route tool utilizes the wide AND/OR gates that exist in each macrocell of
the M4-128/64. In the classical ripple carry adder (section 2.5), there needs to be a
distinct carry signal input to each full-adder. Here the tool has eliminated the carry for
all but the most significant bit by merging the logic equations for several full-adders
together in a process known as node collapsing. This has the effect of lowering the
propagation delay.
Synthesis 455
Continued.
enm
endmoc
The original 1
s[1 ] all produce th
inputs were r
As in the last
modules for I
used, the #1 M
propagation d
11.4.2 All
An alternate v
mathematics
module ad(
output
1 input [:
Figure 11-4. Macrocells in the M4-128/64for low-order two-bit slices of adder reg [3:1
wire [3
module test; reg [3:1
integer ia,ib,numerr;
reg [3:0] a,b; wire [3:0] sum;
addpar al(sum,a,b); function
initial input
begin begin
numerr = 0; case
for (ia=O; ia<=15; ia=ia+l) 3']
for (ib=O; ib<=15; ib=ib+l) 3']
begin 3']
a=ia; b=ib; 3']
#1 $display("%b %b %b",a,b,sum); 3']
if ((ia+ib)%16 !== sum) 3']
begin 3']
$display( "error");numerr=numerr+l; 3']
endci
Continued.
end
end
$display("numerr=",numerr);
end
endmodule
iI- The original behavioral adder, the preplacement netlist and the post-placement netlist
-_s[1] all produce the correct results for the 256 combinations of inputs. If the width of the
inputs were much larger, such an exhaustive test would be impossible.
As in the last example, we are ignoring the back annotated delay by supplying delayless
modules for portin, portout and mbuf. If the backannotation capability were
used, the #1 would have to be changed to an appropriate delay longer than the longest
propagation delay of the synthesized design.
I
module addpar(s,a,b);
output [3:0] s;
input [3:0] a,b;
function car;
input a,b,c;
begin
case ({a,b,c})
3'bOOO: car = 0;
3'bOOl: car = 0;
3'bO10: car = 0;
3'bOll: car = 1;
3'blOO: car = 0;
3'bll: car = 1;
3'bllO: car = 1;
3'blli: car = 1;
endcase
function sum;
input a,b,c;
begin
case ({a,b,c})
3'bOOO: sum = 0;
3'bOOl: sum = 1; Either the ca
3'bOlO: sum = 1; of input, all 2
3'bOll: sum = 0;
problem is th
3'blOO: sum = 1;
3'blOl: sum = 0;
end of the neC
3'bllO: sum = 0; erly except fo
3'blll: sum = 1; of the functio
endcase
end
endfunction
always @(a or b)
begin
c[O] = 0;
s[0] = sum(a[0],b[O],c[O]);
c[l] = car(a[O],b[O],c[O]); A case like t
s[l] = sum(a[l],b[l],c[l]); asynchronous
c[2] = car(a[l],b[l],c[l]);
the case full:
s[2] = sum(a[2],b[2],c[2]);
synthesis dire
c[3] = car(a[2],b[2],c[2]);
s[3] = sum(a[3],b[3],c[3]);
simulator, but
end directives suc]
endmodule synthesis to di
supplying the
simulation. Ai
Here car is a function that models the carry required for the next higher bit position changes how s
when adding three bits, and sum is the corresponding result in the current bit position.
These functions may be coded several ways. An alternative
inside the func
The case statement approach used above is a direct expression of the truth table for a
full-adder. For synthesis, we do not consider bx and bz values in the cases the way In any event, t
that might be necessary for simulation. This is because the synthesis tool implements same sensitivit
the case statement using = = rather than ===, which is all that is physically possible in other way this
hardware: statements rath
Either the case or the i f statement approach is acceptable because for the three bits
of input, all 2 possible cases are listed. Such a situation is known as a full case. The
problem is that without a default clause in the case or an equivalent else at the
end of the nested i fs, the synthesis tool will not synthesize combinational logic prop-
erly except for a full case. For example, the following case, which only lists the ones
of the function, is not full:
case ({a,b,cl)
3'bOO1: sum = ;
3'bOlO: sum = 1;
3'blOO: sum = 1;
3'blll: sum = 1;
endcase
A case like this that is not full will synthesize to what is known as a latch, which is an
asynchronous sequential circuit, rather than the desired combinational logic. To make
the case fullrequires using default:sum=O; intheabove orusing a full case
synthesis directive. A synthesis directive is a comment which would be ignored by a
simulator, but which causes the synthesis tool to alter its operation. Use of synthesis
directives such as full case is common, but is dangerous because it may cause
synthesis to disagree with simulation. It is better to make the case statement be full by
supplying the appropriate default since that acts the same in both synthesis and
simulation. Another common but dangerous directive is parallel case, which
higher bit position changes how synthesis interprets the case to be like ifs without elses.
irrent bit position.
An alternative approach to the case statement would have been to use logic equations
inside the functions, such as sum=a^b~c.
he truth table for a
[the cases the way
In any event, the combinational logic is defined using an always block having the
same sensitivity list as the example in the last section that invokes the functions. An-
s tool implements
tsically possible in other way this could have been defined is with eight separate continuous assignment
statements rather than the one always block:
I
PI-
cycle approach because less computation occurs per clock cycle. The multi-cycle ap-
proach takes several of these faster clock cycles to achieve the same result that the
single-cycle approach achieves in one slower clock cycle.
always
begin
adder
:onsider algorith- @(posedge sysclk) 'ENS;
simplemented as C[O] <= 'CLK 0;
@(posedge sysclk) 'ENS;
iuence of calls to
s[O] <= 'CLK sum(a[O],b[O],c[O]);
4.2 makes it clear c[lJ <= 'CLK car(a[O],b[O],c[O]);
equired to imple- @(posedge sysclk) 'ENS;
single-cycle ap- s[l] <= 'CLK sum(a[l],b[l],c[l]);
-xt rising edge of c[2] <= 'CLK car(a[l1,b[1],c[1]);
I the information @(posedge sysclk) 'ENS;
I such a building s[2] <= 'CLK sum(a[2],b[2],c[2]);
)ar is sometimes c[3] <= 'CLK car(a[2],b[2],c[2]);
@(posedge sysclk) 'ENS;
ailable in parallel
s[3] <= 'CLK sum(a[3],b[3],c[3]);
ware
Synthesis 461
11.5.2 Macros needed for implicit style synthesis A synthesis
In order for the implicit style to be practical, the result of simulation of implicit style earlier exar
Verilog before synthesis must agree with the result of simulation after synthesis (and, r2[0], c
of course, the behavior of the physical hardware). Some synthesis tools are restricted can now occ
as to the use of time control, but as discussed in section 3.8.2.1, simulators need # time bit wire fo
control to simulate non-blocking assignment properly inside implicit style blocks. There- suffices here
fore, in order that simulation agree with synthesis, it is recommended that all of the
time control required for simulation be coded as macros. Only the time control needed With this sh:
by the synthesis tool (the @(posedge sysclk) that denotes a state boundary out- cycle, r2 is
side a non-blocking assignment) is written without a macro. The other two forms of cycle, r2 [ 0
third clock c
time control [the #1 and the @(posedge sysclk) inside the non-blocking assign-
ment] are written using macros ( ENS and ' CLK). This way, they can simulate prop- cessed first,;
bit serial ted
erly when the macros are defined as shown above, but they can be synthesized properly
when the macros are defined as empty. The role of
rl is reused
result. As a 1
11.5.3 Using a shift register approach become the
A disadvantage of the code in section 11.5.1 is that it performs similar computations on such bits are
different bits of the data. The synthesis tool will either have to duplicate the hardware result bits wi
to implement the sum and car functions multiple times, or use muxes to allow re-
source sharing, in a way analogous to the central ALU approach. To avoid this prob-
lem, we can use a shift register approach: 11.5.4 Ui
There is still
reg c;
reg [3:0] rl,r2; states (and tl
Although the
~(posedge sysclk) 'ENS; ate to the nur
r2 <= 'CLK ; rl <= 'CLK x; be proportion
c <= 'CLK 0;
@(posedge sysclk) 'ENS; Here is when
rl <= CLK sum(r1[0],r2[03,c),r1[3:1]}; designer can
c <= CLK car(rl[O],r2[0],c); inside a whi-
r2 <= 'CLK r2 >> 1; the loop unro]
Q(posedge sysclk) 'ENS; RISC achin
rl <= 'CLK {sum(rl[O],r2[0],c),rl[3:1]); increase the s
c <= CLK car(rl[O],r2[0],c); to implement
r2 <= 'CLK r2 >> 1;
a loop counte
@(posedge sysclk) 'ENS;
rl <= 'CLK {sum(rl[O],r210],c),rl[3:1]};
state.
c <= 'CLK car(rl[O],r2[0],c); In previous cl
r2 <= 'CLK r2 >> 1; we will use a
@(posedge sysclk) 'ENS;
easier to unde
rl <= 'CLK {sum(rlC0],r2E0],c),rl[3:1]};
c <= CLK car(rl[O],r2[0],c);
r2 <= 'CLK r2 >> 1;
Synthesis 463
l: always
2: begin module v
3: ready <= CLK 1; input
4: @(posedge sysclk) 'ENS; //ff_4 output
5: r2 <= 'CLK y; wire r(
6: r3 <= 'CL 1; wire [
7: c <= 'CLK 0;
8: if (pb) endmodulE
9: begin
10: ready <= @(posedge sysclk) 0;
Prior to synt
11: @(posedge sysclk)'ENS; //ff 11
12:
test, we use t
rl <= 'CLK x;
13: while (r3[3J) the module ti
14: begin tion 11.4.1, tl
15: @(posedge sysclk) 'ENS; //ff-15 adapt to the
16: rl <= 'CLK{sum(rl[0],r2[0],c),rl[3:1};
nodule tor
17: c <= 'CLK car(rl[O],r2[0],c);
reg [3:0]
18: r2 <= CLK r2 >> 1;
wire reac
19: r3 <= 'CLK r3 << 1;
integer r
20: end
cl #5200C
21: end
vsyaddl s
22: end
initial
When the most significant bit of r3 becomes one, the loop stops. In other words, r3 begin
contains the unary values 0001, 0010, 0100 and 1000 in successive clock cycles. The numerr
#30 res
effect is similar to what would happen by counting 0, 1, 2 and 3. Since the computation
#210; @
only depends on the number of times the loop repeats, and not on the value of r3, the
for (x=
above unary code is just as reasonable as a binary code. A binary code might produce a for (y
somewhat smaller synthesized netlist, but the unary code will produce a synthesized begin
circuit that typically runs faster and is easier to understand. @(po
The above Verilog includes the friendly user interface described in sections 2.2.1 and @ (po
7.4.2. The signal ready is asserted when the machine is able to accept inputs. The @ (pa
@ (p0
user pulses pb for exactly one clock cycle to cause the machine to compute the sum,
if(
which will be available in rl when the machine exits from the while loop.
$di
else
beg
11.5.5 Test code
The implicit style block of section 11.5.4 together with the function definitions from x
section 11.4.2 can be placed inside the module to be synthesized: end
end
$di spla
$finish
end
endmodule
Synthesis 465
Continued If ft_999,
PLSynthesiz
always (posedge sysclk) #20
$display("%d rl=%d r2=%d pb=%b ready=%b", the conditior
$time, sum,r2, pb, ready); a transition t
andmodule true. There;
machine is p
The active low reset signal is necessary for the VITO preprocessor described in false, or whe
chapter 7 and appendix F. The test code detects no errors, so it is reasonable to synthe- f f_15 or st,
size vsyaddl. significant bi
TransitioninE
loop for the f
11.5.6 Synthesizing remaining in
Since PLSynthesizer does not support the implicit style, the first step in synthesizing
vsyaddl is to use the VITO preprocessor.' 3 VITO passes through the module defini- In addition to
tions and functions unchanged, which allows use of these names in the code generated ture compose
by VITO. VITO generates a one hot controller using continuous assignment and one described in 5
bit regs according to the principles described in chapter 7. VITO uses the line number to r2:
in the names of the wires and regs generated. In this particular machine, the states
correspond to f f4, f _11 and f f_15. When the code generated by VITO is run
through PLSynthesizer, logic equations are formed that describe the inputs to these ass.
macrocell flip flops. PLSynthesizer and PLDesigner will eliminate most of the redun- alw.
dant wire names created by VITO. The following is the manual translation of the
.doc file into Verilog for the logic equations of the one hot controller:
13 The preprocessor is not necessary with synthesis tools, such as Synopsys, that support the implicit style.
The above Verilog is equivalent to figure 11-5, which is a kind of specialized shift
register that is loadable (in state f f_4 which includes statement s5) and only shifts
right (in state f f_15 which includes statement s1 8). Again, logic equations are
given in the . doc file that describe the inputs to each macrocell flip flop. The follow-
ing is the manually translated Verilog for the logic equations that correspond to r2:
Although it might have appeared from figure 11-5 that there would be two macrocells
of delay (for each mux), the synthesis tool merged the logic equations of the two muxes
together into a single macrocell per bit slice. Except for r2 [ 3 ] , each bit slice is similar
to the others. For example, there are three cases to consider for r2 [ 0 ] . First, when
ff_4 is active, two terms of the logic equation, r2 [0] &-ff_15&-ff_4 and
r2 [1] &ff_15&-ff_4, are guaranteed to be zero. This leaves only y[O] &ff_4,
which passes through the proper bit of y into the input of the r2 [0 ] flip flop. Second,
when f f_15 is active, we know (because of the nature of one hot controllers) that
f f_4 could not be active, but the synthesis tool did not know this. Therefore, the tool
generates r2 [ 1 ] &ff_15&-ff_4. The - ff_4 is not necessary considering the total
one hot system but is necessary to achieve the mux behavior shown in figure 11-5.
Because the other two terms of the logic equation, r2 [0] &-ff_15&-ff_4 and
y [0] &ff_4, are guaranteed to be zero in this case (ff_15 active and ff_4 inac-
tive), the remaining term, r2 [1] &ff_15&-ff_4, passes through the right-shifted
bit (r2 [ 1] ) into the input of the r2 [0 1 flip flop. Third, the last possibility is that
neither ff_4 nor ff_15 is active. In this case, r2 [0] &-ff_15&-ff_4 holds the
former value of the r2 [ 0 ] flip flop. Figure 11-i
Of course, as mentioned earlier, it is tedious to have to manually translate the non-
standard . doc file back into Verilog. The designer would probably prefer to use the
structural Verilog automatically generated by PLDesigner. The instance and wire names 11.6 Sw
shown in figure 11-6 and in the following may vary slightly, depending on tool- spe- To have a usei
cific details: from a person
mbuf B5(tmp85,pin_13);and A17(tmp90,tmp91,tmp92,pin_12); is to use mech
and A16(tmp87,ff-15,tmp88,pin-46);not I19(tmp88,ff_4); existence of
not I20(tmp91,ff_15);and A18(tmp93,ff_4,pin_25); push button w
not I21(tmp92,ff_4);or 05(tmp86,tmp87,tmp9O,tmp93); problem is tha
dffarap DFF5(pin_12,tmp85,tmp86,GND,GND); chronous nor i
mbuf B6(tmp94,pin_13);and A20(tmp98,tmp99,tmplOO,pin_46); of times per se
and A19(tmp96,ff-15,tmp97,pin_44);not I22(tmp97,ff_4);
erty, known as
not I24(tmplOO,ff_4);and A21(tmplOl,ff_4,pin_23);
not I23(tmp99,ff_15);or 06(tmp95,tmp96,tmp98,tmplOi);
dffarap DFF6(pin 46,tmp94,tmp95,GND,GND);
A
Ideally, when a person flips a switch on, we would hope that the output of the switch
In addition t,
would become and remain one until the person flips the switch off. Unfortunately, real
debounce m;
switches do not behave this way, as is illustrated by the following timing diagram:
the user is m
just one requ
i'ln.zl Iwtc I single pulsin
module d(
actual switch input
output
irst tbounc s c n tb output
first
bounce second bounce
wire s,
Figure 11-7. Ideal versus actual switch behavior shows needfor debouncing. reg pb;
reg [1S
Happily, real switches bounce for less than a constant time t seconds. For example, always
even the very awkward DIP switches'4 soldered onto the M4-128/64 demoboard bounce begin
@ (poE
for less than a quarter of a second.
pb <=
One solution to the bounce problem is to design a debounce machine' whose input is if (S
the actual switch, and whose output is the idealized pb signal needed by many of the cnt
else
designs in this book. Most of the time, the actual switch is quiet; thus the debounce
whi1
machine continually reassigns 0 to pb. The debounce machine does something differ-
beg
ent when the actual switch makes its first transition to a one. During this first t second
period when bounce occurs, we assume that the output of the actual switch will eventu- i
ally stabilize to 1. Therefore, the number of clock cycles when the actual switch could
be zero during this first bounce period is less than t times the clock frequency. After i
the first bounce period but before the second bounce period, the actual switch continu-
ally reads as a one. A second bounce period occurs when the switch is released. end
end
The total number of clock cycles during which the actual switch reads as a zero from endmodule
the time of the first transition to one until the final transition to zero is less than twice t
times the clock frequency. The designer precomputes this constant, which will be loaded Assuming cr
into a counter when the machine first detects that the actual switch has become a one. alone and the:
For example, with the M4-128/64, two times 0.25 seconds times 1.8432 MHz is ap- that sw3 is o
proximately one million. Since 0.25 is an overestimation of t, the exact number of zero again du
clock cycles is not too important, as long it is near one million. A convenient number machine ente
around this size is 22- 1. when sw3 is
significant ar
cnt might oi
14 People often use pencils to move these tiny switches, which aggravates the bounce problem. The constant
the least signi
t tends to be smaller for switches that are easier for people to manipulate, but the underlying cause of bounce
is always electrical.
single cycle.
15 The design here assumes that a single-pole single-throw switch is used and that the debounce machine
machine retur
must be completely digital. Much more economical solutions are possible that either use a few analog primitive DIP
components, such as a capacitor and a resistor, or that use a single-pole double-throw switch. In the case of
the M4-128/64 demoboard, neither alternative is possible without external components.
put of the switch In addition to debouncing the switch, we need to make sure that the pb output of the
nfortunately, real
debounce machine lasts for exactly one clock cycle. Otherwise, it would be as though
ing diagram: the user is making millions of requests for computation, when in fact the user makes
just one request. The following implicit style module solves both the debouncing and
single pulsing aspects of this problem:
module debounce(sw3,pb,cnt,sysclk,reset);
input sw3,sysclk,reset;
-L1 output pb;
output [19:0] cnt;
ice
wire sw3,sysclk,reset;
bouncing. reg pb;
reg [19:0] cnt;
Is. For example, always
begin
moboard bounce
@(posedge sysclk) 'ENS;
pb <= 'CLK 0;
15 whose input is if (sw3 == 1)
I by many of the cnt <= 'CLK 20'hfffff;
else
us the debounce
while (cnt[19:1] != 0)
omething differ-
begin
lis first t second
@(posedge sysclk) 'ENS;
Fitch will eventu- if (sw3 == 0)
ual switch could cnt <= 'CLK nt - 1;
frequency. After if (cnt[19:1] == 0)
I switch continu- pb <= 'CLK 1;
released. end
end
Isas a zero from endmodule
less than twice t
ch will be loaded Assuming cnt is zero and the actual switch, sw3, is zero, the machine leaves cnt
s become a one. alone and therefore does not enter the while loop. The first time the machine detects
432 MHz is ap- that sw3 is one, the machine assigns the constant to cnt. Eventually, sw3 becomes
,xact number of zero again during the first bounce period. Since cnt now contains the constant, the
ivenient number machine enters the while loop. Inside the while loop, cnt is decremented only
when sw3 is zero. The while loop exits when all bits of cnt other than the least
significant are zero (i.e., cnt is 1). During this last clock cycle in the while loop,
cnt might or might not be decremented one last time (hence the reason for ignoring
blem. The constant
the least signifcant bit). In that same clock cycle, pb is scheduled to become one for a
fing cause of bounce
single cycle. (pb will be scheduled to return to zero in the next clock cycle when the
machine returns to the top state.) Therefore, the above code allows us to use the rather
debounce machine
,r use a few analog primitive DIP switch, sw3, as an ideal push button, pb.
witch. In the case of
The above code corresponds to what was called the pure structural stage in chapter 4,
but for brevity, the above uses only behavioral statements. (The present state register
and next state logic are not given in separate modules as was done in chapter 4.) Al-
though similar in operation to the implicit style design given in section 11.6, the ex-
plicit style design is much more tedious to understand. Also, the designer must give a
Verilog architecture (not shown) consisting of a counter (controlled by ldcnt and
deccnt) and an enabled register (controlled by ldpb and clrpb). Finally, the de-
signer must instantiate the controller and architecture to make a module that is identi-
cal to section 11.6:
module debounce(sw3,pb,cnt,sysclk,reset);
input sw3,sysclk,reset;
output pb;
output [19:0] cnt;
wire sw3,sysclk,reset;
wire [19:0] nt;
wire pb,cnteqO_l,clrpb,ldpb,ldcnt,deccnt;
deboun arch architec(pb,cnteqO_1,cnt,
clrpb,ldpb,ldcnt,deccnt,sysclk);
debouncontrol controller(sw3,cnteq0_1,
clrpb,ldpb,ldcnt,deccnt,sysclk,reset);
endmodule
In this case, the binary encoding makes only a slight savings in macrocells (3%) com-
pared to the one hot encoding used by VITO. As in many other designs, the majority of
the macrocells are devoted to the architecture. Those macrocells must be present, re-
gardless of whether the original Verilog was implicit or explicit style. All of the extra
manual coding required for the explicit style was not worth the effort.
Sy t e i
vir
11.8 Putting it all together: structural synthesis
A typical design often uses a combination of the above techniques. For example, con-
11.9 Ab
All the design
sider a machine activated by the debounced sw3 DIP switch that takes a three bit
binary number from the other DIP switches (sw2, swi, swO }) and does bit serial about general-
including the
addition of this to a four-bit accumulator, rl, whose output is displayed in hexadeci-
mal on the LEDs al .. gl. In order to reuse the code given above, the designer needs hardware. Mos
metic because
structural instances of vsyaddl and debounce:
us build a bit s(
module mach445(sw3,sw2,swl,swO,sysclk,reset, it simplifies thi
al,bl,cl,dl,el,fl,gl); ally to the dem
input sw3,sw2,swl,swO,sysclk,reset;
output al,bl,cl,dl,el,fl,gl;
The PDP-8 sul
wire sw3,sw2,swl,swO,sysclk,reset; DCA, HLT, JM
reg a,bl,cl,dl,el,fi,gl; IAC instructior
function [7:0] sevenseg; are not implem
input [3:0] i; program given i
and so the mult
endfunction design, but then
wire pb,ready; M4-128/64.
wire [3:0] rl,r2;
reg [3:0] y; First, bit serial
wire [19:0] cnt; states F3A and
vsyaddl vl(pb,ready,rl,y,rl,r2,reset,sysclk); EOTAD). Secon
debounce debl(sw3,pb,cnt,sysclk,reset);
to be replaced is
always (sw2 or swl or swO)
& ir[8]==
y = {sw2,swl,swO};
always @(rl)
(but unlike chap
{al,bl,cl,dl,el,fl,gl} = sevenseg(rl); are allowed at nm
endmodule vidually. Fourth,
ister wired to the
The vsyaddl and debounce module definitions are given in the same file as the the data out pin o
above module. In the above, y is simply another name for {sw2, swl, swO }. Note must be a separai
that rI connects both to the vI. rl output as well as the v . x input for the instance write signal is
of vsyaddl. In other words, rl plus y will eventually replace the old value of rl. figure 8-11. Seve
the number of bi
The function sevenseg (whose case statement definition is not shown) takes a
proach is to disn
four-bit binary input, i, and outputs the seven bits required to drive one LED digit in
chip. In other woi
hexadecimal. This combinational logic output is complemented to accommodate the
andbitmem[16
active low requirements of the LEDs.
address register,
The pi file must be defined using the pin numbers given in section 11.3.6. When provides the low-
synthesized and downloaded to the M4- 128/64 demoboard, the above design will oper- the bit from th
ate properly. bitmem[ {ma,1
metic loops, rathc
cause this subset,
I
esis 11.9 A bit serial PDP-8
or example, con- All the designs in chapters 8 through 10 use bit parallel arithmetic to illustrate concepts
takes a three bit about general-purpose computers. In contrast, many early general-purpose computers,
id does bit serial including the Manchester Mark I, used bit serial arithmetic because it required less
yed in hexadeci- hardware. Most modem general-purpose computers are designed with bit parallel arith-
.edesigner needs metic because it is faster and easier. As a concluding synthesis example, however, let
us build a bit serial PDP-8. This allows the CPU to fit within one M4-128/64 chip, and
it simplifies the connections to an external memory chip, which must be wired manu-
ally to the demoboard.
The PDP-8 subset chosen for this example is the same as section 9.6 (CLA, TAD,
DCA, HLT, MP, SPA, SMA and CIA), with the addition of the SNA, SZA, CMA and
JAC instructions described in appendix B. The link as well as additional instructions
are not implemented in this example. This subset is sufficient for the childish division
program given in section 9.7. Bit serial arithmetic is necessarily a multi-cycle approach,
and so the multi-cycle PDP-8 ASM of section 8.3.1.3 is a good starting point for the
design, but there are several algorithmic variations required for the CPU to fit into the
M4- 128/64.
First, bit serial addition loops are used for incrementing pc (the user interface and
states F3A and EIASKIP), incrementing ac (state EOIAC) and adding to ac (state
EOTAD). Second, bit parallel comparisons, such as ir==12 o7200 for CLA, need
to be replaced with comparisons of only the appropriate bits, such as ir [1 9]== 7
& ir 8] ==0 & ir [7] ==1 for CLA. Third, like the original PDP-8
(but unlike chapters 8 and 9), combined instructions (e.g., CMA and IAC to form CIA)
are allowed at no extra cost because the bits of the instruction register are tested indi-
vidually. Fourth, memory accesses occur one bit at a time with a one-bit-wide mb reg-
ister wired to the data in pin of the memory chip and a one-bit-wide membus wired to
same file as the the data out pin of the memory chip. Fifth, like section 8.3.2.4 and figure 8-11, memory
:wl, swO }. Note must be a separate actor so that it can be physically wired to the M4-128/64. Sixth, the
it for the instance write signal is active low for the memory chip used here, which is the opposite of
)Id value of r 1. figure 8-11. Seventh, since the number of bits in a memory chip is a power of two but
the number of bits in the PDP-8's memory is a multiple of twelve, the simplest ap-
t shown) takes a proach is to disregard four out of every sixteen bits from the one-bit-wide memory
one LED digit in chip. In other words, bitmem [0] through bitmem [11] form the twelve- bit m [0],
accommodate the and bitmem[16] through bitmem[27] form m[1]. Eighth, in addition to the memory
address register, ma, the bit serial approach needs a bit address register, a, which
ion 11.3.6. When provides the low-order four bits of the address going to the memory chip. At any time,
design will oper- the bit from the memory chip currently being processed by the CPU is
bi tmem [ {ma, ha]I. Ninth, ha also serves as a binary counter for bit serial arith-
metic loops, rather than the unary r3 counter described in section 11.5.4. Tenth, be-
cause this subset only implements the direct page zero addressing mode (and not the
Synthesis 475
r
full set of addressing modes described in appendix B), the memory address register Continued
only needs to be seven bits wide (a reduction which saves several macrocells). Elev-
enth, the user interface of chapter 8 (but_DEP, but_PC, but MA, cont and the
twelve-bit switch register) has been replaced with a simpler but workable scheme us-
ing four undebounced switches and a push button, cont, that must be externally
debounced. Twelfth, swin, which is the concatenation of the four switches, deter-
mines the user interface action taken when cant is pressed:
swin action
0000 ba - 0
001- ba - 0; pc <- {swin[0],pc[ll:ll}
010- bitmem[{pc,ba}] -- swin[O]; Advance {pc,ba}
011- Advance {pc,ba}
1000 Execute
where advancing {pc, ba} means incrementing just ba, except in the case when
ba==4'blOll. In that special case, pc is incremented and ba becomes zero.
-- 7 ba <= 'CLK ba + 1;
end
@(posedge sysclk) 'ENS;
ba <= 'CLK 0;
end
end
end
end
else
begin
@(posedge sysclk) 'ENS; //F2
while (ba != 11)
begin
@(posedge sysclk) 'ENS; //F3A
ir <= 'CLK {membus,ir[ll:l]};
pc <= 'CLK {sum(pc[O],O,c),pc[ll:l]};
c <= 'CLK car(pc[O],O,c);
ba <= 'CLK ba + 1;
end
module mem(mabus,babus,mbbus,membus,write);
7taskps
input mabus,babusmbbus,write;
input I
output membus;
begin
wire [11:0] mabus;
swin
wire [3:0] babus;
#100
wire mbbus, write;
case
reg membus;
0:
reg [11:0] m[0:127];
1:
reg [11:0] temp;
2, -
always @(mabus or babus)
endcE
begin
#300;
temp = m[mabus]; membus = temp[babus];
end
end
endtask
always @(negedge write)
begin
#50 membus = mbbus; temp = m[mabus]; The time cont
temp[babus] = membus; m[mabus] = temp; example, for t]
end program, the t,
endmodule
and then waits
The above models memory as twelve-bit words but interfaces to the CPU one bit at a push
time. An attempt to access one of the four unused bits will result in 1 'bx because of push
the way Verilog treats bit selects that are out of bounds. The above must be instantiated push
together with the CPU: push
push
module pdp8-system(swin,cont,halt,sysclk,reset);
input swin,cont,sysclk,reset;
11.9.3 Ou
output halt;
wire cont,sysclk,reset,halt,mb, membus, write; In running thi
wire [3:0] swin,ba; serve that this
wire [11:0] ma; cycles when 2
pdp8_cpu cpu(swin,write,membus,cont, perspective wi
ba,ma,mb,halt,reset,sysclk); tations discuss
mem memory(ma,ba,mb,membus,write);
endmodule
sect
11.9
Assuming pdp8_system is instantiated as pdp8_machine, the test code can ini- 8.3.
tialize a memory location using a twelve-bit word refered to with hierarchical refer- 9.6
ence to the array pdp8_ machine.memory.m[ ... ].In order to simulate the
pushing of cont, a task is helpful:
The time control in the task depends upon what swin selection was requested. For
example, for the test code to set the program counter to 12 ' o 010 0 and then execute a
program, the task waits 200 units of $ time for each bit shifted into the program counter
and then waits until the CPU halts:
PU one bit at a push(4'bOOlO);push(4'bOOlO);push(4'bOO10);//O
bx because ofl push(4'bOOlO);push(4'bOOlO);push(4'bOO10);//O
be instantiated push(4'bOOll);push(4'bOOlO);push(4'bOO10);//l
push(4'bOOlO);push(4'bOOlO);push(4'bOOlO);//0
push(4'blOOO);//Execute until HLT
Synthesis 481
7
Assuming the same clock period, the bit serial approach is about five times slower than PDE
the multi-cycle bit parallel approach of chapter 8, which in turn is about five times sic
slower than the pipelined bit parallel approach of chapter 9. To execute one instruction, ba[
it takes on average about one cycle for the pipelined bit parallel machine of section 9.6, ba[
five cycles for the multi-cycle bit parallel machine of section 8.3.2.1 and twenty-seven wri
ba[
cycles for the multi-cycle bit serial machine of section 11.9.1. In the latter case, it takes
ba[
twelve cycles to fetch the instruction, twelve cycles to fetch the data and three cycles ma
for the other typical states (i.e., Fl, F2 and F3B). ma
ma
J
mes slower than PDP-8 M4 2102 PDP-8 M4 2102
about five times signal header pin signal header pin
ba[0] JP5-27 1 GND JP4-2 9
one instruction,
ba[l] JP5-25 2 Vcc solder 10
ie of section 9.6,
write JP5-12 3 mb JP5-1 11
nd twenty-seven ba[2] JP5-23 4 membus JP4-27 12
tter case, it takes ba[3] JP5-19 5 GND JP4-2 13
and three cycles ma[0] JP5-31 6 ma[3] JP4-1 14
ma[l] JP4-7 7 ma[41 JP4-26 15
ma[2] JP4-11 8 ma[5] JP4-28 16
64. After synthe- It is desirable that the ma and ba signals also be attached to external LEDs to provide
choose the pins feedback to the user. (The onboard LEDs cannot be used because of the place and route
i information at limitations of the M4-128/64.) The five-volt power supply (Vcc) to the memory chip
nto a single M4- must be soldered on the demoboard power connection. In addition, the following exter-
;design does not nal switches mustbe connected: contin (externally debounced) toJP4-34, swin [1]
-lk and reset. toJP4-6,swin[3] toJP4-4,swin[0] toJP5-32andswin[2] toJP5-26.
used since these
pitches. Once the
ner should be put
require physical
11.10 Conclusions
Five kinds of synthesizable Verilog were considered in this chapter: behavioral regis-
ters, behavioral combinational logic, behavioral implicit style state machines, behav-
ioral explicit style state machines and structural instantiation. Of these, the implicit
style is the best choice because it has such a close relationship to the behavioral ASMs
discussed in earlier chapters. Often a designer must use some of the other kinds of
Verilog, such as combinational logic, to create a complete design, but implicit style
should be the first choice for synthesizing hardware.
This chapter has used the M4- 128/64 CPLD with VITO, PLSynthesizer and PLDesigner.
Although the details of performing synthesis using chips and software from different
vendors may vary somewhat from those described here, the design flow for Verilog
iders (JP4 or JP5) synthesis is similar. Simulation is a critical part of this design flow. Even though simu-
can be used is the lation takes some effort by the designer, in most cases, a bug discovered during simu-
a sixteen-pin dual lation will be much less expensive than one that remains hidden until after the hard-
ard as follows: ware is fabricated. Synthesis as well as place and route tools output structural Verilog
netlists, which can be used with test code to verify the operation of the synthesized
design.
M
r
l This appendix w
j
is, Prentice Hall A. MACHINE AND
ASSEMBLY LANGUAGE'
Most people use programs written in high-level languages. High-level languages are
hardware-independent, complex languages that are relatively easy to use. Hardware
1.8. independent means that programs written in high-level languages will run on nearly
any general-purpose computer. Examples of high-level languages include Pascal, Verilog
given in section
and C.
demoboard. The
at is loaded with In contrast, low-level languages are simple in form and closer to how computers actu-
o pulse. The sec- ally operate. This makes them harder for the programmer to use. Low-level languages
ihexadecimal on are hardware dependent and have one statement per machine operation. Hardware de-
6 and a top-level pendent means that low-level languages are designed for a specific computer's hard-
Jse test code that ware. Each statement is called a mnemonic. Mnemonics are easily memorized symbols
that represent each fundamental computer operation in a textual form for the
programmer's use. An instruction is a binary word that represents these fundamental
11work with the operations in a form the computer can process.
is the 3-bit value
The 13-bit facto- Low level languages include machine language and assembly language. Assembly lan-
unce module of guage is made up of instructions represented by mnemonics. Machine language con-
ie function from sists of the instructions represented in binary. Assembly language has four major parts:
the design flow. 1. labels - symbolic names for places in memory (where variables are stored).
LL, CML, RAR 2. mnemonics - indications of what the computer will do.
itin the M4- 128/ 3. operand - the data operated on by the instruction.
)ased on the ma- 4. comments - a guide to the program that are ignored by the computer.
trictions on <= in
One statement in a high-level language program often corresponds to many assembly
te link and ac
language and machine language instructions. For example, the machine language file
of a program written in C and the machine language file of the same program written in
i (appendix B) in assembly language are basically equivalent. But, the assembly language version is much
opriate test code, longer than the C program. Consider the following very simple program:
l This appendix was written by Susan Taylor McClendon and Mark G. Arnold.
Ware
Appendix A 485
-q
This is equivalent to the following assembly language program written for the PDP-8,
a simple general-purpose computer used as an example in chapters 8, 9 and 11:
B.
label mnemonic operand comment The comman
*0100 /starting addr non-memory
CLA /put zero in AC in little endia
TAD ENGL /add ENGL to 0
TAD COSC /add COSC to ENGL
TAD MATH /add MATH to COSC+ENGL
DCA TUIT /store in TUIT, clear AC
HLT /halt
Memory
ENGL, 0112 /74 dollars 1. TAD(I
COSC, 0152 /106 dollars link
MATH, 0224 /148 dollars
TUIT, 0000 2. DCA(
ac.
3. AND(O
The *0100 indicates the starting address of the program in octal. The mnemonics indi-
4. JMP(5x
cate what each instruction does. The operand refers to a label defined later in the pro-
process
gram. The following shows this example program translated to PDP-8 machine lan-
The PC
guage code:
5. ISZ(2xi
0100/7200 Skip nex
0101/1106
0102/1107 6. JMS (4x
0103/1110 instructi(
0104/3 111 becomes
010 5/7402 which in
0106/ 0112
0107/0152
The "xxx 8" in t]
0110/ 0224 There are four
0111/ 00 00 direct current
addressing mo(
The four digits on the right of the "" indicate a memory address in octal. The four Why do we nee
digits on the left indicate the contents which show the octal values of the bit patterns we can represei
representing the machine language equivalent of each mnemonic. Starting at address of the ir. Only
01068 the contents are data values, not instructions. these seven bits
bit address bus
TAD performs a Two's complement ADdition of the operand to the contents held in the
(starting at page
AC. DCA, Deposit and Clear the AC, deposits the value held in the AC into memory
a particular pag
and then clears the AC. CLA and HLT are non-memory reference instructions.The CLA
Bit eight indice
instruction CLears the AC and the HLT instruction causes the fetch/execute algorithm
to stop. The machine language code for CLA is 72008 and for HLT is 74028. More
details about these and other instructions of the PDP-8 are given in appendix B. 1This appendix was
I
['or the PDP-8,
and 11: B. PDP-8 COMMANDS'
The commands listed below are the Memory Reference Instructions (MRI) and the
non-memory reference instructions of the PDP-8. The bits referred to below are given
in little endian notation.
octal. The four Why do we need other addressing modes? One reason lies in the number of addresses
the bit patterns we can represent using the page addressing bits. The page addressing bits are bits 6-0
xrting at address of the ir. Only 2 or 12810 addresses (starting at address 0,o) can be represented by
these seven bits. To represent the other 39680 memory locations possible with the 12-
bit address bus, the PDP-8 subdivides the 409610 memory locations into 3210 pages
[tents held in the (starting at page zero) of 12810 memory locations (409610 DIV 12810 = 321). To access
LCinto memory a particular page, the PDP-8 uses two types of addressing modes: direct and indirect.
-tions. The CLA Bit eight indicates either direct or indirect addressing mode and bit seven indicates
ecute algorithm
is 74028. More
pendix B. l This appendix was written by Susan T. McClendon and Mark G. Arnold.
A
either page zero or currentpage. Page zero is normally used for global variables and Non-mei
constants and the current page (018-378) is normally used for local data and corre-
sponding code. The following lists the combinations of bits seven and eight for each
possible addressing mode:
Group 111
1. CLA(,
ir[8] ir[7] Addressing Mode Effective Address 00008.
o O Direct Page Zero ir[6:0]
o 1 Direct Current Page {pc[11:7],ir[6:0]J 2. CLL (7
1 0 Indirect Page Zero m[ir[6:0]]
3. CMA(
1 1 Indirect Current Page m[{pc[11:71,ir[6:0]}]
ments
4. CML('
Direct page zero computes the Effective Address (EA) as simply the low-order seven
the ii
bits of the instruction register. This is the only addressing mode used in appendix A.
Direct current page computes EA as the high-order five bits of the program counter 5. RAR (
concatenated to the low-order seven bits of the instruction register. This is useful for tion shi
programs that do not fit in the 128 words of page zero. Indirect page zero computes EA and bit
as the contents of memory pointed to by the low-order seven bits of the instruction
6. RTR -(
register. Similarly, indirect current page computes EA as the contents of memory pointed
Bit 0 sh
to by the concatenation of the high-order five bits of the program counter and low-
other bi
order seven bits of the instruction register. These indirect addressing modes are useful
when the address of data varies during runtime, and also in conjunction with the JMP 7. RAL(7
instruction to return from a subroutine (called by a JMS instruction) or from an inter- shifts bi
rupt service routine. shifts tc
The indirect addressing modes are slower, but more powerful, than the direct address- 8. RTL (7(
ing modes since the EA comes from memory. First, the machine obtains the address of 11 shift!
the EA from the instruction register (and possibly the program counter). Next, it ac- bits shil
cesses memory to obtain the EA. Finally, it accesses memory to obtain the data.
9. IAC - ('
Autoincrement occurs on the PDP-8 with indirect addressing when the address of the the ac.
EA (not the EA itself) is between 00108 and 0017 In these eight cases, the EA in instructs
memory is incremented priorto execution of the instruction. For example, the instruc- register.
tion 14178 increments the word atm[0017 8 ], and then adds m[m[0017 8 ] to the
10. NOP(
accumulator.
The Group 1
73008
1. CLA (72008) - CLear the Accumulator, bit 7 on. This instruction sets the ac to
00008.
2. CLL (71008) - CLear the link, bit 6 on. This instruction sets the link to 0.
3. CMA (70408) - CoMplement the Accumulator, bit 5 on. This instruction comple-
ments (sets all l's to 0's and 0's to l's) the ac.
4. CML (70208) - CoMplement the link, bit 4 on. This instruction complements
Dw-order seven
the link.
in appendix A.
rogram counter 5. RAR (70108) - Rotate the Accumulator and 1 ink Right, bit 3 on. This instruc-
is is useful for tion shifts bit 11 through bit 0 one position to the right. The link shifts to bit 11
o computes EA and bit 0 shifts to the link. All other bits shift one position to the right.
the instruction
6. RTR - (70128) - Rotate the accumulator and link Twice Right, bit 3 and 1 on.
nemory pointed
Bit 0 shifts to bit 11, the link shifts to bit 10 and bit I shifts to the link. All
unter and low-
other bits shift two positions to the right.
iodes are useful
Inwith the JMP 7. RAL (70048) - Rotate the Accumulator and 1 ink Left, bit 2 on. This instruction
r from an inter- shifts bit 10 through 0 one position to the left. The 1 ink shifts to bit 0 and bit 11
shifts to the l ink. All other bits shift one position to the left.
direct address- 8. RTL (70068) - Rotate the accumulator and link Twice Left, bit 2 and 1 on. Bit
is the address of 11 shifts to bit 0, the 1 ink shifts to bit 1 and bit 10 shifts to the 1 ink. All other
er). Next, it ac- bits shift two positions to the left.
i the data.
9. IAC - (70018) - Increment the ACcumulator, bit 0 on. Adds 1 to the contents of
e address of the the ac. If the ac is 77778 the link will be complemented (as in the CML
ases, the EA in instruction). This allows the link and ac to act together as a 13- bit counter
iple, the instruc- register.
)0178] to the
10. NOP (70008) - No OPeration, bits 0-7 off.
The Group 1 Microinstructions can be combined together. For example CLA CLL is
73008.
M
Group 2 microinstructions
C. C
1. SMA (75008) - Skip on Minus Accumulator, bit 6 is 12 and bit 3 is 02- Normally
used with signed data. Skips the next instruction if the value in the ac is nega-
tive. Combinatio,
2. SPA (75108) - Skip on Positive Accumulator, bit 6 is 12 and bit 3 is 12. Normally the kind of d
used with signed data. Skips the next instruction if the value in the ac is posi- logic is criti,
tive. logic has no
previous rest
3. SZA (74408) - Skip on Zero Accumulator, bit 5 is 12 and bit 3 is 02. Skips the next algorithms. (
instruction if the value in the ac is equal to zero. pendix D) to
4. SNA (74508) - Skip on Non-zero Accumulator, bit 5 is 18 and bit 3 is 12. Skips the design proce
next instruction if the value in the ac is not equal to zero. mixture of cc
combination;
5. SZL (74308) - Skip on Zero link, bit 4 is 12 and bit 3 is 12 Skips the next to describe ti
instruction if the link is 02.
This appendi
6. SNL (74208) - Skip on Non-zero link, bit 4 is 12 and bit 3 is 02. Skips the next at a higher le
instruction if the 1 ink is not equal to zero. design (the r
7. SKP (74108) - SKiP unconditionally, bit 3 is 12. Skips the next instruction. look at a des'
sible level, v
8. HLT (74028) - HaLTs the computer. Implemented by setting the HALT bit. possible. Lov
9. OSR (74048) - Inclusive Or of the Switch Register with the ac. The result is left must be dealt
in the ac and the original content of the ac is destroyed. have largely
manually. Th
Note that all the memory reference instructions begin with 08 to 58 and that all the non- course (dealii
memory reference instructions (group 1 and group 2 microinstructions) begin with 78 days are carri
The I/O (Input/Output) instructions are not given here, but they all begin with 68. exposure to s
Interrupts are external signals that cause temporary suspension of the fetch/execute cally. Instead
cycle. On the PDP-8, there are two instructions, ION (60018) and IOF (60028) that binational lol
control whether interrupts are ignored. ION sets the interrupt enable flag, and IOF want to be wl
clears it. On the PDP-8, an interrupt is ignored unless the last instruction was not 60018
and interrupt enable flag is 1. If these conditions are met, the interrupt causes the same
action as executing the instruction 40008 without fetching such a machine code from
memory. The interrupt also causes the interrupt enable flag to become 0. At that point,
C.1 Mc
All scientific
the fetch/execute cycle resumes. At the end of the interrupt service routine, the pro-
that are easie
grammer must put an ION instruction followed by a JMP indirect instruction.
puter scientist
a designer do
model of real
sary details,
model of reali
j
planetary model that says that the sun orbits around the earth in a perfect circle every
twenty-four hours is an acceptable model of reality for everyday problems. The math- the gates th,
ematical simplicity of a circle is compelling, but there are problems where the highly to try out
simplified model is insufficient. A more accurate model would say that the earth orbits combination
the sun in an elliptical path as the earth itself rotates. Although not as simple as a circle,
an ellipse is still fairly straightforward to describe with simple mathematics. For some C.1.4 PI
problems the elliptical model would also be insufficient, and a very complex model, The most a(
considering lunar interaction, etc., might be required.' factors that
analog elect
C.1.1 Ideal combinational logic model tance, etc. A
Speed and cost are not the first concerns of the designer. Producing a design that imple- realities, the
ments a correct algorithm is the top priority. For this reason, it will be convenient to cal reality t(
think of combinational logic as being instantaneous. Such idealized combinational logic detail.
cannot exist in the physical world and is analogous to saying the sun orbits around the
earth. Although an idealized model may seem too simple, it is the proper model for
automatic Verilog synthesis, which helps ensure that the designer gets the product to C.2 Bu
market on time. Most of this book (with the primary exception of chapter 6) assumes
The fundamc
idealized combinational logic.
that transmit
with a slash
C.1.2 Worst case delay model ber of bits th
Just as in the planetary analogy, sometimes the problem will demand something more
accurate. As illustrated in chapter 6, there are problems where the computer designer
C.2.1 Ui
must meet certain speed and cost constraints. Rather than jumping from no detail to
A bus is eith,
every detail, it would be nice to have a simple, but reasonably accurate, model, analo-
unidirectiona
gous to the elliptical model of planetary motion. In computer design, the worst case
direction. Th
propagation delay model satisfies this need. Worst case propagation indicates the maxi-
bus. For exar
mum number of gates through which a signal change must pass in the worst case. An
assumption commonly used in this model is that each gate has a delay of one unit of
$ time in a Verilog simulation.
I L
:t circle every the gates that compose the device. A Verilog simulator is a tool that allows the designer
ns. The math- to try out many combinations of values to see how long it takes for the simulated
ere the highly combinational device to process the information under different circumstances.
he earth orbits
ple as a circle,
tics. For some C.1.4 Physical delay model
mplex model, The most accurate but cumbersome model considers all the physical and geometric
factors that compose the machine. Such a model considers physical laws that govern
analog electronics, including factors such as the speed of light, capacitance, induc-
tance, etc. Although there are times when computer designers must confront these harsh
realities, the goal of a good top-down technique is to insulate the designer from physi-
ign that imple- cal reality to as great an extent as possible. This book never considers this level of
convenient to detail.
inational logic
)its around the
iper model for
the product to C.2 Bus
ter 6) assumes The fundamental building block of all combinational logic is the bus. A bus is a device
that transmits information from a source to a destination. The symbol for a bus is a line
with a slash drawn through it. Next to the slash is a number, which indicates the num-
ber of bits that the bus transmits at any instant.
)mething more
iputer designer C.2.1 Unidirectional
m no detail to A bus is either unidirectional or bidirectional. Most of the buses used in this book are
model, analo- unidirectional. A unidirectional bus is drawn as a line with an arrow pointing in one
the worst case direction. The arrow indicates the direction in which information flows through the
cates the maxi- bus. For example, the following is a four-bit unidirectional bus:
worst case. An
of one unit of
I~~~~
A
abstraction of an n-bit-wide bus is physically implemented as n "wires" running con-
ceptually in parallel to each other. For example, the above four-bit bus would actually
be four "wires":
- milte
-L-J I
a[2] I
a[1] >
a[O]
Figure C
Figure C-2. Implementation of a four-bit bus.
is equivalent
Each "wire" transmits one bit of binary information from the source. For example, to (known as p.
send the number a=1 5 from the source to the destination, all four transmit a one:
l
C.2.3 A(
1 At any instat
physical valt
1
Voltage by it
1 machines tha
Figure C-3. Transmitting15 on a four-bit bus. late physical
each wire, oi
rather than ti
On the other hand, to send the number a=7, the most significant "wire" instead trans- lates into bin
mits a zero:
In the active I
0 the active lox
1 wires are acti
sion. Beware
1
1
Figure C-4. Transmitting 7 on a four-bit bus.
I
P.-
'running con-
would actually
I
is equivalent to figure C-4. Such geometric details need not concern us because tools
or example, to (known as place and route) determine this automatically.
mit a one:
C.2.3 Active high versus active low
At any instant, each "wire" will be at one of two voltages, known as high and low. The
physical values of these voltages seldom concern the designer.'
Voltage by itself is not information. The goal of this book is to describe how to design
machines that process binary information. Additional abstractions are necessary to re-
late physical voltages to the binary information being processed by an algorithm. On
each wire, one of two possible abstractions is chosen (perhaps by the synthesis tool
rather than the human designer) to forever describe how a voltage on that wire trans-
'instead trans- lates into binary information. These two abstractions are active high and active low.
In the active high abstraction, the high voltage means 1 and the low voltage means 0. In
the active low abstraction, the opposite holds. 3 The easiest approach is to assume all
wires are active high, which is the approach used in this book in order to avoid confu-
sion. Beware that with actual physical chips it is common that some wires will be
2 The numeric values of these voltages vary depending on the technology used. Typically, the lower the
voltage, the faster the machine. For the rugged TTL logic families commonly used in educational labs, high
iposed perhaps is five volts, and low is zero volts. Faster, more modem but less rugged chips based on CMOS use lower
ng on a circuit voltages, such as 3.3 volts. Slow vacuum tube machines of the 950s used around +50 volts and -25 volts.
ally parallel to
3When all the signals are active high, the system is know as positive logic. When all the signals are active
low, the system is known as negative logic. When the system is a mixture of both, it is known as mixed logic
(not to be confused with the very different concept in Verilog of mixed behavioral structural design, as
described in chapter 4).
Figure C-
j
r
igits geometric
avels along the
Accomplishing the same thing in hardware is essentially free. You simply
of light, which run the bus
two different places, and refer to the same physical bus by different names
at these new
locations:
Ktremely cheap
iat are built out
delay, like any
1__ _ 2
_ _ _ _ b
a
2
area of a bus is
C
,r of bits in the 2
I route. Figure C-7. Transmitting on one bus to multiple destinationsforfree.
buses have di-
tion delay, par- One of many geometric arrangements of "wires" that can accomplish
this is:
b[1]
b[0]
ng of informa-
4 at little addi-
a[1] I
2re
Appendix C 497
MMMM
Note there is no connection between a [ 1 ] and a [0 , although there are connections
between a[O, b[0] and c [O] and between a[1], b[1] and c [1]. Within the a[3]
time it takes light to travel the physical distance of the bus, the voltages at b [ 1 ] and a[2]
c [ 1] will be the same as a [ 1], which means that bit of information has been trans-
a[1]
mitted to those two locations.
a[O]
It is common for a designer to use different names for the same bus, but doing so can be
confusing. It would be better whenever possible to use the same name at both the
source and destinations: Figure C-
This subset t
a to form a sul
2
a which separa
2 the bus b car
oa tinuous groul
2
bit select, thz
Figure C-9. Using the same name at every node.
since this more accurately reflects physical reality. Nevertheless, there will be times If the destina
when it is advantageous to re-label the same bus with different names. A rose by any
other name is just as sweet, and a bus by any other name is just as cheap.
C.2.6 Subbus
There are certain operations in the binary number system that are trivial to implement. Figure C--
For example, unsigned division by two, b=a/2 (also known as shift right, b=a>>1)
appears to require some special device:
b [ 3] would
a G o d - b
4 3
but in fact can be implemented at no cost simply by rearranging how a subset of the
wires of the bus a is connected to b:
Figure C-J
In concatena
{O,a[3:1]
This subset bus is known as a subbus. A designer can select any bits of the source bus
to form a subbus. The notation we use for this is the concatenation syntax of Verilog,
which separates the name of the individual wires with commas inside { . For example,
the bus b can be described as {a [ 3 , a [2] a [ 1 }. Since subbuses that take a con-
tinuous group of bits from the source are common, there is another notation, known as
bit select, that can be used: a [ 3: 1 ] means the same as {a [ 3 ]a [ 2 ,a[1] }.
a[3] 0 b[3]
a[2] b[2]
a[1] | > b[1]
ubset of the a[O] no
- b[O]
connection
Figure C-13. Implementation offigure C-12.
e Appendix C 499
MOMEMEM
__
C.3 Adder
Many algorithms, even those that are not primarily mathematical, often need to do
addition of binary numbers. One way to accomplish this is to provide a combinational
logic unit that performs this computation: a-
a
sum
b Figure C-.
Figure C-14. Combinationaldevice to add two n-bit values (n+1 bit output). where the n-I
b only when
The physical
The block diagram symbol for an adder is simply a rectangle with a "+" or the word adder. The a(
"adder" inside it. The number of bits in the output bus is one more than the number of simplifies the
bits in the larger of the input buses to allow for the largest possible sum. Note that a, b served while
and sum are typically unsigned. (The low-order n bits of sum are also valid when they however, the
are signed twos complement; however there are complications with signed values be- be ignored.
yond the scope of the discussion here.)
a-
C.3.1 Carry out
It is common for the extra bit in the sum to be broken into a separate carry out (cout)
signal, with wordsum being a subbus:
FigureC-l
a
C.3.2 Sp(
b There are ma
common tech
about 3 *n 01
Figure C-15. Treating the high-orderbit as carry out.
inputs). Anott
in sections 2.5
where sum= cout, wordsum}. The above is often drawn as: proportional t
Faster techniq
dude carry lo,
cout
rten need to do
combinational
a
wordsum
b
where the n-bit wordsum is a valid result that fits within the same sized word as a and
itoutput).
b only when cout is 0. When cout is 1, an overflow error is said to have occurred.
The physical implementation of this approach is identical to the earlier view of the
' or the word adder. The advantage of this view is that all buses are the same width, which often
i the number of simplifies the design of a larger system. The disadvantage is that cout must be ob-
[.Note that a, b served while the system is in operation to detect the possibility of an error. Sometimes,
valid when they however, the designer has a priori knowledge that wordsum is small, and so cout can
aned values be- be ignored.
a
wordsum
wry out (cout) b
C.5 01
Although ad
tional buildi
ioi /
n scribes sever
combination
ii i/ 1
n
2 out C.5.1 In
n One of the n
k
imax - -
a
n n
slk
Figure C-
select
Some people draw muxes as a rectangle with the word "mux" written inside. The mux a
has a select input, which is k bits wide. The mux also has (at most) 2 other buses that n
are data inputs (i 0, i 1, i 2, . . . imax), each n bits wide. The mux has one data
output which is n bits wide. If any input bus has fewer bits, assume zeros are concat- n
enated on the left.
Figure C-=
C.4.1 Speed and cost
There are several ways a mux can be implemented physically. In the most common it is better to
approach, the mux shown above would be implemented using n OR gates (each hav- adder is ineffi
ing 2k inputs), n*2kAND gates (each having k inputs) and k inverters. This approach
I
needs only three stages of propagation delay. Sometimes it is possible to reduce this
~ombinational down to two stages (by eliminating the need for the inverters') and so muxes imple-
.puts to be its mented this way are quite fast.
savior is sym-
C.5.1 Incrementor
One of the most common operations is adding one to a number:
a +1 a+1
n n
most common it is better to specify an incrementor if that is all the problem needs. Using a general
ates (each hav- adder is inefficient both in terms of speed and cost.
.This approach
6 The need for inverters can be eliminated in so-called dual rail designs, where every signal is provided in
both active high and active low form. The reason the inverters are not needed is because certain devices,
such as flip flops, naturally provide both active high and active low versions of the same signal at no extra
cost.
Appendix C 503
C.5.1.1 Speed and cost C.5.4 Si
Although ripple carry addition of two arbitrary numbers requires n stages of worst case The buildin
propagation delay, incrementation can be done in only two stages of propagation delay
using n-1 OR gates (each with two inputs), 2 *n-2 AND gates (each with two in-
puts), n- 1 AND gates (of various sizes) and some inverters. a /
n
b -
C.5.2 Ones complementor n
The ones complement, -a-1 (also known as bitwise not, -a):
Figure C
a o -a-1
n n Although a
The additior
Figure C-21. Symbolfor ones complementor. One approa
complement
ized logic fo
is often part of a larger computation.
a - -a
n n Figure C-
a - +1 -a
n n n a -
Figure C-,
I
r
C.5.4 Subtractor
:sofworst case The building block for a combinational logic subtractor is analogous to addition.
pagation delay
h with two in-
a
n+1
b
n
C.5.5 Shifters
Multiplication and division by constant powers of two can be accomplished at essen-
ion delay.
tially no cost through subbusing and concatenation. For example, multiplication by 4
(shifting left two places):
ims:
a <2 4*a
12 14
simply concatenates a to two bits that are zero on the right. The reason this does not
cost anything is because the power of two is a constant.
Sometimes the shifter has another input, known as the shift in (si), that allows the
designer to specify what the least significant bits are:
a 4*a+si
Si
a
n a
n+2k1 n
b-
n
Sc
Figure C-27. Symbol for barrel shifter with shift count input. Figure C-=
This can be implemented in two (or three) levels of worst case propagation delay as Note that the
constant shifters and a mux: assume that a
multiply sign(
C.5.6.1 Si
a There are mar
niques require
delay of additi
compared to
.a*2sc involving sequ
generate a pro(
C.5.7 Divi
Division (by n(
mented as a co
Sc combinational!
sion can be imr
Figure C-28. Possible implementation of barrelshifter
An alternative implementation which is slower but less costly uses k muxes, each with
two inputs.
C.6 Ariti
A similar right shifter can be implemented for division by variable powers of two. In many problei
Barrel shifters can be arranged to allow for both multiplication or division by arbitrary functions under
powers of two, and to allow for arbitrary shift input (rather than concatenation with tions needed for
zeros).
a
r
a
product
b
Note that the product has twice as many bits as the input buses. We will normally
agation delay as assume that a, b and product are unsigned. It takes a physically different device to
multiply signed numbers.
C.5.7 Division
Division (by non-powers of two) is even more costly than multiplication when imple-
mented as a combinational logic building block. Division is seldom implemented as
combinational logic. Most of this book uses an example of one simple way that divi-
sion can be implemented using sequential logic and ASM charts.
choose to put whatever functionality in an ALU as is appropriate for a particular prob- Note that Al
lem, however it may be convenient 7 to use an ALU that has already been designed, ALU can al!
such as the 74xx1 81. ALU might I
above.
Regardless of what details are inside the ALU that a designer chooses, the basic prin-
ciple of how a combinational logic ALU operates is the same. There is a k-bit bus,
aluctrl, that customizes the ALU for the particular function that needs to be com- C.6.1 Li
puted. As its name i
cal operation
a and also p(
a of a four-bit.
d
n
b
aluctrl
Figure C-30. SymbolforArithmetic Logic Unit (ALU).
In other worc
and f [0] =a
Conceptually, an ALU could be implemented as a mux which selects from the various puting 'AND
combinational functions which that particular ALU is capable of performing: b [0]. Break
equations as
trivial but ted
with because
signer needs t.
separate AND
stand the won
of how many
Mathematical
n no more than
and involving
fied by the laa
a table of the!
8This quite descriptive term was coined by Synopsys, the pioneering vendor of Verilog synthesis tools
in the
early 1990s.
II
r
The fact that a [ 0 ] and b [ 0 ] are both one in this example ultimately affects f [ 0 ],
f [ 1 ] and f [ 2 . This ripple effect is why addition has a worst case propagation delay
proportional to n.
The following table shows some of the most useful arithmetic operations available in
the 74xx181 ALU:
is which will be
Because the 74xx 181 is a low-cost ALU which is readily available for educational
laboratory experiments, it does not implement multiplication or division. Section 2.3.1
ry to include all
shows how to use one of these ALUs to implement division using a slow but simple
hitting some of
algorithm.
J and also may
From these three outputs, the other three conditions can be derived, for example
ageb=agtb aeqb. Many problems only need an equality test: in-
a
aeqb
b
k
o do so. Instead,
used instead. A
)arator has three
r C.8 Demux
The demultiplexor (demux) is a specialized combinational
used in the early chapters of this book. It plays an
concepts found in later chapters.
building block which is not
important role in implementing
outo
outl
-d, for example
out2
n
outmax
Like the mux, the demux has a k-bit input bus known
as select. Some people may
draw the demux as a box. All but one of the n-bit output
ian one that also buses will be zero. The se-
lected output bus will pass through unchanged the value
ator needs 2 *n on the input bus.
ach having two
for an equality C.8.1 Speed and cost
rhe cost is even Demuxes are simply a large collection of AND gates
that operate independently. The
demux shown above requires n * k AND gates (each
2 having k+ 1 inputs) and k invert-
ase propagation ers. Such an implementation would have a worst case
propagation delay of only two
in equality only gates. Sometimes, the inverters can be eliminated, in which
case the propagation delay
is only one gate.
vare
Appendix C 513
r
Second, a bins
a
2 k comparator
code. The sec
comparator is
I
There is no harm in result being 2 *a simultaneously with resultO being a+l.
sary. For ex- In hardware, it is often more economical to compute everything you might need
a, and part of and ignore those results that are not pertinent under particular circumstances.
eeds to incre- Therefore, demuxes are not needed in the early chapters of this book. Demuxes are
ita demux in important in more advanced design topics. Demuxes are important in the design of
memory systems (section 8.2.2.3.1) and in the implementation of one-hot controllers
(chapter 7).
C.9 Decoders
A decoder is a specialized combinational device that converts from a binary code to
some other code. The most common decoder converts from binary to what is called a
unary code. The following table lists these codes for the numbers between 0 and 7
value binary unary out
0 000 00000001
1 001 00000010
,resultl is 2 010 00000100
l and 2 *a do 3 011 00001000
uld be consid- 4 100 00010000
md expensive 5 101 00100000
tresult is 6 110 01000000
design would 7 111 10000000
multaneously,
Such a decoder can be thought of in two ways. First, it can be thought of as a building
block that simply takes k bits of binary input, and produces 2 bits of unary output:
, the program-
select. But
to every place binary dr unary
ainly less than
2
Figure C-37. Symbolfor binary to unary decoder
- cMP ==
C.9.2 01
0 unary[1] Decoders exi
1 1 displays. Suc
k
c2 _ unary[2]
2 t 1 C.10 E.
k Sometimes,
building bloc
could be sure
j n,-l one), the enc(
kk1 1]]unary[2K
valid unary o
might appear
unary[o]
1 possibly -
/i unary
1 unary[1]
,
Figure C-4
o unary[2]
1 ~~~~21 It outputs the
leading zero.
unary[2 -1]
1
binary
Figure C-39. Alternate implementation of decoder.
C.10 Encoders
Sometimes, a designer needs to convert from a unary code to binary.
A combinationalI
building block that performs this conversion is known as an encoder.
If the designerr
could be sure the input were always a proper unary code (with exactly
one bit that is
one), the encoder could be implemented simply with k OR gates. But
there are only 2 k
valid unary codes out of the very large number (two raisedto the k)
of bit patterns thatt
might appear on the input.
priority
encoder
possibly / - -4-- binary
unary .k
2"1
Ire
zre 517
Appendix C
Appendix C 517
quested by tI
design altern
The priority encoder is useful for counting how many leading bits of a number are zero.
derive the lo
This is a computation that is necessary to implement floating-point arithmetic. Priority
the most pra,
encoders are also often used so that a general -purpose computer can select which one
This is becau
of several external interrupts has the highest priority.
rather than fi
sis tool that 1
There are sea
C.11 Programmable devices that the RONv
Almost any imaginable mathematical function can be realized as a combinational build-
ing block if it involves a small enough number of bits of input. With Verilog synthesis
tools available since the mid 1990s, functions involving around sixteen or fewer bits of
address
input are routinely converted into combinational logic without the designer having to
worry about their technological or gate-level implementation. The synthesis tool pro-
duces a file that can be downloaded into one of many kinds of programmable devices. Figure C-
The process of transferring the design into a programmable device is known as pro-
gramming it. Such programming is a mechanical process, which does not require hu-
The number
man intervention or creativity. The term burning is sometimes used to mean the same
k==n. The in
thing as programming. This use of the term programming should not be confused with
is known as
its use in software (chapter 8), where the term programming means the same thing as
memory syst(
design, which, of course, requires lots of creativity.
because once
There are many kinds of programmable devices available, including Programmable
Normally, the
Logic Arrays (PLAs), simple and Complex Programmable Logic Devices (CPLDs),
since the RO
Field Programmable Gate Arrays (FPGAs) and Read Only Memories (ROMs). CPLDs
designer migl
and FPGAs also have provision for sequential logic (see appendix D), but ROMs are
responsible f(
pure combinational logic. The combinational logic implemented by all ROMs and by
ticular ROM.
many FPGAs are based on truth tables without the need for expressing logic equations.
In contrast, PLAs and CPLDs are based on logic equations (sum of products) rather Another view
than truth tables. Synthesis tools automatically produce truth tables or logic equations, it implements
depending on the target technology the designer selects. the constants
ROM.
C.11.1 Read only memory
Automatic synthesis of combinational logic for functions involving more than about
sixteen bits depends on the complexity of the function. A simple function like addition
can be implemented for an arbitrarily large number of input bits with combinational
9 So-called Electi
logic because the function decomposes into smaller combinational logic units, e.g., full which they are us
adders in the case of addition. The synthesis tool is well aware of the properties of puters. In the tet
commonly used functions like addition. The decomposition of more complicated func- EEPROMs are he
tions (whose properties are not built into the synthesis tool) is often less obvious. Syn- under the control
not use an EEPRI
thesis tools explore many possible implementations for the combinational logic re- ROM, which is t(
quested by the designer, however; as the number of input bits increases, the number of
design alternatives grows exponentially. It becomes difficult for the synthesis tool to
nber are zero. derive the logic equations needed for technologies such as CPLDs. ROMs tend to be
netic. Priority
the most practical approach for complex functions as the number of input bits grow.
ect which one
This is because with ROMs, all the designer has to do is tabulate the desired behavior,
rather than find logic equations that produce that behavior. This avoids using a synthe-
sis tool that has to explore exponential possibilities.
There are several ways of describing a ROM. The usual viewpoint of the designer is
that the ROM is a black box, specialized for computing some particular function:
iational build-
ilog synthesis
r fewer bits of
address R contents
Yner having to k n
iesis tool pro-
nable devices. Figure C-41. Symbolfor a Read Only Memory (ROM).
nown as pro-
[ot require hu-
The number of input bits, k, and output bits, n, need not be the same, although often
nean the same
k==n. The input bus to the ROM is known as the address. The output bus of the ROM
confused with
is known as the contents. This address and contents terminology is borrowed from
same thing as
memory systems (section 8.2.2.3.1). However, such a ROM is not truly a "memory"
because once a value is burned into a ROM, it cannot be changed.'
programmable
Normally, the designer will indicate more than just the word "ROM" inside the box,
ices (CPLDs),
since the ROM could be programmed to implement any function. For instance, the
OMs).CPLDs
designer might need a "square root ROM," or something like that. The designer is then
but ROMs are
responsible for providing a table of the contents that need to be burned into that par-
ROMs and by
ticular ROM.
)gic equations.
:oducts) rather Another viewpoint of the ROM is to describe it in terms of the combinational logic that
)gic equations, it implements. A ROM is simply a mux whose data inputs are connected internally to
the constants (cO, c , c2, . . cmax) that the designer has burned into the
ROM.
A L
I
Read Only Memories (ROMs) are not actually memory because they do not have the
ability to forget. ROMs are simply a different, more convenient, approach for imple-
menting combinational logic. The use of ROMs as well as the use of programmable
logic with Verilog synthesis tools has made the design of specialized combinational
logic relatively easy.
PROSSER, FRANKLIN P. and DAVID E. WINKEL, The Art of Digital Design: An Introduction
to Top Down Design, 2nd ed., Prentice Hall PTR, Englewood Cliffs, NJ, 1987. Chapter
3.
C.14 Exercises
Using the combinational logic building block devices listed in each of the following
problems, give a block diagram that implements the more complex combinational build-
ing block described by the data output(s). The buses in these problems should be inter-
re of a CPLD is
preted as unsigned binary integers.
such details. In-
required inside C-1. Control Inputs: CTRL (3 bits)
Data Inputs: A (32 bits), B (32 bits), C (32 bits), D (32 bits), E (32 bits)
Data Output: F (32 bits)
Devices: one 32-bit adder, one 32-bit 2-input mux, one 32-bit 4 input mux
011 F=D
100 F=A+E
101 F=B+E
110 F=C+E
111 F=D+E
, E (32 bits)
C-8. Control Inputs: ALUCTRL (6 bits), CTRL(1 bit),
Data Inputs: H (8 bits), L (8 bits), M(8 bits)
rementors.
Data Outputs: F (8 bits), G (8 bits)
Devices: two 8-bit integer ALUs (74LS 181),
one 8-bit 2 input mux
ALUCTRL CTRL Data output
100100 0 F=H+L; G=H+2*L
100100 1 F=H+M; G=H+L+M
101101 0 F=H&L; G=H&L
101101 1 F=H&M; G=H&L&M
000100 0 F=HIL; G=HIL
000100 1 F=HIM; G=HILIM
sysc
Figure D
This connect
quential devi(
to mean the s
FigureD-
synchronous synchronous
sequential sequential
v device 1 r device 2
sysclk
This connection need not be drawn because it is understood that all synchronous se-
quential devices connect to this same signal. For example, the following is understood
to mean the same as the above:
synchronous synchronous
sequential sequential
device 1 device 2
\.r-- D.3 So
Figure D-3. An analog waveform for the system clock signal.
Synchronou
in a timing
which shows how the analog voltage (vertical axis) on the sysclk wire varies with example, in
time (horizontal axis). Physical properties, like capacitance and inductance, affect the algorithm:
ragged shape of the analog voltage shown on the oscilloscope. Computer designers are
not concerned with analog voltages, and so this rather messy physical reality is ab-
stracted to an idealized square wave: data
sys
SYSCIK -
Figure D-4. A digitalabstractionof the system clock signal.
Figure D
Such a square wave is not physically possible; however, as explained in section C. 1. 1,
computer designers often use models of reality that are physically unrealistic because
The above i
such simplified models emphasize only those things which are algorithmically impor-
only at the ri
tant.
later than th
used to gene
l Computer designers now often use more sophisticated kinds of test equipment.
In the case of the sysclk signal, the only thing that is important is that it subdivides
fing diagrams. time into equal-sized intervals, known as clock periods or clock cycles:
sysclk, this
algebra class.
by an oscillo- syscfk
led eye. When
,r needs some
ime resolution I 4 - first clock cycle second clock cycle -
tplots voltage
signers half a
illoscopes are FigureD-5. The system clock divides time into cycles.
mect an oscil-
Each clock period begins and ends on the rising edge of sysclk. (Some kinds of
sequential logic use the falling edge; however in this book all synchronous sequential
building blocks use the rising edge.)
Synchronous logic is a restriction on physical reality where changes in the values shown
in a timing diagram occur only at the exact instant of the rising edge of sysclk. For
re varies with example, in the following timing diagram, one bit of data is being manipulated by an
ace, affect the algorithm:
designers are
reality is ab-
data
sysclk
Appendix D 527
signer wants
data a variable, v
on a timing
v[1] -
v[O] _
sysclk sysclk J
Figure D-7. A realistic synchronous timing diagramwith propagationdelay.
Figure D.
but as discussed in section C. 1.1, we normally ignore propagation delay. At the begin-
ning of the design process, the primary concern of the designer is getting the algorithm However de;
show the nut
right. Worrying about physical reality is a distraction from the designer's most impor-
tant mission-ensuring that the algorithm is correct.
v C
The following diagram is not synchronous. It is known as asynchronous because the sysclk S
data pulse might occur at any time with respect to syscik:
FigureD-
data
In timing dia;
sysclk
Figure D-8. An asynchronous timing diagram. Figure D-.
With only one exception that happens when a machine is first turned on (described in shows the ins
sections 4.4.5 and 7.1.6), we will not use such asynchronous logic. of the bus cha
value of the b
Synchronous design is safe and easy. Asynchronous design is hard and dangerous.
Commercial synthesis tools concentrate on synchronous design. Therefore, synchro-
nous design is widely used in industry.
D.5 Thf
The simplest
D.4 Bus timing diagrams
Digital computers represent values other than zero and one using a group of bits on a
bus with the binary number system. The physical reality is that each wire in a bus din -
represents a separate bit of information. But from an algorithmic viewpoint, the de-
Figure D-1
L
signer wants to look at the bus as containing a single binary value. Suppose the value of
a variable, v, goes through the sequence 0, 1, 2, 3, 0, 1, 2, 3, 0 .... This could be shown
on a timing diagram as two separate bits that change synchronously with sysclk:
v[1 ]
v[O]
sysclk
pidelay. FigureD-9. Timing diagram showing individual bits of a bus.
At the begin- However dealing with separate bits is quite tedious. Instead timing diagrams usually
the algorithm show the numeric value of the complete bus during each clock cycle:
most impor-
(described in shows the instant in time (a particular rising edge of syscik) when the numeric value
of the bus changes. It is only necessary for one bit of the bus to change for the numeric
value of the bus to be completely different.
d dangerous.
)re, synchro-
Appendix D 529
A
-
Some people refer to din as the D input and dout as the Q output. When n=l, this
device is referred to as a D-flip flop. In fact, an n-bit D-type register is usually built The reason'
from n D-type flip flops. the D-type r
have variab]
In the D-type register, dout is simply a delayed version of din. Put another way, more sophis
dout in the present clock cycle is the same as din in the previous clock cycle. Sup-
pose that din just happens to be going through the binary sequence:
D.6 En
Algorithms,
din out this bool
moments in
dout vast majorit3
sysclk variables do
register buil
FigureD-13. Example timing diagramfor D-type register. well as beinE
capability is
dout will also go through the same sequence, but it will lag by one clock cycle. In the In order to a]
above, x means unknown (see section 3.5.3 for details on how bx is used in Verilog abled register
simulation), because there is not enough information to predict what is in the register at load signal o
the beginning. ated with a no
As another example, consider what happens when din is somewhat more random:
din
dout
sysclk Figure D-i
JI I
When n=l, this The reason the D-type register by itself is often inadequate for many problems
is usually built is that
the D-type register only remembers the old value for one clock cycle. Most
algorithms
have variables that must remain unchanged for multiple clock cycles.
This requires a
it another way, more sophisticated kind of register, discussed in the next section.
)ck cycle. Sup-
TDQD din
enabled
1 0 register dout
clp
n > n
The following action table describes what the enabled register does based
on the d
input:
The two most
ate register for
ima simple D-
pful to think of
esophisticated An action table is not a truth table, because unlike a truth table, an action
table includes
together with the concept of time.
)Is(see section
ire
Appendix D 531
-
r_
For example, suppose the following din and ld signals are provided to an enabled In the TTL
register: ment the sai
din 3 _
2 0 YTXTX -0-Y-I) D.7 Uj
I
la [ JII When comb
implement a
dout ( == )(3= these operat
sysclk n own right. I
within itself
Figure D-16. Example timing diagramfor enabled D-type register
Perhaps the
include step!
In this example, d happens to be 0 at certain times when dout happens to be 3. This with digital
means in the clock cycle after ld is 0, dout will continue to hold the value 3, regard- particles for
less of what din happens to be. On the other hand, when id is 1, the enabled register lions of cour
acts just like a simple D-type register.
There are m,
only on the to
counter (desc
The enabled register can be implemented as a mux connected to a simple D-type regis-
(described in
ter:
up/down cou
means the sy
The up count
dout in an enabled
the next risin
din as the inc c
the clock.
Id
Figure D-l 7. Implementation of enabled D-type register using simple D-type
and mux.
When ld is 0, the mux passes through the old value of dout to be reloaded into the
simple D-type register. When ld is 1, the mux passes through the new din value to be
loaded into the simple D-type register.
Figure D-.
Other arrangements of hardware not based on the simple D-type register can also imple-
ment an enabled register. Therefore, in the top-down approach, designers typically
specify an enabled (or loadable) register without concern for how it is implemented.
2 Except ld is ac
chip.
4
)an enabled In the TTL logic family, the 74xx377 (for n=8) and 74xx378
378 (for n=6) chips imple-
x378
ment the same actions as the above.'
Perhapswithin
the~most ~ important
~ ofnnWilim
these specialineloperations
1932 was isanomto
,counting.
counting. Most algorithms
algorithms
include steps that involve counting. In fact, the very first practical practical machine ever builtbuilt
Perhaps ~ ~ ~ en codute bysLordtan
with digital electronics (by Wynn-Williams in 1932) was acounter
Rut
to be into
aded 3. This
the counter used to count alpha
alpha
inclue
stps tht inolve c ated
ue 3, regard- particles for a physics experiment conducted by Lord Rutherford.ierford. Since that time,
therford. time, bil-
bled register lions of counters have been fabricated.
There are ~many
lions ~ variations
~ ~ on ~ how kidto build
of counters: thee
a counter. Insynfa
this
is book, we will concentrate
~
only on There
the ~
two most~ ~
important n)kinds
and they
of counters:
synchronousthe load
synchronous
,hronous loadable binary up
chronous
igrhe D-18logi fmly thep
counter only ~ ~ in~~~wl
(described refethetoe
this section), and thes more smpoma
synchronous loadable
-able binary up/down
able up/down counter
counter
)-type regis- (described in~section
counter ~ D.8).~ We ~ willherefer towor
thcrbethesecounte
more simply
isciMy
ply as the up counter and
and the
up/downe counter,
2mxet sm actilons whc as t When the word counter is used
respectively. ised by itself in this book, itit
up/down ~ ~ ~ ~ l binaryrupspcounter.
means the synchronous loadable binary up counter.
chi.7 U corner
The up counter has three command inputs. The d command
and signal is the
and the same
same as it is
~~~~~ln
in~ reiser inpts Thele commal
in an enabled register. The e r command signal causes the
ie counter to
he to become
become zero at
the~~~~ ~~~ comman signal cause tfthli
the next rising edge of the clock. The count command signal
gnal (sometimes referred to
;gnal
as ~ ~ ~ ~~k The coun command sinl)c
as the inc command signal) causes the counter to increment
,nt at the next rising edge
edge of
the~~~ ~~~aue thccunectk.crm let the
the clock.
D)-iyp
dout
n value to be
Appendix D 533
The behavior of the up counter is summarized by the following action table:
D.8 U
ld lr count action
Some algoi
o 0 0 hold
such algorit
o o 1 increment
o 1
has three c(
0 clear
o 1 1 clear
ters. The c(
1 0 0 load next rising
1 0 1 load count is]
1 1 0 load ments.
1 1 1 load
Note that the d signal has a higher priority than cr and count. Also cr has a
higher priority than count. An up counter can be constructed from a simple D-type
register, three muxes and an incrementor:
din
FigureD
The behavic
dout
Recall that the combinational logic incrementor (section C.5. 1) is considerably faster
than an adder. Even so, there are other more efficient ways of constructing a counter
than the technique shown above. For example, in the TTL logic family, the 74xx163 An up/down
chip provides for n=4 the same actions 3 as the above using fewer gates and less propa- combination
gation delay.
I
able:
D.8 Up/down counter
Some algorithms involve both incrementing and decrementing the same variable. For
such algorithms, the use of an up/down counter may be appropriate. The up/down counter
has three command inputs. The ld command signal is the same as in the earlier regis-
ters. The count command signal causes the counter to increment or decrement at the
next rising edge of the clock, depending on the up command signal. If up is 1 when
count is 1, the counter increments. If up is 0 when count is 1, the counter decre-
ments.
ISOso
clr has a
count
simple
Sim] ple D-type
Id up
up/down
din / > counter dout
n >n
The behavior of the up/down counter is summarized by the following action table:
dout
I n id u p action
o 0 0 hold
o 0 1 hold
o 1 0 decrement
o 1 1 increment
1 0 0 load
1 0 1 load
1 1 0 load
1 1 1 load
siderably faster
cting a counter
.y,the 74xx163 An up/down counter can be constructed from a simple D-type register, two muxes, a
and less propa- combinational logic incrementor and a combinational logic decrementor:
rsi -
Isi -
din -
Edout
Figure D
This can be
D.9 Shift register simple D-tyl
Like counting, multiplication and division by two, as well as the related operations of
rotation, can be implemented within a specialized device. Shift registers are sequential
building blocks that implement these operations internally. There are many kinds of
shift registers. The kind used in this book is a synchronous parallel loadable left/right rsi
shift register, with left and right shift (serial) inputs. This device is referred to simply as
a shift register in this book.
The shift register has a ci r signal (similar to the up counter) and a two-bit shi f tctrl Isi
signal. The action table for this shift register is:
cr shiftctrl action
o 00 hold din /
o r
01 right
o 10 left shiftctrl-
o 11 load
1 00 zero
1 01 zero Figure D-:
1 10 zero
1 11 zero
In addition to the n-bit-wide din bus that all synchronous registers have, the shift
register has two inputs, rs i and 1s i, each one bit wide, that only play a role when the
shift register is shifting:
rsi
Isi
din dout
am) dout
n
I n FigureD-22. Symbol for shift register
The one-bit input rsi is ignored except when the register is shifting right
(shi f tctrl=01), in which case rs i determines the value of the most significant bit
of dout for the next clock cycle. Similarly, 1si is ignored except when the register is
above.
aboi ve.For ex- shifting left (shiftctrl=10), in which case si determines the value of the least
sameeactions as
sam( significant bit of dout for the next clock cycle.
This can be implemented using two combinational logic shifters, two muxes and a
simple D-type register.
operations
ed or erations of
rs are Sequential
sequential
manyny kinds of
, mai
adablele left/right
adab
rred ttotosimply as
shiftctrl
bit sl Lif tctrl
dout
rare
Appendix D 537
Recall that the combinational logic shifters do not cost anything. There are other ways It is import/
of constructing this than the technique shown above. For example, in the TTL logic diagram. A,
family, the 74xx194 chip provides for n=4 the same actions 4 as the above using fewer automated n
gates and less propagation delay. ing of how d
by designers
The guiding
D.10 Unused inputs the thoughts
Sometimes a designer needs more capability than an enabled register, but not as much the circuit di
as is offered by one of the other register building blocks described above. For example, zero, it is eaw
a designer may need a register that omits any one of the three command inputs of an up detail. Desig
counter:
Id count count
cir Id
cIr d
enable non- non-
din clear dout din clear dout load -**dout
n register n n counter n counter n FigureD-
The register on the left omits the count signal and is therefore not truly a counter. The
register on the left is known as a enabled clearable register. The register in the middle is
a counter that does not ever need to be cleared but that instead is loaded with din. The
register on the right is a counter that never has to be loaded and therefore does not need
a din bus.
Figure D-,
All three of these are specializations of the up counter described in section D.7. They
can be implemented by tying one of the three command inputs of an up counter to 0:
although ther
0 count count diagram with
dr Id | Idm i cir solved. This i
up up up sysclk, gro
din cg e dout din- f& counter fdout no cue clout block diagran
n > n n n connection n
In a similar w
FigureD-25. Implementationsfor these registers using a loadable clearable
up counter.
count
Id
count
non-
din clear dout
n count n
i dout
ntern Figure D-26. Symbolfor a non-clearableup counter
is as:
y a counter. The
in the middle is
I with din. The din dout
re does not need
5 Vcc and ground supply power to a chip. The chip will not operate without these
connections. Likewise,
synchronous devices will not operate without a connection to sysclk.
the buildin
those build
their partic
rsi From a the
register, wl
Isi approach tr
computer n
din dout but such an
The buildin
FigureD-28. Symbol for a non-clearableshift register They are a
74xx 194) s
designers.
may someti
sophisticates
blocks give
This can be implemented as:
D.12 F
GmJsm, DAN
rsi NJ, 1997. Cl
PROSSER, FRY
Isi to Top Down
4.
din dout
L
I
the building blocks given in earlier sections, only to have the synthesis tool convert
those building blocks into some more efficient specialized one which is specific to
their particular algorithm.
From a theoretical viewpoint, every computer can be thought of as a single very big
register, whose value is meaningless to the human mind. In essence, this theoretical
approach treats this one register as the concatenation of every piece of information the
computer needs to remember. Mathematicians like to conceptualize things this way,
but such an approach is an oversimplification that does not help a practical designer.
The building blocks given earlier are at the right level of abstraction for practical use.
They are available as isolated chips (74xx377, 74xx378, 74xx163, 74xx669 and
74xx194) suitable for laboratory experiments which build the confidence of novice
designers. They are commonly used by synthesis tools, even though synthesis tools
may sometimes do something more sophisticated. In order to understand the more
sophisticated things that synthesis tools do, one must already be familiar with the building
blocks given in the earlier sections of this appendix.
Ir D.13 Exercises
D-1. Complete the following timing diagram to show dout, given an enabled
register with a 4-bit din, and a control input d:
sysclk
e theoretically
Id
articular prob-
ed as a combi-
din 5 3 1
~,such special-
register corn-
f Verilog syn-
lem in terms of
Isi
sysclk
Id FigureE-1. A
count
up Saying that th
filament of an
din 775 3 tor), the volta,
the light does
on because th
A
lift register
Lsi: E. TRI-STATE DEVICES
VLL_ A tri-state device is a special kind of combinational building block that has the ability
to disconnect its output logically from the bus to which that output is physically con-
nected. For simplicity, the combinational devices defined in appendix C and used
F_1h throughout most of this book do not have tri-state capabilities, although many actual
chips do. This appendix describes what tri-state devices are, and shows two common
uses for them.
E.1 Switches
As explained in section C.2. 1, a bus is composed of several wires that run in parallel to
each other. The bit transmitted on each wire of the bus originates at the output of some
ap counter gate (such as an AND gate), and is received at the input(s) of other gate(s). Although
computer designers normally prefer to abstract away the electronic details of how a
gate operates, some understanding of how a non-tri-state device operates is necessary
FLF-L- to understand the extra feature provided by a tri-state device.
Each non-tri-state gate is actually composed of several simpler switching devices, such
as transistors. Although the details in the operation of these switching devices depend
upon the technology family used (CMOS, TTL, etc.), the effect they have on the gate's
output is partly analogous to the effect that a wall switch has on the voltage across the
filament of a light bulb. When the wall switch is open, the light is turned off because
the voltage at point a is independent of the voltage at point b:
up/down
LFL a b /
J1-1 )1
Saying that the switch is open is the same as saying a is disconnected from b. Since the
filament of an ordinary light bulb is really just a wire that is a poor conductor (a resis-
tor), the voltage at b will be the same as at c. For this reason, the filament is cool, and
the light does not shine. On the other hand, when the switch is closed, the light is turned
on because the voltage at a is identical to the voltage at b.
A non-tri-sta
"O" switch cl
E.1.2 Us,
The electroni
state gate all(
FigureE-2. A closed switch causes the light to be on.
1
Saying that the switch is closed is the same as saying a is connected to b.
0
E.1.1 Use of switches in non-tri-state gates
Non-tri-state gates are more complicated than light switches in two ways. First, the Figure E-.
gate has to compute the desired output bit (which may require switching devices not
described here). Second, the gate has to connect the output wire to the proper voltage.
The output a
In most technologies, connecting the output wire to the proper voltage requires two answer. The,
switches: the top switch connects the output wire to the voltage' for the bit 1, and the be determine
bottom switch connects the output wire to the voltage2 for the bit 0. For example, the output bit
suppose the gate needs to output the bit 0. To do this, the "1," switch is open and the "O"
switch is closed:
1 E.2 Sin
A tri-state ga
The behaviot
i E A 1
b.
0 -1 + Z
0
I
,,,
I
, >,
out
0 output the bit
enable
The behavior of this gate can be described by the following truth table:
- 1
enable in out
0 0 z
0 1 z
1 0 0
1 1 1 1 1~~~~~~~
Appendix E 545
P.-
In other words, the tri-state driver gate is really nothing more than an electronically takes 10 ut
controlled switch. When enable is 1, the switch is closed: and 30 unit
Also Verilo
active low
is functiona
in 4h I -u
There is a Verilog built-in gate, known as bu f if 1, that implements this. For example,
the following instance:
wire out,in,enable;
bufifl bl(out,in,enable);
E.3 Bu
is equivalent to the single-bit tri-state gate shown above.
It is the com]
As described in section 6.3.4, Verilog allows you to indicate the propagation delay of a to work with
built-in gate, such as the bu f if 1: more abstrac
us to use bus
state gates si
wire out,in,enable; actions of th
bufifl #10 bl(out,in,enable); device, knov
Also, Verilog allows you to indicate different delays for the (rising) time required to
change to a one and the (falling) time required to change to a zero. For built-in gates
such as bu f i f 1, there is a third separate time that may be of interest in some designs,
the turn off delay, which is how long it takes when the output changes to 1 bz. For
example:
FigureE-
wire out,in,enable;
bufifl #(10,20,30) bl(out,in,enable);
L
r
lectronically takes 10 units of $time if out becomes one, 20 units of $time if out becomes zero
and 30 units of $time if out becomes 1 ' bz.
Also Verilog provides other forms of tri-state gates, such as bu f if 0, which has an
t active low enable signal. For example, the following:
wire out,in,enable,enablelow;
not il(enablelow,enable);
bufifO #(10,20,30) bl(out,in,enable-low);
ae required to
built-in gates
some designs, non-tri-state device
to 1 ' bz. For enabln
enable
FigureE-9. Tri-state bus driver
re Appendix E 547
The symbol for a tri-state bus driver looks like a mux, except there is only one input
bus (which is n bits wide). Since a mux always has at least two input busses, there
should be no reason to confuse these two devices, both of which are symbolically
represented as triangles.
Physically, the tri-state bus driver is composed of n independent tri-state driver gates,
each one of which is physically a bu f if 1 instance. Like all other gate-level features
of Verilog, working with buf if 1 gates is not easy, and so it is better to think of an n-
bit-wide tri-state bus driver like any other bus-width building block device, using the
combinational logic modeling technique described in section 3.7.2. 1:
Figure E-
module tristate buffer(out,in,en);
parameter SIZE=1; serves the so
output out;
using two in!
input in,en;
reg [SIZE-l:O] out;
wire [SIZE-l:O] in;
wire en; module si
always @(in or en) paramet
begin output
if (en 1) input i
out = in; wire ['
else if (en === 0) wire SE
out = 'bz; wire ns
else
out = 'bx; not nl(
end tristat
endmodule tristat
endmodule
The bz provides as many 1 ' bz values as is required by SIZE.
E.4.1.1 h
Section 3.5.3
E.4 Uses of tri-state of Verilog wi
There are two main uses of tri-state devices: replacement for muxes and bidirectional value 1 ' bx i!
buses. for the fourth
If it were not
E.4.1 Tri-state buffers as a mux replacement outputs of tw
The first primary use of tri-state bus drivers is to create a structure that is a replacement section E.4. 1.
for a mux. For example, the following: state buffers a
I
ly one input
eusses, there
Symbolically io
driver gates,
-vel features
ink of an n- ii
ce, using the
serves the same role as a two-input mux. The above can be described in Verilog by
using two instances of the tristate_buffer defined in the last section:
module sillymux(out,iO,il,sel);
parameter SIZE=1;
output out;
input iO, il, sel;
wire [SIZE-l:0] out,iO,il;
wire sel;
wire nsel;
Appendix E 549
I L
There is an algorithm built into Verilog that models the physical behavior of a wire, E.4.2 I
based upon the output port(s) of instantiated modules to which that wire is con- Although
nected. When there is only one output port connected to awire, the value of the wire hardware,:
in question reflects the value of that single-output port. When that single-output port wire remai
changes, the wire connected to it is instantaneously and automatically changed. This must be roi
is the situation that occurs throughout most of the structural examples this book.
Each bit ol
The situation is more complicated when there are two or more output ports con- One of the
nected to the same wire. In this example, the output ports of bi and b2 both drive for a chip.
the same wire. In hierarchical naming (section 3.10.8) the output ports are bl. out the pins th
and b2. out, and the wire they both drive is simply out. The following table de- ways:
scribes what Verilog computes automatically as a particular bit of the wire out, given
the corresponding bits of bl. out and b2. out:
b2.out 0 1 z x
F`
bl. out
Figure I
0 o x 0 x
1 x 1 1 x
z 0 1 z x Routing a E
x x x x x
unidirectioi
If we guarantee either that every bit of either bl . out is 1 bz or that every bit of
b2. out is 1 'bz, we can be certain that no bit of out will be 1 hbx (see bold above).
This is precisely what the two tri-state drivers do for us. When sel is 1, every bit of
bl. out is tri-stated, but when sel is 0, every bit of b2 . out is tri-stated.
FigureE
4
Although it is
port could havc
r of a wire
wire, E.4.2 Bidirectional buses
rire is con
con- Although most of this book assumes that a wire is essentially free in the fabricated
ofthewire
of the wi rE hardware, in fact a wire does cost something. The cost is fairly reasonable when the
~-output
_output por
port wire remains hidden inside a physical chip, but the cost is quite high when that wire
ranged.
hanged. TU
This must be routed outside the chip.
s book.
Each bit of a Verilog wire that must be routed outside a chip requires a physical pin.
t ports con
con- One of the most severe limitations in hardware design is the number of pins available
2 both driv(
drive for a chip. Therefore, hardware designers often wish to make maximum utilization of
are bl
bi . out the pins that are available. A bidirectionalbus is one which sends information both
ng table de
de- ways:
out, giver
given
chip
SIZE
Routing a bidirectional bus off chip requires half the number of pins that routing two
unidirectional buses requires:
-s
s are in use
use, Bidirectional buses are especially important in the design of memory systems (section
-claration oof
-claration 8.2.2.1).
4
Although it is not necessary, an input port could have an intervening buffer into the chip,
and an output
port could have an intervening buffer out of the chip.
Appendix E 551
The algorithm Verilog uses to determine the value on an inout port combines the
and here i
value outside the module together with the value inside the module, according to the
table given in section E.4. 1.1 (except the names will be at a different point in the hier- module
archy than bl . out and b2 . out). The distinction between an input port and an parar
inout port is not visible within the instantiated module (containing the inout decla- inout
input
ration). This distinction is only visible within some other instantiating module (which
wire
instantiates the module having the inout port).
wire
wire
E.4.2.2 A read/write register enabl
To illustrate how a bidirectional bus can reduce the number of pins on a chip, consider trist
a register whose values can be read and written using a single bidirectional bus: endmodt
wr rd
Here is an
read / write b
register SIZE bus reg rl
wire [
FigureE-13. A read/write register with a bidirectionalbus.
rwreg
If this device were fabricated on a single chip, it would require 5+SIZE pins (including rw.reg
the clock and power). In comparison, the enabled register using unidirectional buses
(described in sections D.6 and 4.2.1.1) would require 4+2*SIZE pins, which is almost
twice as many.
In order for the bidirectional bus to do double duty, there must be two command inputs:
rd and wr. When rd is one, this device drives the bus (provides output) to show the
current contents of the register. When wr is one, this device leaves the bus alone ('bz)
and instead the bus provides the input which the register will load at the next rising
edge of the clock. Here is the internal structure of this register:
FigureI
Unlike the
-;11l -Avi 1,
will avuIu L
rird and
Figure E-14. Implementation offigure E-13. For exampl
commands:
Here is an example of using two instances of the read/write register defined above:
reg rlrd,rlwr,r2rd,r2wr;
wire [3:0] busl;
SIZE 1
Unlike the s i 1 ly mux example, there is nothing in the above to guarantee that bus 1
will avoid becoming bx. Instead, it is the responsibility of the designer to ensure that
rlrd and r2rd are never simultaneously one.
For example, to implement the register transfer rl <- r2 requires generating the
commands:
Appendix E 553
E-7. Pins an
resorted to se
with dynamii
column addr
ot rows and
address pins
guish the use
E.5 Further Reading ras is assei
PALNITKAR, S.,Verilog HDL: A Guide to Digital Design and Synthesis, Prentice Hall asserted, the
PTR, Upper Saddle River, NJ, 1996. Chapter 5. signals are s
such chips al
low). The fol
E.6 Exercises
E-1. Revise the architecture of the two-state division machine (whose Verilog code is ras
given in section 4.2.3) so as to eliminate the instance of mux2 and instead use two
cas
instances of the tri-statebuffer defined in section E.3. Use the test code given
in section 4.1.1.1. addr bx
E-2. Define a behavioral Verilog module (binmem) for an asynchronous bidirectional
memory (section 8.2.2.1) consisting of 4096 words, each 12-bits wide. The ports are a data
12-bit addr bus, a 12-bit data bus and the commands write and enable. The
following table describes the actions of this memory:
ras I
log code is
ad use two cas I I
code given
addr bx row Col 'bx
[directional
ports are a data 'bz m[{rowcol}] \X 'bz
able. The
Writing to such a memory is similar, except a write sign al is asserted and the new
content is provided to the chip on the data bus during the entire time. Define a struc-
tural Verilog module for a memory containing 16 twelve-t bit words using twenty in-
stances of the rwregister defined in section E.4.2.2 tog ether with additional com-
binational logic.
4.4 need to
ory system
~r to form a
nemory de-
ick diagram
Appendix E 555
r
F.5 e
veriwell filel.v file2.v ... The synthes
Inc. (www.
ferent vend(
which will produce the output of $display commands both on the screen and in a
M4-128/64,
file known as veriwell. log. For the GUI versions, you need to create a "project
to readers ol
file" by selecting Project (Alt P) New (Alt N) and choose a name ( . prj) for the project
this limited
file. Then select Project (Alt P) Add file (Alt F) to specify the . v file name(s). To run
ter at their M
the simulator, select Project (Alt P) rUn (Alt U).
other than ir
Most of the designs in this book are able to simulate on the free version. Wellspring etary DSL 1
Solutions sells a hardware key that removes the limitations of the free version and also requires othi
sells a separate package for graphical output: VerilogEAS
Wellspring Solutions
ES 7 Tudor Drive, Suite 300
Salem, NH 03079
several design (603) 898-1100
here are sev-
endix briefly
ire subject to
information. F.3 M4-128/64 demoboard
Documentation for the M4-128/64 CPLD chip used in chapter 11 can be downloaded
from www. vant is . com. The demoboard, power supply, download cable and
MACHPRO software can be obtained from:
imulator from
F.4 Wirewrap supplies
,this excellent To build the CPU described in chapter 11 requires wirewrap wire, a wirewrap tool
ing it from (such as an "all in one" tool that strips, wraps and unwraps) and a wirewrap socket. It
;ize of Verilog also requires a memory ("RAM") chip, such as the 2102. Some of these may be avail-
n) output. The able at local electronics stores, but there are several mail-order companies, such as
95/NT (GUI), Jamesco (www. j amesco . com), that carry a complete selection of such supplies.
command-line
F.5 VerilogEASY
The synthesis package used in chapter 11, known as VerilogEASY, is sold by MINC,
Inc. (www . minc. com). VerilogEASY comes in several versions, each targeting dif-
ferent vendors' programmable logic. A limited version of VerilogEASY that targets the
;creen and in a M4- 128/64, but that is restricted on the number of inputs and outputs, will be available
eate a "project to readers of this book in the last quarter of 1998. There is no charge for downloading
lfor the project this limited version, but MINC requires that people downloading their software regis-
ame(s). To run ter at their Web site. VerilogEASY accepts the common synthesizable subset of Verilog
other than implicit style. VerilogEASY produces two output files: . s rc (in the propri-
,on.Wellspring etary DSL language) and .v (structural Verilog netlist). To fabricate working hardware
ersion and also requires other tools, described in sections F.3 and F.6. MINC also sells a full version of
VerilogEASY and an even more powerful synthesis tool, known as PLSynthesizer:
design for the M4-128/64 without using this tool. PLDesigner can be purchased from tm]
jo
MINC. The following directions apply to PLDesigner: At the PLDesigner menu, choose
ff
File (Alt F) Open (Alt O), and enter the name of the . src file created by VerilogEASY qu,
Do a File (Alt F) eXit (Alt X). Select Device (Alt D) Parameters (P) and choose the re;
M4-128/64 (MACH445) and say OK. Select Settings (Alt ) Options (Alt 0) and be sy
sure Timing Models are set only to generic Verilog. Select Project (alt P) Build all (Alt r@
B) to create the JEDEC file (. j 1). To create the back annotated Verilog, select Project nei
(Alt P) Generate Timing Model (Alt T), which will put the . v file in a model (I
subdirectory (since a similarly named .v file (the input to VerilogEASY) will already -r(
vit
exist).
The other na
the temporal
F.7 VITO are not allov
The Verilog Implicit To One hot (VITO) preprocessor is a freely available synthesis
preprocessor written by James D. Shuler and Mark G. Arnold. It may be down-
loaded from the Prenticite Hall Web site. It can also be downloaded from
www. cs .brockport .edu/-jshulerorplum.uwyo . edu/-vito.UNIXand F.8 Op
MSDOS versions are available at those Web sites. The theory of how this tool operates The indepen(
is discussed in chapter 7. It is a command-line program, and the following is a typical as Open Ver
use: tional Verilo
Convention
I source for qt
I~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I vito -t implicit.v >explicit.v
where implicit . v is the name of a file consisting of one or more modules that have
implicit style state machines. The - t option generates comments that explain the trans-
formation. The output of VITO is redirected to another file (explicit v), which
would then be used as the input to VerilogEASY (or another synthesis tool). The de-
signer is free to choose other file names.
vito. out
vito. stmt
vito.arch
s_
14-128/64, is
sT-
)fabricate a tmp_
-chased from join_
nenu, choose iff
erilogEASY qual_
d choose the reset
it 0) and be sysclk
Build all (Alt @(posedge sysclk)
elect Project new_
@(posedge sysclk or negedge reset)
in a model
-reset
will already vito.tail
The other names in this file are the prefixes of wire names that VITO will generate, and
the temporary files VITO uses. This file is based on position, and so extra blank lines
are not allowed.
ble synthesis
iy be down-
oaded from
o. UNIX and F.8 Open Verilog International (OVI)
tool operates The independent organization that developed the Verilog standard (IEEE 1364) is known
g is a typical as Open Verilog International (www. ovi . org). OVI is co-sponsor of the Interna-
tional Verilog Conference (www. hdlcon. org) held each spring at the Santa Clara
Convention Center. OVI sells a language reference manual, which is the authoritative
source for questions of Verilog syntax and semantics:
Open Verilog International
15466 Los Gatos Blvd.
iles that have Suite 109-071
lain the trans- Los Gatos, CA 95032
t . v), which (408) 353-8899
ool). The de- [email protected]
Appendix F 559
L
F.9 Other Verilog and programmable logic vendors
Here is a partial list of other Verilog and vendors' Web sites: www. altera.com,
www.avanticorp.com,www.cadence.com,www.fintronic.com,
G.
www.sunburst-design.com,www.synopsys.com,www.simucad.com,
www.synplicity.com,www.veribest.com,www.xilinx.com and
www.eg.bucknell.edu/-cs320/1995-fall/verilog-manual.html. 1. Effic
The founda
there are tv
The ARM i
F.10 PDP-8 are powerft
Additional resources relating to the PDP-8 can be found at strawberry. uwyo. edu, because of
www. in.net/-bstern/PDP8/pdp8.html,www. faqs.org/faqs/dec-fac so easily the
and www. cs. uiowa.edu/-jones/pdp8. A portion of this information is also high speed.
available at the Prentice Hall website. In general
nature - yet.
is significan
F.11 ARM sion, ARM
Additional resources relating to the ARM can be found at www. arm. com. Features of
* Asi
* Cor
Features im]
* Ban
pen
* Con
* Loa
tran
2. Instri
The ARM ir
languages.1
forward and
is flat and thi
RISC proces
two instructi
ARM Code ]
code.
ARM documei
ARM documentation is copyright 1997 Advanced RISC Machines, Ltd. and is reprinted by permission.
A
F
daote R1 R1 IR1 R1 R1 R1
registers and1 6
ne or two status
tus
*nthe processor
sor
o support rapid
pid
Assuming
The semantics of <= and <=#delay are more subtle than the extra always analogy regardless c
explained in section 3.8.2, which only applies to <= @ (posedge syscik) . In gen- ing = instea
eral, non-blocking expressions are evaluated immediately and put into a simulator queue one of the r
to be stored after all blocking assignments at the $time given by the specified #de- situation re(
lay. Clifford Cummings of Sunburst Design, Inc. (www. sunburst - design. com)
gave a very informative presentation on the use of such non-blocking assignment state-
ments at the 1998 International Verilog Conference. He suggested that the blocking
assignment, =, should be primarily limited to modeling combinational logic, as in:
always (a or b)
sum = a + b;
with similar
tional logic
H.1 Sequential logic interacting
According to Cumming's guidelines, sequential logic, such as a simple D type register, intervening
should use the non-blocking assignment:
Most existin
always (posedge sysclk) 7.2.2.1, 11.3
dout <= din; rather than <
without und,
tial logic usi
k
The <= without time control above has the same meaning as <= #0 (sections 11.3.3
and 11.5.6). It causes the simulator to put the assignment into a special "non-blocking
event queue" that stores a new value after all = and =#O have finished but before
$time advances. The <= without time control is useful for situations where an ex-
plicit style module uses the same regs on opposite sides (example in bold) of differ-
ent assignments that model distinct sequential devices, as in:
Assuming rl and r2 were initialized (not shown), the above exchanges rl and r2,
ays analogy regardless of the order in which the simulator schedules the two always blocks. Us-
lk) . In gen- ing = instead of <= above would have the incorrect effect of duplicating the value of
nulator queue one of the registers (seemingly chosen at random) into both of them. To use = in this
-cified #de- situation requires intervening combinational logic:
sign.com)
Ynnent state-
the blocking always (posedge sysclk)
gic, as in:
r2 = new_rl;
always (rl)
new_rl = rl; //identity
with similar code for r2, which is hard for designers to remember when the combina-
tional logic is simply the identity function. This problem does not occur when the
interacting always blocks are in separate modules because the port(s) act like the
type register, intervening combinational logic.
Most existing explicit style designs, including examples in this book (sections 3.7.2.2,
7.2.2.1, 11.3 and 11.7), use = properly (with intervening combinational logic or ports)
rather than <= . Probably many designers stumble onto correct sequential logic using =
without understanding why it is correct. Even more alarming, some incorrect sequen-
tial logic using = may appear to be correct because of the arbitrary order in which the
Appendix H 565
4
Verilog simulator schedules the assignments. Cummings suggested designers use only and inertial
<= for sequential logic to guarantee correct operation without making the designer more rapidl
remember the intervening combinational logic or ports.
alwa
H.2 $strobe
Cummings also suggested using the strokee system task, which works like $dis- de
play, but shows the result of non-blocking assignment at the same $ time the assign- alwa
ment is made. For example, instead of the $display code with delay used in many
examples in this book, as typified by section 3.8.2.3.2: #3
which has the advantage that the values that will take effect during a particular clock
cycle will be displayed at the actual $time of the rising edge. With $di splay, there
must be at least #1 delay (#20 in this example) beyond @(posedge sysclk) to H.4 Se
view the values changed by non-blocking assignments. Because of tl
occur in the!
2 into a.As
code with <:
clock cycle. I
H.3 Inertial versus transport delay have more t1
clock cycle.
An interesting contrast between blocking and non-blocking assignment that Cummings
code for a pl
illustrated is the difference between transportdelay, which retains all the values s ig-
later <= can
nal has, regardless of how briefly they exist:
always @(signal)
$time 0 1 2 3 4 5 6 7 8 9
signal 1 1 1 1 2 3 3 3 3 3 ...
dela x x x 1 1 1 1 2 3 3 ...
delb x x x 1 1 1 1 2 2 2 ...
delc x x x 1 1 1 1 3 3 3 ...
tar clock
gy, there
cik) to
H.4 Sequence preservation
Because of their queued implementation, non-blocking assignments at the same $ time
occur in the sequence the <=s executed. For example, a<=l followed by a<=2 stores
2 into a. As emphasized in section 9.6, hardware registers (described in implicit style
code with <= (posedge sysclk)) cannot store multiple values during a single
clock cycle. Even though Verilog allows it, it is inappropriate for implicit style code to
have more than one <= (posedge sysclk) to a given reg during a particular
clock cycle. On the other hand, Cummings pointed out that it is useful in explicit style
ummings code for a plain <= to give a default values to the output of a state machine, which a
iessig- later <= can modify at the same $ time.
Appendix H 567
A
S
Big endian
I. GLOSSARY Blocking p
sion now,
The following include terms used in computer design. Terms marked with * are unique Bottom tesl
to this book. Synonyms for terms not used in this book are also given. In addition, the difficult t
following includes Verilog features (courier font), some of which are not described task with
elsewhere in this book. See the references given at the end of chapter 3 for details Bus: A grou
about Verilog features not described in this book.
Bus driver
outputs I
Access time: The propagation delay of a memory. Cache: An
Active high: Apin of a physical chip where I is represented as a high voltage. casex: A var
Active low: A pin of a physical chip where I is represented as a low voltage. 13 bits of
*Actor: A machine or person that interacts with the machine being designed.
alway
Address bus: A bus used to indicate which word of a memory is selected.
begi
Algorithmic State Machine, see ASM ca
3
Architecture: 1. The hardware of a machine that manipulates data, as opposed to the 3
controller. Is present in mixed(l) and pure structural designs. Also known as a 3
datapath. 2. The programmer'smodel and instructionset of a general-purposecom- 3
puter. See also computer architecture. 3. A feature of VHDL that provides greater 3
abstraction of instantiationthan Verilog does. eni
{n
ALU (Arithmetic Logic Unit): Combinational logic capable of computing several end
different functions of its input based on a command signal. Typically, the functions
include arithmetic operations, such as addition, and bitwise (logical) operations, such
casez: Liki
as AND.
*Central Al
ASM (Algorithmic State Machine): A graphic notation for finite state machines con-
associated
sisting of rectangles(l), diamonds (or equivalently hexagons), and possibly (for
non-blocki
Mealy machines) ovals. A pure behavioralASM is equivalent to implicitstyle Verilog
architecture
with non-blocking assignment. Moore mixed(1) ASMs can be implemented as im-
plicit style Mealy Verilog. Central Pro,
Asynchronous: Logic which has memory but which does not use the system clock. CPU (Centr
besides me
Backannotation: Recording the propagationdelay in a netlist after synthesis.
Combination
Behavioral: Code which describes what a machine does, rather than how to build it.
(including
see also pure behavioral.
or by a co,
Ftem clock. CPU (Central Processing Unit): The main element of a general-purposecomputer,
besides memory.
thesis.
Combinational: Logic which has no memory. In Verilog, ideal combinational logic
)w to build it. (including a bus or tri-state device) is modeled with @ followed by a sensitivity list
or by a continuous assignment.
Appendix I 569
M -
Combinatorial: see Combinational Enabled r
Command signal: 1. An internal signal output from a controllerthat tells the architec- data.
ture(l) what to do. Found only at the mixed(l) and pure structural stages. 2. An *External
external signal output from a controller to another actor.
entern
Computer architecture: 1. see Programmer'smodel and instruction set architecture. and mix
2. A generic term for a field of study that encompasses the computer design topics in chines ai
this book along with more abstract modeling concerns not discussed here, such as
networked general-purpose computers, disk drives and associated software operat- Explicit st,
ing system issues. not have
lent to td
Concatenation: The joining together of bits, indicated by {} in Verilog. state reg
Conditional: Non-blocking assignments (RTN) and/or command signalsthat occur in the archi
a particular state only under certain conditions. See Mealy and oval. requires
Continuous assignment: A shorthand for instantiating a hidden module that defines Falling del
behavioral combinationallogic. Allows assignment to a Verilog wire. Eliminates $fclose:
the need to declare ports and sensitivity lists. file. See
Controller: The hardware of a machine that keeps track of what step of the algorithm $fdispla
is currently active. Described as an ASM at the mixed(1) stage, but as a present state $fclos
and next state logic at the pure structuralstage.
Field Progi
CPLD (Complex Programmable Logic Device): A fixed set of AND/OR gates op-
tionally attached to flip flops with a programmable interconnection network allow- Finite statt
ing the downloading of arbitrary netlists. Flip Flop:
Data bus: A bus used to transmit words to and from a memory. build regi
at defines Falling delay: The propagation delay it takes for an output to change to 0.
eliminates $fclose: System task, whose argument is a file handle, that closes the associated
file. See also $ fopen.
algorithm $fdisplay: A variation of $display that outputs to a file. See also $fopen,
esent state $fclose.
Field Programmable Gate Array: see FPGA
.gates op-
ork allow- Finite state machine, see ASM.
Flip Flop: A sequential(l) logic device that stores one bit of information. Used to
build registers and controllers.
$f open: System function, whose argument is a quoted file name, that returns an inte-
ger file handle used by $ fdisplay, $fstrobe or $fwrite:
param-
integer handle;
ts depends initial
computers begin
handle = $fopen("example.txt');
$fdisplay(handle,"Example of file output");
while in $fclose(handle);
end
pose com-
fork: An alternative to begin that allows parallel execution of each statement. For
example, the following stores into b at $ time 2 but stores into d at $ time 3:
nD. For a
;t for more
Appendix I 571
Hierarchic
initial Hierarchic
fork without
#1 a=10; periods.
#2 b=20;
join High impel
initial also prod
begin
#1 c=10; Ideal: An
#2 d=20; delay anc
end
Implicit st,
sysclk
Four-valued logic: A simulation feature of Verilog that models each bit as being one lent to aI
of four possible values: 0, 1, high impedance (1 bz) or unknown value (1 bx). Independei
FPGA (Field Programmable Gate Array): A fixed set of lookup (truth) tables op- to design
tionally attached to flip flops with a programmable interconnection network allow-
inout dec
ing the downloading of arbitrary netlists. mation tr
$fstrobe: Avariation of $strobe that outputs to a file. See also $f open, $ fclose.
input: A
Full case: A synthesis directive that causes a case statement to act as though all into a mo,
possible binary patterns are listed. May cause synthesis to disagree with simulation.
Instance: A
$fwrite: A variation of $write that outputs to a file. See also $ f open, $f close.
Instantiatic
General-purpose computer: A machine that fetches machine language instructions
Instruction
from memory and executes them. The machine language describes the algorithm
general-pi
desired by the user, as opposed to a special-purpose computer. Also known as a
stored program computer. Instruction.
Glitch: see Hazard. *Internal St
ler at the
Goto: A high-level language statement not found in Verilog. Similar to state transi- in the arch
tions in explicit style Verilog. Equivalent to assembly language jump or branch in-
structions. Gotos are useful for implementing bottom testing loops. The closest Latch: Ana,
statement in Verilog is di sable, which has drawbacks when used for this purpose. when a ca
Avoidance of gotos is part of structuredprogramming, and is possible with implicit Little endiai
style Verilog.
Macro: Ast
Handshaking: The synchronization required when two actorsof different speed transfer parsing.
data.
Macrocells:
Hazard: The momentary spurious incorrect result produced by combinationallogic of
an optiona
non-zero propagationdelay.
Hexagon: Equivalent to diamond in ASM notation.
Appendix I 573
W
Mealy: A finite state machine that, unlike a Moore machine, produces command sig- notif 1: A
nals that are a function of both the present state and the status inputs. Such a com-
mand is indicated by an oval in ASM notation. One hot: A
module: The basic construct of Verilog which is instantiated to create hierarchical Pin: Theph
and structuraldesigns. Pipeline: A
Moore machine: A finite state machine that, unlike a Mealy machine, produces com- to produce
mand signals that are a function of only the present state. All commands in a Moore pare with.
ASM are given in rectangles. Place and r(
Multi-cycle: A machine that requires several fast clock cycles to produce one result. ited resoul
Compare with single cycle and pipeline. PLI (ProgrE
Multi-port memory: Allow simultaneous access to multiple words within one clock C software
cycle. Port: The ai
Netlist: A structural design described at the level of connections between one-bitwires posedge:
and gates.
Present stat
Next state: Combinational logic that computes what the next step is in the algorithm currently a
based on the present state and status inputs to the controller.
Programma
Node collapsing: An optimization technique used by place and route tools. can be rec(
Non-blocking assignment: A Verilog statement (<=) that evaluates an expression Programmi
now but that schedules the storage of the result to occur later. Several non-blocking logic or a
assignments can execute in parallel without delay. There are several forms, but the purpose co
one used most in this book ( <= @(posedge sysc1k) ) is equivalent to the RTN
Programmi
<- used in the pure behavioral stage for ASMs.
A
mand sig- notif 1: A variation of buf if 1 that complements its output.
Lch a corn-
One hot: An approach for the controller that uses one flip flop for each state.
output: A Verilog feature that only allows a port to be used for information transfer
U or other out of a module.
Typically
roach. Oval: The ASM symbol for a Mealy command.
:remember Parallel: 1. Two or more independent computations that occur at the same physical
en referred time. 2. Two or more computations (dependent or independent) that occur at the
a general- same simulation $time. In Verilog, $tirme is a separate issue from sequence. 3.
When one assumes physical time and sequence are the same, the opposite of sequen-
tial.
ASM using
ns), but the Parallel case: A synthesis directive that allows parallel evaluation of the conditions
zvioral and given in a case statement.
t where I is parameter: A constant within an instantiationof a module that can be different in
resented as each instance.
Pin: The physical connection of an integrated circuit to a printed circuit board.
ierarchical
Pipeline: A machine that requires, on average, slightly more than one fast clock cycle
to produce one result, provided that each result is independentof other results. Com-
duces corn- pare with single cycle and multi-cycle.
in a Moore
Place and route: A post synthesis tool that maps the synthesized design into the lim-
ited resources of a particular technology, such as a CPLD or FPGA.
one result.
PLI (Programming Language Interface): A way to interface Verilog simulations to
C software, and thus extend the capabilities of Verilog.
inone clock
Port: The aspect of a module that allows structural instantiation.
e-bitwires posedge: The rising edge of a signal, such as sysclk
Present state: The register that indicates what is the step in the algorithm which is
ie algorithm currently active.
Programmable logic: Integrated circuits manufactured with a fixed set of devices that
)Is. can be reconfigured by downloading a netlist. See CPLD and FPGA.
n expression Programming: 1. The act of downloading a synthesized netlist into programmable
ion-blocking logic or a truth table into a ROM. 2. The act of designing software for a general-
)rms, but the purpose computer.
it to the RTN
Programming Language Interface: See PLI.
Appendix I 575
Programmer's model: The registers of a general-purposecomputer visible to the Some ven
machine language programmer. RTN. Thi
Propagation delay: The time required for combinational logic to stabilize on the cor- implicit, e
rect result after its inputs change. RTN (Regis
*Pure behavioral: The stage where the design is thought of only as an algorithm evaluates
using RTN. Equivalent to implicit style Verilog. of the left
Verilog nc
*Pure structural: The stage where the controller and the architecture(]) are both
structural. SDF (Stand
after place
RAM: see memory.
Sensitivity
$readmemb: System task, whose arguments are the quoted name of a text file and an the sensiti
array. Reads words represented as a pattern of '0',' ','x' and/or 'z' from the text file cause unw
into the array.
Sequence: l
$readmemh: System task, similar to $readmemb, except for hexadecimal. a particula
Rectangle: 1. The ASM symbol for a Moore command. 2. The block diagram symbol control.
for most devices. Sequential:
Reduction: The unary application of a bitwise operator which acts as though the op- to combine
erator was inserted between each bit of the word. For example, if a is three bits, &a ticular seq
isa[2]&[1]&a[0]. When one
lel.
reg: The declaration used when a value is generated by behavioralVerilog code.
Simulation:
Register: A sequential(J) device that can load, and for some register types otherwise timing dial
manipulate a value. The value in a synchronous register changes at the next rising
edge of the clock. Contrast with combinational. Single-cycle:
pare with
repeat: A Verilog loop that repeats a known number of times. Very different than the
bottom testing loop. Special-purl
algorithm,
Reset: The only asynchronous signal used in this book, which clears the present state. are often a
Resource sharing: A synthesis optimization where the same hardware unit is used for Standard D
multiple computations.
strength:
Rising delay: The propagation delay it takes for an output to change to 1.
ties.
ROM (Read Only Memory): A tabular replacement for combinational logic. Not an State: Aster
actual memory because it does not have the ability to forget.
Status: Seei
RTL: 1. "Register Transfer Logic." In the pre-Verilog literature, the term RTL meant
the logic equations generated by the controller to implement register transfers (sec- Structural:
tion 4.4.1). Today, RTL most commonly means explicit style behavioral Verilog. vices) that
stances of
I
-
sible to the Some vendors (notably Synopsys) also use RTL to describe implicit style design and
RTN. This book avoids the use of the term RTL, in favor of the more precise terms:
implicit, explicit and RTN. 2. "Rotate Two Left", a PDP-8 instruction.
on the cor-
RTN (Register Transfer Notation): An - inside a rectangle or oval of an ASM that
evaluates an expression during the current clock cycle, but that schedules the change
Lnalgorithm of the left-hand register to occur at the next rising edge of the clock. Similar to the
Verilog non-blocking assignment ( <= @(posedge sysclk)).
(1) are both SDF (Standard Delay File): A way to backannotate delay information into a netlist
after place and route.
Sensitivity list: The list of input variables of combinationallogic. The variables in
xt file and an the sensitivity list occur inside @ separated by or. Failure to list all variables can
n the text file cause unwanted latches.
Sequence: The order in which Verilog statements execute in simulation. Statements in
imal. a particular always or initial block execute sequentially, regardless of time
control.
igram symbol
Sequential: 1. A device that has memory, such as a controlleror a register, as opposed
to combinationallogic. 2. Two or more dependent computations that occur in a par-
hough the op- ticular sequence, even if they occur at the same $time in a Verilog simulation. 3.
three bits, &a When one assumes physical time and sequence are the same, the opposite of paral-
lel.
rilog code. Simulation: The interpretation of Verilog source code to produce textual output and
apes otherwise timing diagrams.
the next rising Single-cycle: A machine that requires one slow clock cycle to produce one result. Com-
pare with multi-cycle and pipeline.
fferent than the Special-purpose computer: A machine that is customized to implement only one
algorithm, as opposed to a general-purposecomputer. Special-purpose computers
iepresent state. are often referred to simply as digital logic.
al logic. Not an State: A step that is active in an algorithm during a particular clock cycle.
Status: See internal status and external status.
term RTL meant Structural: An interconnection of wires and gates (or combinationaland register de-
-r transfers (sec- vices) that forms a machine. Represented by a block diagram, circuit diagram, in-
iavioral Verilog. stances of modules or a netlist.
are I1Ibz.
.yfrom a bus.
nge to 1'bz,
ne hot.
'gnals that are
any other con-
Appendix I 579
- V.
There are three consequences of using an oval in an ASM chart. The first consequence, It is possib]
which can be described with implicit style (pure behavioral) Verilog, is to allow com- Mealy exte
putations dependent on a decision to be initiated in parallel to the decision. For ex- approach is
ample, the decoding and execution of a TAD instruction in chapter 9 illustrate a deci- simulation
sion (ir2[ 11:9] == 1) and a computation (ac+mb2) that occur in parallel:
if (ir2[ 11:9] == 1)
ac <= @(posedge sysclk) ac + mb2; a
As in most of the examples in this book, the statement that carries out the computation
is a non-blocking assignment, so the effect will not be observable until the next rising
edge of the clock. When viewed by itself, the architecture is a Moore machine (it has
registers that only change at the rising edge). Since the output (ac) of the complete
machine (the controller together with the architecture) only changes at the clock edge,
the complete machine is Moore. Only the controller is Mealy. For this reason, implicit
style Verilog with non-blocking assignment can model such situations.
The second consequence of using an oval in an ASM chart arises only at the mixed
stage, such as figure 5-2. Depending on how complicated the architecture is, there may
be hazards created between the controller and architecture during simulation that an
implicit style Verilog description of the controller will not process properly. The 1994
paper mentioned below describes a bx handshaking technique with an
exitcurrentstate task that overcomes this problem for Verilog simulation.
This technique is an extension to the enter new state method given in this book.
The third consequence of using an oval in an ASM chart arises only when a decision
involves an input to a machine, and RTN is not used to produce the corresponding
output. Figure 5-7 is an illustration of such a situation. For such ASMs, the machine
M -
-7
XLY cannot be modeled just with implicit style Verilog (@(posedge sysclk) inside
always) because the output of the machine is supposed to follow the input. In other
words, if the input makes multiple changes during one clock cycle, the output should
'LE make corresponding changes during that clock cycle. The implicit style cannot model
this, since the behavioral block will execute only once. Since figure 5-7 is simple com-
rilog simula- binational logic (single-state ASM), the designer uses the appropriate sensitivity list
iASM chart) instead of @(posedge sysclk). In general, Mealy machines often have multiple
)re state ma- states, but there is no implicit style notation to describe this reexecution of the behav-
aly approach ioral code that must take place in each Mealy state. It is necessary to use explicit style
Verilog instead. The 1998 paper gives more information about this.
It is possible to use a hybrid implicit/explicit style to cope with a machine that has
consequence,
o allow com- Mealy external outputs, such as the ASM in section 5.2.4 (figure 5-6). This hybrid
sion. For ex- approach is synthesizable. The following shows in bold the distinctions between the
istrate a deci- simulation only technique of section 5.3 and the hybrid implicit/explicit approach:
,arallel:
reg s;
always //implicit block
begin
s <= (posedge sysclk) 0;
computation @(posedge sysclk) #1;
he next rising rl <= @(posedge sysclk) x;
machine (it has //ready = 1;
the complete if (pb)
he clock edge, begin
-ason, implicit r2 <= @(posedge sysclk) ;
while (rl >= y)
begin
y at the mixed s <= (posedge sysclk) 1;
Seis, there may @(posedge sysclk) #1;
ulation that an rl <= (posedge sysclk) rl - y;
ierly. The 1994 if (rl >= y)
ique with an r2 <= (posedge sysclk) r2 + 1;
log simulation. //else
en in this book. // ready = 1;
end
when a decision end
corresponding
is, the machine
N=__ -
Continued
I
always (s or rl or y) //explicit block ,70,80
begin 1, 70
if (s==O) ready = 1; 70, 80
else if (rl >= y) ready = 0; -&,579
Index 583
11111r_ - - -
with 'ENS, 461-462 Analysis, childish division software, 319
expli(
AND instructions, 309
exten
A and, 74, 169, 201, 448
impli,
A priori worst case timing analysis, 202 AND/OR structure of CPLD, 443
input.
ABC computer, 277, 287 'AND, 510
Meals
Abstract propagation delay, 209 'ANDNA, 510
Si]
Access time, 282 'ANDNB, 510
to
Acorn RISC: Arbitrary gotos, 194
Moor
Machine (ARM) Family Data Architecture, 20, 568
multi(
Manual, 434 computer, see programmer's model one hi
Microprocessor, 378 division machine, 154
outpu
Action table, 531 instruction set, 573 PDP-1
Active high and low voltage, 495 memory hierarchy, 344
fel
Actor, 14, 20, 568 methodical versus central ALU, 48
mi
memory as a separate, 312, 475 multi-cycle, 238
pipeli
Adder, 54, 500 pipelined, 241
softwi
bit parallel, 460 pipelined PDP-8, 374
supen
bit serial, 461 Princeton versus Harvard, 379 Assembly h
ripple carry, 460 quadratic evaluator, 235 assign, se
Address, 280 single cycle, 235 Assignment
Addressing modes, 202 Arithmetic Logic Unit, see ALU
ass
direct, 487 Arithmetic operations, 70, 510
blocki
indirect, 487 ARM (Advanced RISC Microproces
contin
PDP-8, 292, 487 sor), 378 non-bl
Advanced Micro Devices (AMD), 442 branch instruction, 383
time c
Advanced RISC Machines, Ltd., 561 compared to PDP-8, 383 Asynchrono
Advanced RISC Microprocessor, instruction set, 561 memo
see ARM program status register, 384, see also Atanasoff, J4
Aiken, Howard, 278 psr Atlas compu
Algorithmic State Machine, see ASM resources and website, 560, 563 Autoincrenm
Altera, 443 macros used in, 391
ALU (Arithmetic Logic Unit), 507: multi-cycle, 388
B
central, 48 pipelined, 400 Babbage, Ch
multiple, 403 superscalar Verilog, 417 Baby Mark
alul8l portlist, 152 Thumb, 381 Back annotai
aluctrl, 508 Arnold, Mark G., 275, 582 Backquote,
always: ASM (Algorithmic State Machine), 4, 7, Banked regi
block, 70 see also finite state machine Barrel shiftei
with disable statement, 213 behavioral PDP-8, 303 begin, 71
ASM chart, 100, 113 memory as separate actor, 314 Behavioral, I
in infinite loop, 100 behavioral Verilog, 138-149, 186-188
combir
with forever (synthesis), 72 chart, 7 feature
AMD Mach445, see M4-128/64 decisions in, 12 fetch/e:
Analog information, I commands, 9 instanc
Index 585
I
buf, 74, 169 actu
buf if 0, 547 Central ALU , 569 beha
buf if 1, 546 architecture, 48 mod(
Building block: Chart, ASM, 7 mod(
devices, 150 Childish division: phys
combinational, 491 algorithm, 22, 314, 368 with
sequential logic, 525 ARM, 424, 426: Combinato
Burning, 518 conditional instructions, 428 see (
Bus, 493, 543: effect of cache size on, 351 Command,
driver, 547 implementations, comparison, 318, Command
timing diagrams, 528 371, 431, 481 Meal
unidirectional, 496: Mealy, 182, 184-185 Mooi
versus bidirectional, 281, 493 Moore, 26, 30, 34, 39 multi
bidirectional, 496, 551 PDP-8, 317, 369 one b
broadcasting with, 496 program: physi
'bx,77-80 C, 23, 316, 423, 428 Comments
bz,77-80, 221, 229, 545 machine language, 318, 369, Comparatoi
424, 426, 429 portli
C Verilog, 134, 143-148, 186 Computer a
C, C++, 1,4,6 Chip, 3, 539 see PR
childish division program, 23, 316, Circuit diagram, 53, 170, 253, 539 Computer:
424, 429 CISC (Complex Instruction Set archit
Cache: Computer): desigi
consistency, 344 processors, 561 gener
hit, 340 versus RISC, 377 histor
instruction, 437 CLA, 489 specie
memory, 337 Clair, C. R., 59 Concatenati
miss, 339 CLL, 308, 489 Condition,
size, 351 Clock: Conditional
test program, 339 cycle, 7, 222, 527: assign
write-back, 344-345 multi-cycle, 224, 240 comm
write-through, 344-345 pipelined, 226, 244 execui
Cadence Design Systems, 4, 65 single-cycle, 237 Al
Cambridge University, 279 frequency, 199 ch
car function, 457 period, see clock cycle loadin
Carry out signal (cout), 500, 510-511 CLPD, 558 operat
case: CMA, 489 condx func
adder, 116, 457 CML, 308, 489 Constants, s,
controller, 164-165 CMOS, 495 Continuous;
full, 459 Code coverage, 419 with o
parallel, 459 Colossus (computer), 3, 277, 287 Control dep(
statement, 69, 458 Combinational: Controller, 2
casex, 569 adder, synthesis of, 454 divisic
casez, 569 logic, 491: hierarc
Index 587
Decision: Dependency:
M
in ASM charts, 12 data, 359
pr
time control within, 191 examples, 404
tw
translated as one bit wide demux, software, 26
250 Design:
.doc, 442
Declaration, see also variable: automation, 22, 438
'DOUB, 51
event, 212 flow, synthesis, 439
'DOUBINC
function, 114 hierarchical, 52
dp function
inout, 551 Deterministic access time, 282
DSL, 442
input, 118 Devices, programmable, 518
Dual Inline
integer, 67 Diagram:
Dual rail de
output, 119 block and circuit, 54, 539
Dynamic m
real, 115 bus timing, 528
reg, 67, 119 timing, 526
E
task, 110 Diamond, 7,12, 13
ea, see effe
tri, 550 'DIFFERENCE, 511
Eckert, Johi
triand, 578 Digital:
EDSAC coI
trior, 578 building blocks, 6, 491, 525
EDVAC, 27
trireg, 578 design, 3
EEPROM,'
variable, 67 electronics, 533
Effective ad
wand, 579 Digital Equipment Corp., see DEC
else, see
wire, 67, 118-119, 494, 579: DIMMs (Dual In-line Memory Modules),
Enabled reg
wor, 579 289
74xx3
Decoder, 515 Direct, 202:
74xx3
Decoding instructions, 297 addressing mode, 487
enabled_
'DECREMENT, 511 current page, 488
portlis
default, 69, 116, 459 page zero, 292, 475
synth(
'define, 73 Directive, synthesis, 459
Encoder, 51
defparam, 570 disable:
end, 69
Delay: inside forever with bottom test-
endcase,
inertial, 566 ing loop, 273
endfunct
line, 288 statement, 213, 273
Endian nota
propagation, see propagation delay Discrete electronic devices, 120
endmodul
minimum/typical/maximum, 207 Division:
endspeci
rising/falling, 207 childish (see also: Childish division):
endtask,
transport, 566 algorithm, 23
ENIAC, 271
Delayed assignment, 12 with conditional instructions, 428
enterne
Demoboard (Vantis), see M4-128/64 combinational, 507
'EQU, 510
Demultiplexer, see demux mixed two-state example, 271
event, 21;
Demux (demultiplexer), 513: pure behavioral two state example,
Event varial
in memory, 284 270
Example:
misuse of, 514 machine:
behav
translated from a decision, 250 architecture, 154
in
depend function, 417 controller, 157
W __ -
Mealy version, 184 Mealy machine, 178
propagation delay in, 215 one hot Verilog, 270
two stage, pure structural state, dependency, 404
161 hierarchy, 57
.doc, 442 independent instructions, 357
'DOUB, 511 machine language program, 298
'DOUBINCR, 511 mixed, 40-41, 45-46:
dp function, 391 Mealy machine, 179
DSL, 442 one hot Verilog, 271
Dual Inline Package (DIP), 120 Moore command with Mealy <=,
Dual rail design, 503 266
Dynamic memory, 286, 555 netlist propagation delay, 200
one hot, 251
E pipeline, childish division, 368
ea, see effective address pure:
Eckert, John P., 278 behavioral, 22, 134, 143
EDSAC computer, 279 structural, 49
EDVAC, 278 quadratic evaluator:
EEPROM, 519 multi-cycle, 224
Effective address (ea), 292, 296, 303, 488 pipelined, 226
else,seeif else single cycle, 219
odules), Enabled register, 451, 531: real function, 115
74xx377 (8 bit), 151 special purpose renaming, 407,
74xx378(6 bit), 151 409
enabled-register: structural instance, 123
portlist, 151 structural Mealy machine, 180
synthesis of, 445 task, 110
Encoder, 517 traffic light:
end, 69 <=,97,98
)m test- endcase, 69 ASM chart, 8
endfunction, 114 behavioral Mealy, 178
Endian notation, 67 bottom testing loop, 189
endmodule, 118 computer, 1, 2, 97, 113
endspecify, 208 mixed Mealy, 180
livision): endtask, 110 structural Mealy, 180
ENIAC, 278, 287 task, 112
ions, 428 enternewstate task, 112 Execution:
'EQU, 510 parallel, 413
271 event, 212 pipeline, 413
xample, Event variables in Verilog, 212 speculative, 406, 413
Example: Exhaustive test, 66
behavioral exitcurrent-state, 580
instance, 121, 123 Explicit style, 99, 162-165, 472:
behavioral synthesis, 445
Index 589
switch debouncer, 472 one hot, 249 Goto-less:
versus implicit, 99 Flushing pipeline, 231 AS!\
Expression, 69 Font, 5
External: for, 69, 80, 85 styl
command output, 18 forever, 72, 106 Graphical
data: fork, 571 Ground,5
input in ASM, 16 Forrester, Jay W., 288
output, 18 Forwarding data, 360 I
External status, 14, 571 Four-state division machine, 134 Half-addei
input, 16 Four-valued logic, 77, 549 Handshaki
Extra state for interface, 312 FPGA (Field Programmable Gate Array), Hardware
443 Hardware:
F Friendly user, 24, 141 inde
Factory, analogy to pipeline, 227 Full: softl
Ferranti, 279 adder, 54 I
Fetch state, 390 case, 459 traw
Fetch/execute, 1: function, 109; 2
ASM for, 294, 304 car, 458 Harvard N
behavioral, 290 combinational logic, 115 Harvard v/
mixed, 324 condx, 390 Hazard, 2(
registers needed for, 292 depend, 417 HDL (Ha
Field Programmable Gate Array dp, 391 4,64
(FPGA), 443 state-gen, 163 Hennessy,
Fighting outputs, 78 syntax, 114 Hexagon,
Filling pipeline, 231 Hierarchic
Finite state machine, 8, see also ASM: G desil
ARM 388, 400, Gajski, Daniel D., 59, 247, 521, 541 nami
logic equation, 167-168 Gate level modeling, advanced, 207 refin
Mealy, 182, 184-185 Gate: Hierarchy:
Moore, 26, 30, 34, 39, 220, 224, instantiation in Verilog, 75 exan
232 non-tristate, 544 mei
netlist, 169 tristate, 545 a
PDP-8 294, 302, 333 General purpose computer, 1, 561: High impe
Verilog: benchmarks, 320, 351, 371, 431- High level
behavioral, 138-149, 186-188 432, 481 Highly spe
explicit, 472 bit serial, 476 History:
implicit, 258-265, 270-271,464, history, 277 CIS(
471 PDP-8, 485 coml
mixed, 158-159 pipelined, 354 gene
structural, 162-165 RISC and CISC, 475 mem
Flattened netlist, 54 structure, 279 pins
Flip flop: superscalar, 411 HLT, 490
D type, 249, 530 Glitch, 205 Hollerith,
macrocell, 442 Goto, arbitrary, 194 Hopper, Li
Index 591
__-GPI
M I
r
Index 593
_.MPI
Index 595
PDP-8: Primitive logical operations, 509
architecture, 374 R
Princeton versus Harvard architec- RI5 test pr(
illustration, 368 ture, 379 Radar signa
quadratic evaluator, 226 Princeton, Institute for Advanced RAL, 307,'
register, 241 Studies, 279 RAR, 306,d
single cycle, multi-cycle com- Problem with <= for RTN simulation, 96 RAW (read
pared, 217 Procedural assignment: 'RD, 391, 4
skip instructions, 365 blocking, 10 Read only n
stage, 241 non-blocking, 95 Read/write
Place and route, 443, 494 Program: READY, ea
Planetary analogy, 492 cache test, 339 READY, usi
PLD, Complex, 520 childish division, 23, 316-319, real, 67, 1
PLDesigner-XL, 442, 558: 424, 426, 429 declar
modules supplied by, 450 counter, see pc and RI5 Real time, 8
technology mapping, 448 R15 test, 422 Rectangle, 7
PLSynthesizer, 442: status register, 384, see also psr @(po
modules supplied by, 447 Programmable devices, 518 one ho
using VITO code, 466 Programmer's model, 293, 382, 561 Reduction, 5
'PLUS, 511 Programming, object oriented, 127 Reference in
'POPA, 403, 416 Propagation delay, 199: memoi
'POPB, 416 abstracting of, 209 non-mi
Port: division machine, 215 reg, 67, lS
by name, 123, 447 inadequate models of, 209 Register, 13:
by position, 447 netlist, 200 behavii
external, 15 pipelined machine, 244 D type,
inout, 119, 442 Prosser, Franklin P., 60, 276, 352, 521, enable(
input, 118,442 541 file, 37
internal, 15 psr (program status register), 417: mu]
multiple, register file, 403 conditional assignment of, 391 highly:
output, 119,442 Pure, 56: model,
versus pin, 119 behavioral, 19, 22, 134, 143, 148, needed
Portlist: 270, 576 pipelin(
alul8l, 152 structural, 576 program
comparator, 153 controller, 162 read/wr
counterregister, 151 example, 49, 51 rename,
enabledregister, 151 stage, division machine, 161 shift, 52
mux2, 153 Push button, see User interface and synchro
posedge, 88,90, see also @(posedge Switch debouncer Transfei
sysclk) Transfer
Post synthesis simulation, 170 transfer,
Q
'PRD, 403 Q output, 530 types, c]
Present state, 8, 50, 161-165: Quadratic evaluator, see Example, up coun
resetting of, 172 quadratic evaluator Relational ope
>=, !=),
Index 597
Shift register, 462, 536 execution, 406
Shifter, 505-506 parallel and pipeline, 413 state-gE
Shuler, James D., 275, 558, 582 Speed and cost, 198, 496: Static mem,
Sign extension macro, 393 adder, 501 Status:
Signal, see Command, Command signal binary to unary decoder, 517 extern
Silicon foundry, 54 comparator, 512 intern
SIMMs (Single In-line Memory demux, 513 Sternheim, 1
Modules), 289 incrementor, 504 Stibitz. Geoj
Simulation, 64, 81: multiplier, 507 Stimulus, 65
post synthesis, 170, 440 mux, 502 STR, 436
timing analysis, 204 ripple carry, 501, 504 strength,
versus synthesis, 64 SSEC,278 StrongARM
Singh, Rajvir, 131, 375 State, 8, 306: Structural, 1,
Single accumulator, one address see also implicit, explicit and feature
instruction, 291 fnite state machine ASM instanc
Single alternative, 103 fetch, 390 exai
Single cycle: Statement, 69, 71:
adder, 460 $display, 72,420 moc
architecture, 235 $dumpf ile, 570 Mealy
behavioral, 219 $fclose. 571 synthesi
multi-cycle and pipeline $fdisplay, 572 Verilog,
compared, 217 $finish, 99 sing]
quadratic evaluator, 219 $fopen, 571 versus b
Single pulsing and switch debouncing, $ fstrobe, 572 Structure, 126
469 $fwrite, 572 data, 12(
Single state Mealy ASMs, 196 $readmemb, 576 general I
Skip instructions, 310: $readmnemh, 576 pure, 22
pipeline, 365 $strobe, 566 Structured pro,
SKP, 490 $stop, 94 Style, implicit
Slater, Robert, 352 $write, 72, Subbus, 498, s,
SMA, 490 Subtractor, 505
behavioral Verilog, 68
Smith, Douglas J., 131 case, 69,458-459 Sunburst Desig
SNA, 490 casex, 569 Superscalar, 38
SNL, 490 casez, 569 ARM, 40
Software: disable, 213 ASM,
dependency, ASM, 26 for, 69, 80, 85 Verilo
hardware tradeoff, 322, 371, 432, forever, 72,106 Switch, 543:
481 fork, 571 debouncei
Source code overview, 135 if, 103, 108 explici
SPA, 490 if else,101,258 implici
Special purpose computer, 1, 277: repeat, 576 non-tristat,
renaming, 407, 409 wait, 93, 136 register, 3(
specify block, 207 while, 72,104,463 tristate gat
Speculative: wire, 67, 118-119, 494, 579
Index 599
importance of, 141 buffer, 119 Variable:
role of, 65 bus driver symbol, 548
dech
state gen, 165 device, 543
lengi
xor, 86 gate, 545:
eve]
Test programs, 422 structural Verilog, 545
int 4
Testbench, see test code uses, 548
rea.
Thomas, Donald E., 131 Trivedi, Yatin, 131, 375
reg,
Three stage pipeline, ASM for, 396 TTL logic family, 495, 533
tri,
Time: Turing, Alan, 277
wirE
access, 282 Turn off delay, 546 Vcc and gr
control, 83 Two state division: VCD (Valu
assignment with, 94 implicit Verilog, 270
Verilog:
within a decision, 191 machine:
algori
real 81 behavioral stage, 148
behav
'timescale, 448 mixed stage, 150, 271
bit sei
Timing analysis: structural stage, 161
block
a priori worst case, 202 Twos complementor, 504
catchi
simulation of, 204 Typography, 5
comm
Timing diagram, 526
comp,
Top down design, 19 U
co,
Top level: Unary code, 463, 515, 517
compe
module, 117 Unconditional command statement, 178
76
structure of the machine, 313 Unidirectional Bus, 281, 493, 496
consta
Torres y Quevedo, Leonardo, 277 United States:
Traffic light example, see Example, noi
Air Force, 288
eveni
traffic light Department of Defense, 65
expres
Transistor, 3, 543 Navy, 278
Implic
Translating: UNIVAC, 279
'inc]
algorithms into hardware, 3 University of Manchester, England, 279,
logic g
complex ASMs into Verilog, 188 336
macros
conditional commands into Unknown value, 78
multi-c
Verilog, 194 Unused inputs, 538
pipeline
goto-less ASMs to behavioral Up counter register, 533
process
Verilog, 99 Up/down counter, 535
statemE
if at the bottom of forever, 108 User interface, 317:
structu
Mealy ASM to Verilog, 186 hardware, 25
beh
Moore ASM to one hot, 251 software, 317
structure
Transport delay, 566 User mode, 378
supersc
tri, 550
synthes
triand, 578 V
trior, 578 translati
Vacuum tube, 2, 287, 495
trireg, 578 con
Vantis, 442, 557:
fron
Tristate: address and website, 557
vendors
as mux replacement, 548 demoboard, see M4-128/64
versus
Index 601
Write-through cache, 344-345
Wynn-Williams, C. E., 2, 277, 533
x
Xilinx, 443
xor, 74, 85, 201, 208
'XOR, 510
z
'ZERO, 510
Zuse, Konrad, 278, 287, 289
i
i
Keep Up-to-Date with
PH PTR Online!
We strive to stay on the cutting-edge of what's happening in
professional computer science and engineering. Here's a bit of what
you'll find when you stop by www.phptr.com:
0 Special interest areas offering our latest books, book series, software,
i
features of the month, related links and other useful information to
help you get the job done.
Deals, deals, deals! Come to our promotions section for the latest
bargains offered to you exclusively from our retailers.
0t Ilhat's Hew at PH PTR? We don't just publish books for the professional
community, we're a part of it. Check out our convention schedule, join
an author chat, get the latest reviews and press releases on topics of
interest to you.
11,1T II-
I 0
M