ECE 327 Slides VHDL Verilog Digital Hardware Design
ECE 327 Slides VHDL Verilog Digital Hardware Design
Mark Aagaard 2011t1Winter University of Waterloo Dept of Electrical and Computer Engineering
Contents
I Lecture Notes
1 VHDL 1.1 Introduction to VHDL . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Levels of Abstraction . . . . . . . . . . . . . . . . . . . . . 1.1.2 VHDL Origins and History . . . . . . . . . . . . . . . . . . 1.1.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 Synthesis of a Simulation-Based Language . . . . . . . . 1.1.5 Solution to Synthesis Sanity . . . . . . . . . . . . . . . . . 1.1.6 Standard Logic 1164 . . . . . . . . . . . . . . . . . . . . . 1.2 Comparison of VHDL to Other Hardware Description Languages . . . . . . . . . . . . . . . .
1
3 4 4 5 6 11 12 13 14
ii 1.3 Overview of Syntax . . . . . . . . . . . . . . . . . 1.3.1 Syntactic Categories . . . . . . . . . . . . . 1.3.2 Library Units . . . . . . . . . . . . . . . . . 1.3.3 Entities and Architecture . . . . . . . . . . . 1.3.4 Concurrent Statements . . . . . . . . . . . 1.3.5 Component Declaration and Instantiations . 1.3.6 Processes . . . . . . . . . . . . . . . . . . 1.3.7 Sequential Statements . . . . . . . . . . . . 1.3.8 A Few More Miscellaneous VHDL Features 1.4 Concurrent vs Sequential Statements . . . . . . . 1.4.1 Concurrent Assignment vs Process . . . . 1.4.2 Conditional Assignment vs If Statements . 1.4.3 Selected Assignment vs Case Statement . 1.4.4 Coding Style . . . . . . . . . . . . . . . . . 1.5 Overview of Processes . . . . . . . . . . . . . . . 1.5.1 Combinational Process vs Clocked Process 1.5.2 Latch Inference . . . . . . . . . . . . . . . . 1.6 Details of Process Execution . . . . . . . . . . . . 1.6.1 Simple Simulation . . . . . . . . . . . . . . 1.6.2 Temporal Granularities of Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 14 14 15 18 21 21 26 27 27 28 29 30 31 32 36 43 46 46 48
CONTENTS 1.6.3 Intuition Behind Delta-Cycle Simulation . . . . . 1.6.4 Denitions and Algorithm . . . . . . . . . . . . . 1.6.4.1 Process Modes . . . . . . . . . . . . . 1.6.4.2 Simulation Algorithm . . . . . . . . . . 1.6.4.3 Delta-Cycle Denitions . . . . . . . . . 1.6.5 Example 1: Process Execution (Bamboozle) . . 1.6.6 Example 2: Process Execution (Flummox) . . . . 1.6.7 Ex: Need for Provisonal Asn . . . . . . . . . . . 1.6.8 Delta-Cycle Simulations of Flip-Flops . . . . . . 1.7 Register-Transfer-Level Simulation . . . . . . . . . . . . 1.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . 1.7.2 Technique for Register-Transfer Level Simulation 1.7.3 Examples of RTL Simulation . . . . . . . . . . . 1.7.3.1 RTL Simulation Example 1 . . . . . . . 1.8 VHDL and Hardware Building Blocks . . . . . . . . . . . 1.8.1 Basic Building Blocks . . . . . . . . . . . . . . . 1.8.2 Deprecated Building Blocks for RTL . . . . . . . 1.8.3 Hardware and Code for Flops . . . . . . . . . . . 1.8.3.1 Flops with Waits and Ifs . . . . . . . . . 1.8.3.2 Flops with Synchronous Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iii 48 50 50 54 57 58 58 63 69 78 79 80 81 81 85 85 90 92 92 94
iv
CONTENTS 1.8.3.3 Flop with Chip-Enable and Mux on Input . . 1.8.3.4 Flops with Chip-Enable, Muxes, and Reset . 1.8.4 An Example Sequential Circuit . . . . . . . . . . . . . 1.9 Arrays and Vectors . . . . . . . . . . . . . . . . . . . . . . . . 1.10 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10.1 Arithmetic Packages . . . . . . . . . . . . . . . . . . 1.10.2 Shift and Rotate Operations . . . . . . . . . . . . . . 1.10.3 Overloading of Arithmetic . . . . . . . . . . . . . . . 1.10.4 Different Widths and Arithmetic . . . . . . . . . . . . 1.10.5 Overloading of Comparisons . . . . . . . . . . . . . 1.10.6 Different Widths and Comparisons . . . . . . . . . . 1.10.7 Type Conversion . . . . . . . . . . . . . . . . . . . . 1.11 Synthesizable vs Non-Synthesizable Code . . . . . . . . . . 1.11.1 Unsynthesizable Code . . . . . . . . . . . . . . . . . 1.11.1.1 Initial Values . . . . . . . . . . . . . . . . . 1.11.1.2 Wait For . . . . . . . . . . . . . . . . . . . . 1.11.1.3 Different Wait Conditions . . . . . . . . . . 1.11.1.4 Multiple if rising edge in Process . . . . . 1.11.1.5 if rising edge and wait in Same Process 1.11.1.6 if rising edge with else Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 102 102 102 103 103 104 104 104 104 105 106 108 109 109 110 111 113 114 115
CONTENTS
1.11.1.7 if rising edge Inside a for Loop . . . . . . . . . . 116 1.11.1.8 wait Inside of a for loop . . . . . . . . . . . . . . 118 1.12 Synthesizable VHDL Coding Guidelines . . . . . . . . . . . . . . . 120 2 RTL Design with VHDL 2.1 Prelude to Chapter . . . . . . . . . . . . . . . . . . . . . . 2.2 FPGA Background and Coding Guidelines . . . . . . . . 2.2.1 Generic FPGA Hardware . . . . . . . . . . . . . . 2.2.1.1 Generic FPGA Cell . . . . . . . . . . . . 2.2.2 Area Estimation . . . . . . . . . . . . . . . . . . . 2.2.2.1 Interconnect for Generic FPGA . . . . . . 2.2.2.2 Clocks for Generic FPGAs . . . . . . . . 2.2.2.3 Special Circuitry in FPGAs . . . . . . . . 2.2.3 Generic-FPGA Coding Guidelines . . . . . . . . . 2.3 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Algorithms and High-Level Models . . . . . . . . . . . . . 2.5 Finite State Machines in VHDL . . . . . . . . . . . . . . . 2.5.1 Introduction to State-Machine Design . . . . . . . 2.5.1.1 Mealy vs Moore State Machines . . . . . 2.5.1.2 Introduction to State Machines and VHDL 121 122 122 122 123 128 134 134 135 139 143 143 144 144 144 147
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
vi
CONTENTS 2.5.1.3 Explicit vs Implicit State Machines . . . . . . . 2.5.2 Implementing a Simple Moore Machine . . . . . . . . . 2.5.2.1 Implicit Moore State Machine . . . . . . . . . . 2.5.2.2 Explicit Moore with Flopped Output . . . . . . 2.5.2.3 Explicit Moore with Combinational Outputs . . 2.5.2.4 Explicit-Current+Next Moore with Concurrent signment . . . . . . . . . . . . . . . . . . . . . 2.5.2.5 E-C+N Moore with Comb Proc . . . . . . . . . 2.5.3 Implementing a Simple Mealy Machine . . . . . . . . . 2.5.4 Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.5 State Encoding . . . . . . . . . . . . . . . . . . . . . . . 2.6 Dataow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Dataow Diagrams Overview . . . . . . . . . . . . . . . 2.6.2 Dataow Diagrams, Hardware, and Behaviour . . . . . 2.6.3 Dataow Diagram Execution . . . . . . . . . . . . . . . 2.6.4 Performance Estimation . . . . . . . . . . . . . . . . . . 2.6.5 Area Estimation . . . . . . . . . . . . . . . . . . . . . . 2.6.6 Design Analysis . . . . . . . . . . . . . . . . . . . . . . 2.6.7 Area / Performance Tradeoffs . . . . . . . . . . . . . . . 2.7 Design Example: Massey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . As. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 154 155 157 159 161 163 165 166 170 171 171 184 188 198 199 201 203 206
CONTENTS 2.8 Design Example: Vanier . . . . . . . . . . . . . . . . . 2.8.1 Requirements . . . . . . . . . . . . . . . . . . 2.8.2 Algorithm . . . . . . . . . . . . . . . . . . . . . 2.8.3 Initial Dataow Diagram . . . . . . . . . . . . . 2.8.4 Reschedule to Meet Requirements . . . . . . . 2.8.5 Optimize Resources . . . . . . . . . . . . . . . 2.8.6 Assign Names to Registered Values . . . . . . 2.8.7 Input/Output Allocation . . . . . . . . . . . . . 2.8.8 Tangent: Combinational Outputs . . . . . . . . 2.8.9 Register Allocation . . . . . . . . . . . . . . . . 2.8.10 Datapath Allocation . . . . . . . . . . . . . . . 2.8.11 Hardware Block Diagram and State Machine 2.8.11.1 Control for Registers . . . . . . . . . 2.8.11.2 Control for Datapath Components . 2.8.11.3 Control for State . . . . . . . . . . . 2.8.11.4 Complete State Machine Table . . . 2.8.12 VHDL Code with Explicit State Machine . . . 2.8.13 Peephole Optimizations . . . . . . . . . . . . 2.8.14 Notes and Observations . . . . . . . . . . . . 2.9 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii 206 208 209 210 211 213 216 217 220 221 223 224 225 228 230 231 233 237 240 242
viii 2.9.1 Introduction to Pipelining . . . . . . . . . . . . . 2.9.2 Partially Pipelined . . . . . . . . . . . . . . . . . 2.9.3 Terminology . . . . . . . . . . . . . . . . . . . . . Design Example: Pipelined Massey . . . . . . . . . . . Memory Arrays and RTL Design . . . . . . . . . . . . 2.11.1 Memory Operations . . . . . . . . . . . . . . . 2.11.2 Memory Arrays in VHDL . . . . . . . . . . . . . 2.11.3 Data Dependencies . . . . . . . . . . . . . . . 2.11.4 Memory and Dataow Diagrams . . . . . . . . 2.11.5 Ex: Mem Array and Dataow Diagram . . . . . Input / Output Protocols . . . . . . . . . . . . . . . . . Example: Moving Average . . . . . . . . . . . . . . . . 2.13.1 Requirements and Environmental Assumptions 2.13.2 Algorithm . . . . . . . . . . . . . . . . . . . . . 2.13.3 Pseudocode and Dataow Diagrams . . . . . . 2.13.4 Control Tables and State Machine . . . . . . . . 2.13.5 VHDL Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 248 250 252 256 256 260 260 265 272 279 280 281 282 286 291 295
2.10 2.11
2.12 2.13
CONTENTS
ix
3 Performance Analysis and Optimization 297 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 3.2 Dening Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 299 3.3 Comparing Performance . . . . . . . . . . . . . . . . . . . . . . . . . 302 3.3.1 General Equations . . . . . . . . . . . . . . . . . . . . . . . . 302 3.3.2 Example: Performance of Printers . . . . . . . . . . . . . . . 304 3.4 Clock Speed, CPI, Program Length, and Performance . . . . . . . . 305 3.4.1 Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 3.4.2 Example: CISC vs RISC and CPI . . . . . . . . . . . . . . . . 306 3.4.3 Effect of Instruction Set on Performance . . . . . . . . . . . . 310 3.4.4 Effect of Time to Market on Relative Performance . . . . . . 312 3.4.5 Summary of Equations . . . . . . . . . . . . . . . . . . . . . 312 3.5 Performance Analysis and Dataow Diagrams . . . . . . . . . . . . 313 3.5.1 Dataow Diagrams, CPI, and Clock Speed . . . . . . . . . . 313 3.5.2 Examples of Dataow Diagrams for Two Instructions . . . . . 316 3.5.2.1 Scheduling of Operations for Different Clock Periods 317 3.5.2.2 Performance Computation for Different Clock Periods 320 3.5.2.3 Example: Two Instructions Taking Similar Time . . . 321 3.5.2.4 Example: Same Total Time, Different Order for A . . 322 3.5.3 Example: From Algorithm to Optimized Dataow . . . . . . . 323
x 3.6 General Optimizations . . . . . . . . . . . . . . . . . 3.6.1 Strength Reduction . . . . . . . . . . . . . . 3.6.1.1 Arithmetic Strength Reduction . . . 3.6.1.2 Boolean Strength Reduction . . . . 3.6.2 Replication and Sharing . . . . . . . . . . . . 3.6.2.1 Mux-Pushing . . . . . . . . . . . . . 3.6.2.2 Common Subexpression Elimination 3.6.2.3 Computation Replication . . . . . . 3.6.3 Arithmetic . . . . . . . . . . . . . . . . . . . . 3.7 Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 326 326 327 328 328 329 331 332 333
CONTENTS 4 Functional Verication 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Terminology: Validation / Verication / Testing . . . . . . . 4.1.2 The Difculty of Designing Correct Chips . . . . . . . . . 4.1.2.1 Notes from Kenn Heinrich (UW E&CE grad) . . 4.1.2.2 Notes from Aart de Geus (Chairman and CEO Synopsys) . . . . . . . . . . . . . . . . . . . . . 4.2 Test Cases and Coverage . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Floating Point Divider Example . . . . . . . . . . . . . . . 4.3 Testbenches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Overview of Test Benches . . . . . . . . . . . . . . . . . . 4.3.2 Reference Model Style Testbench . . . . . . . . . . . . . 4.3.3 Relational Style Testbench . . . . . . . . . . . . . . . . . 4.3.4 Coding Structure of a Testbench . . . . . . . . . . . . . . 4.3.5 Datapath vs Control . . . . . . . . . . . . . . . . . . . . . 4.3.6 Verication Tips . . . . . . . . . . . . . . . . . . . . . . . 4.4 Functional Verication for Datapath Circuits . . . . . . . . . . . . 4.4.1 A Spec-Less Testbench . . . . . . . . . . . . . . . . . . . 4.4.2 Use an Array for Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . of . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xi 335 336 336 336 337 337 338 338 339 344 344 345 345 346 347 348 349 351 352
xii 4.4.3 Build Spec into Stimulus . . . . . . . 4.4.4 Have Separate Specication Entity . 4.4.5 Generate Test Vectors Automatically 4.4.6 Relational Specication . . . . . . . 4.5 Functional Verication of Control Circuits . 4.5.1 Overview of Queues in Hardware . . 4.5.2 VHDL Coding . . . . . . . . . . . . . 4.5.2.1 Package . . . . . . . . . . 4.5.2.2 Other VHDL Coding . . . . 4.5.3 Code Structure for Verication . . . 4.5.4 Instrumentation Code . . . . . . . . 4.5.5 Assertions . . . . . . . . . . . . . . 4.5.6 VHDL Coding Tips . . . . . . . . . . 4.5.7 Queue Specication . . . . . . . . . 4.5.8 Queue Testbench . . . . . . . . . . 4.6 Example: Microwave Oven . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 355 358 359 360 361 368 368 368 369 371 376 380 385 389 391
CONTENTS 5 Timing Analysis 5.1 Delays and Denitions . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Background Denitions . . . . . . . . . . . . . . . . . . . 5.1.2 Clock-Related Timing Denitions . . . . . . . . . . . . . . 5.1.2.1 Clock Skew . . . . . . . . . . . . . . . . . . . . . 5.1.2.2 Clock Latency . . . . . . . . . . . . . . . . . . . 5.1.2.3 Clock Jitter . . . . . . . . . . . . . . . . . . . . . 5.1.3 Storage-Related Timing Denitions . . . . . . . . . . . . . 5.1.3.1 Flops and Latches . . . . . . . . . . . . . . . . . 5.1.4 Propagation Delays . . . . . . . . . . . . . . . . . . . . . 5.1.5 Timing Constraints . . . . . . . . . . . . . . . . . . . . . . 5.1.5.1 Minimum Clock Period . . . . . . . . . . . . . . . 5.1.5.2 Hold Constraint . . . . . . . . . . . . . . . . . . 5.1.5.3 Example Timing Violations . . . . . . . . . . . . 5.2 Timing Analysis of Latches and Flip Flops . . . . . . . . . . . . . 5.2.1 Simple Multiplexer Latch . . . . . . . . . . . . . . . . . . . 5.2.1.1 Structure and Behaviour of Multiplexer Latch . . 5.2.1.2 Strategy for Timing Analysis of Storage Devices 5.2.1.3 Clock-to-Q Time of a Multiplexer Latch . . . . . 5.2.1.4 Setup Timing of a Multiplexer Latch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xiii 401 402 402 403 403 405 406 408 408 410 411 411 412 412 415 415 416 420 421 422
xiv
CONTENTS 5.2.1.5 Hold Time of a Multiplexer Latch . . . . . . . . . 5.2.1.6 Example of a Bad Latch . . . . . . . . . . . . . . 5.3 Critical Paths and False Paths . . . . . . . . . . . . . . . . . . . 5.3.1 Introduction to Critical and False Paths . . . . . . . . . . 5.3.1.1 Example of Critical Path in Full Adder . . . . . . 5.3.1.2 Preliminaries for Critical Paths . . . . . . . . . . 5.3.1.3 Longest Path and Critical Path . . . . . . . . . . 5.3.2 Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Detecting a False Path . . . . . . . . . . . . . . . . . . . . 5.3.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . 5.3.3.2 Almost-Correct Algorithm to Detect a False Path 5.3.3.3 Examples of Detecting False Paths . . . . . . . 5.3.4 Finding the Next Candidate Path . . . . . . . . . . . . . . 5.3.4.1 Algorithm to Find Next Candidate Path . . . . . 5.3.4.2 Examples of Finding Next Candidate Path . . . . 5.3.5 Correct Algorithm to Find Critical Path . . . . . . . . . . . 5.3.5.1 Rules for Late Side Inputs . . . . . . . . . . . . . 5.3.5.2 Monotone Speedup . . . . . . . . . . . . . . . . 5.3.5.3 Analysis of Side-Input-Causes-Glitch Situation . 5.3.5.4 Complete Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428 430 431 431 434 436 436 440 441 441 447 447 449 450 451 454 454 455 456 456
CONTENTS 5.3.5.5 Complete Examples . . . . . . . . . . . . . . . . . . 5.3.6 Further Extensions to Critical Path Analysis . . . . . . . . . . 5.3.7 Increasing the Accuracy of Critical Path Analysis . . . . . . . 5.4 Elmore Timing Model . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 RC-Networks for Timing Analysis . . . . . . . . . . . . . . . . 5.4.2 Derivation of Analog Timing Model . . . . . . . . . . . . . . . 5.4.2.1 Example Derivation: Equation for Voltage at Node 3 5.4.2.2 General Derivation . . . . . . . . . . . . . . . . . . . 5.4.3 Elmore Timing Model . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Examples of Using Elmore Delay . . . . . . . . . . . . . . . . 5.4.4.1 Interconnect with Single Fanout . . . . . . . . . . . 5.4.4.2 Interconnect with Multiple Gates in Fanout . . . . . 5.5 Practical Usage of Timing Analysis . . . . . . . . . . . . . . . . . . . 5.5.1 Speed Binning . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1.1 FPGAs, Interconnect, and Synthesis . . . . . . . . . 5.5.2 Worst Case Timing . . . . . . . . . . . . . . . . . . . . . . . 5.5.2.1 Fanout delay . . . . . . . . . . . . . . . . . . . . . . 5.5.2.2 Derating Factors . . . . . . . . . . . . . . . . . . . .
xv 457 462 462 463 463 475 479 483 487 491 491 495 498 500 501 502 502 503
xvi 6 Power Analysis and Power-Aware Design 6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Importance of Power and Energy . . . . . . . . 6.1.2 Industrial Names and Products . . . . . . . . . 6.1.3 Power vs Energy . . . . . . . . . . . . . . . . . 6.1.4 Batteries, Power and Energy . . . . . . . . . . 6.1.4.1 Do Batteries Store Energy or Power? 6.1.4.2 Battery Life and Efciency . . . . . . 6.1.4.3 Battery Life and Power . . . . . . . . 6.2 Power Equations . . . . . . . . . . . . . . . . . . . . . 6.2.1 Switching Power . . . . . . . . . . . . . . . . . 6.2.2 Short-Circuited Power . . . . . . . . . . . . . . 6.2.3 Leakage Power . . . . . . . . . . . . . . . . . . 6.2.4 Glossary . . . . . . . . . . . . . . . . . . . . . 6.2.5 Note on Power Equations . . . . . . . . . . . . 6.3 Overview of Power Reduction Techniques . . . . . . . 6.4 Voltage Reduction for Power Reduction . . . . . . . . 6.5 Data Encoding for Power Reduction . . . . . . . . . . 6.5.1 How Data Encoding Can Reduce Power . . . . 6.5.2 Example Problem: Sixteen Pulser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507 508 508 509 509 510 510 511 512 515 517 520 521 522 522 522 527 531 531 535
CONTENTS 6.5.2.1 Problem Statement . . . . . . . . . . . . . 6.5.2.2 Additional Information . . . . . . . . . . . . 6.5.2.3 Answer . . . . . . . . . . . . . . . . . . . . 6.6 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Introduction to Clock Gating . . . . . . . . . . . . . . 6.6.2 Implementing Clock Gating . . . . . . . . . . . . . . 6.6.3 Design Process . . . . . . . . . . . . . . . . . . . . 6.6.4 Effectiveness of Clock Gating . . . . . . . . . . . . . 6.6.5 Example: Reduced Activity Factor with Clock Gating 6.6.6 Clock Gating with Valid-Bit Protocol . . . . . . . . . 6.6.6.1 Valid-Bit Protocol . . . . . . . . . . . . . . . 6.6.6.2 How Many Clock Cycles for Module? . . . 6.6.6.3 Adding Clock-Gating Circuitry . . . . . . . 6.6.7 Example: Pipelined Circuit with Clock-Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xvii 535 536 538 544 544 545 546 546 550 552 552 555 556 559
xviii 7 Fault Testing and Testability 7.1 Faults and Testing . . . . . . . . . . . . . . . . . . . . . 7.1.1 Overview of Faults and Testing . . . . . . . . . . 7.1.1.1 Faults . . . . . . . . . . . . . . . . . . . 7.1.1.2 Causes of Faults . . . . . . . . . . . . . 7.1.1.3 Testing . . . . . . . . . . . . . . . . . . 7.1.1.4 Burn In . . . . . . . . . . . . . . . . . . 7.1.1.5 Bin Sorting . . . . . . . . . . . . . . . . 7.1.1.6 Testing Techniques . . . . . . . . . . . 7.1.1.7 Design for Testability (DFT) . . . . . . . 7.1.2 Example Problem: Economics of Testing . . . . 7.1.3 Physical Faults . . . . . . . . . . . . . . . . . . . 7.1.3.1 Types of Physical Faults . . . . . . . . . 7.1.3.2 Locations of Faults . . . . . . . . . . . . 7.1.3.3 Layout Affects Locations . . . . . . . . 7.1.3.4 Naming Fault Locations . . . . . . . . . 7.1.4 Detecting a Fault . . . . . . . . . . . . . . . . . . 7.1.4.1 Which Test Vectors will Detect a Fault? 7.1.5 Mathematical Models of Faults . . . . . . . . . . 7.1.5.1 Single Stuck-At Fault Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563 564 564 564 565 565 566 566 567 567 567 567 568 569 570 570 571 571 574 575
CONTENTS 7.1.6 Generate Test Vector to Find a Mathematical Fault 7.1.6.1 Algorithm . . . . . . . . . . . . . . . . . . 7.1.6.2 Example of Finding a Test Vector . . . . . 7.1.7 Undetectable Faults . . . . . . . . . . . . . . . . . 7.1.7.1 Redundant Circuitry . . . . . . . . . . . . 7.1.7.2 Curious Circuitry and Fault Detection . . 7.2 Test Generation . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 A Small Example . . . . . . . . . . . . . . . . . . . 7.2.2 Choosing Test Vectors . . . . . . . . . . . . . . . . 7.2.2.1 Fault Domination . . . . . . . . . . . . . . 7.2.2.2 Fault Equivalence . . . . . . . . . . . . . 7.2.2.3 Gate Collapsing . . . . . . . . . . . . . . 7.2.2.4 Node Collapsing . . . . . . . . . . . . . . 7.2.2.5 Fault Collapsing Summary . . . . . . . . 7.2.3 Fault Coverage . . . . . . . . . . . . . . . . . . . . 7.2.4 Test Vector Generation and Fault Detection . . . . 7.2.5 Generate Test Vectors for 100% Coverage . . . . 7.2.5.1 Collapse the Faults . . . . . . . . . . . . 7.2.5.2 Check for Fault Domination . . . . . . . . 7.2.5.3 Required Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xix 577 577 578 579 579 582 583 583 584 585 586 587 588 588 589 590 591 592 595 597
xx
CONTENTS 7.2.5.4 Faults Not Covered by Required Test Vectors . . . . 598 7.2.5.5 Order to Run Test Vectors . . . . . . . . . . . . . . . 599 7.2.5.6 Summary of Technique to Find and Order Test Vectors601 7.2.6 One Fault Hiding Another . . . . . . . . . . . . . . . . . . . . 602 7.3 Scan Testing in General . . . . . . . . . . . . . . . . . . . . . . . . . 604 7.3.1 Structure and Behaviour of Scan Testing . . . . . . . . . . . 604 7.3.2 Scan Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . 606 7.3.2.1 Circuitry in Normal and Scan Mode . . . . . . . . . 607 7.3.2.2 Scan in Operation . . . . . . . . . . . . . . . . . . . 608 7.3.2.3 Scan in Operation with Example Circuit . . . . . . . 610 7.3.3 Summary of Scan Testing . . . . . . . . . . . . . . . . . . . . 614 7.3.4 Time to Test a Chip . . . . . . . . . . . . . . . . . . . . . . . 615 7.3.4.1 Example: Time to Test a Chip . . . . . . . . . . . . 616 7.4 Boundary Scan and JTAG . . . . . . . . . . . . . . . . . . . . . . . . 617 7.4.1 Scan Instructions . . . . . . . . . . . . . . . . . . . . . . . . . 620 7.5 Built In Self Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621 7.5.1 Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 621 7.5.1.1 Components . . . . . . . . . . . . . . . . . . . . . . 624 7.5.1.2 Linear Feedback Shift Register (LFSR) . . . . . . . 628 7.5.1.3 Maximal-Length LFSR . . . . . . . . . . . . . . . . . 630
CONTENTS Test Generator . . . . . . . . . . . . . . . . . . Signature Analyzer . . . . . . . . . . . . . . . . Result Checker . . . . . . . . . . . . . . . . . . Arithmetic over Binary Fields . . . . . . . . . . Shift Registers and Characteristic Polynomials 7.5.6.1 Circuit Multiplication . . . . . . . . . . 7.5.7 Bit Streams and Characteristic Polynomials . . 7.5.8 Division . . . . . . . . . . . . . . . . . . . . . . 7.5.9 Signature Analysis: Math and Circuits . . . . . 7.6 Scan vs Self Test . . . . . . . . . . . . . . . . . . . . . 7.5.2 7.5.3 7.5.4 7.5.5 7.5.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xxi 633 636 640 641 643 646 647 648 651 660
xxii 8 Review 8.1 Overview of the Term . . . . . . . . . . 8.2 VHDL . . . . . . . . . . . . . . . . . . . 8.2.1 VHDL Topics . . . . . . . . . . . 8.2.2 VHDL Example Problems . . . . 8.3 RTL Design Techniques . . . . . . . . . 8.3.1 Design Topics . . . . . . . . . . 8.3.2 Design Example Problems . . . 8.4 Functional Verication . . . . . . . . . . 8.4.1 Verication Topics . . . . . . . . 8.4.2 Verication Example Problems . 8.5 Performance Analysis and Optimization 8.5.1 Performance Topics . . . . . . . 8.5.2 Performance Example Problems 8.6 Timing Analysis . . . . . . . . . . . . . . 8.6.1 Timing Topics . . . . . . . . . . . 8.6.2 Timing Example Problems . . . 8.7 Power . . . . . . . . . . . . . . . . . . . 8.7.1 Power Topics . . . . . . . . . . . 8.7.2 Power Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661 662 663 663 664 665 665 666 667 667 668 669 669 670 671 671 672 673 673 674
CONTENTS 8.8 Testing . . . . . . . . . . . . . . . . 8.8.1 Testing Topics . . . . . . . . 8.8.2 Testing Example Problems . 8.9 Formulas to be Given on Final Exam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CHAPTER 1. VHDL
1.1 1.1.1
Transistor Signal values and time are continous (analog). Each transistor is modeled by a resistor-capacitor network. Switch Time is continuous, but voltage may be either continuous or discrete. Linear equations are used. Gate Transistors are grouped together into gates. Voltages are discrete values such as 0 and 1. Register transfer level Hardware is modeled as assignments to registers and combinational signals. Basic unit of time is one clock cycle. Transaction level A transaction is an operation such as transfering data across a bus. Building blocks are processors, controllers, etc. VHDL, SystemC, or SystemVerilog. Electronic-system level Looks at an entire electronic system, with both hardware and software.
1.1.2
VHDL = VHSIC Hardware Description Language VHSIC = Very High Speed Integrated Circuit
The VHSIC Hardware Description Language (VHDL) is a formal notation intended for use in all phases of the creation of electronic systems. Because it is both machine readable and human readable, it supports the development, verication, synthesis and testing of hardware designs, the communication of hardware design data, and the maintenance, modication, and procurement of hardware. Language Reference Manual (IEEE Design Automation Standards Committee, 1993a)
CHAPTER 1. VHDL
1.1.3
Semantics
The original goal of VHDL was to simulate circuits. The semantics of the language dene circuit behaviour.
a c <= a AND b;
simulation
b c
But now, VHDL is used in simulation and synthesis. Synthesis is concerned with the structure of the circuit. Synthesis: converts one type of description (behavioural) into another, lower level, description (usually a netlist).
c <= a AND b;
synthesis
a c b
1.1.3 Semantics
Synthesis
Synthesis is a computer-aided design (CAD) technique that transforms a designers concise, high-level description of a circuit into a structural description of a circuit.
c <= a AND b;
synthesis
a c b
CHAPTER 1. VHDL
CAD Tools
CAD Tools allow designers to automate lower-level design processes in implementing the desired functionality of a system. NOTE: EDA = Electronic Design Automation. EDA = CAD. In digital hardware design
1.1.3 Semantics
Synthesis vs Simulation
For synthesis, we want the code we write to dene the structure of the hardware that is generated.
c <= a AND b;
synthesis
a c b
10
CHAPTER 1. VHDL
Synthesis vs Simulation
The VHDL semantics dene the behaviour of the hardware that is generated, not the structure of the hardware.
a a c
the
sis
simulation
b c
syn
c <= a AND b;
different structure
a a c b
same behaviour
simulation
b c
11
12
CHAPTER 1. VHDL
1.1.5
Pick a high-quality synthesis tool and study its documentation thoroughly Learn the idioms of the tool Different VHDL code with same behaviour can result in very different circuits Be careful if you have to port VHDL code from one tool to another KISS: Keep It Simple Stupid
VHDL examples will illustrate reliable coding techniques for the synthesis tools from Synopsys, Mentor Graphics, Altera, Xilinx, and most other companies as well. Follow the coding guidelines and examples from lecture As you write VHDL, think about the hardware you expect to get. Note: If you cant predict the hardware, then the hardware probably wont be very good (small, fast, correct, etc)
13
1.1.6
std logic 1164: IEEE standard for signal values in VHDL. U X 0 1 Z W L H -- uninitialized strong unknown strong 0 strong 1 high impedance weak unknown weak 0 weak 1 dont care
The most common values are: U, X, 0, 1. If you see X in a simulation, it usually means that there is a mistake in your code.
14
CHAPTER 1. VHDL
1.3 1.3.1
1.3.2
Library Units
This section reserved for your reading pleasure
15
1.3.3
entity
entity architecture
architecture
16
CHAPTER 1. VHDL
Entity
library ieee; use ieee.std_logic_1164.all; entity and_or is port ( a, b, c : in std_logic ; z : out std_logic ); end and_or; Example of an entity
17
Architecture
architecture signal x : begin x <= a AND z <= x OR end main; main of and_or is std_logic; b; (a AND c); Example of architecture
18
CHAPTER 1. VHDL
1.3.4
Concurrent Statements
19
Concurrent Statements
architecture main of bowser is begin x1 <= a AND b; x2 <= NOT x1; z <= NOT x2; end main; architecture main of bowser is begin z <= NOT x2; x2 <= NOT x1; x1 <= a AND b; end main;
a b
x1
x2
20
CHAPTER 1. VHDL
21
1.3.6
Processes
Processes are used to describe complex and potentially unsynthesizable behaviour A process is a concurrent statement (Section 1.3.4). The body of a process contains sequential statements (Section 1.3.7) Processes are the most complex and difcult to understand part of VHDL (Sections 1.5 and 1.6)
22
CHAPTER 1. VHDL
1.3.6 Processes
23
24
CHAPTER 1. VHDL
1.3.6 Processes
25
Sensitivity List
The sensitivity list contains the signals that are read in the process. A process is executed when a signal in its sensitivity list changes value. An important coding guideline to ensure consistent synthesis and simulation results is to include all signals that are read in the sensitivity list. There is one exception to this rule: for a process that implements a ip-op with an if rising edge statement, it is acceptable to include only the clock signal in the sensitivity list other signals may be included, but are not needed.
26
CHAPTER 1. VHDL
1.3.7
Sequential Statements
Used inside processes and functions. wait signal assignment if-then-else case wait until . . . ; . . . <= . . . ; if . . . then . . . elsif . . . end if; case . . . is when . . . | . . . => . . . ; when . . . => . . . ; end case; loop . . . end loop; while . . . loop . . . end loop; for . . . in . . . loop . . . end loop; next . . . ;
27
1.4
All concurrent assignments can be translated into sequential statements. But, not all sequential statements can be translated into concurrent statements.
28
CHAPTER 1. VHDL
1.4.1
The two code fragments below have identical behaviour: architecture main of tiny is begin b <= a; end main; architecture main of tiny is begin process (a) begin b <= a; end process; end main;
29
30
CHAPTER 1. VHDL
31
1.4.4
Coding Style
Code thats easy to write with sequential statements, but difcult with concurrent: case <expr> is when <choice1> => if <cond> then o <= <expr1>; else o <= <expr2>; end if; when <choice2> => ... end case;
32
CHAPTER 1. VHDL
1.5
Overview of Processes
Processes are the most difcult VHDL construct to understand. This section gives an overview of processes. Section 1.6 gives the details of the semantics of processes. Within a process, statements are executed almost sequentially
33
Process Semantics
VHDL mimics hardware Hardware (gates) execute in parallel Processes execute in parallel with each other All possible orders of executing processes must produce the same simulation results (waveforms) If a signal is not assigned a value, then it holds its previous value
All orders of executing concurrent statements must produce the same waveforms
34
CHAPTER 1. VHDL
Process Semantics
execution sequence
architecture procA: process stmtA1; stmtA2; stmtA3; end process; procB: process stmtB1; stmtB2; end process; B1 B2 B1 B2 B1 B2 A1 A2 A3 A1 A2 A3 A1 A2 A3
execution sequence
execution sequence
35
Process Semantics
36
CHAPTER 1. VHDL
Combinational process:
Executing the process takes part of one clock cycle Target signals are outputs of combinational circuitry A combinational processes must have a sensitivity list A combinational process must not have any wait statements A combinational falling_edges
process must not have any rising_edges, or
37
Clocked process:
Executing the process takes one (or more) clock cycles Target signals are outputs of ops Process contains one or more wait or if rising edge statements Hardware contains combinational circuitry and ip ops
Note: Clocked processes are sometimes called sequential processes, but this can be easily confused with sequential statements, so in E&CE 327 well refer to synthesizable processes as either combinational or clocked.
38
CHAPTER 1. VHDL
39
40
CHAPTER 1. VHDL
41
42
CHAPTER 1. VHDL
43
1.5.2
Latch Inference
The semantics of VHDL require that if a signal is assigned a value on some passes through a process and not on other passes, then on a pass through the process when the signal is not assigned a value, it must maintain its value from the previous pass. process (a, b, c) begin if (a = 1) then z1 <= b; z2 <= b; else z1 <= c; end if; end process;
a b c z1 z2
44
CHAPTER 1. VHDL
Latch Inference
When a signals value must be stored, VHDL infers a latch or a ip-op in the hardware to store the value. If you want a latch or a ip-op for the signal, then latch inference is good. If you want combinational circuitry, then latch inference is bad.
45
b a
Flip-op
Question:
46
CHAPTER 1. VHDL
1.6 1.6.1
10ns
12ns 15ns
d e
b c d e
47
48
CHAPTER 1. VHDL
1.6.2
1.6.3 tion
In zero-delay simulation, a sequence of dependent events must appear to happen instantaneously (in zero time). In particular, the effect of an event must propagate instantaneously through combinational circuitry. Two fundamental rules for zero-delay simulation: 1. events appear to propagate through combinational circuitry instantaneously. 2. all of the gates appear to operate in parallel
49
50
CHAPTER 1. VHDL
1.6.4 1.6.4.1
postponed resume
suspended
51
Suspended
active
e sp su te tiv a
nd
postponed resume
ac
suspended
Nothing to currently execute A process stays suspended until the event that it is waiting for occurs: either a change in a signal on its sensitivity list or the condition in a wait statement
52
CHAPTER 1. VHDL
Postponed
active
e sp su te tiv a
nd
postponed resume
ac
suspended
Wants to execute, but not currently active A process stays postponed until the simulator chooses it from the pool of postponed processes
53
Active
active
e sp su te tiv a nd
postponed resume
ac
suspended
Currently executing A process stays active until it hits a wait statement or sensitivity list, at which point it suspends
54
CHAPTER 1. VHDL
1.6.4.2
Simulation Algorithm
The algorithm presented here is a simplication of the actual algorithm in the VHDL Standard. This algorithm does not (a <= b after 2 ns;). support delayed assignments; for example:
A somewhat ironic note, only six of the two hundred pages in the VHDL Standard are devoted to the semantics of executing processes.
55
The Algorithm
Simulations start at step 1 with all processes postponed and all signals with a default value (e.g., U for std logic).
1. While there are postponed processes: (a) Pick one or more postponed processes to execute (become active). (b) Provisionally execute assignments (new values become visible at step 3) (c) A process executes until it hits its sensitivity list or a wait statement, at which point it suspends. (d) Processes that become suspended, stay suspended until there are no more postponed or active processes. 2. Each process checks its sensitivity list or wait condition to see if it should resume 3. Update signals with their provisional values 4. If no postponed processes, then increment simulation time to next event.
56
CHAPTER 1. VHDL
57
1.6.4.3
Delta-Cycle Denitions
Denition simulation step: Executing one sequential assignment or process mode change.
Denition simulation cycle: The operations that occur in one iteration of the simulation algorithm.
Denition delta cycle: A simulation cycle that does not advance simulation time.
Denition simulation round: A sequence of simulation cycles that all have the same simulation time.
58
CHAPTER 1. VHDL
59
proc1: process (a, b, c) begin c <= a AND b; d <= NOT c; end process; proc2: process (b, d) begin 0ns e <= b AND d; sim round end process; sim cycle proc3: process begin delta cycle a <= 1; proc1 proc2 b <= 0; proc3 wait for 3 ns; a b <= 1; wait for 99 ns; b end process;
c d e
U a U b Uc Ud U e
Legend
60
1. While there are postponed processes: (a) Pick process(es) to activate (b) Execute active processes, record prov asns (c) Suspend at sens list or wait statement (d) Once suspended, stay suspended 2. Check sens lists, wait conditions for changes 3. Update signals with provisional values 4. If no postponed procs, increment time proc1: ...(a, b, c)... c <= a AND b; d <= NOT c; end process; proc2: ...(b, d)... e <= b AND d; end process; proc3: process begin a <= 1; b <= 0; wait for 3 ns; b <= 1; wait for 99 ns; end process;
CHAPTER 1. VHDL
d e
61
3ns +2 +3 +1 +2 +3
102ns
0ns
1ns
2ns
3ns
4ns
a U b U c U d U e U
62
CHAPTER 1. VHDL
Question: What are the different granularities of time that occur when doing delta-cycle simulation?
Question: What is the order of granularity, from nest to coarsest, amongst the different granularities related to delta-cycle simulation?
63
1.6.7
architecture main of swindle is begin p_c: process (a, b) begin Question: c <= a AND b; end process; p_d: process (a, c) begin d <= a XOR c; end process; end main;
Circuit to illustrate need for provisional assignments 1. Start with all signals at 0. 2. Simultaneously change to a = 1 and b = 1.
64
CHAPTER 1. VHDL
p_c p_d a b c d
0 0 0 0
P A P
S A S P A S
65
p_c p_d a b c d
0 0 0 0
P P A S
S P A S
66
CHAPTER 1. VHDL
p_c p_d a b c d
0 0 0 0
P A P
S A S P A S
67
p_c p_d a b c d
0 0 0 0
P P A S
S P A S
68
CHAPTER 1. VHDL
69
1.6.8
p_a : process begin a <= 0; wait for 15 ns; a <= 1; wait for 20 ns; end process;
0ns
sim round sim cycle delta cycle p_a P p_clk P flop P a U clk U q U
B B B A
E E S A
U U
S A S
0 0
70
CHAPTER 1. VHDL
0ns
5ns
10ns
15ns
20ns
25ns
30ns
35ns
a clk q
71
Back-to-Back Flops
p_a : process begin a <= 0; wait for 15 ns; a <= 1; wait for 20 ns; end process; p_clk : process begin clk <= 0; wait for 10 ns; clk <= 1; wait for 10 ns; end process; flops : process ( clk ) begin if rising_edge( clk ) then q1 <= a; q2 <= q1; end if; end process;
15ns 20ns 30ns 35ns
10ns
sim round sim cycle delta cycle p_a p_clk flops a 0 clk 0 q1 U q2 U
B
B/E B/E
B B S
E B E
E B E B/E B
B/E
P A P A P A S
E B E B/E B B/E B S P A
1
E B E B E B/E B B/E B E S P A P A S S
E B E
E E
P A
1 U 0
1 1
72
CHAPTER 1. VHDL
0ns
5ns
10ns
15ns
20ns
25ns
30ns
35ns
a clk q
73
74 architecture mathilde of sauv is e signal clk, a, b : std_logic; begin process begin clk <= 1; wait for 10 ns; clk <= 0; wait for 10 ns; end process; process begin wait for 20 ns; a1 <= 1; end process; process begin wait until rising_edge(clk); a1 <= 1; end process; process begin wait until rising_edge( clk ); b1 <= a1;
CHAPTER 1. VHDL
75
flop : process ( clk ) begin if rising_edge( clk ) then q1 <= a end if; end process;
sim round sim cycle delta cycle env flop1 flop2 a clk q1
76
CHAPTER 1. VHDL
a clk q1
77
Warning
Note: Testbench signals For consistent results across different simulators, simulation scripts vs test benches, and timingsimulation vs zero-delay simulation do not change signals in your testbench or script at the same time as the clock changes.
0ns 10ns 20ns 30ns 40ns 50ns 60ns
a U clk U q1
0ns U 10ns 20ns 30ns 40ns 50ns 60ns
a U
a is output of timed process (testbench or environment) POOR DESIGN a is output of timed process (testbench or environment) GOOD DESIGN
clk U q1
0ns U 10ns 20ns 30ns 40ns 50ns 60ns
a U clk U q1
U
78
CHAPTER 1. VHDL
1.7
0ns
sim round sim cycle delta cycle proc1 proc2 proc3 a b c d e B B B P P A P U U U U U U
Register-Transfer-Level Simulation
0ns+1 0ns+2 0ns+23ns
EB EB PA S EB E S PA EB EB B S PA
3ns+1
3ns+2 3ns+3
E E E S
102ns
A S A U1 0 U U S
EB EB S PA P
S A
EB EB P PA S
A S
EB EB S PA
EB E S PA
0ns
a U 1 U 0 U 0 U 1 U 0
1ns
2ns
3ns
102ns
1 0 U 0 0 1 0 1 1 1 1 0 0
b c d e
1 1 0
RTL simulation
1.7.1 Overview
79
1.7.1
Overview
Much simpler than delta cycle Columns are real time: clock cycles, nanoseconds, etc. Can simulate both synthesizable and unsynthesizable code Cannot simulate combinational loops Same values as delta-cycle at end of simulation round process begin Question: In this code, what a <= 0; value should b have 10 ns? wait for 10 ns; a <= 1; ... end process;
process begin b <= 0; wait for 10 ns; b <= a; ... end process;
80
CHAPTER 1. VHDL
81
1.7.3 1.7.3.1
We revisit an earlier example from delta-cycle simulation, but change the code slightly and do register-transfer-level simulation. proc1: process (a, b, c) begin d <= NOT c; c <= a AND b; end process; proc2: process (b, d) begin e <= b AND d; end process; proc3: process begin a <= 1; b <= 0; wait for 3 ns; b <= 1; wait for 99 ns; end process;
82
CHAPTER 1. VHDL
Decomposed
Sorted
83
Waveforms
0ns
a b c d e U U U U U
1ns
2ns
3ns
102ns
84
CHAPTER 1. VHDL
huey: process begin clk <= 0; wait for 10 ns; clk <= 1; wait for 10 ns; end process; dewey: process begin a <= to_unsigned(0,4); wait until re(clk); while (a < 4) loop a <= a + 1; wait until re(clk); end loop; end process;
louie: process begin d <= 1; wait until re(clk); if (a >= 2) then d <= 0; wait until re(clk); end if; end process;
clk a d
85
1.8 1.8.1
86
CHAPTER 1. VHDL
gate
87
88
CHAPTER 1. VHDL
89
clocked process
ip op
90
CHAPTER 1. VHDL
1.8.2
Some of the common gates you have encountered in previous courses should be avoided when synthesizing register-transfer-level hardware, particularly if FPGAs are the implementation technology. Latches : Use ops, not latches T, JK, SR, etc ip-ops : Limit yourself to D-type ip-ops Tri-State Buffers : Use multiplexers, not tri-state buffers Note: Unfortunately and surprisingly, PalmChip has been awarded a US patent for using uni-directional busses (i.e. multiplexers) for system-on-chip designs. The patent was led in 2000, so all fourth-year design projects completed after that date will need to pay royalties to PalmChip
91
What is This?
process (a) begin if rising_edge(a) then c <= b; end if; end process;
92
CHAPTER 1. VHDL
1.8.3 1.8.3.1
Hardware and Code for Flops Flops with Waits and Ifs
process (clk) begin if rising_edge(clk) then q <= d; end if; end process;
93
94
CHAPTER 1. VHDL
1.8.3.2
process (clk) begin if rising_edge(clk) then if (reset = 1) then q <= 0; else q <= d; end if; end if; end process;
95
96
CHAPTER 1. VHDL
process (clk, reset) begin if (reset = 1) then q <= 0; else if rising_edge(clk) then q <= d; end if; end if; end process;
97
process begin if (reset = 1) then q <= 0; else q <= d0; end if; wait until rising_edge(clk); end process;
98
CHAPTER 1. VHDL
99
d1 clk
100
CHAPTER 1. VHDL
q0
sel
q q1
Question: For the circuits with mux-on-input and mux-on-output, does q have the same behaviour in both circuits?
101
1.8.3.3 Input
Hint: Chip Enable process (clk) begin if rising_edge(clk) then if (ce = 1) then q <= d; end if; end if; end process;
102
CHAPTER 1. VHDL
1.8.3.4 Reset
1.8.4
1.9
1.10. ARITHMETIC
103
1.10 Arithmetic
VHDL includes all of the common arithmetic and logical operators. Use the VHDL arithmetic operators and let the synthesis tool choose the best implementation for you.
1.10.1
Arithmetic Packages
To do arithmetic with signals, use the numeric_std package. This package denes types signed and unsigned, which are std_logic vectors on which you can do signed or unsigned arithmetic. numeric std supersedes std logic arith. earlier arithmetic packages, such as
Use only one arithmetic package, otherwise the different denitions will clash and you can get strange error messages.
104
CHAPTER 1. VHDL
1.10.2
1.10.3
Overloading of Arithmetic
This section reserved for your reading pleasure
1.10.4
1.10.5
Overloading of Comparisons
This section reserved for your reading pleasure
1.10.6 Different Widths and Comparisons Overloading of Comparison Operations (=, /=, >=, >, <) src1/2 unsigned signed unsigned src2/1 integer OK integer OK signed fails in analysis
105
1.10.6
106
CHAPTER 1. VHDL
1.10.7
Type Conversion
The functions unsigned, signed, to integer, to unsigned and to signed are used to convert between integers, std-logic vectors, signed vectors and unsigned vectors. If you convert between two types of the same width, then no additional hardware will be generated. The listing below summarizes the types of these functions.
107
Type Conversion
unsigned( val : std_logic_vector ) signed( val : std_logic_vector ) to_integer( val : signed ) to_integer( val : unsigned ) to_unsigned( val : integer; width : natural) to_signed( val : integer; width : natural) Note: More details in course notes return unsigned; return signed; return integer; return integer; return unsigned; return signed;
108
CHAPTER 1. VHDL
109
1.11.1 1.11.1.1
Initial values on signals (UNSYNTHESIZABLE) signal bad_signal : std_logic := 0; Reason: At powerup, the values on signals are random (except for some FPGAs).
110
CHAPTER 1. VHDL
1.11.1.2
Wait For
Wait for length of time (UNSYNTHESIZABLE) wait for 10 ns; Reason: Delays through circuits are dependent upon both the circuit and its operating environment, particularly supply voltage and temperature. For example, imagine trying to build an AND gate that will have exactly a 2ns delay in all environments.
111
1.11.1.3
wait statements with different conditions in a process (UNSYNTHESIZABLE) -- different clock signals process begin wait until rising_edge(clk1); x <= a; wait until rising_edge(clk2); x <= a; end process; Reason: Would require the ip ops to use different clock signals at different times.
112
CHAPTER 1. VHDL
113
1.11.1.4 cess
Multiple if rising edge statements in a process (UNSYNTHESIZABLE) process (clk) begin if rising_edge(clk) then q0 <= d0; end if; if rising_edge(clk) then q1 <= d1; end if; end process; Reason: The idioms for synthesis tools generally expect just a single if rising edge statement in each process. The simpler the VHDL code is, the easier it is to synthesize hardware. Programmers of synthesis tools make idiomatic (idiotic?) restrictions to make their jobs simpler.
114
CHAPTER 1. VHDL
115
1.11.1.6
The if statement has a rising edge condition and an else clause (UNSYNTHESIZABLE). process (clk) begin if rising_edge(clk) then q0 <= d0; else q0 <= d1; end if; end process; Reason: Generally, an if-then-else statement synthesizes to a multiplexer.
116
CHAPTER 1. VHDL
1.11.1.7
An if rising edge statement in a for-loop (UNSYNTHESIZABLE-Synopsys) process (clk) begin for i in 0 to 7 loop if rising_edge(clk) then q(i) <= d; end if; end loop; end process; Reason: just an idiom of the synthesis tool. Some loop statements are synthesizable (Rushton Section 8.7). For-loops in general are described in Ashenden.
117
Synthesizable Alternative
A synthesizable alternative to an if rising edge statement in a for-loop is to put the if-rising-edge outside of the for loop. process (clk) begin if rising_edge(clk) then for i in 0 to 7 loop q(i) <= d; end loop; end if; end process;
118
CHAPTER 1. VHDL
1.11.1.8
wait statements in a for loop (UNSYNTHESIZABLE) process begin for i in 0 to 7 loop wait until rising_edge(clk); x <= to_unsigned(i,4); end loop; end process; Reason: Unknown. while-loops with the same behaviour are synthesizable. Note: Combinational for-loops Combinational for-loops are usually synthesizable. They are often used to build a combinational circuit for each element of an array. Note: Clocked for-loops Clocked for-loops are not synthesizable, but are very useful in simulation, particular to generate test vectors for test benches.
119
120
CHAPTER 1. VHDL
122
2.1
Prelude to Chapter
This section reserved for your reading pleasure
2.2 FPGA Background and Coding Guidelines 2.2.1 Generic FPGA Hardware
123
2.2.1.1
Cell = Logic Element (LE) in Altera = Congurable Logic Block (CLB) in Xilinx
carry_in
data_in
comb
D CE
data_out
ctrl_in
carry_out
124
flop_data_out
flop_data_in ctrl_in
carry_out
125
flop_data_out
flop_data_in ctrl_in
carry_out
126
flop_data_out
flop_data_in ctrl_in
carry_out
127
flop_data_out
flop_data_in ctrl_in
carry_out
128
2.2.2
Area Estimation
To estimate the number of FPGA cells that will be required to implement a circuit, recall that an FPGA lookup-table can implement any function with up to four inputs and one output. We will describe two methods to estimate the area (number of FPGA cells) required to implement a gate-level circuit:
1. Rough estimate based simply upon the number of ip-ops and primary inputs that are in the fanin of each ip-op. 2. A more accurate estimate, based upon greedily including as many gates as possible into each FPGA cell.
129
130 Question:
CHAPTER 2. RTL DESIGN WITH VHDL How many cells are needed to implement a 4:1 mux?
131
132
more than four inputs However, have more than four signals as input, then further back in the fanin, the circuit will collapse back to four or fewer signals.
2.2.2 Area Estimation Question: Map the combinational circuits below onto generic FPGA cells.
133
comb
D CE
comb
Q D CE
a b c d z
comb
D CE
comb
Q D CE
comb
D CE
comb
Q D CE
134
2.2.2.1
2.2.2.2
Very few wires that traverse entire chip and can be connected to every ip-op.
135
2.2.2.3
For more than ve years, FPGAs have had special circuits for RAM and ROM. In Altera FPGAs, these circuits are called ESBs (Embedded System Blocks). These special circuits are possible because many FPGAs are fabricated on the same processes as SRAM chips. So, the FPGAs simply contain small chunks of SRAM.
136
Microprocessors
A new feature to appear in FPGAs in 2001 and 2002 is hardwired microprocessors on the same chip as programmable hardware.
Hard Soft Altera Arm 922T with 200 MIPs Nios with ?? MIPs Xilinx: Virtex-II Pro Power PC 405 with 420 D-MIPs Microblaze with 100 D-MIPs The Xilinx-II Pro has 4 Power PCs and enough programmable hardware to implement the rst-generation Intel Pentium microprocessor.
137
Arithmetic Circuitry
A new feature to appear in FPGAs in 2001 and 2002 is hardwired circuits for multipliers and adders. Altera: Mercury 16 16 at 130MHz Xilinx: Virtex-II Pro 18 18 at ???MHz Using these resources can improve signicantly both the area and performance of a design.
138
Input / Output
Recently, high-end FPGAs have started to include special circuits to increase the bandwidth of communication with the outside world. Product Altera True-LVDS (1 Gbps) Xilinx Rocket I/O (3 Gbps)
139
2.2.3
140
Use It or Lose
Aim for using 8090% of the cells on a chip.
reason If you use more than 90% of the cells on a chip, then the place-androute program might not be able to route the wires to connect the cells. reason If you use less than 80% of the cells, then probably: there are optimizations that will increase performance and still allow the design to t on the chip; or you spent too much human effort on optimizing for low area; or you could use a smaller (cheaper!) chip. exception In E&CE 327 (unlike in real life), the mark is based on the actual number of cells used.
141
142
143
2.3
Design Flow
This section reserved for your reading pleasure
2.4
144
2.5 2.5.1
Finite State Machines in VHDL Introduction to State-Machine Design Mealy vs Moore State Machines
2.5.1.1
145
Moore Machines
Outputs are dependent upon only the state No combinational paths from inputs to outputs
s0/0 a s1/1 !a s2/0
s3/0
146
Mealy Machines
Outputs are dependent upon both the state and the inputs Combinational paths from inputs to outputs
s0 a/1 s1 /0 s3 /0 !a/0 s2
147
2.5.1.2 VHDL
A state machine is generally written as a single clocked process, or as a pair of processes, where one is clocked and one is combinational.
Design Decisions
Moore vs Mealy (Sections 2.5.2 and 2.5.3) Implicit vs Explicit (Section 2.5.1.3) State values in explicit state machines: Enumerated type vs constants (Section 2.5.5) State values for constants: encoding scheme (binary, gray, one-hot, ...) (Section 2.5.5)
148
149
2.5.1.3
Explicit
There are two styles of writing state machines in VHDL: explicit and implicit.
State signal appears explicitly in VHDL code At most one wait statement per process Two sub-categories of explicit state machines
Explicit-Current State signal represents current state Next-state computation done in a clocked process Explicit-Current+Next Two state signals: current state and next state Next-state computation done in a combinational process Current-state <= next-state is registered assignment Implicit Use multiple wait statements in a process to describe state machine implicilty
150
151
Most detailed, closest to hardware Greatest opportunity for manual optimization Most labour-intensive Susceptible to small, subtle, hard-to-nd bugs
Explicit-Current
Almost as manual optimization as Explicit-Current+Next Easier to write than Explicit-Current+Next Less susceptible to subtle bugs
Implicit
Taught infrequently Least detailed, furthest from actual hardware Rely on synthesis for optimization Usually least labour to write, shortest code Easiest to write correctly (But must understand VHDL synthesis!)
152
!a s3/0
a s1/1
153
Terminology
Note: The terminology of explicit and implicit is somewhat standard, in that some descriptions of processes with multiple wait statements describe the processes as having implicit state machines. There is no standard terminology to distinguish between the two explicit styles: explicit-current+next and explicit-current.
154
s3/0
155
2.5.2.1
Flops architecture moore_implicit_v1a of simple isGates Delay begin process begin z <= 0; wait until rising_edge(clk); if (a = 1) then z <= 1; else z <= 0; end if; wait until rising_edge(clk); z <= 0; wait until rising_edge(clk); end process; end moore_implicit;
156
157
2.5.2.2
architecture moore_explicit_v1 of simple is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process (clk) begin if rising_edge(clk) then case state is when s0 => if (a = 1) then state <= s1; z <= 1; else state <= s2; z <= 0; end if; when s1 | s2 => state <= s3; z <= 0; when s3 => state <= s0; z <= 1; end case; end if; end process; end moore_explicit_v1;
158
159
160
161
162
163
2.5.2.5
architecture moore_explicit_v4 of simple is type state_ty is (s0, s1, s2, s3); signal state, state_nxt : state_ty; begin process (clk) begin if rising_edge(clk) then state <= state_nxt; end if; end process; process (state, a) begin case state is when s0 => if (a = 1) then state_nxt <= s1; else state_nxt <= s2; end if; when s1 | s2 => state_nxt <= s3; when s3 => state_nxt <= s0; end case; end process; z <= 1 when (state = s1) else 0; end moore_explicit_v4;
164
165
166
2.5.4
Reset
All circuits should have a reset signal that puts the circuit back into a good initial state. However, not all ip ops within the circuit need to be reset. In a circuit that has a datapath and a state machine, the state machine will probably need to be reset, but datapath may not need to be reset. There are standard ways to add a reset signal to both explicit and implicit state machines. It is important that reset is tested on every clock cycle, otherwise a reset might not be noticed, or your circuit will be slow to react to reset and could generate illegal outputs after reset is asserted.
2.5.4 Reset
167
168
2.5.4 Reset
169
170
2.5.5
State Encoding
This section reserved for your reading pleasure
171
2.6 2.6.1
Dataow diagrams are data-dependency graphs where the computation is divided into clock cycles. Purpose:
Provide a disciplined approach for designing datapath-centric circuits Guide the design from algorithm, through high-level models, and nally to register transfer level code for the datapath and control circuitry. Estimate area and performance Make tradeoffs between different design options
Background
Based on techniques from high-level synthesis tools Some similarity between high-level synthesis and software compilation Each dataow diagram corresponds to a basic block in software compiler terminology.
172
Data-Dependency Graph
a b c d e f
+
x1
+
x2
+
x3
+
x4
+
z
173
Dataow Diagrams
a b c d e f
+
x1
+
x2
+
x3
+
x4
+
z
174
+
x1
+
x2
+
x3
+
x4
+
z
175
Latency
a b c d e f
+
2 3 4 5 6
z x1
+
x2
+
x3
+
x4
+
Latency = 6 clock cycles
176
Latency
a b c d e f
+
x1
+
2
x2
+
x3
+
3 4
z x4
+
Latency = 4 clock cycles
177
Flip Flops
a b c d e f
+
x1
+
x2
+
x3
+
x4
+
z
178
+
x1
+
x2
+
x3
+
x4
+
z
179
+
x1
+
x2
+
x3
+
x4
+
z
180
Datapath Components
a b c d e f
+
x1
+
x2
+
x3
+
x4
+
z
181
Inputs
Unconnected signal tails are inputs Horizontal lines mark clock cycle boundaries
+
x1
+
x2
+
x3
+
x4
+
z
182
Outputs
a b c d e f
Unconnected signal tails are inputs Horizontal lines mark clock cycle boundaries
+
x1
+
x2
+
x3
+
x4
+
z
183
Summary
a b c d e f
Unconnected signal tails are inputs Horizontal lines mark clock cycle boundaries
+
x1
+
x2
+
x3
+
x4
+
z
184
x
Behaviour
clk i x
185
Register Input
Hardware i x Dataow Diagram i Behaviour
clk i x
186
Register Signal
Hardware
i1 x
Dataow Diagram i1 i2
i2
+
x
clk i1 i2 x
Behaviour
187
Combinational-Component Output
Hardware
i1
Dataow Diagram i1 i2
i2
+
x
clk i1 i2 x
Behaviour
188
2.6.3
189
0
clk a
0 1 2 3 4 5 6
x1
+
x2
+
x3
x1 x2
+
x4
x3 x4
+
x5
x5 z
+
z
190
0 1
clk a
0 1 2 3 4 5 6
x1
+
x2
+
x3
x1 x2
+
x4
x3 x4
+
x5
x5 z
+
z
191
0 1
clk a
0 1 2 3 4 5 6
x1
+
x2
+
x3
x1 x2
+
x4
x3 x4
+
x5
x5 z
+
z
192
0 1
clk a
0 1 2 3 4 5 6
x1
+
x2
+
x3
x1 x2
+
x4
x3 x4
+
x5
x5 z
+
z
193
0 1
clk a
0 1 2 3 4 5 6
x1
+
x2
+
x3
x1 x2
+
x4
x3 x4
+
x5
x5 z
+
z
194
0 1
clk a
0 1 2 3 4 5 6
x1
+
x2
+
x3
x1 x2
+
x4
x3 x4
+
x5
x5 z
+
z
195
0 1
clk a
0 1 2 3 4 5 6
x1
+
x2
+
x3
x1 x2
+
x4
x3 x4
+
x5
x5 z
+
z
5 6
196
0 1
clk a
0 1 2 3 4 5 6
x1
+
x2
+
x3
x1 x2
+
x4
x3 x4
+
x5
x5 z
+
z
5 6
197
0 1
clk a
0 1 2 3 4 5 6
x1
+
x2
+
x3
x1 x2
+
x4
x3 x4
+
x5
x5 z
+
z
198
2.6.4
Denition Latency: Number of clock cycles from inputs to outputs. A combinational circuit has latency of zero. A single register has a latency of one. A chain of n registers has a latency of n.
Min clock period (Max clock speed) limited by longest path in a clock cycle
199
2.6.5
Area Estimation
Maximum number of blocks in a clock cycle is total number of that component that are needed Maximum number of signals that cross a cycle boundary is total number of registers that are needed Maximum number of unconnected signal tails in a clock cycle is total number of inputs that are needed Maximum number of unconnected signal heads in a clock cycle is total number of outputs that are needed
These estimates give lower bounds. Other constraints might force you to use more components.
200
Area Estimation
Implementation-technology factors, such as the relative size of registers, multiplexers, and datapath components, might force you to make tradeoffs that increase the number of datapath components to decrease the overall area of the circuit. With some FPGA chips, a 2:1 multiplexer has the same area as an adder.
With some FPGA chips, a 2:1 multiplexer can be combined with an adder into one FPGA cell per bit. In FPGAs, registers are usually free, in that the area consumed by a circuit is limited by the amount of combinational logic, not the number of ip-ops.
201
2.6.6
a b
Design Analysis
c d e f
+
x1
num inputs
+
x2
num outputs
+
x3
num registers
+
x4
+
z
latency
202
+
x1
num inputs
+
x2
num outputs
+
x3
num registers
+
x4
+
x5 z
latency
203
2.6.7
a b
0 1
+ +
x2
+
x1
+
x2
+
x3
+
x3
+
x4
+
x4
+
x5 z
5 6
+
x5 z
3 4
Note: wasted.
204
0
clk
0 1 2 3 4 5 6
a x1
+
x1
+
x2
x2
+
x3
x3
x4 x5
+
x4
+
x5 z
3 4
205
Design Comparison
One add per clock cycle
a b c d e f
0 1
0 1
+
x1
+
x1
+
x2
+
x2
+
x3
+
x3
+
x4
+
x4
+
x5 z
5 6
+
x5 z
3 4
6 1 6 1 op + 1 add 6
6 1 6 2 op + 2 add 4
206
2.7
2.8
Well go through the following artifacts: 1. requirements 2. algorithm 3. dataow diagram 4. high-level models 5. hardware block diagram 6. RTL code for datapath 7. state machine 8. RTL code for control
207
Design Process
1. Scheduling (allocate operations to clock cycles) 2. I/O allocation 3. First high-level model 4. Register allocation 5. Datapath allocation 6. Connect datapath components, insert muxes where needed 7. Design implicit state machine 8. Optimize 9. Design explicit-current state machine 10. Optimize
208
2.8.1
Requirements
Cost requirements
Maximum of two adders Maximum of two multipliers Unlimited registers Maximum of three inputs and one output Maximum of 5000 student-minutes of design effort
2.8.2 Algorithm
209
2.8.2
Algorithm
+ + +
z
210
2.8.3
+ + +
z
211
2.8.4
a
+ + +
z z
212
+ + +
z
+ + +
z
213
2.8.5
Optimize Resources
a d b c
+ + +
z
214
Analysis
d b
+ + +
z
Question: Should we move the second addition from third clock cycle to second?
215
Dene Entity
Having nalized our input/output scheduling, we can write our entity. Note: we will add a reset signal later, when we design the state machine to control the datapath. entity vanier is port ( clk : in std_logic; i_1, i_2 : in std_logic_vector(15 downto 0); o_1 : out std_logic_vector(15 downto 0) ); end vanier;
216
2.8.6
+ + +
z
Question:
Question: Why do we not need to assign a new name to x1, x2, and x4 the second time they cross a clock cycle boundary?
217
2.8.7
Input/Output Allocation
d x1 b x2 c x4 x5
a x3
+
x6
+ +
x8 z
x7
218
VHDL Code!
architecture hlm_v1 of vanier is signal x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8 : unsigned(15 downto 0); begin process begin wait until rising_edge(clk); x_1 <= unsigned(i_1); x_2 <= unsigned(i_2); wait until rising_edge(clk); x_3 <= unsigned(i_1); x_4 <= x_1(7 downto 0) * x_2(7 downto 0); x_5 <= unsigned(i_2); wait until rising_edge(clk); x_6 <= x_3(7 downto 0) * x_1(7 downto 0); x_7 <= x_2 + x_5; wait until rising_edge(clk); x_8 <= x_6 + (x_4 + x_7); end process; o_1 <= std_logic_vector(x_8); end hlm_v1;
219
1 2 3 4 5
0 1
x2 x3
x1
i1 a
x2
i2 c
x4 x5
x3
x4
x5
+
x6
x6 x7
+ +
x8 z o1
x7 3
x8
0 4 i1 i2 r1 r2 r3 r4 r5
220
2.8.8
architecture hlm_v1c of vanier is signal x_1, x_2, x_3, x_4, x_5, x_6, x_7 : unsigned(15 downto 0); begin process begin wait until rising_edge(clk); x_1 <= unsigned(i_1); x_2 <= unsigned(i_2); wait until rising_edge(clk); x_3 <= unsigned(i_1); x_4 <= x_1(7 downto 0) * x_2(7 downto 0); x_5 <= unsigned(i_2); wait until rising_edge(clk); x_6 <= x_3(7 downto 0) * x_1(7 downto 0); x_7 <= x_2 + x_5; end process; o_1 <= std_logic_vector(x_6 + (x_4 + x_7)); end hlm_v1c;
i1 d
i2 b
x1
i1 a
x2
i2 c
x3
x4
x5
+
x6
+ +
z o1
x7
221
2.8.9
Register Allocation
i1 d i2 b
x1
i1 a
x2
i2 c
x3
x4
x5
+
x6
+ +
z o1
x7
222
+
r2 x6
+ +
r5 x8 z o1
r5 x7
architecture hlm_v2 of vanier is signal r_1, r_2, r_3, r_4, r_5 : unsigned(15 downto 0); begin process begin wait until rising_edge(clk); r_1 <= unsigned(i_1); r_2 <= unsigned(i_2); wait until rising_edge(clk); r_3 <= unsigned(i_1); r_4 <= r_1(7 downto 0) * r_2(7 downto 0); r_5 <= unsigned(i_2); wait until rising_edge(clk); r_2 <= r_3(7 downto 0) * r_1(7 downto 0); r_5 <= r_2 + r_5; wait until rising_edge(clk); r_5 <= r_2 + (r_4 + r_5); end process; o_1 <= std_logic_vector(r_5); end hlm_v2;
223
2.8.10
i1 d r1 x1 i1 a r3 x3
Datapath Allocation
i2 b r2 x2 i2 c r4 x4 r5 x5
+
r2 x6
+ +
r5 x8 z o1
r5 x7
224
225
2.8.11.1
S0 S1
i1 a r3 x3 m1 m1 i1 d r1 x1
Build a table with one row per state, one colum per register.
S2
+
r5 x7
r2 x6 a2
S3
a1
+
r5 x8 z o1
S0
r1 ce S0 S1 S2 S3 d ce
r2 d ce
r3 d ce
r4 d ce
r5 d
226
Chip enable: a register holds a value for multiple clock cycles. Mux: a register loads values from multiple sources.
227
S0 S1 S2 S3
228
2.8.11.2
Table for datapath components. One row per state. One column per datapath component. Sub-columns for sources and instructions (e.g. add/sub for ALU).
S0 S1
i1 a r3 x3 m1 m1 r4 x4 a1 r2 x6 a2 i2 c r5 x5 i1 d r1 x1 i2 b r2 x2
S2
+
r5 x7
S3
a1
+
r5 x8 z o1
S0
S0 S1 S2 S3
229
S0 S1 S2 S3
230
2.8.11.3
We need to control the transition from one state to the next. For this example, the transition is very simple, each state transitions to its successor: S0 S1 S2 S3 S0....
231
2.8.11.4
S0 S1 S2 S3
Question:
232
233
2.8.12 chine
We chose a one-hot encoding of the state, which usually results in small and fast hardware for state machines with sixteen or fewer states.
architecture explicit_v1 of vanier is signal r_1, r_2, r_3, r_4, r_5 : std_logic_vector(15 downto 0) type state_ty is std_logic_vector(3 downto 0); constant s0 : state_ty := "0001"; constant s1 : state_ty := "0010"; constant s2 : state_ty := "0100"; constant s3 : state_ty := "1000"; signal state : state_ty;
234
begin ----------------------- r_1 process (clk) begin if rising_edge(clk) then if state != S1 then r_1 <= i_1; end if; end if; end process; ----------------------- r_2 process (clk) begin if rising_edge(clk) then if state != S1 then if state = S0 then r_2 <= i_2; else r_2 <= m_1; end if; end if; end if; end process;
235
236
S0 S1
i1 a r3 x3 m1
i1 d r1 x1 m1 r4 x4
i2 b r2 x2 i2 c r5 x5 a1
S2
+
r5 x7
r1
r2
r3
r5
r2 x6 a2
S3
a1
+
r5 x8 z m1
o1
S0
r4 a2
+ +
a1
237
2.8.13
Peephole Optimizations
-- r_1 (optimized) process (clk) begin if rising_edge(clk) then if then r_1 <= i_1; end if; end if; end process;
-- r_1 process (clk) begin if rising_edge(clk) then if state != S1 then r_1 <= i_1; end if; end if; end process;
238
Peephole Optimizations
-- r_2 process (clk) begin if rising_edge(clk) then if state != S1 if state = S0 then r_2 <= i_2; else r_2 <= m_1; end if; end if; end if; end process; -- r_2 (optimized) process (clk) begin if rising_edge(clk) then if state(1) = 0 then if state(0) = 1 then r_2 <= i_2; else r_2 <= m_1; end if; end if; end if; end process;
239
Peephole Optimizations
-- state machine process (clk) begin if rising_edge(clk) then if reset = 1 then state <= S0; else case state is when S0 => state <= when S1 => state <= when S2 => state <= when S3 => state <= end case; end if; end if; end process; -- state machine (optimized) -- NOTE: "st" = "state" process (clk) begin if rising_edge(clk) then if reset = 1 then st <= S0; else for i in 0 to 3 loop st( (i+1) mod 4 ) <= st( i ); end loop; end if; end if; end process;
240
2.8.14
Our functional requirements were written as: output = (a d) + (d b) + b + c Alternatively, we could have achieved exactly the same functionality with the functional requirements written as (the two statements are mathematically equivalent): output = (a d) + b + (d b) + c
241
Alternative (a d) + c + (d b) + b
b c
+ + +
z
+ +
z
242
2.9
Pipelining
Pipelining is optimization that increases performance by overlapping the execution of multiple parcels (instructions). The cost is an increase in area, because we cannot reuse datapath components, registers, inputs, or outputs.
2.9.1
Introduction to Pipelining
243
b r2
0
c r2
+
r1
add1
1
clk d r2
0 1 2 3 4 5 6 7 8 9 10 11 12 13
a r1
+
r1
add1
2
e r2
+
r1
add1
3
f r2
+
r1
add1
4 5
+
z
244
b r2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0
c r4 clk a (stage1) r1 d r5
+
r3
add2
1 2
e r8
+
r5
add3
(stage2) r3 (stage3) r5
+
r7
add4
3
f r10
(stage4) r7 (stage5) r9
+
r9
add5
4 5
+
z
245
r1
add1
r2
+
o1
246
Pipelined Hardware
i1 i2 r1 stage 1
add1
r2 i3
+
r3
r4 i4
stage 2
add2
+
r5
r6 i5
stage 3
add3
+
r7
r8 i6
stage 4
add4
+
r9
r10
stage 5
add5
+
o1
247
248
2.9.2
Partially Pipelined
Fully pipelined: throughput is one parcel per clock cycle Partially pipelined: throughput is less than one parcel per clock cycle. Superscalar: throughput is more than one parcel per clock cycle.
a r1 stage 1
add1
b r2
0
c r2
0 1 2 3 4 5 6 7 8 9 10 11 12 13
clk a
+
r1
add1
1
d r4
+
r3
add2
2
e r4
(stage1) r1 (stage2) r3
stage 2
+
r3
add2
3
f r6
(stage3) r5 z
+
r5
add3
4 5
+
z
stage 3
249
reset
State(0) State(1)
stage 1
r1
add1
r2
+
i2
stage 2
r3
add2
r4
+
i2 stage 3 r5
add3
r6
+
o1
250
2.9.3
Terminology
Denition Depth: The depth of a pipeline is the number of stages on the longest path through the pipeline.
Denition Latency: The latency of a pipeline is measured the same as for an unpipelined circuit: the number of clock cycles from inputs to outputs.
Denition Throughput: The number of parcels consumed or produced per clock cycle.
Denition Upstream/downstream: Because parcels ow through the pipeline analogously to water in a stream, the terms upstream and downstream are used respectively to refer to earlier and later stages in the pipeline. For example, stage1 is upstream from stage2.
2.9.3 Terminology Denition Bubble: When a pipe stage is empty (contains invalid data), it is said to contain a bubble.
251
Question: How do we know whether the output of the pipeline is a bubble or is valid data?
252
six
8-bit
numbers:
Maximum of ve adders Small miscellaneous hardware (e.g. muxes) is unlimited Maximum of six inputs and one output Design effort is unlimited
253
+ +
+ + + + +
z f d e
+ +
z
254
+ +
+ +
+ +
z o_valid
+
z
255
VHDL Code
-- stage 1 process begin wait until rising_edge(clk); r1 <= i1; r2 <= i2; r3 <= i3; end process; a1 <= r1 + r2; a2 <= r3 + r4; -- stage 2 process begin wait until rising_edge(clk); r5 <= a1; r6 <= a2; r7 <= i5; end process; a3 <= r5 + r6; a4 <= r7 + r8; -- stage 3 process begin wait until rising_edge(clk); r9 <= a3; r10 <= a4; end process; a5 <= r9 + r10; -- outputs z <= a5; o_valid <= v3;
r4 <= i4;
v1 <= i_valid;
r8 <= i6;
v2 <= v1;
v3 <= v2;
256
2.11 Memory Arrays and RTL Design 2.11.1 Memory Operations Read of Memory with Registered Inputs
Hardware
we a clk
WE A DO
M
DI
do
Behaviour
clk we a M(a) do a d -
257
M
DI
do
Behaviour
clk we a di M(a) do a d -
258
a d a -
M
DI0 A1 DO1
do0 do1
259
a d a a d2 a -
DI0 A1 DO1
d d1 d
260
2.11.2
2.11.3
Data Dependencies
:= M[i]
M[i]
:=
M[i]
:=
Read after Write Write after Write Write after Read (True dependency) (Load dependency) (Anti dependency) Instructions in a program can be reordered, so long as the data dependencies are preserved.
261
Purpose of Dependencies
W0 WAW ordering prevents W0 from happening after W1 R3 := ...... W1 R3 := ...... producer
RAW ordering prevents R1 from happening before W1 WAR ordering prevents W2 from happening before R1 R1 ... := ... R3 ... consumer
W2
R3 := ......
Each of the three types of memory dependencies (RAW, WAW, and WAR) serves a specic purpose in ensuring that producer-consumer relationships are preserved.
262
Initial Program
263
Initial Program
Valid Modication
264
M[3] := 31 C := M[3]
M[3] := 32 M[0] := 01
Initial Program
265
2.11.4
Input port Output port State signal Array read Array write
mem
266
mem(wr)
rd_addr
mem(rd)
mem
data_out
267
mem(wr)
mem(rd)
mem
data_out
268
mem(wr)
data2
wr2_addr
mem(wr)
mem
269
mem(wr)
data2
wr2_addr
data1
mem(wr) wr1_addr
mem(wr)
mem
mem(wr)
270
mem(rd)
wr_data wr_addr
mem(wr)
rd_data
mem
271
mem(rd)
mem(wr)
rd_data
mem
272
2.11.5 gram
mem M 21 2
data_in wr_addr
M(wr)
31
M(wr)
M(rd)
M(rd)
32
1 2 3 4 5 6 7
M(wr)
01
M(wr)
273
M(wr)
31
M(wr)
M(rd)
M(rd)
32
M(wr)
01
M(wr)
M(rd)
274
M(wr)
31
M(wr)
M(rd)
M(rd)
32
M(wr)
01
M(wr)
M(rd)
275
Minimal Dependencies
M 0 21 2 31 3
M(rd) B 01 0 M(wr)
M(wr)
M(wr)
2 M(rd)
32 3 M(wr) 3 M(rd)
276
M(rd) B 01 0
M(wr)
M(wr)
2 2 M(rd) 3
32 3 M(wr) 3 3 M(rd)
M(wr)
277
M(rd) B
M(wr)
2 2 M(rd) A 2
31 3 M(wr)
32 3 3 M(wr)
01 0 4 M(wr) 3
3 M(rd)
278
3 3 M(rd) C 4
01 0 M(wr) M
279
280
o_avg
4 5 4 3
281
282
2.13.2
Algorithm
avg i = (xi3 + xi2 + xi1 + xi)/4
Decompose into sum and avg: sumi = xi3 + xi2 + xi1 + xi avg i = sumi/4 Look for patterns and potential optimizations: sum5 = x2 + (x3 + x4 + x5) sum6 = (x3 + x4 + x5) + x6 = sum5 x2 + x6 Generalized recurrence equation: sumi = sumi1 xi4 + xi avg i = sumi/4
2.13.2 Algorithm
283
Summary of Behaviour
1. Dene a signal new for the value of i data each time that i valid is 1. 2. Dene a memory array M to store a sliding window of the four most recent values of i data. 3. Dene a signal old for the oldest data value from the sliding window. 4. Update sumi with sumi1 oldi + newi
284
Sliding Window
Two design patterns to choose from: shift register vs circular buffer
old old
M[3] M[2] M[1] M[0]
M[0..3]
new
new
Shift register
Circular Buffer For FIFO behaviour, circular buffer is usually prefered: smaller and lower power.
2.13.2 Algorithm
285
M[0] 8
idx[1]
ce[1]
D CE
M[1] 8 8 q M[2] 8
idx[2]
ce[2]
D CE
idx[3]
ce[3]
D CE
M[3] 8
286
2.13.3
First Pseudocode
Real 3-address pseudocode new old tmp sum M[idx] idx o_avg = = = = = = = i_data M[idx] sum - old tmp + new new idx rol 1 sum/4
sum M idx i_data new
Rd
old
Wr
tmp
(wired shift)
sum
o_avg
idx
2.13.3 Pseudocode and Dataow Diagrams Remove intermediate signal old new = i_data tmp = sum - M[idx] sum = tmp + new M[idx] = new idx = idx rol 1 o_avg = sum/4 reading new from memory tmp = sum - M[idx] M[idx] = i_data new = M[idx] sum = tmp + new idx = idx rol 1 o_avg = sum/4 Remove intermediate signal new tmp = sum - M[idx] M[idx] = i_data sum = tmp + M[idx] idx = idx rol 1 o_avg = sum/4
287
Rd
old
Wr
Rd
tmp
new
(wired shift)
sum
o_avg
idx
288
Dataow Diagram
Latency of three clock cycles
M S0
Wr Rd
i_data
idx
sum
i_data
idx
sum
S1
Rd 1
S1
Rd 1
S2 S0 M sum
(wired shift)
S0
(wired shift)
o_avg
idx
sum
o_avg
idx
Two clock cycles potentially preferable for performance, but requires an additional multiplexer.
2.13.3 Pseudocode and Dataow Diagrams Latency of two clock cycles with registered address M i_data idx sum
S0
Wr Rd 1
289
S1
Rd
S0
(wired shift)
sum
o_avg
idx
290
i_data
idx
sum
S1
Rd
as1
sum
idx
S0
as1
(wired shift)
sum
o_avg
idx
291
2.13.4
M S0
Wr
i_data
idx
sum
Rd 1 rol
S1
Rd
as1
Register control table M idx sum we addr d ce d ce d S0 1 idx x 0 1 as1 S1 0 idx 1 rol 1 as1 Datapath control table as1 rol sub src1 src2 src1 src2 S0 0 M sum S1 1 sum M idx 1
sum
idx
S0
as1
(wired shift)
sum
o_avg
idx
292
CHAPTER 2. RTL DESIGN WITH VHDL Optimized control table Static assignments in control table M.addr = idx M.d = x idx.d = rol sum.d = as1 as1.src1 = sum as1.src2 = M
293
294
State Machine
i valid valid1 S0 1 0 S1 0 1 idle 0 0 Final control table with state encoding
state M idx sum as1 i valid valid1 we ce ce sub S0 1 0 1 0 1 0 S1 0 1 0 1 1 1 idle 0 0 0 0 0 0 M.we idx.ce sum.ce as1.sub = = = = i_valid valid1 i_valid OR valid1 valid1
295
2.13.5
VHDL Code
-- valid bits process begin wait until rising_edge(clk); valid1 <= i_valid; o_valid <= valid1; end process; -- idx process begin wait until rising_edge(clk); if reset = 1 then idx <= "0001"; else if valid1 = 1 then idx <= idx rol 1; end if; end if; end process;
-- sliding window process begin wait until rising_edge(clk); for i in 3 downto 0 loop if (i_valid = 1) and (idx(i) = 1) th M(i) <= i_data; end if; end loop; end process; mem_out <= M(0) when idx(0) = 1 else M(1) when idx(1) = 1 else M(2) when idx(2) = 1 else M(3); -- add sub add_sub <= sum - mem_out when valid1 = 1 else sum + mem_out; -- sum process begin wait until rising_edge(clk); if i_valid = 1 or valid1 = 1 then sum <= add_sub; end if; end process;
296
Hardware
i_valid i_data
A CE
valid1
CE
M
(wired shift)
idx
add/sub
CE
sum
(wired shift)
o_valid
o_avg
298
3.1
Introduction
Hennessey and Pattersons Quantitative Computer Achitecture (textbook for E&CE 429) has good information on performance. We will use some of the same definitions and formulas as Hennessey and Patterson, but we will move away from generic denitions of performance for computer systems and focus on performance for digital circuits.
299
3.2
Dening Performance
Performance = Work Time
You can double your performance by: doing twice the work in the same amount of time OR doing the same amount of work in half the time
300
Benchmarking
Work Performance = Time Measuring time is easy, but how do we accurately measure work? The game of benchmarketing is nding a denition of work that makes your system appear to get the most work done in the least amount of time. Measure of Work clock cycle instruction synthetic program real program travel 1/4 mile Measure of Performance MHz MIPs Whetstone, Dhrystone, D-MIPs (Dhrystone MIPs) SPEC drag race
301
SPEC Benchmarks
The Spec Benchmarks are among the most respected and accurate predictions of real-world performance.
Denition SPEC: Standard Performance Evaluation Corporation MISSION: To establish, maintain, and endorse a standardized set of relevant benchmarks and metrics for performance evaluation of modern computer systems http://www.spec.org.
The Spec organization has different benchmarks for integer software, oating-point software, web-serving software, etc.
302
3.3 3.3.1
Using n% greater formula, the phrase The performance of A is n% greater than the performance of B is: PerformanceA PerformanceB PerformanceB
n% =
303
Substituting the above equation into the equation for the performance of A is n% greater than the performance of B gives: n% = TimeB TimeA TimeA
In general, the equation for a fast system to be n% faster than a slow system is: TSlow TFast TFast
n% =
Another useful formula is the average time to do one of k different tasks, each of which happens %i of the time and takes an amount of time Ti to do each time it is done .
TAvg =
i=1
(%i)(Ti)
We can measure the performance of practically anything (cars, computers, vacuum cleaners, printers....)
304
3.3.2
305
3.4 Clock Speed, CPI, Program Length, and Performance 3.4.1 Mathematics
CPI NumInsts ClockSpeed ClockPeriod Cycles per instruction Number of instructions Clock speed Clock period
306
3.4.2
The AMD Athlon is a CISC microprocessor (it uses the IA-32 instruction set). The Fujitsu SPARC64 is a RISC microprocessor (it uses Suns Sparc instruction set). Assume that it requires 20% more instructions to write a program in the Sparc instruction set than the same program requires in IA-32.
307
Question:
308
Relative CPI
Question: What is the ratio between the CPIs of the two microprocessors?
309
Absolute CPI
Question: Can you determine the absolute (actual) CPI of either microprocessor?
310
311
Options
You have three options:
option 1 : no change option 2 : add the MAC instruction, increase the clock period by 20%, and MAC has the same CPI as MUL. option 3 : add the MAC instruction, keep the clock period the same, and the CPI of a MAC is 50% greater than that of a multiply.
Question:
312
Question: If you add the optimization, how much can you allow your schedule to slip before the delay hurts your relative performance compared to not doing the optimization and launching the product according to your current schedule?
3.4.5
Summary of Equations
313
3.5 Performance Analysis and Dataow Diagrams 3.5.1 Dataow Diagrams, CPI, and Clock Speed
One of the challenges in designing a circuit is to choose the clock speed. Choosing a clock period affects many aspects of the design, not just the overall performance. Some goals will push you toward a short clock period Some goals will push you toward a long clock period
314 Goal
Minimize area
Increase exibility
scheduling
Decrease percentage of clock cycle spent in ops (overhead time in ops is not doing useful work) Decrease time to execute an instruction
315
1. Start with smallest possible clock period. 2. Allocate operations to clock cycles 3. Calculate average time to execute an instruction. 4. If latency > 1, then: increase clock period until reduce latency; return to Step 2. Else (latency = 1): choose clock period and dataow diagram that resulted in highest performance. 5. Optimize dataow diagram to reduce area.
316
Instruction A
f (30ns)
Instruction B
i (40ns)
g (50 ns)
g (50 ns)
g (50 ns)
317
g (50 ns)
55ns
55ns
g (50 ns)
318
Scheduling (2)
15 ns 25 ns 15 ns 25 ns
319
Scheduling (3)
15 ns 25 ns
320
321
Tavg
322
Tavg
323
This question involves doing some of the design work for a circuit that implements InstP and InstQ using the components described below. Instruction Algorithm Frequence of Occurrence InstP a b ((a b) + (b d) + e) 75% InstQ (i + j + k + l) m 25%
Component Delays 2-input Mult 40ns 2-input Add 25ns Register 5ns
324
NOTES
There is a resource limitation of a maximum of 3 input ports. (There are no other resource limitations.) You must put registers on your inputs, you do not need to register your outputs. The environment will directly connect your outputs (its inputs) to registers. Each input value (a, b, c, d, e, i, j, k, l, m) can be input only once if you need to use a value in multiple clock cycles, you must store it in a register.
325
Questions
Question: What clock period will result in the best overall performance?
Question: Find a minimal set of resources that will achieve the performance you calculated.
326
3.6 3.6.1
3.6.1.1
Multiply by a constant power of two Multiply by a power of two Divide by a constant power of two Divide by a power of two Multiply by 3
327
3.6.1.2
is neg, is pos
Boolean tests that can be implemented as wires is odd, is even By choosing your encodings carefully, you can sometimes reduce a vector comparison to a wire. For example if your state uses a one-hot encoding, then the comparison state = S3 reduces to state(3) = 1. You might expect a reasonable logic-synthesis tool to do this reduction automatically, but most tools do not do this reduction. When using encodings other than one-hot, Karnaugh maps can be useful tools for optimizing vector comparisons. By carefully choosing our state assignments, when we use a full binary encoding for 8 states, the comparison: (state = S0 or state = S3 or state = S4) = 1 can be reduced from looking at 3 bits, to looking at just 2 bits. If we have a condition that is true for four states, then we can nd an encoding that looks at just 1 bit.
328
3.6.2 3.6.2.1
Pushing multiplexors into the fanin of a signal can reduce area. Before z <= a + b when (w = 1) else a + c; After tmp <= b when (w = 1) else c; z <= a + tmp; The rst circuit will have two adders, while the second will have one adder. Some synthesis tools will perform this optimization automatically, particularly if all of the signals are combinational.
329
3.6.2.2 tion
Introduce new signals to capture subexpressions that occur multiple places in the code. Before y <= else z <= else
a + b + c when (w = 1) d; a + c + d when (w = 1) e;
330
Subexpression Elimination
Note: Clocked subexpressions Care must be taken when doing common subexpression elimination in a clocked process. Putting the temporary signal in the clocked process will add a clock cycle to the latency of the computation, because the tmp signal will be ip-op. The tmp signal must be combinational to preserve the behaviour of the circuit.
331
3.6.2.3
Computation Replication
To improve performance
If same result is needed at two very distant locations and wire delays are signicant, it might improve performance (increase clock speed) to replicate the hardware
To reduce area
If same result is needed at two different times that are widely separated, it might be cheaper to reuse the hardware component to repeat the computation than to store the result in a register Note: Muxes are not free Each time a component is reused, multiplexors are added to inputs and/or outputs. Too much sharing of a component can cost more area in additional multiplexors than would be spent in replicating the component
332
3.6.3
Arithmetic
VHDL is left-associative. The expression a + b + c + d is interpreted as (((a + b) + c) + d). You can use parentheses to suggest parallelism. Perform arithmetic on the minimum number of bits needed. If you only need the lower 12 bits of a result, but your input signals are 16 bits wide, trim your inputs to 12 bits. This results in a smaller and faster design than computing all 16 bits of the result and trimming the result to 12 bits.
3.7. RETIMING
333
3.7
state a b c
Retiming
state S0 S1 S2 S3 S0 S1 S2 S3 a critical path b c sel 1 y z x y + z +
sel x
process begin wait until rising_edge(clk); if state = S1 then z <= a + c; else z <= b + c; end if; end process;
334
state S0 S1 S2 S3 S0 S1 S2 S3 a b c sel x y z
process (state) begin if state = S1 then sel = 1 else sel = 1 end if; end process; process begin wait until rising_edge(clk); if sel = 1 then ... -- code for z end if; end process;
process begin wait until rising_edge(clk); if state = then sel = 1 else sel = 1 end if; end process; process begin wait until rising_edge(clk); if sel = 1 then ... -- code for z end if; end process;
336
4.1
Overview
4.1.1 Terminology: Validation / Verication / Testing 4.1.2 The Difculty of Designing Correct Chips
337
338
4.2 4.2.1
To be absolutely certain that an implementation is correct, we must check every combination of values. This includes both input values and internal state (ip ops). If we have ni bits of inputs and ns bits in ip-ops, we have to test 2ni +ns different cases when doing functional verication.
Question: If we have nc combinational signals, why dont we have to test 2ni+ns+nc different cases?
339
4.2.2
This example illustrates the difculty of achieving signicant coverage on realistic circuits. Consider doing the functional simulation for a double precision (64-bit) oating-point divider. Given Information Data width 64 bits Number of gates in circuit 10 000 Number of assembly-language instructions to 100 simulate one gate for one test case Number of clock cycles required to execute one 0.5 assembly language instruction on the computer that is running the simulation Clock speed of computer that is running the sim- 1 Gigahertz ulation
340
Number of Cases
Question: How many cases must be considered?
341
342
Coverage
Question: If you can run simulations non-stop for one year on ten computers, what coverage will you achieve? width=64b, gates=10 000, instrs/gate=100, cycles/instr=0.5, cycles/sec=109
343
By tapeout, over 200 billion simulation cycles had been run on a network of computers. All of these simulations represent less than two minutes of running a real processor.
344
4.3 4.3.1
implementation
Implementation Circuit that youre checking for bugs also known as: design under test or unit under test Stimulus Generates test vectors Specication Describes desired behaviour of implementation Check Checks whether implementation obeys specication
345
4.3.2
stimulus
implementation
4.3.3
relational testbench
stimulus
check
implementation
346
4.3.4
testbench stimulus
implementation
architecture main of athabasca_tb is component declaration for implementation; other declarations begin implementation instantiation; stimulus process; specification process (or component instantiation); check process; end main;
347
4.3.5
Datapath vs Control
implementation
relational testbench
stimulus
check
implementation
348
4.3.6
Verication Tips
Suggested order of simulation for functional verication. 1. Write high-level model. 2. Simulate high-level model until have correct functionality and latency. 3. Write synthesizable model. 4. Use zero-delay simulation (uw-sim) to check behaviour of synthesizable model against high-level model. 5. Optimize the synthesizable model. 6. Use zero-delay simulation (uw-sim) to check behaviour of optimized model against high-level model. 7. Use timing-simulation (uw-timsim) to check behaviour of optimized model against high-level model. section 4.4 describes a series of testbenches that are particularly useful for debugging datapath circuits in the early phases of the design cycle.
349
350
Implementation
entity and2 is port ( a, b : in std_logic; c : out std_logic ); end and2; architecture main of and2 is begin c <= 1 when (a = 1 AND b = 1) else 0; end and2;
351
4.4.1
A Spec-Less Testbench
First, use waveform viewer to check that implementation generates reasonable outputs for a small set of inputs.
entity and2_tb is end and2_tb; architecture main_tb of and2_tb is component and2 ... end component; signal ta, tb, tc_impl : std_logic; signal ok : boolean; begin --------------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); --------------------------------------------stimulus : process begin ta <= 0; tb <= 0; wait for 10ns; ta <= 1; tb <= 1; wait for 10ns; end process; --------------------------------------------end main_tb;
352
4.4.2
architecture main_tb of and2_tb is ... begin ... stimulus : process type test_datum_ty is record ra, rb : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty ; constant test_vectors : test_vectors_ty := -a b ( ( 0, 0), ( 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; wait for 10 ns; end loop; end process; end main_tb;
353
4.4.3
stimulus : process type test_datum_ty is record ra, rb, rc : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty; constant test_vectors : test_vectors_ty := -a, b: inputs -c : expected output -a b c ( ( 0, 0, 0), ( 0, 1, 0), ( 1, 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; tc_spec <= test_vectors(i).rc; wait for 10 ns; end loop; end process;
354
355
4.4.4
entity and2_spec is ...(same as and2 entity)... end and2_spec; architecture spec of and2_spec is begin c <= a AND b; end spec;
356
357
358
4.4.5
architecture main_tb of and2_tb is ... begin ... stimulus : process subtype std_test_ty of std_logic is (0, 1); begin for va in std_test_tylow to std_test_tyhigh loop for vb in std_test_tylow to std_test_tyhigh loop ta <= va; tb <= vb; wait for 10 ns; end loop; end loop; end process; ... end main_tb;
359
4.4.6
Relational Specication
Sometimes we want to check a relationship between the output and the input, rather than check that the output has a specic value. To do this, we drop the spec process, and put the brains into the check process.
architecture main_tb of and2_tb is ... begin -----------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); -----------------------------------------stimulus : process ... end process; -----------------------------------------check : process (tc_impl, tc_spec) begin ok <= NOT (tc_impl = 1 AND (ta =0 OR tb = 0)); end process; -----------------------------------------end main_tb;
360
In this section, we will explore the functional verication of state machines via a First-In First-Out queue.
361
4.5.1
Structure of queue
362
Empty A Write 1
Write Sequence
363
364
Read 1 A B B Read 2 A
365
B C D E F G H I J
B C D E F G H I J
366
Write 1 K B C D E F G H I J Write 2 K B C D E F G H I J
367
do_rd wr_idx
data_rd
empty
empty
368
4.5.2 4.5.2.1
package queue_pkg is subtype data is std_logic_vector(3 downto 0); function to_data(i : integer) return data; end queue_pkg; package body queue_pkg is function to_data(i : integer) return data is begin return std_logic_vector(to_unsigned(i, 4)); end to_data; end queue_pkg;
4.5.2.2
4.5.3 Code Structure for Verication This section reserved for your reading pleasure
369
4.5.3
Verication things to notice in queue implementation: 1. instrumentation code 2. coverage monitors 3. assertions
370
371
4.5.4
Instrumentation Code
Added to implementation to support verication Usually keeps track of previous values of signals Does not create hardware (Optimized away during synthesis) Does not feed any output signals Must use synthesizable subset of VHDL
process (clk) begin if rising_edge(clk) then prev_rd_idx <= rd_idx; prev_wr_idx <= wr_idx; prev_do_rd <= do_rd; prev_do_wr <= do_wr; end if; end process;
372
373
374
375
376
4.5.5
1. If rd idx changes, then it increments or wraps. 2. If rd idx changes, then do rd was 1, or reset is 1. 3. If wr idx changes, then it increments or wraps. 4. If wr idx changes, then do wr was 1, or reset is 1. 5. And many others....
4.5.5 Assertions
377
Assertion Template
process (signals read) begin assert (required condition) report "error: message" severity warning; end process;
378
4.5.5 Assertions
379
380
4.5.6
type data_array_ty is array(natural range <>) of data; signal data_array : data_array_ty(7 downto 0);
381
Functions
function to_idx (i : natural range data_arraylow to data_arrayhigh) return idx_ty is begin return to_unsigned(i, idx_tylength); end to_idx; Conversion to Index Without Function With Function rd_idx <= to_unsigned(5, 3); rd_idx <= to_idx(5); The function code is verbose, but is very maintainable, because neither the function itself nor uses of the function need to know the width of the index vector.
382
Attributes
function inc_idx (idx : idx_ty) return idx_ty is begin if idx < data_arrayhigh then return (idx + 1); else return (to_idx(data_arraylow)); end if; end inc_idx;
383
384
385
4.5.7
Queue Specication
Most bugs in queues are related to the queue becoming full, becoming empty, and/or wrap of indices. Specication should be obviously correct. Avoid bugs in specication by making specication queue larger than the max number of writes that we will do in test suite. Thus, the specication queue will never become full or wrap. However, the implementation queue will become full and wrap.
386
387
Things to Notice
Things to notice in queue specication: 1. dont care conditions (-) 2. uninitialized data (hint: what is the value of rd_data when do more reads than writes?
388
Dont Care
rd_data <= data_array(rd_idx) when (do_rd =1) else (others => -);
389
4.5.8
Queue Testbench
Things to notice in queue testbench: 1. running multipe test sequences 2. uninitialized data U 3. std_match to compare spec and impl data 0 0 1 1 everything else 0 L 1 H everything everything
With equality, - = 1, but we want to use - to mean dont care in specication. The solution is to use std match, rather than = to check implementation signals against the specication.
390
391
4.6
This question concerns the VHDL code microwave, which controls a simple microwave oven; the properties prop1...prop3; and two proposed changes to the VHDL code. INSTRUCTIONS: 1. Assume that the code as currently written is correct any change to the code that causes a change to the behaviour of the signals heat or count is a bug. 2. For each of the two proposed code changes, answer whether the code change will cause a bug. 3. If the code change will cause a bug, provide a test case that will exercise the bug and identify all of the given properties (prop1, prop2, and prop3) that will detect the bug with the test case you provide. 4. If none of the three properties can detect the bug, provide a property of your own that will detect the bug with the testcase you provide.
392
Question: For each of the three properties prop1...prop2, answer whether the property is best checked as part of a testbench or assertion. For each property, justify why a testbench or an assertion is the best method to validate that property. prop1 If start is pushed and the door is closed, then heat remains on for exactly the time specied by the timer when start was pushed, assuming reset remains false and the door remains closed. prop2 If the door is open, then heat is off. prop3 If start is not pushed, reset is false, and count is greater than zero, then count is decremented.
393
Implementation
entity microwave is port ( timer -- time input from user : in unsigned(7 downto 0); reset, -- resets microwave clk, -- clock signal input is_open, -- detects when door is open start -- start button input from user : in std_logic; heat : out std_logic -- 1=on, 0=off ); end microwave; architecture main of microwave is signal count : unsigned(7 downto 0); -- internal time count signal x_heat : std_logic; begin
394
-- heat process -----------------------------process (clk) begin if rising_edge(clk) then if reset = 1 then x_heat <= 0; elsif (is_open = 0) and (start = 1) and (time > 0) then x_heat <= 1; elsif (is_open = 0) and (count > 0) then x_heat <= x_heat; else x_heat <= 0; end if; end if; end process;
395
396
Properties
prop1 If start is pushed and the door is closed, then heat remains on for exactly the time specied by the timer when start was pushed, assuming reset remains false and the door remains closed. prop2 If the door is open, then heat is off. prop3 If start is not pushed, reset is false, and count is greater than zero, then count is decremented.
397
Change #1
elsif (start = 1) then count <= time; From: elsif (count > 0) then count <= count - 1; elsif (count > 0) then count <= count - 1; elsif (start = 1) then count <= time;
To:
398
Change #2
elsif (is_open then x_heat <= From: elsif (is_open then x_heat <= elsif To: = 0) and (start = 1) and (time > 0) 1; = 0) and (count > 0) x_heat;
(is_open = 0) and ((start = 1) or (count > 0)) then x_heat <= 1; else x_heat <= 0;
399
Coverage
Question: If msb of src1 is 1 and lsb of src2 is 0 or sum(3) is 1, then result is wrong. What is the minimum coverage needed to detect bug? What is the minimim coverage needed to guarantee that the bug will be detected?
400
402
5.1
In this section we will look at the different timing parameters of circuits. Our focus will be on those parameters that limit the maximum clock speed at which a circuit will work correctly.
5.1.1
Background Denitions
This section reserved for your reading pleasure
403
5.1.2 5.1.2.1
Denition Clock Skew: The difference in arrival times for the same clock edge at different ip-ops.
Clock skew is caused by the difference in interconnect delays to different points on the chip.
404
405
5.1.2.2
Clock Latency
master clock latency intermediate clock final clock
Denition Clock Latency: The difference in arrival times for the same clock edge at different levels of interconnect along the clock tree. (Intuitively different points in the clock generation circuitry.) Note: Clock latency Clock latency does not affect the limit on the minimim clock period.
406
5.1.2.3
ideal clock
Clock Jitter
Denition Clock Jitter: Difference between actual clock period and ideal clock period.
407
temperature and voltage variations across different locations on a chip manufacturing variations between different parts
408
5.1.3 5.1.3.1
clk d q
Flop Behaviour
Latch Behaviour
Storage devices have two modes: load mode and store mode. Flops are edge sensitive; they are in load mode just before the clock edge. Latches are level senstive; they are in load mode while their enable signal is asserted high (low for active low latches).
409
Timing Parameters
Setup d clk q Clock-to-Q Hold d clk q Clock-to-Q Setup Hold Setup d clk q Clock-to-Q Hold
Flip-op
Active-high latch
Active-low latch
Setup and hold dene the window in which input data are required to be constant in order to guarantee that storage device will store data correctly. Clock-to-Q denes the delay from the clock edge to when the output is guaranteed to be stable.
410
5.1.4
Propagation Delays
Propagation delay time it takes a signal to travel from the source (driving) op to the destination op propagation delay = load delay + interconnect delay Load delay combinational gates between the ops Interconnect delay wires between gates and ops
411
5.1.5 5.1.5.1
clk1 clk2 a b
ClockPeriod >
412
5.1.5.2 5.1.5.3
a clk b
Clock-to-Q
c d
413
Setup Violation
a clk b Clock-to-Q Prop Setup c d ??? ???
Setup Violation
414
Hold Violation
a clk b c d
a clk b
c d
???
Hold Violation
415
5.2.1
416
Storage mode
417
Latch implementation
418
Latch Glitching
d clk o
Note: inverters on clk Both of the inverters on the clk signal are needed. Together, they prevent a glitch on the OR gate when clk is deasserted. If there was only one inverter, a glitch would occur. For more on this, see section 5.2.1.6
419
Loading 0
0 1 0 d clk=0 o
Loading 1
0 1 1 0 o=0
d=1 clk=1
0 0 0 1
0 1 1
Storing 0
Storing 1
420
421
5.2.1.3 Latch
d clk
d clk
l1 c2 cn
l2 qn s2 s1 q
d clk
l1 c2 cn
l2 qn s2 s1 q
d clk
l1 c2 cn
l2 qn s2 s1 q
d clk
l1 c2 cn
l2 qn s2 s1 q
422
5.2.1.4
d 1 clk
0
d 0 clk
1 0
423
1 1 1
Setup Violation
d 1 clk
0 1 0 0
d 0 clk
d 1 clk
t=1: propagates through AND Clk propagates through inverter Trouble: inconsistent values on load path and store path. Old value () still in store path when store path is enabled. d 0 clk
1 0 1
424
d 0 clk
1 0 1 / /
d 0 clk
1 0
l1 l2 qn q s1 s2
1 /
clk cn
c2
425
We now repeat the analysis of setup violation, but illustrate the minimum violation (input transitions from to 3 time-units before the clock edge).
d 1 clk
0 1 0 0
d 1 clk
0 1
d 1 clk
d 0 clk
d 1 clk
d 0 clk
426
Trouble: inconsistent values on load path and store path. Old value () still in store path when store path is enabled. d 0 clk
1 0 1
d 0 clk
0 /
d 0 clk
=1 1 0
0 0 1
1 1
0 1 1
d 0 clk
0 / /
l1 l2 qn q s1 s2 clk cn
c2
427
setup d l1 l2 qn q s1 s2 clk cn c2
428
5.2.1.5
429
d clk cn
l1 c2
l2 qn s2 s1 q
d clk cn
l1 c2
l2 qn s2 s1 q
d clk cn
l1 c2
l2 qn s2 s1 q
d clk cn
l1 c2
l2 qn s2 s1 q
430
5.2.1.6
d l1 l2 qn q s1 s2 clk c2 cn
431
5.3
Denition false path: : a path along which an edge cannot travel from beginning to end.
432
Outline
The algorithm that we present comes from McGeer and Brayton in a DAC 198? paper. The algorithm to nd the critical path through a circuit is presented in several parts. 1. Section 5.3.2: Find the longest path ignoring the possibility of false paths. 2. Section 5.3.3: Almost-correct algorithm to test whether a candidate critical path is a false path. 3. Section 5.3.4: If a candidate path is a false path, then nd the next candidate path, and repeat the false-path detection algorithm. 4. Section 5.3.5: Correct, complete, and complex algorithm to nd the critical path in a circuit.
433
Notes
Note: The analysis of critical paths and false paths assumes that all inputs change values at exactly the same time. Timing differences between inputs are modelled by the skew parameter in timing analysis. Throughout our discussion of critical paths, we will use the delay values for gates shown in the table below. gate delay NOT 2 AND 4 OR 4 XOR 6
434
5.3.1.1 Adder
Question:
Find the critical path through the full-adder circuit shown below.
ci a b i k j co s
435
Alternative Excitation
Question: Do the input values of ci=0, a=, b=1 exercise the critical path?
ci a b i k j co s
436
5.3.1.2 5.3.1.3
The longest path through the circuit might not be the critical path, because the behaviour of the gates might prevent an edge (0 1 or 1 0) from travelling along the path.
437
a = 0, b = 0 1
a y b b a
a = 0, b = 1 0
y
a = 1, b = 0 1
a y b b a
a = 1, b = 1 0
y
Question:
438
a
439
b a c
440
5.3.2
Longest Path
The primary input signal with the maximum delay is the start of the longest path. The delay annotation of this signal is the delay of the longest path. The longest path is found by working from the source signal to the destination signals, picking the fanout signal with the maximum delay at each step.
441
5.3.3 5.3.3.1
The controlling value of a gate is the value such that if one of the inputs has this value, the output can be determined independently of the other inputs. The controlled output value is the value produced by the controlling input value. Gate Controlling Value Controlled Output
AND OR NAND NOR XOR
442
Denition side input: For a gate on a path (either a candidate critical path, or a real critical path), the side inputs are the input signals that are not on the path.
443
Reconvergent Fanout
Denition reconvergent fanout: There are paths from signals in the fanout of a gate that reconverge at another gate.
a b d e f c g y h z
If a candidate path has reconvergent fanout, then the rising or falling edge on the input to the path might cause a side input along the path to have a rising or falling edge, rather than a stable 0 or 1.
444
0 OR
1 XOR
445
Missing Rules?
Question: Why do the rules not have falling edges for AND gates or rising edges for OR gates on the side input?
a b a c b c
446
Based upon the rules for propagating an edge that we have seen so far, the viability condition for a path is: every side input has a non-controlling value. As always, section 5.3.5 has the complete viability condition.
447
5.3.3.3
448
False-Path Example 1
Question: Determine if the longest path in the circuit below is a false path.
a 16 b 12 c 10
d 14
f 12
12 g 8 12 6 h 4 i 4 8 8
2 4 4
j 0 k 0
e 8
449
5.3.4
If the longest path is a false path, we need to nd the next longest path in the circuit, which will be our next candidate critical path. If this candidate fails, we continue to nd the next longest of the remaining paths, ad innitum.
450
5.3.4.1 Path
1. Initialize path table with primary inputs, their potential delay, and fanout. 2. Sort path table by potential delay 3. If the partial path with the max delay has just one unused fanout signal, then extend the partial path with this signal. Otherwise: (a) Extend path through unused fanout with max delay. (b) Delete this fanout signal from the list of unused fanout signals . 4. Compute constraint that side input has non-controlling value 5. If the new constraint does not cause a contradiction, then return to step 3. Otherwise: (a) Mark this partial path as false. (b) For each partial path that is a prex of the false path:
451
452
5.3.4 Finding the Next Candidate Path side input non-controlling value constraint
453
454
5.3.5 Path
We now remove the assumption that side inputs always arrive earlier than path inputs.
5.3.5.1
Late Side
monotone speedup monotone speedup path input propogates side input causes glitch
The complete and correct rule: a path input excites the gate if the side-input is non-controlling or the side-input arrives late and the path input is controlling.
455
5.3.5.2
Monotone Speedup
Denition monotonic: A function ( f ) is monotonic if increasing its input causes the output to increase or remain the same. Mathematically: x < y = f (x) f (y).
Denition monotononous: A lecture is monotonous if increasing the length of the lecture increases the number of people who are asleep.
Denition monotone speedup: The maximum clockspeed of a circuit should be monotonic with respect to the speed of any gate or sub-circuit. That is, if we increase the speed of part of the circuit, we should either increase the clockspeed of the circuit, or leave it unchanged.
456
If nd a contradiction on the path, check for side inputs that are on previously discovered false paths. If a gate and its side input are on a previously discovered false path, then the side input denes a prex of a false path that is a late-arriving side input. For each late-arriving prex, compute its viability (the conditions under which an edge will propagate along the prex to the late side input). To the row of the late arriving side input in the constraint table, add as a disjunction the constraint that: the path input has a controlling value and at least one of the prexes is viable.
457
5.3.5.5
Question:
potential unused delay fanout path false a,b,d,e,f,g 10 g, c a 10 a,c,f,g side input non-controlling value constraint f[e] 1 a g[a] 1 a
458
Complete Example 2
Question: Find the critical path in the circuit below.
a
8 8 8 14
f4
8 8 8
4 4 4
j 0
b 12 c
d 12 e 10
10
g 8 12
12
h8 i
459
Complete Example 3
Monotone speedup
0 a
0 0
b c
2 2
0 a
0 0
b c
2 2
4 e
0 6 f
Fast timing
460
Complete Example 4
Late side inputs sometimes must have an edge. Find the second-longest path with contradiction using early sides: c d k e a i j b g f h
a b
c 0 d 1
0
e 1 g 4 h 6
1 6
1 0
0 f 2
c 2 d 4
0
a 0 b
e4 8
6
48
i 810
j
10 12
14 k 16
0 f 2
g 4 h 6
461
Complete Example 5
Late side paths must be viable.
Question:
a b c
i k j
h d f
462
nding all input values that will exercise the critical path multiple paths with the same delay to the same gate
463
5.4 5.4.1
465
GND
metal
467
A Pair of Inverters
Transistor Level VDD Gate Level b
b
GND
Mask Level b
468
GND
469
470
Transistor Level
VDD
b a c b d
c GND
471
b a c b d
c GND
Mask Level
VDD b a b c d c GND
472
Rpu d Cp Rpd c
GND
473
GND
GND
474
VDD Rpu b RW1 Cp Rpd GND RV CW1 CL b RW2 CW2
475
5.4.2
input voltage
output voltage
476
Denition Trip Points: A high or 1 trip point is the voltage level where an upwards transition means the signal represents a 1. A low or 0 trip point is the voltage level where a downwards transition means the signal represents a 0.
a
477
The source (VDD in our case) and each capacitor is a node. We number the nodes, capacitors, and resistors. Resistors are numbered according to the capacitor to their right. Multiple resistors in series without an intervening capacitor are lumped into a single resistor. All nodes except the source start at GND. We calculate the voltage at a node when we turn on the P-transistor (connect to VDD).
The process for analyzing a transition from VDD to GND on a node is the dual of the process just described. The source node is GND, all other nodes start at VDD, we calculate the voltage when we turn on the N-transistor (connect it to GND).
VDD 0 Rpu R1 R2 1 b RW12 Cp Rpd GND RV R5 CW1 R3 R4 b RW2 3 RV CW2 5 CL 4 CL
478
Denition down: The set of capactitors downstream from a node is the set of all capacitors where current would ow through the node to charge the capacitor. You can think of this as the set of capacitors that are between the node and ground. Example: down(2) = {C2,C3,C4,C5}. Example: down(3) = {C3,C4}
479
Rr Ir (t)
= V0(t) (R1I1(t) + R2I2(t) + R3I3(t)) The current through a resistor is the sum of the currents through all of the downstream capacitors Ir (t) =
cdown(r)
Ic
I1(t) = Ic1 + Ic2 + Ic3 + Ic4 + Ic5 I2(t) = Ic2 + Ic3 + Ic4 + Ic5 I3(t) = Ic3 + Ic4
480
CHAPTER 5. TIMING ANALYSIS Substitute Ir into the equation for V3 R1(Ic1 + Ic2 + Ic3 + Ic4 + Ic5) V3(t) = V0(t) + R2(Ic2 + Ic3 + Ic4 + Ic5) + R3(Ic3 + Ic4) Use associativity to group terms by currents. Ic1(R1) + Ic2(R1 + R2) + Ic3(R1 + R2 + R3) V3(t) = V0(t) + Ic4(R1 + R2 + R3) + Ic5(R1 + R2)
5.4.2 Derivation of Analog Timing Model Current through a capacitor Vc(t) Ic(t) = Cc t Substitute Ic into equation for V3 Vc1(t) (R1)Cc1 t V (t) + (R1 + R2)Cc2 c2 t V (t) V3(t) = V0(t) + (R1 + R2 + R3)Cc3 c3 t V (t) + (R1 + R2 + R3)Cc4 c4 t V (t) + (R1 + R2)Cc5 c5 t
481
482
r(path(k)path(k))
Rr
R1 R1 + R2 R1 + R2 + R3 R1 + R2 + R3 R1 + R2
Substitute Ri,k into V3 Vc2(t) Vc3(t) Vc1(t) + R3,2Cc2 + R3,3Cc3 R3,1Cc1 t t t V3(t) = V0(t) Vc4(t) Vc5(t) + R3,4Cc4 + R3,5Cc5 t t
483
5.4.2.2
General Derivation
Vi(t) = V0(t) voltage drop fromNode0toNodei The voltage drop is the sum of the voltage drops across the resistors on the path from Node0 to Nodei = V0(t)
rpath(i)
Rr Ir (t)
484
CHAPTER 5. TIMING ANALYSIS The current through a resistor is the sum of the currents through all of the downstream capacitors Ir (t) =
cdown(r)
Ic
Vi(t) = V0(t)
Rr
cdown(r)
Ic
Rr Ic
5.4.2 Derivation of Analog Timing Model Current through a capacitor Vc(t) Ic(t) = Cc t Substitute Ic into equation for Vi Vi(t) = V0(t)
rpath(i) cdown(r)
485
Rr Cc
Vc(t) t
Vi(t) = V0(t)
rpath(i)path(k)
Rr Ck
Vc(t) t
486
Rr
Ri,k Ck
Vc(t) t
487
5.4.3
Assume that V0(t) is a step function from 0 to 1 at time 0. Derive upper and lower bounds for Vi(t). Find RC time constants for upper and lower bounds. Elmore delay is guaranteed to be between upper and lower bounds.
TD-TRi
TRi
TP
TP-TRi TD
488
Upper
Elmore
Lower
TDi 1 t + TRi
TP TRi t TDi TP e 1 TP
489
kNodes
R2 Ck k,i Ri,i
kNodes
kNodes
490
491
5.4.4 5.4.4.1
492
G1
G2
Ra4 Ra1
G1
C3 Rw3
G2 C1 Rw1 G1
Rpu
Vi Cp Rpd
C1
C2
C3
CG2
G* C* Ra* Rw*
493
Vi
494
Doubling Antifuses
Question: If you double the number of antifuses and wires needed to connect two gates, what will be the approximate effect on the wire delay between the gates?
495
G2 G3 G1
Question: Assuming that wire resistance is much less than antifuse resistance and that all antifuses have equal resistance, calculate the delay from the source inverter (G1) to G2
496
497
Delay to G2 vs G3
Question: Assuming all wire segments at same level have roughly the same capacitance, which is greater, the delay to G2 or the delay to G3?
R3 C3 R5 C4 R6 G3 C7 C6 R2 C2
R4 C5 G2 C1 R1
G1
G1
G2
Rpu R1 n1 R2 n2 Cp Rpd G3 R5 n6 R6 C6 C1 C2 n3 R3 n4 R4 C3 C4
Vi
n5 C5
n7 C7
498
5.5
Speed Grading
Fabs sort chips according to their speed (sorting is known as speed grading or speed binning) Faster chips are more expensive In FPGAs, sorting is based usualy on propagation delay through an FPGA cell. As wires become a larger portiono of delay, some analysis of wire delays is also being done. Propagation delay is the average of the rising and falling propagation delays. Typical speed grades for FPGAs:
Std standard speed grade 1 15% faster than Std 2 25% faster than Std 3 35% faster than Std Worst-Case Timing
499
Slow-slow conditions (process variation/corner which result in slow p-channel and slow n-channel). We could also have fast-fast, slow-fast, and fast-slow process corners
Derating factor is a number used to adjust timing number to account for voltage and temp conditions
500
ASIC manufacturers classes, based on variety of environments: VDD TA (ambient temp) TC (case temp) Commercial 5V 5% 0 to +70C Industrial 5V 10% 40 to +85C 5V 10% 55 to +125C Military What is important is the transistor temperature inside the chip, TJ (junction temperature)
5.5.1
Speed Binning
Speed binning is the process of testing each manufactured part to determine the maximum clock speed at which it will run reliably. Manufacturers sell chips off of the same manufacturing line at different prices based on how fast they will run. A speed bin is the clock speed that chips will be labeled with when sold. Overclocking: running a chip at a clock speed faster than what it is rated for (and hoping that your software crashes more frequently than your over-stressed hardware will).
501
502
5.5.2 5.5.2.1
In Smiths book, Table 5.2 (Fanout delay) combines two separate parameters:
503
5.5.2.2
Derating Factors
Delays are dependent upon supply voltage and temperature. Temp = Delay Supply voltage = Delay
504
Temperature
Temp = Delay
Temp = Resistivity of wires As temp goes up, atoms vibrate more, and so have greater probability of colliding with electrons owing with current.
505
Supply Voltage
Supply voltage = Delay
Supply voltage = current (V = IR) current = time to charge load capacitors to threshold voltage
506
508
6.1 6.1.1
Laptops, PDA, cell-phones, etc obvious! For microprocessors in personal computers, every watt above 40W adds $1 to manufacturing cost Approx 25% of operating expense of server farm goes to energy bills (Dis)Comfort of Unix labs in E2 Sandia Labs had to build a special sub-station when they took delivery of Teraops massively parallel supercomputer (over 9000 Pentium Pros) High-speed microprocessors today can run so hot that they will damage themselves Athlon reliability problems, Pentium 4 processor thermal throttling In 2000, information technology consumed 8% of total power in US. Future power viruses: cell phone viruses cause cell phone to run in full power mode and consume battery very quickly; PC viruses that cause CPU to meltdown batteries
509
6.1.2
Note: Lots of links from E&CE 327 web pages under Documentation
6.1.3
Power vs Energy
Most people talk about power reduction, but sometimes they mean power and sometimes energy. Power minimization is usually about heat removal
510
6.1.4
Batteries rated in Amp-hours at a voltage. battery = Amps Seconds Volts = Coulombs Seconds Volts Seconds = Coulombs Volts = Energy Batteries store energy.
511
6.1.4.2
To extend battery life, we want to increase the amount of work done and/or decrease energy consumed. Work and energy are same units, therefore to extend battery life, we truly want to improve efciency. Power efciency of microprocessors normally measured in MIPS/Watt. Is this a real measure of efciency? MIPs = millions of instructions Seconds Watts Energy Seconds = millions of instructions Energy Both instructions executed and energy are measures of work, so MIPs/Watt is a measure of efciency.
Question:
512
6.1.4.3
Question: Running a VHDL simulation requires executing an average of 1 million instructions per simulation step. My computer runs at 700MHz, has a CPI of 1.0, and burns 70W of power. My battery is rated at 10V and 2.5AH. Assuming all of my computers clock cycles go towards running VHDL simulations, how many simulation steps can I run on one battery charge?
513
514
515
6.2
Power Equations
Power = SwitchPower + ShortPower + LeakagePower DynamicPower StaticPower
Dynamic Power dependent upon clock speed Switching Power useful charges up transistors Short Circuit Power not useful both N and P transistors are on Static Power independent of clock speed Leakage Power not useful leaks around transistor
516
Dynamic Power
Dynamic power is proportional to how often signals change their value (switch). Roughly 20% of signals switch during a clock cycle.
Need to take glitches into account when calculating activity factor. Glitches increase the activity factor. Equations for dynamic power contain clock speed and activity factor.
517
6.2.1
Switching Power
1->0 0->1 CapLoad 0->1 1->0 CapLoad
Charging a capacitor
Disharging a capacitor
518
Switching Power
When a capacitor C is charged to a voltage V , the energy stored in capacitor is 1CV 2. 2 The energy required to charge the capacitor from 0 to V is CV 2. Half of the energy ( 1CV 2 is dissipated as heat through the pullup resistance. Half of energy is 2 transfered to the capacitor. When the capacitor discharges from V to 0, the energy stored in the capacitor 1 ( 2CV 2) is dissipated as heat through the pulldown resistance.
519
Switching Power
f : frequency at which invertor goes through complete charge-discharge cycle. (eqn 15.4 in Smith)
average switching power = f CapLoad VoltSup2 ClockSpeed clock speed ActFact average number of times that signal switches from 0 1 or from 1 0 during a clock cycle
520
6.2.2
IShort Vi Vo
Short-Circuited Power
Gate Voltage
521
6.2.3
Leakage Power
Vi Vo
I
N P N P P
ILeak V
N-substrate
ILeak e
522
6.2.4
Glossary
This section reserved for your reading pleasure
6.2.5
523
Analog Parameters
Power reduction parameters at the analog level. capacitance for example, Silicon on Insulator (SOI) resistance for example, copper wires voltage low-voltage circuits
524
Analog Techniques
Power reduction techniques at the analog level. dual-VDD Two different supply voltages: high voltage for performance-critical portions of design, low voltage for remainder of circuit. Alternatively, can vary voltage over time: high voltage when running performance-critical software and low voltage when running software that is less sensitive to performance. dual-Vt Two different threshold voltages: transistors with low threshold voltage for performance-critical portions of design (can switch more quickly, but more leakage power), transistors with high threshold voltage for remainder of circuit (switches more slowly, but reduces leakage power). exotic circuits Special ops, latches, and combinational circuitry that run at a high frequency while minimizing power adiabatic circuits Special circuitry that consumes power on 0 1 transitions, but not 1 0 transitions. These sacrice performance for reduced power. clock trees Up to 30% of total power can be consumed in clock generation and clock tree
525
Digital Parameters
Power-reduction parameters at the digital level. capacitance (number of gates) activity factor clock frequency
526
Digital Techniques
Power-reduction techniques at the digital level. multiple clocks Put a high speed clock in performance-critical parts of design and a low speed clock for remainder of circuit clock gating Turn off clock to portions of a chip when its not being used data encoding Gray coding vs one-hot vs fully encoded vs ... glitch reduction Adjust circuit delays or add redundant circuitry to reduce or eliminate glitches. asynchronous circuits Get rid of clocks altogether.... Additional low-power design techniques for RTL from a Qualis engineer: http://home.europa.com/celiac/lowpower.html
527
Power =
we observe:
Power VoltSup2
528
529
530
ILeak e
And increasing the leakage current increases the power: Power ILeak So, need to strike a balance between reducing VoltSup (which has a quadratic affect on reducing power), and increasing ILeak, which has a linear affect on increasing power.
531
6.5
532 Decimal 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN Gray 0000 0001 0011 0010 0110 0111 0101 0100 1100 1101 1111 1110 1010 1011 1001 1000 Binary 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
533
8-bit Counter
Question: For an eight-bit counter, how much more power will a binary counter consume than a Gray-code counter?
534
Random Data
Question: For completely random eight-bit data, how much more power will a binary circuit consume than a Gray-code circuit?
535
6.5.2 6.5.2.1
Your task is to do the power analysis for a circuit that should send out a one-clock-cycle pulse on the done signal once every 16 clock cycles. (That is, done is 0 for 15 clock cycles, then 1 for one cycle, then repeat with 15 cycles of 0 followed by a 1, etc.)
1 clk done 2 3 15 16 17 31 32 33
Required behaviour You have been asked to consider three different types of counters: a binary counter, a Gray-code counter, and a one-hot counter. (The table below shows the values from 0 to 15 for the different encodings.)
Question: What is the relative amount of power consumption for the different options?
536
6.5.2.2
Additional Information
Your implementation technology is an FPGA where each cell has a programable combinational circuit and a ip-op. The combinational circuit has 4 inputs and 1 output. The capacitive load of the combinational circuit is twice that of the ip-op.
PLA
cell 1. You may neglect power associated with clocks. 2. You may assume that all counters: (a) are implemented on the same fabrication process (b) run at the same clock speed (c) have negligible leakage and short-circuit currents
537
Data Encoding
Decimal 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Gray 0000 0001 0011 0010 0110 0111 0101 0100 1100 1101 1111 1110 1010 1011 1001 1000 One-Hot 0000000000000001 0000000000000010 0000000000000100 0000000000001000 0000000000010000 0000000000100000 0000000001000000 0000000010000000 0000000100000000 0000001000000000 0000010000000000 0000100000000000 0001000000000000 0010000000000000 0100000000000000 1000000000000000 Binary 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
538
6.5.2.3
539
Capacitance
cap number subtotal cap Gray d() PLAs Flops done PLAs Flops 1-Hot d() PLAs Flops done PLAs Flops Binary d() PLAs Flops done PLAs Flops
540
Gray coding
541
One-hot coding
542
Binary coding
543
1-Hot
544
6.6
Clock Gating
The basic idea of clock gating is to reduce power by turning off the clock when a circuit isnt needed. This reduces the activity factor.
6.6.1
Circuitry turned off Everything except core state (PC, registers, caches, etc) No oating point instruc- oating point circuitry tions for k clock cycles Instruction cache miss Instruction decode circuitry No instruction in pipe Pipe stage i stage i
545
6.6.2
Clock gating is implemented by adding a component that disables the clock when the circuit isnt needed.
i_data i_valid clk o_data
o_valid
cool_clk
o_valid
546
6.6.3 6.6.4
Parameters to characterize effectiveness of clock gating: Eff = effectiveness of clock gating PctValid = percentage of clock cycles with valid data in the circuit the clock must be toggling PctClk = percentage of clock cycles that clock toggles Effectiveness measures the percentage of clock cycles with invalid data in which the clock is turned off. Equation for effectiveness of clock gating: PctClkOff Eff = PctInvalid 1 PctClk = 1 PctValid
547
Question:
548
Question:
549
Effect of Effectiveness
We can see the effect of the effectiveness of a clock-gating scheme on the activity factor: A PctValid * A A
0 0 Eff 1
A = A (1 PctValid) Eff A
550
70% of the time the main circuit has valid data clock gating circuit is 90% effective (90% of the time that the circuit has invalid data, the clock is off) clock gating circuit has 10% of the area of the main circuit clock gating circuit has same activity factor as main circuit neglect short-circuiting and leakage power
551
552
6.6.6 6.6.6.1
o_valid o_data
553
Valid-Bit Protocol
clk i_valid i_data o_valid o_data
i valid: high when i data has valid data signies whether circuit should pay attention to or ignore data. o valid: high when o data has valid data signies whether whether environment should pay attention to output of circuit. For more on circuit protocols, see section 2.12.
554
Microscopic Analysis
Which clock edges are needed?
i_valid clk o_valid
555
556
6.6.6.3
557
cool_clk
wakeup_out
hot clk: clock that always toggles cool clk: gated clock sometimes toggles, sometimes stays low wakeup: alerts circuit that valid data will be arriving soon clk en: turns on cool clk
558
559
latency varies from 5 to 10 clock cycles, even distribution of latencies contains a maximum of 6 instructions (parcels of data). 60% of incoming parcels are valid average length of continuous sequence of valid parcels is 80 use input and output valid bits for wakeup leakage current is negligible short-circuit current is negligible LUTs have a capacitance of 1, ops have a capacitance of 2
560
0 1 2 0 0 0 1 2 3 4 5 0 1 2 3 4 5 6 7 8 9 10 cycle_clk_en
561
562
564
7.1 7.1.1
7.1.1.1
During manufacturing, faults can occur that make the physical product behave incorrectly. Denition: A fault is a manufacturing defect that causes a wire, poly, diffusion, or via to either break or connect to something it shouldnt.
Good wires
Shorted wires
Open wire
565
7.1.1.2
Causes of Faults
Fabrication process (initial construction is bad) chemical mix, impurities, dust Manufacturing process (damage during construction)
handling: probing, cutting, mounting materials: corrosion, adhesion failure, cracking, peeling
7.1.1.3
Testing
Denition Testing is the process of checking that the manufactured wafer/chip/board/system has the same functionality as the simulations.
566
7.1.1.4
Burn In
Denition Burn-in: The process of subjecting chips to extreme conditions (high and low temps, high and low voltages, high and low clock speeds) before and during testing.
7.1.1.5
Bin Sorting
Each chip (or wafer) is run at a variety of clock speeds. The chips are grouped and labeled (binned) by the maximum clock frequency at which they will work reliably. For example, chips coming off of the same production line might be labelled as 800MHz, 900MHz, and 1000MHz.
567
7.1.1.6 7.1.1.7
7.1.3
Physical Faults
568
7.1.3.1
Good Circuit
a b c d
a b a b
c d c d
short to GND
569
7.1.3.2
Locations of Faults
Each segment of wire, poly, diffusion, via, etc is a potential fault location. Different segments affect different gates in the fanout. A potential fault location is a segment or segments where a fault at any position affects the same set of gates in the same way.
570
7.1.3.3
a b c d e f g h
e
L3
L2
e
L3 L5 L4
L1 L4
g h
L1
g h
7.1.3.4
Two ways to name a fault location: pin-fault model Faults are modelled as occuring on input and output pins of gates. net-fault model Faults are modelled as occuring on segments of wires. In E&CE 327, well use the net-fault model, because it is simpler to work with and is closer to what actually happens in hardware.
571
7.1.4
Detecting a Fault
To detect a fault, we compare the actual output of the circuit against the expected value.
7.1.4.1 Fault?
Question: For the good circuit and faulty circuit shown below, which test vectors will detect the fault?
a b c a b e c
d e
Good circuit
Faulty circuit
Sometimes multiple test vectors will catch the same fault. Sometimes a single test vector can catch multiple faults.
573
a b c good faulty 1 1 0 1 0
Another fault The test vector 110 can catch both this fault and the previous one. Note: Detect vs. diagnose Testing detects faults. Testing does not diagnose which fault occurred.
574
7.1.5
Goal: develop reliable and predictable technique for detecting faults in circuits. Observations:
The possible faults in a circuit are dependent upon the physical layout of the circuit. A very wide variety of possible faults A single test vector can catch many different faults
Need: a mathematical model for faults that is abstracted from complexities of circuit layout and plethora of possible faults, yet still detects most or all possible faults.
575
7.1.5.1
Two simplifying assumptions: 1. A maximum of one fault per tested circuit (hence single) 2. All faults are either: (a) stuck-at 1: short to VDD (b) stuck-at 0: short to GND hence, stuck at
576
Question: If we consider all possible stuck-at faults, how many faulty circuits would we need to test for?
Question: If we consider only single-stuck-at faults, how many faulty circuits would we need to test for?
577
7.1.6.1
Algorithm
1. compute Karnaugh map for correct circuit 2. compute Karnaugh map for faulty circuit 3. nd region of disagreement 4. any assignment in region of disagreement is a test vector that will detect fault 5. any assignment outside of region of disagreement will result in same output on both correct and faulty circuit
578
7.1.6.2
d e
a b c
d e
b c1 c0
ab ab ab ab 10 11 01 00
c
Good circuit
Faulty circuit
Question:
a c
579
7.1.7
Undetectable Faults
1. If a circuit is irredundant then all single stuck-at faults can be detected. A redundant circuit is one where one or more gates can be removed without affecting the functional behaviour. 2. If not trying to nd all of the faults in a circuit, then a fault that you arent looking for can mask a fault that you are looking for.
7.1.7.1
Redundant Circuitry
Some faults are undetectable. Undetectable stuck-at faults are located in redundant parts of a circuit.
580
Timing Hazards
Static hazard Dynamic hazard Timing hazards are often removed by adding redundant circuitry.
Redundant Circuitry
a b
1,1 1,0
1,0 1,0,1
b c
d e
d c
1,1
0,1
0,1
f g
Irredundant circuit
Glitch on g is caused because the AND gate for e turns off before f turns on.
581
Redundant Circuitry
Question: Add one or more gates to the circuit so that the static hazard is guaranteed to be prevented, independent of the delay values through the gates
1,1 1,0
a c
a b
1,0 1,0,1
d c
1,1
0,1
0,1
Redundant Circuitry
Question: Has the redundant circuitry introduced any undetectable faults? If so, identify an undetectable fault.
582
a z z c
c
b c
L1 L3
K-map
a c b
diff w/ ckt
a c b
a c
b c
L2@1 a (b c)
583
7.2 7.2.1
a b c
ab + bc
a
L2 L5
fault 1) L2@1
eqn K-map
a c b
a c
b c
2) L4@1
a c b c a b
3) L5@1
584
7.2.2
The goal of test vector generation is to nd the smallest set of test vectors that will detect the faults of interest. Test vector generation requires analyzing the faults. We can simplify the task of fault analysis by reducing the number of faults that we have to analyze. Smith has examples of this in Figures 14.13 and 14.14.
585
7.2.2.1
fault eqn
Fault Domination
K-map
a c b c
Diff w/ ckt
a b
test vectors
1) L5@1 ab+c
a c b c a b
101, 001
2) L6@1 1
Denition dominates: f1 dominates f2: any test vector that detects f1 will also detect f2. When choosing test vectors, we can ignore the dominated fault, but must keep the dominant fault.
Question:
To detect both L5@1 and L6@1, can we ignore one of the faults?
Question:
586
7.2.2.2
fault
Fault Equivalence
Diff w/ ckt
b c a b a c
eqn K-map
1) L1@1 b
a c b c a b
2) L3@1 b
Denition fault equivalence: f1 is equivalent to f2: f1 and f2 are detected by exactly the same set of test vectors. That is, all of the test vectors that detect f1 will also detect f2, and vice versa. When choosing test vectors we can ignore one of the faults and just include the other.
587
7.2.2.3
Gate Collapsing
A stuck-at-1 fault on the input to an OR gate is equivalent to a stuck-at-1 fault on the output of the OR gate.
Denition Gate collapsing: : The technique of looking at the functionality of a gate and nding equivalent faults between inputs and outputs. Sets of collapsable faults for common gates
@0
AND
@1
@0
@0
OR
@1
@1
588
7.2.2.4
Note:
Node Collapsing
Node collapsing is relevant only for the pin-fault model
7.2.2.5
When calculating the test-vectors to detect a set of faults, apply the fault collapsing techniques of: gate collapsing
node collapsing (if using pin-fault model) general fault equivalence (intelligent collapsing) fault domination
to reduce the number of faults that you must examine.
589
7.2.3
Fault Coverage
Denition Fault coverage: percentage of detectable faults that are detected by a set of test vectors. DetectedFaults DetectableFaults
FaultCoverage =
Some peoples denition of fault coverage has a denominator of AllPossibleFaults, not just those that are detectable.
590
overide normal datapath to send test-vectors, rather than normal inputs, as inputs to ops compare outputs of ops to expected result
591
ab + bc
L6 L8
a b
592
7.2.5.1
L3@0,1
593
Gate Collapsing
gate faults kept fault
For each set of equivalent faults, we will keep the fault shown in bold and eliminate the other faults. A good heuristic for choosing which fault to keep: keep the fault closes to the output. The closer a fault is to the output, the easier it is to analyze its behaviour, because the equation for the output will be simpler.
594
Intelligent Collapsing
1. delete faults that previously decided could be ignored 2. by intelligent analysis of circuit, nd equivalent faults
a b
L2@0,1 L5@0,1
L1@0,1 L4@0,1
L3@0,1
595
7.2.5.2
fault eqn 1) L2@1 a+c
a c
a c
a c
2) L3@1 b
a c b
a c
3) L4@1 a+bc
a c b
a c
4) L5@1 ab+c
a c b
a c
5) L6@0 bc
a c b
a c
6) L7@0 ab
a c b
a c
7) L8@0 0
a c b
a c
8) L8@1 1
596
L3@0,1
Dominated faults:
597
7.2.5.3
Denition required test vector: A test vector tv is required if there is a fault for which tv is the only test vector that will detect the fault. fault eqn K-map
a c b c
Diff w/ ckt
a b
1) L3@1 b
a c b c a b
2) L4@1 a+bc
a c b c a b
3) L5@1 ab+c
a c b c a b
4) L6@0 bc
a c b c a b
5) L7@0 ab
598
Diff w/ ckt
a b
1) L4@1 a+bc
a c b c a b
599
7.2.5.5
The order in which the test vectors are run is important because it can affect how long a faulty chip stays in the tester before the chips fault is detected. The rst vector to run should be the one that detects the most faults. Build a table for which faults each test vector will detect.
600
a c b c a b c
fault 110
a c b
010
011
101
1) 2) 3) 4) 5) 6) 7) 8) 9)
L1@0
a c b
1 1
a c b
L1@1 L2@0
a c b
1
a c b
1 1 1
L2@1 L3@0
a c b
L3@1
a c b
1 1
a c b
L4@0 L4@1
a c b
1 1
a c b
L5@0
a c b
1 1 1
a c b
1 1
14) L7@1
a c b
1 1
a c b
1 1
1 5
1 6
601
602
7.2.6
a b c
L1 L4 L2 L5 L3
Assume that we are not trying to detect all faults L1 is viewed as not being at risk for faults, but L3 is at risk for faults.
a b z c
L3 L1
a b
L1
z c
L3
603
Fault Hiding
a b z c
L3 L1
a b
L1
z c
L3
Problem: If L1 is stuck-at 1, the test vectors that normally detect L3@0 will not detect L3@0. In the presence of other faults, the set of test vectors to detect a fault will change. fault(s) L3@0 eqn K-map Diff w/ ckt
a c b c a b
ab
a c b c a b
L1@1,L3@0 b
604
7.3
zeta_in(2)
data_in(1)
zeta_in(1)
data_in(0)
zeta_in(0)
Normal Circuit
605
scan_out0
scan_out1
another circuit
scan chain 0
scan chain 1
606
7.3.2
Scan Chains
mode0 scan_in0 mode1 scan_in1 zeta_in(3) another circuit #1 data_in(3) zeta_in(3)
zeta_in(2)
data_in(2)
data_in(1)
zeta_in(1)
zeta_in(2)
zeta_in(1)
Normal Circuit
607
7.3.2.1
mode0 scan_in0
scan_out0
scan_out1
scan_out0
scan_out1
Normal Mode
Scan Mode
608
7.3.2.2
mode0 scan chain 0
Scan in Operation
scan_in0 mode1 scan chain 0 scan_in1 clk mode0 yet another circuit scan_out0 scan_in0 scan_out1 scan_in1 scan_out0 scan_out1 current vector0 current results1
another circuit
mode0
another circuit
another circuit
another circuit
609
another circuit
another circuit
another circuit
scan_out0
scan_out1
Unload Prev Result Load Cur Test Vector (1 cycle per bit)
clk mode0 scan_out0 scan_in0 scan_out1 scan_in1 previous results0 current vector0 previous results1 current vector1 current results0 next test vector0 current results1 next test vector1
Unload Cur Result Load New Test Vector (1 cycle per bit)
610
scan_out0
scan_out1
611
scan_out1
scan_out1
Load
mode1 scan_in1
scan_out1
scan_out1
Load
Load
612
mode0 scan_in0 mode1 scan_in1
__
+
__
__
__
scan_out1
scan_out1
mode1 scan_in1
__
scan_out1 (+)
__
scan_out0
__
scan_out1 (+, +)
__
clk mode0
613
scan_out0
__
scan_out1 (+, +)
__
scan_out0
__
scan_out1 (+, +)
__
clk mode0
clk mode0
scan_out0
__
scan_out1 (+, +)
__
clk mode0
614
7.3.3
615
7.3.4
If the length (number of ops) of a scan chain is n, then it takes 2n + 1 clock cycles to run a single test: n clock cycles to scan in the test vector, 1 clock cycle to execute the test vector, and n cycles to scan out the results. Once the results are scanned out, they can be compared to the expected results for a correctly working circuit. If we run 2 or more tests (and chips generally are subjected to hundreds of thousands of tests), then we speed things up by scanning in the next test vector while we scan out the previous result. ScanLength = number of ip ops in a scan chain NumVectors = number of test vectors in test suite TimeScan = number of clock cycles to run test suite = NumVectors (ScanLength + 1) + ScanLength
616
7.3.4.1
A 800MHz chip has scan chains of length 20,000 bits, 18,000 bits, 21,000 bits, 22,000 bits, and two of 15,000 bits. 500,000 test vectors are used for each scan chain. The tests are run at 80% of full speed.
Question:
617
7.4
Boundary scan originated as technique to test wires on printed circuit boards (PCBs). Goal was to replace bed-of-nails style testing with technique that would work for high-density PCBs (lots of small wires close together) Now used to test both boards and chip internals. Used both on boundaries (I/O pins) and internal ops.
618
1 optional signal (Scan Pin: TRST) protocol to connect circuit under test to tester and other circuits state machine to drive test circuitry on chip Boundary Scan Description Language (BSDL): structural language used to describe which features of JTAG a circuit supports
JTAG circuitry now commonly built-into FPGAs and ASICS, or part of a cell-library. Rarely is a JTAG circuit custom-built as part of a larger part. So, youll probably be choosing and using JTAG circuits, not constructing new ones. Using JTAG circuitry is usually done by giving a description of your printed circuit board (PCB) and the JTAG components on each chip (in BSDL) to test generation software. The software then generates a sequence of JTAG commands and data that can be used to test the wires on the circuit board for opens and shorts.
619
JTAG Structure
chip BSR BSC circuit under test BSC BSC chip scan registers control TDI BR Instruction Decoder IR TCK IDCODE TDI TCK TMS TDO control TMS TAP Controller IRC IRC TDO BSC BSC BSC
High-level view
Detailed view
620
7.4.1
Scan Instructions
This the set of required instructions, other instructions are optional. Test board-level interconnect. Drive output pins of chip with hard-coded test vector. Sample results on inputs. SAMPLE Sample result data PRELOAD Load test vector BYPASS Directly connect TDI to TDO. This is used when several chips are daisy chained together to skip loading data into some chips. IDCODE Output manufacturer and part number EXTEST
621
7.5 7.5.1
test generator
i_data(0)
o_data(1)
o_data(1)
d(2) i_data(2)
o_data(2)
d(2) i_data(2)
o_data(2)
d(3)
622
d(0)
d(1) i_data(1)
d(3) i_data(3)
623
d(0)
d(1) i_data(1)
d(3) i_data(3)
624
7.5.1.1
mode test gen LFSR test generator i_data(0) signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) i_data(2) o_data(1) signature ok(2) analyzer2 o_data(2)
d(0)
d(1) i_data(1)
d(3) i_data(3)
generates a psuedo-random set of test vectors for n output bits, generates all vectors from 1 to 2n 1 in a pseudo random order built with a linear-feedback shift register (shift-register portion is the input ops)
625
Test Generator
q2 q1 q0
Question:
626
Signature Analyzer
mode test gen LFSR test generator i_data(0) signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) i_data(2) o_data(1) signature ok(2) analyzer2 o_data(2) d(0)
d(1) i_data(1)
d(3) i_data(3)
checks that the output it is examining has the correct results for the complete set of tests that are run only has a meaningful result at the end of the entire test sequence. built with a linear-feedback shift register similar to a hash function or a lossy compression function if there are no faults, the signature analyzer will denitely say ok (no false negatives) if there is a fault, the signature analyzer might say ok or might say bad (false positives are possible) design tradeoff: more accurate signature analyzers require more hardware
627
Result Checker
mode test gen LFSR test generator i_data(0) signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) i_data(2) o_data(1) signature ok(2) analyzer2 o_data(2) d(0) d(1) i_data(1)
d(3) i_data(3)
signature analyzers output ok/bad on every clock cycle, but the result is only meaningful at the end of running the complete set of test vectors the result checker looks at test vector inputs to detect the end of the test suite and outputs all ok if all signature analyzers report ok at that moment implemented as an AND gate
628
number of ip-ops external or internal XOR feedback taps (coefcients) external-input or self-contained reset or set
reset
d0 i
q0 d1
q1 d2
q2
LFSR Example
629
Example LFSRs
reset d0 d0 i
S S S S S S R
q0 d1
q1 d2
q2
q0 d1
q1 d2
q2
set
reset i d0
R
q0
d1
q1 d2
q2 i d0
R
q0
d1
q1
d2
q2
S S S S
set
In E&CE 327, we use internal-XOR LFSRs, because the circuitry matches the mathematics of Galois elds. External-XOR LFSRs work just ne, but they are more difcult to analyze, because their behaviour cant be treated as Galois elds.
630
7.5.1.3
Maximal-Length LFSR
Denition maximal-length linear feedback shift register: An LFSR that outputs a pseudo-random sequence of all representable bit-vectors except 0...00.
Denition pseudo random: The same elements in the same order every time, but the relationship between consecutive elements is apparantly random.
Maximal-length linear feedback shift registers are used to generate test vectors for built-in self test.
631
d0
q0 d1
q1
d2
q2
set
d0
q0
d1
q1 d2
q2
set
Question: Why do maximal-length LFSRs not generate the test vector 0...00?
632
633
7.5.2
mode test gen LFSR test generator i_data(0) d(0) d(1) i_data(1)
Test Generator
signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) o_data(1) signature ok(2) analyzer2 o_data(2)
i_data(2)
d(3) i_data(3)
d0
q0
d1
q1 d2
q2
set
634
Test Generator
The test generator component is a maximal-length LFSR with multiplexors on the inputs to each ip-op. In test mode, the data input on each ip op is connected to the output of the previous ip op. In normal mode, the input of each ip op is connected to the environment.
mode d1 q1 d2 q2
d0
q0
635
Test Generator
mode d0 i_d(0)
q0
d1 i_d(1) d2 i_d(2)
q1
q2
636
7.5.3
Signature Analyzer
There are four things that change between different signature analyzers:
number of ops ( ops = area, accuracy) choice of feedback taps: a good choice can improve accuracy (more isnt necessarily better) bubbles on input to AND gate for ok: determined by expected result from simulating test sequence through circuit under test and LFSR of analyzer.
mode test gen LFSR test generator i_data(0) signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) i_data(2) o_data(1) signature ok(2) analyzer2 o_data(2)
d(0)
d(1) i_data(1)
d(3) i_data(3)
637
Signature Analyzer
This circuit:
Two ops, most analyzers use more the HP boards in the 1970s used 37 ops! Feedback taps on both ops. Different signature analyzers have different congurations of feedback taps. Also contains ok tester (AND gate). Expected output of LFSR at end of test sequence is: q0=1 and q1=1, or 01. (We know this because of bubble on AND gate. To see why this is the expected output of the signature analyzer, we would need to know the correct sequence of outputs of the circuit under test.)
reset
d0 i
q0
d1
q1
ok
638
Signature Analyzer
reset clk i d0 q0 d1 q1 0 0 i6 i5 i4 i3 i2 i1 i0 -
639
i4i6 356 i5
i4i6 356
640
7.5.4
mode test gen LFSR test generator i_data(0) d(0) d(1) i_data(1)
Result Checker
signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) o_data(1) signature ok(2) analyzer2 o_data(2)
i_data(2)
d(3) i_data(3)
The purpose of the result checker is to check the ok circuit at the end of the test sequence.
reset q0 q1 q2 ok
all_ok
641
7.5.5
Galois Fields! Two operations: + and Two values: 0 and 1 Bit vectors and shift-registers are written as polynomials in terms of x.
Addition
+ represents XOR expression result 0+0 0 0+1 1 1+0 1 1+1 0 x+x 0
Multiplication
represents concatenating shift registers expression result x4 1 x4 x2 x3 x5
642
Example
Calculate (x3 + x2 + 1) (x2 + x) x2 (x3 + x2 + 1) = x5 + x4 x (x3 + x2 + 1) = x4 + x5 + + x2 x3 + x x3 + x2 + x
643
The maximum exponent denotes the number of ops The other exponents denote the ops that tap off of feedback line from last op From the characteristic polynomial, we cannot determine whether the shift register has an external input. Stated another way, two shift registers that are identical except that one has an external input and the other does not will have the same characteristic polynomial.
644
q0
q1
q2
p(x) = x3
reset
d0 x0
q0 x1
d1
q1 x2
q2 x3
p(x) = x3 + x
reset
d0 i x0
q0 x1
q1 x2
q2 x3
p(x) = x3 + 1
645
reset
d0 i x0
q0 x1
d1
q1 x2
q2 x3
p(x) = x3 + x + 1
reset
d0 i x0
q0 x1
d1
q1 x2
d2
q2 x3
p(x) = x3 + x2 + x + 1
reset
d0 i x0
q0 x1
d1
q1 x2
q2 x3
d3
q3 x4
p(x) = x4 + x3 + x + 1
646
7.5.6.1
Circuit Multiplication
x2 + x x3 + x2 + 1 (x2 + x) (x3 + x2 + 1)
x (x3 + x2 + 1) + x2 (x3 + x2 + 1)
x5 + x3 + x2 + x
647
648
7.5.8
Division
With rules for multiplication and addition, we can dene division. A fundamental theorem of division denes q and r to be the quotient and remainder, respectively, of m p iff:
7.5.8 Division
649
Long Division
In Galois elds, we do division just as with long division in elementary school. Given: m(x) = x6 + x4 + x3 p(x) = x4 + x Calculate the quotient, q(x) and remainder r(x) for m(x) p(x): x2 + 1 x4 + x x6 + 0x5 + 1x4 + 1x3 + 0x2 + 0x1 + 0x0 x6 + 1x3 1x4 1x4 + x x Quotient q(x) = x2 + 1 Remainder r(x) = x
650
651
The sequence of output bits forms a quotient, q(x), of length n l The ops in the analyzer form a remainder, r(x), of length l
652
653
e(x) is the error polynomial bits in the message that are ipped have a coefcient of 1 in e(x)
654
e(x) mod p(x) = 0 That is e(x) must be a multiple of p(x). The larger p(x) is, the smaller the chances that e(x) will be a multiple of p(x).
655
656
Components
a b a
L1 L4 L2 L5 L3 L6 L7 L8
t0
t1
t2
r0
r1
r2
657
t0 t1
t2
r0
r1
r2
r0
r1
r2
658 Question:
CHAPTER 7. FAULT TESTING AND TESTABILITY Determine if L2@1 will be detected Equation for correct circuit: ab + bc Equation for faulty circuit: a + c Output sequences for correct and faulty circuits
t0 a 1 1 0 1 0 0 1 t1 b 1 1 1 0 1 0 0 t2 c 1 0 1 0 0 1 1 correct faulty z 1 1 1 0 0 0 0 z 1 1 1 1 0 1 1 output sequences from circuits
659
Signature analyzer sequence for correct Signature analyzer sequence for faulty circuit Circuit
z 1 1 1 0 0 0 0 1 1 1 1 0 0 1 r0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 1 r1 0 0 1 1 0 1 0 1 0 0 1 0 0 1 1 r2 0 0 0 initial values = 0 1 0 0 remainder 1 1 z 1 1 1 1 0 1 1 1 1 1 0 0 1 1 r0 0 1 1 1 0 0 1 1 0 1 1 0 0 0 1 r1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0 r2 0 0 0 initial values = 0 1 0 0 remainder 0 0
660
7.6
Scan
less hardware slower well dened coverage test vectors are easy to modify Self Test more hardware faster ill dened coverage test vectors are hard to modify
Chapter 8 Review
This chapter lists the major topics of the term. The Topics List section for each major area is meant to be relatively complete.
662
CHAPTER 8. REVIEW
8.1
timing analysis
8.2. VHDL
663
8.2 8.2.1
simple syntax and semantics things that you should know simply by having done the labs and project behavioural semantics of VHDL synthesis semantics of VHDL synthesizable and unsynthesizable code
664
CHAPTER 8. REVIEW
8.2.2
identify whether a particular signal will be the output of combinational circuitry or a op identify whether a particular process is combinational or clocked legal, synthesizable, and good code perform delta-cycle simulation of VHDL perform RTL simulation of VHDL identify whether two VHDL fragments have same behaviour match VHDL code with waveforms match VHDL code with hardware choose the VHDL fragment that generates smaller or faster hardware
665
8.3 8.3.1
coding guidelines generic FPGA hardware area estimation nite state machines
implicit explicit-current explicit-current+next
666
CHAPTER 8. REVIEW
8.3.2
choose design guidelines to follow in different situations estimate area to implement a circuit in an FPGA calculate resource usage for a dataow diagram calculate performance data for a dataow diagram given an algorithm, design a dataow diagram given a dataow diagram, design the datapath and nite state machine optimize a dataow diagram to improve performance or reduce resource usage given a dataow diagram, calculate the clock period that will result in the maximum performance
667
8.4 8.4.1
test cases measuring coverage time for verication test benches assertions coverage monitors relational specication functional specication boundary conditions / corner cases
668
CHAPTER 8. REVIEW
8.4.2
choose rst cases to test identify corner cases choose technique to detect bug (test case, assertion/test bench) determine whether a code change will cause a bug identify a test case and either assertion or test bench to catch a bug
669
Performance Topics
670
CHAPTER 8. REVIEW
8.5.2
calculate performance / area tradeoffs calculate performance / time tradeoffs compare performance data between products evaluate performance criteria
671
8.6 8.6.1
672
CHAPTER 8. REVIEW
8.6.2
timing parameters for minimum clock period timing parameters for hold constraint nd the critical path and assignment to exercise it compute elmore delay constant compare accuracy of different timing models determine if a storage device will work correctly compute timing parameters of storage device identify timing violation, suggest remedy suggest design change to increase clock speed
8.7. POWER
673
8.7 8.7.1
674
CHAPTER 8. REVIEW
8.7.2
predict effect of new fabrication process on power predict effect of environment change (temp, supply voltage, etc) on power consumption predict effect of design change on power consumption (capacitance, activity factor) design data-encoding scheme for a circuit, predict effect on power consumption design clock gating scheme for a circuit, predict effect on power consumption asses validity of various power- or energy-consumption metrics
8.8. TESTING
675
8.8 8.8.1
causes of faults locations of faults physical faults single stuck-at fault model testable / untestable fault economics of testing fault coverage test vector generation order test vectors to reduce test time
676
CHAPTER 8. REVIEW
8.8.2
compute optimal amount of testing to maximize prots compute coverage for a given set of test vectors nd test vectors to catch a set of faults, choose order to run test vectors determine if a fault is detectable choose an LFSR to use for BIST test generation choose an LFSR to use for BIST signature analysis determine if a given BIST will catch a given fault determine probability that a given BIST technique will report that a faulty circuit is correct determine if a given fault-testing scheme will detect a physical fault match LFSR to characteristic polynomial match BIST hardware to Galois mathematics perform Galois eld mathematics, compare to waveforms
677
8.9
S =
M =
678
CHAPTER 8. REVIEW
Formulas II
1 P = (A CL V2 F) + ( A V ISh F) + (V IL) 2 q = 1.60218 1019C k = 1.38066 1023J/K (V VTh)2 F V q VTh IL e k T