Advanced Computer Architecture: 563 L02.1 Fall 2011
Advanced Computer Architecture: 563 L02.1 Fall 2011
563 L02.1
Fall 2011
Computer Architecture >> instruction sets Computer Architecture skill sets are different
5 Quantitative principles of design Quantitative approach to design Solid interfaces that really work Technology tracking and anticipation
563 L02.2
Fall 2011
Review (continued)
Other fields often borrow ideas from architecture Quantitative Principles of Design
1. 2. 3. 4. 5.
Take Advantage of Parallelism Principle of Locality Focus on the Common Case Amdahls Law The Processor Performance Equation Define, quantity, and summarize relative performance Define and quantify relative cost Define and quantify dependability Define and quantify power
Culture of anticipating and exploiting advances in technology Culture of well-defined interfaces that are carefully implemented and thoroughly checked
Fall 2011
563 L02.3
Amdahls law
A programs execution time on a uniprocessor is t, and x% of the execution has to be sequential. If we decide to deploy n processors (same as the original uniprocessor), what is the best execution time on this multiprocessor system?
563 L02.4
Fall 2011
Cost
Execution Time
Reliability
Resiliency to electrical noise, part failure Robustness to bad software, operator error
Maintainability
563 L02.5
Compatibility
Fall 2011
Cost of Processor
Design cost (Non-recurring Engineering Costs, NRE) dominated by engineer-years (~$200K per engineer year) Cost of die die area die yield (maturity of manufacturing process, redundancy features) cost/size of wafers die cost ~= f(die area^4) with no redundancy Cost of packaging number of pins (signal + power/ground pins) power dissipation Cost of testing built-in test features? logical complexity of design choice of circuits (minimum clock rates, leakage currents, I/O drivers)
Power supply and cooling Support chipset Off-chip SRAM/DRAM/ROM Off-chip peripherals
563 L02.7
Fall 2011
What is Performance?
Latency
Bandwidth
(or throughput)
What
563 L02.8
Fall 2011
Definition: Performance
is better
performance(x) =
1 execution_time(x)
Execution_time(Y) = Execution_time(X)
Fall 2011
Performance Guarantees
Types of Benchmark
Synthetic Benchmarks
Fake programs invented to try to match the profile and behavior of real applications, e.g., Dhrystone, Whetstone
Toy Programs
100-line programs from beginning programming assignments, e.g., Nqueens, quicksort, Towers of Hanoi
Kernels
small, key pieces of real applications, e.g., matrix multiply, FFT, sorting, Livermore Loops, Linpack
Simplified Applications
Extract main computational skeleton of real application to simplify porting, e.g., NAS parallel benchmarks, TPC
Real Applications
Things people actually use their computers for, e.g., car crash simulations, relational databases, Photoshop, Quake
563 L02.11
Fall 2011
Usually rely on benchmarks vs. real workloads To increase predictability, collections of benchmark applications-- benchmark suites -- are popular SPECCPU: popular desktop benchmark suite
CPU only, split between integer and floating point programs SPECint2000 has 12 integer, SPECfp2000 has 14 integer pgms SPECCPU2006 to be announced Spring 2006 SPECSFS (NFS file server) and SPECWeb (WebServer) added as server benchmarks
Transaction Processing Council measures server performance and cost-performance for databases
TPC-C Complex query for Online Transaction Processing TPC-H models ad hoc decision support TPC-W a transactional web benchmark TPC-App application server and web services benchmark
Fall 2011
563 L02.12
Summarizing Performance
System A B
Rate (Task 1) 10 20
Rate (Task 2) 20 10
563 L02.13
Fall 2011
Average throughput
System A B
Throughput relative to B
System A B
Throughput relative to A
563 L02.14
Fall 2011
Summarizing Performance over Set of Benchmark Programs Arithmetic mean of execution times ti (in seconds)
1/n
i ti
n/ [i (1/ri)]
Both equivalent to workload where each program is run the same number of times Can add weighting factors to model other workload distributions
563 L02.15
Fall 2011
ratio = tRef / tA
563 L02.16
Fall 2011
pieces of workload that work well on your design, ignore others Use unrealistic data set sizes for application (too big or too small) Report throughput numbers for a latency benchmark Report latency numbers for a throughput benchmark Report performance on a kernel and claim it represents an entire application Use 16-bit fixed-point arithmetic (because its fastest on your system) even though application requires 64-bit floating-point arithmetic Use a less efficient algorithm on the competing machine Report speedup for an inefficient algorithm (bubblesort) Compare hand-optimized assembly code with unoptimized C code Compare your design using next years technology against competitors year old design (1% performance improvement per week) Ignore the relative cost of the systems being compared Report averages and not individual results Report speedup over unspecified base system, not absolute times Report efficiency not absolute times Report MFLOPS not absolute times (use inefficient algorithm) [ David Bailey Twelve ways to fool the masses when giving performance results for parallel supercomputers ]
563 L02.17 Fall 2011
Variance in performance for parallel architectures is going to be much worse than for serial processors
SPECcpu means only really work across very similar machine configurations
563 L02.18
Fall 2011
Packaging costs
Power has to be brought in and distributed around the chip Hundreds of pins and multiple interconnect layers for power
Power supply rail design Chip and system cooling costs Noise immunity and system reliability Battery life (in portable systems) Environmental concerns
Office equipment accounted for 5% of total US commercial energy usage in 1993 Energy Star compliant systems
Fall 2011
563 L02.19
Peak power
determines power ground wiring designs sets packaging limits impacts signal noise margin and reliability analysis
Joules = Watts * seconds lower energy number means less power to perform a computation at the same frequency
Fall 2011
563 L02.20
563 L02.21
Fall 2011
Peak A Peak B
Integrate power curve to get energy
Power
Time
System A has higher peak power, but lower total energy System B has lower peak power, but higher total energy
Fall 2011
563 L02.22
32-bit fixed format instruction (3 formats) 32 32-bit GPR (R0 contains zero) 3-address, reg-reg arithmetic instruction Single address mode for load/store: base + displacement
no indirection
see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3
563 L02.23
Fall 2011
Example: MIPS
Register-Register
31 26 25 21 20 16 15 11 10 6 5 0
Op
31
Rs1
26 25 21 20
Rs2
Rd
Opx
0
Register-Immediate
16 15
Op Branch
31
Rs1
26 25 21 20
Rd
16 15
immediate
Op Jump / Call
31
Rs1
Rs2
immediate
26 25
Op
target
563 L02.24
Fall 2011
Instruction Fetch
Next PC
Memory Access
MUX
Write Back
Adder
RS1
4
Address
563 L02.25
Zero?
MUX MUX
MEM/WB
PC <= PC + 4
Imm
Sign Extend
RD
RD
RD
Reg[IRrd] <= WB
Fall 2011
WB <= rslt
WB Data
IR <= mem[PC];
Memory
RS2
EX/MEM
Reg File
ID/EX
IF/ID
ALU
Data Memory
MUX
Ifetch
opFetch-DCD
br
if bop(A,b) PC <= __
jmp RR
PC <= __ r <= A opIRop B
RI
r <= A opIRop ___
LD
r <= A __ __
WB <= r
WB <= r
WB <= Mem[r]
Reg[IRrd] <= WB
Reg[IRrd] <= WB
Reg[IRrd] <= WB
563 L02.26
Fall 2011
Ifetch
opFetch-DCD
br
if bop(A,b) PC <= PC+IRim
jmp RR
PC <= IRjaddr r <= A opIRop B
RI
r <= A opIRop IRim
LD
r <= A + IRim
WB <= r
WB <= r
WB <= Mem[r]
Reg[IRrd] <= WB
Reg[IRrd] <= WB
Reg[IRrd] <= WB
563 L02.27
Fall 2011
Instruction Fetch
Next PC
Memory Access
MUX
Write Back
Adder
RS1
4
Address
563 L02.28
Zero?
MUX MUX
MEM/WB
Imm
Sign Extend
RD
RD
RD
WB Data
Memory
RS2
EX/MEM
Reg File
ID/EX
IF/ID
ALU
Data Memory
MUX
Visualizing Pipelining
Figure A.2, Page A-8
ALU
I n s t r. O r d e r
ALU
Ifetch
Reg
DMem
Reg
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
563 L02.29
Fall 2011
Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle
Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).
563 L02.30
Fall 2011
ALU
ALU
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
563 L02.31
Fall 2011
ALU
ALU
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
Ifetch
Reg
DMem
Reg
Bubble
Bubble Bubble
Bubble
ALU
Bubble
Reg
Ifetch
Reg
DMem
Fall 2011
563 L02.33
Fall 2011
Machine A: Dual ported memory (Harvard Architecture) Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed Machine A is 1.33 times faster
563 L02.34
Fall 2011
Data Hazard on R1
Figure A.6, Page A-17
WB
Reg
I n s t r. O r d e r
Ifetch
Reg
ALU
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
r8,r1,r9
Ifetch
Reg
DMem
Reg
ALU
xor r10,r1,r11
563 L02.35
Ifetch
Reg
DMem
Reg
Among these 4 arrows, which are hazards and which are not?
Fall 2011
Read After Write (RAW) InstrJ tries to read operand before InstrI writes it I: add r1,r2,r3 J: sub r4,r1,r3 Caused by a Dependence (in compiler nomenclature). This hazard results from an actual need for communication.
563 L02.36
Fall 2011
Write After Read (WAR) InstrJ writes operand before InstrI reads it
Called an anti-dependence by compiler writers. This results from reuse of the name r1. Cant happen in MIPS 5 stage pipeline because:
All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5
563 L02.37
Fall 2011
Write After Write (WAW) InstrJ writes operand before InstrI writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7
Called an output dependence by compiler writers This also results from the reuse of name r1. Cant happen in MIPS 5 stage pipeline because:
563 L02.38
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
xor r10,r1,r11
Ifetch
Reg
DMem
Reg
563 L02.39
Fall 2011
NextPC
mux
Immediate
563 L02.40
Registers
MEM/WR
EX/MEM
ALU
ID/EX
Data Memory
mux
mux
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
xor r10,r9,r11
Ifetch
Reg
DMem
Reg
563 L02.41
Fall 2011
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
563 L02.42
Fall 2011
Ifetch
Reg
DMem
Reg
Ifetch
Reg
Bubble
ALU
DMem
Reg
ALU
Ifetch
Bubble
Reg
DMem
Reg
Bubble
Ifetch
Reg
ALU
DMem
LW LW LW ADD LW SW SUB SW
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! Two part solution:
Determine branch taken or not sooner, AND Compute taken branch address earlier
Move Zero test to ID/RF stage (use xor) Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3
563 L02.47
Fall 2011
Instruction Fetch
Next PC
Memory Access
Write Back
MUX
Adder
Adder
4
Address
563 L02.48
Zero?
RS1
MEM/WB
Imm
Sign Extend
RD
RD
RD
WB Data
Memory
RS2
EX/MEM
Reg File
ID/EX
ALU
IF/ID
Data Memory
MUX
MUX
Execute successor instructions in sequence Squash instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction 53% MIPS branches taken on average But havent calculated branch target address in MIPS
- MIPS still incurs 1 cycle branch penalty - Other machines: branch target known before outcome
563 L02.49
Fall 2011
Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken
1 slot delay allows proper decision and branch target address in 5 stage pipeline MIPS uses this
563 L02.50
Fall 2011
A is the best choice, fills delay slot & reduces instruction count (IC) In B, the sub instruction may need to be copied, increasing IC
563 L02.51
Fall 2011
Delayed Branch
Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled
Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot
Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches Growth in available transistors has made dynamic approaches relatively cheaper
563 L02.52
Fall 2011
Assume 4% unconditional branch, 6% conditional branch- untaken, 10% conditional branch-taken Scheduling scheme Stall pipeline Predict taken Predict not taken U 2 2 2 C-UT 3 3 0 C-T 3 2 3 CPI 1.56 1.46 1.38
563 L02.53
Fall 2011
Example: a sound card interrupts when it needs more audio output samples (an audio click happens if it is left waiting)
Problem: the exception or interrupt must appear between 2 instructions (Ii and Ii+1)
The effect of all instructions up to and including Ii is totalling complete No effect of any instruction after Ii can take place
The interrupt (exception) handler either aborts program or restarts at instruction Ii+1
563 L02.54
Fall 2011
Key observation: architected state only change in memory and register write stages.
563 L02.55 Fall 2011
Difficult to compare widely differing machines on benchmark suite Control VIA State Machines and Microprogramming Just overlap tasks; easy if tasks are independent Speed Up Pipeline Depth; if ideal CPI is 1, then:
Cycle Timeunpipelined Pipeline depth Speedup 1 Pipeline stall CPI Cycle Timepipelined
Structural: need more HW resources Data (RAW,WAR,WAW): need forwarding, compiler scheduling Control: delayed branch, prediction
563 L02.56
Please check the sakai page Move class from 10/17 to 10/16?
563 L02.57
Fall 2011