1-Intro 2
1-Intro 2
© 2003 by by Sorin, Roth, Hill, Wood, ECE 252/ CPS 220 Lecture Notes 1 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 2
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 3 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 4
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
What is This Course All About? Course Goals and Expectations
State-of-the-art computer hardware design Course Goals
Topics + Understand how current processors work
• Uniprocessor architecture (i.e., microprocessors) + Understand how to evaluate/compare processors
• Memory architecture + Learn how to use simulator to perform experiments
• I/O architecture + Learn research skills by performing term project
• Brief look at multithreading and multiprocessors
Course expectations:
Fundamentals, current systems, and future systems • Will loosely follow text
Will read from textbook, classic papers, brand-new papers • Major emphasis on cutting-edge issues
• Students will read a list of research papers
• Term project
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 5 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 6
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 7 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 8
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
Grading Conduct
Grading breakdown Lectures
• Homework: 30% • Consult course website for tentative schedule
• Midterm: 15% Academic Misconduct
• Project: 25% • University policy will be followed strictly
• Final: 30% • Zero tolerance for cheating and/or plagiarism
Late policy
• Late homeworks
• late <1 day = 50% off
• late >1 day = zero
• No late term project will be accepted. Period.
Now, moving on to computer architecture ...
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 9 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 10
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 11 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 12
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
Architecture and Other Disciplines Levels of Computer Architecture
architecture
Application Software
• functional appearance to immediate user
• opcodes, addressing modes, architected registers
Operating Systems, Compilers, Networking Software
implementation (microarchitecture)
Computer Architecture • logical structure that performs the architecture
• pipelining, functional units, caches, physical registers
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 13 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 14
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 17 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 18
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 19 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 20
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
Moore’s Law Evolution of Single-Chip Microprocessors
“performance doubles every 18 months” 1971–1980 1981–1990 1991–2000 2010
• common interpretation of Moore’s Law, not original intent Transistor Count 10K–100K 100K–1M 1M–100M 1B
• wrong! “performance” doubles every ~2 years
Clock Frequency 0.2–2MHz 2–20MHz 20M–1GHz 10GHz
• self-fulfilling prophecy (Moore’s Curve)
• 2X every 2 years = ~3% increase per month IPC < 0.1 0.1–0.9 0.9–2.0 10 (?)
• 3% per month used to judge performance features
• if feature adds 9 months to schedule... MIPS/MFLOPS < 0.2 0.2–20 20–2,000 100,000
• ...it should add at least 30% to performance (1.039 = 1.30 → 30%)
• Itanium: under Moore’s Curve in a big way
some perspective: 1971–2001 performance improved 35,000X!!!
• what if cars improved at this rate?
• 1971: 60 MPH / 10 MPG
Q: what do (big bang–2001 and 2000–2003) have in common?
• 2001: 2,100,000 MPH / 350,000 MPG
A: same absolute increase in computing power • but... what if cars crashed as often as computers did?
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 21 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 22
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 23 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 24
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
Performance Metrics Performance Metric I: MIPS
latency: response time, execution time MIPS (millions of instructions per second)
• good metric for fixed amount of work (minimize time) • (instruction count / execution time in seconds) x 10-6
throughput: bandwidth, work per time, “performance” – instruction count is not a reliable indicator of work
• some optimizations add instructions
• = (1 / latency) when there is NO OVERLAP • work per instruction varies (FP mult >> register move)
• > (1 / latency) when there is overlap • instruction sets are not equal (3 Pentium instrs != 3 Alpha instrs)
• in real processors there is always overlap (e.g., pipelining) – may vary inversely with actual performance
• good metric for fixed amount of time (maximize work)
relative MIPS
comparing performance • (timereference / timenew) x MIPSreference
• A is N times faster than B iff + a little better than native MIPS
• perf(A)/perf(B) = time(B)/time(A) = N
– but very sensitive to reference machine
• A is X% faster than B iff
• perf(A)/perf(B) = time(B)/time(A) = 1 + X/100 • upshot: may be useful if same ISA/compiler/OS/workload
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 25 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 26
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 27 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 28
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
CPU Performance Equation CPU Performance Comparison
instructions / program: dynamic instruction count famous example: “RISC Wars” (RISC vs. CISC)
• mostly determined by program, compiler, ISA • assume
• instructions / program: CISC = P, RISC = 2P
cycles / instruction: CPI • CPI: CISC = 8, RISC = 2
• mostly determined by ISA and CPU/memory organization • CISC time = P x 8 x T = 8PT
seconds / cycle: cycle time, clock time, 1 / clock frequency • RISC time = 2P x 2 x T = 4PT
• mostly determined by technology and CPU organization • RISC time = CISC CPU time/2
uses of CPU performance equation
• high-level performance comparisons the truth is much, much, much more complex
• back of the envelope calculations • actual data from IBM AS/400 (CISC -> RISC in 1995):
• helping architects think about compilers and technology • CISC (IMPI) time = P x 7 x T = 7PT
• RISC (PPC) time = 3.1P x 3 x T/3.1 = 3PT (+1 tech. gen.)
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 29 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 30
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
• old CPI = 0.43 + 0.21 + (0.12 x 2) + (0.24 x 2) = 1.36 measuring CPI breakdown
• new CPI = 0.43 + 0.21 + 0.12 + (0.24 x 2) = 1.24 • hardware event counters (PentiumPro, Alpha DCPI)
• speedup = (P x 1.36 x T) / (P x 1.24 x 1.15T) = 0.95 • calculate CPI using instruction frequencies/event costs
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 31 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 32
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
Benchmarks and Benchmarking Benchmarks: Instruction Mixes
“program” as unit of work instruction mix: instruction type frequencies
• millions of them, many different kinds, which to use? – ignores dependences
benchmarks + ok for non-pipelined, scalar processor without caches
• the way all processors used to be
• standard programs for measuring/comparing performance
• example: Gibson Mix - developed in 1950’s at IBM
+ represent programs people care about • load/store: 31%, branches: 17%
+ repeatable!! • compare: 4%, shift: 4%, logical: 2%
• fixed add/sub: 6%, float add/sub: 7%
• benchmarking process • float mult: 4%, float div: 2%, fixed mul: 1%, fixed div: <1%
• define workload • qualitatively, these numbers are still useful today!
• extract benchmarks from workload
• execute benchmarks on candidate machines
• project performance on new machine
• run workload on new machine and compare
• not close enough -> repeat
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 33 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 34
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 35 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 36
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
SPEC CPU2000 Benchmarking Pitfalls
12 integer programs (C, C++) • benchmark properties mismatch with features studied
• gcc (compiler), perl (interpreter), vortex (database) • e.g., using SPEC for large cache studies
• bzip2, gzip (replace compress), crafty (chess, replaces go) • careless scaling
• eon (rendering), gap (group theoretic enumerations)
• using only first few million instructions (initialization phase)
• twolf, vpr (FPGA place and route)
• reducing program data size
• parser (grammar checker), mcf (network optimization)
• choosing performance from wrong application space
14 floating point programs (C, FORTRAN) • e.g., in a realtime environment, choosing troff
• swim (shallow water model), mgrid (multigrid field solver) • others: SPECweb, TPC-W (amazon.com)
• applu (partial diffeq’s), apsi (air pollution simulation)
• using old benchmarks
• wupwise (quantum chromodynamics), mesa (OpenGL library)
• “benchmark specials”: benchmark-specific optimizations
• art (neural network image recognition), equake (wave propagation)
• fma3d (crash simulation), sixtrack (accelerator design) benchmarks must be continuously maintained and updated!
• lucas (primality testing), galgel (fluid dynamics), ammp (chemistry)
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 37 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 38
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 39 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 40
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
GM Weirdness Amdahl’s Law
what about averaging ratios (speedups)? “Validity of the Single-Processor Approach to Achieving Large-
Scale Computing Capabilities” –G. Amdahl, AFIPS, 1967
• HM / AM change depending on which machine is the base
• let optimization speed up fraction f of program by factor s
machine A machine B B/A A/B • speedup = old / ([(1-f) x old] + f/s x old) = 1 / (1 - f + f/s)
Program1 1 10 10 0.1 •
Program2 1000 100 0.1 10
(10+.1)/2 = 5.05 (.1+10)/2 = 5.05
• f = 95%, s = 1.1 → 1/[(1-0.95) + (0.95/1.1)] = 1.094
AM B is 5.05 times faster! A is 5.05 times faster!
• f = 5%, s = 10 → 1/[(1-0.05) + (0.05/10)] = 1.047
2/(1/10+1/.1) = 5.05 2/(1/.1+1/10) = 5.05
HM B is 5.05 times faster! A is 5.05 times faster! • f = 5%, s = ∞ → 1/[(1-0.05) + (0.05/∞)] = 1.052
GM √(10*.1) = 1 √(.1*10) = 1 • f = 95%, s ∞ → 1/[(1-0.95) + (0.95/∞)] = 20
– geometric mean of ratios is not proportional to total time! make common case fast, but...
• if we take total execution time, B is 9.1 times faster
• GM says they are equal
...uncommon case eventually limits performance
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 41 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 42
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 43 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 44
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
Tradeoffs Bursty Behavior
“Bandwidth problems can be solved with money. Latency Q: to sustain 2 IPC... how many instructions should processor be
problems are harder, because the speed of light is fixed and you able to
can’t bribe God” –someone famous (John Cocke?)
• fetch per cycle?
well... • execute per cycle?
• can convert some latency problems to bandwidth problems • complete per cycle?
• solve those with money
A: NOT 2 (more than 2)
• the famous “bandwidth/latency tradeoff”
• dependences will cause stalls (under-utilization)
• if desired performance is X, peak performance must be > X
• architecture is the art of making tradeoffs
programs don’t always obey “average” behavior
• can’t design processor only to handle average behvaior
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 45 © 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 46
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
© 2003 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 51
Sohi, Smith, Vijaykumar, Lipasti Introduction