0% found this document useful (0 votes)

15 views

(English) Advanced CPU Designs - Crash Course Computer Science #9 (DownSub - Com)

Uploaded by

Trọng Đại

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

(English) Advanced CPU Designs - Crash Course Computer Science #9 (DownSub - Com)

Uploaded by

Trọng Đại

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 10

Hi, I’m Carrie Anne and welcome to CrashCourse

Computer Science!

As we’ve discussed throughout the series,

computers have come a long way from mechanical

devices capable of maybe one calculation per second, to CPUs running at kilohertz
and megahertz speeds.

The device you’re watching this video on

right now is almost certainly running at Gigahertz

speeds - that’s billions of instructions

executed every second.

Which, trust me, is a lot of computation!

In the early days of electronic computing,

processors were typically made faster by improving

the switching time of the transistors inside

the chip - the ones that make up all the logic

gates, ALUs and other stuff we’ve talked

about over the past few episodes.

But just making transistors faster and more

efficient only went so far, so processor designers

have developed various techniques to boost

performance allowing not only simple instructions

to run fast, but also performing much more

sophisticated operations.

INTRO

Last episode, we created a small program for

our CPU that allowed us to divide two numbers.

We did this by doing many subtractions in

a row... so, for example, 16 divided by 4

could be broken down into the smaller problem

of 16 minus 4, minus 4, minus 4, minus 4.

When we hit zero, or a negative number, we

knew that we we’re done.

But this approach gobbles up a lot of clock

cycles, and isn’t particularly efficient.

So most computer processors today have divide

as one of the instructions that the ALU can

perform in hardware.

Of course, this extra circuitry makes the

ALU bigger and more complicated to design,
but also more capable - a complexity-for-speed
tradeoff that has been made many times in

computing history.

For instance, modern computer processors now

have special circuits for things like graphics

operations, decoding compressed video, and

encrypting files - all of which are operations

that would take many many many clock cycles

to perform with standard operations.

You may have even heard of processors with

MMX, 3DNow!, or SSE.

These are processors with additional, fancy

circuits that allow them to execute additional,

fancy instructions - for things like gaming

and encryption.

These extensions to the instruction set have

grown, and grown over time, and once people

have written programs to take advantage of

them, it’s hard to remove them.

So instruction sets tend to keep getting larger

and larger keeping all the old opcodes around

for backwards compatibility.

The Intel 4004, the first truly integrated

CPU, had 46 instructions - which was enough

to build a fully functional computer.

But a modern computer processor has thousands

of different instructions, which utilize all

sorts of clever and complex internal circuitry.

Now, high clock speeds and fancy instruction

sets lead to another problem - getting data

in and out of the CPU quickly enough.

It’s like having a powerful steam locomotive,

but no way to shovel in coal fast enough.

In this case, the bottleneck is RAM.

RAM is typically a memory module that lies

outside the CPU.

This means that data has to be transmitted

to and from RAM along sets of data wires,

called a bus.

This bus might only be a few centimeters long,

and remember those electrical signals are

traveling near the speed of light, but when

you are operating at gigahertz speeds – that’s

billionths of a second – even this small

delay starts to become problematic.

It also takes time for RAM itself to lookup

the address, retrieve the data, and configure

itself for output.

So a “load from RAM” instruction might take

dozens of clock cycles to complete, and during

this time the processor is just sitting there

idly waiting for the data.

One solution is to put a little piece of RAM

right on the CPU -- called a cache.

There isn’t a lot of space on a processor’s

chip, so most caches are just kilobytes or

maybe megabytes in size, where RAM is usually

gigabytes.

Having a cache speeds things up in a clever

way.

When the CPU requests a memory location from

RAM, the RAM can transmit not just one single

value, but a whole block of data.

This takes only a little bit more time than

transmitting a single value, but it allows

this data block to be saved into the cache.

This tends to be really useful because computer

data is often arranged and processed sequentially.

For example, let say the processor is totalling

up daily sales for a restaurant.

It starts by fetching the first transaction

from RAM at memory location 100.

The RAM, instead of sending back just that

one value, sends a block of data, from memory

location 100 through 200, which are then all

copied into the cache.

Now, when the processor requests the next

transaction to add to its running total, the

value at address 101, the cache will say “Oh,

I’ve already got that value right here,

so I can give it to you right away!”

And there’s no need to go all the way to

RAM.

Because the cache is so close to the processor,

it can typically provide the data in a single

clock cycle -- no waiting required.

This speeds things up tremendously over having

to go back and forth to RAM every single time.

When data requested in RAM is already stored

in the cache like this it’s called a cache

hit,

and if the data requested isn’t in the cache,

so you have to go to RAM, it’s a called

a cache miss.

The cache can also be used like a scratch

space, storing intermediate values when performing

a longer, or more complicated calculation.

Continuing our restaurant example, let’s

say the processor has finished totalling up

all of the sales for the day, and wants to

store the result in memory address 150.

Like before, instead of going back all the

way to RAM to save that value, it can be stored

in cached copy, which is faster to save to,

and also faster to access later if more calculations

are needed.

But this introduces an interesting problem

-- the cache’s copy of the data is now different

to the real version stored in RAM.

This mismatch has to be recorded, so that

at some point everything can get synced up.

For this purpose, the cache has a special

flag for each block of memory it stores, called

the dirty bit -- which might just be the best

term computer scientists have ever invented.

Most often this synchronization happens when

the cache is full, but a new block of memory

is being requested by the processor.

Before the cache erases the old block to free

up space, it checks its dirty bit, and if

it’s dirty, the old block of data is written

back to RAM before loading in the new block.

Another trick to boost cpu performance is

called instruction pipelining.

Imagine you have to wash an entire hotel’s

worth of sheets, but you’ve only got one

washing machine and one dryer.

One option is to do it all sequentially: put

a batch of sheets in the washer and wait 30

minutes for it to finish.

Then take the wet sheets out and put them

in the dryer and wait another 30 minutes for

that to finish.

This allows you to do one batch of sheets

every hour.

Side note: if you have a dryer that can dry

a load of laundry in 30 minutes, please tell

me the brand and model in the comments, because

I’m living with 90 minute dry times, minimum.

But, even with this magic clothes dryer, you

can speed things up even more if you parallelize

your operation.

As before, you start off putting one batch

of sheets in the washer.

You wait 30 minutes for it to finish.

Then you take the wet sheets out and put them
in the dryer.

But this time, instead of just waiting 30

minutes for the dryer to finish, you simultaneously
start another load in the washing machine.

Now you’ve got both machines going at once.

Wait 30 minutes, and one batch is now done,

one batch is half done, and another is ready

to go in.

This effectively doubles your throughput.

Processor designs can apply the same idea.

In episode 7, our example processor performed

the fetch-decode-execute cycle sequentially

and in a continuous loop: Fetch-decode-execute,

fetch-decode-execute, fetch-decode-execute,

and so on.

This meant our design required three clock

cycles to execute one instruction.

But each of these stages uses a different

part of the CPU, meaning there is an opportunity

to parallelize!

While one instruction is getting executed,

the next instruction could be getting decoded,

and the instruction beyond that fetched from

memory.

All of these separate processes can overlap

so that all parts of the CPU are active at

any given time.

In this pipelined design, an instruction is

executed every single clock cycle which triples

the throughput.

But just like with caching this can lead to

some tricky problems.

A big hazard is a dependency in the instructions.

For example, you might fetch something that

the currently executing instruction is just

about to modify, which means you’ll end

up with the old value in the pipeline.

To compensate for this, pipelined processors

have to look ahead for data dependencies,
and if necessary, stall their pipelines to
avoid problems.

High end processors, like those found in laptops

and smartphones, go one step further and can

dynamically reorder instructions with dependencies

in order to minimize stalls and keep the pipeline

moving, which is called out-of-order execution.

As you might imagine, the circuits that figure

this all out are incredibly complicated.

Nonetheless, pipelining is tremendously effective

and almost all processors implement it today.

Another big hazard are conditional jump instructions

-- we talked about one example, a JUMP NEGATIVE,

last episode.

These instructions can change the execution

flow of a program depending on a value.

A simple pipelined processor will perform

a long stall when it sees a jump instruction,

waiting for the value to be finalized.

Only once the jump outcome is known, does

the processor start refilling its pipeline.

But, this can produce long delays, so high-end

processors have some tricks to deal with this

problem too.

Imagine an upcoming jump instruction as a

fork in a road - a branch.

Advanced CPUs guess which way they are going

to go, and start filling their pipeline with

instructions based off that guess – a technique

called speculative execution.

When the jump instruction is finally resolved,

if the CPU guessed correctly, then the pipeline

is already full of the correct instructions

and it can motor along without delay.

However, if the CPU guessed wrong, it has

to discard all its speculative results and

perform a pipeline flush - sort of like when

you miss a turn and have to do a u-turn to
get back on route, and stop your GPS’s insistent
shouting.

To minimize the effects of these flushes,

CPU manufacturers have developed sophisticated

ways to guess which way branches will go,

called branch prediction.

Instead of being a 50/50 guess, today’s

processors can often guess with over 90% accuracy!

In an ideal case, pipelining lets you complete

one instruction every single clock cycle,

but then superscalar processors came along

which can execute more than one instruction

per clock cycle.

During the execute phase even in a pipelined design, whole areas of the processor
might

be totally idle.

For example, while executing an instruction

that fetches a value from memory, the ALU

is just going to be sitting there, not doing

a thing.

So why not fetch-and-decode several instructions

at once, and whenever possible, execute instructions

that require different parts of the CPU all

at the same time!?

But we can take this

one step further and add duplicate circuitry

for popular instructions.

For example, many processors will have four,

eight or more identical ALUs, so they can

execute many mathematical instructions all

in parallel!

Ok, the techniques we’ve discussed so far

primarily optimize the execution throughput

of a single stream of instructions, but another

way to increase performance is to run several

streams of instructions at once with multi-core

processors.

You might have heard of dual core or quad

core processors.
This means there are multiple independent
processing units inside of a single CPU chip.

In many ways, this is very much like having

multiple separate CPUs, but because they’re

tightly integrated, they can share some resources,

like cache, allowing the cores to work together

on shared computations.

But, when more cores just isn’t enough,

you can build computers with multiple independent

CPUs!

High end computers, like the servers streaming

this video from YouTube’s datacenter, often

need the extra horsepower to keep it silky

smooth for the hundreds of people watching

simultaneously.

Two- and four-processor configuration are

the most common right now, but every now and

again even that much processing power isn’t

enough.

So we humans get extra ambitious and build

ourselves a supercomputer!

If you’re looking to do some really monster

calculations – like simulating the formation

of the universe - you’ll need some pretty

serious compute power.

A few extra processors in a desktop computer

just isn’t going to cut it.

You’re going to need a lot of processors.

No.. no... even more than that.

A lot more!

When this video was made, the world’s fastest

computer was located in The National Supercomputing

Center in Wuxi, China.

The Sunway TaihuLight contains a brain-melting

40,960 CPUs, each with 256 cores!

Thats over ten million cores in total... and

each one of those cores runs at 1.45 gigahertz.
In total, this machine can process 93 Quadrillion
-- that’s 93 million-billions -- floating

point math operations per second, knows as

FLOPS.

And trust me, that’s a lot of FLOPS!!

No word on whether it can run Crysis at max

settings, but I suspect it might.

So long story short, not only have computer

processors gotten a lot faster over the years,

but also a lot more sophisticated, employing

all sorts of clever tricks to squeeze out

more and more computation per clock cycle.

Our job is to wield that incredible processing

power to do cool and useful things.

That’s the essence of programming, which

we’ll start discussing next episode.

See you next week.

Yamaha RX-Z (6-SPEED) Owner Manual
82% (120)
Yamaha RX-Z (6-SPEED) Owner Manual
56 pages
Wiring Diagram of Front SAM Control Unit With Fuse and Relay Module (N10 - 1)
100% (1)
Wiring Diagram of Front SAM Control Unit With Fuse and Relay Module (N10 - 1)
9 pages
Beetel Phone
100% (1)
Beetel Phone
40 pages
Running Problems - Restricted Functions - Audi 0B5 Gearbox
100% (4)
Running Problems - Restricted Functions - Audi 0B5 Gearbox
4 pages
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
How A CPU Works
No ratings yet
How A CPU Works
11 pages
Computer Architecture - Notes
100% (1)
Computer Architecture - Notes
11 pages
How To Overclock CPUs and RAM
No ratings yet
How To Overclock CPUs and RAM
25 pages
Cache: Why Level It: Departamento de Informática, Universidade Do Minho 4710 - 057 Braga, Portugal Nunods@ipb - PT
No ratings yet
Cache: Why Level It: Departamento de Informática, Universidade Do Minho 4710 - 057 Braga, Portugal Nunods@ipb - PT
8 pages
Week 9
No ratings yet
Week 9
5 pages
Assignment1-Rennie Ramlochan (31.10.13)
No ratings yet
Assignment1-Rennie Ramlochan (31.10.13)
7 pages
HW1 Solution
No ratings yet
HW1 Solution
3 pages
Multi Core
No ratings yet
Multi Core
70 pages
CompTIA A+ Core 1 RAM & Storage Solutions
No ratings yet
CompTIA A+ Core 1 RAM & Storage Solutions
28 pages
Computer Memory Basics: RAM ROM Cache
No ratings yet
Computer Memory Basics: RAM ROM Cache
6 pages
Welcome To Part 3: Memory Systems and I/O
No ratings yet
Welcome To Part 3: Memory Systems and I/O
31 pages
Lec 6
No ratings yet
Lec 6
9 pages
CPU Cache: How Caching Works
No ratings yet
CPU Cache: How Caching Works
6 pages
study guide 2
No ratings yet
study guide 2
4 pages
AnswertheQuestion Chapter2 OS
No ratings yet
AnswertheQuestion Chapter2 OS
8 pages
How A CPU Works1
No ratings yet
How A CPU Works1
56 pages
Q5 Processor Pool Model
No ratings yet
Q5 Processor Pool Model
3 pages
GCSE Computer Science: Geraint D. Jones Mark D. Thomas
No ratings yet
GCSE Computer Science: Geraint D. Jones Mark D. Thomas
72 pages
Cache Memory Thesis
100% (3)
Cache Memory Thesis
5 pages
Lesson 6 - Central Processing Unit
No ratings yet
Lesson 6 - Central Processing Unit
5 pages
Limitation of Memory Sys Per
No ratings yet
Limitation of Memory Sys Per
38 pages
What Is Paging? Why Paging Is Used?: Resource Allocator and Manager
No ratings yet
What Is Paging? Why Paging Is Used?: Resource Allocator and Manager
8 pages
A Comparison of Single-Core and Dual-Core Opteron Processor Performance For HPC
No ratings yet
A Comparison of Single-Core and Dual-Core Opteron Processor Performance For HPC
13 pages
Mit 101 Activity2
No ratings yet
Mit 101 Activity2
3 pages
4 ProcessesProf
No ratings yet
4 ProcessesProf
3 pages
Os 2 Marks
No ratings yet
Os 2 Marks
4 pages
How Computer Memory Works
No ratings yet
How Computer Memory Works
6 pages
Assignment4-Rennie Ramlochan
No ratings yet
Assignment4-Rennie Ramlochan
7 pages
Factors That Affect The Performance of Your Computer
No ratings yet
Factors That Affect The Performance of Your Computer
4 pages
Operating System Questions: by Admin - January 17, 2005
No ratings yet
Operating System Questions: by Admin - January 17, 2005
12 pages
Number of Cores (Processors)
No ratings yet
Number of Cores (Processors)
4 pages
Clock Speed
No ratings yet
Clock Speed
3 pages
Operating System Questions
No ratings yet
Operating System Questions
3 pages
Operative Systems Midterm Project
No ratings yet
Operative Systems Midterm Project
6 pages
Multicore Computers
No ratings yet
Multicore Computers
18 pages
NIA - Chapter 15
No ratings yet
NIA - Chapter 15
13 pages
Lesson 7: System Performance: Objective
No ratings yet
Lesson 7: System Performance: Objective
2 pages
The Different Parts of A CPU and Their Functions
No ratings yet
The Different Parts of A CPU and Their Functions
33 pages
Postgresql Performance Tuning
No ratings yet
Postgresql Performance Tuning
7 pages
Squid Performance Tuning
No ratings yet
Squid Performance Tuning
57 pages
Static and Dynamic RAM
No ratings yet
Static and Dynamic RAM
6 pages
Lesson 7 The Central Processing Unit (CPU)
No ratings yet
Lesson 7 The Central Processing Unit (CPU)
32 pages
Lovely Professional University: Computer Organization and Architecture
No ratings yet
Lovely Professional University: Computer Organization and Architecture
7 pages
The Performance Balancing Act
No ratings yet
The Performance Balancing Act
6 pages
Memory Management - Summary
No ratings yet
Memory Management - Summary
6 pages
Hunting Down CPU Related Issues With Oracle: A Functional Approach
No ratings yet
Hunting Down CPU Related Issues With Oracle: A Functional Approach
9 pages
Distributed Operating Systems: Amdahl's Law
No ratings yet
Distributed Operating Systems: Amdahl's Law
16 pages
1.1 Parallelism
No ratings yet
1.1 Parallelism
29 pages
What Is Swap?: Openoffice
No ratings yet
What Is Swap?: Openoffice
8 pages
Magzhan Kairanbay
No ratings yet
Magzhan Kairanbay
27 pages
FEROLIN, Mary Bernadette J. November29, 2020 BSCS-2 CS 3104 (4:30 - 6:00, MW) Chapter 10: Virtual Memory
No ratings yet
FEROLIN, Mary Bernadette J. November29, 2020 BSCS-2 CS 3104 (4:30 - 6:00, MW) Chapter 10: Virtual Memory
43 pages
lec3 (1)
No ratings yet
lec3 (1)
17 pages
Assgniment 3rd Year 2nd Semester
No ratings yet
Assgniment 3rd Year 2nd Semester
5 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
From Everand
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
Hunter Davis
No ratings yet
The Computer User's Survival Handbook: Why Is My Computer Slow?
From Everand
The Computer User's Survival Handbook: Why Is My Computer Slow?
Laszlo Szenes
No ratings yet
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
The No Bull$#!£ Guide to Building Your Own PC: No Bull Guides
From Everand
The No Bull$#!£ Guide to Building Your Own PC: No Bull Guides
David Smallway
No ratings yet
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
MBB 600 TV 6 A
No ratings yet
MBB 600 TV 6 A
8 pages
Horton C2150 PDF
100% (1)
Horton C2150 PDF
20 pages
How To Calibrate and Measure A DUT Like A Toroid Correctly The Smart Way
No ratings yet
How To Calibrate and Measure A DUT Like A Toroid Correctly The Smart Way
10 pages
Intro To Computer Notes 1
No ratings yet
Intro To Computer Notes 1
6 pages
X33J Plus en
No ratings yet
X33J Plus en
2 pages
EIL Installation Standards
33% (3)
EIL Installation Standards
85 pages
RK200-03 Pyranometer Data Sheet
No ratings yet
RK200-03 Pyranometer Data Sheet
3 pages
Generator Circuit Breaker
No ratings yet
Generator Circuit Breaker
1 page
Yaskawa V1000 Series
No ratings yet
Yaskawa V1000 Series
242 pages
Voice Activated Home Automation
No ratings yet
Voice Activated Home Automation
25 pages
03 Checklist Hyva
No ratings yet
03 Checklist Hyva
1 page
Psychrometric Chart Ashrae PDF
No ratings yet
Psychrometric Chart Ashrae PDF
2 pages
STLD Flipflops
No ratings yet
STLD Flipflops
9 pages
Mitsubishi Graphic Operation Terminal GOT2000 Series (Catalog) - L08270engc
No ratings yet
Mitsubishi Graphic Operation Terminal GOT2000 Series (Catalog) - L08270engc
84 pages
Altapail Ii: 20-Liter or 5-Gallon Pail Melters
No ratings yet
Altapail Ii: 20-Liter or 5-Gallon Pail Melters
2 pages
3rd PT (TQ'S) .1
No ratings yet
3rd PT (TQ'S) .1
6 pages
JW1969A&B&C&D&E Datasheet R0.21 EN 20200221
No ratings yet
JW1969A&B&C&D&E Datasheet R0.21 EN 20200221
11 pages
Micrologic P User Manual
No ratings yet
Micrologic P User Manual
92 pages
HFA Gigawatt Review
No ratings yet
HFA Gigawatt Review
20 pages
Electric Motors
No ratings yet
Electric Motors
11 pages
BEAM200
No ratings yet
BEAM200
2 pages
micro bit
No ratings yet
micro bit
10 pages
Semi-Detailed Lesson Plan Tle 6 (Industrial Arts) Fourth Quarter
80% (10)
Semi-Detailed Lesson Plan Tle 6 (Industrial Arts) Fourth Quarter
3 pages
Agd640 PM - Iss1
No ratings yet
Agd640 PM - Iss1
28 pages
Yam Onroad Eu
No ratings yet
Yam Onroad Eu
176 pages
The First Generation Computers
No ratings yet
The First Generation Computers
10 pages