0% found this document useful (0 votes)
15 views

(English) Advanced CPU Designs - Crash Course Computer Science #9 (DownSub - Com)

Uploaded by

Trọng Đại
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

(English) Advanced CPU Designs - Crash Course Computer Science #9 (DownSub - Com)

Uploaded by

Trọng Đại
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 10

Hi, I’m Carrie Anne and welcome to CrashCourse

Computer Science!

As we’ve discussed throughout the series,


computers have come a long way from mechanical

devices capable of maybe one calculation per second, to CPUs running at kilohertz
and megahertz speeds.

The device you’re watching this video on


right now is almost certainly running at Gigahertz

speeds - that’s billions of instructions


executed every second.

Which, trust me, is a lot of computation!

In the early days of electronic computing,


processors were typically made faster by improving

the switching time of the transistors inside


the chip - the ones that make up all the logic

gates, ALUs and other stuff we’ve talked


about over the past few episodes.

But just making transistors faster and more


efficient only went so far, so processor designers

have developed various techniques to boost


performance allowing not only simple instructions

to run fast, but also performing much more


sophisticated operations.

INTRO

Last episode, we created a small program for


our CPU that allowed us to divide two numbers.

We did this by doing many subtractions in


a row... so, for example, 16 divided by 4

could be broken down into the smaller problem


of 16 minus 4, minus 4, minus 4, minus 4.

When we hit zero, or a negative number, we


knew that we we’re done.

But this approach gobbles up a lot of clock


cycles, and isn’t particularly efficient.

So most computer processors today have divide


as one of the instructions that the ALU can

perform in hardware.

Of course, this extra circuitry makes the


ALU bigger and more complicated to design,
but also more capable - a complexity-for-speed
tradeoff that has been made many times in

computing history.

For instance, modern computer processors now


have special circuits for things like graphics

operations, decoding compressed video, and


encrypting files - all of which are operations

that would take many many many clock cycles


to perform with standard operations.

You may have even heard of processors with


MMX, 3DNow!, or SSE.

These are processors with additional, fancy


circuits that allow them to execute additional,

fancy instructions - for things like gaming


and encryption.

These extensions to the instruction set have


grown, and grown over time, and once people

have written programs to take advantage of


them, it’s hard to remove them.

So instruction sets tend to keep getting larger


and larger keeping all the old opcodes around

for backwards compatibility.

The Intel 4004, the first truly integrated


CPU, had 46 instructions - which was enough

to build a fully functional computer.

But a modern computer processor has thousands


of different instructions, which utilize all

sorts of clever and complex internal circuitry.

Now, high clock speeds and fancy instruction


sets lead to another problem - getting data

in and out of the CPU quickly enough.

It’s like having a powerful steam locomotive,


but no way to shovel in coal fast enough.

In this case, the bottleneck is RAM.

RAM is typically a memory module that lies


outside the CPU.

This means that data has to be transmitted


to and from RAM along sets of data wires,

called a bus.

This bus might only be a few centimeters long,


and remember those electrical signals are

traveling near the speed of light, but when


you are operating at gigahertz speeds – that’s

billionths of a second – even this small


delay starts to become problematic.

It also takes time for RAM itself to lookup


the address, retrieve the data, and configure

itself for output.

So a “load from RAM” instruction might take


dozens of clock cycles to complete, and during

this time the processor is just sitting there


idly waiting for the data.

One solution is to put a little piece of RAM


right on the CPU -- called a cache.

There isn’t a lot of space on a processor’s


chip, so most caches are just kilobytes or

maybe megabytes in size, where RAM is usually


gigabytes.

Having a cache speeds things up in a clever


way.

When the CPU requests a memory location from


RAM, the RAM can transmit not just one single

value, but a whole block of data.

This takes only a little bit more time than


transmitting a single value, but it allows

this data block to be saved into the cache.

This tends to be really useful because computer


data is often arranged and processed sequentially.

For example, let say the processor is totalling


up daily sales for a restaurant.

It starts by fetching the first transaction


from RAM at memory location 100.

The RAM, instead of sending back just that


one value, sends a block of data, from memory

location 100 through 200, which are then all


copied into the cache.

Now, when the processor requests the next


transaction to add to its running total, the

value at address 101, the cache will say “Oh,


I’ve already got that value right here,

so I can give it to you right away!”

And there’s no need to go all the way to


RAM.

Because the cache is so close to the processor,


it can typically provide the data in a single

clock cycle -- no waiting required.

This speeds things up tremendously over having


to go back and forth to RAM every single time.

When data requested in RAM is already stored


in the cache like this it’s called a cache

hit,

and if the data requested isn’t in the cache,


so you have to go to RAM, it’s a called

a cache miss.

The cache can also be used like a scratch


space, storing intermediate values when performing

a longer, or more complicated calculation.

Continuing our restaurant example, let’s


say the processor has finished totalling up

all of the sales for the day, and wants to


store the result in memory address 150.

Like before, instead of going back all the


way to RAM to save that value, it can be stored

in cached copy, which is faster to save to,


and also faster to access later if more calculations

are needed.

But this introduces an interesting problem


-- the cache’s copy of the data is now different

to the real version stored in RAM.

This mismatch has to be recorded, so that


at some point everything can get synced up.

For this purpose, the cache has a special


flag for each block of memory it stores, called

the dirty bit -- which might just be the best


term computer scientists have ever invented.

Most often this synchronization happens when


the cache is full, but a new block of memory

is being requested by the processor.

Before the cache erases the old block to free


up space, it checks its dirty bit, and if

it’s dirty, the old block of data is written


back to RAM before loading in the new block.

Another trick to boost cpu performance is


called instruction pipelining.

Imagine you have to wash an entire hotel’s


worth of sheets, but you’ve only got one

washing machine and one dryer.

One option is to do it all sequentially: put


a batch of sheets in the washer and wait 30

minutes for it to finish.

Then take the wet sheets out and put them


in the dryer and wait another 30 minutes for

that to finish.

This allows you to do one batch of sheets


every hour.

Side note: if you have a dryer that can dry


a load of laundry in 30 minutes, please tell

me the brand and model in the comments, because


I’m living with 90 minute dry times, minimum.

But, even with this magic clothes dryer, you


can speed things up even more if you parallelize

your operation.

As before, you start off putting one batch


of sheets in the washer.

You wait 30 minutes for it to finish.

Then you take the wet sheets out and put them
in the dryer.

But this time, instead of just waiting 30


minutes for the dryer to finish, you simultaneously
start another load in the washing machine.

Now you’ve got both machines going at once.

Wait 30 minutes, and one batch is now done,


one batch is half done, and another is ready

to go in.

This effectively doubles your throughput.

Processor designs can apply the same idea.

In episode 7, our example processor performed


the fetch-decode-execute cycle sequentially

and in a continuous loop: Fetch-decode-execute,


fetch-decode-execute, fetch-decode-execute,

and so on.

This meant our design required three clock


cycles to execute one instruction.

But each of these stages uses a different


part of the CPU, meaning there is an opportunity

to parallelize!

While one instruction is getting executed,


the next instruction could be getting decoded,

and the instruction beyond that fetched from


memory.

All of these separate processes can overlap


so that all parts of the CPU are active at

any given time.

In this pipelined design, an instruction is


executed every single clock cycle which triples

the throughput.

But just like with caching this can lead to


some tricky problems.

A big hazard is a dependency in the instructions.

For example, you might fetch something that


the currently executing instruction is just

about to modify, which means you’ll end


up with the old value in the pipeline.

To compensate for this, pipelined processors


have to look ahead for data dependencies,
and if necessary, stall their pipelines to
avoid problems.

High end processors, like those found in laptops


and smartphones, go one step further and can

dynamically reorder instructions with dependencies


in order to minimize stalls and keep the pipeline

moving, which is called out-of-order execution.

As you might imagine, the circuits that figure


this all out are incredibly complicated.

Nonetheless, pipelining is tremendously effective


and almost all processors implement it today.

Another big hazard are conditional jump instructions


-- we talked about one example, a JUMP NEGATIVE,

last episode.

These instructions can change the execution


flow of a program depending on a value.

A simple pipelined processor will perform


a long stall when it sees a jump instruction,

waiting for the value to be finalized.

Only once the jump outcome is known, does


the processor start refilling its pipeline.

But, this can produce long delays, so high-end


processors have some tricks to deal with this

problem too.

Imagine an upcoming jump instruction as a


fork in a road - a branch.

Advanced CPUs guess which way they are going


to go, and start filling their pipeline with

instructions based off that guess – a technique


called speculative execution.

When the jump instruction is finally resolved,


if the CPU guessed correctly, then the pipeline

is already full of the correct instructions


and it can motor along without delay.

However, if the CPU guessed wrong, it has


to discard all its speculative results and

perform a pipeline flush - sort of like when


you miss a turn and have to do a u-turn to
get back on route, and stop your GPS’s insistent
shouting.

To minimize the effects of these flushes,


CPU manufacturers have developed sophisticated

ways to guess which way branches will go,


called branch prediction.

Instead of being a 50/50 guess, today’s


processors can often guess with over 90% accuracy!

In an ideal case, pipelining lets you complete


one instruction every single clock cycle,

but then superscalar processors came along


which can execute more than one instruction

per clock cycle.

During the execute phase even in a pipelined design, whole areas of the processor
might

be totally idle.

For example, while executing an instruction


that fetches a value from memory, the ALU

is just going to be sitting there, not doing


a thing.

So why not fetch-and-decode several instructions


at once, and whenever possible, execute instructions

that require different parts of the CPU all


at the same time!?

But we can take this


one step further and add duplicate circuitry

for popular instructions.

For example, many processors will have four,


eight or more identical ALUs, so they can

execute many mathematical instructions all


in parallel!

Ok, the techniques we’ve discussed so far


primarily optimize the execution throughput

of a single stream of instructions, but another


way to increase performance is to run several

streams of instructions at once with multi-core


processors.

You might have heard of dual core or quad


core processors.
This means there are multiple independent
processing units inside of a single CPU chip.

In many ways, this is very much like having


multiple separate CPUs, but because they’re

tightly integrated, they can share some resources,


like cache, allowing the cores to work together

on shared computations.

But, when more cores just isn’t enough,


you can build computers with multiple independent

CPUs!

High end computers, like the servers streaming


this video from YouTube’s datacenter, often

need the extra horsepower to keep it silky


smooth for the hundreds of people watching

simultaneously.

Two- and four-processor configuration are


the most common right now, but every now and

again even that much processing power isn’t


enough.

So we humans get extra ambitious and build


ourselves a supercomputer!

If you’re looking to do some really monster


calculations – like simulating the formation

of the universe - you’ll need some pretty


serious compute power.

A few extra processors in a desktop computer


just isn’t going to cut it.

You’re going to need a lot of processors.

No.. no... even more than that.

A lot more!

When this video was made, the world’s fastest


computer was located in The National Supercomputing

Center in Wuxi, China.

The Sunway TaihuLight contains a brain-melting


40,960 CPUs, each with 256 cores!

Thats over ten million cores in total... and


each one of those cores runs at 1.45 gigahertz.
In total, this machine can process 93 Quadrillion
-- that’s 93 million-billions -- floating

point math operations per second, knows as


FLOPS.

And trust me, that’s a lot of FLOPS!!

No word on whether it can run Crysis at max


settings, but I suspect it might.

So long story short, not only have computer


processors gotten a lot faster over the years,

but also a lot more sophisticated, employing


all sorts of clever tricks to squeeze out

more and more computation per clock cycle.

Our job is to wield that incredible processing


power to do cool and useful things.

That’s the essence of programming, which


we’ll start discussing next episode.

See you next week.

You might also like