Fundamentals of Computer Design
Fundamentals of Computer Design
1.1 Introduction
The concept of stored program computers appeared in 1945 when John von
Neumann drafted the first version of EDVAC (Electronic Discrete Variable
Computer). Those ideas have since been the milestones of computers:
Four lines of evolution have emerged from the first computers (definitions
are very loose and in many case the borders between different classes are
blurring):
For many years the evolution of computers was concerned with the
problem of object code compatibility. A new architecture had to be, at least
partly, compatible with older ones. Older programs (“the dusty deck”) had
to run without changes on the new machines. A dramatic example is the
IBM-PC architecture, launched in 1981, it proved so successful that further
developments had to conform with the first release, despite the flaws which
became apparent in a couple of years thereafter.
What is the meaning of saying that a computer is faster than another one? It
depends upon the position you have: if you are a simple user (end user)
then you say a computer is faster when it runs your program in less time,
and you think at the time it takes from the moment you launch your
program until you get the results, this the so called wall-clock time. On the
other hand, if you are system's manager, then you say a computer is faster
when it completes more jobs per time unit.
As a user you are interested in reducing the response time (also called the
execution time or latency). The computer manager is more interested in
increasing the throughput (also called bandwidth), the number of jobs
done in a certain amount of time.
Answer:
A faster CPU decreases the response time and, in the mean time, increases
the throughput
a) both the response-time and throughput are increased.
b) several tasks can be processed at the same time, but no one gets done
faster; hence only the throughput is improved.
can compute the best-case access time for a hard disk as well as the worst-
case access time: what happens in real life is that you have a disk request
and the completion time (response-time) which depends not only upon the
hardware characteristics of the disk (best/worst case access time), but also
upon some other facts, like what is the disk doing at the moment you are
issuing the request for service, and how long the queue of waiting tasks is.
Comparing Performance
Execution time of B n
------------------------------------------------ = 1 + ---------
Execution time of A 100
Performance A n
------------------------------------ = 1 + ---------
Performance B 100
Answer:
Machine A is faster than machine B by n% can be written as:
Execution_time_B n
--------------------------------------------- = 1 + ---------
Execution_time_A 100
Execution_time_B – Execution_time_A
n = ------------------------------------------------------------------------------------------------ *100
Execution_time_A
6–5
n = ------------ × 100 = 16.7 %
6
CPU Performance
The above formula computes the time CPU spends running a program, not
the elapsed time: it does not make sense to compute the elapsed time as a
function of Tck, mainly because the elapsed time also includes the I/O time,
and the response time of I/O devices is not a function of Tck.
If we know the number of instructions that are executed since the program
starts until the very end, lets call this the Instruction Count (IC), then we
can compute the average number of clock cycles per instruction (CPI) as
follows:
Clock _cycles_per_program
CPI = ---------------------------------------------------------------------
IC
The scope of a designer is to lower the CPUtime, and here are the
parameters that can be modified to achieve this:
want a lower CPUtime) you must thoroughly check how the change affects
other parts of the system. For example you may consider a change in
organization such that CPI will be lowered, however this may increase Tck
thus offsetting the improvement in CPI.
A final remark: CPI has to be measured and not simply calculated from
the system's specification. This is because CPI strongly depends of the
memory hierarchy organization: a program running on the system without
cache will certainly have a larger CPI than the same program running on
the same machine but with a cache.
Architecture is the art and science of building. Vitruvius, in the 1st century
AD, said that architecture was a building that incorporated utilitas, firmitas
and venustas, in English terms commodity, firmness and delight. This
definition recognizes that architecture embraces functional, technological
and aesthetic aspects.
Software compatibility
Object code Frozen architecture; programs
move easily from one machine to
another without any investment.
High level language Designer has maximum freedom;
substantial effort in software
(compilers) is needed
Standards
Buses VME, SCSI, IPI etc.
Floating point IEEE 754, IBM, DEC
Operating system UNIX, DOS, Windows NT, OS/
2, proprietary
Networks Ethernet, FDDI, etc.
Programming languages The choice will influence the
instruction set
A new design must support, not only the circuits that are available now, at
the design time, which will become obsolete eventually, but also the
circuits that will be available when the product gets into the market.
A common concern is, however to detect the common case and to compute
the performance gain when this case is optimized. This is easy when you
have to design a system for a dedicated application: its behavior can be
observed and the possible optimizations can be applied. On the other hand,
when you design a general purpose machine, you must very cautiously
consider possible improvements for what seems to be the common case, if
this slows down other parts of the machine; the system will run fine for
applications that fit the common case, but results will be poor in other
cases.
Amdahl's Law
Suppose you enhance somehow your machine, to make it run faster: the
speedup is defined as:
T old
speedup = ------------
T new
where Told represent the execution time without the enhancement, and
Tnew is the execution time with the enhancement. In terms of performance
the speedup can be defined as:
1
S overall = ------------------------------------------------------------------- ( 1)
F enhanced
( 1 – F enhanced ) + --------------------- -
S enhanced
where Soverall represents the overall speedup and Fenhanced is the fraction of
the time that the enhancement can be used (measured before the
enhancement is applied).
(1 - Fenhanced) * Told
Told Tnew
As it can easily be seen the running time after the enhancement is applied:
F enhanced * T old
T new = ( 1 – F enhanced ) * T old + --------------------------------------
-
S enhanced
Answer:
Fenhanced = 20% = 0.2
Senhanced = 10
It is important to stress that, in using the Amdahl's law you have to use the
fraction of time that can use the enhancement; it is a mistake to use relation
(1) with a value of Fenhanced that was measured after the enhancement has
been in use.
The following example gives a relation for the overall speedup that uses
the fraction of time the enhancement represents from the total, measured
when the enhancement is in use.
Suppose you speed up all floating point operations by a factor of 10; the
enhanced floating point operations represent f = 20% of the running time,
with the enhancement in use. Which is the overall speedup?
Answer:
f * Tnew * Senhanced
f * Tnew
(1 - f) * Tnew
Tnew Told
T = ( 1 – f) * T +f*T *S
old new new enhanced
T
old
S = -------------- = ( 1 – f ) + f * S
overall T enhanced
new
S = ( 1 – 0.2 ) + 0.2 * 10 = 2.8
overall
Answer:
S = 2
enhanced
F = 75% = 0.75
enhanced
Since the price increases by a factor of 1.8 while the performance increases
only by a factor of 1.6 the improvement is not worth.
Locality of reference
The 90/10 rule of thumb says that a program spends 90% of its execution
time in only 10% of the code. There are two aspects of reference locality:
• spatial locality: items that are near to each other in memory tend
to be referenced near one to another in time; data structures and
arrays are good illustrations for spatial locality.
Number of
references
Addresses
FIGURE 1.1 In a normal program, the number of references is not uniformly distributed
over the whole address space.
Exercises
1.1 The 4 Mbit DRAM chip was introduced in 1990. When do you think
the 64 Mbit DRAM chip will be available?
1.2 Suppose you have enhanced your machine with a floating point
coprocessor; all floating point operations are faster by a factor of 10 when
the coprocessor is in use:
b) you know that 40% of the run-time is spent in floating point operations
in the enhanced mode; you could buy new floating point hardware for a
high price (10 times the price of your actual hardware for doubling the
performance), or you may consider an improvement in software
(compiler). How much should increase the percentage of floating point
utilization, compared with the present usage, such that the increase in
performance is the same as with the new hardware you could buy. Which
investment is better; don't forget to state all your assumptions.
1.3 Compute the average CPI for a program running for 10 seconds
(without I/O), on a machine that has a 50 MHz clock rate, if the number of
instructions executed in this time is 150 millions?