Chapter 2 Questions and Notes
Chapter 2 Questions and Notes
Performance Issues
PROBLEMS
Contents
PROBLEMS...............................................................................................................1
Review Question 2.1 .................................................................................................. 2
Problem 2.1: ............................................................................................................. 3
Problem 2.2: ............................................................................................................. 5
Problem 2.3 .............................................................................................................. 8
Chapter Notes ....................................................................................................... 12
Performance ........................................................................................................... 12
Technology .............................................................................................................. 12
Parallelism .............................................................................................................. 12
METRICS ................................................................................................................. 13
Measuring Performance........................................................................................ 13
CPU Execution Time ............................................................................................. 14
Clock cycles ........................................................................................................ 14
CPU Performance and Its Factors .......................................................................... 15
Instruction Performance ....................................................................................... 15
Classic CPU Performance Equation ....................................................................... 16
COAL Chapter # 2 Problems & Notes Prepared by: Engr. Amna Bibi [1]
Review Question 2.1
List and briefly define some of the techniques used in contemporary processors to
increase speed.
1. Pipelining:
The execution of an instruction involves multiple stages of operation, including fetching
the instruction, decoding the opcode, fetching operands, performing a calculation, and so
on.
2. Branch prediction:
The processor looks ahead in the instruction code fetched from memory and predicts
which branches, or groups of instructions, are likely to be processed next.
3. Superscalar execution:
This is the ability to issue more than one instruction in every processor clock cycle.
5. Speculative execution:
Using branch prediction and data flow analysis, some processors speculatively execute
instructions ahead of their actual appearance in the program execution, holding the results
in temporary locations.
COAL Chapter # 2 Problems & Notes Prepared by: Engr. Amna Bibi [2]
Problem 2.1:
A benchmark program is run on a 40 MHz processor. The executed program consists of
100,000 instruction executions, with the following instruction mix and clock cycle count:
Determine the effective CPI, MIPS rate, and execution time for this program.
Solution:
To determine the effective CPI, MIPS rate, and execution time for this program, let's go
through each calculation step by step.
Given:
Processor clock rate = 40 MHz (40 million cycles per second)
Total instructions executed = 100,000
Instruction mix:
COAL Chapter # 2 Problems & Notes Prepared by: Engr. Amna Bibi [3]
COAL Chapter # 2 Problems & Notes Prepared by: Engr. Amna Bibi [4]
Problem 2.2:
Consider two different machines, with two different instruction sets, both of which have a
clock rate of 200 MHz. The following measurements are recorded on the two machines
running a given set of benchmark programs:
a. Determine the effective CPI, MIPS rate, and execution time for each machine.
b. Comment on the results.
Solution:
COAL Chapter # 2 Problems & Notes Prepared by: Engr. Amna Bibi [5]
COAL Chapter # 2 Problems & Notes Prepared by: Engr. Amna Bibi [6]
Part (b): Comment on the Results
The results show something interesting: Machine B has a higher MIPS (Millions of
Instructions Per Second) rate than Machine A, meaning it seems to process more
instructions each second. However, Machine B actually takes longer to finish the program.
This happens because MIPS alone doesn’t tell the whole story. Just because a machine can
execute more instructions per second doesn’t mean it will finish a program faster. The time
it takes also depends on:
COAL Chapter # 2 Problems & Notes Prepared by: Engr. Amna Bibi [7]
Problem 2.3
Early examples of CISC and RISC design are the VAX 11/780 and the IBM RS/6000,
respectively. Using a typical benchmark program, the following machine characteristics
result:
The final column shows that the VAX required 12 times longer than the IBM measured in
CPU time.
a. What is the relative size of the instruction count of the machine code for this
benchmark program running on the two machines?
b. What are the CPI values for the two machines?
Solution:
Given data:
VAX 11/780:
Clock Frequency: 5 MHz
Performance: 1 MIPS
CPU Time: 12 times longer than IBM RS/6000
IBM RS/6000:
Clock Frequency: 25 MHz
Performance: 18 MIPS
CPU Time: 𝑥
x seconds (this is our baseline for comparison)
Since the VAX takes 12 times longer than the IBM, the CPU time for VAX is
12×𝑥 seconds.
COAL Chapter # 2 Problems & Notes Prepared by: Engr. Amna Bibi [8]
Part (a): Relative Instruction Count
The instruction count for each machine can be estimated using the formula for MIPS:
This comparison shows that even though the IBM RS/6000 needs more instructions to
complete the program, it is still much faster due to its higher MIPS rate and efficiency.
COAL Chapter # 2 Problems & Notes Prepared by: Engr. Amna Bibi [9]
Final Answers
1. Relative Instruction Count: The VAX instruction count is two-thirds that of the IBM
for the same program.
2. CPI Values:
• VAX CPI = 5
• IBM CPI ≈ 1.39
Analysis of Results
Here’s an analysis of the results:
1. Instruction Count :
The VAX has an instruction count that is two-thirds of the IBM's. This means that, for the
same program, the VAX executes fewer instructions than the IBM does. This could suggest
that the VAX's CISC (Complex Instruction Set Computing) design allows it to perform more
COAL Chapter # 2 Problems & Notes Prepared by: Engr. Amna Bibi [10]
work per instruction, as each instruction may be more complex and able to accomplish
more.
In contrast, the IBM RS/6000’s RISC (Reduced Instruction Set Computing) design might rely
on simpler instructions, which means it requires more instructions to complete the same
task.
2. CPI (Cycles Per Instruction) :
- The VAX has a higher CPI of 5 . This indicates that, on average, each instruction on the
VAX requires more clock cycles to complete. This is typical of CISC architectures, where
instructions are complex and take multiple cycles to execute.
- The IBM has a much lower CPI of approximately 1.39 . This is expected in RISC
architectures, where instructions are designed to be simple and executed in fewer cycles.
3. Clock Frequency and Execution Speed :
- The IBM RS/6000 has a significantly higher clock rate (25 MHz) compared to the VAX (5
MHz) . Combined with a lower CPI, this means the IBM can execute instructions much
faster than the VAX.
- As a result, even though the IBM requires more instructions to complete the same
program, it compensates for this with faster, simpler instructions and a higher clock rate,
leading to a shorter overall execution time .
4. Overall Performance :
- The IBM RS/6000 completes the program 12 times faster than the VAX. This highlights
the effectiveness of the RISC design in achieving high performance through a combination
of simple instructions, low CPI, and high clock speed.
- Despite the VAX’s advantage in needing fewer instructions (lower instruction count), its
higher CPI and lower clock rate make it slower overall.
In summary, this comparison underscores the fundamental design trade-offs between
CISC and RISC:
- CISC (VAX) : Fewer, more complex instructions with a higher CPI.
- RISC (IBM) : More, simpler instructions with a lower CPI and higher clock speed.
The results show that RISC architectures, like the IBM RS/6000, can achieve superior
performance by leveraging higher clock rates and simpler instructions, despite requiring
more instructions to complete the same task. This has historically been one reason for the
popularity of RISC designs in high-performance computing.
COAL Chapter # 2 Problems & Notes Prepared by: Engr. Amna Bibi [11]
Chapter Notes
Performance
The most important measure of the performance of a computer is how quickly it can execute
programs. The speed with which a computer executes programs is affected by the following:
Technology
The technology to fabricate the electronic circuits for a processor on a single chip is a critical
factor in the speed of execution of machine instructions.
The speed of switching between the 0 and 1 states in logic circuits is largely determined
by the size of the transistors that implement the circuits. Smaller transistors switch
faster. Advances in fabrication technology over several decades have reduced transistor
sizes dramatically.
This has two advantages: Instructions can be executed faster, and more transistors can be
placed on a chip, leading to more logic functionality and more memory storage capacity.
Parallelism
Performance can be increased by performing a number of operations in parallel.
Parallelism can be implemented on many different levels.
• Instruction-level Parallelism
The simplest way to execute a sequence of instructions in a processor is to complete all
steps of the current instruction before starting the steps of the next instruction. If we
overlap the execution of the steps of successive instructions, total execution time will be
reduced.
For example, the next instruction could be fetched from memory at the same time that an
arithmetic operation is being performed on the register operands of the current instruction.
This form of parallelism is called pipelining.
COAL Chapter # 2 Problems & Notes Prepared by: Engr. Amna Bibi [12]
• Multicore Processors
Multiple processing units can be fabricated on a single chip. In technical literature, the
term core is used for each of these processors. The term processor is then used for the
complete chip.
Hence, we have the terminology dual-core, quad-core, and octo-core processors for chips
that have two, four, and eight cores, respectively.
• Multiprocessors
Computer systems may contain many processors, each possibly containing multiple
cores. Such systems are called multiprocessors. These systems either execute a
number of different application tasks in parallel, or they execute subtasks of a single large
task in parallel. All processors usually have access to all of the memory in such systems,
and the term shared memory multiprocessor is often used to make this clear.
The high performance of these systems comes with much higher complexity and cost,
arising from the use of multiple processors and memory units, along with more complex
interconnection networks.
In contrast to multiprocessor systems, it is also possible to use an interconnected group of
complete computers to achieve high total computational power. The computers normally
have access only to their own memory units. When the tasks they are executing need to
share data, they do so by exchanging messages over a communication network. This
property distinguishes them from shared-memory multiprocessors, leading to the name
message passing multi-computers.
METRICS
Measuring Performance
Time is the measure of computer performance: the computer that performs the same
amount of work in the least time is the fastest. Program execution time is measured in
seconds per program.
However, time can be defined in different ways, depending on what we count. The most
straightforward definition of time is called wall clock time, response time, or elapsed
time. These terms mean the total time to complete a task, including disk accesses,
memory accesses, input/output (I/O) activities, operating system overhead—everything.
Computers are often shared, however, and a processor may work on several programs
simultaneously. In such cases, the system may try to optimize throughput rather than
attempt to minimize the elapsed time for one program. Hence, we often want to distinguish
between the elapsed time and the time over which the processor is working on our behalf.
COAL Chapter # 2 Problems & Notes Prepared by: Engr. Amna Bibi [13]
CPU Execution Time
CPU execution time or simply CPU time, which recognizes this distinction, is the time the
CPU spends computing for this task and does not include time spent waiting for I/O or
running other programs. (Remember, though, that the response time experienced by the
user will be the elapsed time of the program, not the CPU time.) CPU time can be further
divided into two:
1. the CPU time spent in the program, called user CPU time,
and
2. the CPU time spent in the operating system performing tasks on behalf of the
program, called system CPU time.
Differentiating between system and user CPU time is difficult to do accurately, because it is
often hard to assign responsibility for operating system activities to one user program
rather than another and because of the functionality differences among operating systems.
For consistency, we maintain a distinction between performance based on elapsed time
and that based on CPU execution time. We will use the term system performance to refer to
elapsed time on an unloaded system and CPU performance to refer to user CPU time.
Although as computer users we care about time, when we examine the details of a
computer it’s convenient to think about performance in other metrics. In particular,
computer designers may want to think about a computer by using a measure that relates to
how fast the hardware can perform basic functions.
Clock cycles
Almost all computers are constructed using a clock that determines when events take
place in the hardware. These discrete time intervals are called clock cycles (or ticks, clock
ticks, clock periods, clocks, cycles).
Designers refer to the length of a clock period both as the time for a complete clock cycle
(e.g., 250 picoseconds, or 250 ps) and as the clock rate (e.g., 4 gigahertz, or 4 GHz), which
is the inverse of the clock period. In the next subsection, we will formalize the relationship
between the clock cycles of the hardware designer and the seconds of the computer user.
1. Suppose we know that an application that uses both personal mobile devices and the
Cloud is limited by network performance. For the following changes, state whether only the
throughput improves, both response time and throughput improve, or neither improves.
a. An extra network channel is added between the PMD and the Cloud, increasing the
total network throughput and reducing the delay to obtain network access (since
there are now two channels).
b. The networking software is improved, thereby reducing the network communication
delay, but not increasing throughput.
c. More memory is added to the computer.
COAL Chapter # 2 Problems & Notes Prepared by: Engr. Amna Bibi [14]
3. Computer C’s performance is 4 times as fast as the performance of computer B,
which runs a given application in 28 seconds.
This formula makes it clear that the hardware designer can improve performance by
reducing the number of clock cycles required for a program or the length of the clock cycle.
However, the designer often faces a trade-off between the number of clock cycles needed
for a program and the length of each cycle. Many techniques that decrease the number of
clock cycles may also increase the clock cycle time.
Instruction Performance
The performance equations above did not include any reference to the number of
instructions needed for the program. However, since the compiler clearly generated
instructions to execute, and the computer had to execute the instructions to run the
program, the execution time must depend on the number of instructions in a program. ‘
One way to think about execution time is that it equals the number of instructions
executed multiplied by the average time per instruction. Therefore, the number of clock
cycles required for a program can be written as:
Th term clock cycles per instruction, which is the average number of clock cycles each
instruction takes to execute, is often abbreviated as CPI. Since different instructions may
take different amounts of time depending on what they do, CPI is an average of all the
COAL Chapter # 2 Problems & Notes Prepared by: Engr. Amna Bibi [15]
instructions executed in the program. CPI provides one way of comparing two different
implementations of the same instruction set architecture, since the number of instructions
executed for a program will, of course, be the same.
These formulas are particularly useful because they separate the three key factors that
affect performance. We can use these formulas to compare two different implementations
or to evaluate a design alternative if we know its impact on these three parameters. The
performance of a program depends on the algorithm, the language, the compiler, the
architecture, and the actual hardware.
Clock Period
Clock cycle Also called tick, clock tick, clock period, clock, or cycle. The time for one clock
period, usually of the processor clock, which runs at a constant rate.
Clock period - The length of each clock cycle.
The following table summarizes how these components affect the factors in the CPU
performance equation.
COAL Chapter # 2 Problems & Notes Prepared by: Engr. Amna Bibi [16]
COAL Chapter # 2 Problems & Notes Prepared by: Engr. Amna Bibi [17]