0% found this document useful (0 votes)
15 views15 pages

Chapter 2

Chapter 2 discusses the evolution and optimization of computing performance, highlighting advancements in processor design, multicore architectures, and the use of GPUs for general-purpose computing. Key techniques for enhancing performance include pipelining, branch prediction, and speculative execution, while challenges such as memory bottlenecks and power consumption are also addressed. The chapter concludes with a comparative analysis of multicore CPUs, MICs, and GPGPUs, emphasizing the shift towards parallel processing for improved efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views15 pages

Chapter 2

Chapter 2 discusses the evolution and optimization of computing performance, highlighting advancements in processor design, multicore architectures, and the use of GPUs for general-purpose computing. Key techniques for enhancing performance include pipelining, branch prediction, and speculative execution, while challenges such as memory bottlenecks and power consumption are also addressed. The chapter concludes with a comparative analysis of multicore CPUs, MICs, and GPGPUs, emphasizing the shift towards parallel processing for improved efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Chapter 2

Performance

DESIGNING FOR PERFORMANCE:


1. Continuous Improvement in Computing Power
• The cost of computer systems continues to decline while performance and capacity increase
significantly each year.
• Technological advancements enable the development of highly complex and powerful
applications that were once unimaginable.
2. Applications Requiring High Computing Power
• Several modern applications demand significant processing power, including:
o Image processing – Editing, filtering, and enhancing digital images.
o 3D rendering – Creating high-quality 3D models for gaming, animation, and
simulations.
o Speech recognition – Converting spoken language into text or executing voice
commands.
o Videoconferencing – Enabling real-time video and audio communication over the
internet.
o Multimedia authoring – Creating and editing interactive digital media, including
animations, videos, and presentations.
o Voice and video annotation – Adding voice or video comments to digital files for
enhanced communication and collaboration.
o Simulation modeling – Running complex simulations for research, engineering, and
artificial intelligence applications.
3. Key Factors Behind Performance Design
• Despite advances in computing, modern processors still rely on fundamental building blocks
similar to those in early computers.
• The key to maximizing performance lies in optimizing how resources are utilized and how
efficiently tasks are executed.
• Moore’s Law states that the number of transistors on a chip doubles approximately every
two years, allowing chipmakers to release new, more powerful processors frequently.
• Memory capacity has also expanded significantly, with DRAM capacities quadrupling every
three years.
• Processor performance has improved 4-5 times every three years due to enhanced
microarchitecture and shrinking component sizes.
4. Techniques for Performance Enhancement
To fully utilize processor power, designers implement several optimization techniques:
a) Pipelining
• The execution of an instruction is divided into multiple stages (e.g., fetch, decode, execute,
write-back).
• Instead of waiting for one instruction to complete before starting another, multiple
instructions are processed simultaneously in different stages.
• This approach is similar to an assembly line, where each stage of processing is handled
independently but concurrently.
b) Branch Prediction
• The processor analyzes the instruction flow and predicts the most likely execution path.
• Prefetching the predicted instructions minimizes stalls and keeps the processor busy.
• Advanced processors predict multiple branches ahead, further improving efficiency.
c) Superscalar Execution
• Modern processors can issue multiple instructions per clock cycle, increasing instruction
throughput.
• Multiple execution units allow simultaneous processing of independent instructions,
improving parallel execution.
d) Data Flow Analysis
• The processor determines which instructions depend on each other’s outputs and optimizes
their execution order.
• Instead of following the program order strictly, instructions are executed as soon as their
dependencies are resolved.

e) Speculative Execution
• The processor executes instructions before it is certain they are needed, storing results in
temporary locations.
• This prevents idle time and ensures the processor is utilized efficiently.
• If a predicted instruction path is incorrect, the results are discarded without affecting
execution.
5. Performance Balancing Challenges
• While processor speeds have increased significantly, other computer components (e.g.,
memory and I/O devices) have not kept pace.
• The most significant performance bottleneck is the interface between the processor and
main memory.
• If memory access is too slow, the processor must wait for data, resulting in wasted clock
cycles and reduced performance.
6. Solutions to Performance Bottlenecks
• Increasing Memory Access Efficiency
o Wider DRAM chips retrieve more bits at once, reducing memory access delays.
o Faster memory interfaces and improved bus architectures reduce transfer latency.
• Implementing Advanced Caching Techniques
o Caches store frequently accessed data closer to the processor, reducing the need for
slow main memory access.
o Multiple cache levels (L1, L2, L3) improve data retrieval efficiency.
• Enhancing Interconnect Bandwidth
o High-speed buses and hierarchical bus structures improve communication between
memory and the processor.
o Advanced interconnection techniques prevent data transfer bottlenecks.
• Improving I/O Device Management
o Peripheral devices (e.g., graphics cards, SSDs, and network interfaces) require
efficient data transfer mechanisms.
o Caching, buffering, and high-speed interconnects optimize I/O operations.
o Multiple-processor systems help distribute processing workloads and manage I/O-
intensive tasks more effectively.
8. Improvements in Chip Organization & Architecture
To further improve performance, three key strategies are used:
a) Increasing Hardware Speed
• Shrinking logic gates on processor chips reduces signal propagation time, allowing faster
operation.
• Higher clock speeds enable faster execution of individual instructions.
b) Enhancing Cache Size & Speed
• Placing caches directly on the processor chip reduces access time.
• Modern processors dedicate over half of their chip area to cache memory.
• Improved cache efficiency significantly reduces reliance on slower main memory.
c) Optimizing Processor Architecture
• Parallelism is used to enhance instruction execution speed.
• Pipelining, superscalar execution, and out-of-order execution improve processing efficiency.
9. Challenges in Increasing Processor Speed
As clock speed and logic density increase, new challenges arise:
• Power Consumption
o Higher transistor density leads to increased heat dissipation.
o Efficient cooling solutions and power management strategies are necessary to
prevent overheating.
• RC Delay (Resistance-Capacitance Delay)
o The speed at which signals travel on a chip is limited by wire resistance and
capacitance.
o Miniaturization increases resistance and capacitance, reducing signal transmission
speed.
• Memory Latency & Throughput
o Memory speed improvements lag behind processor advancements, creating a
performance bottleneck.
o Efficient memory management techniques, such as prefetching and caching, help
mitigate this issue.

MULTICORE, MICS, AND GPGPUS


1. Multicore Processors
• Concept: Instead of using a single complex processor, manufacturers place multiple
processing units (cores) on the same chip.
• Advantages:
o Increases performance without increasing the clock rate.
o Consumes less power compared to a single complex processor.
o Allows efficient parallel processing if the software supports it.
• Evolution:
o Began with dual-core processors, followed by 4-core, 8-core, 16-core, and beyond.
o Led to multi-level cache systems (L1, L2, and L3) to enhance performance.

2. Many Integrated Cores (MICs)


• Concept: When the number of cores increases significantly (e.g., 50+ cores per chip), the
system is referred to as MIC architecture.
• Purpose:
o Designed to handle large-scale parallel computations.
o Aims to maximize performance for high-performance computing (HPC) applications.
• Challenge: Developing software that can effectively utilize such a large number of cores is
complex.

3. General-Purpose Computing on GPUs (GPGPUs)


• Concept: Initially, GPUs (Graphics Processing Units) were specialized for rendering 2D/3D
graphics and video processing. However, their parallel processing capabilities have made
them useful for general-purpose computing. This approach is called GPGPU computing.
• Advantages:
o Efficient for parallel operations on large datasets.
o Used in applications requiring high-speed computation, such as AI, deep learning,
and simulations.
• Blurring CPU-GPU Line:
o GPUs can now handle tasks traditionally performed by CPUs.
o Some modern processors integrate both CPU and GPU components on the same
chip.

Summary
• Multicore processors improve performance by adding multiple cores on a single chip,
enabling parallel execution.
• MICs extend this approach by massively increasing the number of cores for high-
performance computing.
• GPGPUs leverage GPU parallelism for general-purpose applications beyond graphics, making
them useful for AI, simulations, and data-intensive computations.
This evolution reflects the shift towards parallel processing to achieve higher efficiency, lower
power consumption, and improved computational capabilities in modern processors.

Comparative Analysis: Multicore CPU vs. MIC vs. GPGPU

Feature Multicore CPU MIC (Many GPGPU (General-


Integrated Cores) Purpose GPU)

Cores 2-64 50-100+ 1000s

Task Suitability General-purpose, Highly parallel Data-parallel


mixed workloads workloads workloads
Memory Access Shared cache & RAM High-bandwidth High-bandwidth,
memory optimized for large
datasets

Programming OpenMP, pthreads OpenMP, MPI, Intel CUDA, OpenCL


Models TBB

Best Use Cases OS, applications, Scientific computing, AI, deep learning,
databases AI simulations

Amdahl’s Law
Amdahl's Law states that the maximum speedup of a program using multiple processors is
limited by the fraction of the program that must be executed sequentially. It is given by the
formula:
Amdahl’s law can be generalized to evaluate any design or technical improvement in a
computer system. Consider any enhancement to a feature of a system that results in a
speedup. The speedup can be expressed as
Little’s Law
Little’s Law states that in a steady-state system with no leakage, the average number of
items (L) in a system is equal to the arrival rate of items (λ) multiplied by the average time
(W) each item spends in the system. Mathematically, it is represented as:

This law applies to any queuing system where items arrive, wait for service, get processed,
and then leave. It is widely used in computer systems, networking, and performance
analysis.
1. Queuing System – A system where items (e.g., processes, packets, or I/O requests)
arrive, wait, get serviced, and then depart.
2. L (Average Number in System) – The number of items, processes in a queue, or
instructions in a pipeline at any given time.
3. λ (Arrival Rate) – The rate at which items enter the system (e.g., the number of
requests per second).
4. W (Time in System) – The average time an item spends from arrival to departure.

Basic Measures of Computer Performance


1. Clock Speed (Frequency)
• Measured in Hertz (Hz), typically GHz (gigahertz) for modern processors.
• Represents the number of cycles a CPU can execute per second.
• Higher clock speed generally indicates faster performance but is not the sole determinant.
2. Instructions Per Cycle (IPC)
• Defines how many instructions a CPU can execute per clock cycle.
• A higher IPC means better efficiency at the same clock speed.
3. CPU Performance (CPI and MIPS)
• Cycles Per Instruction (CPI): Average number of clock cycles per instruction.
• Million Instructions Per Second (MIPS): Measures the execution speed of instructions.
• Lower CPI and higher MIPS indicate better performance.
4. Floating Point Operations Per Second (FLOPS)
• Measures performance in scientific and engineering applications.
• Used in high-performance computing (HPC) and AI workloads.
5. Throughput
• The number of tasks or processes a system can complete per unit of time.
• Important for multi-core processors and parallel computing.
6. Latency (Response Time)
• The time taken to complete a single task.
• Lower latency indicates faster system responsiveness.
7. Memory Performance (Bandwidth and Latency)
• Memory Bandwidth: The rate at which data is transferred between memory and CPU,
measured in GB/s.
• Memory Latency: The delay before data is available after a request.
8. Cache Performance
• Cache hit rate: The percentage of memory accesses served from the cache.
• Cache miss penalty: Time delay when data is not found in the cache.
9. Disk Performance (IOPS and Throughput)
• Input/Output Operations Per Second (IOPS): The number of read/write operations per
second.
• Disk Throughput: The rate at which data is read/written from storage.
10. Power Efficiency
• Measured in performance per watt.
• Important for mobile and energy-efficient computing.
11. Benchmarking Scores
• Standardized tests (SPEC, Geekbench, Cinebench) used to compare performance across
systems.

Instruction Execution Rate: Understanding Processor Performance


1. Clock Speed and Cycle Time
A processor operates based on a clock signal with a fixed frequency f (measured in Hertz (Hz)). The
cycle time t (measured in seconds per cycle) is the time required for one clock cycle:
t = 1/f
For example, a processor with a 400 MHz clock speed has a cycle time:
t = 1 / (400 × 10^6) ≈ 2.5 ns (nanoseconds)
2. Instruction Count (Ic) and CPI (Cycles Per Instruction)
• Instruction Count (Ic): The total number of instructions a program executes before completion.
• Cycles Per Instruction (CPI): The average number of clock cycles required to execute one
instruction.
Different instruction types (e.g., arithmetic, load/store, branch) take varying cycles to execute. The
overall CPI for a program is calculated as:

Example: Consider a program with 2 million instructions running on a 400 MHz processor, with the
following instruction mix:

Instruction Type CPI Instruction Mix (%)

Arithmetic/Logic 1 60%

Load/Store (Cache Hit) 2 18%

Branch 4 12%

Memory Reference (Cache 8 10%


Miss)

The average CPI is calculated as:

CPI = (1×0.6) + (2×0.18) + (4×0.12) + (8×0.10) = 2.24

CPI = 2.24
3. Execution Time Calculation

For the given program:


T = (2 × 10^6) × (2.24) × (2.5 × 10^-9) ≈ 11.2 milliseconds
4. MIPS (Million Instructions Per Second)
MIPS measures the execution rate of instructions per second:

For our example:


MIPS = (400 × 10^6) / (2.24 × 10^6) ≈ 178 MIPS
This means the processor executes 178 million instructions per second.
5. Factors Affecting Performance
The five performance factors affecting execution time are:
1. Instruction Count (Ic) → Affected by Instruction Set Architecture and Compiler Optimization.
2. CPI (p) → Depends on how the processor executes instructions.
3. Memory References (m) → More references increase execution time.
4. Memory Access Speed (k) → Ratio of memory cycle time to processor cycle time.
5. Cycle Time (t) → Determined by Processor Clock Speed.
6. Floating-Point Performance (MFLOPS)
Where
Calculating the Mean
In Computer Organization and Architecture (COA), calculating the mean (average) is essential for
performance analysis. Different types of means—Arithmetic Mean (AM), Geometric Mean (GM),
and Harmonic Mean (HM)—are used based on the nature of performance metrics.
Why Calculate the Mean in COA?
1. Comparing Performance Across Benchmarks
o Different programs have different execution times on various processors. Using the
right mean helps in fair comparisons.
2. MIPS and Execution Time Analysis
o When comparing multiple processors, AM, GM, and HM are used to average MIPS
(Millions of Instructions Per Second) or execution times.
3. Selecting the Right Mean:
o Arithmetic Mean (AM): Used when adding independent values, such as execution
times of different programs.
o Geometric Mean (GM): Used when normalizing performance ratios, e.g., when
comparing relative speedups.
o Harmonic Mean (HM): Used when averaging rates, like MIPS or IPC (Instructions Per
Cycle), since it properly handles reciprocal values.
4. Performance Ranking of Processors
o GM is commonly used in benchmarking suites (e.g., SPEC benchmarks) to provide a
fair ranking of CPU performance.

NOTE- The three common formulas used for calculating a mean are arithmetic, geometric,
and harmonic. Given a set of n real numbers (x1, x2, …, xn), the three means are defined as
follows:

You might also like