20.optimization I
20.optimization I
Programming
20. Performance Optimization I
Basic Concepts
Federico Busato
2023-11-14
Table of Context
1 Introduction
Moore’s Law
Moore’s Law Limitations
Reasons for Optimizing
2 Basic Concepts
Asymptotic Complexity
Time-Memory Trade-off
Developing Cycle
Ahmdal’s Law
Throughput, Bandwidth, Latency
Performance Bounds
Arithmetic Intensity 1/50
Table of Context
4 Memory Hierarchy
Memory Hierarchy Concepts
Memory Locality
2/50
Introduction
Performance and Technological Progress
3/50
Performance and Technological Progress
4/50
Moore’s Law 1/2
5/50
Moore’s Law 2/2
The Moore’s Law is not (yet) dead but the same concept is not true for clock
frequency, single-thread performance, power consumption, and cost. How we can
provide value?
6/50
Single-Thread Performance Trend
Higher performance over time is not merely dictated by the number of transistors.
Specific hardware improvements, software engineering, and algorithms play a crucial
rule in driving the computer performance.
8/50
Moore’s Law Limitations - Two Examples 2/2
Specialized Hardware
Reduced precision, matrix multiplication engine, and sparsity provided orders
of magnitude performance improvement for AI applications
• In the first decades, the computer performance was extremely limited. Low-level
optimizations were essential to fully exploit the hardware
• Modern systems provide much higher performance, but we cannot more rely on
hardware improvement on short-period
• Performance and efficiency add market value (fast program for a given task), e.g.
search, page loading, etc.
• Optimized code uses less resources, e.g. in a program that runs on a server for
months or years, a small reduction in the execution time/power consumption
translates in a big saving of power consumption
10/50
Software Optimization is Complex
The asymptotic analysis refers to estimate the execution time or memory usage as
function of the input size (the order of growing)
The asymptotic behavior is opposed to a low-level analysis of the code
(instruction/loop counting/weighting, cache accesses, etc.)
Drawbacks:
• The worst-case is not the average-case
• Asymptotic complexity does not consider small inputs (think to insertion sort)
• The hidden constant can be relevant in practice
• Asymptotic complexity does not consider instructions cost and hardware details
13/50
Asymptotic Complexity 2/2
Be aware that only real-world problems with a small asymptotic complexity or small
size can be solved in a “user” acceptable time
Three examples:
• Matrix multiplication: O N 3 , even for small sizes N (e.g. 8K, 16K), it requires
special accelerators (e.g. GPU, TPU, etc.) for achieving acceptable performance
14/50
Time-Memory Trade-off
Examples:
15/50
Developing Cycle 1/3
17/50
Developing Cycle 3/3
• Most of the times, there is no the perfect algorithm for all cases (e.g.
insertion, merge, radix sort). Optimizing also refers in finding the correct
heuristics for different program inputs/platforms instead of modifying the
existing code
18/50
Ahmdal’s Law 1/3
Ahmdal’s Law
The Ahmdal’s law expresses the maximum improvement possible by improving a
particular part of a system
19/50
Ahmdal’s Law 2/3
1
Overall Improvement =
P
(1 − P) +
S
10% 1.02x 1.03x 1.04x 1.05x 1.07x 1.08x 1.09x 1.10x 1.11x
20% 1.04x 1.07x 1.09x 1.11x 1.15x 1.18x 1.19x 1.22x 1.25x
30% 1.06x 1.11x 1.15x 1.18x 1.25x 1.29x 1.31x 1.37x 1.49x
40% 1.09x 1.15x 1.20x 1.25x 1.36x 1.43x 1.47x 1.56x 1.67x
50% 1.11x 1.20x 1.27x 1.33x 1.50x 1.60x 1.66x 1.82x 2.00x
60% 1.37x 1.25x 1.35x 1.43x 1.67x 1.82x 1.92x 2.17x 2.50x
70% 1.16x 1.30x 1.43x 1.54x 1.88x 2.10x 2.27x 2.70x 3.33x
80% 1.19x 1.36x 1.52x 1.67x 2.14x 2.50x 2.78x 3.57x 5.00x
90% 1.22x 1.43x 1.63x 1.82x 2.50x 3.08x 3.57x 5.26x 10.00x 20/50
Ahmdal’s Law 3/3
The memory bandwidth is the amount of data that can be loaded from or stored into
a particular memory space
Peak bandwidth:
(Frequency in Hz) x (Bus width in bit / 8) x (Pump rate, memory type multiplier)
23/50
Performance Bounds 2/2
• Latency-bound. The program spends its time primarily in waiting the data are
ready (instruction/memory dependencies). The performance is limited by the
latency of the CPU/memory
• I/O Bound. The program spends its time primarily in performing I/O operations
(network, user input, storage, etc.). The performance is limited by the speed of
the I/O subsystem
24/50
Arithmetic Intensity 1/2
Arithmetic Intensity
Arithmetic/Operational Intensity is the ratio of total operations to total data
movement (bytes or words)
25/50
* What Is a Flop?
Arithmetic Intensity 2/2
ops 2n3 n
R= = =
bytes 12n 2 6
which means that for every byte accessed, the algorithm performs n
6 operations →
compute-bound
A modern CPU performs 100 GFlops, and has about 50 GB/s memory bandwidth 26/50
Basic Architecture
Concepts
Instruction-Level Parallelism (ILP) 1/3
27/50
Instruction-Level Parallelism (ILP) 2/3
Pipeline
Microarchitecture
stages
Core 14
Bonnell 16
Sandy Bridge 14
Silvermont 14 to 17
Haswell 14
Skylake 14
Kabylake 14
29/50
ILP and Little’s Law
The Little’s Law expresses the relation between latency and throughput. The
throughput of a system λ is equal to the number of elements in the system divided by
the average time spent (latency ) W for each element in the system:
L
L = λW → λ =
W
• L: average number of customers in a store
• λ: arrival rate (throughput)
• W : average time spent (latency )
30/50
Data-Level Parallelism (DLP)
A thread is a single sequential execution flow within a program with its state
(instructions, data, PC, register state, and so on)
32/50
Single Instruction Multiple Threads (SIMT)
33/50
RISC, CISC Instruction Sets
34/50
CISC
37/50
ARM vs x86: What’s the difference between the two processor architectures?
CISC vs. RISC
• Hardware market:
- RISC (ARM, IBM): Qualcomm Snapdragon, Amazon Graviton, Nvidia Grace,
Nintendo Switch, Fujitsu Fukaku, Apple M1, Apple Iphone/Ipod/Mac, Tesla
Full Self-Driving Chip, PowerPC
- CISC (Intel, AMD): all x86-64 processors
• Software market:
- RISC : Android, Linux, Apple OS, Windows
- CISC : Windows, Linux
• Power consumption:
- CISC : Intel i5 10th Generation: 64W
- RISC : Arm-based smartphone < 5W
38/50
ARM Quote
“Incidentally, the first ARM1 chips required so little power, when the
first one from the factory was plugged into the development system to
test it, the microprocessor immediately sprung to life by drawing current
from the IO interface – before its own power supply could be properly
connected.”
Happy birthday, ARM1. It is 35 years since Britain’s Acorn RISC Machine chip
39/50
sipped power for the first time
Memory Hierarchy
The Von Neumann Bottleneck
40/50
The Von Neumann Bottleneck
Moving data to and from main memory consumes the vast majority of time and
energy of the system
41/50
Memory Hierarchy
42/50
Memory Hierarchy 1/3
43/50
twitter.com/MIT CSAIL
Memory Hierarchy 2/3
Source:
“Accelerating Linear Algebra on Small Matrices from Batched BLAS to Large Scale Solvers”,
ICL, University of Tennessee 44/50
Memory Hierarchy 3/3
Latency Bandwidth
Hierarchy level Size Latency Bandwidth
Ratio Ratio
• en.wikichip.org/wiki/WikiChip
45/50
• Memory Bandwidth Napkin Math
Memory Hierarchy Concepts 1/4
A cache is a small and fast memory located close to the processor that stores
frequently used instructions and data. It is part of the processor package and takes 40
to 60 percent of the chip area
Characteristics and content:
Registers Program counter (PC), General purpose registers, Instruction Register
(IR), etc.
L1 Cache Instruction cache and data cache, private/exclusive per CPU core,
located on-chip
L2 Cache Private/exclusive per single CPU core or a cluster of cores, located
off-chip
L3 Cache Shared between all cores and located off-chip (e.g. motherboard), up to
46/50
128/256MB
Memory Hierarchy Concepts 2/4
47/50
Memory Hierarchy Concepts 3/4
A cache line or cache block is the unit of data transfer between the cache and main
memory, namely the memory is loaded at the granularity of a cache line
The typical size of the cache line is 64 bytes. A cache line can be further organized in
banks or sectors
Cache access type:
Warm L2 or L3 caches
48/50
Memory Hierarchy Concepts 4/4
• A cache hit occurs when a requested data is successfully found in the cache
memory
• The cache hit rate is the number of cache hits divided by the number of memory
requests
• A cache miss occurs when a requested data is not found in the cache memory
• The miss penalty refers to the extra time required to load the data into cache
from the main memory when a cache miss occurs
• A page fault occurs when a requested data is in the process address space, but it
is not currently located in the main memory (swap/pagefile)
• Page thrashing occurs when page faults are frequent and the OS spends
significant time to swap data in and out the physical RAM 49/50
Memory Locality
• Temporal Locality refers to the reuse of the same data within a relatively
small time duration, and, as consequence, exploit lower levels of the memory
hierarchy (caches), e.g. multiple sparse accesses
Heavily used memory locations can be accessed more quickly than less heavily
used locations
50/50