0% found this document useful (0 votes)
128 views

Cache-Conscious Algorithms and Data Structures

The document discusses techniques for designing algorithms and data structures to be cache-conscious and optimize for memory hierarchy. It explores how different algorithms like searches and sorts perform on various machines and cache levels. It advocates analyzing algorithms through a cache performance model and presents case studies where cache-optimized designs like multi-way tree searches and implicit k-d trees outperform standard implementations. The goal is to raise awareness of the memory hierarchy when programming complex applications.

Uploaded by

a2199347
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPS, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views

Cache-Conscious Algorithms and Data Structures

The document discusses techniques for designing algorithms and data structures to be cache-conscious and optimize for memory hierarchy. It explores how different algorithms like searches and sorts perform on various machines and cache levels. It advocates analyzing algorithms through a cache performance model and presents case studies where cache-optimized designs like multi-way tree searches and implicit k-d trees outperform standard implementations. The goal is to raise awareness of the memory hierarchy when programming complex applications.

Uploaded by

a2199347
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPS, PDF, TXT or read online on Scribd
You are on page 1/ 47

1-Jun-00 Bentley: Cache-Conscious Algs & DS 1

Cache-Conscious Algorithms and


Data Structures
Jon Bentley
Avaya Labs

A Programming Puzzle
A Cost Model
Case Studies
Principles
1-Jun-00 Bentley: Cache-Conscious Algs & DS 2
A Programming Puzzle
Which is faster for representing sequences:
arrays or lists?
Technical details
Random insertions
Into a sorted sequence
Same sequence of comparisons
Different overhead
Pointer chasing in lists
Knuth, v. 3: Search is 4C in arrays, 6C in lists
Sliding a sequence of an array
1-Jun-00 Bentley: Cache-Conscious Algs & DS 3
A Testbed
Main Loop in Pseudocode
S = empty
while S.size() < n
S.insert(bigrand())
About n
2
/4 comparisons
C++ Classes for Arrays and Linked Lists
Which is faster?
1-Jun-00 Bentley: Cache-Conscious Algs & DS 4
An Experiment


ArrList -- PII 400
0
5
10
15
20
25
30
35
40
45
100 1000 10000 100000 1000000
n
n
a
n
o
s
e
c
s
/
n
^
2
Arrays
Lists
Average access time as a function of set size
1-Jun-00 Bentley: Cache-Conscious Algs & DS 5
Display on a Log Scale
ArrList -- PII 400
1
10
100
100 1000 10000 100000 1000000
n
n
a
n
o
s
e
c
s
/
n
^
2
Arrays
Lists
1-Jun-00 Bentley: Cache-Conscious Algs & DS 6
Other Machines
ArrList -- K6 400
0
10
20
30
40
50
60
70
100 1000 10000 100000 1000000
n
a
n
o
s
e
c
s
/
n
^
2
1
10
100
100 1000 10000 100000 1000000
n
n
a
n
o
s
e
c
s
/
n
^
2
ArrList -- R10000 250
0
2
4
6
8
10
12
14
100 1000 10000 100000 1000000
1
10
100
100 1000 10000 100000 1000000
n
Arrays
Lists
1-Jun-00 Bentley: Cache-Conscious Algs & DS 7
Lessons Across Machines
Knees at L1, L2, RAM boundaries



Smaller structures have later knees
In L1: All accesses are cheap
Above L1: Sequential is faster than random
RAM Caches
1-Jun-00 Bentley: Cache-Conscious Algs & DS 8
A Cost Model for Memory
Goal: A Program to Estimate Access Costs
The Key Loop (n is array size, d is delta)
for (i = 0; i < count; i++) {
sum += x[j];
j += d;
if (j >= n)
j -= n;
}
A Real Program
1-Jun-00 Bentley: Cache-Conscious Algs & DS 9
Results of the Model

MemEx -- PII 400
0
10
20
30
40
50
60
70
80
90
100 1000 10000 100000 1000000
n
n
a
n
o
s
e
c
s
/
a
c
c
e
s
s
d=1
d=11
1-Jun-00 Bentley: Cache-Conscious Algs & DS 10
Other Machines

MemEx -- K6 400
0
20
40
60
80
100
120
140
160
180
100 1000 10000 100000 1000000
n
n
a
n
o
s
e
c
s
/
a
c
c
e
s
s
d=1
d=11
MemEx -- R10000 250
0
20
40
60
80
100
120
100 1000 10000 10000
0
1E+06 1E+07
n
1-Jun-00 Bentley: Cache-Conscious Algs & DS 11
Trends Across Machines
Same shapes, different constants
Transitions at cache boundaries
Constant cost in L1
Sequential is cheaper above L1
Differences grow substantially


What happens with complex software?
1-Jun-00 Bentley: Cache-Conscious Algs & DS 12
Awks Associative Arrays
Interpretation and data structures dominate
Algorithms in Awk are cache-insensitive
Awk MemEx -- PII 400
0
5
10
15
20
25
30
10 100 1000 10000 100000 1000000
n
M
i
c
r
o
s
e
c
s

/

A
c
c
e
s
s
d=1
d=11
1-Jun-00 Bentley: Cache-Conscious Algs & DS 13
Sorting Algorithms
How do different sorts behave under caching?
Two easy O(n log n) sorts
Quicksort
Heapsort
Which is faster?
1-Jun-00 Bentley: Cache-Conscious Algs & DS 14
Cache-Insensitive Sorting
Sorts in Awk -- PII 400
0
200
400
600
800
1000
1200
1400
1600
1800
10 100 1000 10000 100000 1000000
n
m
i
c
r
o
s
e
c
s

/

n
qsort
hsort
1-Jun-00 Bentley: Cache-Conscious Algs & DS 15
Quicksort vs. Heapsort

Sorting -- PII 400
0
500
1000
1500
2000
2500
3000
3500
100 1000 10000 100000 1000000 1E+07
n
n
a
n
o
s
e
c
s
/
n
qsort
hsort
1-Jun-00 Bentley: Cache-Conscious Algs & DS 16
Sorting on Other Machines

Sorting -- K6 400
0
1000
2000
3000
4000
5000
6000
7000
100 1000 10000 10000
0
1E+06 1E+07
n
n
a
n
o
s
e
c
s
/
n
Sorting -- R10000 250
0
500
1000
1500
2000
2500
3000
3500
4000
4500
100 1000 10000 10000
0
1E+06 1E+07
n
qsort
hsort
1-Jun-00 Bentley: Cache-Conscious Algs & DS 17
Cache-Conscious Sorting
Early work on tapes and disks
LaMarca and Ladner, 1997 SODA
Quicksort: Undo Sedgewicks final sort; one
multiway partition
Heapsort: Build towards root; multiway
branching
Merge Sort: Tiling (sort a cache-full in the first
pass); multiway merge
Radix Sort
Detailed Analyses

1-Jun-00 Bentley: Cache-Conscious Algs & DS 18
Searching
A Rich History
Represent 3-level subtrees on disk pages
Linear search within pages, followed by multi-
way branch
Landauer (IEEE TEC, 1963; ISAM)
B-Trees (Bayer and McCreight, 1970)
Fun Problems
Hashing (Binstock, DDJ April 1996)
How to search in a (preprocessed) array?
1-Jun-00 Bentley: Cache-Conscious Algs & DS 19
Binary Search
Array: 0 1 2 3 4 5 6
Search Code
l = 0;
u = n-1;
for (;;) {
if (l > u)
return -1;
m = (l + u) / 2;
if (x[m] < t)
l = m+1;
else if (x[m] == t)
return m;
else /* x[m] > t */
u = m-1;
}
1-Jun-00 Bentley: Cache-Conscious Algs & DS 20
Timing Binary Search
My First Timing Code
// start clock
for (i = 0; i < n; i++)
assert(search(x[i]) == i);
// end clock
Problems?
1-Jun-00 Bentley: Cache-Conscious Algs & DS 21
Cache-Insensitive Search
Searching in Awk -- PII 400
0
100
200
300
400
500
600
10 100 1000 10000 100000
n
m
i
c
r
o
s
e
c
s

/

s
e
a
r
c
h
Ordered
Scrambled
1-Jun-00 Bentley: Cache-Conscious Algs & DS 22
Observed Run Times
Binary Search -- PII 400
0
500
1000
1500
2000
2500
3000
100 10000 1000000
n
n
a
n
o
s
e
c
s
/
s
e
a
r
c
h
Seq
Scrambled
1-Jun-00 Bentley: Cache-Conscious Algs & DS 23
Timing Binary Search, cont.
Whack-a-Mole Cost Model
Final Timing Code
// scramble perm vector p
// start clock
for (i = 0; i < n; i++)
assert(search(x[p[i]]) == p[i]);
// end clock
A General Problem
Perhaps a Solution?
1-Jun-00 Bentley: Cache-Conscious Algs & DS 24
HeapSearch
Tree: 3 Array:
1 5 3 1 5 0 2 4 6
Search Code 0 2 4 6
p = 1;
while (p <= n) {
if (t == y[p])
return p;
else if (t < y[p])
p = 2*p;
else /* t > y[p] */
p = 2*p + 1;
}
return -1;
1-Jun-00 Bentley: Cache-Conscious Algs & DS 25
Multiway HeapSearch
View as implicit, static B-trees
b-way branching
b=8 for 32-byte cache lines
Aligned on cache boundaries
Recursive code builds the array in linear time
Speed up by loop unrolling
1-Jun-00 Bentley: Cache-Conscious Algs & DS 26
Search Performance

Ordered Searches, PII 400
0
500
1000
1500
2000
2500
3000
3500
100 1000 10000 100000 1000000 10000000
n
n
a
n
o
s
e
c
s
/
s
e
a
r
c
h
Binary
b=2
b=8, Simple
b=8, Unrolled
1-Jun-00 Bentley: Cache-Conscious Algs & DS 27
Searching on Other Machines
Ordered Searches, K6 400
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
100 1000 10000 100000 1E+06 1E+07
n
n
a
n
o
s
e
c
s
/
s
e
a
r
c
h
Binary
b=2
b=8, Simple
b=8, Unrolled
Ordered Searches, R10000 250
0
1000
2000
3000
4000
5000
6000
100 1000 10000 100000 1E+06 1E+07
n
1-Jun-00 Bentley: Cache-Conscious Algs & DS 28
A Philosophical Digression
Approaches to Cache-Conscious Coding
Head-in-the-sand big-ohs
System Tools
VTune
Compilers (and more)
Detailed Analyses
Lamarca and Ladner
Knuths MMIX Simulator
High-level, heuristic, machine-independent
A Supermarket Analogy
1-Jun-00 Bentley: Cache-Conscious Algs & DS 29
Vector Chains
What is the longest chain in a set of n vectors
in 3-space?
Erdos and Szekeres; Ulam; Baer and Brock;
Logan and Shepp; Vershik and Kerov; Bollobas
and Winkler; Odlyzko and Rains
Key structure: a 2-d antichain
Sequence of 2-d points with increasing x values
and decreasing y values
1-Jun-00 Bentley: Cache-Conscious Algs & DS 30
Key Decisions
Represent points as (x, y) pairs, not by
pointers
How to represent a sorted sequence of m=n
1/3

points (n ~ 10
9
)?
STL Maps: Search in O(lg m), insert in O(lg m)
Tiny code; guaranteed performance
Sorted Arrays: Search in O(lg m); insert in O(m)
Long (buggy) code; small and sequential
1-Jun-00 Bentley: Cache-Conscious Algs & DS 31
Run Times

Vector Chains -- PII 400
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
10000 100000 1000000
Active Vectors
N
a
n
o
s
e
c
s
/
V
e
c
t
o
r
Array
STL
1-Jun-00 Bentley: Cache-Conscious Algs & DS 32
Other Machines

Vector Chains -- K6-2 400
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
10000 100000 1000000
Active Vectors
N
a
n
o
s
e
c
s
/
V
e
c
t
o
r
Array
STL
Vector Chains -- R10000 250
0
10000
20000
30000
40000
50000
60000
70000
10000 100000 1000000
Active Vectors
1-Jun-00 Bentley: Cache-Conscious Algs & DS 33
An Ancient Problem
Ideally one would desire an indefinitely large
memory capacity such that any particular
[word] would be immediately available. It
does not seem possible to achieve such a
capacity. We are therefore forced to recognize
the possibility of constructing a hierarchy of
memories, each of which has greater capacity
than the preceding but which is less quickly
accessible.

Preliminary discussion of the logical design of an
electronic computing instrument, Burks, Goldstine,
von Neumann, 1946
1-Jun-00 Bentley: Cache-Conscious Algs & DS 34
k-d Trees
Search for All Nearest Neighbors
Internal Nodes (A Cutting Hyperplane)
struct inode {
char nodetype;
char cutdim;
int cutpt;
iptr lokid;
iptr hikid;
}
External Nodes (A Set of Points)
Two indices into a perm vector of point indices
1-Jun-00 Bentley: Cache-Conscious Algs & DS 35
Cache-Conscious k-d Trees
No pointers to (indices of) points
Copy values (perhaps entire points)
Implicit Tree
Internal Nodes
Parallel arrays: cutdim[], cutval[]
Drop 24 bytes/node to 5
External Nodes
Permutation vector of (copies of) points
Future
Cluster subtrees by cache line size
1-Jun-00 Bentley: Cache-Conscious Algs & DS 36
Ordering the Searches
Recall Testbed for Binary Search
Searching for x[0], x[1], x[2], was very fast
Random searches were slower (and more realistic)
Neighbor Searches in Random Order
for (i = 0; i < n; i++)
nntab[i] = nnsearch(i);
Searches in Permutation Order
for (i = 0; i < n; i++)
nntab[i] = nnsearch(perm[i]);
1-Jun-00 Bentley: Cache-Conscious Algs & DS 37
k-d Tree Run Times
k-d Tree Search -- PII 400
0
2000
4000
6000
8000
10000
12000
14000
16000
100 1000 10000 100000 1000000 1000000
0
n
n
a
n
o
s
e
c
s
/
s
e
a
r
c
h
Simp-Rand
Simp-Perm
CC-Rand
CC-Perm
1-Jun-00 Bentley: Cache-Conscious Algs & DS 38
Times on Other Machines
k-d Tree Search -- K6 400
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
100 1000 10000 100000 1E+06 1E+07
n
n
a
n
o
s
e
c
s
/
s
e
a
r
c
h
k-d Tree Search -- R10000 250
0
5000
10000
15000
20000
25000
30000
100 1000 10000 100000 1E+06 1E+07
n
Simp-Rand
Simp-Perm
CC-Rand
CC-Perm
1-Jun-00 Bentley: Cache-Conscious Algs & DS 39
Caches in Programming Pearls
Vector Rotation
Dolphin vs. block swap vs. reversal
Dont optimize {I/O, cache}-bound code
Binary search
Original testbed timed (adjacent, fast) searches
Final timed random searches
Set representations
Weird times on arrays vs. lists
STL sets thrash
1-Jun-00 Bentley: Cache-Conscious Algs & DS 40
Markov Text
Order-1: The table shows how many contexts; it uses two or
equal to the sparse matrices were not chosen. In Section
13.1, for a more efficient that ``the more time was
published by calling recursive structure translates to build
scaffolding to try to know of selected and testing
Order-2: The program is guided by verification ideas, and
the second errs in the STL implementation (which
guarantees good worst-case performance), and is
especially rich in speedups due to Gordon Bell. Everything
should be to use a macro: for n=10,000, its run time;
Order-3: A Quicksort would be quite efficient for the main-
memory sorts, and it requires only a few distinct values in
this particular problem, we can write them all down in the
program, and they were making progress towards a
solution at a snail's pace.

1-Jun-00 Bentley: Cache-Conscious Algs & DS 41
Markov Text Algorithms
Original Data Structures
Original text as one long string
Suffix array of pointers to each word
Algorithm
Read input
Sort words by k-grams
Use binary search to make transitions
Cache-Conscious Version
Hash each word on input
Replace a pointer to a text string with an index
into the hash table
Sort (copied) k-grams of hash indices
1-Jun-00 Bentley: Cache-Conscious Algs & DS 42
A Choice About Binary Search
Find Equal Elements in a Sorted Array

Warm Start
l = binarysearch(t, 0, n-1, <)
u = binarysearch(t, l, n-1, =)
Cold Start
l = binarysearch(t, 0, n-1, <)
u = binarysearch(t, 0, n-1, =)
Whack-a-Mole Analysis
Details in DDJ, March 2000
<
=
>
l u
1-Jun-00 Bentley: Cache-Conscious Algs & DS 43
Time of Markov Algorithms
Markov Text -- PII 400
0
5000
10000
15000
20000
25000
10000 100000 1000000 10000000
n
n
a
n
o
s
e
c
s
/
w
o
r
d
pp2
cc1
cc2
1-Jun-00 Bentley: Cache-Conscious Algs & DS 44
Times on Other Machines
Markov Text -- K6 400
0
5000
10000
15000
20000
25000
30000
10000 100000 1000000 10000000
n
n
a
n
o
s
e
c
s
/
w
o
r
d
Markov Text -- R10000 250
0
5000
10000
15000
20000
25000
30000
35000
40000
10000 100000 1000000 10000000
n
pp2
cc1
cc2
1-Jun-00 Bentley: Cache-Conscious Algs & DS 45
A Sampler of Related Work
Cache-Conscious Databases, Object Code, Record
Layouts, Compilers, Languages, ...
Scientific Computing: Blocking, etc.
Lamarca: Understanding and Optimizing Cache
Performance
www.lamarca.org/anthony/caches.html
Board, Chatterjee, et al: TUNE
www.cs.unc.edu/Research/TUNE/
Vitter et al: External Memory Algorithms
www.cs.duke.edu/~jsv/Papers/catalog/
Frigo, Leiserson, et al: Cache-Oblivious Algorithms
1999 FOCS
1-Jun-00 Bentley: Cache-Conscious Algs & DS 46
Lessons for Programmers
Canonical Curves
Experimenters beware
Implementers exploit
Down: Lower access cost
Out: Shrink size
Cost Model
Whack-a-Mole Analysis
Techniques from the Cases (Max slope reductions)
Arrays vs. Lists (6) Vector Chains (3.6)
Sorting an Array (16) k-d Trees (13)
Searching in a Static Array (3.5) Markov Chains (6)
1-Jun-00 Bentley: Cache-Conscious Algs & DS 47
Cache-Conscious Coding
Traits of Fast Programs
Small structures
Arbitrary access Repeated Sequential
Top-Down Heapsort Bottom-Up Quicksort
Programming Techniques
Avoid pointers
Copy information
Links Arrays
Implicit structures
Respect cache size and alignment
Multiway branching
Compression and recomputation
Records Parallel arrays
Carry a signature of an object
Order operations to induce locality

You might also like