The document discusses techniques for designing algorithms and data structures to be cache-conscious and optimize for memory hierarchy. It explores how different algorithms like searches and sorts perform on various machines and cache levels. It advocates analyzing algorithms through a cache performance model and presents case studies where cache-optimized designs like multi-way tree searches and implicit k-d trees outperform standard implementations. The goal is to raise awareness of the memory hierarchy when programming complex applications.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPS, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
128 views
Cache-Conscious Algorithms and Data Structures
The document discusses techniques for designing algorithms and data structures to be cache-conscious and optimize for memory hierarchy. It explores how different algorithms like searches and sorts perform on various machines and cache levels. It advocates analyzing algorithms through a cache performance model and presents case studies where cache-optimized designs like multi-way tree searches and implicit k-d trees outperform standard implementations. The goal is to raise awareness of the memory hierarchy when programming complex applications.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPS, PDF, TXT or read online on Scribd
You are on page 1/ 47
1-Jun-00 Bentley: Cache-Conscious Algs & DS 1
Cache-Conscious Algorithms and
Data Structures Jon Bentley Avaya Labs
A Programming Puzzle A Cost Model Case Studies Principles 1-Jun-00 Bentley: Cache-Conscious Algs & DS 2 A Programming Puzzle Which is faster for representing sequences: arrays or lists? Technical details Random insertions Into a sorted sequence Same sequence of comparisons Different overhead Pointer chasing in lists Knuth, v. 3: Search is 4C in arrays, 6C in lists Sliding a sequence of an array 1-Jun-00 Bentley: Cache-Conscious Algs & DS 3 A Testbed Main Loop in Pseudocode S = empty while S.size() < n S.insert(bigrand()) About n 2 /4 comparisons C++ Classes for Arrays and Linked Lists Which is faster? 1-Jun-00 Bentley: Cache-Conscious Algs & DS 4 An Experiment
ArrList -- PII 400 0 5 10 15 20 25 30 35 40 45 100 1000 10000 100000 1000000 n n a n o s e c s / n ^ 2 Arrays Lists Average access time as a function of set size 1-Jun-00 Bentley: Cache-Conscious Algs & DS 5 Display on a Log Scale ArrList -- PII 400 1 10 100 100 1000 10000 100000 1000000 n n a n o s e c s / n ^ 2 Arrays Lists 1-Jun-00 Bentley: Cache-Conscious Algs & DS 6 Other Machines ArrList -- K6 400 0 10 20 30 40 50 60 70 100 1000 10000 100000 1000000 n a n o s e c s / n ^ 2 1 10 100 100 1000 10000 100000 1000000 n n a n o s e c s / n ^ 2 ArrList -- R10000 250 0 2 4 6 8 10 12 14 100 1000 10000 100000 1000000 1 10 100 100 1000 10000 100000 1000000 n Arrays Lists 1-Jun-00 Bentley: Cache-Conscious Algs & DS 7 Lessons Across Machines Knees at L1, L2, RAM boundaries
Smaller structures have later knees In L1: All accesses are cheap Above L1: Sequential is faster than random RAM Caches 1-Jun-00 Bentley: Cache-Conscious Algs & DS 8 A Cost Model for Memory Goal: A Program to Estimate Access Costs The Key Loop (n is array size, d is delta) for (i = 0; i < count; i++) { sum += x[j]; j += d; if (j >= n) j -= n; } A Real Program 1-Jun-00 Bentley: Cache-Conscious Algs & DS 9 Results of the Model
MemEx -- PII 400 0 10 20 30 40 50 60 70 80 90 100 1000 10000 100000 1000000 n n a n o s e c s / a c c e s s d=1 d=11 1-Jun-00 Bentley: Cache-Conscious Algs & DS 10 Other Machines
MemEx -- K6 400 0 20 40 60 80 100 120 140 160 180 100 1000 10000 100000 1000000 n n a n o s e c s / a c c e s s d=1 d=11 MemEx -- R10000 250 0 20 40 60 80 100 120 100 1000 10000 10000 0 1E+06 1E+07 n 1-Jun-00 Bentley: Cache-Conscious Algs & DS 11 Trends Across Machines Same shapes, different constants Transitions at cache boundaries Constant cost in L1 Sequential is cheaper above L1 Differences grow substantially
What happens with complex software? 1-Jun-00 Bentley: Cache-Conscious Algs & DS 12 Awks Associative Arrays Interpretation and data structures dominate Algorithms in Awk are cache-insensitive Awk MemEx -- PII 400 0 5 10 15 20 25 30 10 100 1000 10000 100000 1000000 n M i c r o s e c s
/
A c c e s s d=1 d=11 1-Jun-00 Bentley: Cache-Conscious Algs & DS 13 Sorting Algorithms How do different sorts behave under caching? Two easy O(n log n) sorts Quicksort Heapsort Which is faster? 1-Jun-00 Bentley: Cache-Conscious Algs & DS 14 Cache-Insensitive Sorting Sorts in Awk -- PII 400 0 200 400 600 800 1000 1200 1400 1600 1800 10 100 1000 10000 100000 1000000 n m i c r o s e c s
/
n qsort hsort 1-Jun-00 Bentley: Cache-Conscious Algs & DS 15 Quicksort vs. Heapsort
Sorting -- PII 400 0 500 1000 1500 2000 2500 3000 3500 100 1000 10000 100000 1000000 1E+07 n n a n o s e c s / n qsort hsort 1-Jun-00 Bentley: Cache-Conscious Algs & DS 16 Sorting on Other Machines
Sorting -- K6 400 0 1000 2000 3000 4000 5000 6000 7000 100 1000 10000 10000 0 1E+06 1E+07 n n a n o s e c s / n Sorting -- R10000 250 0 500 1000 1500 2000 2500 3000 3500 4000 4500 100 1000 10000 10000 0 1E+06 1E+07 n qsort hsort 1-Jun-00 Bentley: Cache-Conscious Algs & DS 17 Cache-Conscious Sorting Early work on tapes and disks LaMarca and Ladner, 1997 SODA Quicksort: Undo Sedgewicks final sort; one multiway partition Heapsort: Build towards root; multiway branching Merge Sort: Tiling (sort a cache-full in the first pass); multiway merge Radix Sort Detailed Analyses
1-Jun-00 Bentley: Cache-Conscious Algs & DS 18 Searching A Rich History Represent 3-level subtrees on disk pages Linear search within pages, followed by multi- way branch Landauer (IEEE TEC, 1963; ISAM) B-Trees (Bayer and McCreight, 1970) Fun Problems Hashing (Binstock, DDJ April 1996) How to search in a (preprocessed) array? 1-Jun-00 Bentley: Cache-Conscious Algs & DS 19 Binary Search Array: 0 1 2 3 4 5 6 Search Code l = 0; u = n-1; for (;;) { if (l > u) return -1; m = (l + u) / 2; if (x[m] < t) l = m+1; else if (x[m] == t) return m; else /* x[m] > t */ u = m-1; } 1-Jun-00 Bentley: Cache-Conscious Algs & DS 20 Timing Binary Search My First Timing Code // start clock for (i = 0; i < n; i++) assert(search(x[i]) == i); // end clock Problems? 1-Jun-00 Bentley: Cache-Conscious Algs & DS 21 Cache-Insensitive Search Searching in Awk -- PII 400 0 100 200 300 400 500 600 10 100 1000 10000 100000 n m i c r o s e c s
/
s e a r c h Ordered Scrambled 1-Jun-00 Bentley: Cache-Conscious Algs & DS 22 Observed Run Times Binary Search -- PII 400 0 500 1000 1500 2000 2500 3000 100 10000 1000000 n n a n o s e c s / s e a r c h Seq Scrambled 1-Jun-00 Bentley: Cache-Conscious Algs & DS 23 Timing Binary Search, cont. Whack-a-Mole Cost Model Final Timing Code // scramble perm vector p // start clock for (i = 0; i < n; i++) assert(search(x[p[i]]) == p[i]); // end clock A General Problem Perhaps a Solution? 1-Jun-00 Bentley: Cache-Conscious Algs & DS 24 HeapSearch Tree: 3 Array: 1 5 3 1 5 0 2 4 6 Search Code 0 2 4 6 p = 1; while (p <= n) { if (t == y[p]) return p; else if (t < y[p]) p = 2*p; else /* t > y[p] */ p = 2*p + 1; } return -1; 1-Jun-00 Bentley: Cache-Conscious Algs & DS 25 Multiway HeapSearch View as implicit, static B-trees b-way branching b=8 for 32-byte cache lines Aligned on cache boundaries Recursive code builds the array in linear time Speed up by loop unrolling 1-Jun-00 Bentley: Cache-Conscious Algs & DS 26 Search Performance
Ordered Searches, PII 400 0 500 1000 1500 2000 2500 3000 3500 100 1000 10000 100000 1000000 10000000 n n a n o s e c s / s e a r c h Binary b=2 b=8, Simple b=8, Unrolled 1-Jun-00 Bentley: Cache-Conscious Algs & DS 27 Searching on Other Machines Ordered Searches, K6 400 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 100 1000 10000 100000 1E+06 1E+07 n n a n o s e c s / s e a r c h Binary b=2 b=8, Simple b=8, Unrolled Ordered Searches, R10000 250 0 1000 2000 3000 4000 5000 6000 100 1000 10000 100000 1E+06 1E+07 n 1-Jun-00 Bentley: Cache-Conscious Algs & DS 28 A Philosophical Digression Approaches to Cache-Conscious Coding Head-in-the-sand big-ohs System Tools VTune Compilers (and more) Detailed Analyses Lamarca and Ladner Knuths MMIX Simulator High-level, heuristic, machine-independent A Supermarket Analogy 1-Jun-00 Bentley: Cache-Conscious Algs & DS 29 Vector Chains What is the longest chain in a set of n vectors in 3-space? Erdos and Szekeres; Ulam; Baer and Brock; Logan and Shepp; Vershik and Kerov; Bollobas and Winkler; Odlyzko and Rains Key structure: a 2-d antichain Sequence of 2-d points with increasing x values and decreasing y values 1-Jun-00 Bentley: Cache-Conscious Algs & DS 30 Key Decisions Represent points as (x, y) pairs, not by pointers How to represent a sorted sequence of m=n 1/3
points (n ~ 10 9 )? STL Maps: Search in O(lg m), insert in O(lg m) Tiny code; guaranteed performance Sorted Arrays: Search in O(lg m); insert in O(m) Long (buggy) code; small and sequential 1-Jun-00 Bentley: Cache-Conscious Algs & DS 31 Run Times
Vector Chains -- PII 400 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 10000 100000 1000000 Active Vectors N a n o s e c s / V e c t o r Array STL 1-Jun-00 Bentley: Cache-Conscious Algs & DS 32 Other Machines
Vector Chains -- K6-2 400 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 10000 100000 1000000 Active Vectors N a n o s e c s / V e c t o r Array STL Vector Chains -- R10000 250 0 10000 20000 30000 40000 50000 60000 70000 10000 100000 1000000 Active Vectors 1-Jun-00 Bentley: Cache-Conscious Algs & DS 33 An Ancient Problem Ideally one would desire an indefinitely large memory capacity such that any particular [word] would be immediately available. It does not seem possible to achieve such a capacity. We are therefore forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible.
Preliminary discussion of the logical design of an electronic computing instrument, Burks, Goldstine, von Neumann, 1946 1-Jun-00 Bentley: Cache-Conscious Algs & DS 34 k-d Trees Search for All Nearest Neighbors Internal Nodes (A Cutting Hyperplane) struct inode { char nodetype; char cutdim; int cutpt; iptr lokid; iptr hikid; } External Nodes (A Set of Points) Two indices into a perm vector of point indices 1-Jun-00 Bentley: Cache-Conscious Algs & DS 35 Cache-Conscious k-d Trees No pointers to (indices of) points Copy values (perhaps entire points) Implicit Tree Internal Nodes Parallel arrays: cutdim[], cutval[] Drop 24 bytes/node to 5 External Nodes Permutation vector of (copies of) points Future Cluster subtrees by cache line size 1-Jun-00 Bentley: Cache-Conscious Algs & DS 36 Ordering the Searches Recall Testbed for Binary Search Searching for x[0], x[1], x[2], was very fast Random searches were slower (and more realistic) Neighbor Searches in Random Order for (i = 0; i < n; i++) nntab[i] = nnsearch(i); Searches in Permutation Order for (i = 0; i < n; i++) nntab[i] = nnsearch(perm[i]); 1-Jun-00 Bentley: Cache-Conscious Algs & DS 37 k-d Tree Run Times k-d Tree Search -- PII 400 0 2000 4000 6000 8000 10000 12000 14000 16000 100 1000 10000 100000 1000000 1000000 0 n n a n o s e c s / s e a r c h Simp-Rand Simp-Perm CC-Rand CC-Perm 1-Jun-00 Bentley: Cache-Conscious Algs & DS 38 Times on Other Machines k-d Tree Search -- K6 400 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 100 1000 10000 100000 1E+06 1E+07 n n a n o s e c s / s e a r c h k-d Tree Search -- R10000 250 0 5000 10000 15000 20000 25000 30000 100 1000 10000 100000 1E+06 1E+07 n Simp-Rand Simp-Perm CC-Rand CC-Perm 1-Jun-00 Bentley: Cache-Conscious Algs & DS 39 Caches in Programming Pearls Vector Rotation Dolphin vs. block swap vs. reversal Dont optimize {I/O, cache}-bound code Binary search Original testbed timed (adjacent, fast) searches Final timed random searches Set representations Weird times on arrays vs. lists STL sets thrash 1-Jun-00 Bentley: Cache-Conscious Algs & DS 40 Markov Text Order-1: The table shows how many contexts; it uses two or equal to the sparse matrices were not chosen. In Section 13.1, for a more efficient that ``the more time was published by calling recursive structure translates to build scaffolding to try to know of selected and testing Order-2: The program is guided by verification ideas, and the second errs in the STL implementation (which guarantees good worst-case performance), and is especially rich in speedups due to Gordon Bell. Everything should be to use a macro: for n=10,000, its run time; Order-3: A Quicksort would be quite efficient for the main- memory sorts, and it requires only a few distinct values in this particular problem, we can write them all down in the program, and they were making progress towards a solution at a snail's pace.
1-Jun-00 Bentley: Cache-Conscious Algs & DS 41 Markov Text Algorithms Original Data Structures Original text as one long string Suffix array of pointers to each word Algorithm Read input Sort words by k-grams Use binary search to make transitions Cache-Conscious Version Hash each word on input Replace a pointer to a text string with an index into the hash table Sort (copied) k-grams of hash indices 1-Jun-00 Bentley: Cache-Conscious Algs & DS 42 A Choice About Binary Search Find Equal Elements in a Sorted Array
Warm Start l = binarysearch(t, 0, n-1, <) u = binarysearch(t, l, n-1, =) Cold Start l = binarysearch(t, 0, n-1, <) u = binarysearch(t, 0, n-1, =) Whack-a-Mole Analysis Details in DDJ, March 2000 < = > l u 1-Jun-00 Bentley: Cache-Conscious Algs & DS 43 Time of Markov Algorithms Markov Text -- PII 400 0 5000 10000 15000 20000 25000 10000 100000 1000000 10000000 n n a n o s e c s / w o r d pp2 cc1 cc2 1-Jun-00 Bentley: Cache-Conscious Algs & DS 44 Times on Other Machines Markov Text -- K6 400 0 5000 10000 15000 20000 25000 30000 10000 100000 1000000 10000000 n n a n o s e c s / w o r d Markov Text -- R10000 250 0 5000 10000 15000 20000 25000 30000 35000 40000 10000 100000 1000000 10000000 n pp2 cc1 cc2 1-Jun-00 Bentley: Cache-Conscious Algs & DS 45 A Sampler of Related Work Cache-Conscious Databases, Object Code, Record Layouts, Compilers, Languages, ... Scientific Computing: Blocking, etc. Lamarca: Understanding and Optimizing Cache Performance www.lamarca.org/anthony/caches.html Board, Chatterjee, et al: TUNE www.cs.unc.edu/Research/TUNE/ Vitter et al: External Memory Algorithms www.cs.duke.edu/~jsv/Papers/catalog/ Frigo, Leiserson, et al: Cache-Oblivious Algorithms 1999 FOCS 1-Jun-00 Bentley: Cache-Conscious Algs & DS 46 Lessons for Programmers Canonical Curves Experimenters beware Implementers exploit Down: Lower access cost Out: Shrink size Cost Model Whack-a-Mole Analysis Techniques from the Cases (Max slope reductions) Arrays vs. Lists (6) Vector Chains (3.6) Sorting an Array (16) k-d Trees (13) Searching in a Static Array (3.5) Markov Chains (6) 1-Jun-00 Bentley: Cache-Conscious Algs & DS 47 Cache-Conscious Coding Traits of Fast Programs Small structures Arbitrary access Repeated Sequential Top-Down Heapsort Bottom-Up Quicksort Programming Techniques Avoid pointers Copy information Links Arrays Implicit structures Respect cache size and alignment Multiway branching Compression and recomputation Records Parallel arrays Carry a signature of an object Order operations to induce locality