0% found this document useful (0 votes)
34 views

A Systolic Array For Implementing LRU Replacement: September 2002

The document describes a systolic array implementation of LRU cache replacement that can handle an arbitrarily high degree of associativity. It presents a simple systolic array design that maintains LRU information by advancing line indices through the array. On a cache hit, the index is advanced until a match is found, at which point indices are copied to maintain ordering. The design is modified to allow one cache access per cycle by removing registers to effectively double the clock rate. It is further extended to support multiple accesses per cycle.

Uploaded by

koorapatisagar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

A Systolic Array For Implementing LRU Replacement: September 2002

The document describes a systolic array implementation of LRU cache replacement that can handle an arbitrarily high degree of associativity. It presents a simple systolic array design that maintains LRU information by advancing line indices through the array. On a cache hit, the index is advanced until a match is found, at which point indices are copied to maintain ordering. The design is modified to allow one cache access per cycle by removing registers to effectively double the clock rate. It is further extended to support multiple accesses per cycle.

Uploaded by

koorapatisagar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/2838466

A Systolic Array for Implementing LRU Replacement

Article · September 2002


Source: CiteSeer

CITATIONS READS
7 83

1 author:

J.P. Grossman
D. E. Shaw Research
56 PUBLICATIONS   2,224 CITATIONS   

SEE PROFILE

All content following this page was uploaded by J.P. Grossman on 30 June 2013.

The user has requested enhancement of the downloaded file.


-18 March 13, 2002

A Systolic Array for Implementing LRU Replacement


J.P. Grossman

Abstract L=3 4 1 3 0 5 2

Increasing the associativity of a cache reduces both the 4 1 0 5 2 3


miss rate and the power consumption. It also makes LRU
replacement more difficult to implement. We present a
simple systolic array that can be used to implement LRU (a) LRU list
replacement in arbitrarily associative caches.
L=1
2 1 3 0
1 Introduction
1
One of the important design parameters of a hardware
2 1 3 0
cache is its degree of associativity. Increasing a cache’s
associativity improves performance by reducing the miss
L=2 1
rate [Hennessy96] and leads to a lower power 2 3 3 0
implementation [Zhang00]. However, it also becomes
more difficult to implement a least recently used (LRU) 2 1
replacement policy. As a result, hardware designers opt 3 3 0 0
for simpler replacement strategies such as round-robin
[Clark01], even though the LRU policy is known to 2
provide better performance [Smith82]. 3 0 0 1
In this paper we present a simple systolic array that
can keep track of LRU information for a set of cache (b) Systolic array implementation
lines. Since the length of the critical path is constant, the
approach can be used for N-way associative caches with Figure 1: Keeping track of LRU information using
N arbitrarily large. We begin by constructing a systolic (a) an atomically updated list (b) a systolic array.
array that can handle one cache access on every other
cycle. In section 3 we modify the design to allow a cache together the last node’s inputs, as shown in Figure 1b.
access on every cycle. Finally in section 4 we show how This ensures that L will be deposited because by the end
to accommodate multiple accesses per cycle. of the array we are guaranteed that M=1, so the last node
will attempt to copy a value from the right, and with the
2 Implementation inputs wired together this value is L. Note that we can
only present indices to the array on every other cycle. For
The central idea is to maintain a list of cache line indices example, if in Figure 1b ‘2’ were presented on the cycle
sorted from LRU to MRU (most recently used). When a immediately following ‘1’, then the value ‘1’ would
cache line is accessed its line index L is presented to the erroneously be copied into the first node instead of the
list, and that index is rotated to the MRU position at the correct ‘3’.
end (Figure 1a). We can implement this list as a systolic Figure 2 shows a hardware implementation of the
array by advancing L one node per clock cycle, along systolic array node. The forward signals are the line
with a single-bit “matched” signal M, indicating whether index L and the match bit M; the backward signal is the
or not the index has found a match within the array. Until current index which is used to shift values when M=1.
a match is found, L is advanced without any changes The node contains two logN bit registers, one single-bit
being made. Once a match is found, nodes begin copying register, a logN bit multiplexer, a logN bit comparator,
values from their neighbours to the right. Finally, L is and an OR gate. No extra hardware is required to set up
deposited in the last node. This is illustrated in Figure 1b. the array as it can be initialized simply by setting M=1
We can use the same design for all nodes by wiring and presenting all N line indices in N consecutive cycles

Project Aries Technical Memo ARIES-TM-18


Artificial Intelligence Laboratory
Department of Electrical Engineering and Computer Science
Massachusetts Institute of Technology
Cambridge, MA, USA

Sponsored by DARPA/AFOSR Contract Number F306029810172


4 Multiple Accesses per Cycle
Some modern processors allow up to four cache accesses
current index on a single cycle [Bradley02]. A systolic array capable of
maintaining LRU information with k > 1 accesses per
=?
cycle is naturally more complicated, but can be
implemented as follows: Each node holds 2k indicies.
There are k forward line index signals L1, …, Lk, and we
L M index replace the match bit M with a log(k+1) bit match
counter indicating how many of the index signals have
Figure 2: Systolic array node
found their match. The comparator and OR gate used to
update M are replaced by k comparators, a k-input OR
followed by N copies of the last index (N – 1) in the next
N consecutive cycles. gate, and a log(k+1) bit incrementer. Each index
In normal operation the input M to the first node is register is fed by a k+1 input multiplexer taking its inputs
always 0. On a cache hit, the line index L is presented to from that register as well as the k above it; the multiplexer
the array. On a cache miss, the output of the first node is controlled by M. Finally, a given multiplexer input
gives the LRU line index; this line is replaced and the connects to an index register’s output if the register is in
index is fed back into the array. On a cycle with no cache the same node, otherwise it connects to the index
activity, the index of the most recently accessed line is register’s input. This generalizes the node design
presented, which does not change the state of the array presented in Section 3; Figure 4 shows the resulting
(this technique avoids the need for a separate “valid” bit). design for k = 2.

3 One Access per Cycle References


[Bradley02] David Bradley, Patrick Mahoney, Blaine
To accommodate one cache line access per cycle, the
Stackhouse, “The 16kB Single-Cycle Read Access Cache on
systolic array must be modified to behave as though it a Next-Generation 64b Itanium Microprocessor”, Proc.
were being clocked twice as fast. We can accomplish this ISSCC 2002, pp. 110-111.
by simply removing every other set of forward registers [Clark01] Lawrence T. Clark, Eric J. Hoffman, Jay Miller,
and modifiying the backward ‘index’ signal slightly to Manish Biyani, Yuyun Liao, Stephen Strazdus, Michael
obtain the new node implementation shown in Figure 3. Morrow, Kimberley E. Velarde, Mark A. Yarc, “An
The index signal is taken from the input rather than the Embedded 32b Microprocessor Core for Low-Power and
output of the bottom register to ensure that when the
previous node attempts to copy the index value, it obtains
the value that would be stored in this register after the =?

node has finished processing its current inputs. This new =? +


systolic array, which contains N/2 nodes, can be
initialized by presenting all N line indices in N
consecutive cycles with M=1.
=?

=? +

index1
=?
=?
=? +

index0 =?

=? +
=?

L M index L1 L2 M index1 index2

Figure 3: Modified systolic array node Figure 4: Systolic array node for 2 accesses per cycle

-18 2 Systolic LRU


High-Performance Applications”, IEEE Journal of Solid-
State Circuits, Vol. 36, No. 11, November 2001, pp. 1599-
1608.
[Hennessy96] J. Hennessy, D. Patterson, Computer
Architecture: A Quantitative Approach, 2nd ed., Morgan
Kaufman, San Mateo, CA, 1996.
[Smith82] A. J. Smith, “Cache Memories”, ACM
Computing Surveys, Vol. 14, No. 3, September 1982, pp.
473-530.
[Zhang00] Michael Zhang, Krste Asanović, “Highly-
Associative Caches for Low-Power Processors”, Proc. Kool
Chips Workshop, 33rd International Symposium on
Microarchitecture, Monterey, CA, Dec. 2000.

-18 3 Systolic LRU

View publication stats

You might also like