Caches
Caches
In the second part of this book, we shall focus on the design of the memory system. To sustain a high
performance pipeline, we need a high performance memory system. Otherwise, we will not be able to realise
the gains of having a high performance pipeline. It is like having a strong body and a strong mind. Unless
we have a strong body, we cannot have a strong mind, and vice versa.
The most important element in the on-chip memory system is the notion of a cache that stores a subset
of the memory space, and the hierarchy of caches. In this section, we assume that the reader is well aware
of the basics of caches, and is also aware of the notion of virtual memory. We shall provide a very brief
introduction to these topics in this section for the sake of recapitulation. However, this might be woefully
insufficient for readers with no prior background. Hence, readers are requested to take a look at some basic
texts such as [Sarangi, 2015] to refresh their basics.
In line with this thinking, we shall provide a very quick overview of cache design in Section 7.1 and
virtual memory in Section 7.2. Then, we shall move on to discuss methods to analytically estimate the area,
timing, and power consumption of caches in Section 7.3. This will give us a practical understanding of the
issues involved in designing caches. We shall then extend this section to consider advanced cache design
techniques in Section 7.4.
We will then proceed to look at a very unconventional design – Intel’s trace cache – in Section 7.5. It is
designed to store sequences of instructions, whereas conventional caches are designed to store just blocks of
bytes. Storing traces gives us a lot of benefits. In most cases we can completely skip the fetch and decode
stages.
The next half of the chapter focuses on using methods to improve the efficiency of the memory system by
using complicated logic that resides outside the caching structures. In Sections 7.6 and 7.7, we shall focus on
prefetching mechanisms where we try to predict the memory blocks that are required in the near future and
try to fetch them in advance. This reduces the average memory latency. Prefetching techniques are highly
effective for both instructions and data.
249
Smruti R. Sarangi 250
most revolutionary ideas in the history of computing. The credit goes to early pioneers such as Alan Turing
and John von Neumann.
Processor
Memory
Figure 7.1: A simple processor-memory system (Von Neumann architecture)
In a simplistic model we have the processor connected to the memory system as shown in Figure 7.1:
the memory system stores both instructions and data. This is the Von Neumann architecture. The main
problem with this organisation is that a single unified memory is too large and too slow. If every memory
access takes 100s of cycles, the IPC will be less than 0.01. A ready fix to this issue is to use registers, which
are named storage locations within the processor. Each register takes less than a cycle to access and this is
why we use registers for most instructions. However, to keep the register file fast, we need to keep it small.
Hence, we have a limited number of on-chip registers. The number is typically limited to 8 or 16.
Compilers often run out of registers while mapping variables to registers. It is thus necessary to spill the
values of some registers to memory to free them. The spilled values can then be read back from memory,
whenever we need to use them. Additionally, it is also necessary to store the registers to memory before
calling a function. This is because the function may overwrite some of the registers, and their original
contents will be lost. Finally, we need to restore their values once the function returns. Along with local
variables in functions, most programs also use large arrays and data structures that need to be stored in
the memory system. Because of multiple such reasons, memory accesses are an integral part of program
execution. In fact, memory instructions account for roughly a third of all the instructions in most programs.
We have till now discussed only data accesses. However, instructions are also stored in the memory
system, and every cycle we need to fetch them from the memory system. The part of the memory system
that stores the instructions is traditionally known as the instruction memory. Having one single memory
for both instructions and data is an inefficient solution because it needs to simultaneously provide both
instructions and data. This increases the overheads significantly. Hence, a more practical method is to split
the unified memory into separate instruction and data memories as shown in Figure 7.2. This is known as
the Harvard architecture.
To store all the instructions and data in a modern program, we need large memories. Large memories
are slow, have large area, and consume a lot of power. Another direct consequence of a large memory size is
that such memories cannot be fit within the CPU. They need to reside outside the CPU (off-chip). Accessing
off-chip memory (also referred to as main memory) for both instructions and data is simply not practical,
neither feasible. In most processors, it takes 200-300 cycles to get data back from off-chip memory. This will
decrease our IPC significantly.
Instruction
memory
time executing loops. Loops are typically small pieces of code, particularly when we consider the entire
program. Most programs typically move from function to function. In each function they do some intense
activity within a hierarchy of loops, and then move to another function where they do the same. Such kind
of patterns are very common in programming languages, and are often referred to using two very special
names: spatial locality and temporal locality.
Spatial Locality
Spatial locality refers to a pattern where we access objects that are in some sense proximate (close by) in a
small interval of time. Before looking at deeper aspects of this definition, let us explain with an example.
Consider the instructions in a loop. In terms of PC addresses, we access instructions that have addresses that
are close by. Thus, we have spatial locality. Similarly, when we are accessing an array, we also have spatial
locality if we are accessing it sequentially – from indices 0 to N . Spatial locality is an inherent property of
most programs that we use in our everyday life. As a result, most computer architects take spatial locality
in programs for granted.
Note that there is some vagueness in the definition. We have not precisely defined what exactly we mean
by “a small interval of time”. Is it in nanoseconds, microseconds, or hours? This is subjective, and depends
on the scenario.
Consider a loop. If in every iteration of a loop we access an array location in sequence, we have spatial
locality because of the way we are accessing the array. Now assume that we take the same loop and in every
iteration we insert a function call that takes a few thousand cycles to complete. Do we still have spatial
locality? The answer is, no. This is because it is true that we are accessing nearby addresses in the array;
however, this is being done across thousands of cycles. This is by no means a small interval of time as
compared to the time it takes to access a single location in an array. Hence, we do not have spatial locality.
Let us further augment the code of the loop to include an access to the hard disk that takes a million
cycles. Let the hard disk accesses be to consecutive sectors (blocks of 512 bytes on the disk). Do we have
spatial locality? We do not have spatial locality for the array accesses; however, we do have spatial locality
for the disk accesses. This is because in the time scale of a disk access (a few million cycles), the instructions
of the loop qualify as a “small interval of time”. The summary of this discussion is that we need to deduce
spatial locality on a case by case basis. The accesses that we are considering to be “spatially local” should
occupy a non-trivial fraction of the interval under consideration.
A related concept is the notion of the working set. It is defined as the set of memory addresses that a
program accesses repeatedly in a short time interval. Here again, the definition of short is on the same lines
as the definition for spatial locality – there is a degree of subjectivity. This subjectivity can be reduced if
Smruti R. Sarangi 252
we realise that program execution can typically be divided into phases: in each phase a program executes
the same piece of code and accesses the same region of data over and over again, and then moves to another
region of the code – the next phase begins. We typically have spatial locality for accesses within a phase,
and the set of addresses accessed repeatedly in a phase comprise the working set at that point of time.
Definition 38
We can divide program execution into phases, where each phase has a distinct pattern in terms of
instruction and data accesses. Within a phase, we typically access data that is proximate in terms of
memory addresses – this is known as spatial locality. The set of addresses accessed repeatedly in a phase
comprises the working set of the program at that point of time.
Temporal Locality
Let us consider a program with loops once again. There are some variables and regions of memory that we
tend to access frequently in a small interval of time. For example, in a loop we access the loop variables
frequently, and also we execute the instructions in the loop in every iteration. Even while walking through
an array we access the base register of the array on every access. Such patterns, where we keep accessing the
same memory locations over and over again, is referred to as temporal locality. Note that temporal locality
has been found to be a general property in most programs. Temporal locality is observed while accessing
instructions or data. In fact, we can see temporal locality in almost all memory and storage structures inside
the chip.
Most schemes in computer architecture are designed to make use of temporal locality. For example, a
branch predictor uses this fact to keep a small table of saturating counters. The expectation is that the hit
rate (probability of finding an entry) in this table will be high; this is guaranteed by temporal locality. The
branch target buffer operates on a similar principle. Even predictors such as value predictors or dependence
predictors rely on the same phenomenon. Had we not had temporal locality, most of our architectural
structures would have never come into being. In the case of of the memory system as well, we shall explicitly
rely on temporal locality in our memory access patterns.
Definition 39
Temporal locality refers to an access pattern where we repeatedly access the same locations over and over
again in a small interval of time.
line is that caches store frequently accessed data and instructions. Hence, due to temporal locality, we are
guaranteed to see a high cache hit rate (probability of successfully finding a value).
Definition 40
If we find a value in the cache, then this event is known as a cache hit, otherwise it is known as a cache
miss.
Memory
Taking inspiration from the Harvard and Von Neumann architectures, we arrive at the design in Fig-
ure 7.3, where the pipeline reads in data from the instruction cache (i-cache), and reads or writes data to
the data cache (d-cache). These are small caches, which as of 2020 are from 16 KB to 64 KB in size. They
are referred to as the level 1 or L1 caches. Often when people use the word “L1 cache” they refer to the
data cache. We shall sometimes use this terminology. The usage will be clear from the context.
Observe that in Figure 7.3, we have the processor, two caches, and a combined memory that stores
instructions and data: we have successfully combined the Harvard and Von Neumann paradigms. The access
protocol is as follows. For both instructions and data, we first access the respective caches. If we find the
instruction or data bytes, then we use them. This event is known as a cache hit. Otherwise, we have a cache
miss. In this case, we go down to the lower level, which is a large memory that contains all the code and
data used by the program. It is guaranteed to contain everything (all code and data bytes). Recall that we
store instructions as data in structures that store both kinds of information. Finally, note that this is a very
simplistic picture. We shall keep on refining it, and adding more detail in the subsequent sections.
Up till now we have not taken advantage of spatial locality. Let us take advantage of it by grouping bytes
into blocks. Instead of operating on small groups of bytes at a time, let us instead create blocks of 32 or 64
bytes. Blocks are atomic units in a cache. We always fetch or evict an entire block in one go – not in smaller
units. One advantage of this is that we automatically leverage spatial locality. Let’s say we are accessing
an array, and the array elements are stored in contiguous memory locations as is most often the case. We
access the first element, which is 4 bytes wide. If the block containing the element is not in the cache, then
we fetch the entire block from the memory to the cache. If this element was at the beginning of the 32-byte
block, then the remaining 28 bytes are automatically fetched because they are a part of the same block. This
means that for the next 7 accesses (28 bytes = 7 accesses * 4 bytes/access), we have the elements in the
cache. They can be quickly accessed. In this case, by creating blocks, we have taken advantage of spatial
locality, and consequently reduced the time it takes to access the array elements.
Way Point 7
We now know that memory access patterns exhibit both temporal and spatial locality. Most modern
memory systems take advantage of these properties. This is done as follows:
Smruti R. Sarangi 254
• We create a small structure called a cache that stores the values corresponding to a subset of the
memory addresses used by the program. Because of temporal locality, we are guaranteed to find
our data or instructions in the caches most of the time. Two such structures that most processors
typically use are the instruction cache (i-cache) and the data cache (d-cache).
• To take advantage of spatial locality, we group consecutive bytes into blocks. Furthermore, we treat
blocks atomically, and fetch or evict data at the granularity of blocks within the memory system.
The advantage of fetching 32-64 bytes at once is conspicuously visible when we are accessing a
sequence of contiguous instructions or accessing an array. If we read a memory word (4 bytes),
then with a very high probability we shall find the next memory word in the same block. Since
the block has already been fetched from memory, and kept in the cache, the access time for other
memory words in the same block will get reduced significantly. In other words, if our access pattern
has spatial locality, then we will find many memory words in the cache because other words in the
same blocks would have already been fetched. This will reduce our overall memory access time.
L2 cache
L3 cache
Main memory
Figure 7.4: Hierarchy of caches
Example 3
A natural question that might arise is why do we stop at 2 or 3 cache levels. Why do not we have 7 or
8 levels of caches?
Answer: Caches are not particularly free. They have costs in terms of the transistor area. Given that
we cannot synthesise very large silicon dies, we have to limit the number of transistors that we place on
the chip. The typical silicon die size is about 200 mm2 for desktop processors and 400-500 mm2 for
server class processors. In larger dies, we can place more transistors; however, we cannot place enough
to create additional caching levels.
Also note that as we go to lower and lower cache levels the miss rates typically become high and
saturate. The incremental benefit of having more cache levels goes down.
Finally, additional layers of caches introduce additional overheads in terms of the area and power
consumed by the caches. Moreover, the miss penalty increases for a block that is being accessed for the
first time, because now the request has to pass through multiple caches.
26 bits 6 bits
Byte
Block address offset
Figure 7.5: Splitting a memory address into two parts: block address and byte offset
Typically, the terms “cache line” and “cache block” are used synonymously and interchangeably. However,
we shall use the term “cache line” to refer to the entire entry in the cache and the term “cache block” to
refer to the actual, usable contents: 64 bytes in the case of our running example. A cache line thus contains
the cache block along with some additional information. Typically, this subtle distinction does not matter
in most cases; nevertheless, if there are two terms, it is a wise idea to precisely define them and use them
carefully.
Now, we are clearly not going to search all 1024 entries for a given block address. This is too slow, and
too inefficient. Let us take a cue from a course in computer algorithms, and design a method based on the
well known technique called hashing [Cormen et al., 2009]. Hashing is a technique where we map a large
set of numbers to a much smaller set of numbers. This is a many-to-one mapping. Here, we need to map a
26-bit space of block addresses to a 10-bit space of cache lines (1024 = 210 ).
The simplest solution is to extract the 10 LSB (least significant) bits from the block address, and use
them to access the corresponding entry in the cache, which we are currently assuming to be a simple one-
dimensional table. Each entry is identified by its row number: 0 to 1023. This is a very fast scheme, and is
very efficient. Extracting the 10 LSB bits, and on the basis of that accessing a hardware table is a very quick
and power efficient operation. Recall that this is similar to how we were accessing the branch predictor.
However, there is a problem. We can have aliasing, which means that two block addresses can map to
the same entry. We need to have a method to disambiguate this process. We can do what was suggested
way back in Section 3.2, which is to store some additional information with each entry. We can divide a
32-bit address into three parts: 16-bit tag, 10-bit index, and 6-bit offset.
The 10-bit index is used to access the corresponding line in the cache, which for us is a one-dimensional
table (or an array). The 6-bit offset will be used to fetch the byte within the block. And, finally the 16-bit
tag will be used to uniquely identify a block. Even if two separate blocks have the last 10 bits of the block
address (index) in common, they will have different tags. Otherwise, the block addresses are the same,
and the blocks are not different (all 26 bits of the block address are common). This is pictorially shown in
Figure 7.6.
Let us explain this differently. Out of a 32-bit memory address, the upper 26 bits (more significant)
comprise the block address. Out of this, the lower 10 bits form the index that we use to access our cache.
The upper 16 bits can thus vary between two block addresses that map to the same line in the cache.
However, if this information is also stored along with each line in the cache, then while accessing the cache
we can compare these 16 bits with the tag part of the memory address, and decide if we have a cache hit or
257 Smruti R. Sarangi
miss. We are thus comparing and using all the information that is present in a memory address. There is
no question of aliasing here. Figure 7.7 explains this concept graphically.
Different
addresses
Figure 7.7: The concept of the tag explained along with aliasing
Let us now take a look at one of the simplest cache designs that uses this concept.
Index
Index
Tag
Hit/Miss
Let us summarise the process with a set of bullet points. Assume that the function tag(A) provides the
tag part (upper 16 bits in this case) of the block address. We will use the block addresses A and A0 (as
defined in the previous paragraph) in the following description.
• Every byte is uniquely identified by a 32-bit memory address.
• Out of these 32 bits, we have taken out the 6 least significant bits because they specify the offset of a
byte within a block. We are left with 26 bits.
• Out of these 26 bits, we have used the 10 least significant bits to index the tag and data arrays. By
this point, we have used 16 bits out of the 32-bit address. This means that between addresses A and
A0 , we have the lower 16 bits in common. We are not sure of the upper (more significant) 16 bits. If
they match, then the addresses are the same.
• The upper 16 bits are the tag part of the address. Thus, the problem gets reduced to a simpler problem:
check if tag(A) = tag(A0 ).
• The upper 16 bits of A0 are stored in the tag array. We need to compare them with the upper 16 bits
of A, which is tag(A). These bits can be easily extracted from the address A and sent to a comparator
for comparison, as shown in Figure 7.8.
Thus, accessing a direct mapped cache is per se rather simple. We access the data and tag arrays in
parallel. Simultaneously, we compute the tag part of the address. If there is a tag match – the tag part of
the address matches the corresponding contents of the tag array – then we declare a hit, and use the contents
of the corresponding entry in the data array. Otherwise, we declare a miss.
The logic for having separate tag and data arrays will be clear in Section 7.3. Let us proceed to look at
other variants of caches.
match
line
Figure 7.9: A set of CAM cells. The match line is a wired-AND bus (computes a logical AND of all the
inputs).
However, we can also access the array by the contents of an array element, and get the index if the array
contains the value. For example, we can issue the statement get(vals, 10). In this case, the answer will be
3 because the value 10 exists at the array index 3 (we start counting from 0). CAM arrays can be used to
see if a given value exists within an array, and for finding the index of the row that contains the value. If
there are multiple copies of the same value, then we can either return the lowest index, or any of the indices
at random. The behaviour in such cases is often undefined.
Now let us use the CAM array to build a cache (refer to Figure 7.10). We show a simple example with
only 4 entries (can be scaled for larger designs). For our problem at hand, the CAM array is the best
structure to create the tag array. The input is the tag part of the memory address; we need to quickly
search if it is contained within any row of the tag array. In this case, we do not have an index. Instead, the
address is split into two parts: 6-bit offset and 26-bit tag. The tag is large because we are not dedicating
any bits to index the tag array. The output will be the index of the entry that contains the same tag or
a miss signal (tag not present). Once we get the index, we can access the data array with that index, and
fetch the contents of the block. Note that there is a one-to-one correspondence between the entries of the
tag array and the data array.
Hit/Miss
Tag array
(CAM cells) Data array
Index of the
Encoder
matching entry
Tag
Figure 7.10: Fully associative cache
The main technological innovation that allowed us to build this cache, which is called a fully associative
cache, is the CAM array. It allows for a quick comparison with the contents of each entry, and thus we gain
the flexibility of storing an entry any where in the array. Even though such an approach can reduce the miss
rate by taking care of aliasing, it has its share of pitfalls.
Smruti R. Sarangi 260
The CAM cell is large and slow. In addition, the process of comparing with each entry in the CAM array
is extremely inefficient when it comes to power. This approach is not scalable, and it is often very difficult
to construct CAM arrays of any practical significance beyond 64 entries. It is a very good structure when
we do not have a large number of entries (≤ 64).
We have seen a direct mapped cache and a fully associative cache. The former has a low access time, and
the latter has a low miss rate at the cost of access time and power. Is a compromise possible?
Indeed, it is possible. Such a design is called a set associative cache. Here, we offer limited flexibility,
and do not significantly compromise on the access time and power. The idea is as follows. Instead of offering
the kind of flexibility that the fully associative cache provides, let us restrict a given block address to a few
locations instead of all the locations. For example, we can restrict a given address to 2 or 4 locations in the
tag and data arrays.
Let us proceed as follows. Let us divide the tag array into groups of equal sized sets. Sets often contain
2, 4, or 8 blocks. Let us implement our running example by creating a 4-way set associative cache, where
each set contains 4 entries. In general, if a set contains k entries, we call it a k-way set associative cache.
Each entry in the set is called a way. In other words, a way corresponds to an entry in the tag array and
data array that belongs to a given set. In a k-way set associative cache we have k ways per set.
Definition 41
If a set contains k entries, we call it a k-way set associative cache. Each entry in the set is called a way.
Recall that we needed to create a 64 KB cache with a 64-byte block size in a 32-bit memory system. The
number of blocks in the data array and the number of entries in the tag array is equal to 1024 (64 KB/ 64
B). As we had done with direct mapped caches, we can split the 32-bit memory address into a 6-bit offset
and 26-bit block address.
Let us now perform a different kind of calculation. Each set has 4 entries. We thus have 1024/4 = 256
sets. The number of bits required to index each set is log2 (256), which is 8. We can thus use the 8 LSB bits
of the block address to find the number or index of the set. Once we get the number of the set we can access
all the entries in the set. We can trivially map the entries in the tag array to sets as follows. For a 4-way
set associative cache, we can assume that entries 0 to 3 belong to set 1, entries 4 to 7 belong to set 2, and
so on. To find the starting entry of each set, we just need to multiply the set index by 4, which is very easy
to do in binary (left shift by 2 positions).
Till now we have used 6 bits for the offset within the block and 8 bits for the set id. We are left with
18 (32 - 14) bits. This is the part of the address that is not common to different addresses that map to the
same set, and thus by definition it is the tag. The size of the tag is thus 18 bits. The breakup of an address
for a 4-way set associative cache in our running example is shown in Figure 7.11.
To access (read/write) a data block, we first send the address to the tag array. We compute the set index,
and access all the entries belonging to that set in parallel. In this case, we read out 4 tags (corresponding
to each way) from the tag array, and compare each of them with the tag part of the address using a set of
comparators. If there is no match, then we have a miss, otherwise we have a hit. If the tags match, then
we can easily compute the index in the tag array that contains the matching tag with an encoder. Refer to
Figure 7.12 for a representative design of a set associative cache.
We subsequently index the data array with the index and read out the data block. Some times for
the sake of efficiency we can read out the contents of the 4 corresponding data blocks in parallel. After
computing the tag match, we can choose one of the blocks as the final output. This is faster because it
creates an overlap between accessing the tag array and reading the data blocks; however, it is inefficient in
terms of power because it involves extra work.
Hit/Miss
Encoder
Set index
Tag
Important Point 11
If we think about it, a direct mapped cache is also a set associative cache. Here, the size of each set is 1.
A fully associative cache is also a set associative cache, where the size of the set is equal to the number
of blocks in the cache.
The set associative cache represents an equitable trade-off between direct mapped and fully associative
caches. It is not as slow as fully associative caches, because it still uses the faster SRAM cells. Additionally,
its miss rate is not as low as direct mapped caches. This is because there is some degree of immunity against
aliasing. If two addresses conflict (map to the same block) in a direct mapped cache, they can be placed
in different ways of a set in a set associative cache. In many such cases, we can avoid misses altogether as
compared to a direct mapped cache.
Smruti R. Sarangi 262
Important Point 12
Why is it necessary to first fetch a block to the L1 cache before writing to it?
Answer: A block is typically 32 or 64 bytes long. However, most of the time we write 4 bytes or 8
bytes at a time. If the block is already present in the cache, then there is no problem. However, if there
is a cache miss, then the question that we need to answer is, “Do we wait for the entire contents of the
block to arrive from the lower levels of the memory hierarchy?” A naive and incorrect way of doing so
will be to go ahead and write the data at the appropriate positions within an empty cache block, even
though the rest of the contents of the actual block are not present in the cache. However, this method is
fraught with difficulties. We need to keep track of the bytes within a block that we have updated. Later
on, these bytes have to be merged with the rest of the contents of the block (after they arrive). The
process of merging is complicated because some of the bytes would have been updated, and the rest would
be the same. We need to maintain information at the byte level with regards to whether a given byte has
been updated or not. Keeping these complexities in mind, we treat a block as an atomic unit, and do not
keep track of any information at the intra-block level. Hence, before writing to a block, we ensure that it
is present in the cache first.
This is sadly not enough because over time the counters will tend to saturate, and all the blocks within
a set will hold the maximum count. To ensure that this does not happen and we capture recent history, we
need to periodically decrement the values of all the counters such that the counts of unused blocks move
towards 0. We can thus deduce that higher the count, higher is the probability of accesses in the recent
past. The least recently used block is expected to have the least count. Note that periodically decrementing
counters is absolutely necessary, otherwise we cannot maintain information about the recent past.
Since this approach approximates the LRU scheme, it does not always identify the least recently used
block; however, in all cases it identifies one of the least used blocks in the recent past. This is a good
candidate for replacement.
To replace a block, we evict the block that has the least count, and in its place we bring in the new
block. The process of eviction is not easy, and there are various complexities. Let us quickly understand the
trade-offs between different approaches.
There are two paradigms: write-through and write-back. In a write-through cache, whenever we perform a
write operation, we write the data to the lower level immediately. Because our memory system is inclusive,
we are guaranteed to find the block in the lower level as well. Also recall that we never write to all the bytes
of a typical 64-byte block together. Instead we write at a much finer granularity – 2 to 8 bytes at a time.
These small writes are sent to the lower level cache, where the corresponding block is updated. Thus, we
never have stale copies while using a write-through cache. In this case, eviction is very easy. We do not have
the concept of fresh and stale copies. Since all the writes have been forwarded to the lower level, the block
is up to date in the lower level cache. We can thus seamlessly evict the block and not do anything more.
However, a write-back cache is more complicated: here we do not send writes automatically to the lower
level. We keep a bit known as the modified bit with each cache line that indicates if the line has been modified
with a write operation or not. If it is modified, then we need to perform additional steps, when this line
is going to be evicted. Note that in this case, the write-back cache will contain the up to date copy of the
data, whereas the cache at the immediately lower level will contain a stale copy because it has not received
any updates. Hence, whenever we observe the modified bit to be 1 during eviction, we write the block to
the lower level. This ensures that the lower level cache contains all the modifications made to the contents
of the block. However, if the modified bit is 0, then it means that the block has not been written to. There
is thus no need to write back the block to the lower level. It can be seamlessly evicted.
The write-back cache basically defers writing back the block till the point of eviction. In comparison, a
write-through cache writes back the block every time it is written. In a system with high temporal locality
and a lot of writes, a write-through cache is clearly very inefficient. First, because of temporal locality we
expect the same block to be accessed over and over again. Additionally, because of the high write traffic in
some workloads, we will need to write a lot of blocks back to the lower level – every time they are modified
at the upper level. These writes are unnecessary and can be avoided if we use a write-back cache. However,
on the flip side, write-through caches are simple and support seamless eviction.
There is one more subtle advantage of write-through caches. Assume we have a three level cache hierarchy.
Because of the property of inclusiveness, we will have three entries for a given block in all the three caches:
L1, L2, and L3. Now, assume that there is an eviction in L3, or an I/O device wishes to write to the block.
Many I/O devices write directly to main memory, and not to the caches. In this case, it is necessary to
evict the block from all three caches. In the case of write-through caches, this is simple. The blocks can be
seamlessly evicted. However, in the case of a write-back cache, we need to check the modified bits at each
level, and perform a write back to main memory if necessary. This has high overheads.
Given the nature of the requirements, we need to make a judicious choice.
Smruti R. Sarangi 264
Compulsory or Cold Misses These misses happen when we read in instructions or data for the first time.
In this case, misses are inevitable, unless we can design a predictor that can predict future accesses and
prefetch them in advance. Such mechanisms are known as prefetchers. We shall discuss prefetching
schemes in detail in Sections 7.6 and 7.7. Prefetching is a general technique and is in fact known to
reduce all kinds of misses.
Conflict Misses This is an artefact of having finite sized sets. Assume that we have a 4-way set associative
cache and we have 5 blocks in our access pattern that map to the same set. Since we cannot fit more
than 4 blocks in the same set, the moment the 5th block arrives, we shall have a miss. Such misses can
be reduced by increasing the associativity. This will increase the hit time and thus may not always be
desirable.
The standard way to reduce the miss rate is to have better prefetching schemes, increase the cache size
or the associativity. However, they increase the hardware overheads. Here is a low-overhead scheme that is
very effective.
Victim Cache A victim cache is a small cache that is normally added between the L1 and L2 caches. The
insight is that sometimes we have some sets in the L1 cache that see a disproportionate number of accesses,
and we thus have conflict misses. For example, in a 4-way set associative cache, we might have 5 frequently
used blocks mapping to the same set. In this case, one of the blocks will frequently get evicted from the
cache. Instead of going to the lower level, which is a slow process, we can add a small cache between L1
and L2 that contains such victim blocks. It can have 8-64 entries making it very small and very fast. For a
large number of programs, victim caches prove to be extremely beneficial in spite of their small size. It is
often necessary to wisely choose which blocks need to be added to a victim cache. We would not like to add
blocks that have a very low probability of being accessed in the future. We can track the usage of sets with
counters and only store evicted blocks of those sets that are frequently accessed.
One such scheme is critical word first and early restart. The insight is as follows. Almost always reads are
on the critical path because we have waiting instructions and writes are not on the critical path. Furthermore,
we often read data at the granularity of 4 bytes or 8 bytes, whereas a block is 32-128 bytes. This means
that we often read a very small part of every block. We thus need not pay the penalty of fetching the entire
block. Consider a 64-byte block. In most on-chip networks we can only transfer 8 bytes in a single cycle,
and it thus takes 8 cycles to transfer a full block. The 8-byte packets are ordered as per their addresses.
We can instead transfer them in a different order. We can transfer the memory word (4 or 8 bytes) that
is immediately needed by the processor first, and transfer the rest of the words in the block in later cycles.
This optimisation is known as fetching the critical word first. Subsequently, the processor can process the
memory word, and start executing the consumer instructions. This is known as early restart because we are
not waiting to fetch the rest of the bytes in the block.
This optimisation helps reduce the miss penalty because we are prioritising the data that needs to be
sent first. This is easy to implement, and is heavily used. Furthermore, this technique is the most useful
when applied between the L1 and L2 caches. It fails to show appreciable benefits when the miss penalty is
large. The savings in the miss penalty do not significantly affect the AMAT in this case.
Next, let us discuss the write buffer that reduces the latency of writes to the lower level. It temporarily
buffers the write and allows other accesses to go through.
Write Buffer A write buffer is a small fully associative cache for writes that we add between levels in the
memory hierarchy, or attach with the store unit of the pipeline. It typically contains 4-8 entries where each
entry stores a block. The insight is as follows. Due to spatial locality we tend to write to multiple words in a
block one after the other. If we have repeated misses for different words in the same block, we do not want to
send separate write-miss requests to the lower level. We should ideally send a single write-miss message for
the block, and merge all the writes. This is the job of the write buffer’s entry. For every write operation, we
allocate a write buffer entry, which absorbs the flurry of writes to the same block. It is managed like a regular
cache, where older entries are purged out and written to the lower level of the memory hierarchy. Subsequent
read operations to the same block get their value from the write buffer entry. Other read operations that
do not find their blocks in the write buffer can bypass the write buffer and can directly be sent to the lower
levels of the memory system. To summarise, write buffers allow the processor to move ahead with other
memory requests, reduce the pressure on the memory system by merging write operations to the same block,
and can quickly serve read requests if their data is found in the write buffer.
Now that we have gained a good high level understanding of caches in modern processors, let us proceed
to understand the internals of modern caches in Section 7.3. Before that we need to understand the concept
of virtual memory.
Processor
Let us now systematically take the peel off all of these assumptions, and argue about the problems that
we shall face in a real situation. As described in the introductory text by Sarangi [Sarangi, 2015], there are
two problems in such implementations: the overlap problem and the size problem.
Definition 42
A process is defined as a running instance of a program. Note that for one program we can create any
number of processes. All of them are independent of each other, unless we use sophisticated mechanisms
to pass messages between them. A process has its own memory space, and it further assumes that it has
exclusive access to all the regions within its memory space.
The reader can click Ctrl-Alt-Del on her Microsoft R Windows R machine and see the list of processes
that are currently active. She will see that there will be tens of processes that are active even if she has
just one core on her laptop. This is because the processor is being time shared across all the processes.
Furthermore, these programs are switching so quickly (tens of times a second) that the human brain is not
able to perceive this. This is why we can have an editor, web browser, and video player running at the same
time.
Note that these programs running on the processor have been compiled at different places by different
compilers. All of them assume that they have complete and exclusive control over the entire memory space.
This means that they can write to any memory location that they please. The problem arises if these sets of
memory addresses overlap across processes. It is very much possible that process A and process B access the
same memory address. In this case, one process may end up overwriting the other process’s data, and this
will lead to incorrect execution. Even worse, one process can steal secret data such as credit card numbers
from another process. Such overlaps fortunately do not happen in real systems. This is because additional
steps are taken to ensure that processes do not inadvertently or maliciously corrupt each other’s memory
addresses. This means that even if we maintain the abstraction that each process’s memory space belongs to
Smruti R. Sarangi 268
it exclusively, we somehow ensure that two processes do not corrupt each other’s data. We need to somehow
ensure that in reality the memory spaces do not unintentionally overlap.
Let us now look at the size problem. Assume that the size of the main memory is 1 GB. We have been
assuming till now that all the accesses find their data in the main memory (miss rate is 0). This means
that the maximum amount of memory that any process is allowed to use is limited to 1 GB. This is too
restrictive in practice. It should be possible to run larger programs. In this case, we need to treat the main
memory as the cache, and create a lower level beneath it. This is exactly how modern systems are organised
as shown in Figure 7.14. The level beneath the main memory is the hard disk, which is a large magnetic
storage device that typically has 10-100 times more capacity than main memory. We dedicate a part of the
hard disk known as the swap space to store data that does not fit in the main memory.
L2 cache
L3 cache
This should happen seamlessly, and the programmer or the compiler should not be able to know about the
movement of data between main memory and the swap space. It should thus be possible for the programmer
to use more memory than the capacity of the off-chip main memory. In this specific case, it should for
example be possible to run a program that uses 3 GB of memory. Some of the data blocks will be in main
memory, and the rest need to be in the swap space. The relationship is typically not inclusive.
Such a pristine view of the memory space is known as virtual memory. Every process has a virtual view of
memory where it assumes that the size of the memory that it can access is 2N bytes, where we assume that
valid memory addresses are N bits wide. For example, in a 32-bit machine the size of the virtual address
space is 232 bytes and on a 64-bit machine the size of the virtual address space is 264 bytes. Furthermore,
the process can unreservedly write to any location within its memory space without any fear of interference
from other programs and the total amount of memory that a process can use is limited by the size of the
main memory and the swap space. This memory space is known as the virtual address space.
Definition 43
Virtual memory is defined as a view of the memory space that is seen by each process. A process assumes
its memory space (known as the virtual address space) is as large as the range of valid memory addresses.
The process has complete and exclusive control over the virtual address space, and it can write to any
location in the virtual address space at will without any fear of interference from other programs.
Note that at this point of time, virtual memory is just a concept. We are yet to provide a physical
realisation for it. Any physical implementation has to be consistent with the memory hierarchy that we have
defined. Before proceeding further, let us enumerate the advantages of virtual memory:
1. It automatically solves the overlap problem. A process cannot unknowingly or even maliciously write
in the memory space of another process.
269 Smruti R. Sarangi
2. It also automatically solves the size problem. We can store data in an area that is as large as the swap
space and main memory combined. We are not limited by the size of the main memory.
3. Since we can write to any location at will within the virtual address space, the job of the programmer
and the compiler become very easy. They can create code that is not constrained by a restrictive set
of allowed memory addresses.
Important Point 13
Many students often argue that virtual memory is not required. We can always ask a process to use
a region of memory that is currently unused, or we can force different programs at run time to use a
different set of memory addresses. All of these approaches that seek to avoid the use of virtual memory
have problems.
A program is compiled once, and run millions of times. It is not possible for the compiler to know
about the set of memory addresses that a program needs to use in a target system to avoid interference
from other programs. What happens if we run two copies of the same program? They will have clashing
addresses.
Another school of thought is to express all addresses in a program as an offset from a base address,
which can be set at runtime. The sad part is that this still does not manage to solve the overlap problem
completely. This will work if the set of memory addresses in a program are somewhat contiguous. Again,
if the memory footprint grows with time, we need to ensure that irrespective of how much it grows it will
never encroach into the memory space of another process. This is fairly hard to ensure in practice.
Definition 44
A physical address refers to an actual location of a byte or a set of bytes in the on-chip or off-chip
memory structures such as the caches, main memory, and swap space. The available range of physical
addresses is known as the physical address space.
This process is shown in Figure 7.15. If we assume a 32-bit address then the input to the translator is a
32-bit address, and let’s say the output is also a 32-bit address1 . The only distinction is that the former is
1 The virtual and physical addresses need not have the same size.
Smruti R. Sarangi 270
a virtual address, and the latter is a physical address, which can be used to access the memory system. Let
us delve into this further.
Virtual Physical
Address
address address
translator
(32 bits) (32 bits)
Process 1
Physical
address
space
Process 2 Page
Frame
It is easy to realise that we are solving the overlap problem seamlessly. If we never map the same frame
to two different pages, then there is no way that writes from one process will be visible to another process.
The virtual memory system will never translate addresses such that this can happen.
271 Smruti R. Sarangi
Solving the size problem is also straightforward (refer to Figure 7.17). Here, we map some of the pages
to frames in memory and some to frames in the swap space. Even if the virtual address space is much larger
than the physical address space, this will not pose a problem.
Process 1
Page
access. This will offset all the gains that we have made in creating a sophisticated out-of-order pipeline and
an advanced memory system.
Thankfully, a simple solution exists. We can use the same ideas that we used in caching: temporal
locality and spatial locality. We keep a small hardware cache known as the Translation Lookaside Buffer
(TLB) with each core. This keeps a set of the most recently used mappings from virtual pages to physical
frame ids. Given the property of temporal locality, we expect to have a very high hit rate in the TLB.
This is because most programs tend to access the same set of pages repeatedly. In addition, we can also
exploit spatial locality. Since most accesses will be to nearby addresses, they are expected to be within the
same page. Hence, saving a mapping at the level of pages is expected to be very beneficial because it has a
potential for significant reuse.
Definition 45
• The page table is a data structure that maintains a mapping between page ids and their corres-
ponding frame ids. In most cases this data structure is maintained in a dedicated memory region
by software. Specialised modules of the operating system maintain a separate page table for each
process. However, in some architectures, notably latest Intel processors, the process of looking up
a mapping (also referred to as a page walk) is performed by hardware.
• To reduce the overheads of address translation, we use a small cache of mappings between pages
and frames. This is known as the Translation Lookaside Buffer or the TLB. It typically contains
32-128 entries, and can be accessed in less than 1 clock cycle. It is necessary to access the TLB
every time we issue a load/store request to the memory system.
The process of memory address translation is thus as follows. The virtual address first goes to the TLB
where it is translated to the physical address. Since the TLB is typically a very small structure that contains
32-128 entries its access time is typically limited to a single cycle. Once we have the translated address, it
can be sent to the memory system, which is either the i-cache (for instructions) or the d-cache (for data).
Memory
access
Yes
Send mapping
TLB hit? to processor
No
Yes
Frame in Populate Send mapping
memory? TLB to processor
No
Create/update mapping
in the page table
Definition 46
A page fault is defined as an event where a page’s corresponding frame is not found in main memory.
It either needs to be initialised in main memory, or its contents need to be read from the hard disk and
stored in the main memory.
Smruti R. Sarangi 274
SRAM Cell
The question that we wish to answer is how do we store 1 bit of information? We can always use latches
and flip-flops. However, these are area intensive structures and cannot be used to store thousands of bits.
We need a structure that is far smaller.
Let us extend the idea of a typical SR latch as shown in Figure 7.19. An SR latch can store a single bit.
If we set S to 1 and R to 0, then we store a 1. Conversely, if we set S = 0 and R = 1, we store a 0.
S
Q
Q
R
Figure 7.19: A basic SR latch
Let us appreciate the basic structure of this circuit. We have a cross-coupled pair of NAND gates. By
cross-coupling, we mean that the output of one NAND gate is connected to the input of the other and
likewise for the other gate. Unfortunately, a NAND gate has four transistors, and thus this circuit is not
area efficient. Let us take this idea, and build a circuit that has two cross-coupled inverters as shown in
Figure 7.20.
This structure can also store a bit. It just uses four transistors, because one CMOS inverter can be made
out of one NMOS transistor and one PMOS transistor. To write a value, we can simply set the value of
node Q to the value that we want to write. The other node will always have Q because of the two inverters.
Reading a value is also easy. It is the output at node Q.
However, this circuit has a drawback. There is no way to enable or disable the circuit. Enabling and
disabling a memory cell is very important because we are not reading or writing to the cell in every cycle.
We want to maintain the value in the cell when we are not accessing it. During this period, its access should
275 Smruti R. Sarangi
Q Q Q
Q
Cross-coupled inverters
(a)
Cross-coupled inverters
implemented using CMOS logic
(b)
Figure 7.20: Cross-coupled inverters
be disabled. Otherwise, whenever there is a change in the voltages of nodes Q or Q, the value stored in the
cell will change. If we can effectively disconnect the cell from the outside world, then we are sure that it will
maintain its value.
This is easy to do. We can use the most basic property of a transistor, which is that it works like a switch.
We can connect two transistors to both the nodes of the cross-coupled inverter. This design in shown in
Figure 7.21. We add two transistors – W1 and W2 – at the terminals Q and Q, respectively.These are called
word line transistors; they connect the inverter pair to two bit lines on either side. The gates of W1 and
W2 are connected to a single wire called the word line. If the word line is set to 1, both the transistors get
enabled (the switch closes). In this case, the bit lines get connected to the terminals Q and Q respectively.
We can then read the value stored in the inverter pair and also write to it. If we set the word line to 0,
then the switch gets disconnected, and the 4-transistor inverter pair is disconnected from the bit lines. We
cannot read the value stored in it, or write to it. Thus, we have a 6-transistor memory cell; this is known as
an SRAM (static random access memory) cell. This is a big improvement as compared to the SR latch in
terms of the number of transistors that are used.
Smruti R. Sarangi 276
Q
Q
W1 W2
BL BL
(Bit line)
Figure 7.21: Cross-coupled inverters with enabling transistors
SRAM Array
Now that we have designed a memory cell that can store a single bit, let us create an array of SRAM cells.
We can use this array to store data.
To start with, let us create a matrix of cells as shown in Figure 7.22. We divide the address used to
access the SRAM array into two parts: row address, and column address. We send the row address to the
row decoder. Assume that it contains r bits. The decoder takes in these r bits, and it sets one out of 2r
output lines that are the word lines to 1. Each word line is identified by the binary number represented by
the row address. For example, if the row address is 0101, then the fifth output of the decoder is set to 1 or
the fifth word line is set to 1. The rest of the output lines (word lines) are set to 0.
The decoder is a standard electronic circuit, and can be easily constructed out of logic gates. The benefit
of using the decoder is that it enables only one of the word lines. This word line subsequently enables a row
of cells in the 2D array (matrix) of memory cells. All the memory cells in a row can then be accessed.
Let us consider the process of writing first. Every cell is connected to two wires that are called bit lines.
Each bit line is connected to a node of the memory cell (Q or Q) via a transistor switch that is enable by
a word line. The bit lines carry complementary logic values. Let us refer to the left bit line as BL and the
right bit line as BL. To write values we use fast write drivers that can quickly set the voltage of the bit
lines. If we want to write a 1 to a given cell, then we set its bit lines as follows: BL to a logical 1 and BL to
a logical 0. We do the reverse, if we wish to write a logical 0. The pair of inverters get reprogrammed once
we set the voltages of BL and BL.
The more difficult operation is reading the SRAM array. In this case, once we enable the memory cell,
the bit lines get charged to the values that are contained in the memory cell. For example, if the value of
a node of the memory cell is a logical 1 (assume a logical 1 is 1 V), then the voltage on the corresponding
bit line increases towards 1 V. Similarly, the voltage on the other bit line starts moving towards 0 V. This
situation is not the same as writing a value. While writing a value we could use large write drivers that can
277 Smruti R. Sarangi
BL BL
WL
SRAM SRAM SRAM SRAM
cell cell cell cell
Decoder
WL
Row
SRAM SRAM SRAM SRAM
address cell cell cell cell
WL
SRAM SRAM SRAM SRAM
cell cell cell cell
Column
Column mux/demux
address Write Write
driver driver
Data in Data in
Sense amplifier Sense amplifier
Data out
Figure 7.22: An SRAM array
pump in a lot of current into the bit lines. In this case, we only have a small 6-transistor cell that is charging
the bit lines. It is far weaker than the powerful write drivers. As a result, the process of charging (towards
a logical 1) or discharging (towards a logical 0) is significantly slower. Note that the process of reading is
crucial. It is often on the critical path because there are instructions waiting for the value read by a load
instruction. Hence, we need to find ways to accelerate the process.
A standard method of doing this is called precharging. We set the value of both the bit lines to a value
that is midway between the voltages between logical 0 and 1. Since we have assumed that the voltage
corresponding to a logical 1 is 1 V, the precharge voltage is 0.5 V. We use strong precharge drivers to
precharge both the bit lines to 0.5 V. Akin to the case with write drivers, this process can be done very
quickly. Once the lines are precharged, we enable a row of memory cells. Each memory cell starts setting
the values of the bit lines that it is connected to. For one bit line the voltage starts moving to 1 V and for
Smruti R. Sarangi 278
the other the voltage starts moving towards 0 V. We monitor the difference in voltage between the two bit
lines.
Assume the cell stores a logical 1. In this case, the voltage on the bit line BL will try to move towards
1 V, and the voltage on the bit line BL will try to move towards 0 V. The difference between the voltages
of BL and BL will start at 0 V and gradually increase to 1 V. However, the key idea is that we do not
have to wait till the difference reaches 1 V or -1 V (when we store a logical 0). Once the difference crosses a
certain threshold, we can infer the final direction in which the voltages on both the bit lines are expected to
progress. Thus, much before the voltages on the bit lines reach their final values, we can come to the correct
conclusion.
Let us represent this symbolically. Let us define the function V to represent the voltage. For example,
V (BL) represents the instantaneous voltage of the bit line BL. Here are the rules for inferring a logical 0
or 1 after we enable the cell for reading.
(
1 V (BL) − V (BL) > ∆
value =
0 V (BL) − V (BL) > ∆
In this case, we define a threshold ∆ that is typically of the order of tens of millivolts. An astute reader
might ask a question about the need for the threshold, ∆. One of the reasons is that long copper wires such
as bit lines can often accumulate an EMF due to impinging electromagnetic radiation. In fact, unbeknownst
to us a long copper wire can act as a miniature antenna and pick up electric fields. This might cause a
potential to build up along the copper wire. In addition, we might have crosstalk between copper wires
where because of some degree of inductive and capacitive coupling adjacent wires might get charged. Due
to such types of noise, it is possible that we might initially see the voltage on a bit line swaying in a certain
direction. To completely discount such effects, we need to wait till the absolute value of the voltage difference
exceeds a threshold. This threshold is large enough to make us sure that the voltage difference between the
bit lines is not arising because of transient effects such as picking up random electric fields. We should be
sure that the difference in voltages is because one bit line is moving towards logical 1 and the other towards
logical 0.
At this point, we can confidently declare the value contained in the memory cell. We do not have to wait
for the voltages to reach their final values. This is thus a much faster process. The lower we set ∆, faster is
our circuit. We are limited by the amount of noise.
Note that there is one hidden advantage of SRAM arrays. Both BL and BL are spaced close together.
Given their spatial proximity, the effects of noise will be similar, and thus if we consider the difference in
voltages, we shall see that the effects of electromagnetic or crosstalk noise will mostly get cancelled out. This
is known as common mode rejection and works in our favour. Had we had a single bit line, this would not
have happened.
In our SRAM array (shown in Figure 7.22) we enable the entire row of cells. However, we might not
be interested in all of this data. For example, if the entire row contains 64 bytes of data, and we are only
interested in 16 bytes, then we need to choose the component of the row that we are interested in. This
is done using the column multiplexers that read in only those columns that are we are interested in. The
column address is the input to these column multiplexers. For example, in this case since there are four 16
byte chunks in a 64 byte row, there are four possible choices for the set of columns. We thus need 2 bits to
encode this set of choices. Hence, the column address is 2 bits wide.
Column Multiplexers
Let us look at the design of the column multiplexers. Figure 7.23 shows a simple example. We have two
pairs of bit lines: (B1 , B1 ) and (B2 , B2 ). We need to choose one of the pairs. We connect each wire to
an NMOS transistor, which is known as a pass transistor. If the input at its gate is a logical 1, then the
transistor conducts, and the voltage at the drain is reflected at the source. However, if the voltage at the
gate is a logical 0, then the transistor behaves as an open circuit.
279 Smruti R. Sarangi
B1 B1 B2 B2
S0 S0 S1 S1
Sense amplifer
Output of the
sense amplifier
Figure 7.23: A column multiplexer
In this case, we need a single column address bit because there are two choices. We send this bit to a
decoder and derive two outputs: S0 and S1 . Only one of them can be true. Depending upon the output that
is true, the corresponding pass transistors get enabled. We connect B1 and B2 to the same wire, and we do
the same with B1 and B2 . Only one bit line from each pair will get enabled. The enabled signals are then
sent to the sense amplifier that senses the difference in the voltage and determines the logic level of the bit
stored in the SRAM cell.
Sense Amplifiers
After we have chosen the columns that we are interested in, we need to compare BL and BL for each cell,
and then ascertain which way the difference is going (positive or negative). We use a specialised circuit called
a sense amplifier for this purpose. The circuit diagram of a typical sense amplifier is shown in Figure 7.24.
Let us describe the operation of a sense amplifier as shown in Figure 7.24. A sense amplifier is made up
of three differential amplifiers. In Figure 7.24, each shaded box represents a differential amplifier. Let us
consider the amplifier at the left top and analyse it. We shall provide an informal treatment in this book.
For a better understanding of this circuit, readers can perform circuit simulation or analyse the circuit
mathematically. First, assume that V (BL) = V (BL). Transistors T1 and T2 form a circuit known as a
current mirror, where the current flowing through both the transistors is the same. Furthermore, transistor
T2 is in saturation (VSD > VSG − VT ), which fixes the current flowing through the transistor.
Now assume that V (BL) is slightly lower than V (BL), specifically V (BL) − V (BL) = ∆, which is the
difference threshold for detecting a logic level. In this case transistor T4 will draw slightly more current
as compared to the equilibrium state. Let this current be Id mA. This means an additional current of
Id mA will pass through T2 . Because we have a current mirror, the same additional current Id will also
pass through T1 . However, the current through T5 will remain constant because it is set to operate in the
saturation region (the reader should verify this by considering possible values of Vx ). This means that since
the current through T4 has increased by Id , the current through T3 needs to decrease by Id to ensure that
the sum (flowing through T5 ) remains constant.
Smruti R. Sarangi 280
AMP1 AMP2
T1 T2
P1 P2
Vx
T3 T4
T5
AMP3
P5 Output
P3
T6
P4
The summary of this discussion is that an additional Id mA flows through T1 and the current in T3
decreases by Id mA. There is thus a total shortfall of 2Id mA, which must flow from terminal P1 to P3 .
Terminal P3 is the gate of transistor T6 . This current will serve the purpose of increasing its gate voltage.
Ultimately, the current will fall of to zero once the operating conditions of transistors T1 . . . T5 change. Thus,
the net effect of a small change in the voltage between the bit lines is that the voltage at terminal P3 increases
significantly. Consider the reverse situation where V (BL) decreases as compared to V (BL). It is easy to
argue that we shall see a reverse effect at terminal P3 . Hence, we can conclude that transistors T1 . . . T5
make up a differential amplifier (AM P1 in the figure).
Similarly, we have another parallel differential AM P2 where the bit lines are connected to the amplifier
in a reverse fashion. Let us convince ourselves that the directions in which the voltages increase at terminals
P1 and P2 are opposite. When one decreases, the other increases and vice versa. The role of the parallel
differential amplifiers(AM P1 and AM P2 ) is to amplify the difference between the voltages at the bit lines;
this shows up at terminals P3 and P4 .
Terminals P3 and P4 are the inputs to another differential amplifier AM P3 , which further amplifies the
voltage difference between the terminals. We thus have a two-stage differential amplifier. Note that as
281 Smruti R. Sarangi
V (BL) increases, the voltage at terminal P3 increases and this increases the voltage at terminal P5 (using
a similar logic). However, with an increase in V (BL) we expect the output to become 0, and vice versa.
To ensure that this happens, we need to connect an inverter to terminal P5 . The final output of the sense
amplifier is the output of the inverter.
Sense amplifiers are typically connected to long copper wires that route the output to other functional
units. Sense amplifiers are typically very small circuits and are not powerful enough to charge long copper
wires. Hence, we need another circuit called the output driver that takes the output of a sense amplifier,
and stabilises it such that it can provide enough charge to set the potential of long copper wires to a logical
1 if there is a need. Note that this is a basic design. There are many modern power-efficient designs. We
shall discuss another variant of sense amplifiers that are used in DRAMs in Chapter 10.
CAM Cell
In Section 7.1.4 we had discussed the notion of a CAM (content-addressable memory) array, where we address
a row in the matrix of memory cells based on its contents and not on the basis of its index. Let us now
proceed to design such an array. The basic component of a CAM array is the CAM cell (defined on the same
lines as a 6-transistor SRAM cell).
Figure 7.25 shows the design of a CAM cell with 10 transistors. Let us divide the diagram into two halves:
above and below the match line. The top half looks the same as an SRAM cell. It contains 6 transistors
and stores a bit using a pair of inverters. The extra part is the 4 extra transistors at the bottom. The two
output nodes of the inverter pair are labelled Q and Q respectively. In addition, we have two inputs, A and
A. Our job is to find out if the input bit A matches the value stored in the inverter pair, Q.
The four transistors at the bottom have two pairs of two NMOS transistors each connected in series. Let
us name the two transistors in the first pair T1 and T2 respectively. The drain terminal of T1 is connected
to the match line, which is initially precharged to the supply voltage. The first pair of NMOS transistors
is connected to Q and A respectively. Let us create a truth table based on the inputs and the states of the
transistors T1 and T2 .
Q A T1 T2
0 0 off off
0 1 off on
1 0 on off
1 1 on on
If the inputs Q and A are both 1, then only do both the transistors conduct. Otherwise, at least one of
the transistors does not conduct. In other words if Q = 1 and A = 0, there is a straight conducting path
between the match line and the ground node. Thus, the voltage of the match line becomes zero. Otherwise,
the voltage of the match line will continue to remain the same as its precharged value because there is no
conducting path to the ground.
Let us now look at the other pair of transistors, T3 and T4 , that are connected in the same way albeit to
different inputs: Q and A (complements of the inputs to transistors T1 and T2 ). Let us build a similar truth
table.
Q A T3 T4
0 0 off off
0 1 off on
1 0 on off
1 1 on on
Here also, the only condition for a conducting path is Q = A = 1. If we combine the results of both the
truth tables, then we have the following conditions for the match line to get set to 0. Either Q = A = 1, or
Q = A = 1. We should convince ourselves that this will only happen if A 6= Q. If A = Q, then both of these
Smruti R. Sarangi 282
A
BL BL A
Word line (WL)
Q Q
W1 W2
match
T1 T3
T2 T4
conditions will always be false. One of the values will be 1 and the other will be 0. However, if A 6= Q, then
only this is possible.
Let us thus summarise. A CAM cell stores a value Q. In addition, it takes as input the bit A that is to
be compared with Q. If the values are equal, then the match line maintains its precharged value. However,
if they are not equal then a direct conducting path forms between the match line and ground. Its value gets
set to a logical 0. Thus, by looking at the potential of the match line, we can infer if there has been a match
or not.
CAM Array
Let us build a CAM array the same way we built the SRAM array. Figure 7.26 shows the design.
Typically, a CAM array has all the features of an SRAM array, and in addition it has more features such
as content-based addressability. The row address decoder, column multiplexers, sense amplifiers, precharge
circuits, write and output drivers are the parts that the CAM array inherits from the SRAM array. To use
the CAM array like a regular SRAM array, we just simply enable the row decoder nd proceed with a regular
array access. In this case, we do not set the match line or read its value.
283 Smruti R. Sarangi
A1 A1 A2 A2 An An
CAM mode BL BL BL BL
BL BL
WL
CAM CAM CAM
cell cell cell
Decoder match
WL
Priority encoder
Row
CAM CAM CAM
address cell cell cell
Matching
index
WL
CAM CAM CAM
cell cell cell
Column
Column mux/demux
address
Write Write
driver driver
Data in Data in
Sense amplifier Sense amplifier
match
Data out
Figure 7.26: A CAM array
Let us now focus on the extra components that are useful when we are searching for an entry based on
its contents. To keep the discussion simple, we assume that we wish to match an entire row. Extending this
idea to cases where we wish to match a part of the row is trivial. Now, observe that all the transistors in
the same row are connected to the same match line (see Figure 7.26). We provide a vector V as input such
that it can be compared with each row bit by bit. Thus, the ith bit in the vector is compared with the value
stored in the ith CAM cell in the row. In other words, at each cell we compare a pair of bits: one from the
vector V (bit Ai in the figure) and the value stored in the CAM cell. First, assume that all the bits match
pairwise. In this case, we observe the bits to be equal at each cell, and thus a conducting path between
the match line and ground does not form. This happens for all the cells in the row. Thus, the match line
continues to maintain its precharged value: a logical 1.
Now, consider the other case, where at least one pair of bits does not match. In this case at that CAM
cell, a conducting path forms to the ground, and the match line gets discharged; the voltage gets set to a
logical 0.
Thus, to summarise, we can infer if the entire row has matched the input vector V or not by basically
looking at the voltage of the match line after the compare operation. If it is the same as its precharged value
(logical 1), then there has been a full match. Otherwise, if the match line has been discharged, then we can
be sure that at least one pair of bits has not matched.
Hence, the process of addressing a CAM memory based on the contents of its cells is as follows. We first
enable all the word lines. Then, we set the bits of the input vector (bits A1 . . . An in the figure) and then
allow the comparisons to proceed. We always assume that we do not have any duplicates. There are thus
two possible choices: none of the match lines are at a logical 1 or only one of the match lines is set to 1.
This is easy to check. We can create an OR circuit that checks if any of the match lines is a logical 1 or
Smruti R. Sarangi 284
not. If the output of this circuit is 1, then it means that there is a match, otherwise there is no match. Note
that it is impractical to create a large OR gate using NMOS transistors. We can either create a tree of OR
gates with a limited fan-in (number of inputs), or we can use wired-OR logic, where all the match lines are
connected to a single output via diodes (as shown in Figure 7.27). If any of the match lines is 1, it sets the
output to 1. Because of the diodes current cannot flow from the output terminal to the match lines.
A B C
Output
Now, if we know that one of the match lines is set (equal to 1), we need to find the number of the row
that matches the contents. Note that our count starts from 0 in this case (similar to arrays in programming
languages). We can use an encoder for this that takes N inputs and has log2 (N ) outputs. The output gives
the id (Boolean encoding) of the input that is set to 1. For example, if the 9th input out of a set of 16 inputs
is set to 1 its encoding will be 1001 (assume that the count starts at 0).
In a fully associative cache whose tag array is implemented as a CAM array, once we get the id of the
row whose match line is set to 1, we can access the corresponding entry of the data array and read or write
to the corresponding block. A CAM array is thus an efficient way of creating a hash table in hardware. The
tag array of a fully associative cache is typically implemented as a CAM array, whereas the data array is
implemented as a regular SRAM array.
Field Description
A Associativity
B Block size (in bytes)
C Cache size (in bytes)
bo Width of the output (in bits)
baddr Width of the input address (in bits)
These are the most basic parameters for any set associative cache. Note that direct mapped caches and
fully associative caches are specific instances of set associative caches. Hence, we use the set associative
cache as a basis for all our subsequent discussion. Using the ABC parameters (associativity, block size, and
cache size), we can compute the size of the set index and the size of the tag (refer to Example 4).
285 Smruti R. Sarangi
Example 4
Given A (associativity), B (block size in bytes), W (width of the input address in bits), and C (cache
size), compute the number of bits in the set index and the size of the tag.
Answer: The number of blocks that can be stored in the cache is equal to C/B. The size of each set
is A. Thus, the number of sets is equal to C/(BA). Therefore, the number of bits required to index each
set is log2 (C/(BA)).
Let us now compute the size of the tag. We know that the number of bits in the memory address is
W . Furthermore, the block size is B bytes, and we thus require log2 (B) bits to specify the index of a
byte within a block (block address bits).
Hence, the number of bits left for the tag is as follows:
tag bits = address size − #set index bits − #block address bits
= W − log2 (C/(BA)) − log2 (B)
= W − log2 (C) + log2 (A)
A naive organisation with the ABC parameters might result in a very skewed design. For example, if
we have a large cache, then we might end up with a lot of rows and very few columns. In this case, the
load on the row decoder will be very large and this will become a bottleneck. Additionally, the number of
devices connected to each bit line would increase and this will increase its capacitance, thus making it slower
because of the increased RC delay.
On the other hand, if we have a lot of columns then the load on the column multiplexers and the word
lines will increase. They will then become a bottleneck. In addition, placing a highly skewed structure on
the chip is difficult. It conflicts with other structures. Having an aspect ratio (width/length) that is close to
that of a square is almost always the best idea from the point of view of placing a component on the chip.
Hence, having a balance between the number of rows and columns is a desirable attribute.
It is thus necessary to break down a large array of SRAM transistors into smaller arrays such that they
are faster and more manageable. We refer to the large original array as the undivided array and the smaller
arrays as subarrays. Cacti thus introduces two additional parameters: Ndwl and Ndbl . Ndwl indicates the
number of segments that we create by splitting each word line or alternatively the number of partitions that
we create by splitting the set of columns of the undivided array. On similar lines, Ndbl indicates the number
of segments that we create by splitting each bit line or the set of rows. After splitting, we create a set of
subarrays. In this case, the number of subarrays is equal to Ndwl × Ndbl .
Additionally, the Cacti tool introduces another parameter called Nspd , which basically sets the aspect
ratio of the undivided array. It indicates the number of sets that are mapped to a single word line. Let us
do some math using our ABC parameters. The size of a block is B, and if the associativity is A, then the
size of a set in bytes is A × B. Thus, the number of bytes that are stored in a row (for a single word line) is
A × B × Nspd .
Example 5 Compute the number of rows and columns in a subarray using Cacti’s parameters.
Answer: Let us compute the number of columns as follows. We have A × B × Nspd bytes per row in
the undivided cache. This is equal to 8 × A × B × Nspd bits. Now, if we divide the set of columns into
8×A×B×Nspd
Ndwl parts, the number of columns in each subarray is equal to Ndwl .
Let us now compute the number of rows in a subarray. The number of bytes in each row of the
undivided cache is equal to A × B × Nspd . Thus, the number of rows is equal to the size of the cache C
Smruti R. Sarangi 286
C
divided by this number, which is equal to A×B×Nspd . Now, if we divide this into Ndbl segments, we get
C
the number of rows in each subarray as A×B×Nspd ×Ndbl .
Thus, given a cache, the task is to compute these three parameters – Ndwl , Ndbl , and Nspd . We need to
first figure out a goal such as minimising the access time or the energy per access. Then, we need to compute
the optimal values of these parameters. These parameters were for the data array (d in the subscript). We
can define similar parameters for the tag array: Ntwl , Ntbl , and Ntspd respectively.
Let us now summarise our discussion. We started out with an array or rather a matrix of SRAM cells.
We quickly realised that we cannot have a skewed ratio (disproportionate number of rows or columns). In
one case, we will have very slow word lines, and in the other case we will have very slow bit lines. Both are
undesirable. Hence, to strike a balance we divide an array of memory cells into a series of subarrays. This
is graphically shown in Figure 7.28.
Address
Each subarray has its own decoder. Recall that the input to the decoder is the set index. If we have 4
subarrays then we can use the last two bits of the set index to index the proper subarray. Subsequently, we
expect to find a full set in the row of the subarray. However, in theory it is possible that the set of blocks
maybe split across multiple subarrays. In this case, we need to read all the subarrays that contain the blocks
in the set. Given that we can divide a large SRAM array in this manner, we will end up accessing subarrays,
which are much smaller and faster.
Definition 47
A port is defined as an interface for accepting a read or write request. We can have a read port, a write
port, or a read/write port.
The traditional approach for creating a multi-ported structure is to connect each SRAM cell to an
additional pair of bit lines as shown in Figure 7.29. We thus introduce two additional word line transistors
W3 and W4 that are enabled by a different word line – this creates a 2-ported structure. Now, since we have
two pairs of bit lines, it means that we can make two parallel accesses to the SRAM array. One access will
287 Smruti R. Sarangi
use all the bit lines with subscript 1, and the other access will use all the bit lines with subscript 2. Each
pair of bit lines needs its separate set of column multiplexers, sense amplifiers, write, precharge, and output
drivers. Additionally, we need two decoders – one for each address. This increases the area of each array
significantly. A common thumb rule that is used is that the area of an array increases as the square of the
number of ports – proportional increase in the number of word/bit lines in both the axes.
Q
Q
W1
W2
W3
W4
A better solution is a multi-banked cache. A bank is defined as an independent array with its own
subarrays and decoders. If a cache has 4 banks, then we split the physical address space between the 4
banks. This can be done by choosing 2 bits in the physical address and then using them to access the right
bank. Each bank may be organised as a cache with its own tag and data arrays, alternatively we can divide
the data and tag arrays into banks separately. For performance reasons, each bank typically has a single
port.
The advantage of dividing a cache into banks is that we can seamlessly support concurrent accesses to
different banks. Now, for 4 banks there is a 25% chance of two simultaneous accesses accessing the same
bank – assuming a uniformly random distribution of accesses across the banks. This is known as a bank
conflict, and in this case we need to serialise the accesses. This means that one memory access needs to wait
for the other. There is an associated performance penalty. However, this is often outweighed by the fast
access time of banks. Finally, note that each bank has its own set of subarrays. However, subarrays cannot
be accessed independently.
Cacti 5.0, the authors propose to divide an array into multiple banks, then further subdivide a bank into
subbanks. Banks can be accessed independently by concurrent requests. However, one bank can process
only one memory request at any given point of time. Each bank consists of multiple subbanks, and only one
of the subbanks can be enabled for a memory request. A subbank contains an entire data block, which is
typically either 64 bytes or 128 bytes. Following the maxim, “smaller is faster”, we divide a subbank into
multiple mats, and each mat into 4 subarrays. The structure is shown in Figure 7.30. The hierarchy is
Array → Bank → Subbank → M at → Subarray.
Array
Bank
Subbank
Mat Subarray
The logic for such a deep hierarchy is as follows. If a subbank is one large array, it will be very big and
very slow. Hence, we divide a subbank into multiple mats, and we divide each mat into multiple subarrays.
We store a part of each block in each mat. For a read operation, each mat supplies the part of the block that
it contains and at the subbank level we join all the parts to get the block. We do the same for the subarrays
within mats. This process ensures that the mats and their constituent subarrays are small and hence fast.
This also parallelises the process of reading and writing. We can thus quickly read or write an entire data
block. Another advantage of this design is that different subarrays within a mat share their decoding logic,
which increases the overall speed of operation and minimises the area.
Routing messages between the cache controller, which is a small piece of logic in each bank, and the
subarrays can be complicated in large caches that have long wire delays. Let us outline an approach to solve
this problem.
H-Trees
The memory address needs to be sent to all the mats and subarrays. Because wire delays in large caches
can be of the order of a few cycles, it is possible that the request might not reach all the mats and subarrays
at the same time, particularly if there is a mismatch in the length of wires. If this happens, the responses
will not arrive at the same time. To ensure that all the requests reach the subarrays at the same time, we
create a network that looks like an H-tree as shown in Figure 7.31. The sender is at the centre of the figure.
Observe that it is located at the centre of the middle bar of a large ‘H’ shaped subnetwork. Each corner of
the large ‘H’ shaped subnetwork is the centre of another smaller ‘H’ shaped network. This process continues
289 Smruti R. Sarangi
till all the receivers are connected. The reader needs to convince herself that the distance from the centre to
each receiving node (dark circle) is the same. The address and data are sent along the edges of the H-tree.
They reach each subarray at exactly the same time.
have a capacitance associated with each segment. This is because whenever we have two metallic conductors
in proximity, they will act like a classic capacitor – two parallel plates separated by a dielectric. Such a pair
of conductors will store some charge across a potential difference. We can model this as a capacitor between
the segment of the wire and ground. A wire can thus be visualised as a ladder network of such R and C
elements.
Similarly, we can replace a transistor with a set of resistors and capacitors as shown in Figure 7.33. We
have a gate capacitance because the gate is comprised of a conducting plate that is set to a given potential.
Thus it is going to store charge, and this can be modelled as the gate capacitance. Using a similar argument,
both the drain and source nodes will also have a capacitance associated with them. When the transistor
is in the linear region, the resistance across the channel will not be 0. For a given drain-source and gate
voltage, the drain current is given by the Id-Vg curve of the transistor. This relationship can be modelled
by placing a resistor between the drain and the source. When the transistor is in saturation it behaves as a
current source and the drain-source resistor can be replaced with a regular current source.
d d
g g
s s
Linear region Saturation
Figure 7.33: Equivalent RC circuit for an NMOS transistor
Using such RC networks is a standard approach in the analysis of electronic circuits, particularly when
we want to leverage the power of fast circuit simulation tools to compute the voltage at a few given nodes
in the circuit. We sometimes need to add an inductance term if long wires are involved. Subsequently,
to compute the voltages and currents, it is necessary to perform circuit simulation on these simplified RC
networks.
The Cacti 1.0 [Wilton and Jouppi, 1993] model proposes to replace all the elements in a cache inclusive
of the wires, transistors, and specialised circuits with simple RC circuits. Once we have a circuit consisting
of just voltage sources, current sources, and RC elements, we can then use quick approximations to compute
the voltage at points of interest. In this section, we shall mainly present the results from Horowitz’s pa-
per [Horowitz, 1983] on modelling the delay of MOS circuits. This paper in turn bases its key assumptions
291 Smruti R. Sarangi
on Elmore’s classic paper [Elmore, 1948] published in 1948. This approach is often referred to as the Elmore
delay model.
RC Trees
Let us consider an RC network, and try to compute the time it takes for a given output to either rise to a
certain voltage (rise time), or fall to a certain voltage (fall time). For example, our model should allow us
to compute how long it will take for the input of the sense amplifier to register a certain voltage after we
enable the word lines.
Let us make two assumptions. The first is that we consider an RC tree and not a general RC network.
This means that there are no cycles in our network. Most circuits can be modelled as RC trees and only in
rare cases where we have a feedback mechanism, we have cycles in our network. Hence, we are not losing
much by assuming only RC trees.
The second assumption that we make is that we consider only a single type of voltage sources that provide
a step input (see Figure 7.34).
0 → 1 step 1 → 0 step
(a) (b)
We consider two kinds of such inputs: a 0 → 1 transition and a 1 → 0 transition. In digital circuits we
typically have such transitions. We do not transition to any intermediate values. Thus, the usage of the step
function is considered to be standard practice. For the sake of simplicity, we assume that a logical 0 is at 0
V and a logical 1 is at 1 V.
Analysis of an RC Tree
Let us consider a generic RC tree as described by Horowitz [Horowitz, 1983]. Consider a single voltage source
that can be treated as the input. As discussed, it is a step input that can either make a 0 → 1 transition or
a 1 → 0 transition. Let us assume that it makes a 1 → 0 transition (the reverse case is analogous).
Let us draw an RC tree and number the resistors and capacitors (see Figure 7.35). Note that between
an output node and the voltage sources we only have a series of resistors, we do not have any capacitors.
All the capacitors are between a node and ground.
Each capacitor can be represented as a current source. For a capacitor with capacitance C, the charge
that it stores is V (t)C, where V (t) is the voltage at time t. We assume that the input voltage makes
a transition at t = 0. Now, the current leaving the capacitor is equal to −CdV (t)/dt. Let us draw an
equivalent figure where our capacitors are replaced by current sources. This is shown in Figure 7.36.
The goal is to compute Vx (t), where x is the number of the output node (shown in an oval shaped box
in the figure). Let us show how to compute the voltage at node 3 using the principle of superposition. If
Smruti R. Sarangi 292
R4 4 R5 5
ground
C4 C5
R1 1 R2 2 R3 3
0
C1 C2 C3
V
we have n current sources, we consider one at a time. When we are considering the k th current source we
disconnect (replace with an open circuit) the rest of the n − 1 current sources. This reduced circuit has just
one current source. We then proceed to compute the voltage at node 3.
In this RC tree, only node 0 is connected to a voltage source, which makes a 1 → 0 transition at t = 0.
The rest of the nodes are floating. As a result the current will flow towards node 0 via a path consisting
exclusively of resistors.
Now, assume that the current source at node 4 is connected, and the rest of the current sources are
replaced with open circuits. The current produced by the current source is equal to −C4 dV44 /dt. The term
Vij refers to the voltage at terminal i because of the effect of the current source placed at terminal j using
our methodology. The voltage at node 1 is therefore −R1 C4 dV44 /dt. Since the rest of the nodes are floating,
this is also equal to the voltage at node 3. We thus have:
dV44
V34 = −R1 C4 (7.5)
dt
Let us do a similar analysis when the current source attached to node 2 is connected. In this case, the
voltage at node 2 is equal to the voltage at node 3. The voltage at node 2 or 3 is given by the following
equation:
dV22
V22 = V32 = −(R1 + R2 )C2 (7.6)
dt
Let us generalise these observations. Assume we want to compute the voltage at node i, when the current
source attached to node j is connected. Now consider the path between node 0 (voltage source) and node i.
Let the set of resistors on this path be P0i . Similarly, let the set of resistors in the path from node 0 to node
293 Smruti R. Sarangi
R4 4 R5 5
C4 C5
R1 1 R2 2 R3 3
0
C1 C2 C3
V
j be P0j . Now, let us consider the intersection of these paths and find all the resistors that are in common.
These resistors are given by
Now, please convince yourself that Equations 7.5 and 7.6 are special cases of the following equation.
Assume that we only consider the current source at node j.
dVjj
Vij = −Rij Cj (7.9)
dt
Now, if we consider all the capacitors one by one and use the principle of superposition, we compute Vi
to be a sum of the voltages at i computed by replacing each capacitor with a current source.
X X dVjj
Vi = Vij = −Rij Cj (7.10)
j j
dt
Unfortunately, it is hard to solve a system of simultaneous differential equations, that too quickly and
accurately. It is therefore imperative that we make some approximations.
Smruti R. Sarangi 294
This is exactly where Elmore [Elmore, 1948] proposed his famous approximation. Let us assume that
dVjj /dt = αdVi /dt, where α is a constant. This is also referred to as the single pole approximation (refer to
the concept of poles and zeros in electrical networks). Using this approximation, we can compute the voltage
at node i to be
X dVi
Vi∗ = −αRij Cj (7.11)
j
dt
Here, Vi∗ is the voltage at node i computed using our approximation. Let us now consider the error
(Vi − Vi∗ )dt. Assume α = 1 in the subsequent equation. We have
R
X dVjj X dVi
Vi − Vi∗ = −Rij Cj − −Rij Cj
j
dt j
dt
X dVjj dVi
= −Rij Cj ( − )
j
dt dt
dVjj (7.12)
Z Z !
X dVi
(Vi − Vi∗ )dt = −Rij Cj − dt
j
dt dt
X ∞
= −Rij Cj (Vjj − Vi )
0
j
=0
∞
Note that the expression, (Vjj − Vi ) = 0, because both Vi and Vjj start from the same voltage (1
0
V
R in this∗ case) and end at the same voltage (0 V in this case). Thus, we can conclude that the error
(Vi − Vi )dt = 0, when we assume that dVjj /dt = dVi /dt for all i and j. This is the least possible error,
and thus we can conclude that our approximation with α = 1 minimises the error as we have defined it
(difference of the two functions). Let us now try to solve the equations for any Vi using our approximation.
Let us henceforth not use the term Vi∗ . We shall use the term Vi (voltage as a function of time) to refer to
the voltage at node i computed using Elmore’s approximations.
We thus have
X dVi
Vi = −Rij Cj
j
dt
dVi X
= −τi (τi = Rij Cj )
dt j
dt dVi
⇒ =− (7.13)
τi Vi
t
⇒ − ln(k) = −ln(Vi ) ln(k) is the constant of integration
τi
− τt
⇒Vi = ke i
− τt
⇒Vi = V0 e i at t = 0, Vi = V0 = k
Vi thus reduces exponentially with time constant τi . Recall that this equation is similar to a capacitor
discharging in a simple RC network consisting of a single resistor and capacitor. Let us now use this formula
to compute the time it takes to discharge a long copper wire (see Example 6).
295 Smruti R. Sarangi
Example 6
Compute the delay of a long copper wire.
Answer: Let us divide a long copper wire into n short line segments. Each segment has an associated
resistance and capacitance, which are assumed to be the same for all the segments.
Let the total resistance of the wire be R and the total capacitance be C. Then the resistance and
capacitance of each line segment is R/n and C/n respectively. Let terminal i be the end point of segment
i. The time constant measured at terminal n is given by Equation 7.13. It is equal to
X dVn
Vn = −Rnj Cj (7.14)
j
dt
Pj
Any Rnj in this network is equal to i=1 Ri . Ri and Ci correspond to the resistance and capacitance
of the ith line segment respectively. We can assume that ∀i, Ri = R/n and ∀i, Ci = C/n. Hence,
Rnj = jR/n . The time constant of the wire is therefore equal to
n n
X CXR
τ= Rnj Cj = ×j
j=1
n j=1 n
C R n(n + 1) C n+1 (7.15)
= × × = ×R×
n n 2 n 2
n+1
= RC ×
2n
As n → ∞, τ → RC 2 . We can assume that the time constant of a wire is equivalent to that of a
simple RC circuit that has the same capacitance and half the resistance of the wire (or vice versa).
We can draw some interesting conclusions from Example 6. The first is that the time constant of a wire
is equal to RC/2, where R and C are the resistance and capacitance of the entire wire respectively. Let
the resistance and capacitance for a small segment of the wire be r and c respectively. Then, we have the
following relations.
R = nr
(7.16)
C = nc
Hence, the time constant, τ , is equal to rcn2 /2. Recall that the time constant is the time it takes for the
input to rise to 63% of its final value (1 − 1/e), or the output to fall to 37% of the maximum value (1/e).
We can extend this further. A typical RC circuit charges or discharges by 98% after 4τ units of time. After
5τ units of time, the final voltage is within 0.7% of its final value. If we set a given threshold for the voltage
for deciding whether it is a logical 0 or 1, then the time it takes to reach that threshold can be expressed in
terms of time constants. It is common to refer to the time a circuit takes to respond to an input in terms of
time constants.
Now, given that τ = rcn2 /2 for a long wire, we can quickly deduce that the delay is proportional
to the square of the wire’s length. This is bad news for us because it means that long wires are not
scalable and thus should not be used.
Smruti R. Sarangi 296
τ is the time constant assuming a step input, vth is the threshold voltage as a fraction of the supply
voltage, trise is the rise time of the input, and b is the fraction of the input’s swing at which the output
changes (Cacti 1 uses a value of b = 0.5). We have a similar equation for the time it takes for an input to
fall.
q
delayf all = τ (log(vth )2 + 2tf all b(1 − vth )/τ (7.18)
Equations 7.17 and 7.18 are primarily based on empirical models that describe the behaviour of transistors
in the linear and saturation regions. These equations can change with the transistor technology and are thus
not fundamental principles. Hence, it is necessary to change these equations appropriately if we are trying
to use a different kind of transistors.
Input to the
sense amplifier Rcolmux Rline Rmem
Ccolmux
Cline
Figure 7.37: Equivalent circuit of a bit line(adapted from [Wilton and Jouppi, 1993])
Rmem is the combined resistance of the word line transistor and the NMOS transistor in the memory cell
(via which the bit line discharges). These transistors connect the bit line to the ground. Cline is the effective
capacitance of the entire bit line. This includes the drain capacitance of the pass transistors (controlled
by the word lines), the capacitance that arises due to the metallic portion of the bit line, and the drain
capacitances of the precharge circuit and the column multiplexer.
Rcolmux and Ccolmux represent the resistance of the pass transistor in the column multiplexer and the
output capacitance of the column multiplexer respectively.
Rline needs some explanation. Refer to Example 6, where we had computed the time constant of a long
wire to be approximately RC/2, where R and C are its resistance and capacitance respectively. A model
297 Smruti R. Sarangi
that treats a large object as a small object with well defined parameters is known as a lumped model. In
this case, the lumped resistance of the entire bit line is computed as follows:
#rows
Rline = × Rsegment (7.19)
2
Here, Rsegment is the resistance of the segment of a bit line corresponding to one row of SRAM cells. We
divide it by 2 because the time constant in the lumped model of a wire is RC/2. We need to divide either
the total resistance or capacitance by 2.
Using the Elmore delay model the time constant (τ ) is equal to Rmem Cline + (Rmem +Rline +Rcolmux )Ccolmux .
Rpull-up
Word line
Cequiv
Figure 7.38: Equivalent circuit for a word line (adapted from [Wilton and Jouppi, 1993])
Figure 7.38 shows the equivalent circuit for a word line. In this circuit Rpull−up is the pull-up resistance.
The internal resistance of the word line drivers determine the value of this parameter.
Cequiv is the equivalent capacitance of the word line. It is given by the following equation.
The parameter #cols refers to the number of columns in a row. Cgatewl is the gate capacitance of each
pass transistor that is controlled by the word line. Since there are two such transistors per memory cell, we
need to multiply this value by 2. Finally, Cmetal is the capacitance of just the metallic portion of the word
line. Cequiv is a sum of all these parameters.
The time constant in this case is simply Rpull−up × Cequiv .
Tag array
access
Tag Data block
comparison selection
Data array
access
Tag array
access
Tag Data block
comparison selection
Data array
access
Pipeline
latches
We need to add pipeline latches or buffers between these stages (similar to what we had done in an
299 Smruti R. Sarangi
in-order pipeline) to create a pipelined cache. The resulting structure is shown in Figure 7.40. Of course,
we are making many simplistic assumptions in this process, notably that the time it takes to complete each
subtask (stage) is roughly the same. Sometimes we might wish to create more pipeline stages.
In pursuance of this goal, it is possible to break the SRAM array access process into two stages: address
decode and row access. If the decoder has N outputs, then we can create a large N -bit pipeline latch to
temporarily store its output. In the next stage we can access the target row of the SRAM array. This will
increase the depth of the pipeline to 4 stages. It is typically not possible to pipeline the row access process
because it is basically an analog circuit.
Even though the exact nature of pipelining may differ, the key idea here is that we need to pipeline the
cache to ensure that it does not lock up while processing a request. The process of pipelining ensures a much
larger throughput, which is mandatory in high performance systems.
Now, if we consider write accesses as well, then they can be pipelined in a similar fashion: first access
the tag array, then compare the tag part of the address with all the tags in the set, and finally perform the
write. If the subsequent access is a read, then we require forwarding; therefore forwarding paths similar to
in-order processors are required here as well. Working out the details of these forwarding paths is left as an
exercise for the reader.
MSHR
Miss queue
are called secondary misses. The method to handle secondary misses is conceptually similar to the way we
handled memory requests in the LSQ. Assume that a secondary miss is a write. We create an entry at the
tail of the miss queue, which contains the value that is to be written along with the address of the word
within the block. Now, assume that the secondary miss is a read. In this case, if we are reading a single
memory word, then we first check the earlier entries in the miss queue to see if there is a corresponding
write. If this is the case, then we can directly forward the value from the write to the read. There is no need
to queue the entry for the read. However, if such forwarding is not possible, then we create a new entry
at the tail of the miss queue and add the parameters of the read request to it. This includes the details of
the memory request such as the id of the destination register (in the case of the L1 cache) or the id of the
requesting cache (at other levels) – referred to as the tag in Figure 7.41.
The advantage of an MSHR is that instead of sending multiple miss requests to the lower level, we send
just one. In the time being the cache continues to serve other requests. Then, when the primary miss returns
with the data, we need to take a look at all the entries in the miss queue, and start applying them in order.
After this process, we can write the modified block to the cache, and return all the read/write requests to
the upper level.
Now let us account for the corner cases. We might have a lot of outstanding memory requests, which
might exhaust the number of entries in a miss queue, or we might run out of miss queues. In such a scenario,
the cache needs to lock up and stop accepting new requests.
To start the discussion, let us consider a traditional 2-way set associative cache first. Given a block, we
first find its set index, which is a subset of the bits in the block address. Let it set index be I. Then, the two
locations that can potentially contain a copy of the block are 2I and 2I + 1. If two blocks with addresses
A1 and A2 have the same set index, then both of them will vie for the same locations on the cache: 2I and
2I + 1. If we have three such blocks that are accessed frequently, then we will have a large number of misses.
Let us do something to minimise the contention by designing a skewed associative cache.
Let us partition a cache into two subcaches. Let us use two different functions – f1 and f2 – to map a
block to lines in the different subcaches. This means that if a block has address A, we need to search for
it in the lines f1 (A) and f2 (A) in the two subcaches, respectively. Let function fk refer to a line in the k th
subcache. The crux of the idea is to ensure that for two block addresses, A1 and A2 , if f1 (A1 ) = f1 (A2 )
(in subcache 1), then f2 (A1 ) 6= f2 (A2 ) (in subcache 2). In simple terms, if two blocks have a conflict in one
subcache, they should have a very high likelihood of not conflicting in the other subcache. We can easily
extend this idea to a cache with k subcaches. We can create separate mapping functions for each subcache
to ensure that even if a set of blocks have conflicts in a few subcaches, they do not conflict in the rest of the
subcaches. To implement such a scheme, we can treat each bank as a subcache.
The operative part of the design is the choice of functions to map block addresses to lines in subcaches.
The main principle that needs to be followed is that if two blocks map to the same line in subcache 1, then
their probability of mapping to the same line in subcache 2 will be 1/N , where N is the number of lines in
a subcache. The functions f1 () and f2 () are known as skewing functions.
Rest of
the bits n bits n bits
A3 A2 A1
Let us discuss the skewing functions described by Bodin and Seznec [Bodin and Seznec, 1997]. We divide
a block address into three parts as shown in Figure 7.42. Assume each subcache has 2n cache lines. We
create three chunks of bits: A1 (lowest n bits), A2 (n bits after the bits in A1 ), and A3 (rest of the MSB
bits).
The authors use a function σ that shuffles the bits similar to shuffling a deck of playing cards. There are
several fast algorithms in hardware to shuffle a set of bits. Discussing them is out of the scope of the book.
For a deeper understanding of this process, readers can refer to the seminal paper by Diaconis [Diaconis
et al., 1983]. For a 4-way skewed associative cache (see Figure 7.43), where each logical subcache is mapped
to a separate bank, the four mapping functions for block address A are as follows. The ⊕ sign refers to the
XOR function.
Bank 1 f1 (A) = A1 ⊕ A2
Bank 2 f2 (A) = σ(A1 ) ⊕ A2
Bank 3 f3 (A) = σ(σ(A1 )) ⊕ A2
Bank 4 f4 (A) = σ(σ(σ(A1 ))) ⊕ A2
These functions can be computed easily in hardware, and we can thus reduce the probability of conflicts
to a large extent as observed by Bodin and Seznec. In a skewed associative cache, let us refer to the locations
at which a block can be stored as its set. A set is distributed across subcaches.
Smruti R. Sarangi 302
Subcache 1 f1 f2 Subcache 2
Subcache 3 Subcache 4
f3 f4
Address
Figure 7.43: Skewed associative cache
The last piece remaining is the replacement policy. Unlike a traditional cache, where we can keep LRU
timestamps for each set, here we need a different mechanism. We can opt for a very simple pseudo-LRU
policy. Assume we want to insert a new block, and all the cache lines in its set are non-empty. We ideally
need to find the line that has been accessed the least. One approach to almost achieve this is to have a bit
along with each cache line. When the cache line is accessed this bit is set to 1. Periodically, we clear all such
bits. Now, when we need to evict a line out of the set of k lines in a k-way skewed associative cache, we can
choose that line whose bit is 0. This means that it has not been accessed in the recent past. However, if the
bits for all the lines where the given block can be inserted are set to 1, then we can randomly pick a block
as a candidate for replacement.
We can always do something more sophisticated. This includes associating a counter with each line, which
is decremented periodically, and incremented when the line is accessed (similar to classical pseudo-LRU). We
can also implement Cuckoo hashing. Assume that for address A, all the lines in its set are non-empty. Let
a block with address A0 be present in its set (in subcache 2). It is possible that the line f1 (A0 ) in subcache
1 is empty. Then the block with address A0 can be moved to subcache 1. This will create an empty line
in the set, and the new block with address A can be inserted there. This process can be made a cascaded
process, where we remove one block, move it to one of its alternative locations, remove the existing block in
the alternative location, and try to place it in another location in its set, until we find an empty line.
contains the data and access it first. If there is a miss, then we access the rest of the ways using the regular
method. The efficacy of this approach completely depends on the accuracy of the predictor.
Let us describe two simple way predictors (see [Powell et al., 2001]). The main idea is to predict the
way in which we expect to find the contents of a block before we proceed to access the cache. If after
computing the memory address, it will take a few cycles to translate it and check the LSQ for a potential
forwarding opportunity, then we can use the memory address itself for way prediction. Let us thus consider
the more difficult case where we can access the cache almost immediately because address translation and
LSQ accesses are not on the critical path either because of speculation or because they are very fast. In this
case, we need to use information other than the memory address to predict the way.
For predicting the way in advance, the only piece of information that we have at our disposal is the PC
of the load or the store. This is known well in advance and thus we can use a similar table as we had used
for branch prediction or value prediction to predict the way. For a k-way set associative cache, we need to
store log2 (k) bits per entry. Whenever we access the cache, we first access the predicted way. If we do not
find the entry there, then we check the rest of the ways using the conventional approach. Let us take a look
at the best case and the worst case. The best case is that we have a 100% hit rate with the way predictor.
In this case, our k-way set associative cache behaves as a direct mapped cache. We access only a single way.
The energy is also commensurately lower.
Let us consider the worst case at the other end of the spectrum, where the hit rate with the predicted way
is 0%. In this case, we first access the predicted way, and then realise that the block is not contained in that
way. Subsequently, we proceed to access the rest of the ways using the conventional cache access mechanism.
This is both a waste of time and a waste of energy. We unnecessarily lost a few cycles in accessing the
predicted way, which proved to be absolutely futile. The decision of whether to use a way predictor or not
is thus dependent on its accuracy and the resultant performance gains (or penalties).
Let us now discuss another method that has the potential to be more accurate than the PC based predictor.
The main problem with the PC based predictor is that the computed memory addresses can keep changing.
For example, with an array access, the memory address keeps getting incremented, and thus predicting the
way becomes more difficult. It is much easier to do this with the memory address of the load or store.
However, we do not want to introduce another way prediction stage between the stage that computes the
value of the memory address and the cache access because this is on the critical path.
Let us find a solution in the middle, where we can compute the prediction just before the cache access.
Recall that after we read the registers, the values are sent to the execution unit. There is thus at least a
single cycle gap between the time at which the value of the base register is available, and the time at which
the address reaches the L1 cache. We have the execute stage in the middle that adds the offset to the address
stored in the base register. Let us use this time productively. Once we have the value of the base register,
let us compute a XOR between this value and the offset.
Consider an example: a load instruction ld r2, 12[r1]. In this case, the base register is r1. We read
the value V of r1 in the register read stage. Subsequently, we add 12 to V and this becomes the memory
address of the load instruction. Next, we need to update the corresponding entry in the LSQ and check
for store→load forwarding. Meanwhile, let us compute the XOR of V and 12. We compute R = V ⊕ 12.
Similar to what we had done in the GShare predictor, the result R of the XOR instruction in some sense
captures the value of the base register and the offset. Computing a XOR is much faster than doing an
addition and possibly checking for forwarding in the LSQ. Thus, in the remaining part of the clock cycle we
can access a 2n -entry way prediction cache that is organised in a manner similar to a last value predictor
(see Section 5.1.5) using n bits from the result, R. Once the address has been computed, we can access
the cache using the predicted way. This approach uses a different source of information, which is not the
memory address, but is a quantity that is similar to it in terms of information content.
Smruti R. Sarangi 304
This is the classic matrix multiplication algorithm – nice and simple. However, this code is not efficient
from the point of view of cache accesses for large values of N , which is most often the case. Let us understand
why. Assume that N is a very large number; hence, none of the matrices fit within the L1 cache. In the case
of this algorithm, we multiply a row in A with a column in B, element by element. Subsequently, we move
to the next column in B till we reach the end of the matrix. Even though we have temporal locality for the
elements of the row in A, we essentially touch all the elements in B column by column. The N 2 accesses
to elements in B do not exhibit any temporal locality, and if the size of N is large, we shall have a lot of
capacity misses. Thus the cache performance of this code is expected to be very poor.
Let us now see what happens in the subsequent iteration of the outermost loop. We choose the next
row of A and then again scan through the entire matrix B. We do not expect any elements of B to have
remained in the cache after the last iteration because these entries would have been displaced from the cache
given B’s size. Thus there is a need to read the elements of the entire matrix (B in this case) again. We
can thus conclude that the main reason for poor temporal locality and consequently poor cache hit rates is
because in every iteration of the outermost loop, we need to read the entire matrix B from the lowest levels
of memory. This is because it does not fit in the higher level caches. If we can somehow increase the degree
of temporal locality, then we can improve the cache hit rates as well as the overall performance.
The key insight here is to not read the entire matrix B in every iteration. We need to consider small
regions of A and small regions of B, process them, and then move on to other regions. We do not have the
luxury of reading large amounts of data every iteration. Instead, we need to look at small regions of both
the matrices simultaneously. Such regions are also called tiles, and thus the name of our algorithm is called
loop tiling.
Let us start by looking at matrix multiplication graphically as depicted in Figure 7.44. In traditional
matrix multiplication (Figure 7.44(a)), we take a row of matrix A and multiply it with a column of matrix
B. If the size of each row or column is large, then we shall have a lot of cache misses. In comparison, the
approach with tiling is significantly different. We consider a b × b tile in matrix A, and a same-sized tile in
matrix B. Then we multiply them using our conventional matrix multiplication technique to produce a b × b
tile (see Figure 7.44(b)). The advantage of this approach is that at any point of time, we are only considering
three matrices of b2 elements each. Thus, the total amount of working memory that we require is 3b2 . If
this data fits in the cache, then we can have a great degree of temporal locality in our computations.
305 Smruti R. Sarangi
j
j
i i
(a)
(b)
Figure 7.44: Matrix multiplication: (a) normal (b) with tiling
Now, that we have looked at the insight, let us look at the code of an algorithm that uses such kind
of loop tiling or blocking (refer to Listing 7.2). Assume that the result matrix C is initialised to all zeros.
Additionally, assume that both the input matrices, A and B, are N × N matrices, where N is divisible by
the tile size b.
There are many implementations of tiling. There are variants that have 5 nested loops. We show a
simpler implementation with 6 nested loops. First we consider the three matrices consisting of arrays of
Smruti R. Sarangi 306
tiles, where each tile has b × b elements. Similar to traditional matrix multiplication, we iterate through all
combinations of tiles in the three outermost loops. We essentially choose two tiles from the matrices A and
B in the three outermost loops. The first tile starts at (ii, kk); it is b elements deep and b elements wide.
Similarly, the second tile starts at (kk, jj) – its dimensions are also b × b.
Next, let us move our attention to the three innermost loops. This is similar to traditional matrix
multiplication where we iterate through each and every individual element in these tiles, multiply the cor-
responding elements from the input matrices, and add the product to the result element’s value – value of
C[i][j]. Let us convince ourselves that this algorithm is correct, and it is equivalent to the traditional matrix
multiplication algorithm.
This is easy to prove. Consider the traditional matrix multiplication algorithm. We consider all combin-
ations of i, j, and k. For each combination we multiply A[i][k] and B[k][j], and add the result to the current
value of the result element C[i][j]. There are N 3 possible values of such combinations and that’s the reason
we need three loops.
In this case, we simply need to prove that the same thing is happening. We need to show that we are
considering all combinations of i, j, and k, and the result is being computed in the same manner. To prove
this, let us start out by observing that the three outermost loops ensure that we consider all combinations
of b × b tiles across matrices A and B. The three innermost loops ensure that for each pair of input tiles,
we consider all the values that the 3-tuple (i, j, k) can take. Combining both of these observations, we can
conclude that all the combinations of i, j, and k are being considered. Furthermore, the reader should
also convince herself that no combination is being considered twice. Finally, we perform the multiplication
between elements in the same way, and also compute the result matrix C in the same way. A formal proof
of correctness is left as an exercise for the reader.
The advantage of such techniques is that it confines the execution to small sets of tiles. Thus, we can
take advantage of temporal locality, and consequently reduce cache miss rates. Such techniques have a very
rich history, and are considered vitally important for designing commercial implementations of linear algebra
subroutines.
20 bits 12 bits
Figure 7.45: Breakup of a memory address for accessing a set associative cache. This example is for a 32-bit
memory system with 4 KB pages.
a page or a frame? Here, the key insight is that in the process of address translation, the 12 LSB bits do
not change. They remain the same because the size of a page or a frame is 4 KB (as per our assumptions).
However, the remaining 20 MSB bits change according to the mapping between pages and frames. The
crucial insight is that the 12 LSB bits remain the same in the virtual and physical addresses. If we can find
the set index using these 12 bits, then it does not matter if we are using the physical address or the virtual
address. We can index the correct set before translating the address.
In this particular case, where we are using 12 bits to find the set index and block offset, we can use the
virtual address to access the set. Refer to the VIPT cache in Figure 7.46. When the memory address is
ready, and we are sure that there are no chances of store→load forwarding in the LSQ, we can proceed to
access the L1 cache. We first extract the 6 set index bits, and read out all the tags in the set. Simultaneously
(see the timing diagram in Figure 7.46) we perform the virtual to physical address translation by accessing
the TLB. The greatness of the VIPT cache is that it allows us to overlap the tag accesses with the process of
translation. In the next stage of the access, we have the physical address with us, and then we can extract
its tag and compare the tag portion of the address with the tags stored in the ways. The rest of the access
(read or write) proceeds as usual.
Note that the VIPT scheme has its limitations. If we have a large number of sets, then it is possible that
the set index bits are split between the page offset, and the page/frame number. Then, this approach is not
feasible. The only reason this approach works is because it is possible to access the set and read out all of
its constituent ways in parallel without translating the address.
Let us now extend this idea. Assume that the number of block offset bits and set index bits adds up to
14. Since a page is 4 KB (12 bits), our VIPT scheme will not work. This is because we have two extra bits,
and they will not be the same after the mapping process. This is where we can get some help from either
software or hardware.
Let us discuss the software approach first. Assume that our process of translation is such that the
least significant 14 bits are always the same between physical and virtual addresses. This requires minimal
changes to the OS’s page mapping algorithms. However, the advantage is that we can then use the 14 LSB
bits to read out all the tags from the set (similar to the original VIPT scheme). We shall thus have all the
advantages of a virtually indexed physically tagged cache. However, there is a flip side to this. This is that
in this case, we are creating a super-page (larger than a page) that is 16 KB (214 bytes = 16 KB). Frames
in memory need to be reserved at granularities of 16 KB each. This might cause wastage of memory space.
Assume that in a frame, we are only using 8 KB; the remaining 8 KB will get wasted. We thus have a
trade-off between memory usage and performance. For some programs, such a trade-off might be justified.
This needs to be evaluated on a case-by-case basis.
Smruti R. Sarangi 308
Address
translation Multiplexer
Output
(a) VIPT cache
Address translation
Time
(b) Timeline
Figure 7.46: Virtually indexed physically tagged (VIPT) cache
The other method of dealing with such cases is with a hardware trick. We start out by noting that
most of the time we have a good amount of temporal and spatial locality. Most consecutive accesses are
expected to be to the same page. We can thus keep the corresponding frame number in a small register.
We can speculatively read the translation from the register, create a physical address, and start accessing
the cache. This is a very fast operation as compared to a full TLB access. In parallel, we need to access
the TLB, and verify if the speculation is correct or not. If we have a high chance of success, then we have
effectively minimised the address translation overhead. The accuracy of this process can further be enhanced
by having a more sophisticated predictor that uses the PC, and possibly the memory address of the load.
The prediction however must be done before the request is ready to be sent to the L1 cache.
If we can somehow get rid of all of this circuitry, and manage to get decoded RISC micro-instructions directly
from the caches, then we can radically decrease the power consumption and improve performance. In other
words, we can completely skip the fetch and decode stages of the pipeline.
This does sound too good to be true. However, it is possible to get close. Designers at Intel tried to realise
this goal in early 2000, when they designed a novel structure called a trace cache for the Intel R Pentium R
4 processor. In this section, we shall present the ideas contained in the patent filed by Krick et al. [Krick
et al., 2000]. We shall simplify some sections for the sake of readability. For all the details please refer to
the original patent.
Let us first explain the concept of a trace. A trace is a sequence of dynamic instructions that subsumes
branches and loop iterations. Let us explain with an example. Consider the following piece of C code.
int sum = 0;
We have a loop with 3 iterations and an if statement that skips the body of the second loop iteration. Let
us look at the sequence of instructions that the processor will execute. For the sake of readability we show
C statements instead of x86 assembly code. Let the label .loop point to the beginning of the loop, and the
label .exit point to the statement immediately after the loop. Note that we are not showing a well-formed
assembly program, we are instead just showing a dynamic sequence of instructions that the processor will
see (simplifications made to increase readability).
/* initial part */
sum = 0;
i = 0;
/* first iteration */
if ( i >= 3) goto . exit ;
if ( i == 1) goto . temp ;
sum = sum + arr [ i ]; // i = 0
. temp : i = i + 1;
goto . loop ;
/* fourth iteration */
if ( i >= 3) goto . exit ; /* exit the for loop */
Smruti R. Sarangi 310
The instructions in this unrolled loop form a trace. It is the sequence of instructions that the processor
is going to fetch to execute the code. If we can store the instructions corresponding to the entire trace in a
trace cache, then all that we need to do is to simply fetch the instructions in the trace and process them.
Furthermore, if we can also store them in their decoded format, then we can skip the power-hungry decode
stage. In the case of CISC processors, it is desirable if the trace cache stores micro-instructions instead of full
CISC instructions. If we observe a good hit rate in the trace cache, then we can save all the energy that would
have been consumed in the fetch and decode stages. In addition, the trace cache is also serving as a branch
predictor. We are using the information about subsequent trace segments as branch predictions. Finally,
note that we still need an i-cache in a system with a trace cache. We always prefer reading instructions from
the trace cache; however, if we do not find an entry, we need to access the conventional i-cache.
Fill buffers
The data array is a regular k-way set associative cache. Let us assume that it is a 4-way set associative
cache with 4 ways per set. Instead of defining traces at the granularity of instructions, let us define a trace
as a sequence of cache lines. For example, it is possible that a trace many contain 5 cache lines. Each such
line is known as a trace segment. We can have three kinds of segments: head, body, and tail. Every trace is
organised as a linked list as shown in Figure 7.48. It starts with the head segment, and ends with the tail
segment.
Next, we need to store the trace in the trace cache. We start with the head of the trace. We create a tag
array entry with the head of the trace. The rest of the segments in the trace are organised as a linked list
(see Figure 7.48). Each segment is stored in a separate data line, and has a dedicated entry in the tag array.
The standard way of representing a linked list is by storing a pointer to the next node within the current
311 Smruti R. Sarangi
node. However, this is not space efficient. If a data block is 64 bytes, and a pointer is 64 bits (8 bytes), then
the space overhead is equal to 12.5%. This is significant. Hence, let us restrict the way a trace is stored.
Let us store a trace in contiguous cache sets. For example, if a trace has 5 segments, we can store the
segments in sets s, s + 1, . . ., s + 4, where s is the index of the set that stores the trace head. Consider a
4-way set associative cache. Each trace segment can be stored in any of the 4 ways of a set. Given that we
have stored a trace segment in a given set, we know that the next trace segment is stored in the next set.
The only information that we need to store is the index of the way in that set. This requires just 2 bits, and
thus the additional storage overhead is minimal. Figure 7.49 shows how we store multiple traces in the data
array. In this figure, each column is a way in a set. A trace cache can be visualised as a packet of noodles,
where each individual strand represents a trace.
Trace 1 Trace 2
Head Head
Body Body
Sets
Body Body
Tail Tail
Traces cannot be arbitrarily long. Their length is limited by the number of sets in the cache. However,
there are a few additional conditions that govern the length of a trace.
Let us first look at conditions for terminating the creation of a trace segment, we shall then move on
to the rules for terminating the creation of a trace. Let us henceforth refer to a microinstruction as a µOP
(micro-op).
1. If we encounter a complex CISC instruction that translates to many more µOPs than what a single data
line can store, then we store all the µOPs that can be stored in the data line, and then terminate the
trace segment. The remaining µOPs need to be generated by the decode unit by reading the microcode
memory. The microcode memory contains all the microinstructions for complex CISC instructions.
2. We allow a limited number of branch instructions per trace segment. If we encounter more than that,
we terminate the trace segment. This is to avoid structural hazards.
3. We never distribute the µOPs of a CISC instruction across trace segments. We terminate the segment
if we do not have enough space to store all the µOPs of the next CISC instruction.
Let us now look at the criteria to terminate the process of trace creation.
Smruti R. Sarangi 312
1. In an indirect branch, a call, or a return statement, the branch’s target may be stored in a register.
Since the address is not based on a fixed PC-relative offset, the next CISC instruction tends to change
for every trace. Consider a function return statement. Depending on the caller function, we may return
to a possibly different address in each invocation. This is hard to capture in a trace. Hence, it is better
to terminate a trace after we encounter such instructions.
2. If we receive a branch misprediction or an interrupt alert, then we terminate the trace. This is
because the subsequent instructions will be discarded from the pipeline and thus will not be checked
for correctness.
3. The length of every trace is limited by the number of sets in the cache, and this is thus a hard limit
on the length of the trace.
Tag Array
Let us now look at the trace cache in greater detail. Each entry in the tag array contains the following fields:
address tag, valid bit, type (head or body or tail), next way, previous way, NLIP (next line’s instruction
pointer), and µIP. Let us describe them in sequence (also see Figure 7.50).
Next Prev
Tag Type NLIP μIP
way way
Valid
bit
Figure 7.50: Entry in the tag array of the trace cache
When we are performing a lookup in the tag array to locate the head of the trace, we send the address
to the tag array. This is similar to a regular cache access, where the tag portion of the address needs to be
compared with the tag stored in the entry. Hence, the first entry that we store is the tag. Subsequently, we
have a customary valid bit that indicates the validity of the entry.
For each data line that stores one trace segment, we need to store whether it is a trace head, body or
tail. This requires 2 bits (type field). Since we store consecutive trace segments in consecutive sets, the only
information that we need to store is the id of the next and previous ways such that we can create a doubly
linked list comprising trace segments. Previous pointers are required to delete the trace at a later point of
time starting from a body or a tail segment. The next field is NLIP, which stores the address of the next
CISC instruction. This is only required for the tail segment such that we can locate the address of the next
CISC instruction. The last field, µIP, is used to read microinstructions for a complex CISC instruction. We
use it to index a table of microinstructions known as the microcode memory.
Data Array
Each line in the data array can store up to a maximum of 6 microinstructions (µOPs). We have a valid bit for
each µOP. To skip the decode stage, we store the µOPs in a decoded format such that the microinstruction
does not have to be decoded in the pipeline. In addition, each µOP also stores a branch target such that
it does not have to be computed. This saves us an addition operation for every instruction that has a
PC-relative branch.
313 Smruti R. Sarangi
7.5.2 Operation
To fetch a trace, the trace cache runs a state machine as shown in Figure 7.51. During execution, we run a
state machine that starts in the Head lookup state. Given the program counter of an instruction, we search
for it in the trace cache.
Assume that we find an entry. We then transition to the Body lookup state, where we keep reading all
the trace segments in the body of the trace and supplying them to the pipeline. Once we reach the tail, we
transition to the Tail state. In the Body lookup state, if there is a branch misprediction, or we receive an
interrupt, we abort reading the trace, and move to the Head lookup state to start anew. Furthermore, if at
any point, we encounter a complex macroinstruction, we read all its constituent microinstructions from a
dedicated microcode memory (Read microinstructions state), and then continue executing the trace. If at
any point, we do not find a trace segment in the trace cache, we transition to the body miss state, which
means that our trace has snapped in the middle. This is because while building another trace we evicted a
data block in the current trace. Whenever we terminate executing a trace either because of an unanticipated
event such as an external interrupt, we reached the Tail state, or we reached the Body miss state, we start
from the Head lookup state once again.
ew
trac
e
Miss
Body lookup Body miss
Re Complex instruction
tur
Tail n encountered
Read micro-
instructions
Figure 7.51: FSM (finite state machine) used for reading a trace
Let us now look at the process of creating a trace. The flowchart is shown in Figure 7.52. We trigger
such an operation when we find that a fetched instruction is not a part of any trace. We treat it as the
head of a trace, and try to build a trace. The first step is to issue a fetch request to the i-cache. Then the
state changes to the wait for µOPs state, where we wait for the decoder to produce a list of decoded µOPs.
Once the instruction is decoded, the µOPs are sent to the fill buffer. Then we transition to the bypass µOPs
state, where we send the µOPs to the rest of the pipeline. This continues till we encounter a trace segment
terminating condition. There are two common cases that can disrupt the flow of events.
The first is that we encounter a complex instruction, where we need a list of microinstructions from
microcode memory. In this case, we transition to the Read microinstructions state. The second is a normal
trace segment terminating condition. In this case, we move to the Transfer state where the data line created
in the fill buffer is transferred to the tag and data arrays. Then we transition back to the Wait for µOPs
state, if we have not reached the end of the trace. However, if we have encountered a condition to end the
trace, then we mark the data line as the tail of the trace, and finish the process of creating the trace.
We subsequently fetch the next instruction, and check if it is the head of a trace. If it is not the head of
any trace, then we start building a new trace.
Smruti R. Sarangi 314
Read micro
instructions
Fetch from i-cache Trace end
Trace not Complex
ended condition
instruction
detected
Bypass μOPs
h Trace segment
ac er
re uff end condition
Wait for μOPs Ps ll b Head lookup
μO e fi
th
Trace end
Trace not
Transfer condition
ended
detected
Most prefetchers actually base their decisions on the miss sequence and not the access sequence. This
means that a prefetcher for the L1 cache takes a look at the L1 misses, but not at the L1 accesses. This is
because we are primarily concerned with L1 misses, and there is no point in considering accesses for blocks
that are already there in the cache. It is not power efficient, and this information is not particularly useful.
Hence, we shall assume from now on that all our prefetchers consider the miss sequence only while computing
their prefetching decisions. Furthermore, we shall only consider block addresses while discussing prefetchers
– the bits that specify the addresses of words in a block are not important while considering cache misses.
Important Point 14
Prefetchers operate on the miss sequence of a cache, and not on the access sequence. This means that
if a prefetcher is associated with a cache, it only takes a look at the misses that are happening in the
cache; it does not look at all the accesses. This is because most cache accesses are typically hits, and
thus there is no need to consider them for the sake of prefetching data/instructions into the cache. We
only need to prefetch data/instructions for those block addresses that may record misses in the cache.
Additionally, operating on the access sequence will consume a lot of power.
Hence, in the case of a next line or next block prefetcher, we look at misses. If we record a miss for block
address X, we predict a subsequent miss for block address X + 1 – we thus prefetch it.
This is a simple approach for instruction prefetching and often works well in practice. Even if we have
branches, this method can still work. Note that most branches such as if-else statements, or the branches
in for loops have targets that are nearby (in terms of memory addresses). Hence, fetching additional cache
blocks, as we are doing in this scheme, is helpful.
There can be an issue with fetching blocks too late. This can be fixed by prefetching the block with
address X + k, when we record a miss for block address X. If we record a miss for a new block every n
cycles, and the latency to access the lower level memory is L cycles, k should be equal to L/n. This means
that the block X + k will arrive just before we need to access it.
count. We associate a saturating counter with each column that indicates the miss count. To ensure that
the information stays fresh, we periodically decrement these counters.
X Y 4 Z 6
Now, when we have a cache miss, we look up the address in this table. Assume we suffer a miss for block
X. In the table we find two entries: one each for blocks Y and Z respectively. There are several choices.
The most trivial option is to issue prefetch requests for both Y and Z. However, this is not power efficient
and might increase the pressure on the lower level cache. Sometimes it might be necessary to adopt a better
solution to conserve bandwidth. We can compare the miss counts of Y and Z and choose the block that has
a higher miss count.
After issuing the prefetch request, we continue with normal operation. Assume that for some reason the
prefetch request suffers from an error. This could be because the virtual memory region corresponding to
the address in the request has not been allocated. Then we will get an “illegal address” error. Such errors
for prefetch requests can be ignored. Now, for the next cache miss, we need to record its block address. If
the miss happened for block Y or block Z, we increment the corresponding count in the table. For a new
block we have several options.
The most intuitive option is to find the entry with the lowest miss count in the table among the blocks in
the row corresponding to the cache miss, and replace that entry with an entry for the new block. However,
this can prove to be a bad choice, particularly, if we are disturbing a stable pattern. In such cases, we can
replace the entry probabilistically. This provides some hysteresis to entries that are already there in the
table; however, it also allows a new entry to come into the miss table with a finite probability. The choice
of the probability depends on the nature of the target workload and the architecture.
Software Approach
We first start out by creating a call graph of a program. A call graph is created as follows. We run the
program in profiling mode, which is defined as a test run of the program before the actual run, where we collect
317 Smruti R. Sarangi
important statistics regarding the program’s execution. These statistics are known as the program’s profile,
and this process is known as profiling. In the profiling phase, we create a graph (defined in Section 2.3.2)
in which each node represents a function. If function A calls function B, then we add an edge between the
nodes representing A and B respectively. Note that it is possible for node A to call different functions across
its invocations. One option is to only consider the first invocation of function A; in this case, we do not
collect any data for subsequent invocations. Based on this information, we can create a graph of function
calls, which is referred to as the call graph. From the call graph, we can create a list of hcaller, calleei
function pairs, and write them to a file. Note that if function A calls function B, then A is the caller and B
is the callee.
Let us explain this process with an example. Consider the set of function invocations shown in Fig-
ure 7.54(a). The associated call graph is shown in Figure 7.54(b). In addition, we label the edges based on
the order in which the parent function invokes the child functions. For example, in Figure 7.54(a), foo2 is
called after foo1, and thus we have labelled the edges to indicate this fact.
foo1
void foo() { 1
foo1(); 2 foo2
...
foo2(); foo
3
...
foo3(); foo3
... 4
foo4();
} foo4
(a) (b)
Figure 7.54: Example of a call graph
Subsequently, we can use a binary instrumentation engine, or even a compiler to generate code that
prefetches instructions. The algorithm to insert prefetch statements is as follows. Assume function A calls
function B and then function C. We insert prefetch code for function B at the beginning of function A.
We assume that it will take some time to setup the arguments for B, and then we shall invoke the function.
During this time, the memory system can fetch the instructions for function B.
After the call instruction that calls function B, we insert prefetch code for function C. Again the logic
is the same. We need some time to prepare the arguments for function C after B returns. During this time,
the memory system can in parallel prefetch the instructions for C. If after C, we would have invoked another
function D, then we would have continued the same process.
The software approach is effective and is generic. However, it necessitates a profiling run. This is an
additional overhead. In addition, it is not necessary that the inputs remain the same for every run of the
program. Whenever, the input changes substantially, we need to perform the process of profiling once again.
In addition, for large programs, we can end up generating large files to store these profiles. This represents a
large storage overhead also. Finally, the profiling run need not be representative. A function might be called
Smruti R. Sarangi 318
many times and its behaviour might vary significantly, and all of these might not be effectively captured in
the profile. Hence, whenever we have the luxury, a hardware based approach is preferable.
Figure 7.55 shows the structure of a call graph history cache. It consists of a tag array and data array
similar to a normal cache. The tag array stores the tags corresponding to the PCs of the first (starting)
instructions of the functions. Along with each entry in the tag array, we store an integer called an index,
which is initialised to 1. The data array can contain up to N entries, where each entry is the starting PC of
a function that can be invoked by the current function.
Assume function A invokes the functions B, C, and D, in sequence. We store the starting PCs of B, C,
and D, respectively, in the data array row. Note that we store the starting PCs of functions in the same
order in which they are invoked. This order is also captured by the index. If the index is k, then it means
that we are referring to the k th function in the data array row.
Similar to software prefetching, whenever we invoke function A, we start prefetching the instructions of
function B. When B returns, we prefetch the instructions of function C, and so on. We use the index field
in the tag array for this purpose. Initially, the index is 1, hence, we prefetch the first function in the data
array row. On every subsequent return, we increment the index field, and this is how we identify the next
function to prefetch: if the index is i, then we prefetch the instructions for the ith function in the row. When
A returns, we reset its index field to 1.
Whenever A invokes a new function, we have two cases. If the function is invoked after the last function
in the data array row, then we need to just create a new entry in the row, and store the address of the first
instruction of the newly invoked function. However, if we invoke a new function that needs to be in the
middle of the row, then there is a problem: it means that the control flow path has changed. We have two
options. The first is that we adopt a more complicated structure for the data array row. In the first few
bytes, we can store a mapping between the function’s index and its position in the data array row. This
table can be updated at run time such that it accurately tracks the control flow. The other option is to
discard the rest of the entries in the data array row, and then insert the new entry. The former is a better
approach because it allows changes to propagate faster; however, the latter is less complex.
In addition, we can borrow many ideas from the notion of saturated counters to keep track of the
functions that are a part of the call sequence in the current phase of the program. Whenever A calls B, we
can increment the saturating counter for B in A’s row of the CGHC. This indicates that the entry is still
fresh. Periodically, we can decrement the counters to indicate the fact that the information that we have
stored is ageing. Once a counter becomes zero, we can remove the entry from the row.
Here, A is an array. Depending on the data type of the array, the memory address of the array is
calculated for the statement, sum += A[i]; . If A is an array of integers, then in each iteration we increment
i by 4. If it is an array of double precision numbers, then we increment i by 8. Let us now look at the
assembly code for the statement sum += A[i]; .
Consider the load instruction (Line 5). Every time that it is invoked, it will have a different address. Let
us define the difference in the memory addresses between consecutive calls to a memory instruction (load or
store) as the stride. In this case , the stride depends on the data type stored in array A. Alternatively, it is
possible that the enclosing for loop does not visit array locations consecutively, instead it only traverses the
even indices within the array. The value of the stride will double in this case.
Definition 48
Let us define the difference in the memory addresses between consecutive calls to a memory instruction
(load or store) as the stride.
We thus observe that the value of the stride is dependent on two factors: the data type and the array
access pattern. However, the only thing that we require is to know if a given stride is relatively stable and
predictable. This means that the stride, irrespective of its value, should not change often. Otherwise we
will not be able to predict the addresses of future memory accesses. Let us design a stride predictor, and a
stride based prefetcher.
Stride Predictor
Let us design a stride predictor on the lines of the predictors that we have been seeing up till now. We
create an array indexed by the least significant bits of the PC. In each row, we can optionally have a tag for
increasing the accuracy. In addition, we have the following fields (also see Figure 7.56): last address, stride,
confidence bits.
The last address field stores the value of the memory address that was computed by the last invocation
of the memory instruction. The stride field stores the current stride, and the confidence bits (implemented
using a saturating counter) show the confidence that we have in the value of the stride that is stored in the
entry.
confidence
bits
tag last address stride
Let us now discuss the logic. Whenever, we record a miss in a cache, we access the stride predictor.
First, we subtract the last address from the current address to compute the current stride. If this is equal
to the value of the stride stored in the table, then we increment the saturating counter that represents the
321 Smruti R. Sarangi
confidence bits. If the strides do not match, then we decrement the confidence bits. The confidence bits
provide a degree of hysteresis to the stride. Even if there is one irregular stride, we still maintain the old
value till we make a sufficient number of such observations. At the end, we set the last address field to the
memory address of the current instruction.
As long as the stride is being observed to be the same, this is good news. We keep incrementing the
confidence bits till they saturate. However, if the value of the stride changes, then ultimately the saturating
counter will reach 0, and once it does so, we replace the value of the stride stored in the entry with the
current stride. In other words, our predictor can dynamically learn the current stride, and adapt to changes
over time.
Process of Prefetching
Now, the process of prefetching is as follows. Whenever, we record a miss for a given block, we access its
corresponding entry in the stride predictor. If the value of the confidence is high (saturating counter above
a certain threshold), we decide to prefetch. If the value of the stride is S and the current address is A, we
issue a prefetch instruction for address A0 = A + κS. Note that κ in this case is a constant whose value can
either be set dynamically or at design time. The basic insight behind this parameter is that the prefetched
data should arrive just before it is actually required. It should not arrive too soon nor too late. If we do not
expect a huge variance in the nature of the workloads, then κ can be set at design time.
However, if we have a lot of variation, and we are not sure where the prefetched values are coming from,
then we are not in a position to predict how long it will take to prefetch the values. If they are coming from
the immediately lower level of memory, they will quickly arrive; however, if the values are coming from main
memory, then they will take hundreds of cycles. This can be dynamically estimated by maintaining a set of
counters. For example, we can have a counter that starts counting after a prefetch request for block A0 is
sent to the memory system. The counter increments every cycle. Once the data arrives, we note the value
of the counter. Let the count be Tpref etch – this gives us an estimate of the time it takes to prefetch a block.
We can have another counter to keep track of the duration between prefetching the block A0 and accessing
it. This counter starts when a prefetch request is sent to the memory system. The counter stops when
subsequently the first memory request for a word in the block A0 is sent to the memory system. Let this
duration be referred to as Taccess .
Ideally, Tpref etch should be equal to Taccess . If Tpref etch < Taccess , then it means that the data has
arrived too soon. We could have possibly prefetched later. This can be done by decreasing κ. Similarly,
Tpref etch > Taccess means that the data arrived too late. In this case, we need to increase κ. We can
dynamically learn the relationship between κ and Taccess by dynamically changing the value of κ for different
blocks and measuring the corresponding values of Taccess . This approximate relationship can be used to tune
κ accordingly.
As of today, stride based prediction is the norm in almost all high-end processors. This is a very simple
prefetching technique, and is very useful in codes that use arrays and matrices.
struct node_t {
int val ;
struct node_t * next ;
};
void foo () {
...
/* traverse the linked list */
node * temp = start_node ;
while ( temp != NULL ) {
process ( temp ) ;
temp = temp - > next ;
}
...
}
In this piece of code, we define a linked list node (struct node t). To traverse the linked list, we keep
reading the next pointer of the linked list, which gives us the address of the next node in the linked list. The
addresses of subsequent nodes need not be arranged contiguously in memory, and thus standard prefetching
algorithms do not work. A simple way of solving this problem is to insert a prefetch instruction in software
for the linked list traversal code.
...
/* traverse the linked list */
node * temp = start_node ;
while ( temp != NULL ) {
prefetch ( temp - > next ) ; /* prefetch the next node */
process ( temp ) ;
temp = temp - > next ;
}
...
We add a set of prefetch instructions for fetching the next node in the linked list before we process the
current node. If the code to process the current node is large enough, it gives us enough time to prefetch the
next node, and this will reduce the time we need to stall for data to come from memory. Such a prefetching
strategy is known as pointer chasing. We are literally chasing the next pointer and trying to prefetch it.
We can extend this scheme by traversing a few more nodes in the linked list and prefetching them. Since
prefetch instructions do not lead to exceptions, there is no possibility of having null pointer exceptions or
illegal memory access issues in such code.
The compiler and memory allocator can definitely help in this regard. If on a best-effort basis, the
memory allocator tries to allocate new nodes in the linked list in contiguous cache lines, then traditional
prefetchers will still work. Of course the situation can get very complicated if we have many insertions or
deletions in the linked list. However, parts of the linked list that are untouched will still maintain a fair
amount of spatial locality.
Keeping this in mind, let us try to do something when the processor is stalled. There are many similar
ideas in this space. We shall discuss some of the major proposals in this area. They are collectively known
as pre-execution techniques.
Runahead Execution
Let us discuss one of the earliest ideas in this space known as runahead execution [Mutlu et al., 2003]. In
this case, whenever we have a high-latency L2 miss, we let the processor proceed with a possibly predicted
value of the data. This is known as the runahead mode. In this mode, we do not change the architectural
state. Once the data from the miss comes back, the processor exits the runahead mode, and enters the
normal mode of operation. All the changes made in the runahead mode are discarded. The advantage of the
runahead mode is that we still execute a lot of instructions with correct values, and in specific, we execute
many memory instructions with correct addresses. This effectively prefetches the data for those instructions
from the memory system. When we restart normal execution, we shall find many much-needed blocks in the
caches, and thus the overall performance is expected to increase. Let us elaborate further.
Whenever we have an L2 miss, we enter runahead mode. We take a checkpoint of the architectural register
file and the branch predictors. Similar to setting the poison bit in the delayed selective replay scheme (see
Section 5.2.4), we set the invalid bit for the destination register of the load that missed in the L2 cache. We
subsequently propagate this poison bit to all the instructions in the load’s forward slice. Recall that the
forward slice of an instruction consists of its consumers, its consumers’ consumers and so on. The invalid
bit is propagated via the bypass paths, the LSQ, and the register file. This ensures that all the consumers of
an instruction receive an operand marked as invalid. If any of the sources are invalid, the entire instruction
including its result is marked as invalid. An instruction that is not marked invalid, is deemed to be valid.
We execute instructions in the runahead mode as we execute them in the normal mode. The only
difference is that we always keep track of the valid/invalid status of instructions. Second, we do not update
the branch predictor when we resolve the direction of a branch that is invalid.
Runahead execution introduces the notion of pseudo-retirement, which means retirement in runahead
mode. Once an instruction reaches the head of the ROB, we inspect it. If it is invalid, we can remove it
immediately, otherwise we wait for it to complete. We never let stores in the runahead mode write their
values to the normal cache. Instead, we keep a small runahead L1 cache, where the stores in runahead mode
write their data. All the loads in the runahead mode first access the runahead cache, and if there is a miss,
they are sent to the normal cache. Furthermore, whenever we evict a line from the runahead cache, we never
write it to the lower level.
Once the value of the load that missed in the L2 cache arrives, we exit the runahead mode. This is
accompanied by flushing the pipeline and cleaning up the runahead cache. We reset all the invalid bits, and
restore the state to the checkpointed state that was collected before we entered the runahead mode.
There are many advantages of this scheme. The first is that we keep track of the forward slice of the
load that has missed in the L2 cache. The entire forward slice is marked as invalid, and we do not allow
instructions in the forward slice to corrupt the state of the branch predictor and other predictors that are
used in a pipeline with aggressive speculation. This ensures that the branch prediction accuracy does not
drop after we resume normal execution. The other advantage is that we use the addresses of valid memory
instructions in runahead mode to fetch data from the memory system. This in effect prefetches data for the
normal mode, which is what we want.
Helper Threads
Let us now look at a different method of doing what runahead execution does for us. This method uses
helper threads. Recall that a thread is defined as a lightweight process. A process can typically create
many threads, where each thread is a subprocess. Two threads share the virtual memory address space, and
can thus communicate using virtual memory. They however have a separate stack and program counter.
Programs that use multiple threads are known as multithreaded programs.
Smruti R. Sarangi 324
Definition 49
A thread is a lightweight process. A parent process typically creates many threads that are themselves
instances of small running programs. However, in this case, the threads can communicate amongst each
other via their shared virtual address space. Each thread has its dedicated stack, architectural register
state, and program counter.
The basic idea of a helper thread is as follows. We have the original program, which is the parent
process. In parallel, we run a set of threads known as helper threads that can run on other cores of a
multicore processor. Their job is to prefetch data for the parent process. They typically run small programs
that compute the values of memory addresses that will be used in the future. Then they issue prefetch
requests to memory. In this manner, we try to ensure that the data that the parent process will access in
the future is already there in the memory system. Let us elaborate.
First, let us define a backward slice. It is the set of all the instructions that determine the value of the
source operands of an instruction. Consider the following set of instructions.
1 add r1 , r2 , r3
2 add r4 , r1 , r1
3 add r5 , r6 , r7
4 add r8 , r4 , r9
The backward slice of instruction 4 comprises instructions 1 and 2. It does not include instruction 3.
Of course, the backward slice of an instruction can be very large. Nevertheless, if we consider the backward
slice in a small window of instructions, it is limited to a few instructions.
Definition 50
The backward slice of an instruction comprises all those instructions that determine the values of its
source operands. It consists of the producer instruction of each operand, the producers of the operands
of those instructions, and so on.
We can create small subprograms for loads that may most likely miss in the L2 cache, and launch them
as helper threads way before the load gets executed. To figure out which loads have a high likelihood of
missing in the L2 cache, we can use an approach based on profiling, or prediction based on misses in the
past. Each helper thread runs the backward slice of such a load instruction, computes its address, and sends
it to the memory system for prefetching the data.
Summary 6
2. Given that large memories are too slow and too inefficient in terms of power consumption, we need
to create a memory hierarchy.
(a) A typical memory hierarchy has 3-5 levels: L1 caches (i-cache and d-cache), L2 cache, L3
and L4 caches (optional), and the main memory.
(b) The cache hierarchy is typically inclusive. This means that all the blocks in the L1 cache are
also contained in the L2 cache, and so on.
(c) The performance of a cache depends on its size, latency, and replacement policy.
(a) We replace each element in the cache by an equivalent RC circuit. This helps us estimate the
latency and power.
(b) We calculate the time constant of each element using the Elmore delay model.
(c) Subsequently, we divide large arrays into banks, subbanks, mats, and subarrays based on the
objective function: we can either minimise power, minimise latency, or minimise area.
6. Modern caches are pipelined and are non-blocking. They have miss status handling registers
(MSHRs) that do not allow secondary misses to be sent to the lower levels of the memory hierarchy.
We record a secondary miss in the cache, when at the point of detecting a miss, we find that we
have already sent a request to the lower level for a copy of the block. Such misses are queued in
the MSHR.
7. We can use skewed associative caches, way prediction, loop tiling, and VIPT (virtually indexed,
physically tagged) caches to further increase the performance of a cache.
8. The trace cache stores traces, which are commonly executed sequences of code. We can read decoded
instructions directly from it, and skip the fetch and decode stages altogether.
Smruti R. Sarangi 326
9. There are three kinds of misses: compulsory, conflict, and capacity. For all these types of misses,
prefetching is helpful.
10. We can prefetch either instructions or data. We learn patterns from the miss sequence of a cache,
and then leverage them to prefetch data or code blocks.
11. For instruction prefetching, next line prefetching is often very effective. However, modern ap-
proaches prefetch at the level of functions or groups of functions. They take the high-level structure
of the code into account.
12. For prefetching data, we studied stride based prefetching for regular memory accesses and pre-
execution based methods for irregular memory accesses. The latter class of techniques is very
important for code that uses linked lists and trees.
Exercises
Ex. 1 — A cache has block size b, associativity k, and size n (in bytes). What is the size of the tag in
bits? Assume a 64-bit memory system.
Ex. 2 — Does pseudo-LRU approximate the LRU replacement scheme all the time?
Ex. 3 — From the point of view of performance, is an i-cache miss more important or is a d-cache miss
more important? Justify your answer.
Ex. 6 — Why is it necessary to read in the entire block before even writing a single byte to it?
327 Smruti R. Sarangi
* Ex. 7 — Assume the following scenario. An array can store 10 integers. The user deliberately enters
15 integers, and the program (without checking) tries to write them to successive locations in the array. It
will cross the bounds of the array and overwrite other memory locations. It is possible that if the array is
stored on the stack, then the return address (stored on the stack) might get overwritten. Is it possible to
hack a program using this trick? In other words, can we direct the program counter to a region of code that
it should not be executing? How can we stop this attack using virtual memory based techniques?
Ex. 8 — Assume we have an unlimited amount of physical memory, do we still need virtual memory?
* Ex. 9 — Does the load-store queue store physical addresses or virtual addresses? What are the trade-
offs. Explain your answer, and describe how the load-store queue needs to take this fact (physical vs virtual
addresses) into account.
Ex. 10 — In a set associative cache, why do we read the tags of all the lines in a set?
Ex. 11 — What is the key approximation in the Elmore delay model? Why do we need to make this
assumption?
Ex. 12 — Show the design of an MSHR where every load checks all the previous stores. If there is a
match, then it immediately returns with the store value (similar to an LSQ).
Ex. 13 — Does the VIPT scheme place limits on the size of the cache?
Ex. 14 — Why was it necessary to store a trace in consecutive sets? What did we gain by doing this?
** Ex. 15 — Consider an application that makes a lot of system calls. A typical execution is as follows.
The application executes for some time, then it makes a system call. Subsequently, the OS kernel starts to
execute, and then after some time, it switches back to the application. This causes a lot of i-cache misses.
Suggest some optimisations to reduce the i-cache miss rate.
* Ex. 16 — Can you design a piece of hardware that can detect if an OOO processor is traversing a linked
list?
** Ex. 17 — Suggest an efficient hardware mechanism to prefetch a linked list. Justify your answer.
Extend the mechanism to also prefetch binary trees.
Design Questions
Ex. 18 — Understand the working of the CACTI tool. Create a web interface for it.
Ex. 19 — Implement way prediction in an architectural simulator such as the Tejas Simulator.