Advances in Microprocessor Cache Architectures Over The Last 25 Years
Advances in Microprocessor Cache Architectures Over The Last 25 Years
Over the last 25 years, the use of caches has advanced significantly in mainstream
microprocessors to address the memory wall challenge. As we transformed
microprocessors from single-core to multicore to manycore, innovations in the
architecture, design, and management of on-die cache hierarchy were critical to
enabling scaling in performance and efficiency. In addition, at the system level, as
input/output (I/O) devices (e.g., networking) and accelerators (domain-specific)
started to interact with general-purpose cores across shared memory,
advancements in caching became important as a way of minimizing data
movement and enabling faster communication. In this article, we cover some of the
major advancements in cache research and development that have improved the
performance and efficiency of microprocessor servers over the last 25 years. We will
reflect upon several techniques including shared and distributed last-level caches
(including data placement and coherence), cache Quality of Service (addressing
interference between workloads), direct cache access (placing I/O data directly into
CPU caches), and extending caching to off-die accelerators (CXL.cache). We will
also outline potential future directions for cache research and development over
the next 25 years.
CACHING CHALLENGES OVER THE (DRAM), architects had to innovate and advance cach-
LAST 25 YEARS ing techniques to facilitate high bandwidth, low latency
O
ver the last 25 years, we have seen significant data access from the core as well as from input/output
advances in microprocessors including sub- (I/O) devices and accelerators. Figure 1 illustrates the
stantial improvements in core frequency and compute growth and the memory wall challenge and
performance, multicore, manycore, and more recently highlights some caching innovations that we will cover
heterogenous compute architectures [diverse central in the article. The advancements in caching (as a result)
processing unit (CPU) cores and tightly coupled acceler- are best described by an illustrative example (see
ators/devices]. These have enabled applications to grow Table 1). About 25 years ago, the Intel Pentium Pro was
rapidly from single-threaded to multithreaded on client launched into the market for both client and eventually
and server platforms and furthermore to multitenant server platforms. The Intel Pentium Pro was a single-
and service-oriented microservices scenarios in virtual- core processor running at 150–200 MHz and featured an
ized cloud infrastructure. All these advancements in off-die but on-package nonblocking L2 cache (256 K at
compute performance and increase in application introduction) connected to the core using a backside
demands required improvements in access to data bus to address the memory latency and enable concur-
(both latency and bandwidth). With a slower pace of rent access to cache and memory. Fast forward 25 years
advancements in dynamic random access memory to our current generation server microprocessors,
which have tens of cores (each multithreaded and capa-
ble of running at well over 3 GHz) with on-die cache
capacity at almost 100 MB or more (including both L2
0272-1732 ß 2021 IEEE
and L3) that is physically distributed across an on-die
Digital Object Identifier 10.1109/MM.2021.3114903 interconnect. Table 1 presents the number of cores, fre-
Date of current version 19 November 2021. quency, and cache sizes for the latest third-generation
FIGURE 1. Growing compute versus memory gap necessitates advances in on-die cache hierarchy and novel features.
Intel Xeon scalable server processor (formerly Icelake- L3) could remain private or be shared across subset or
SP)1 in comparison with Intel Pentium Pro. all cores. The second challenge emerged as the number
of cores on-die was scaled and this made it difficult to
have a single monolithic shared last-level cache (L3)
due to the latency increase and physical placement con-
IN THIS ARTICLE, WE WILL DESCRIBE
siderations. Furthermore, as virtualization and cloud
THE LAST 25 YEARS OF ON-DIE computing emerged, there was also a question of how
CACHING ADVANCEMENTS BY resources should be shared as well as performance iso-
DISCUSSING THE CHALLENGES THAT lated across multiple applications or tenant virtual
THEY ADDRESSED AS WELL AS THE machines (VMs) running simultaneously on the micro-
INNOVATIONS THAT ENABLED DATA processor, sharing the L3 cache in the cloud infr-
ACCESS AND DATA MOVEMENT astructure. These challenges necessitated the physical
EFFICIENCY WITHIN THE distribution of L3 slices across cores and introduced
MICROPROCESSOR AS WELL AS WITH additional questions such as 1) how should data be
TIGHTLY COUPLED DEVICES AND placed across the L3 slices and across the hierarchy?
and 2) should the caches enable some partitioning to
ACCELERATORS.
In this article, we will describe the last 25 years of on- Intel Third Gen Intel Xeon
die caching advancements by discussing the challenges Pentium Scalable (2021)
that they addressed as well as the innovations that Pro (1996)
enabled data access and data movement efficiency Cores (C) 1C, 1T on Up to 40C, 2T per core on 10
within the microprocessor as well as with tightly cou- and 0.35 mm nm
pled devices and accelerators. It should be noted that Threads (T)
excellent advances in caching occurred in other areas Core 150–200 Up to 3.9 GHz base
of the platform architecture as well (e.g., use of DRAM Frequency MHz frequency
as a cache with the emergence of Intel Optane persis- L1 Cache 8 KB Code 32 KB Code 48 KB Data
tent memory), but we limit our discussion in this article (code, data) 8 KB Data
to on-die static random access memory (SRAM) caches L2 Cache 256 KB to 1.25 MB per core on-die
in the microprocessor and later in devices/accelerators (unified) 1 MB
on server platforms. Figure 2 illustrates the on-die cach- L3 Cache None Up to 60 MB shared by
ing challenges and advancements that we will cover in cores; on-die distributed
more detail in subsequent sections. The first of these layout w/1.5 MB per core
challenges was at the advent of multicore processors, Intel, the Intel logo and Xeon are trademarks of Intel Corporation or
where the growing last-level cache (initially L2 and later its subsidiaries.
facilitate resource monitoring and allocation for quality data amongst parallel applications. The shared L3 in
of service? Last but not least, as I/O devices (e.g., net- quad core CPUs was initially designed as a monolithic
working and storage) started playing a critical role in cache and as the core count in the die increased, the
cloud deployments, improving the data movement size and latency of the L3 became a challenging archi-
between the device and the host platform processing tecture/design tradeoff. With more cores on die, more
the content became increasingly important to enable sophisticated on-die interconnects (rings, mesh) started
faster network speeds. Moreover, with the emergence of to be considered for scalability and intercore communi-
domain-specific accelerators [primarily off-die but also cation. This also led to the consideration for architecting
integrated graphics processing units (GPUs) on-die], the the shared L3 as multiple distributed cache slices
placement and sharing of data across the CPU, devices across the interconnect, with each slice potentially col-
and accelerators needed to be considered carefully. located with a core but accessible by all cores.
In this article, we will delve into the architectural With a distributed shared L3, while it was easier to
solutions that have emerged and been integrated distribute the cache space and reduce latency on a per
into microprocessors over the years to address the cache slice, there were other considerations that
above questions. We will also discuss the design and became important: 1) how do we place data across the
technology advancements that improved cache den- slices? 2) should the L3 cache be inclusive? and 3) how
sity, reliability, yield, and latency. These foundational do we maintain coherence with respect to the lower
improvements ensured that we continue to scale and (L1/L2) levels of cache that are on each core? To avoid
grow our cache subsystems in microprocessors keep- hotspots in the interconnect, the placement of data
ing up with the ever-increasing application demands. was based on hashing techniques that distributed the
data evenly across the L3 slices. Since the L1/L2 caches
SHARED CACHES IN MULTI/MANY- were small (<256 KB) initially, the large L3 slices (2–
CORE CPUS 3 MB) could be kept inclusive and thereby keep the pro-
As commercial multicore x86 CPUs started emerging tocol complexity low. To maintain coherence, the L3
about 15 years ago, the traditional approach was to just cache slices started supporting valid bits to keep track
have individual cache hierarchies (L1, L2) on each core. of lines in lower levels of cache.
However, there was an opportunity to further improve As additional cores continued to be integrated and
sharing and communication between the cores with a the interconnect became larger (ring to mesh, for exam-
shared last-level cache. Initially, as dual-core CPUs ple), the latency to the last-level caches started to
started emerging, the focus was on a shared L2 cache. increase as well. There were three considerations that
But soon, the use of a shared L3 as last-level cache emerged with the growing core count and interconnect
started to become more common with quad core and on-die: 1) the number of integrated memory controllers
larger multicore CPUs in order to share code as well as on-die started to increase to enable higher bandwidth;
2) the bits needed in the last-level cache to track coher- controller into L3 slices that are closer to it, whereas
ence state (of each cache line in the lower level caches) placing data from another memory controller into L3 sli-
started to increase; and 3) whether to grow the L2 ces closer to that. This reduced the latency of data
cache, scale the L3 cache, or optimize the data place- access using NUMA affinity extended into the on-die
ment to reduce latency. In order to enable flexibility in architecture while still ensuring no replication of data
the cache hierarchy and data placement for locality, blocks occur with the L3 slices.
enhancing the L3 architecture and policies was desir- Figure 3 shows how the first-generation Intel Xeon
able. As a result, techniques such as noninclusive Scalable Family server microprocessor (Skylake SP)3
caches but inclusive directories (NCID)2 emerged to enables all of the above capabilities in the on-die last-
provide flexibility and efficiency. In the first-generation level (L3) cache. In addition to the hardware capabili-
Intel Xeon scalable server processors (previously ties, it soon also became important to enable software
known as Skylake SP) with up to 28 cores per socket, hints to modulate placement in different levels in the
the interconnect became a scalable mesh, the L2 cache hierarchy. For example, prefetch instructions to
cache per core grew in size (from 256 KB to 1 MB), and pull data closer in the cache hierarchy, and instruc-
the L3 cache changed from inclusive to noninclusive to tions such CLDEMOTE17 to push data further away
enable flexible placement of data in L2 and L3. from the originating core so that it is accessible by
Although a line in L2 did not require a data copy in the another core who may find it at the right level in the
(noninclusive) L3, the L3 cache slice supported (inclu- cache hierarchy, reducing coherency penalties.
sive) snoop filters to maintain coherence tracking the Beyond such software control, additional software
lines in lower levels of the cache. guidance of cache allocation and monitoring of cache
Last but not least, as the number of memory control- usage is discussed below to enable QoS capabilities
lers and the number of L3 cache slices grew on die, that address shared resource contention.
the placement of data for locality was also a key consid-
eration. In Skylake, sub-nonuniform memory access QOS IN SHARED CACHES
(NUMA) clustering was introduced to enable multiple As chip multiprocessors grew in the number of
domains within a die, where each domain consisted of a cores and deployments became popular in cloud
subset of cores, L3 slices, and integrated memory con- environments, the number of workloads running simul-
trollers. The sub-NUMA clustering approach enabled taneously on these processors also grew rapidly. This
the placement of data coming from one memory was fueled by virtualization and multitenant hosting
(many VMs) and has since evolved into the use of con- service (CLOS) and associated bit vectors to repre-
tainers and microservices as well. With many simulta- sent cache space needs. Subsequent generations
neous and disparate workloads running on the cores, expanded these controls to include memory band-
contention for shared resources such as last-level width allocation. These QoS capabilities and architec-
cache led to performance variability, long tail latencies ture made possible by the Intel Resource Director
to complete certain operations, or steady-state perfor- Technology (RDT) enabled dynamic monitoring and
mance imbalance. This QoS problem4 was highlighted control of contention effects, independent scaling of
initially over 15 years ago and led to innovative research architectural parameters (such as RMIDs and CLOS)
in managing shared resource contention.5 and provide a scalable framework.
An example of the shared resource contention Resource contention impact can vary considerably
effects is shown in Figure 4 where a high priority depending on other simultaneously running workloads
workload (BZIP2 running simultaneously with many and core/cache hierarchy configurations. As shown in
other low priority workloads) can slow down by as Figure 4 (for BZIP2), cache/memory resource contention
much as 4X due to shared cache contention in cer- on Intel Xeon E5 v4 may result in a performance slow-
tain microarchitectures. To address these concerns down of up to 4X, whereas the same workload on
in mainstream server microprocessors, the first QoS today’s third-generation Intel Xeon Scalable processors
techniques in hardware were integrated into Intel suffers up to 2X slowdown. Employing QoS in cache
Xeon E5/E7 v3 and v4 processors6 enabling last-level (CAT) using Intel RDT available in these platforms ena-
cache monitoring (CMT), memory bandwidth monitor- bles prioritization and isolation of high priority work-
ing (MBM), and cache allocation controls (CAT). loads while running simultaneously with low priority
With CMT and MBM, execution environments could workloads. As shown in Figure 4, the effect of conten-
dynamically monitor the LLC capacity and memory tion from low priority workloads on the performance of
bandwidth consumed by software threads using a high priority workload (BZIP2) can be reduced from
resource monitoring IDs (RMIDs). With CAT, execu- over 2X (without CAT) down to negligible slowdowns
tion environments could allocate different amounts (with CAT) by prioritizing cache space for the high prior-
of last-level cache space using different classes of ity application and limiting resource allocation for low
TABLE 2. Cache QoS: Example reduction in slowdown, average to max (SPEC CPU2006 BZIP2 versus background apps).
Intel Xeon E5 v4 Full Intel Xeon E5 v4 Third Gen Intel Xeon Scalable Third Gen Intel Xeon
Contention With CAT Full Contention Scalable With CAT
Average 1.71 1.01 1.24 1.02
Geomean 1.54 1.01 1.21 1.02
Max 4.20 1.17 2.09 1.15
Min 1.00 1.00 1.00 1.00
priority applications. Table 2 presents benefits for the or flush data at synchronization points. In such scenar-
BZIP2 example in more detail indicating that average, ios, a high-speed link that provides coherency between
geomean, and maximum performance degradation can host and accelerator has become important to mini-
be addressed using QoS techniques in shared caches. mize latency, enable new usage models, and simplify
the software stack. One of the leading efforts to stan-
dardize the interfaces of coherently attached accelera-
CACHE ADVANCEMENTS FOR I/O tors is driven by the Compute Express Link (CXL)
DEVICES AND ACCELERATORS consortium.9 CXL implements three logical protocols
Traditionally, I/O devices have interfaced with the host (CXL.io, CXL.cache, and CXL.mem) on a single physical
using links such as PCIe. Over the years, these I/O devi- interface. CXL.io is primarily targeted for streaming bulk
ces have increased in data/processing rates (e.g., net- data transfer very much like PCIe. CXL.mem enables
working from <100 Mb/s to 1 Gb/s to 10 Gb/s to device-attached memory to be part of OS-managed sys-
hundreds of gigabits per second) and new domain-spe- tem memory. CXL.cache is designed for low latency
cific processing units have emerged (deep learning, coherent transfer of cache lines between host and
infrastructure processing, etc.) to work closely with device. These three complimentary protocols allow a
host compute platforms. In the former case (increasing device to become a cache coherent accelerator with
data rates for networking), a significant challenge was capabilities and performance that match those of its
that incoming packets were initially placed in memory host CPU, with the option of having OS-managed sys-
(typical direct memory access operations from the tem memory directly connected to the device.
device) and the cores had to read the data from memory CXL.cache extends the same benefits from caching on
for protocol processing and eventually to hand the data the CPU to accelerators, allowing accelerators to exploit
off to applications. This overhead significantly limited spatial and temporal locality, prefetch data, and imple-
the amount of network processing that could be done ment atomic operations locally. Unlike symmetric coher-
by the host efficiently. In order to address this issue, ency protocols like UPI or CCIX that require all agents to
implement a home agent, CXL implements an asymmetric
researchers proposed direct cache access7 that allows
CXL.cache protocol that relies on the host home agent to
devices to place data directly into processor caches
maintain coherency between CPU and accelerator. This
(e.g., last-level cache) so that the data is closer to the
greatly simplifies the accelerator design since it does not
core that processes the packets. The Intel Data Direct
need to orchestrate coherency and instead focuses on
I/O (Intel DDIO) feature was introduced in Intel Xeon E5
providing the core functionality of the device and the sim-
processors and enables server network interface cards
ple set of abstracted commands (reads, writebacks,
(NICs) to place data directly in processor caches with-
streaming writes, snoops). In addition, the host is responsi-
out the detour to memory. This capability improves ble for managing coherency for all device-attached mem-
latency, reduced memory bandwidth and overall packet ory that is exposed to the host using CXL.mem.
processing efficiency on the server platform, More CXL.cache operates exclusively on physical
recently, in the article by Yuan et al.,8 carefully managing addresses. Devices rely on existing PCIe address
the amount of cache space allocated to DDIO versus translation services to obtain virtual to physical
the tenants running on the CPU cores has shown that address translations and typically use a device TLB to
any performance interference between DDIO place- store the translations. CXL.cache uses 64 B as the
ment and core usage can also be managed using RDT/ coherency granularity and implements a MESI proto-
QoS features (CAT as explained in the previous section). col (a cache coherence protocol based on four states:
In addition to DDIO, new transactions on the coherent modified, exclusive, shared, and invalid). In addition to
fabric (e.g., RdCur) enable I/O devices to read data from using CXL.cache as a low-latency link, coherent
the CPU caches without requiring state changes, attached devices can also improve performance by
thereby improving performance by avoiding snoops to implementing custom operations like atomics. CXL.
IO controllers on subsequent reads/writes from cores. cache transactions are used to maintain coherency
Next, we turn our attention to more sophisticated between the CPU and the accelerator, ensuring that
accelerator devices like deep learning and infrastruc- the data remains coherent across components and
ture processing. As the performance requirements and memory. As is the case with CPU, accelerators use
usages have evolved many of these accelerators have these transactions to obtain a copy of the line, get
incorporated local (device-attached) memory and ownership of the line, etc. However, to enable an
caches to minimize latency and improve throughput, accelerator direct access to device-attached memory
but still rely on software for bulk data transfers to copy at high performance, CXL introduced a bias-based
FIGURE 6. (a) Cache array Vmin. (b) Vmin versus density. (c) Multivoltage design.
voltage domains. More complex dynamic multivoltage 1) Larger L4 caches and memory side caches: It is
read and write assist techniques that modulate indi- clear that larger on-die cache capacity continues
vidual transistor strengths during read and write oper- to be of importance as the workloads continue to
ation are used to improve Vmin for cache arrays in 22- grow in working sets (e.g., machine learning mod-
nm FinFET technology and beyond.10,11 els growing rapidly). As a result, understanding
Cache leakage power dominates the total power the benefits and viability (cost) to build larger
when operating at low voltage and frequency, as well caches (integrated on-die, in cache tiles or three-
as under low activity and idle conditions. Therefore, dimensional stacked) is likely to be an area of
fine-grain sleep transistor techniques are used to focus. In addition, finding ways to reduce the on-
manage cache leakage in different power states and die overhead of tag structures, snoop filters, or
sleep modes. When the data need to be retained for directories for larger caches also becomes criti-
faster wake-up, an active clamping circuit is used with cal to address. Memory disaggregation (expan-
the sleep transistor to ensure that the voltage does sion and pooling) at the system level is also an
not fall below the data retention Vmin across process active exploration topic to improve the capacity
and temperature.12–14
and dynamic access of larger memory working
sets and even better total cost of ownership
NEXT 25 YEARS IN CACHES (TCO) by pooling memory across multiple nodes.
As described above, the last 25 years saw tremendous Using protocols like CXL.mem device-attached
advances in caching, especially at the L3 cache, memory can be part of standard OS-managed
enabling distributed but shared caches, QoS aware- system memory thus enabling disaggregated
ness, guiding I/O data placement, enabling tighter memory architectures. The remote memory con-
coupling with accelerator memory and caching and troller receives the memory reads and writes
more. Over the next 25 years, while difficult to predict, from the CPU over CXL.mem and translates it
there are some major research challenges that will into memory technology specific requests (e.g.,
likely be important to address. requests to DDR memory connected to the
controller itself or transactions sent over to net- 4) Managing memory shared by accelerators and
work-attached memory). The memory latency cores: The use of scratchpads, buffers, and
addition due to extending the memory hierarchy caches continue to remain independent between
can be potentially mitigated by implementing cores and accelerators. However, the growing
memory side caching schemes that avoid the trip presence of accelerators and the tight coupling
to the memory (locally or across a fabric). This of these with CPU cores have the potential to
memory side caching hierarchy may not be visible introduce new techniques for managing these
to the CPU, but it becomes important to under- caches as hybrid buffer/cache architectures. Fur-
thermore, determining what to cache using hard-
stand when and how to cache hot pages in the
ware techniques versus what to expose for
memory side cache for effectiveness.
software management will continue to remain
2) New techniques for deeper cache hierarchies:
important as domain-specific processing (e.g.,
As cache hierarchies become deeper, the typi-
XPUs for infrastructure processing and machine
cal replacement policies start to become inef-
learning) emerges at larger scale.
fective since recency information gets filtered 5) Beyond SRAM for on-die caches has been of inter-
through each level in the cache hierarchy. Find- est for several years, but the mainstream use of
ing better techniques to address this will these has remained constrained to specific plat-
become important to retain the efficiency of the forms. In the next 25 years, with the ability to
caches. Deeper multilevel cache hierarchies develop modular designs, heterogenous caches
also create challenges for system software to and memories will emerge to enable different
express locality/affinity domains, manage data flows and data types to employ the right cache/
persistence, contain errors when a failure memory technology for the right task. Methods to
occurs, create checkpoints for moving VM con- identify how to utilize these heterogenous caches
texts, save and restore data while transitioning and memories will continue to be explored.
from different power states, and avoid perfor- 6) AI-based techniques for caches are being explored
mance interference and variability in the pres- by researchers to further improve replacement
ence of diverse workloads sharing these policy decisions as well as other (QoS) placement
and isolation decisions in the hierarchy. We expect
resources. All of these areas will require contin-
this to help identify intelligent approaches espe-
ued research focus
cially as the workloads become more dynamic
3) Compute in/near cache approaches seem
and efficient AI-based management of resources
attractive to disrupt the traditional computing
can provide better performance QoS.
paradigm and research into the right primitives
7) Higher level caching techniques at scale: As large
and techniques needed to make that main- scale computing (i.e., warehouse scale) focuses
stream would be very interesting. An example of on the use of microservices and RPCs to develop
compute in cache was recently explored to services and applications, higher level approaches
accelerate deep learning in the paper by Eckert like memoization may emerge for caching in
et al.15 Fueled by such work, further research is future platforms. Such techniques require careful
well underway in the architecture community to investigation since the interface for such caching
find the most efficient approach while providing becomes equally important as the identification
a balance between specialization and general- of which aspects to cache and how to enable
purpose use. such caching (perhaps using AI techniques).