SlideShare a Scribd company logo
Performance Considerations For Cache
Memory Design in a Multi-core Processor ?
Divya Ravindran, dxr150630
Ilango Jeyasubramanian, ixj150230
Kavitha Thiagarajan, kxt132230
Susmitha Gogineni, sxg155930
University of Texas at Dallas, Richardson, TX 75030 USA
Abstract: In the recent times multi-core processors have gained importance over the traditional
uniprocessors as there is a saturated growth in the performance improvements of the uniproces-
sors. Multi-core processors make use of multiple cores and in order to improve their performance,
there is a high necessity to reduce the memory access time, improve power e ciency and also
maintain the coherence of data among the cores. To address to the e ciency of multiple cores, a
filter cache is designed with an e cient Segmented Least Recently Used replacement policy. This
technique e↵ectively reduces the energy consumed by 11%. Finally, to address the coherence of
the caches, a modified MOESI based snooping protocol for the ring topology was used. This
improved the performance of the processor by increasing the hit rate by 7%.
Keywords: multi-core; filter cache; energy e cient; hit ratio; coherence; LRU; ring-order
1. INTRODUCTION
As the number of transistors on the chip is doubling every
18 months following the Moore‘s law, it is observed that
the processor speed is also improving at the same rate, but
the memory latency has not progressed at the same rate
as the processor. Due to this di↵erence in the growth, the
time to access the memory becomes larger as the processor
speed improves further. In order to overcome this memory
wall, caches were built.
Cache is a tiny and fast memory and has a smaller access
time than the main memory. The beneficial properties of
cache has made it desirable in providing e ciency to the
processor [1].
This project concentrates on how the Caches can be
modified to make the multi-core processor work in an
e cient way such that the overall speed up is improved,
the energy consumed by the processor is reduced and there
is an improvement in its performance. The analysis of the
newly implemented cache designs is done using some of
the SPEC2006 Benchmarks, In this experiment, the size
and associativity of the caches are fixed in order to provide
simplicity in analysis, the instruction set architecture(ISA)
is built for X86-64 processors.
The first modification performed was introducing a filter
cache, which is a tiny cache assumed to run at the speed of
the core. It consists of the most frequently used instructions
and the access time of the data in the filter cache is very
short, but the hit rate of the filter cache is low [2]. This is
improved by implementing a prediction technique which
? This project paper is edited in the format of International Feder-
ation of Automatic Control Conference Papers in LATEX 2"as part of
EEDG 6304 Computer Architecture coursework.
chooses the memory level to be accessed to reduce misses [3]
[4].
In order to see further improvement in hit rate, a Segmented
Least recently used(LRU) Block Replacement Policy along
with the filter cache is implemented and analyzed. The SLRU
consists of two segments and it uses the principle of
probability to perform the cache block replacement.
The coherence of the multi-core processors were analyzed
later with the help of various topologies. The idea was to
introduce a modified MOESI based snooping protocol f or
the ring topology which helps in improving the coherence of
data in a multi-core processor. This modification makes use
of the round-robin order of ring to provide a fast and stable
performance.
2. FILTER CACHE
2.1 Idea of Filter Cache
Cache is a very important component of modern processor
which can e↵ectively alleviate the speed gap between the
CPU and o↵-chip memory system. Multi-core processors
have become the main development trend of processors, due
to their high performance but power dissipation is a major
issue with the large memory accesses of multiple cores.
Therefore, an energy e cient cache design is required for
energy e ciency.
Filter cache is used to improve performance and power
e ciency. Filter cache is a small capacity cache which is used
to store frequently accessed data and instructions by the
cores. Filter cache acts as the first instruction source which
will consume less energy for most used instructions and
data. The filter is assumed to have almost the same speed as
the core and consume less energy than the normal cache. Fig
1. shows the basic idea of the filter cache.
Fig. 1. Filter Cache: The basic idea
Fig. 2. Filter Cache: How prediction works
The improvement of performance and energy saving is
achieved by accessing the filter instead of the normal cache.
The CPU will access the filter first and only when the filter
is not hit, the visit of normal cache is performed.
2.2 Prediction Filter Cache
For any instruction or data, the processor first accesses the
filter cache. If the filter is hit, we can finish the fetch
instruction operation at a very low cost without the extra loss
of performance and energy would happen. The past studies
have shown that the hit ratio of filter is extremely important
for filter cache [5]. Therefore to ensure hit ratio, prediction
algorithm is incorporated to improve hit ratio of public filter.
In the prediction cache, The CPU accesses the filter or nor-
mal cache depending on the prediction signal. Prediction
algorithm is designed to eliminate unnecessary accesses to
the filter cache [6]. I f the prediction for filter is failed, the
CPU will re-fetch the instruction through normal cache
and it will also cause the extra loss of performance and
energy.
2.3Architecture of Energy E cient Multi-core cache System:
Public Filter
Each core has separately level 1 instruction cache and data
cache. Apublic filter cache unit is shared by all cores in the
system. All cores also share the level 2 LLC. However, a public-
filter is introduced to be the first shared cache for all cores.
Fi g. 3 shows how the architecture has been modified to
accommodate the filter cache [7].
For each instruction-fetch, every core will access the public-
filter first. If the public-filter is hit, instruction is returned to
core directly [8].Otherwise, the next level memory L2 cache
will be accessed until the right instruction is returned and the
public filter will be updated by the new cache block which
contains the new missed instruction.
Fig. 3. Filter Cache: Architectural Change made to the
Baseline Cache
Algorithm 1 Algorithm for the proposed cache design
CPU sends the data;
while Resolving the public filter for the data do
Visit the public filter;
if data was hit then
Return the instruction;
else
Visit the LLC;
if hit then
Return instruction and update the filter;
else
Visit the main memory and update the filter;
end
end
end
A Dynamic Replacement method Segmented LRU (SLRU)
is used to maintain good hit ratio and dynamic memory
management methods are used to distribute hit ratio
equally among all cores.
3. SEGMENTED LRU POLICY
3.1 Existing Segmented LRU
An SLRU cache is divided into two segments, a proba-
tionary segment and a protected segment. Lines in each
segment are ordered from the most to the least recently
accessed. Fig. 4 explains how the block is segmented.
Data from the memory for misses is added to the cache
at the most recently accessed end of the probationary
segment. Cache Hits are removed from wherever they
currently reside and added to the most recently accessed
end of the protected segment. Lines in the protected
segment have thus been accessed at least twice, giving this
line another chance to be accessed before being replaced.
The lines to be discarded for replacement are obtained
from the LRU end of the probationary segment [9].
3.2 Dynamic Segmented LRU
Based on our observation with the existing SLRU algo-
rithm, we found that often, they always use a constant
number of protected and probationary ways. The proposed
scheme handles the dynamic sizing of the two segments
based on access probability in each cache line of the set
[10].
The access probability is summed up each time from
the first line and the selection of new cache line for the
insertion of new cache miss data from the memory is
done at the cache line where the summed up probability
is around “0.5”. This helps in dynamically adjusting the
segmentation size by access probability.
3.3 Code Snippet for LRU Changes
Void LRU::insertBlock(PacketPtr pkt, BlkType *blk)
{
BaseSetAssoc::insertBlock(pkt, blk);
int set = extractSet(pkt->getAddr());
int Tot = 0;
//Calculating the total number of accesses
for (int i = 0; i <= assoc - 1 ; i++)
{
BlkType *b1 = sets[set].blks[i];
int Tot = Tot + b1->refCount;
}
int add = 0;
int start = 0;
if( Tot != 0)
{
for (int i = 0; i <= assoc -1; i++)
{
BlkType *b2 = sets[set].blks[i];
// Calculating the access probability of each line
int prob = b2->refCount / Tot;
add = add + prob;
//Selecting theline with probability of 0.5
if (add >= 0.5)
{
start = add;
break;
}
}
}
//Setting the head of probationary block for new data
sets[set].moveToHead1(blk,start);
}
3.4 Code Snippet for Cacheset
template <class Blktype>
void
CacheSet<Blktype>::moveToHead1(Blktype *blk, int start)
{
// nothing to do if block is already head
if (blks[0] == blk)
return;
% write ’next’ block into blks[i]
. moving up from MRU toward LRU
. until we overwrite the block we moved to head.
. setting the head of the probationary statement %
int i = start;
Blktype *next = blk;
do {
assert(i < assoc);
// swap blks[i] and next
Blktype *tmp = blks[i];
blks[i] = next;
next = tmp;
++i;
} while (next != blk);
}
Fig. 4. LRU Segmentation: The probationary vs protected
segments
3.5 Dynamic SLRU with Random Promotion and Aging
Traditional implementations of SLRU has shown benefit
by making selected random promotions as well. The ran-
dom promotion in the SLRU algorithm allows to randomly
pick a cache line from the probationary segment and pro-
mote it to the promoted segment. This random promotion
is also added with the Dynamic segmented LRU policy to
see further performance improvements.
In contrast to random promotion, we also made “Cache
line aging mechanism” to bring down aged cache line with
lowest access probability from the protected to probation-
ary segment to see further performance improvements.
3.6 Dynamic SLRU With Adaptive Bypassing
Cache bypassing helps in avoiding invalidating cache line
with high access probability for just one or two misses.
The new data is accessed directly from the memory with
no update for cache line where it got missed. This helps
in improving the hit rate by maintaining highly accessed
cache line for a little more time in the cache set.
Initially our bypass algorithm arbitrarily picks an access
probability for implementing adaptive bypassing [11] [12].
The probability of making the bypass is also dynamically
adapted by how e↵ective the decisions to bypass have been
in the past by measuring the hit rate.
Each e↵ective bypass doubles the probability that a future
a bypass will occur, for example, if the current probability
is 0.25 the probability will double to 0.5. Similarly each
ine↵ective bypass halves the probability of a future bypass,
for example, cutting the current probability of 0.5 to 0.25.
To turn o↵ adaptive bypassing, bypassing probability is
set to 0 that will prevent any bypassing and allocate all
missed lines.
3.7 Miss Status Holding Register (MSHR)
The adaptive bypassing is implemented with Miss Status
Holding Register which helps to store the cache miss
information without invalidation the corresponding cache
line. This in turn improves the hit rate by supplying cache
hits even under a miss.
When the data becomes available in the memory, the miss
pending is resolved with new data inside the cache line.
However, the adaptive bypassing cannot be done if the
MSHR becomes full. Stalls will be required until we resolve
and create enough space to store the new miss pending and
continue the adaptive bypassing mechanism.
This adaptive bypassing with MSHR is also implemented
with Dynamic SLRU to see further performance improve-
ments.
4. COHERENCE POLICY
4.1 Existing Segmented LRU
In multi-core processors, due to data transaction between
several processors and their respective caches, there hap-
pens to be a coherence problem. This occurs when two pro-
cessors access the same physical address space [13]. Thus
the shared memory models should be deigned with their
respective cache hierarchies with a performance sensitive
stand-point. In this paper, the cache coherence problem
is addressed for the ring interconnect model. It was
chosen since they are proven to address the coherence
problem quite well. The rings have an exploitable
ordering of coherence, simple and distributive arbitration
as opposed to the bus topologies, short and fewer ports
with faster p2p links [14] [15].
The order of the ring is not the order of the bus since bus
has a centralized arbiter [16]. To initiate a request, a core
must first access a centralized arbiter and then send its
request to a queue. The queue creates the total order of
requests, and resends the request on a separate set of snoop
links [17] [18]. Caches snoop the requests and send the
results on another set of links where the snoop results are
collected at another queue. Finally the snoop queue resends
the final snoop results to all cores [19]. This type of logical
bus will result in significant performance loss to recreate the
ordering of an atomic bus [20]. In crossbar interconnects
this is a drawback. Thus we go for the ring interconnect due
to the e ciency of the wires. In a way, the topology is
analogous to a tra c roundabout. This is the idea in which
the snooping was implemented. Rings o↵er a distributive
access by the method of “Token Ring” [20 - 23].
There were several proposals to implement the coherence
in the ring topology. The Greedy-Order topology uses
unbounded reentries of cache requests to the ring to handle
contention. This improves the latency but hits the band-
width. The Ordering-Point topology uses a performance-
costly ordering point which hits the latency.
The Ring-Order consistency used this paper is fast and
stable in performance. It also exploits the round-robin order
of the ring. Ring-Order uses a token-counting approach,
that passes tokens to ring nodes in order to ensure
coherence safety [22]. A program was designed to simulate
an LRU cache with a write-back and write-allocate policy.
Modified the MOESI Snooping protocol for the ring
topology thus making the initial requests to succeed all the
time and as a result there would be no reentries or ordering
point.
5. SIMULATION
In order to evaluate the e↵ectiveness of the energy e cient
cache design for multi-core processor, the simulation of the
Fig. 5. CPI for various Cache designs vs Benchmarks used
improved cache protocols were done on Gem5 simulator
[24]. The baseline was taken as an X86 processor with 4
cores. Some of the SPEC 2006 benchmark programs were
used for the simulation
The following table 1 explains the configuration of our
baseline system.
Table 1. Baseline System Settings
System Configuration
PRIVATE L1 CACHES Split-I&D, 4kB, 4-way set Assoc
SHARED FILTER CACHE Unified-I&D, 8kB, 8-way set Assoc
SHARED L2 CACHE Unified-I&D, 64kB, 16-way set Assoc
MAIN MEMORY 1GB of DRAM
RING INTERCONNECT 80-byte unidirectional
5.1 Results
The energy e cient cache design was integrated into the
Gem5 and do some comparative experiments with the
baseline 4-cores cache system and the filter cache with fixed
distribution of public-filter using crossbar intercon-nect
[25]. The public-filter associativity is 16 which was fixed for
simplicity and this indicates each core has 4 filter lines in
fixed filter cache. The dynamic management method
(SLRU) will be activated for every 1000 instruc-tions.
In the experiments, the performance and energy consump-
tion of each benchmark were observed. The performance is
evaluated by the CPI (Cycle Per Instruction). The smaller
the CPI is, the higher the performance of the system is.
The results obtained from this experiment was fed into
CACTI for observing the power and energy consumption.
Figure 5 shows the improvement of CPI for every bench-
mark and the modified system. On an average, there is
about 7.68% improvement in the CPI of the fully enhanced
system when compared to the baseline system.
Figure 6 shows the reduced energy consumption of each
cache system proposed for the benchmarks. The energy
consumption is improved by about 11%.
Fig. 6. Energy consumption for the cache implementations
The coherence policy was simulated using Gem5 as well
as the SMP Cache simulator. The write transactions were
recorded for the normalized tra c (having the total L2
cache misses/transactions on a scale of 0 to 1). Figure 7
shows the improvements.
Fig. 7. Write Transactions vs L2 Misses
The hit rates were also recorded and it is seen that for a
128 KB cache for all 4 cores, the performance was quite
good for the given workload with hit rates ranging from
89% to 97%. The hit rate for Rind-Order was found to be
more than that of Ordering-Point and Greedy-Order. The
table is given below.
The snoops per cycle for Rings show improvement over the
Bus for the said benchmarks as on the table below.
6. FUTURE WORK
The Cache system proposed can be integrated with the
coherence policy discussed in the paper. By that, instead
if crossbar interconnect, the filter-cache and SLRU design
would be implemented on a system in which cores would
be connected in a ring topology.
Cache power and performance can be improved using De-
terministic Naps and early miss detection. Dynamic power
can be reduced by 2% by use of a hash based mechanism
to minimize the Cache lines. There is a 92% improvement
in performance due to skipping of few cache pipe stages as
guaranteed misses. Static power savings of about 17% is
achieved by using cache access to deterministically lower
the power state of cache lines that are guaranteed not
to be accessed in the immediate future [26]. If this is
implemented in the proposed cache design there would be
better results in-terms of performance and power.
7. CONCLUSION
In this paper, an energy e cient cache design for multi-
core processors was proposed. The baseline cache is im-
plemented with a filter cache structure on the multi-core
I-cache in a form of public-filter which is the shared first
instruction source for all cores. Meanwhile, a dynamic LRU
policy of the public-filter is also applied. Together they
improved the power and performance of the cache.
The experiment results show that the presented method
can save about 11% energy and also shows a significant
improvement in the performance. The Coherence policy for
a Ring topology was also discussed and the results showed
improvement when compared with bus topology.
ACKNOWLEDGEMENTS
We profoundly thank Professor Dr. Bhanu Kapoor for
providing us guidance, support and encouragement. We
also thank Jiacong He, whose PhD qualifier presentation
inspired us to work on this research.
REFERENCES
[1] Hennessy, J. L., Patterson, D. A. (2012). “Computer
Architecture: A Quantitative Approach. Elsevier.
[2] Tang Weiyu and Gupta R,Nicolau, “A Design of
a Predictive Filter Cache for Energy Savings in High
Performance Processor Architectures” Proceedings of the
International Conference on Computer Design, 2001: 68-
73.
[3] Brooks, D., Tiwari, V., & Martonosi, M. (2000).
“Wattch: a framework for architectural-level power anal-
ysis and optimizations” (Vol. 28, No. 2, pp. 83-94). ACM.
[4] S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feed-
back directed prefetching: “Improving the performance and
bandwidth-e ciency of hardware prefetchers”. In Proc. of
the 13th International Symposium on High Performance
Computer Architecture, 2007.
[5] Cao. X, Z. Xiaolin. “An Energy E cient Cache Design
for Multi-core Processors”, In IEEE International Confer-
ence on Green Computing and Communications, 2013.
[6] Advanced Micro Devices, Inc., AMD64 Architecture
Programmer‘s Manual Volume 3: “General-Purpose and
System Instructions”, May 2013, revision 3.20.
[7] Johnson Kin, Munish Gupta and William H. Mangione-
Smith, “The Filter Cache: An Energy E cient Memory
Structure,” Microarchitecture .Proceedings, Thirtieth An-
nual IEEE/ACM International Symposium on, 1997:184
-193.
[8] J. Kin, M. Gupta, and W. H. Magione-Simth, “Filtering
memory references to increase energy e ciency,” IEEE
Trans. Comput, vol. 49, no. 1, pp. 1?15, Jan. 2000.
[9] H. Gao and C. Wilkerson,“A dueling segmented LRU
replacement algorithm with adaptive bypassing,” 1st JILP:
Cache Replacement Championship, France, 2010.
[10] K. Morales and B. K. Lee, “Fixed Segmented LRU
cache replacement scheme with selective caching,” 2012
IEEE 31st International Performance Computing and
Communications Conference (IPCCC), Austin, TX, 2012.
[11] H. Gao and C. Wilkerson. “A dueling segmented
LRU replacement algorithm with adaptive bypassing.” In
Proceedings of the 1st JILP Workshop on Computer
Architecture Competitions, 2010
[12] Jayesh Gaur et al. “Bypass and Insertion Algorithms
for Exclusive Last-level Caches.” In ISCA 2011.
[13] Hongil Yoon and Gurindar S. Sohi, “Reducing Coher-
ence Overheads with Multi-line Invalidation (MLI) Mes-
sages”, Computer Sciences Department at University of
Wisconsin-Madison
[14] Daniel J. Sorin, Mark D. Hill, and David A. Wood. “A
Primer on Memory Consistency and Cache Coherence”,
Synthesis Lectures in Computer Architecture, 2011 Mor-
gan & Claypool Publishers.
[15] I. Singh, A. Shriraman, W. W. L. Fung, M. O?Connor,
and T. M. Aamodt, “Cache coherence for GPU architec-
tures,” in HPCA, 2013, pp. 578?590.
[16] R. Kumar, V. Zyuban, and D. Tullsen. “Interconnec-
tions in multi-core architectures: Understanding Mecha-
nisms, Overheads and Scaling”. In Proceedings of the 32nd
Annual International Symposium on Computer Architec-
ture, June 2005.
[17] M. M. K. Martin, P. J. Harper, D. J. Sorin, M. D. Hill,
and D. A. Wood, “Using destination-set prediction to im-
prove the latency/bandwidth trade- o↵ in shared-memory
multiprocessors,” in Proceedings of the 30th ISCA, June
2003.
[18] M. M. K. Martin, M. D. Hill, and D. A. Wood, “Token
coherence: Decoupling performance and correctness,” in
ISCA-30, 2003.
[19] M. M. K. Martin, D. J. Sorin, M. D. Hill, and D. A.
Wood, “Bandwidth adaptive snooping,” in HPCA-8, 2002.
[20] M. R. Marty, “Cache coherence techniques for multi-
core processors,” in PhD Dissertation, University of Wis-
consin - Madison, 2008.
[21] M. R. Marty, J. D. Bingham, M. D. Hill, A. J. Hu,
M. M. K. Martin, and D. A. Wood, “Improving multiple-
cmp systems using token coherenece,” in HPCA, February
2005.
[22] M. R. Marty and M. D. Hill, “Coherence ordering for
ring-based chip multiprocessors,” in MICRO-39, December
2006.
[23] –, “Virtual hierarchies to support server consolida-
tion,” in ISCA-34, 2007.
[24] N. Binkert, et al., “The gem5 simulator”. 2011
SIGARCH Comput. Ar- chit. News.
[25] “gem5-gpu.cs.wisc.edu”
[26] Oluleye Olorode and Mehrdad Nourani, “Improving
Cache Power and Performance Using Deterministic Naps
and Early Miss Detection”, IEEE Trans. Multi-Scale Com-
puting Systems, Vol 1, No 3, Pages 150–158, 2015.
Ad

More Related Content

What's hot (20)

TRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
TRACK B: Multicores & Network On Chip Architectures/ Oren HollanderTRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
TRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
chiportal
 
Secure remote protocol for fpga reconfiguration
Secure remote protocol for fpga reconfigurationSecure remote protocol for fpga reconfiguration
Secure remote protocol for fpga reconfiguration
eSAT Publishing House
 
MTE104-L2: Overview of Microcontrollers
MTE104-L2: Overview of MicrocontrollersMTE104-L2: Overview of Microcontrollers
MTE104-L2: Overview of Microcontrollers
Abdalla Ahmed
 
Placement and algorithm.
Placement and algorithm.Placement and algorithm.
Placement and algorithm.
Ashish Singh
 
Implementation strategies for digital ics
Implementation strategies for digital icsImplementation strategies for digital ics
Implementation strategies for digital ics
aroosa khan
 
Evaluation of Branch Predictors
Evaluation of Branch PredictorsEvaluation of Branch Predictors
Evaluation of Branch Predictors
Bharat Biyani
 
Design and implementation of 15 4 compressor using 1-bit semi domino full add...
Design and implementation of 15 4 compressor using 1-bit semi domino full add...Design and implementation of 15 4 compressor using 1-bit semi domino full add...
Design and implementation of 15 4 compressor using 1-bit semi domino full add...
eSAT Journals
 
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
IOSRJVSP
 
Distributed Traffic management framework
Distributed Traffic management frameworkDistributed Traffic management framework
Distributed Traffic management framework
Saurabh Nambiar
 
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...
IRJET Journal
 
Ijecet 06 08_003
Ijecet 06 08_003Ijecet 06 08_003
Ijecet 06 08_003
IAEME Publication
 
Implementation of switching controller for the internet router
Implementation of switching controller for the internet routerImplementation of switching controller for the internet router
Implementation of switching controller for the internet router
IAEME Publication
 
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
IJECEIAES
 
Publication
PublicationPublication
Publication
Pranjal Jain
 
Vlsi physical design-notes
Vlsi physical design-notesVlsi physical design-notes
Vlsi physical design-notes
Dr.YNM
 
Iaetsd design and simulation of high speed cmos full adder (2)
Iaetsd design and simulation of high speed cmos full adder (2)Iaetsd design and simulation of high speed cmos full adder (2)
Iaetsd design and simulation of high speed cmos full adder (2)
Iaetsd Iaetsd
 
Dp32725728
Dp32725728Dp32725728
Dp32725728
IJERA Editor
 
A novel mrp so c processor for dispatch time curtailment
A novel mrp so c processor for dispatch time curtailmentA novel mrp so c processor for dispatch time curtailment
A novel mrp so c processor for dispatch time curtailment
eSAT Publishing House
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Fisnik Kraja
 
M.Tech: Advanced Computer Architecture Assignment II
M.Tech: Advanced Computer Architecture Assignment IIM.Tech: Advanced Computer Architecture Assignment II
M.Tech: Advanced Computer Architecture Assignment II
Vijayananda Mohire
 
TRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
TRACK B: Multicores & Network On Chip Architectures/ Oren HollanderTRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
TRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
chiportal
 
Secure remote protocol for fpga reconfiguration
Secure remote protocol for fpga reconfigurationSecure remote protocol for fpga reconfiguration
Secure remote protocol for fpga reconfiguration
eSAT Publishing House
 
MTE104-L2: Overview of Microcontrollers
MTE104-L2: Overview of MicrocontrollersMTE104-L2: Overview of Microcontrollers
MTE104-L2: Overview of Microcontrollers
Abdalla Ahmed
 
Placement and algorithm.
Placement and algorithm.Placement and algorithm.
Placement and algorithm.
Ashish Singh
 
Implementation strategies for digital ics
Implementation strategies for digital icsImplementation strategies for digital ics
Implementation strategies for digital ics
aroosa khan
 
Evaluation of Branch Predictors
Evaluation of Branch PredictorsEvaluation of Branch Predictors
Evaluation of Branch Predictors
Bharat Biyani
 
Design and implementation of 15 4 compressor using 1-bit semi domino full add...
Design and implementation of 15 4 compressor using 1-bit semi domino full add...Design and implementation of 15 4 compressor using 1-bit semi domino full add...
Design and implementation of 15 4 compressor using 1-bit semi domino full add...
eSAT Journals
 
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
IOSRJVSP
 
Distributed Traffic management framework
Distributed Traffic management frameworkDistributed Traffic management framework
Distributed Traffic management framework
Saurabh Nambiar
 
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...
IRJET Journal
 
Implementation of switching controller for the internet router
Implementation of switching controller for the internet routerImplementation of switching controller for the internet router
Implementation of switching controller for the internet router
IAEME Publication
 
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
IJECEIAES
 
Vlsi physical design-notes
Vlsi physical design-notesVlsi physical design-notes
Vlsi physical design-notes
Dr.YNM
 
Iaetsd design and simulation of high speed cmos full adder (2)
Iaetsd design and simulation of high speed cmos full adder (2)Iaetsd design and simulation of high speed cmos full adder (2)
Iaetsd design and simulation of high speed cmos full adder (2)
Iaetsd Iaetsd
 
A novel mrp so c processor for dispatch time curtailment
A novel mrp so c processor for dispatch time curtailmentA novel mrp so c processor for dispatch time curtailment
A novel mrp so c processor for dispatch time curtailment
eSAT Publishing House
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Fisnik Kraja
 
M.Tech: Advanced Computer Architecture Assignment II
M.Tech: Advanced Computer Architecture Assignment IIM.Tech: Advanced Computer Architecture Assignment II
M.Tech: Advanced Computer Architecture Assignment II
Vijayananda Mohire
 

Similar to DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED FILTER CACHES USING GEM5 (20)

Cache memory
Cache memoryCache memory
Cache memory
Eklavya Gupta
 
shashank_hpca1995_00386533
shashank_hpca1995_00386533shashank_hpca1995_00386533
shashank_hpca1995_00386533
Shashank Nemawarkar
 
Memory consistency models
Memory consistency modelsMemory consistency models
Memory consistency models
palani kumar
 
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORESWRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
journalijdps
 
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORESWRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
ijdpsjournal
 
Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...
Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...
Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...
IJIR JOURNALS IJIRUSA
 
Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...
eSAT Publishing House
 
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
Vijay Prime
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
ijesajournal
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
ijesajournal
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passing
eSAT Journals
 
Different Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache MemoryDifferent Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache Memory
Dhritiman Halder
 
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSSTUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
ijdpsjournal
 
An efficient multi-level cache system for geometrically interconnected many-...
An efficient multi-level cache system for geometrically  interconnected many-...An efficient multi-level cache system for geometrically  interconnected many-...
An efficient multi-level cache system for geometrically interconnected many-...
International Journal of Reconfigurable and Embedded Systems
 
Tdt4260 miniproject report_group_3
Tdt4260 miniproject report_group_3Tdt4260 miniproject report_group_3
Tdt4260 miniproject report_group_3
Yulong Bai
 
Cmp cache architectures a survey
Cmp cache architectures   a surveyCmp cache architectures   a survey
Cmp cache architectures a survey
eSAT Publishing House
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Power minimization of systems using Performance Enhancement Guaranteed Caches
Power minimization of systems using Performance Enhancement Guaranteed CachesPower minimization of systems using Performance Enhancement Guaranteed Caches
Power minimization of systems using Performance Enhancement Guaranteed Caches
IJTET Journal
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
ashishmulchandani
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
csandit
 
Memory consistency models
Memory consistency modelsMemory consistency models
Memory consistency models
palani kumar
 
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORESWRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
journalijdps
 
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORESWRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
ijdpsjournal
 
Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...
Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...
Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...
IJIR JOURNALS IJIRUSA
 
Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...
eSAT Publishing House
 
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
Vijay Prime
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
ijesajournal
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
ijesajournal
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passing
eSAT Journals
 
Different Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache MemoryDifferent Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache Memory
Dhritiman Halder
 
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSSTUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
ijdpsjournal
 
Tdt4260 miniproject report_group_3
Tdt4260 miniproject report_group_3Tdt4260 miniproject report_group_3
Tdt4260 miniproject report_group_3
Yulong Bai
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Power minimization of systems using Performance Enhancement Guaranteed Caches
Power minimization of systems using Performance Enhancement Guaranteed CachesPower minimization of systems using Performance Enhancement Guaranteed Caches
Power minimization of systems using Performance Enhancement Guaranteed Caches
IJTET Journal
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
csandit
 
Ad

More from Ilango Jeyasubramanian (6)

DESIGNED A 350NM TWO STAGE OPERATIONAL AMPLIFIER
DESIGNED A 350NM TWO STAGE OPERATIONAL AMPLIFIERDESIGNED A 350NM TWO STAGE OPERATIONAL AMPLIFIER
DESIGNED A 350NM TWO STAGE OPERATIONAL AMPLIFIER
Ilango Jeyasubramanian
 
PARASITIC-AWARE FULL PHYSICAL CHIP DESIGN OF LNA RFIC AT 2.45GHZ USING IBM 13...
PARASITIC-AWARE FULL PHYSICAL CHIP DESIGN OF LNA RFIC AT 2.45GHZ USING IBM 13...PARASITIC-AWARE FULL PHYSICAL CHIP DESIGN OF LNA RFIC AT 2.45GHZ USING IBM 13...
PARASITIC-AWARE FULL PHYSICAL CHIP DESIGN OF LNA RFIC AT 2.45GHZ USING IBM 13...
Ilango Jeyasubramanian
 
RELIABLE NoC ROUTER ARCHITECTURE DESIGN USING IBM 130NM TECHNOLOGY
RELIABLE NoC ROUTER ARCHITECTURE DESIGN USING IBM 130NM TECHNOLOGYRELIABLE NoC ROUTER ARCHITECTURE DESIGN USING IBM 130NM TECHNOLOGY
RELIABLE NoC ROUTER ARCHITECTURE DESIGN USING IBM 130NM TECHNOLOGY
Ilango Jeyasubramanian
 
ACCURATE Q-PREDICTION FOR RFIC SPIRAL INDUCTORS USING THE 3DB BANDWIDTH
ACCURATE Q-PREDICTION FOR RFIC SPIRAL INDUCTORS  USING THE 3DB BANDWIDTHACCURATE Q-PREDICTION FOR RFIC SPIRAL INDUCTORS  USING THE 3DB BANDWIDTH
ACCURATE Q-PREDICTION FOR RFIC SPIRAL INDUCTORS USING THE 3DB BANDWIDTH
Ilango Jeyasubramanian
 
PARASITICS REDUCTION FOR RFIC CMOS LAYOUT AND IIP3 VS Q-BASED DESIGN ANALYSI...
 PARASITICS REDUCTION FOR RFIC CMOS LAYOUT AND IIP3 VS Q-BASED DESIGN ANALYSI... PARASITICS REDUCTION FOR RFIC CMOS LAYOUT AND IIP3 VS Q-BASED DESIGN ANALYSI...
PARASITICS REDUCTION FOR RFIC CMOS LAYOUT AND IIP3 VS Q-BASED DESIGN ANALYSI...
Ilango Jeyasubramanian
 
STANDARD CELL LIBRARY DESIGN
STANDARD CELL LIBRARY DESIGNSTANDARD CELL LIBRARY DESIGN
STANDARD CELL LIBRARY DESIGN
Ilango Jeyasubramanian
 
DESIGNED A 350NM TWO STAGE OPERATIONAL AMPLIFIER
DESIGNED A 350NM TWO STAGE OPERATIONAL AMPLIFIERDESIGNED A 350NM TWO STAGE OPERATIONAL AMPLIFIER
DESIGNED A 350NM TWO STAGE OPERATIONAL AMPLIFIER
Ilango Jeyasubramanian
 
PARASITIC-AWARE FULL PHYSICAL CHIP DESIGN OF LNA RFIC AT 2.45GHZ USING IBM 13...
PARASITIC-AWARE FULL PHYSICAL CHIP DESIGN OF LNA RFIC AT 2.45GHZ USING IBM 13...PARASITIC-AWARE FULL PHYSICAL CHIP DESIGN OF LNA RFIC AT 2.45GHZ USING IBM 13...
PARASITIC-AWARE FULL PHYSICAL CHIP DESIGN OF LNA RFIC AT 2.45GHZ USING IBM 13...
Ilango Jeyasubramanian
 
RELIABLE NoC ROUTER ARCHITECTURE DESIGN USING IBM 130NM TECHNOLOGY
RELIABLE NoC ROUTER ARCHITECTURE DESIGN USING IBM 130NM TECHNOLOGYRELIABLE NoC ROUTER ARCHITECTURE DESIGN USING IBM 130NM TECHNOLOGY
RELIABLE NoC ROUTER ARCHITECTURE DESIGN USING IBM 130NM TECHNOLOGY
Ilango Jeyasubramanian
 
ACCURATE Q-PREDICTION FOR RFIC SPIRAL INDUCTORS USING THE 3DB BANDWIDTH
ACCURATE Q-PREDICTION FOR RFIC SPIRAL INDUCTORS  USING THE 3DB BANDWIDTHACCURATE Q-PREDICTION FOR RFIC SPIRAL INDUCTORS  USING THE 3DB BANDWIDTH
ACCURATE Q-PREDICTION FOR RFIC SPIRAL INDUCTORS USING THE 3DB BANDWIDTH
Ilango Jeyasubramanian
 
PARASITICS REDUCTION FOR RFIC CMOS LAYOUT AND IIP3 VS Q-BASED DESIGN ANALYSI...
 PARASITICS REDUCTION FOR RFIC CMOS LAYOUT AND IIP3 VS Q-BASED DESIGN ANALYSI... PARASITICS REDUCTION FOR RFIC CMOS LAYOUT AND IIP3 VS Q-BASED DESIGN ANALYSI...
PARASITICS REDUCTION FOR RFIC CMOS LAYOUT AND IIP3 VS Q-BASED DESIGN ANALYSI...
Ilango Jeyasubramanian
 
Ad

Recently uploaded (20)

Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
DSP and MV the Color image processing.ppt
DSP and MV the  Color image processing.pptDSP and MV the  Color image processing.ppt
DSP and MV the Color image processing.ppt
HafizAhamed8
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
ELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdfELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdf
Shiju Jacob
 
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Journal of Soft Computing in Civil Engineering
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
Artificial Intelligence introduction.pptx
Artificial Intelligence introduction.pptxArtificial Intelligence introduction.pptx
Artificial Intelligence introduction.pptx
DrMarwaElsherif
 
Smart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptxSmart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptx
rushikeshnavghare94
 
Main cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxb
Main cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxbMain cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxb
Main cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxb
SunilSingh610661
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdffive-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
AdityaSharma944496
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
Data Structures_Linear data structures Linked Lists.pptx
Data Structures_Linear data structures Linked Lists.pptxData Structures_Linear data structures Linked Lists.pptx
Data Structures_Linear data structures Linked Lists.pptx
RushaliDeshmukh2
 
Compiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptxCompiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptx
RushaliDeshmukh2
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
New Microsoft PowerPoint Presentation.pdf
New Microsoft PowerPoint Presentation.pdfNew Microsoft PowerPoint Presentation.pdf
New Microsoft PowerPoint Presentation.pdf
mohamedezzat18803
 
LECTURE-16 EARTHEN DAM - II.pptx it's uses
LECTURE-16 EARTHEN DAM - II.pptx it's usesLECTURE-16 EARTHEN DAM - II.pptx it's uses
LECTURE-16 EARTHEN DAM - II.pptx it's uses
CLokeshBehera123
 
Oil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdfOil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdf
M7md3li2
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
DSP and MV the Color image processing.ppt
DSP and MV the  Color image processing.pptDSP and MV the  Color image processing.ppt
DSP and MV the Color image processing.ppt
HafizAhamed8
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
ELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdfELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdf
Shiju Jacob
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
Artificial Intelligence introduction.pptx
Artificial Intelligence introduction.pptxArtificial Intelligence introduction.pptx
Artificial Intelligence introduction.pptx
DrMarwaElsherif
 
Smart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptxSmart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptx
rushikeshnavghare94
 
Main cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxb
Main cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxbMain cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxb
Main cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxb
SunilSingh610661
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdffive-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
AdityaSharma944496
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
Data Structures_Linear data structures Linked Lists.pptx
Data Structures_Linear data structures Linked Lists.pptxData Structures_Linear data structures Linked Lists.pptx
Data Structures_Linear data structures Linked Lists.pptx
RushaliDeshmukh2
 
Compiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptxCompiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptx
RushaliDeshmukh2
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
New Microsoft PowerPoint Presentation.pdf
New Microsoft PowerPoint Presentation.pdfNew Microsoft PowerPoint Presentation.pdf
New Microsoft PowerPoint Presentation.pdf
mohamedezzat18803
 
LECTURE-16 EARTHEN DAM - II.pptx it's uses
LECTURE-16 EARTHEN DAM - II.pptx it's usesLECTURE-16 EARTHEN DAM - II.pptx it's uses
LECTURE-16 EARTHEN DAM - II.pptx it's uses
CLokeshBehera123
 
Oil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdfOil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdf
M7md3li2
 

DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED FILTER CACHES USING GEM5

  • 1. Performance Considerations For Cache Memory Design in a Multi-core Processor ? Divya Ravindran, dxr150630 Ilango Jeyasubramanian, ixj150230 Kavitha Thiagarajan, kxt132230 Susmitha Gogineni, sxg155930 University of Texas at Dallas, Richardson, TX 75030 USA Abstract: In the recent times multi-core processors have gained importance over the traditional uniprocessors as there is a saturated growth in the performance improvements of the uniproces- sors. Multi-core processors make use of multiple cores and in order to improve their performance, there is a high necessity to reduce the memory access time, improve power e ciency and also maintain the coherence of data among the cores. To address to the e ciency of multiple cores, a filter cache is designed with an e cient Segmented Least Recently Used replacement policy. This technique e↵ectively reduces the energy consumed by 11%. Finally, to address the coherence of the caches, a modified MOESI based snooping protocol for the ring topology was used. This improved the performance of the processor by increasing the hit rate by 7%. Keywords: multi-core; filter cache; energy e cient; hit ratio; coherence; LRU; ring-order 1. INTRODUCTION As the number of transistors on the chip is doubling every 18 months following the Moore‘s law, it is observed that the processor speed is also improving at the same rate, but the memory latency has not progressed at the same rate as the processor. Due to this di↵erence in the growth, the time to access the memory becomes larger as the processor speed improves further. In order to overcome this memory wall, caches were built. Cache is a tiny and fast memory and has a smaller access time than the main memory. The beneficial properties of cache has made it desirable in providing e ciency to the processor [1]. This project concentrates on how the Caches can be modified to make the multi-core processor work in an e cient way such that the overall speed up is improved, the energy consumed by the processor is reduced and there is an improvement in its performance. The analysis of the newly implemented cache designs is done using some of the SPEC2006 Benchmarks, In this experiment, the size and associativity of the caches are fixed in order to provide simplicity in analysis, the instruction set architecture(ISA) is built for X86-64 processors. The first modification performed was introducing a filter cache, which is a tiny cache assumed to run at the speed of the core. It consists of the most frequently used instructions and the access time of the data in the filter cache is very short, but the hit rate of the filter cache is low [2]. This is improved by implementing a prediction technique which ? This project paper is edited in the format of International Feder- ation of Automatic Control Conference Papers in LATEX 2"as part of EEDG 6304 Computer Architecture coursework. chooses the memory level to be accessed to reduce misses [3] [4]. In order to see further improvement in hit rate, a Segmented Least recently used(LRU) Block Replacement Policy along with the filter cache is implemented and analyzed. The SLRU consists of two segments and it uses the principle of probability to perform the cache block replacement. The coherence of the multi-core processors were analyzed later with the help of various topologies. The idea was to introduce a modified MOESI based snooping protocol f or the ring topology which helps in improving the coherence of data in a multi-core processor. This modification makes use of the round-robin order of ring to provide a fast and stable performance. 2. FILTER CACHE 2.1 Idea of Filter Cache Cache is a very important component of modern processor which can e↵ectively alleviate the speed gap between the CPU and o↵-chip memory system. Multi-core processors have become the main development trend of processors, due to their high performance but power dissipation is a major issue with the large memory accesses of multiple cores. Therefore, an energy e cient cache design is required for energy e ciency. Filter cache is used to improve performance and power e ciency. Filter cache is a small capacity cache which is used to store frequently accessed data and instructions by the cores. Filter cache acts as the first instruction source which will consume less energy for most used instructions and data. The filter is assumed to have almost the same speed as the core and consume less energy than the normal cache. Fig 1. shows the basic idea of the filter cache.
  • 2. Fig. 1. Filter Cache: The basic idea Fig. 2. Filter Cache: How prediction works The improvement of performance and energy saving is achieved by accessing the filter instead of the normal cache. The CPU will access the filter first and only when the filter is not hit, the visit of normal cache is performed. 2.2 Prediction Filter Cache For any instruction or data, the processor first accesses the filter cache. If the filter is hit, we can finish the fetch instruction operation at a very low cost without the extra loss of performance and energy would happen. The past studies have shown that the hit ratio of filter is extremely important for filter cache [5]. Therefore to ensure hit ratio, prediction algorithm is incorporated to improve hit ratio of public filter. In the prediction cache, The CPU accesses the filter or nor- mal cache depending on the prediction signal. Prediction algorithm is designed to eliminate unnecessary accesses to the filter cache [6]. I f the prediction for filter is failed, the CPU will re-fetch the instruction through normal cache and it will also cause the extra loss of performance and energy. 2.3Architecture of Energy E cient Multi-core cache System: Public Filter Each core has separately level 1 instruction cache and data cache. Apublic filter cache unit is shared by all cores in the system. All cores also share the level 2 LLC. However, a public- filter is introduced to be the first shared cache for all cores. Fi g. 3 shows how the architecture has been modified to accommodate the filter cache [7]. For each instruction-fetch, every core will access the public- filter first. If the public-filter is hit, instruction is returned to core directly [8].Otherwise, the next level memory L2 cache will be accessed until the right instruction is returned and the public filter will be updated by the new cache block which contains the new missed instruction. Fig. 3. Filter Cache: Architectural Change made to the Baseline Cache Algorithm 1 Algorithm for the proposed cache design CPU sends the data; while Resolving the public filter for the data do Visit the public filter; if data was hit then Return the instruction; else Visit the LLC; if hit then Return instruction and update the filter; else Visit the main memory and update the filter; end end end A Dynamic Replacement method Segmented LRU (SLRU) is used to maintain good hit ratio and dynamic memory management methods are used to distribute hit ratio equally among all cores. 3. SEGMENTED LRU POLICY 3.1 Existing Segmented LRU An SLRU cache is divided into two segments, a proba- tionary segment and a protected segment. Lines in each segment are ordered from the most to the least recently accessed. Fig. 4 explains how the block is segmented. Data from the memory for misses is added to the cache at the most recently accessed end of the probationary segment. Cache Hits are removed from wherever they currently reside and added to the most recently accessed end of the protected segment. Lines in the protected segment have thus been accessed at least twice, giving this line another chance to be accessed before being replaced. The lines to be discarded for replacement are obtained from the LRU end of the probationary segment [9]. 3.2 Dynamic Segmented LRU Based on our observation with the existing SLRU algo- rithm, we found that often, they always use a constant number of protected and probationary ways. The proposed scheme handles the dynamic sizing of the two segments based on access probability in each cache line of the set [10].
  • 3. The access probability is summed up each time from the first line and the selection of new cache line for the insertion of new cache miss data from the memory is done at the cache line where the summed up probability is around “0.5”. This helps in dynamically adjusting the segmentation size by access probability. 3.3 Code Snippet for LRU Changes Void LRU::insertBlock(PacketPtr pkt, BlkType *blk) { BaseSetAssoc::insertBlock(pkt, blk); int set = extractSet(pkt->getAddr()); int Tot = 0; //Calculating the total number of accesses for (int i = 0; i <= assoc - 1 ; i++) { BlkType *b1 = sets[set].blks[i]; int Tot = Tot + b1->refCount; } int add = 0; int start = 0; if( Tot != 0) { for (int i = 0; i <= assoc -1; i++) { BlkType *b2 = sets[set].blks[i]; // Calculating the access probability of each line int prob = b2->refCount / Tot; add = add + prob; //Selecting theline with probability of 0.5 if (add >= 0.5) { start = add; break; } } } //Setting the head of probationary block for new data sets[set].moveToHead1(blk,start); } 3.4 Code Snippet for Cacheset template <class Blktype> void CacheSet<Blktype>::moveToHead1(Blktype *blk, int start) { // nothing to do if block is already head if (blks[0] == blk) return; % write ’next’ block into blks[i] . moving up from MRU toward LRU . until we overwrite the block we moved to head. . setting the head of the probationary statement % int i = start; Blktype *next = blk; do { assert(i < assoc); // swap blks[i] and next Blktype *tmp = blks[i]; blks[i] = next; next = tmp; ++i; } while (next != blk); } Fig. 4. LRU Segmentation: The probationary vs protected segments 3.5 Dynamic SLRU with Random Promotion and Aging Traditional implementations of SLRU has shown benefit by making selected random promotions as well. The ran- dom promotion in the SLRU algorithm allows to randomly pick a cache line from the probationary segment and pro- mote it to the promoted segment. This random promotion is also added with the Dynamic segmented LRU policy to see further performance improvements. In contrast to random promotion, we also made “Cache line aging mechanism” to bring down aged cache line with lowest access probability from the protected to probation- ary segment to see further performance improvements. 3.6 Dynamic SLRU With Adaptive Bypassing Cache bypassing helps in avoiding invalidating cache line with high access probability for just one or two misses. The new data is accessed directly from the memory with no update for cache line where it got missed. This helps in improving the hit rate by maintaining highly accessed cache line for a little more time in the cache set. Initially our bypass algorithm arbitrarily picks an access probability for implementing adaptive bypassing [11] [12]. The probability of making the bypass is also dynamically adapted by how e↵ective the decisions to bypass have been in the past by measuring the hit rate. Each e↵ective bypass doubles the probability that a future a bypass will occur, for example, if the current probability is 0.25 the probability will double to 0.5. Similarly each ine↵ective bypass halves the probability of a future bypass, for example, cutting the current probability of 0.5 to 0.25. To turn o↵ adaptive bypassing, bypassing probability is set to 0 that will prevent any bypassing and allocate all missed lines. 3.7 Miss Status Holding Register (MSHR) The adaptive bypassing is implemented with Miss Status Holding Register which helps to store the cache miss information without invalidation the corresponding cache line. This in turn improves the hit rate by supplying cache hits even under a miss. When the data becomes available in the memory, the miss pending is resolved with new data inside the cache line. However, the adaptive bypassing cannot be done if the MSHR becomes full. Stalls will be required until we resolve
  • 4. and create enough space to store the new miss pending and continue the adaptive bypassing mechanism. This adaptive bypassing with MSHR is also implemented with Dynamic SLRU to see further performance improve- ments. 4. COHERENCE POLICY 4.1 Existing Segmented LRU In multi-core processors, due to data transaction between several processors and their respective caches, there hap- pens to be a coherence problem. This occurs when two pro- cessors access the same physical address space [13]. Thus the shared memory models should be deigned with their respective cache hierarchies with a performance sensitive stand-point. In this paper, the cache coherence problem is addressed for the ring interconnect model. It was chosen since they are proven to address the coherence problem quite well. The rings have an exploitable ordering of coherence, simple and distributive arbitration as opposed to the bus topologies, short and fewer ports with faster p2p links [14] [15]. The order of the ring is not the order of the bus since bus has a centralized arbiter [16]. To initiate a request, a core must first access a centralized arbiter and then send its request to a queue. The queue creates the total order of requests, and resends the request on a separate set of snoop links [17] [18]. Caches snoop the requests and send the results on another set of links where the snoop results are collected at another queue. Finally the snoop queue resends the final snoop results to all cores [19]. This type of logical bus will result in significant performance loss to recreate the ordering of an atomic bus [20]. In crossbar interconnects this is a drawback. Thus we go for the ring interconnect due to the e ciency of the wires. In a way, the topology is analogous to a tra c roundabout. This is the idea in which the snooping was implemented. Rings o↵er a distributive access by the method of “Token Ring” [20 - 23]. There were several proposals to implement the coherence in the ring topology. The Greedy-Order topology uses unbounded reentries of cache requests to the ring to handle contention. This improves the latency but hits the band- width. The Ordering-Point topology uses a performance- costly ordering point which hits the latency. The Ring-Order consistency used this paper is fast and stable in performance. It also exploits the round-robin order of the ring. Ring-Order uses a token-counting approach, that passes tokens to ring nodes in order to ensure coherence safety [22]. A program was designed to simulate an LRU cache with a write-back and write-allocate policy. Modified the MOESI Snooping protocol for the ring topology thus making the initial requests to succeed all the time and as a result there would be no reentries or ordering point. 5. SIMULATION In order to evaluate the e↵ectiveness of the energy e cient cache design for multi-core processor, the simulation of the Fig. 5. CPI for various Cache designs vs Benchmarks used improved cache protocols were done on Gem5 simulator [24]. The baseline was taken as an X86 processor with 4 cores. Some of the SPEC 2006 benchmark programs were used for the simulation The following table 1 explains the configuration of our baseline system. Table 1. Baseline System Settings System Configuration PRIVATE L1 CACHES Split-I&D, 4kB, 4-way set Assoc SHARED FILTER CACHE Unified-I&D, 8kB, 8-way set Assoc SHARED L2 CACHE Unified-I&D, 64kB, 16-way set Assoc MAIN MEMORY 1GB of DRAM RING INTERCONNECT 80-byte unidirectional 5.1 Results The energy e cient cache design was integrated into the Gem5 and do some comparative experiments with the baseline 4-cores cache system and the filter cache with fixed distribution of public-filter using crossbar intercon-nect [25]. The public-filter associativity is 16 which was fixed for simplicity and this indicates each core has 4 filter lines in fixed filter cache. The dynamic management method (SLRU) will be activated for every 1000 instruc-tions. In the experiments, the performance and energy consump- tion of each benchmark were observed. The performance is evaluated by the CPI (Cycle Per Instruction). The smaller the CPI is, the higher the performance of the system is. The results obtained from this experiment was fed into CACTI for observing the power and energy consumption. Figure 5 shows the improvement of CPI for every bench- mark and the modified system. On an average, there is about 7.68% improvement in the CPI of the fully enhanced system when compared to the baseline system. Figure 6 shows the reduced energy consumption of each cache system proposed for the benchmarks. The energy consumption is improved by about 11%.
  • 5. Fig. 6. Energy consumption for the cache implementations The coherence policy was simulated using Gem5 as well as the SMP Cache simulator. The write transactions were recorded for the normalized tra c (having the total L2 cache misses/transactions on a scale of 0 to 1). Figure 7 shows the improvements. Fig. 7. Write Transactions vs L2 Misses The hit rates were also recorded and it is seen that for a 128 KB cache for all 4 cores, the performance was quite good for the given workload with hit rates ranging from 89% to 97%. The hit rate for Rind-Order was found to be more than that of Ordering-Point and Greedy-Order. The table is given below. The snoops per cycle for Rings show improvement over the Bus for the said benchmarks as on the table below. 6. FUTURE WORK The Cache system proposed can be integrated with the coherence policy discussed in the paper. By that, instead if crossbar interconnect, the filter-cache and SLRU design would be implemented on a system in which cores would be connected in a ring topology. Cache power and performance can be improved using De- terministic Naps and early miss detection. Dynamic power can be reduced by 2% by use of a hash based mechanism to minimize the Cache lines. There is a 92% improvement in performance due to skipping of few cache pipe stages as guaranteed misses. Static power savings of about 17% is achieved by using cache access to deterministically lower the power state of cache lines that are guaranteed not to be accessed in the immediate future [26]. If this is implemented in the proposed cache design there would be better results in-terms of performance and power. 7. CONCLUSION In this paper, an energy e cient cache design for multi- core processors was proposed. The baseline cache is im- plemented with a filter cache structure on the multi-core I-cache in a form of public-filter which is the shared first instruction source for all cores. Meanwhile, a dynamic LRU policy of the public-filter is also applied. Together they improved the power and performance of the cache. The experiment results show that the presented method can save about 11% energy and also shows a significant improvement in the performance. The Coherence policy for a Ring topology was also discussed and the results showed improvement when compared with bus topology. ACKNOWLEDGEMENTS We profoundly thank Professor Dr. Bhanu Kapoor for providing us guidance, support and encouragement. We also thank Jiacong He, whose PhD qualifier presentation inspired us to work on this research. REFERENCES [1] Hennessy, J. L., Patterson, D. A. (2012). “Computer Architecture: A Quantitative Approach. Elsevier. [2] Tang Weiyu and Gupta R,Nicolau, “A Design of a Predictive Filter Cache for Energy Savings in High Performance Processor Architectures” Proceedings of the International Conference on Computer Design, 2001: 68- 73. [3] Brooks, D., Tiwari, V., & Martonosi, M. (2000). “Wattch: a framework for architectural-level power anal- ysis and optimizations” (Vol. 28, No. 2, pp. 83-94). ACM. [4] S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feed- back directed prefetching: “Improving the performance and bandwidth-e ciency of hardware prefetchers”. In Proc. of the 13th International Symposium on High Performance Computer Architecture, 2007. [5] Cao. X, Z. Xiaolin. “An Energy E cient Cache Design for Multi-core Processors”, In IEEE International Confer- ence on Green Computing and Communications, 2013.
  • 6. [6] Advanced Micro Devices, Inc., AMD64 Architecture Programmer‘s Manual Volume 3: “General-Purpose and System Instructions”, May 2013, revision 3.20. [7] Johnson Kin, Munish Gupta and William H. Mangione- Smith, “The Filter Cache: An Energy E cient Memory Structure,” Microarchitecture .Proceedings, Thirtieth An- nual IEEE/ACM International Symposium on, 1997:184 -193. [8] J. Kin, M. Gupta, and W. H. Magione-Simth, “Filtering memory references to increase energy e ciency,” IEEE Trans. Comput, vol. 49, no. 1, pp. 1?15, Jan. 2000. [9] H. Gao and C. Wilkerson,“A dueling segmented LRU replacement algorithm with adaptive bypassing,” 1st JILP: Cache Replacement Championship, France, 2010. [10] K. Morales and B. K. Lee, “Fixed Segmented LRU cache replacement scheme with selective caching,” 2012 IEEE 31st International Performance Computing and Communications Conference (IPCCC), Austin, TX, 2012. [11] H. Gao and C. Wilkerson. “A dueling segmented LRU replacement algorithm with adaptive bypassing.” In Proceedings of the 1st JILP Workshop on Computer Architecture Competitions, 2010 [12] Jayesh Gaur et al. “Bypass and Insertion Algorithms for Exclusive Last-level Caches.” In ISCA 2011. [13] Hongil Yoon and Gurindar S. Sohi, “Reducing Coher- ence Overheads with Multi-line Invalidation (MLI) Mes- sages”, Computer Sciences Department at University of Wisconsin-Madison [14] Daniel J. Sorin, Mark D. Hill, and David A. Wood. “A Primer on Memory Consistency and Cache Coherence”, Synthesis Lectures in Computer Architecture, 2011 Mor- gan & Claypool Publishers. [15] I. Singh, A. Shriraman, W. W. L. Fung, M. O?Connor, and T. M. Aamodt, “Cache coherence for GPU architec- tures,” in HPCA, 2013, pp. 578?590. [16] R. Kumar, V. Zyuban, and D. Tullsen. “Interconnec- tions in multi-core architectures: Understanding Mecha- nisms, Overheads and Scaling”. In Proceedings of the 32nd Annual International Symposium on Computer Architec- ture, June 2005. [17] M. M. K. Martin, P. J. Harper, D. J. Sorin, M. D. Hill, and D. A. Wood, “Using destination-set prediction to im- prove the latency/bandwidth trade- o↵ in shared-memory multiprocessors,” in Proceedings of the 30th ISCA, June 2003. [18] M. M. K. Martin, M. D. Hill, and D. A. Wood, “Token coherence: Decoupling performance and correctness,” in ISCA-30, 2003. [19] M. M. K. Martin, D. J. Sorin, M. D. Hill, and D. A. Wood, “Bandwidth adaptive snooping,” in HPCA-8, 2002. [20] M. R. Marty, “Cache coherence techniques for multi- core processors,” in PhD Dissertation, University of Wis- consin - Madison, 2008. [21] M. R. Marty, J. D. Bingham, M. D. Hill, A. J. Hu, M. M. K. Martin, and D. A. Wood, “Improving multiple- cmp systems using token coherenece,” in HPCA, February 2005. [22] M. R. Marty and M. D. Hill, “Coherence ordering for ring-based chip multiprocessors,” in MICRO-39, December 2006. [23] –, “Virtual hierarchies to support server consolida- tion,” in ISCA-34, 2007. [24] N. Binkert, et al., “The gem5 simulator”. 2011 SIGARCH Comput. Ar- chit. News. [25] “gem5-gpu.cs.wisc.edu” [26] Oluleye Olorode and Mehrdad Nourani, “Improving Cache Power and Performance Using Deterministic Naps and Early Miss Detection”, IEEE Trans. Multi-Scale Com- puting Systems, Vol 1, No 3, Pages 150–158, 2015.