0% found this document useful (0 votes)

31 views

An Efficient Multi-Level Cache System For Geometrically Interconnected Many-Core Chip Multiprocessor

Many-core chip multiprocessor offers high parallel processing power for big data analytics; however, they require efficient multi-level cache and interconnection to achieve high system throughput. Using on-chip first level L1 and second level L2 per core fast private caches is expensive for large number of cores. In this paper, for moderate number of cores from 16 to 64, we present a cost and performance efficient multi-level cache system with per core L1 and last level shared ...

Uploaded by

IJRES team

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

An Efficient Multi-Level Cache System For Geometrically Interconnected Many-Core Chip Multiprocessor

Uploaded by

IJRES team

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

International Journal of Reconfigurable and Embedded Systems (IJRES)

Vol. 11, No. 1, March 2022, pp. 93~102

ISSN: 2089-4864, DOI: 10.11591/ijres.v11.i1.pp93-102  93

An efficient multi-level cache system for geometrically

interconnected many-core chip multiprocessor

Tirumale Ramesh, Khalid Abed

Department of Electrical & Computer Engineering and Computer Science, Jackson State University, Mississippi, USA

Article Info ABSTRACT

Article history: Many-core chip multiprocessor offers high parallel processing power for big
data analytics; however, they require efficient multi-level cache and
Received Jul 30, 2021 interconnection to achieve high system throughput. Using on-chip first level
Revised Sep 11, 2021 L1 and second level L2 per core fast private caches is expensive for large
Accepted Jan 10, 2022 number of cores. In this paper, for moderate number of cores from 16 to 64,
we present a cost and performance efficient multi-level cache system with
per core L1 and last level shared bus cache on each bus line of a cost-
Keywords: efficient geometrically bus-based interconnection. In our approach, we
extracted cache hit and miss concurrencies and applied concurrent average
Big data memory access time to more accurately determine the cache system
Bus cache performance. We conducted least recently used cache policy-based
Geometrical simulation for cache system with L1, with L1/L2, and with L1/shared bus
Heterogeneous cache. Our simulation results show that an average system throughput
Many-core improvement of 2.5x can be achieved by using system with L1/shared bus
Throughput cache system compared to using only first level L1 or L1/L2. Further, we
show that the throughput degradation for the proposed cache system is only
within 5% for a single bus fault, suggesting a good bus fault tolerance.
This is an open access article under the CC BY-SA license.

Corresponding Author:
Khalid Abed
Department of Electrical & Computer Engineering and Computer Science, Jackson State University
1400 John R Lynch Street (JSU Box 17098), Jackson, MS. 39217, USA
Email: [email protected]

1. INTRODUCTION
In recent years, many cores are trending as a on-chip computing platform [1]–[3] that can provide
massive computational power for a heterogenous computing environment for big data [4] and other compute
intensive embedded artificial intelligence applications [5]. Some recent work [6]–[9] on high performance
computing for big data have focused on processing framework, architecture synthesis and utilization of
multiple cores. With increased very large-scale integration (VLSI) density, it may be still manageable to
provide heterogeneous computing using cost effective on-chip interconnection and cache memory system.
From past research on bus-based interconnection for large parallel processing systems [10], it was
determined that regular bus connected multiple-bus interconnection that uses number of buses equal to one-
half of the cores or memory modules, gives comparable memory bandwidth. However, the reduced bus
interconnection is costly for chip multiprocessor (CMP) due to large number of bus-core/memory connections.
In our earlier research, we proposed a cost-effective interconnection using geometrical patterns for bus-
core/memory connections [11] with reduced number of buses. The approach in [11] was extended to system
level configuration defined with three geometrical system configurations termed as geometrical bus
interconnection (GBI) [12] for bus-memory connections using rhombic connection pattern as the base. We
achieved cost savings from 1.8𝑥 to 2.4𝑥 with GBI compared to regular reduced bus interconnection.

Journal homepage: http://ijres.iaescore.com

94  ISSN: 2089-4864

However, as the overall throughput of the many-core CMP is also determined by the cache system
performance, achieving high overall CMP througput with cost and performance efficient interconnection and
cache system is highly desirable today.
Providing an adequate and sustained many-core CMP throughput becomes more challenging as it
also requires efficient cache system solution. Towards this challenge, our focus is to present a cost-effective
multi-level cache system to improve the overall many-core CMP throughput using comparable memory
bandwidth results from cost-effective GBI [12]. A typical general multi-level cache system hierarchy for
multi-core systems as shown in Figure 1 has L1 and L2 private cache per core at levels 1 and 2, and a shared
cache L3 as a last level cache (LLC) at level 3. For example, some of the current mainstream commercial
multi-core processor such as Intel® Core™ i5 processor has three levels of cache with per core L1 with a
separate instruction and data cache, a per core L2 unified (instruction/data) cache and a shared L3 cache as
LLC (shared by all cores).

Off-chip
Cores Shared
L1 L2 Interconnect Global
(1-n) L3
Memory

Figure 1. Traditional multi-level cache system with L1, L2 and L3 for multi-core CMP

Adding a large number of per core fast on-chip private L1 and L2 caches with a shared L3 may
increase cache system cost. As a result, we propose an alternative solution by combining L1 with a relatively
slower shared bus cache (SBC) as LLC added to every bus line of GBI [12] in which the data request of all
cores is shared via GBI. In addition, our proposed cache system solution may also provide the ability to
increase the cache levels and sizes within the cache hierarchy upon cache reconfiguration in order to optimize
the system for cost, performance and power consumption.
Some earlier research [13]–[16] have addressed various cache system architecture, issues and
solutions for improved performance. In [13], the authors addressed analyzing memory performance for tiled
many-core CMP. Lin et al. [14] suggested hybrid cache systems that included layers for cache architecture
from memory to data base to improve performance in specific relational data base query for big data
applications. Charles et al. [15] looked at cache reconfiguration for network-on-chip (NoC) based many-core
CMP. Safayenikoo et al. [16] suggested an energy-efficient cache architecture to address the problem of
increased leakage power resulting from large area of LLC (as much as 50% of the chip area) due to its
increased size. Most of the work reported in [13]–[16] may require complex cache design process. Our
proposed cache system solution is simple and do not add any extra or difficult cache design process. Our
main contribution in this paper are as: i) Propose a shared bus cache (SBC) within a multi-level cache
system; ii) Present a least recently used (LRU) multi-level cache system simulation to extract hit and miss
concurrencies; iii) Apply concurrent average memory access time (C-AMAT) [17] to accurately determine
the system throughput performance and present our results; and iv) Provide conclusion and present some
insight into future research.

2. L1-SBC CACHE SYSTEM

Figure 2 shows a system with L1 and share bus cache at every bus line of GBI [12]. We term the
memory system using L1 private cache as L1, with L1 and L2 as L12, with L1 and shared bus cache as L1-
SBC throughout this paper.

n cores m Off-chip
b Shared Bus
and n L1 GBI Caches (SBC)
Memory
caches Modules

Figure 2. L1-SBC

Int J Reconfigurable & Embedded Syst, Vol. 11, No. 1, March 2022: 93-102
Int J Reconfigurable & Embedded Syst ISSN: 2089-4864  95

2.1. Concurrent average memory access time (C-AMAT)

Some cache techniques [18]–[20] were suggested earlier for improving traditional average memory
access time for multi-level cache systems. In [18], hardware prefetching was considered to exploit spatial and
temporal locality of references. In [19], multi-level caches were considered as primary and secondary
memories for proxy servers to access web content. In [20], an LRU replacement policy was proposed that
makes use of the awareness of the cache miss-penalty to ensure memory access latency is balanced for
memory system built with different memory technologies termed as “hybrid” system. The work addressed in
[18]–[20] were specific cache techniques attempted to reduce average memory access time without
considering any cost implications. Our approach is to optimize cache and interconnection cost across the
cache levels and apply C-AMAT for exploitation of parallel concurrency in cache hit and misses that
accurately determine the average memory access time across all levels for data access. An analytical method
for determining C-AMAT is briefly provided below. A traditional average memory access time (AMAT)
with a multi-level cache system is given in (1) and (2) for L1 and L12 cache systems respectively.

𝐴𝑀𝐴𝑇1 = 𝑡1 ℎ1 + (1 − ℎ1 )𝑡𝑚 (1)

𝐴𝑀𝐴𝑇2 = 𝑡1 ℎ1 + (1 − ℎ1 )(𝑡2 ℎ2 + (1 − ℎ2 )𝑡𝑚 (2)

Where t1 and t2 are the cache access time for level 1 and level 2 caches, h 1 and h2 are cache hit ratios
for level 1 and level 2 caches and tm is the global memory access time. In our approach, we exploit parallel
concurrency for core and SBC hit and miss concurrency for SBC supported by GBI and apply C-AMAT for
performance evaluation. The hit concurrency will improve performance while a cache miss may impact the
memory system performance, depending on hit concurrency. Taking advantage of multiple buses with miss
concurrency, higher system performance can be achieved. However, the application of C-AMAT need to
ensure that the miss concurrency do not exceed the interconnection bandwidth with reduced number of buse.
Thus, we re-write (1) and (2) as (3) and (4).
𝑡1 ℎ1
𝐶 − 𝐴𝑀𝐴𝑇1 = + (1 − ℎ1 )𝑡𝑚 /𝑐𝑚 (3)
𝑐ℎ1

𝑡1 ℎ1 𝑡 2 ℎ2
𝐶 − 𝐴𝑀𝐴𝑇2 = + (1 − ℎ1 )( + (1 − ℎ2 ) 𝑡𝑚 /𝑐𝑚 (4)
𝑐ℎ1 𝑐ℎ2

Where 𝑐ℎ1 and 𝑐ℎ2 are the average hit cycle concurrency at levels 1 and 2 and 𝑐𝑚 is the average
miss cycle concurrency. In this paper, we evaluate L1, L12 and L1-SBC systems. We selected minimum
number of L1 and SBC cache blocks to meet the following criterion for hit and miss concurrency given as
(5).

cℎ ≤ 𝑛, c𝑚 ≤ 𝑛/2 (5)

Since the GBI interconnection provides a memory bandwidth of 𝑛/2, we can also approximate (4)
by miss concurrency supported by the GBI memory bandwidth as (6).
𝑡1 ℎ1 𝑡2 ℎ2
𝐶 − 𝐴𝑀𝐴𝑇 = + (1 − ℎ1 )( + (1 − ℎ2 ) 2𝑡𝑚 /𝑛 (6)
𝑐ℎ1 𝑐ℎ2

When the 𝑐𝑚 is less than 𝑛/2, the interconnection bandwidth is not fully utilized. The C-AMAT
given in (6) is smaller compared to conservative miss concurrency given in (4). The percentage deviation
from (4) to (6) varies from 4 to 30 % across all cache systems. We see a higher deviation for L1-SBC system
which is attributed to the fact that the miss concurrency decreases as a result of higher hit concurrency using
bus cache during read cycle. In this paper, we only include conservative results from (3) and (4) for L1 and
L12 cache systems respectively and at the same time ensuring criterion (5).

2.2. Geometrical bus interconnection (GBI) [12] cost

Table 1 gives the average normalized interconnection cost of GBI compared to fully reduced
multiple bus system [10]. We notice a reduction of about 30 % in cost across the number of cores from 16 to
64.

An efficient multi-level cache system for geometrically interconnected many-core … (Tirumale Ramesh)
96  ISSN: 2089-4864

Table 1. Normalized average GBI cost compared to fully reduced bus system [10]
No. of Cores Normalized Cost
16 0.69
32 0.66
64 0.64

2.3. SBC impact on C-AMAT

In the past, some shared cache techniques [21] have looked at cache sharing of ways based on hash
mapping instead of traditional cache set sharing for multi-core platforms. In general, it is known that by
increasing the number of processor cores can directly increase LLC (last level cache) hit and miss
concurrency giving reduced C-AMAT. As our system uses buses equal to one-half the number of cores, the
memory access missed in per core cache is searched in SBC. Since a shared reduced number of buses in our
approach naturally captures all core accesses via the bus interconnection, placing an SBC at each bus line of
GBI replicates closely to a traditional L3 shared cache normally used in current commercial processor
𝑛
systems. As we used number of SBC at level 2, any miss in L1 increases the hit concurrency in SBC. In our
2
approach, we accounted only a pure miss concurrency [17] (only if none of the bus cache has a hit in the hit
cycle, a miss is accounted).

2.4. Cache association impact on C-AMAT

Cache association can also impact our solution. Authors in [22], [23] attributed to the fact that
higher cache association normally increases the cache hit rate but at the expense of hardware complexity for
the cache controller and additional latency for cache search time with increased association. However, in our
approach, the association was selected to ensure that criteria (5) are satisfied. Thus, selecting a direct mapped
cache may benefit to achieve reduced C-AMAT. In general, miss concurrencies in LLC can normally be
supported by use of multi-ported memory, or multi-bank memory (memory modules) with a single bus.
However, for a single bus system, bus contention impacts the throughput performance. The miss concurrency
can be facilitated by using a multi-bank memory module with multiple bus interconnection between shared
cache and memory modules. The miss concurrency can be supported by multiple buses in GBI yielding lower
C-AMAT.

3. CACHE SYSTEM SIMULATION

3.1. System operation with L1-SBC
Figure 2 shows the operation flowchart for read and write cycles for L1 system. SBC is used only
during “read cycle” with a “write through” policy to update on cache miss. In “normal no-fault mode”,
during read cycle, the data is first searched in L1. If the L1 read is a “miss,” it is then searched in SBC. If it is
“hit,” the data is cached. On read “miss” in SBC, buses in GBI are arbitrated to utilize full memory
bandwidth and the data is read from the global memory module and is written to SBC and L1 cache as well.
If the current bus that is granted fails, then cache system switches to “bus fault mode” and the
interconnection is re-arbitrated to use other b-1 connected buses. After bus re-arbitration, the data is re-
searched first in L1 and if “hit”, the data is cached in L1 cache, otherwise searched in SBC. During write
cycle, if the L1 cache block is “present”, then data is written into L1 cache. On L1 write “miss”, the L1 cache
block is replaced and the data is updated to L1 and consequently the data is written to global memory using
arbitrated buses in GBI.
The proposed cache system was simulated using publicly available “lrucache” libraries in python
and created multiple objects of a “lrucache” with indexing to implement L1, L2 and SBC. We iterated cache
operation for over 𝑛 𝑥 1000 for n number of cores. Table 2 shows the general parameters used for the
simulation. Using as much of insight into today’s memory technologies, we approximately used a relative bit
cost for L1, L2 and SBC as given in Table 2.

Int J Reconfigurable & Embedded Syst, Vol. 11, No. 1, March 2022: 93-102
Int J Reconfigurable & Embedded Syst ISSN: 2089-4864  97

Core 1 request Cores 2 to n send their request to cache

controllers

L1 read Search Cache Controllers [2 … 𝑛]

hits block in
L1 write LRU L1 L1 read miss
Update
Core 1 cache controller LRU SBC read search L1 on
[1 … 𝑏] SBC
Update L1 read miss miss
L1 on SBC read
SBC L1 write hit
miss miss Arbitrate Interconnection and update
global memory and cache (write
through)

Figure 2. L1-SBC Operation

Table 2. Cache system simulation parameters

Clock L1 access SBC access L2 access time Global memory Bus data L1 L2 SBC
cycle cycles cycles (t2) (cycles) access cycles (tm) width relative relative relative
(t1) cost cost cost
0.5 ns 5 25 10 100 2 bytes 10 6 3
(2 GHz)

3.2. Relative normalized system cost

Table 3 shows the normalized system cost as the total system cost that includes the normalized
interconnection cost from Table 1 and relative cache memory cost from Table 2. As we noticed from Table 3,
L2 cache adds 2 % additional system cost and SBC adds 0.5 % additional cost. We ran simulations using
minimum number of L1, L2, and SBC cache blocks selected to meet the criteria given in (5). To reduce the
cache “hit” time, we used an optimal cache association, but at the same time ensured concurrency criteria
given by (5).

Table 3. Normalized system cost

Cache System No. of L1 L1 No. of L2 L2 No. of SBC SBC Normalized system
blocks association sets association sets association cost
L1 128 1 1
L12 128 1 4 2 1.02
L1-SBC 128 1 4 2 1.005

3.3. Cache read and write misses criticality impact

It is well known that cache read misses are more critical and incurs more penalty in read than write
cycles. To alleviate this problem, some read-write partitioning policy was suggested in [24] that minimizes
the read misses using dynamic cache management. To provide more read miss support, in our approach, we
included SBC during read only. In general, as the read is increased from 50 to 80% of the processor data
requests, we found drastic improvement in SBC hit concurrency as a result of its exclusive support during
read cycle. However, as not all applications ensure a less data reads than data writes, we may treat the 50%
read data requests as a good comparison for now and look for application centric read/write trade-offs in
future using novel cache protocols. Some recent novel read/write cost tradeoffs for DNA based data storage
[25] has been suggested.

3.4. L1 and SBC hit concurrencies and miss concurrencies

Tables 4 and 5 show the L1 cache hit concurrency (𝑐ℎ1), SBC hit concurrency (𝑐ℎ2), and miss
concurrency in SBC (𝑐𝑚) for various system sizes for L1, L12 and L1-SBC systems for 50 % and 80 % read
requests respectively.

An efficient multi-level cache system for geometrically interconnected many-core … (Tirumale Ramesh)
98  ISSN: 2089-4864

Table 4. Cache hit and miss concurrency with 50% read requests
System size 16 32 64
𝑐ℎ1 𝑐ℎ2 𝑐𝑚 𝑐ℎ1 𝑐ℎ2 𝑐𝑚 𝑐ℎ1 𝑐ℎ2 𝑐𝑚
L1 0.4 8.3 0.4 16.1 0.5 31.9
L12 4.5 4.2 8 8.5 8.2 15.7 16.8 16.1 31.2
L1-SBC 4.4 5.2 7.4 8.5 11.9 12.8 16.7 31.5 16.1

Table 5. Cache hit and miss concurrency with 80% read requests
System size 16 32 64
𝑐ℎ1 𝑐ℎ2 𝑐𝑚 𝑐ℎ1 𝑐ℎ2 𝑐𝑚 𝑐ℎ1 𝑐ℎ2 𝑐𝑚
L1 0.4 8.3 0.4 16.1 0.5 31.9
L12 1.9 6.8 7.1 3.7 13.1 14.3 7.3 25.9 29.2
L1-SBC 2.0 8.2 6.4 3.7 19.3 10.6 6.9 50.9 16.1

Figures 5 and 6 show the hit and miss concurrency for L1-SBC for 50 % and 80 % read requests
respectively. For the same number of cores, the miss concurrency decreases for L1-SBC as compared to L1
due to higher hit concurrency in SBC. The miss concurrency utilization in L1-SBC is about 50 % for larger
number of cores. This is attributed to the fact that SBC offers higher hit concurrency yielding reduced
memory traffic over the interconnection. Even though the low miss currency utilization may suggest that the
number of buses for higher number of cores may be reduced further, it may invariably decrease the hit rate
for SBC due to lower bandwidth availablity thus nullifying any overall advantage. As the data read are more
than data writes, SBC hit concurrency increases by approximately 1.5𝑥 for the same system size.

Figure 5. Cache hit and miss concurrency for L1 with 50 % read requests

60 ch1 ch2 cm
Hit and miss concurrency

50,9
50
40
30
19,3
20 16,1
8,2 10,6
10 6,4 6,9
1,97 3,7
0
16 32 64
Number of Cores

Figure 6. Cache hit and miss concurrency for L1 with 80 % read requests

Int J Reconfigurable & Embedded Syst, Vol. 11, No. 1, March 2022: 93-102
Int J Reconfigurable & Embedded Syst ISSN: 2089-4864  99

3.5. Concurrent average memory access time (C-AMAT) cycles

We evaluated the concurrent average memory access time (C-AMAT) cycles from (3) and (4).
Tables 6 and 7 show the C-AMAT for 50 % and 80 % read requests respectively.

Table 6. C-AMAT with 50 % read requests

Cache system 16 32 64
L1 12.4 6.4 3.2
L12 4.1 2.2 1.1
L1-SBC 3.7 1.6 0.3

Table 7. C-AMAT with 80 % read requests

Cache system 16 32 64
L1 12.4 6.4 3.2
L12 6.5 3.3 1.6
L1-SBC 5.7 2.4 0.3

As a result of increased SBC hit concurrency, the C-AMAT decreases with the number of cores.
Figure 7 shows the C-AMAT for 50% and 80% read requests respectively. Further reduction in C-AMAT is
seen for 80% read requests due to increase in SBC hit concurrency.

6 5,7
50 % read 80 % read
C-AMAT cycles

3,7
4
2,4
2
1,6 0,3
0 0,34
16 32 64
Number of Cores

Figure 7. C-AMAT with 50 % and 80 % read data requests

3.6. Cache system throught

The throughput in GB/sec (g) given as (6).
2 .𝑏
𝑔= (6)
𝐶−𝐴𝑀𝐴𝑇+𝑡𝑟

Where b is the number of buses with 2 bytes bus data width. We assumed GBI bus arbitration and
bus allocation reconfiguration time (tr) of 1 cycle and a clock cycle time of 0.5 ns. Table 8 summarizes our
results for throughput in GB per sec. We used normalized unit cost from Table 3 and C-AMAT using (3) and
(4). As shown in Table 8, the throughput increases with the number of cores and read request percentage
suggesting a good advantage.

Table 8. Throughput in GB/sec for L1-SBC for 50 % and 80 % read requests

No. of Cores
16 32 64
50 % 80 % 50 % 80 % 50 % 80 %

6.8 4.8 24.6 18.8 98.5 95.5

An efficient multi-level cache system for geometrically interconnected many-core … (Tirumale Ramesh)
100  ISSN: 2089-4864

Figure 8 shows the throughput for 50% and 80% read requests respectively. Figure 9 shows the
average throughput improvement factor for L12 and L1-SBC cache systems over L1 cache system. We found
that the average throughput improvement factor of L12 cache system across all system sizes is 1.5𝑥 for 50 %
read requests and 1.8𝑥 for 80 % read requests compared to L1. We determined that the average throughput
improvement for L1-SBC memory system is 2.5𝑥 for 50 % read requests and 2.4𝑥 with 80 % read requests
compared to L1 system. As there is very negligible cost increase for L1-SBC (0.5%) over L1, we conclude
that L1-SBC cache is both cost and performance efficient compared to L1 or L12 cache system, L1-SBC
offers 30 to 60% increase in throughput improvement factor compared to L12 improvement factor over L1.

150
50 % read 80 % read
Throughput

100 98,46
95,52

50 24,62
6,81
18,82
0 4,78
16 32 64
Number of Cores

Figure 8. L1-SBC throughput in GB/sec for 50 % and 80 % read data requests

3 2,5 2,4
Improvement
Throughput

1,8
2 1,5

0
L12 at 50 % read L12 at 80 % read L1-SBC at 50 % read L1-SBC at 80 % read

Figure 9. Average improvement in throughput for L12 and L1-SBC compared to L1

3.7. Cache system throughput with single bus fault

We also ran simulation for L1-SBC with a single bus fault in the system. We used both critical and
non-critical bus for assigning faulty bus. A bus is a “critical bus” if a memory is only connected to that bus.
Typically, rhombic interconnection [11] has a single “critical bus”. However, with GBI [12], we provided
redundant bus paths yielding all buses “non-critical”. Figure 10 shows the percentage degradation of a single
bus faulted system compared to normal L1-SBC system with 50% read requests. We noticed that the
percentage degradation in throughput for a single bus fault is less than 5% across all system sizes and
decreases with higher number of cores. This suggests a good fault tolerance for L1-SBC for increased
number of cores.

5 4,1 3,7
DEGRADATION
THROUGHPUT

4
PECENTAGE

3
2 1,5 1,45
1
0
16 32 64 AVG
NUMBER OF CORES

Figure 10. Throughput degradation with a single bus fault for 50 % read requests

Int J Reconfigurable & Embedded Syst, Vol. 11, No. 1, March 2022: 93-102
Int J Reconfigurable & Embedded Syst ISSN: 2089-4864  101

4. CONCLUSION AND FUTURE RESEARCH

Many-core based heterogeneous system demands high system throughput for big data applications
and other compute intensive embedded applications. By adding a less expensive SBC in association with
expensive per core L1 private cache within a multi-level cache hierarchy, we can achieve higher system
throughput. For better accuracy, we extracted cache hit and miss concurrencies at each level and applied
concurrent average memory access time for L1, L12 and L1-SBC systems. We conducted simulation of L1,
L12 and L1-SBC cache systems. Our simulation results indicate that by using L1-SBC, we can achieve 2.5𝑥
throughput improvement compared to using only L1 private cache and we see that L1-SBC offers higher
increase in throughput improvement factor compared to L12 improvement factor at a very negligible increase
in SBC cost over L1. We also determined that the throughput degradation using L1-SBC with a single bus
fault is less than 5 % across all system sizes and this degradation reduces as the system size increases
suggesting a good advantage for higher number of cores. As we used the SBC only during read request, in
the future, we hope to develop some additional novel SBC cache protocols using exclusive and shared modes
and include SBC in both read and write cycles. We also hope to perform some heterogenous computing big
data application benchmarks with LRU L1-SBC system and assess the overall system performance.

ACKNOWLEDGEMENTS
This work was supported in part by Army Research Office HBCU/MSI contract number W911NF-
13-1-0133 entitled: “Exploring High Performance Heterogeneous Computing via Hardware/Software Co-
Design”.

REFERENCES
[1] S. Le Beux, P. V. Gratz, and I. O’Connor, “Guest editorial: emerging technologies and architectures for manycore computing part
1: hardware techniques,” IEEE Transactions on Multi-Scale Computing Systems, vol. 4, no. 2, pp. 97–98, Apr. 2018, doi:
10.1109/TMSCS.2018.2826758.
[2] S. Savas, Z. Ul-Abdin, and T. Nordström, “A framework to generate domain-specific manycore architectures from dataflow
programs,” Microprocessors and Microsystems, vol. 72, p. 102908, Feb. 2020, doi: 10.1016/j.micpro.2019.102908.
[3] J. Ax et al., “CoreVA-MPSoC: a many-core architecture with tightly coupled shared and local data memories,” IEEE
Transactions on Parallel and Distributed Systems, vol. 29, no. 5, pp. 1030–1043, May 2018, doi: 10.1109/TPDS.2017.2785799.
[4] H. Homayoun, “Heterogeneous chip multiprocessor architectures for big data applications,” in Proceedings of the ACM
International Conference on Computing Frontiers, May 2016, pp. 400–405, doi: 10.1145/2903150.2908078.
[5] A. Parashar, A. Abraham, D. Chaudhary, and V. N. Rajendiran, “Processor pipelining method for efficient deep neural network
inference on embedded devices,” in Proceedings - 2020 IEEE 27th International Conference on High Performance Computing,
Data, and Analytics, HiPC 2020, Dec. 2020, pp. 82–90, doi: 10.1109/HiPC50609.2020.00022.
[6] L. Cheng et al., “A tensor processing framework for CPU-manycore heterogeneous systems,” IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, pp. 1–1, 2021, doi: 10.1109/tcad.2021.3103825.
[7] M. Goudarzi, “Heterogeneous architectures for Big Data batch processing in MapReduce paradigm,” IEEE Transactions on Big
Data, vol. 5, no. 1, pp. 18–33, Mar. 2019, doi: 10.1109/TBDATA.2017.2736557.
[8] E. Alareqi, T. Ramesh, and K. Abed, “Functional heterogeneous processor affinity characterization to Big Data: towards machine
learning approach,” in 2017 International Conference on Computational Science and Computational Intelligence (CSCI), Dec.
2017, pp. 1432–1436, doi: 10.1109/CSCI.2017.250.
[9] C. Lai, X. Shi, and M. Huang, “Efficient utilization of multi-core processors and many-core co-processors on supercomputer
beacon for scalable geocomputation and geo-simulation over big earth data,” Big Earth Data, vol. 2, no. 1, pp. 65–85, Jan. 2018,
doi: 10.1080/20964471.2018.1434265.
[10] T. N. Mudge, J. P. Hayes, G. D. Buzzard, and D. C. Winsor, “Analysis of multiple-bus interconnection networks,” Journal of
Parallel and Distributed Computing, vol. 3, no. 3, pp. 328–343, 1986, doi: 10.1016/0743-7315(86)90019-5.
[11] T. Ramesh and K. Abed, “Reconfigurable many-core embedded computing platform with Geometrical bus interconnection,” in
Proceedings - 2020 International Conference on Computational Science and Computational Intelligence, CSCI 2020, Dec. 2020,
pp. 1256–1259, doi: 10.1109/CSCI51800.2020.00234.
[12] T. Ramesh and K. Abed, “Cost-efficient reconfigurable geometrical bus interconnection system for many-core platforms,”
International Journal of Reconfigurable and Embedded Systems (IJRES), vol. 10, no. 2, pp. 77–89, Jul. 2021, doi:
10.11591/ijres.v10.i2.pp77-89.
[13] Y. Liu, S. Kato, and M. Edahiro, “Analysis of Memory System of Tiled Many-Core Processors,” IEEE Access, vol. 7, pp. 18964–
18974, 2019, doi: 10.1109/ACCESS.2019.2895701.
[14] Y. Te Lin, Y. H. Hsiao, F. P. Lin, and C. M. Wang, “A hybrid cache architecture of shared memory and meta-table used in big
multimedia query,” in 2016 IEEE/ACIS 15th International Conference on Computer and Information Science, ICIS 2016 -
Proceedings, Jun. 2016, pp. 1–6, doi: 10.1109/ICIS.2016.7550809.
[15] S. Charles, A. Ahmed, U. Y. Ogras, and P. Mishra, “Efficient cache reconfiguration using machine learning in NoC-based many-
core CMPs,” ACM Transactions on Design Automation of Electronic Systems, 2019, doi:
https://doi.org/10.1145/1122445.1122456.
[16] P. Safayenikoo, A. Asad, and F. Mohammadi, “An Energy-Efficient Cache Architecture for Chip-Multiprocessors Based on Non-
Uniformity Accesses,” in 2018 IEEE Canadian Conference on Electrical & Computer Engineering (CCECE), May 2018, pp. 1–4,
doi: 10.1109/CCECE.2018.8447736.
[17] X. H. Sun and D. Wang, “Concurrent average memory access time,” Computer, vol. 47, no. 5, pp. 74–80, May 2014, doi:
10.1109/MC.2013.227.
[18] J. H. Lee, S. W. Jeong, S. D. Kim, and C. C. Weems, “An intelligent cache system with hardware prefetching for high
performance,” IEEE Transactions on Computers, vol. 52, no. 5, pp. 607–616, May 2003, doi: 10.1109/TC.2003.1197127.
An efficient multi-level cache system for geometrically interconnected many-core … (Tirumale Ramesh)
102  ISSN: 2089-4864

[19] Y. Niranjan, S. Tiwari, and R. Gupta, “Average memory access time reduction in multilevel cache of proxy server,” in
Proceedings of the 2013 3rd IEEE International Advance Computing Conference, IACC 2013, Feb. 2013, vol. 2013-Febru, pp.
44–47, doi: 10.1109/IAdCC.2013.6506813.
[20] D. Chen, H. Jin, X. Liao, H. Liu, R. Guo, and D. Liu, “MALRU: Miss-penalty aware LRU-based cache replacement for hybrid
memory systems,” in Proceedings of the 2017 Design, Automation and Test in Europe, DATE 2017, Mar. 2017, pp. 1086–1091,
doi: 10.23919/DATE.2017.7927151.
[21] A. K. Singh, K. Geetha, S. Vollala, and N. Ramasubramanian, “Efficient Utilization of Shared Caches in Multicore
Architectures,” Arabian Journal for Science and Engineering, vol. 41, no. 12, pp. 5169–5179, Dec. 2016, doi: 10.1007/s13369-
016-2197-0.
[22] M. D. Hill and A. J. Smith, “Evaluating Associativity in CPU Caches,” IEEE Transactions on Computers, vol. 38, no. 12, pp.
1612–1630, 1989, doi: 10.1109/12.40842.
[23] D. Ramtake, N. Singh, S. Kumar, and V. K. Patle, “Cache Associativity Analysis of Multicore Systems,” in 2020 International
Conference on Computer Science, Engineering and Applications, ICCSEA 2020, Mar. 2020, pp. 1–4, doi:
10.1109/ICCSEA49143.2020.9132884.
[24] S. Khan, A. R. Alameldeen, C. Wilkerson, O. Mutluy, and D. A. Jimenezz, “Improving cache performance using read-write
partitioning,” in 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), Feb. 2014, pp.
452–463, doi: 10.1109/HPCA.2014.6835954.
[25] S. Chandak et al., “Improved read/write cost tradeoff in DNA-based data storage using LDPC codes,” in 2019 57th Annual
Allerton Conference on Communication, Control, and Computing, Allerton 2019, Sep. 2019, pp. 147–156, doi:
10.1109/ALLERTON.2019.8919890.

BIOGRAPHIES OF AUTHORS

Tirumale Ramesh is currently supporting Jackson State University as an

advanced computing research consultant where he previously served as a Senior Research
Associate. His current research interests include heterogeneous computing, network-on-chip,
cache systems, artificial intelligence (AI)/machine learning. He received his BE degree in
electrical engineering from Bangalore University, India in 1975, an MSEE in VLSI area from
Mississippi State University in 1983 and the Ph.D degree in computer engineering from
Oakland University, Michigan in 1993. Ramesh has a long-standing career. Previously he
served as a tenured professor of computer engineering at Saginaw Valley State University in
Michigan. He was a corporate fellow for advanced computing at Boeing and provided
technical leadership for several research projects funded by Boeing Corporate Research. He
was a senior engineer at IBM. He also served as a professorial lecturer in the department of
electrical and computer engineering at George Washington University in DC. Ramesh has
numerous US and foreign patents and published widely. He is a senior member of IEEE and
has served in leadership roles for IEEE conferences and IEEE computer society and received
several professional awards. He can be contacted at email: [email protected].

Khalid Abed is a Tenured Professor in the Department of Electrical & Computer

Engineering and Computer Science at Jackson State University (JSU). His research interests
include high performance heterogeneous/reconfigurable computing (HPRC/HPHC), edge
computing, artificial intelligence (AI), machine learning (ML), and deep learning (DL). He
received his B.S., M.S., and Ph.D. in Electrical Engineering from Wright State University in
1995, 1996, 2000, respectively. He has published extensively in IEEE journals and conferences
and is a technical reviewer for several IEEE journals and conferences. Dr. Abed is a Senior
Member of IEEE, IEEE Computer Society. He co-authored several patent submissions in the
HPHC/HPRC, areas. He has received funding from sources including the NSF, the DoD, and
the Army Research Office. He has received about $3M in grants for HPHC/HPRC education
and research. He can be contacted at email: [email protected].

Int J Reconfigurable & Embedded Syst, Vol. 11, No. 1, March 2022: 93-102

002 CAT-6015 Undercarriage
100% (2)
002 CAT-6015 Undercarriage
16 pages
CEH Lesson 3 - Enumeration and System Hacking
No ratings yet
CEH Lesson 3 - Enumeration and System Hacking
36 pages
Dcos: Cache Embedded Switch Architecture For Distributed Shared Memory Multiprocessor Socs
No ratings yet
Dcos: Cache Embedded Switch Architecture For Distributed Shared Memory Multiprocessor Socs
4 pages
Exploring Cache Coherency Design
No ratings yet
Exploring Cache Coherency Design
5 pages
Scheduling Threads For Constructive Cache Sharing On Cmps
No ratings yet
Scheduling Threads For Constructive Cache Sharing On Cmps
11 pages
18bce2429 Da 2 Cao
No ratings yet
18bce2429 Da 2 Cao
13 pages
Phase-Priority Based Directory Coherence For Multicore Processor
No ratings yet
Phase-Priority Based Directory Coherence For Multicore Processor
15 pages
Dagatan Nino PR
No ratings yet
Dagatan Nino PR
12 pages
2021-MESI Protocol For Multicore Processors Based On FPGA
No ratings yet
2021-MESI Protocol For Multicore Processors Based On FPGA
10 pages
Provably Good Multicore Cache Performance For Divide-and-Conquer Algorithms
No ratings yet
Provably Good Multicore Cache Performance For Divide-and-Conquer Algorithms
10 pages
Non Inclusive Caches
No ratings yet
Non Inclusive Caches
10 pages
Chip Multicore Processors - Tutorial 7: Task 7.1: Memory Overhead of Cache Coherency
No ratings yet
Chip Multicore Processors - Tutorial 7: Task 7.1: Memory Overhead of Cache Coherency
2 pages
Week 5 - The Impact of Multi-Core Computing On Computational Optimization
No ratings yet
Week 5 - The Impact of Multi-Core Computing On Computational Optimization
11 pages
CMP Cache Architectures - A Survey
No ratings yet
CMP Cache Architectures - A Survey
9 pages
Compression Aware DCR
No ratings yet
Compression Aware DCR
10 pages
Electrical Engineering and Computer Science Department: Chip Multiprocessor Cooperative Cache Compression and Migration
No ratings yet
Electrical Engineering and Computer Science Department: Chip Multiprocessor Cooperative Cache Compression and Migration
23 pages
Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling
No ratings yet
Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling
24 pages
Article
No ratings yet
Article
14 pages
LECTURE 37
No ratings yet
LECTURE 37
17 pages
Design and Implementation of A Cache Hierarchy-Aware Task Scheduling For Parallel Loops On Multicore Architectures
No ratings yet
Design and Implementation of A Cache Hierarchy-Aware Task Scheduling For Parallel Loops On Multicore Architectures
13 pages
Final Solution
No ratings yet
Final Solution
8 pages
Shared-Memory Multiprocessors - Symmetric Multiprocessing Hardware
No ratings yet
Shared-Memory Multiprocessors - Symmetric Multiprocessing Hardware
7 pages
Memory Hierarchy SMT
No ratings yet
Memory Hierarchy SMT
8 pages
Jamshed 2015
No ratings yet
Jamshed 2015
17 pages
Hierarchical Cache / Bus Architecture For Shared Memory Multiprocessors
No ratings yet
Hierarchical Cache / Bus Architecture For Shared Memory Multiprocessors
9 pages
Evaluating Stream Buffers As A Secondary Cache Replacement
No ratings yet
Evaluating Stream Buffers As A Secondary Cache Replacement
10 pages
3217
No ratings yet
3217
11 pages
Neha Lem Paper
No ratings yet
Neha Lem Paper
8 pages
Reevaluation of Programmed IO with Write-Combining Buffers to Improve IO Performance on Cluster Systems (NAS2015_kPIO+WC)
No ratings yet
Reevaluation of Programmed IO with Write-Combining Buffers to Improve IO Performance on Cluster Systems (NAS2015_kPIO+WC)
8 pages
A Survey of Cache Coherence Mechanisms in Shared M
No ratings yet
A Survey of Cache Coherence Mechanisms in Shared M
27 pages
[20]manticore_ultraefficient_floating_point
No ratings yet
[20]manticore_ultraefficient_floating_point
7 pages
Pattern Based Cache Coherency Architectu
No ratings yet
Pattern Based Cache Coherency Architectu
13 pages
Cascade Inference- Memory Bandwidth Efficient Shared Prefix Batch Decoding
No ratings yet
Cascade Inference- Memory Bandwidth Efficient Shared Prefix Batch Decoding
9 pages
tr1500_frequent_pattern_compression
No ratings yet
tr1500_frequent_pattern_compression
14 pages
ECE657
No ratings yet
ECE657
15 pages
10125-2002 CSVT On The Data Reuse and Memory Bandwidth Analysis For Full-Search Block-Matching VLSI Architecture
No ratings yet
10125-2002 CSVT On The Data Reuse and Memory Bandwidth Analysis For Full-Search Block-Matching VLSI Architecture
12 pages
ITEC582 Chapter18
No ratings yet
ITEC582 Chapter18
36 pages
Multi-Core Processors: Concepts and Implementations
No ratings yet
Multi-Core Processors: Concepts and Implementations
10 pages
Taxonomy of Parallel Computing Paradigms
No ratings yet
Taxonomy of Parallel Computing Paradigms
9 pages
Designing High-Performance DELL
No ratings yet
Designing High-Performance DELL
4 pages
25895
No ratings yet
25895
4 pages
Cia 1 - Mca Ans Key
No ratings yet
Cia 1 - Mca Ans Key
5 pages
C35_ICCD2020_Joe
No ratings yet
C35_ICCD2020_Joe
8 pages
Research Article: Memory Map: A Multiprocessor Cache Simulator
No ratings yet
Research Article: Memory Map: A Multiprocessor Cache Simulator
13 pages
Supporting Ordered Multiprefix Operations in Emulated Shared Memory Cmps
No ratings yet
Supporting Ordered Multiprefix Operations in Emulated Shared Memory Cmps
7 pages
MRPB: Memory Request Prioritization For Massively Parallel Processors
No ratings yet
MRPB: Memory Request Prioritization For Massively Parallel Processors
12 pages
Distributed Shared Memory
No ratings yet
Distributed Shared Memory
23 pages
Design of An Exact Data Deduplication Cluster
No ratings yet
Design of An Exact Data Deduplication Cluster
12 pages
18.1 Các vấn đề về hiệu suất phần cứng Giới thiệu
No ratings yet
18.1 Các vấn đề về hiệu suất phần cứng Giới thiệu
12 pages
Barrelfish March 2011
No ratings yet
Barrelfish March 2011
5 pages
b22cn096 Os Crp
No ratings yet
b22cn096 Os Crp
4 pages
Parallel Branch and Bound in Multi core Multi CPU
No ratings yet
Parallel Branch and Bound in Multi core Multi CPU
26 pages
Accelerating CPU-Based Sparse General Matrix Multiplication With Binary Row Merging
No ratings yet
Accelerating CPU-Based Sparse General Matrix Multiplication With Binary Row Merging
12 pages
Memory Interference Characterization Between CPU
No ratings yet
Memory Interference Characterization Between CPU
10 pages
Compute Caches: Ntroduction
No ratings yet
Compute Caches: Ntroduction
12 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
Parallel Processors and Cluster Systems: Gagan Bansal IME Sahibabad
No ratings yet
Parallel Processors and Cluster Systems: Gagan Bansal IME Sahibabad
15 pages
LECTURE 19
No ratings yet
LECTURE 19
15 pages
Scalable and Adaptive Log Manager in Distributed Systems
No ratings yet
Scalable and Adaptive Log Manager in Distributed Systems
18 pages
Data Similarity-Aware Computation Infrastructure For The Cloud
No ratings yet
Data Similarity-Aware Computation Infrastructure For The Cloud
14 pages
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
NVMe Performance Hacks
From Everand
NVMe Performance Hacks
Mei Gates
No ratings yet
Multimodal recognition with deep learning: audio, image, and text
No ratings yet
Multimodal recognition with deep learning: audio, image, and text
11 pages
FPGA implementation of artificial neural network for PUF modeling
No ratings yet
FPGA implementation of artificial neural network for PUF modeling
8 pages
Modeling of chimp optimization algorithm node localization scheme in wireless sensor networks
No ratings yet
Modeling of chimp optimization algorithm node localization scheme in wireless sensor networks
10 pages
Development and evaluation of robotic exoskeleton arm for enhanced human load carrying efficiency
No ratings yet
Development and evaluation of robotic exoskeleton arm for enhanced human load carrying efficiency
9 pages
Analysing feature selection: impacts towards forecasting electricity power consumption
No ratings yet
Analysing feature selection: impacts towards forecasting electricity power consumption
8 pages
Finite element analysis method as an alternative for furniture prototyping process and product testing
No ratings yet
Finite element analysis method as an alternative for furniture prototyping process and product testing
12 pages
A fast half-subtractor using 8T static random access memory for in-memory computation
No ratings yet
A fast half-subtractor using 8T static random access memory for in-memory computation
9 pages
Comparative analysis of ZigBee, LoRa, and NB-IoT in a smart building: advantages, limitations, and integration possibilities
No ratings yet
Comparative analysis of ZigBee, LoRa, and NB-IoT in a smart building: advantages, limitations, and integration possibilities
11 pages
Field-programmable gate array (FPGA) is a prominent device in developing the internet of things (IoT) application since it offers parallel computation, power efficiency, and scalability. The identification and authentication of these FPGAbased IoT applications are crucial to secure the user-sensitive data transmitted over IoT networks. Physical unclonable function (PUF) technology provides a great capability to be used as device identification and authentication for FPGAbased IoT applications. Nevertheless, conventional PUF-based authentication suffers a huge overhead in storing the challenge-response pairs (CRPs) in the verifier’s database. Therefore, in this paper, the FPGA implementation of the Arbiter-PUF model using an artificial neural network (ANN) is presented. The PUF model can generate the CRPs on-the-fly upon the authentication request (i.e., by a prover) to the verifier and eliminates huge storage of CRPs database in the verifier. The architecture of ANN (i.e., Arbiter-PUF
No ratings yet
Field-programmable gate array (FPGA) is a prominent device in developing the internet of things (IoT) application since it offers parallel computation, power efficiency, and scalability. The identification and authentication of these FPGAbased IoT applications are crucial to secure the user-sensitive data transmitted over IoT networks. Physical unclonable function (PUF) technology provides a great capability to be used as device identification and authentication for FPGAbased IoT applications. Nevertheless, conventional PUF-based authentication suffers a huge overhead in storing the challenge-response pairs (CRPs) in the verifier’s database. Therefore, in this paper, the FPGA implementation of the Arbiter-PUF model using an artificial neural network (ANN) is presented. The PUF model can generate the CRPs on-the-fly upon the authentication request (i.e., by a prover) to the verifier and eliminates huge storage of CRPs database in the verifier. The architecture of ANN (i.e., Arbiter-PUF
13 pages
Development of internet of vehicles and recurrent neural network enabled intelligent transportation system for smart cities
No ratings yet
Development of internet of vehicles and recurrent neural network enabled intelligent transportation system for smart cities
10 pages
Design of agrivoltaic system with internet of things control for chili fruit classification using the neural network method
No ratings yet
Design of agrivoltaic system with internet of things control for chili fruit classification using the neural network method
8 pages
Self-attention encoder-decoder with model adaptation for transliteration and translation tasks in regional language
No ratings yet
Self-attention encoder-decoder with model adaptation for transliteration and translation tasks in regional language
11 pages
A study of IoT based real-time monitoring of photovoltaic power plant
No ratings yet
A study of IoT based real-time monitoring of photovoltaic power plant
7 pages
Performance analysis of parallel prefix adders developed with field programmable gate array technology
No ratings yet
Performance analysis of parallel prefix adders developed with field programmable gate array technology
8 pages
Implementation of flexible axis photovoltaic system based on internet of things
No ratings yet
Implementation of flexible axis photovoltaic system based on internet of things
8 pages
An internet of things-driven smart key system with real-time alerts: innovations in hotel security
No ratings yet
An internet of things-driven smart key system with real-time alerts: innovations in hotel security
12 pages
Waste incinerator monitoring system based on remote communication with android interface
No ratings yet
Waste incinerator monitoring system based on remote communication with android interface
9 pages
Design of flood warning prototype using ESP32 module-based ultrasonic sensors
No ratings yet
Design of flood warning prototype using ESP32 module-based ultrasonic sensors
10 pages
Design and implementation of smart traffic light controller with emergency vehicle detection on FPGA
No ratings yet
Design and implementation of smart traffic light controller with emergency vehicle detection on FPGA
12 pages
20 21374 IJRES
No ratings yet
20 21374 IJRES
9 pages
Comparative analysis of feature descriptors and classifiers for real-time object detection
No ratings yet
Comparative analysis of feature descriptors and classifiers for real-time object detection
11 pages
Design of medium grain integrated clock gater for low power clock network
No ratings yet
Design of medium grain integrated clock gater for low power clock network
9 pages
Algorithm-driven development of a simulation tool for industrial manipulator stability analysis
No ratings yet
Algorithm-driven development of a simulation tool for industrial manipulator stability analysis
10 pages
TENS device for cervical pain during teleworking controlled remotely by mobile application
No ratings yet
TENS device for cervical pain during teleworking controlled remotely by mobile application
9 pages
Optimizing resource allocation in job shop production systems with seasonal demand patterns
No ratings yet
Optimizing resource allocation in job shop production systems with seasonal demand patterns
14 pages
Central processing unit load reduction through application code optimization and memory management
No ratings yet
Central processing unit load reduction through application code optimization and memory management
10 pages
Artificial intelligence driven robotic control system for personalized elderly care and foot massage
No ratings yet
Artificial intelligence driven robotic control system for personalized elderly care and foot massage
13 pages
Integration of K-Means and Silhouette score for energy efficiency of wireless sensor networks
No ratings yet
Integration of K-Means and Silhouette score for energy efficiency of wireless sensor networks
9 pages
Implementing a Very High-speed Secure Hash Algorithm 3 Accelerator Based on PCI-express
No ratings yet
Implementing a Very High-speed Secure Hash Algorithm 3 Accelerator Based on PCI-express
11 pages
Performance comparison of indoor navigation and obstacle avoidance methods for low-cost implementation in wheelchairs
No ratings yet
Performance comparison of indoor navigation and obstacle avoidance methods for low-cost implementation in wheelchairs
9 pages
Hfo Unloading System: 1 For Approval Saado N Ahmed Ahme D Ahmed 29.03.2022
No ratings yet
Hfo Unloading System: 1 For Approval Saado N Ahmed Ahme D Ahmed 29.03.2022
12 pages
Adapt and Extend Functionality of Apps
No ratings yet
Adapt and Extend Functionality of Apps
68 pages
Eaw Sm400ih Eastern Acoustic Works Manual de Usuario
No ratings yet
Eaw Sm400ih Eastern Acoustic Works Manual de Usuario
2 pages
Current Transformer Grounding
No ratings yet
Current Transformer Grounding
4 pages
Unit Test
No ratings yet
Unit Test
4 pages
E-Register July 2021
No ratings yet
E-Register July 2021
76 pages
Dewatering & Groundwater Control Sys. Install & Operate R1
No ratings yet
Dewatering & Groundwater Control Sys. Install & Operate R1
28 pages
03.1 VRRP Principle
No ratings yet
03.1 VRRP Principle
34 pages
Gas Turbine Control Temop Control
No ratings yet
Gas Turbine Control Temop Control
6 pages
POPM and Implementing SAFe Certifications
No ratings yet
POPM and Implementing SAFe Certifications
3 pages
Pantum M6200-M6500-M6550-M6600 Series User Guide en V2 - 5
No ratings yet
Pantum M6200-M6500-M6550-M6600 Series User Guide en V2 - 5
159 pages
11ac Dual-Band Wi-Fi Extender 1200 User Guide
No ratings yet
11ac Dual-Band Wi-Fi Extender 1200 User Guide
2 pages
Ethernet Cables
No ratings yet
Ethernet Cables
2 pages
BETCK105H-IOT Module-1
No ratings yet
BETCK105H-IOT Module-1
19 pages
Rohini 32001918232
No ratings yet
Rohini 32001918232
12 pages
Heat No
No ratings yet
Heat No
234 pages
Activity-Based Costing, Customer Profitability, and Activity-Based Management
No ratings yet
Activity-Based Costing, Customer Profitability, and Activity-Based Management
26 pages
Mri-Device-Selection-Guide Pin Diodes Um4001
No ratings yet
Mri-Device-Selection-Guide Pin Diodes Um4001
1 page
Pressure Drop Theory
No ratings yet
Pressure Drop Theory
3 pages
VISTA 50P Data Sheet
No ratings yet
VISTA 50P Data Sheet
2 pages
Socar Turkey: Contract No. PRXTR-CV-001 Unit No. Contractor Doc. SOCAR Doc. No. Sheet / Page Revision No
No ratings yet
Socar Turkey: Contract No. PRXTR-CV-001 Unit No. Contractor Doc. SOCAR Doc. No. Sheet / Page Revision No
5 pages
MIS ppt
No ratings yet
MIS ppt
10 pages
Setup Wizard
No ratings yet
Setup Wizard
33 pages
Show Time
No ratings yet
Show Time
31 pages
P1377 - Evp
No ratings yet
P1377 - Evp
3 pages
DISCUSS & Essential Questions Photography and Camera Skills Digital Skills (Developing Craft) Photographic Concepts
No ratings yet
DISCUSS & Essential Questions Photography and Camera Skills Digital Skills (Developing Craft) Photographic Concepts
11 pages
Mynusco BioPur Cornstarch - Injection Moulding Guide
No ratings yet
Mynusco BioPur Cornstarch - Injection Moulding Guide
2 pages
FUJITSU Storage Eternus Dx8900 S4 Technical Slides
No ratings yet
FUJITSU Storage Eternus Dx8900 S4 Technical Slides
130 pages

An Efficient Multi-Level Cache System For Geometrically Interconnected Many-Core Chip Multiprocessor

Uploaded by

An Efficient Multi-Level Cache System For Geometrically Interconnected Many-Core Chip Multiprocessor

Uploaded by

International Journal of Reconfigurable and Embedded Systems (IJRES)

Vol. 11, No. 1, March 2022, pp. 93~102

An efficient multi-level cache system for geometrically

Tirumale Ramesh, Khalid Abed

Article Info ABSTRACT

Journal homepage: http://ijres.iaescore.com

2. L1-SBC CACHE SYSTEM

2.1. Concurrent average memory access time (C-AMAT)

𝐴𝑀𝐴𝑇1 = 𝑡1 ℎ1 + (1 − ℎ1 )𝑡𝑚 (1)

𝐴𝑀𝐴𝑇2 = 𝑡1 ℎ1 + (1 − ℎ1 )(𝑡2 ℎ2 + (1 − ℎ2 )𝑡𝑚 (2)

2.2. Geometrical bus interconnection (GBI) [12] cost

2.3. SBC impact on C-AMAT

2.4. Cache association impact on C-AMAT

3. CACHE SYSTEM SIMULATION

Core 1 request Cores 2 to n send their request to cache

L1 read Search Cache Controllers [2 … 𝑛]

Figure 2. L1-SBC Operation

Table 2. Cache system simulation parameters

3.2. Relative normalized system cost

Table 3. Normalized system cost

3.3. Cache read and write misses criticality impact

3.4. L1 and SBC hit concurrencies and miss concurrencies

3.5. Concurrent average memory access time (C-AMAT) cycles

Table 6. C-AMAT with 50 % read requests

Table 7. C-AMAT with 80 % read requests

Figure 7. C-AMAT with 50 % and 80 % read data requests

3.6. Cache system throught

Table 8. Throughput in GB/sec for L1-SBC for 50 % and 80 % read requests

6.8 4.8 24.6 18.8 98.5 95.5

Figure 8. L1-SBC throughput in GB/sec for 50 % and 80 % read data requests

Figure 9. Average improvement in throughput for L12 and L1-SBC compared to L1

3.7. Cache system throughput with single bus fault

4. CONCLUSION AND FUTURE RESEARCH

Tirumale Ramesh is currently supporting Jackson State University as an

Khalid Abed is a Tenured Professor in the Department of Electrical & Computer

You might also like