[18]logic_in_memory
[18]logic_in_memory
Academic Supervisors
Prof. Fabrizio RIENTE
Prof. Marco VACCA
Ing. Andrea COLUCCIO
Candidate
Gianluca GOTI
October 2022
Abstract
In recent years, many efforts have been spent on research on emerging technologies applied to
memories. Many different devices are present on the market, each of them with its peculiarity but
in general, as a rule of thumb, if the storage capacity increases the speed decreases. Furthermore,
the standard technology suffers from physical and technological limitations.
Novel technologies try to overcome such limitations, Racetrack technology seems to be a good
candidate to satisfy at the same time the requirements of speed and storage capacity. Racetrack is
essentially a ferromagnetic wire where bits are retained by exploiting the magnetization direction.
Bits are accessed and modified only through dedicated ports, thus this requires bit alignment to
the access ports by means of shift operations along the ferromagnetic structure. Studies on this
technology showed interesting performances in both access latency and storage capacity.
Another very important issue in modern Electronics is the so called Memory-wall. Today’s ar-
chitectures are capable of astonishing performances but the exchange with the memory slows
down the overall performance. It is in this context that the Logic-in-Memory paradigm takes
place. The idea is to create completely new architectures able to partially or even totally perform
computations directly in memory embedding logic elements within the memory. This limits the
exchange of data back and for increasing the system performance.
This Thesis work focused on the application of the Logic-in-Memory paradigm to the emerging
Racetrack technology. This new architecture was applied to an open computing system named
RI5CY already provided with a LiM architecture.
The proposed architecture aims to offer an open and configurable Logic-in-Memory platform
in which multiple types of memory can be tested. In addition, this work proposes a working
RTL model of a Racetrack memory with LiM capabilities with performances comparable to the
original LiM system. Furthermore, this Thesis tries to give some ideas on new possible internal
organizations for the Racetrack memory.
Simulations showed that the new Racetrack memory seems to be a good candidate to replace
the standard technology. Combining Racetrack with the Logic-in-Memory paradigm should be
an interesting solution to overcome the memory-wall problem and limitations due to standard
memory technologies.
List of Tables v
List of Figures vi
4 Simulations 63
4.1 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Simulation of custom programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.1 Bitwise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.2 Inverted-bitwise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Simulation with standard programs . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.1 Database search with Bitmap Indexes algorithm . . . . . . . . . . . . . . . 76
4.3.2 AES Addroundkey algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3.3 Binary Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.4 Simulation Results Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.1 Original version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.2 Parallel version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.4.3 Core compliant version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.5 Racetrack organization analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5.1 Bitwise & inverted_bitwise . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.5.2 Bitmap algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.5.3 AES_128 Addroundkey algorithm . . . . . . . . . . . . . . . . . . . . . . . 106
4.5.4 Xnor net algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
iv
List of Tables
v
List of Figures
viii
Chapter 1
Modern Electronic is based on a common framework, data and instructions are stored in external
memories and CPUs retrieve information from them. Astonishing performances have been reached
during the years but one of the main problems is the so called Von-Neuman bottleneck, research’s
efforts are concentrated on finding valid solutions to this problem. As the architectures’ computing
speed increases the continuous exchange of information back and forth from the memory implies a
huge waste of time and resources.
One of the most promising candidates to solve this problem is the concept of Logic-in-Memory
(LiM), where memory is enhanced with internal (and/or external) logic making possible to perform
simple or even more complex in-place computations. This solution could provide a performance
speed-up and a reduction of power consumption reducing data transfer on memory links.
The first goal of this Thesis is to develop a fully customizable LiM framework for Pulpino RISC-V,
then it will be adapted to a novel memory, based on an emerging technology called Racetrack.
The last part of this work is focused on the analysis of several algorithms to understand benefits
and drawbacks of this architecture.
1
Motivations and background
The principle of memory hierarchy is based on the way in which software is written, this
argument is widely treated in [16].
Designers tend to solve problems step-by-step and this reflects directly on memory accesses, which
tend to be non-random and predictable. Starting from this fact it is possible to define the concept
of Locality of References, which is divided in:
• Temporal Locality: during the execution of a program, it is very likely to access the same
resource more than once;
• Spatial Locality: during the execution of a program, it is very likely that the next accesses
will be "physically" in proximity to the last one;
These two properties have shaped the design of modern memory systems because they suggest that
it is not necessary to have the whole memory to run a program and furthermore "in proximity"
accesses are very common during the execution of tasks.
Thus, modern computing systems embeds three different memory types, a small and fast memory
(i.e. Cache), an intermediate one large enough to run several tasks but not as fast as the first one
(i.e. Main memory) and a final memory stage capable of storing all the set of data (i.e. HDD).
2
1.2 – Memory Systems
In the following, an overview of the main memory systems will be given, the main features and
properties will be highlighted as well as their controllers, that are essential to ensure their correct
functioning. This review will follow an ideal path starting from the memory stage close to the
CPU then moving away more and more up to the non-volatile memory stage.
It is possible to find a common structure among all the different types of memory. Due to
technological and fabrication reasons, in general memory arrays are squared or rectangular, this
improves the array regularity. Assuming n address bits, a fully generic memory array is composed
by 2n−k rows composed by one or more words, each row contains 2k column. In general the word
is addressed exploiting Row and Column Decoders, Column Circuitry is useful to read stored data
and other functionalities depending on the analyzed memory. A bit-line conditioning could be
present or not, this depends on the mechanism on which the memory relies, Figure 1.2 depicts a
generic memory array.
Starting from the computing unit, the first memory system that is found is the Cache, a cache
is a generic name indicating that a memory is masking the access latency of another one. As
matter of fact, the main memory acts as a cache for the backing store and the "true" cache acts
as a cache for the main memory, but in modern computer architecture the cache is the memory
placed between the main memory and the CPU. The main task of this component is to be at least
as fast as the processor, to not slow down the execution flow, for this reason caches are commonly
designed using SRAM technology.
3
Motivations and background
1.2.1 SRAM
SRAM stands for Static Random Access Memory, it is an asynchronous memory adopted whenever
speed is the main requirement, in fact it is used for cache systems, register files and CPU’s internal
registers. The words Static and Random highlights two main features of this technology, memory
is able to retain data without need of a refresh system and it is possible to address every single
cell within the array.
During years, many different SRAM implementations have been proposed, but the most adopted
and reliable one is the 6T (6-Transistor) cell. Information is retained in the cross coupled inverters
that can be accessed by two pass-transistors activated by a wor-dline. Two complementary bit-lines
are exploited for read and write operations. Figure 1.3 depicts the schematic of a 6T-cell, the read
and write circuitry is also present.
• Activate word-lines;
Once word-lines are activated, if the applied voltage is correct, the internal bit of the cell is forced
to the value imposed by the user.
The complete block diagram of a SRAM memory is composed by one or more arrays, in addition
to the circuitry presented in Section 1.2. Figure 1.4 shows a complete SRAM architecture, among
the already mentioned building blocks such as sense amplifiers, column and row decoders, bit-line
4
1.2 – Memory Systems
conditioning circuitry etc., the scheme shows a block organization of the memory array, activated
by a specific decoder. Additional circuitry is required for the correct behaviour of the memory, like
the start-up circuitry that is necessary to generate a start signal and the timing circuitry, which
coordinates all the signals within the memory.
Figure 1.5 and Table 1.1 show respectively a generic SRAM interface and its truth table, it’s clearly
visible how the memory array acts like a combinational circuit, setting specific signal combinations
leads to different results.
5
Motivations and background
CE WE OE Access Type
0 0 0 Write
0 1 0 Read
1 X X Disable
In Figure 1.6 the timing diagram of read and write operations is depicted.
There is not a specific order for applying signals, a possible reliable solution is to adopt CE as a
control signal to define a safe window in which other signals can be asserted.
The Read operation starts enabling CE, then if the address is stable, asserting OE gives in output
the requested data, here W E is a don’t care. The write operation is similar, once input data and
the address are stable, CE is activated, in this case W E is active, while OE is a don’t care, once
this sequence of signals is set, after a while the input data is sampled. Since in this case CE is
used to create a "window", input data is sampled on the "low-high" transition of this signal.
As said previously, this memory acts like a combinational circuit, so by definition it is asyn-
chronous. As the execution speed increases it is more and more difficult to properly generate all
the internal signals with correct delays, timing tolerances will lead to errors, a solution to such
problem is to rely on a synchronous architecture.
SSRAM, Synchronous Static Random Memory, employs standard SRAM technology improved
with some pipeline stages. Different types of SSRAMs have been presented during years, one
example is the pipelined SSRAM, that implements an input and an output pipeline stage to
improve performance, Figure 1.7 depicts a possible block diagram. Of course the drawback of
this solution is the increase of latency and an higher cost in terms of power and area due to the
additional circuitry.
6
1.2 – Memory Systems
Memories are devices controlled by external entities, in general the computing unit does not
directly communicate with the memory and an intermediate device act as an mediator between
the two, this is the Memory Controller. These devices provide an easy interfacing with memories,
CPU sends addresses, operation type (read or write) and eventually the data to store, then the
controller will provide to set the specific signals to the memory core and to retrieve data from
memory.
A possible SRAM controller is depicted in Figure 1.8, it has a very simple structure, it contains a
FSM to control all the different states, a decoder to decode input addresses and several registers
to store information and to provide a synchronous interface to the microprocessor.
7
Motivations and background
Caches are small and fast memories placed in between of the computational system and the
main memory, they try to mask the access penalty by storing often-accessed or likely-accessed
data, relying heavily on the well known concepts of temporal locality and spatial locality.
Generally, a Cache is transparent, meaning that the processor acts normally and it is not aware of
the Cache system. When data is present in the Cache, there is a HIT and the execution speeds-up,
otherwise when data is not present there is a MISS and a normal memory access is performed,
paying the correspondent access penalty.
The line is the basic Cache memory uni, its counterpart in the Main memory is called block. In
this insight only transparent-addressed Caches will be investigated, this means, as said before, that
the Cache is transparent to the system, so the same addresses as in the main memory are used.
From the technological point of view, Caches are generally based on SRAM technology, this choice
has several advantages like high transfer speed and compatibility with standard CPU’s fabrication
processes. The architecture is very similar to a standard SRAMs, the memory core is identical to
the ones showed in 1.2.1, additional circuitry is required i.e. Tag Comparison circuitry in addition
to a different organization depending on the mapping policy. Possibly, a Control Unit is needed to
control the replacement and writing policies.
The first parameter that distinguishes all the different Cache structures is the mapping, it represents
how data is mapped within the cache.
There are three possible solutions:
• Direct mapped: data is store only in a unique location identified by means of the modulo
operation. Since the Cache is much smaller than the main memory, many different addresses
can be mapped in the same Cache line, this leads to contention. The address in divided in
8
1.2 – Memory Systems
three fields, Tag, Set index and Offset. Set index is basically the modulo operation between
the block number and the overall number of lines, the result gives the line number where to
map the data. Offset field (LSBs) indexes the words within the line, while the Tag field is
used to check whether a line is present or not in the Cache. Tag field is basically a set of
MSBs of the address that uniquely identifies data mapped within the cache.
This is a very simple and effective implementation, the main limitation is the line contention,
since multiple blocks can be mapped in the same cache line, this problem is called trashing
and produces cache misses;
• Fully Associative: data can be stored in any line of the cache, it overcomes the problem of
trashing, addressed in the previous implementation.
Tag circuitry is replicated for every cache line, the behaviour is similar to a CAM plus a
RAM, the tag-check is fully parallel. The main limitations of this implementation are the
high dynamic power and the very high cost due to the redundancy of the tag circuitry, for
this reason to keep low these costs the storage capacity is limited;
• Set-Associative: this is an hybrid solution that embeds some characteristics of the previous
two implementations. Cache is organized in set or ways, synonyms (cache lines with the
same set index) can be mapped in one of the available set. Tag field allows to identify the
requested data among all the different sets. Tag circuitry is replicated for each set, this
solution is less expensive with respect to the direct mapped one, furthermore it reduces the
cache misses due to the same cache line contention;
In the following, Figures 1.10, 1.11 and 1.12 shows respectively the architecture of a Direct-Mapped
Cache, a Set-Associative Cache and a Fully-Associative Cache.
Figure 1.10. Direct-mapped cache [16] Figure 1.11. Set-associative cache [16]
9
Motivations and background
• LRU (Least Recently Used): replaces the least recently used line;
• FIFO (First In First Out): replaces the first line entered in the cache;
• LFU (Least Frequently Used): replaces the least frequently used line;
• Write-through: data is written both in the Cache and in the main memory;
• Write-back: data is written or updated only in the Cache, only in specific moments it is also
written in main memory i.e. replacement phases;
Literature doesn’t provide a clear view of how Cache controllers are designed, while the book
Computer Organization adn Design RISC-V Edition [22] offers a possible design solution. Here a
behavioural description of the different controller’s states is proposed, Figure 1.13 shows a possible
implementation of the state diagram. Here, all main features are developed, tag comparison,
replacement and write policies are handled by a simple FSM that controls all the task of the cache.
1.2.3 DRAM
DRAM memory is what is usually called main memory inside a computing system. Retention
mechanism is completely different with respect to the cross-coupled inverters of the SRAM. Data
is stored in a parasitic capacitance of a MOS transistor that acts also as a pass-transistor for the
memory cell, Figure 1.14 shows the basic 1-bit cell. This mechanism gives a very high storage
density at a very low cost, as drawback the charge stored in the capacitor slowly leaks through
the pass transistor. This slow discharge corrupts the retained data, for this reason DRAM needs
10
1.2 – Memory Systems
to be periodically refreshed and this type of memory is typically is referred as dynamic. In this
technology read and write operations are typically much slower with respect to SRAMs.
For historical reasons, the address bus in these structures is multiplexed, row address and
column address are set in two different phases.
The very first version of this memory was asynchronous, then progress pushed the architecture
towards the synchronous world, in fact today only synchronous DDR DRAMs are used.
The internal architecture, recalls deeply the SRAM structure, basically it embeds the same main
components plus additional circuitry to handle the refresh phase. Recalling Figure 1.14, data is
stored in the parasitic capacitor CS , the access is granted by the pass transistor M1 that connects
the inner memory cell with the global bit-line. This latter connects all the memory cells along its
path, thus it has a huge parasitic capacitance named CBL . The presence of this bit-line capacitance
makes more tough the read operation on the memory cell.
In the following, the three main cell operations are described:
• Write operation: the bit-line is set to ’1’ or ’0’, then the word-line is activated and the voltage
VBL across CBL is transferred into CS , other cells along the bit-line are simply refreshed;
V dd
• Read operation: the bit-line is pre-charged at , then the wor-dline is activated, CS and
2
CBL start sharing charges. This has two effects, from one side, voltage across CBL changes,
making possible to read the stored data, on the other side the voltage across CS is completely
corrupted since the storage capacitance is much smaller with respect to the bit-line one.
Voltage perturbation ∆V B across CBL is sensed with a Sense Amplifier that speeds up the
read operation. The memory cell is then refreshed with a positive feedback mechanism;
• Refresh operation: this task is periodically carried out with a simple dummy read operation;
At system level, the access to a DRAM cell requires very precise steps that the designer should
follow to ensure correct accesses, they consist in two different commands, Activate and Pre-charge.
Figure 1.15 depicts the different phases involved in a read/write operation.
11
Motivations and background
The cell is initially in the quiescent (1) state, here the pass-transistor is open and the bit-lines
V dd
are at . Then, Activate command closes the pass-transistor and connects the Cell capacitor
2
with the bit-line capacitor, this leads to a charge sharing (2). Sense amplifiers at the end of the
bi-tline sense the voltage perturbation across the bit-line capacitor (3) , at this point a Read or
Write command can be issued. Eventually, in case of a read operation, the original value can
be restored (4). Finally, Pre–charge command (5) closes the access cycle restoring the original
V dd
quiescent condition, thus pre-charging the bit-line to . These are the main steps to follow
2
during an access cycle, every read or write operation should follow this scheme.
DRAMs have a hierarchical organization, they are divided in several structures, unfortunately
literature does not provide a common nomenclature, the proposed one is reported in the following
(Figure 1.16):
• Row: group of memory cells connected by the same row, also called Page;
• Rank: collection of one or more banks, it can cooperate with other ranks;
• Channel: collection of ranks which shares the same physical link, it can operate independently;
DRAMs are different from SRAMs, they are not just a piece of combinational logic, since the address
bus is multiplexed, column and row addresses need to be saved in internal latches, this reflects
deeply on the memory access interface. First versions of this memory were still asynchronous, in
the following a set of possible interface signals are reported (Figure 1.17):
• W R: write enable;
The internal architecture, Figure 1.18, is very similar to the SRAM’s one. The core of the
memory array is composed by common building blocks present also in other memories such as row
and column decoders, write and sense amplifiers, bitline conditioning circuits and many others.
Since DRAM is volatile, it needs a specific circuitry that periodically perform a refresh of the
entire array, this is carried out automatically by an internal FSM designed specifically for this
task. As said before, addresses are multiplexed, this design solution was taken for historical reason
to save pins because they had a very huge impact on the final cost of the memory. Addresses are
given in two different phases and they need to be saved somewhere, generally internal latches are
adopted for this task.
Timing diagrams of read and write operations are very similar, these operations share the same
access phase and they differ just in their final steps. Figure 1.19 show a typical read and write
access, in both cases the access starts with an activation phase that consists of asserting RAS
once the Row Address is stable.
The first memory cycle is a read operation, after RAS assertion, W E should not be asserted
to notify that it is a read operation. Then, once the column address is stable, CAS should be
activated to sample it. RAS and CAS should be kept stable for the whole cycle, once OE is
13
Motivations and background
asserted after a while the required data is available on the output port.
Write operation is a little bit different, once the row address is sampled with RAS, W E should be
set before CAS is asserted. Then, data is written into the memory after CAS assertion.
In both cases a memory cycle terminates once RAS and CAS return in their original positions,
before a new access cycle, a precharge cycle should be carried out.
The evolution of DRAMs follows the one described for SRAMs, memory cycles require very
precise time constraints and control signals should be generated with proper delays. These can
be achieved easily at low operative frequency but as the clock frequency increases it is way more
difficult to provide signals with te correct combination, thus a synchronous system is required.
The internal organization becomes more complex, a FSM controls all the internal signals and the
user can program it by means of the mode register that should be programmed at the switch-on.
Internal latches are no more controlled directly by RAS and CAS, they do not act as strobes but
now they are sampled on the rising edge of the clock, Figure 1.21. A new Chip Select (CS) signal,
controls the communication with the memory chip, Figure 1.20 shows the external interface and
the internal architecture of a DRAM array.
In SDRAMs RAS, CAS and W E compose a command word that is used to control the
operation of the memory through the FSM. Modern DRAMs implement a Double Data Read
(DDR) feature, data is sent both on the rising and falling edge of the clock, in this condition
memory also provides a strobe named DQS, used by the receiver for sampling data with the correct
timing, Figure 1.22.
14
1.2 – Memory Systems
This complex structure requires a memory controller able to menage all the different operations.
First of all, it has to carry out the initial configuration of the memory programming through
the mode register. Then, it should activate all the specific commands to perform read and write
operations i.e. assert Activate and Pre-charge commands. It should also provide all the features
of a standard memory controller like address decoding, data and memory request buffering and
many others. Since all these operations are very complex, memory controller can be implemented
with a FSM which controls all the different memory states and phases. In [3] a general DRAM
controller is proposed, 1.23 shows its schematic. The proposed structure is provided with a memory
mapping unit useful to translate the addresses into block, row and column numbers, an arbiter
is also present and it schedules memory transactions in all the available banks. This controller
implements also a set of input and output buffers to speed up the execution.
In [28], Xilinx provides a DDR SDRAM controller, Figure 1.24. The main building blocks
are similar to the previous example, a controller menages all the different phases of the memory,
then some latches sample the input addresses, a DLL is implemented to provide a stable clock
reference to the memory, furthermore some counters (i.e. burst counter, latency counter etc.) are
implemented.
15
Motivations and background
1.2.4 MRAM
Magnetoresistive Random Access Memories (MRAMs) are the first non volatile memories of this
overview. Their peculiarity is that their are based on emerging technologies such as magnetic
materials and Magnetic Tunneling Junctions (MTJs). They store data as a stable state of magnetic
devices, then information is read measuring resistance to estimate the magnetic state. MRAMs
behave like resistive memories during read operations, while they differ in the writing operations
based on the mechanism adopted by the specific MRAM type.
The main features of these emerging technologies are:
• scalability;
• integration density;
All these feature make MRAM technology a good candidate to replace standard memories, studies
are still ongoing.
The fundamental component for a 1-bit MRAM cell is the MTJ. In its most straightforward
configuration, Figure 1.25, it is composed by three layers:
Data is stored as a stable magnetic state obtained through the relative orientation of the
magnetization of the two ferromagnetic layers. Furthermore, this condition determines the resistive
behaviour of the device,this is the so called Tunneling MagnetoResistance (TMR) effect . For most
materials, the resistance is low when the layers’ magnetization is parallel, conversely the resistance
is high when magnetization is anti-parallel, these two states are exploited to encode logic ’0s’ and
’1s’.
Read operations exploits a specific circuitry which compares the resistance of the cell with a reference
value provided by the memory array to determine the cell’s state. Tunneling MagnetoResistance
ratio is an important parameter for MTJs, it is defined as following:
RAP − RP
TMR =
RP
It basically shows the relative resistance change, RP and RAP are respectively the resistance in
the parallel state and the resistance in the anti-parallel state.
Different types of MRAMs rely on the same reading mechanism, what distinguish them is the
writing mechanism, two types of the most promising MRAMs are:
• STT-MRAM
• SOT-MRAM
STT MRAMs take their name from Spin Transfer Torque effect, when a current is passed
through the MTJ, this exerts a torque on the FL’s magnetization. If the current is large enough
this will result in the switching of the magnetization state of the FL and of the resistive value
as well. Furthermore, current polarity determines the parallel or the anti-parallel magnetization
state. The basic 1-bit STT-MRAM cell is a two terminal structure, 1.26, a word-line activated
transistor determines the current flow or not. Read operation implies the MOS activation and
the application of a read voltage across the two port device, the read current is then sensed and
compared with a reference value by a Sense Amplifier, which determines the resistive value of the
cell.
17
Motivations and background
Multiple cells can be arranged together to create arrays like in standard CMOS memories.
Array’s internal architecture is similar to a DRAM’s one, [8] proposes a simple STT-MRAM
internal schematic reported in Figure 1.27. Differently from DRAMs, in this case data should
not be written back after read operations, so Sense Amplifiers can be shared among multiple
bit-lines, determining a low energy consumption and area occupation. A row buffer stores output
data disconnecting SAs, this limits wasted power after data sensing. Then, architecture contains
building common blocks to each memory like SAs, column and row decoders, registers/latches,
timing circuitry and others. As known, read operations are carried out comparing the current
flowing through the MTJ with a reference value, this requires a current generator to provide a
stable reference source for SAs.
In 2018, Everspin Technologies proposed a 256MB DDR3 STT-MRAM [1], as stated previously
the overall architecture resembles deeply a DRAM as shown in Figure 1.28.
18
1.2 – Memory Systems
Everspin’s datasheet provides also several timing diagrams, Figure 1.29 depicts the read cycle
timing diagram, it is very similar to the DRAM’s one. Before any read or write operation a
row should be activated through the activate command (not shown in the diagram), then the
read/write command can be issued, notice that addresses are issued with the read/write operation
command.
STT technology implies that read and write currents pass along the same path, this leads to
some drawbacks. Fast switching requires a large current flow through the MTJ insulating layer,
this speeds-up aging of the barrier leading to a lower reliability. This is why during the year
researchers have moved to other MRAM types such as SOT-based MRAMs.
SOT-MRAMs are based on the so called Spin-Orbit Effect, from which they take the name. The
1-bit cell architecture, 1.30, is modified to decouple the read and write current path, this solution
fixes one of the most limiting issues of STT-MRAMs.
19
Motivations and background
The MTJ’s Free Layer is replaced by a Channel Layer composed by a heavy metal. An in-plane
current flowing in this latter induces a spin torque through the Free layer, which is able to switch
the magnetization state of the cell. Thus, read and write current paths are separated, this reduces
the Isolation Layer aging increasing the reliability of the cell. Furthermore, SOT-MRAMs result
in a lower writing current with respect to STT ones.
The drawback of this technology is the larger area overhead due to the additional transistor
involved in the write operation. In any case SOT technology is a good candidate for overcoming
STT-based MRAMs.
Regarding MRAMS’s controllers, in 2015 Northwest Logic and Everspin Technologies Inc. an-
nounced a "MRAM Controller IP compatible with Everspin’s STT-MRAM" [21], unfortunately
any further information is provided by this two companies. In any case the main modules required
by the controller are similar to the ones involved in the DRAM Controller (i.e. FSM, decoders,
memory mapping modules, buffers and additional logic), so it is safe to suppose that the controller’s
architecture is not so different from the previous ones.
1.2.5 FLASH
FLASH memories are non-volatile devices, they adopt a more classic mechanism with respect to
MRAM technology and they are adopted in modern SSDs, USB drives and many other devices.
Unlike standard HDDs, that adopt a mechanical tip to read and write magnetic disks, these
memories adopt a fully electronic mechanism for read and write operations. This type of memory
relies on the Floating Gate MOS transistor, Figure 1.31, a particular device provided with two
different gates.
The first one, named Control gate, acts normally like in a standard MOS system, the second
20
1.2 – Memory Systems
one called Floating gate, is completely surrounded by the gate dielectric. Thanks to this particular
configuration, this MOS can be programmed. It is possible to inject electrons in the floating
gate, this changes electrical characteristics and thus the behavior of the transistor. In a standard
situation the floating gate is empty and the system acts as a normal MOS, so VGS vs VDS
characteristic is standard, this state is called Not-programmed. The programmed state is basically
opposite, applying specific voltages it is possible to trap negative charges in the floating gate.
This modifies the MOS characteristics rising the threshold voltage as depicted in Figure 1.32. By
programming the MOS it is possible to set an higher threshold voltage required to switch-on the
transistor, in this way logic ’1s’ and ’0s’ can be encoded.
In FLASH memories the page is the minimum readable unit, while the block is the minimum
erasable unit. During years two different types of FLASH memories have been proposed, based on
NAND and NOR technology respectively. The differences between these two architectures are
shown in Figure 1.33. NOR architecture is based on a NOR-type logic where each cell can drive
the bit-line, while in the NAND case the bit-line can be driven only through a chain of transistors.
These two structures have their own advantages and drawbacks:
• NOR
– Random accesses
– Slow write and erase operations
– Suitable to store instructions
– No multiplexed bus
• NAND
– Requires ECC
Due to technological and economical reasons, NAND Flash memories have become the standard
technology for non-volatile memory devices. Modern memories can achieve a very high storage
density, this comes directly from the NAND architecture. Bits are stored in long transistor chains,
this reflects in a smaller area occupation space since some metallizzation are shared, thus for
the same amount of cells, area occupation is minimized. Unfortunately the drawback is a more
complex read operation since data is degraded along transistors chain, so ECC is required.
Internal architecture of these memories is similar to the previous ones but at the same time
requires additional complex building blocks. There are three main operations: write, read and
erase. Read operation is similar to previous memories, once bit-lines and word-lines are activated,
data is read by Sense amplifiers (SAs). Write and erase operations require more complex tasks
because the floating gate need to be filled or emptied with negative charges. Thus, very specific
voltages are required for such operations, internal architecture is provided with specific FSMs to
handle all the different steps and many modules to generate and apply such voltages.
First versions of these memories were asynchronous, Figure 1.34 depicts the interfaces for NOR
and NAND Flash memories, in the following their interface signals are listed:
• NAND:
– CLE: Command Latch Enable
– ALE: Address Latch Enable
– CE: Chip enable
– W E: Write enable
– W P : Write protect
– RE: Read enable
– RD/BY: Ready/Busy flag
– I/O: Input/Output bus
• NOR:
– ADD: Address bus
22
1.2 – Memory Systems
– W E: Write enable
– W P : Write protect
Modern Flash memories are based on NAND technology due to an higher density capability, NOR
Flashes have a very small applications, for this reason a brief insight of this Flash type will be
given in the following.
NAND memories are divided in blocks, made of several pages. They can be programmed page
by page and erased by blocks, random reads are slower with respect to NOR Flashes. As shown
in the previous schematic, this configuration leads to an higher integration, but additional pass
transistors are required (i.e. top and bottom ones). As for the program operation, pages are the
minimum readable entity, while the string is the basic entity of the memory array (column of
transistors).
Like in NAND-based logic, read operations are carried out through chains of transistors, this
degrades read data, indeed ECC is required to check data integrity. Furthermore, for historical
reasons this memory is provided with a multiplexed I/O bus in which addresses, commands and
data are transferred.
Internal architecture can change depending on size, manufacturer and design choices, one possible
internal schematic is the one depicted in Figure 1.35. Addresses and commands (read, write
or erase) are sampled by internal latches, these are controlled by CLE and ALE signals. Since
the internal mechanisms are very complex, a FSM is required to control all the tasks. Memory
is provided with a command register like in DRAMs with which it is possible to program its
behaviour. Due to the small I/O bandwidth, read pages are sent in a page buffer, then data is
sent out like in a shift register, this increases the throughput of the memory. Vice versa, during
write operation, data is written in the page buffer and then the internal circuitry will take care to
write the correct page. The erase operation is applied on the whole block.
23
Motivations and background
All these operations are represented in Figure 1.36, notice that part of pages is reserved for
ECC bits. In NAND Flash memories, different operations are characterized by different timings,
program and erase are in the range of hundreds of µs, page read in the range of tens of µs, while
the communication with the output buffer is in the range of tens of ns.
In Figure 1.37 and 1.38 read and write timing are depicted. In the read operation, at the
beginning the address acquisition is started sampling the initial command with CLE assertion.
Then, since the I/O bus is multiplexed, multiple address cycles need to be issued to send the
complete address, ALE signal is used to sample them. W E is used as a strobe to sample input
data. Finally CLE is activated and thus read command is sampled, after a while data is provided
to the output.
Write operation is similar to read one, the initial transition is started with the assertion of the
CLE, then addresses are sent to the memory in multiple cycles. Then input data is sent through
24
1.2 – Memory Systems
the I/O bus, again W E is used as strobe to sample input data. At the end, write command is
sampled with CLE assertion.
Erase operation is very similar to a write one but in this case input data is not required. At the
end of both operations is necessary to read the Status Register, this contains useful information
about the status of such operations.
As for previous memories, an asynchronous interface represents a bottleneck for data exchanges,
also in this case there have been an evolution towards the synchronous word. During years Flash
memory manufacturers defined a standard named ONFI, nowadays it is the common standard for
modern Flash systems. This implies some modifications to the memory structure:
Controllers for these memories have a common structure to the ones analyzed before. They provide
a synchronous interface for command and data exchange, they embed decoders and they implement
multiple FSM routines for handling different memory operations. As case study is proposed, a
Flash NAND controller by Lattice Semiconductor Corporation [20] is reported in the following.
Figure 1.40 depicts the proposed architecture, it is possible to identify all the main module such
as ECC logic, buffers, control FSM, Timing FSM and others.
The initial goal of this Thesis work was to develop a generic memory controller able to adapt
its functionalities to a set of different memory types and technologies. Thus, it is important to
highlight common building blocks within memory and their controllers to find an initial basic
structure from which it is possible to start the design.
In the following, Table 1.2 summarizes all internal building block of each memory.
26
1.2 – Memory Systems
It is possible to notice that a huge set of blocks are common between all the memories.
In the following all the building blocks inside the different memory controllers are summarized.
As clearly visible from Table 1.3, Registers, FSM, Decoders, Memory mapping and Buffers are
common to all the different controllers.
In the perspective to design a generic memory controller able to adapt itself to a large variety
of memories, this study helped to understand the different needs of each of them.
27
Motivations and background
1.3 L2 Cache
Once highlighted a set of possible building blocks which represents the initial point of the memory
controller design, the next task was to decide which memory level is the best candidate to enhance
its capabilities with Logic in Memory paradigm. It was decided to focus on L2 cache memory
systems, because the perspective to perform logic operations at this level could bring an huge
improvement in terms of computational speed.
As said at the beginning of this overview, modern architectures exploit the concept of memory
hierarchy, this concept can be stressed out and this leads to the concept of multi-level caches,
Figure 1.41 represents a typical computing system organized with a multi-level cache system. This
is a direct application of memory hierarchy concept to cache systems, the goal is to reduce the
miss penalty masking the generated overhead.
L1 caches are very fast and they are placed very close to the microprocessor, this ensure small
latencies, the huge drawback is the small size. L2 caches instead are placed a little bit far with
respect to the first ones and the act as real cache systems for L1 ones. Their behaviour is quite
simple, when a miss in L1 cache occurs, the required data is searched in L2 cache and if it is
present it will be provided to the system, if a miss occurs also at this level, data is retrieved
from main memory and loaded in L2 cache. As consequence, they are larger at least one order
of magnitude, to contain also data not present in L1 caches, furthermore data transfer speed is
slower for three main reasons [17]:
• Longer critical path: a larger memory array and a more complex circuitry have a direct
impact on the critical path;
• Off-chip accesses: unlike on-chip accesses, these are slower due to physical limitations;
• Bandwidth: the number of I/O pins is limited due to size, cost and design choices;
To describe better the improvements brought by a multi-level cache system it is possible to
introduce the concepts of Global Miss Rate and Local Miss Rate [9]:
• Global Miss Rate (GMR): cache misses divided by CPU accesses;
• Local Miss Rate (LMR): cache misses divided by cache accesses;
In the following GMRs and LMRs for L1 and L2 caches are reported:
• L1:
– GM RL1 = M RL1
28
1.3 – L2 Cache
• L2:
– GM RL1 = M RL1
– GM RL2 = M RL2
Then, it is possible to define the Average Access Time to Memory (AMAT), which represents the
average access time to memory taking into account the improvements brought by the memory
hierarchy. This parameter can be defined as follows:
AM AT = HT + M R · M P (1.1)
Considering a system provided with only a L1 cache, the AMAT can be written in this way:
Assuming a system provided with both L1 and L2 cache systems, the AMAT is defined as follows:
Legend:
• MR=Miss Rate
• HT=Hit time
• MP=Miss Penalty
As clearly visible from the previous equations, Miss Penalty in L1 can be re-defined in terms of L2
properties (HT, MP and MR), this leads to a reduction of M PL1 and thus to a speed-up of the
system. It is possible to characterize L2 Caches with several properties, in the following a brief
overview of the main characteristics is shown [31],[17]:
• Inclusion vs Exclusion:
– Multi-level inclusion: L1 data always present in L2, eviction in L2 affects also L1 but
eviction in L1 does not affect L2. When a L1 miss occurs, if data is present in L2, it
will be fetched in L1. In general the degree of associativity in L2 is larger or at least
equal to the L1 one, this characteristic is the same also for the number of sets;
– Multi-level exclusion: L1 data never present in L2, if a required data is present in L2,
then it is moved to L1, thus L2 is populated only with L1 evicted data. On L1 and L2
miss, new data is stored only in L1. For all these reasons, L2 behaves as a victim cache
for the system;
• Split vs Unified: this property refers to the data type contained in the cache. A cache is said
unified if it contains both data and instruction, on the contrary it is called Split if they are
stored in two different caches. A split cache could give higher bandwidth but unified ones
give flexibility. In [17] it is suggested to adopt a L1 cache combined with one or more L2
caches for good performance.
• Write policy: as in a standard cache, different writing policies could affect the system
behaviour. In [17] it is suggested to adopt a write back policy in addition to a write allocate
one.
29
Motivations and background
• Associativity: as explained in the previous section, there are several policies that could affect
system performance.
As briefly described, handling a L2 cache system is very complex and unfortunately literature does
not provide enough information to design a reliable model for the pursuance of this Thesis. For
this reason, it was decided to leave the L2 cache system design.
Since this Thesis is relies on a previous work based on a RISC-V framework, the new objective,
will be to search a RISC-V framework already provided with a L1 cache system, then once found
it, L1 cache will be enhanced with the LiM paradigm.
The starting point of this Thesis is based on [5], here the RI5CY "Pulpino" Core developed by
the PULP Platform (now mantained by OPENHW Group) have been adopted, unfortunately it is
not provided with a Cache system. The objective is to find a similar Core but provided with a
Cache system, based on the properties of RI5CY, two different Cores have been selected from the
previous list, ORCA and Taiga processors.
• ORCA: this is a FPGA-optimized RISC-V core implementing a RV32I ISA and optional
AXI3/4 data and instruction caches.
30
1.5 – Configurable memory controller
Other possible solutions have been found in Literature, in the following various examples are
reported. OPENHW Group supports also CVA6 core [13], this is a 6-pipeline stage that CPU
implements a 64-bit RISC-V instruction set. It implements two separate Data and Instruction L1
caches. In [33], a novel interleaved LiM architecture is proposed, named MISK. This work is based
on OpenRISC CPU [14], it is an open-source CPU implementing a 32-bit RISC architecture with 5
pipeline stages with Data and Instruction L1 caches. Many different available RISC-V frameworks
implementing L1 Cache systems have been found, unfortunately these solutions are very different
from the adopted RI5CY core and they would have required an intensive study of the internal
architecture before adopting them.
For this reason the research of a RISC-V framework implementing a L1 cache system has been
left, deciding to adopt the same core and its memory system used in [5].
In this Thesis the word configurable is adopted multiple times with different meanings, it is used
in this Section to indicate the capability if the memory controller to instantiate different types
of memory (in terms of types technologies and LiM functionalities). In the next Chapter the
word configurable is linked to the fact that the LiM functionalities supported by the Core can be
expanded and supported with multiple hardware implementations.
In [5] two main limitations were the insertion of new LiM instructions by hand directly in the
.hex file and the limited available bits for programming the memory. In the next Chapter will be
shown how the RISC-V GNU/GCC Compiler and the system have been modified to support new
LiM feature,s which can be implemented in multiple different ways even different from the actual
31
Motivations and background
implementation. Thus, configurable stands either for the capability to select the wanted memory
configuration and at the same time also for the possibility to add new LiM features to the system.
In Chapter 2 a brief description of the current LiM memory array model will be given, furthermore
all the hardware and software modifications required to extend the LiM functionalities will be
described. Chapter 3 presents the Racetrack memory array design and its integration with the
RISC-V Core. Chapter 4 will be focused on the comparison of the different structures which are
supported by the configurable memory controller.
32
Chapter 2
• Standard load and store: as in a classic RISC-V architecture, these instructions are able to
serve all the memory requests fetched by the core;
• Bit-wise oeprations: each cell of the memory array has a set of built-in logic gates which
enable in-place logic operations like AND, OR, XOR in just one clock cycle. In the case
of a logic store, memory content is overwritten by new computed values, while for a logic
load the selected value is not corrupted but it is served to the Core once computed the logic
operation;
• Range operations: an external hardware logic supports range operations during store logic
instructions and max/min research;
• Max/Min research: a special instruction triggers the max/min logic which computes, in a
fully parallel way, the max or the min value among a set of values specified by the user in
just 33 clock cycles;
33
LiM RISC-V configurable Framework
Figure 2.2 depicts a 1-bit LiM cell architecture, the output of the memory cell is used to
compute the three logic operations and the wired-or output necessary for max/min computation.
The input value is chosen with an external signal which allows to select the bit to write in the
memory cell between a value provided from outside or one of the feedback logic output. All the
logic operations are performed with a mask provided by the user, thus this structure resembles a
vector processor.
34
2.2 – LiM functionalities state of the art
• Hybrid-SIMD [11]: this architecture is composed by a stack of standard memory rows and
enhanced rows called Smart rows. Standard rows are fundamental, they hold data used
by smart rows during computations. Each smart row is composed by multiple cells, which
contains a storage element, two XOR gates and a full adder.
In this Single Instruction Multiple Data (SIMD) architecture each smart-row’s bit-cell
supports different logic operations like OR, AND, XOR, XOR and even sum operations by
means of the full adder. Furthermore, a row interface offers support to complex operations
like multiplications, tailored on the mapped algorithm;
• CLiMA [27]: this PhD Thesis proposes a configurable hybrid-in-memory and near-memory
architecture. This structure is composed by LiM cells and eventually it is surrounded
by near-memory logic to enhance the structure’s capabilities and overcome the memory
bottleneck problem. This architecture can exploit the LiM approach but it is able to support
also other degree of in-memory computing if it is necessary. Simple logic operations, such
as configurable AND, OR, XOR or even inter-row addition are computed directly in each
bit-cell, while more complex operations are carried out by means of additional external logic.
Furthermore, thanks to the built-in full adder, fixing one or more inputs, it is possible to
obtain more complex logic operations;
• DRC2 [4]: this work presents a peripheral SRAM accelerator circuitry, which offers ALU-like
capabilities. This architecture exploits the concept of logic-near-memory because the memory
block is fabricated with standard 10T SRAM technology. The peripheral ALU-like circuitry
supports logic operations such as XOR, OR, AND (also their inverted forms) and more
complex operations like addition/subtraction, shift, increment and decrement by 1 and
grater/less computation;
• MISK [33]: standard 6T SRAM cells are interleaved with logic layers, then at the end of
each standard-logic stack a latch layer is placed to store intermediate and final results. Each
logic layer is capable to manipulate bits of the surrounding data layers, this architecture
supports XOR operation and a programmable 2-bit LUT;
In Table 2.1 all the analyzed architecture with their supported instructions are reported.
35
LiM RISC-V configurable Framework
This Table shows how these structures support a wide range of logic operation, ranging from
simple ones (i.e. bit-wise operations) to complex ones (i.e. multiplications and divisions) which
require expensive additional logic. In order to keep the structure similar to the original one, for
the sake of simplicity it was decided to implement the following operations:
• NAND
• NOR
• XNOR
This implies small modifications to the already provided LiM RAM model, but at the same time
this could bring several improvements in terms of logic functionalities.
• rsN: contains the address of the Register File’s location where the range information is
stored;
36
2.3 – LiM Instructions
This instruction has the task to program the memory, once the range size is retrieved from
the Register File, it is merged together with the funct3 bits to create the 32-bit program word.
Then, the RISC-V operates a store in the memory and it writes the program word in the special
programming address, Figure 2.4 represents a schematic view of the instruction’s work-flow.
Once store_active_logic is completed, LiM memory is capable to perform logic operations.
• imm: immediate field, together with the content pointed by rs1 builds the target address;
2.3.3 Store
This is not a new instruction, once memory is programmed, a store instruction acts differently from
a standard RISC-V store (sw). This instruction exploits the programming of the LiM structure
and perform the corresponding logic operation directly in memory. The value specified by the
source register rs2 is interpreted in this case as the mask value for the logic operation;
38
2.4 – LiM instruction set extension
The 32-bit LiM program word (Figure 2.7) issued by store_active_logic is a combination of two
information:
• funct3: 3-bit field, specifies the LiM operation (AND, OR, XOR, MAX/MIN, NONE);
• operation size: 29-bit field coming from the Register File, gives the information of the size of
the range operation to perform in memory;
Table 2.2 reports the encoding of the funct3 field, as clearly visible only one combination (3’b111)
is free, this narrows down the possibility to expand the set of supported instruction of the LiM
structure. This is clearly a limitation in terms of supporting new functionalities and with a view
to turn this structure in a LiM-sandbox, the available function field has to be expanded.
Function Code
NONE 000
XOR 001
AND 010
OR 011
MIN 101
MAX 110
It was decided to redesign the store_active_logic instruction to expand the funct field of the
program word. The new instruction is reported in Figure 2.8, in which the following fields were
modified:
• imm: immediate field reduced from 12-bits to 7-bits as in the load_mask instruction;
• funct: new 5-bits function field, which expands the available codes for new instructions;
39
LiM RISC-V configurable Framework
The new program word structure is depicted in Figure 2.9, here the function field is composed
by 8-bits enabling the possibility to support up to 28 − 1 different LiM operations. Operand size
field is reduced to 24-bits, this is not a limitation because the RISC-V’s RAM address width is on
22-bits, so it is fully possible to support a range operation on the whole memory.
• RISC-V Core: modify ID-stage and decoder module to support new store_active_logic;
• LiM RAM: modify program word decoding and add inverting logic operations support;
The benefits of this solution choice are several, first of all a huge set of new function encodings are
unlocked, this expand dramatically the LiM capabilities. The program word decoding undergoes
small modifications since the new function field is composed by funct3 and the new defined funct
field, so only a re-size of some signals is required. Considering to support also inverting versions
of the already available logic instructions, the new function codes result in the ones reported in
Table 2.3.
40
2.4 – LiM instruction set extension
Function Code
NONE 0000000
XOR 0000001
AND 0000010
OR 0000011
XNOR 0001001
NAND 0001010
NOR 0001011
MIN 0000101
MAX 0000110
These modifications are necessary to build the new LiM program word according to the new
store_active_logic format.
XNOR), additional logic has been inserted. The feedback write mux has been enlarged to host the
inverting logic operations, which are generated just adopting NOT gates as shown in Figure 2.11.
The output logic has been modified as well in the same way, as depicted in Figure 2.12.
42
2.4 – LiM instruction set extension
Hardware modifications are necessary to support this new extended LiM framework, but this
requires also software modifications. In the original work, LiM instructions (i.e. store_active_logic
and load_mask) were inserted by hand directly in the .hex file. This is not easy and requires a
deep understanding of the assembly code, for this reason it was decided to modify the RISC-V
GNU-GCC Toolchain to define new LiM-custom instructions.
According to [26], modifications have been applied to RISC-V GNU-GCC Binutils, two files have
been modified:
• riscv-opc.h: #define (Figure 2.13) and DECLARE_INS() (Figure 2.14) of new instructions
have been inserted;
• riscv-opc.c: all new instructions have been declared with their new parameters (Figure 2.15);
43
LiM RISC-V configurable Framework
After applied these modifications and re-built the compiler, the new Toolchain is able to support
LiM instructions. Differently from the original work, now it is possible to define in-line assembly
code and exploit new custom instructions.
44
Chapter 3
perpendicular to the magnetic material. The peculiarity of NML technology is the possibility to
arrange multiple magnetic elements and exploit the interaction of their magnetization direction to
create logic functions.
Declining the Racetrack technology in a LiM context implies to adopt special pNML cells able
to perform in-place computations. In this regard a programmable NAND-NOR pNML gate is
proposed in [6] and itwas be the starting element for the LiM Racetrack design. This gate, as
shown in Figure 3.1, has a multi-layer structure. The top layer hosts the track of control gates
which program the bottom central cells, turning the structure into a 2-port NAND gate or a
2-port NOR gate. The bottom layer is composed by multiple cells, central ones are arranged as
a Racetrack and they will generate the logic values, sides cells represent the inputs of the logic
gate. These latter cells can be also arranged in a Racetrack context, thus this makes possible
to compute bit-wise operations directly in memory. Thus, the final Racetrack is arranged as a
multi-layer structure composed by multiple Racetrack with different tasks.
The model is very simple, in the upper part a NAND and a NOR gates computes the logic
operation, then a mux forwards the result based on the cell’s programming. The shift current sign
selects the right or the left value during a shift operation, the presence or not of the shift current
pulse acts as a control signal for the central mux. In this way, if a shift operation is present, bits
coming from left or right cells are forwarded, if the current pulse is not present the computed
logical result is selected. The storage part of the cell is modeled as a Flip-Flop triggered by a
signal generated by several logic gates, the switching pulse is generated only in those situations
described previously.
As will be clear in next Sections, the cell presented previously is not responsible for read and write
operations. Access cells are based on a different technology named SOT. Actually, the current cell
model is not capable to write external values, but this is not a problem since the central Racetrack
is only responsible for logic computations. The only modification with respect to a standard pNML
NAND-NOR gate is the ability to read the stored value, this feature has been modeled using
an additional Flip-Flop which takes as input the output signal of the storage element and it is
triggered by a read signal provided from the external. These cells are used for modeling read ports
of the Racetrack, the resulting HDL model is reported in Figure 3.3.
47
LiM Racetrack memory model
Many different solution are suggested, based on [34] it was decided to adopt a configuration suitable
for low-power and low-area applications like IoT ones. It was decided to adopt MU-32-08-04 layout
which includes 32 domains, 8 access ports and 4 Racetrack per MU.
energy with respect to standard cells. For this reason the number of these elements is limited, this
implies shift operations to align required data to the access port. Head management policies have
a huge impact on the wasted power during accesses, so it is important to find a good trade-off
between efficiency and wasted power.
There are two main aspects to take into account [29]:
• Head selection: this policy selects the appropriate "head" for data access, the selection could
be static or dynamic. In the first case each part of the Racetrack is assigned statically to an
access port, in the second one the selected port is the nearest to the required data;
• Head update: this property decides what to do with a port after an access. There are three
possibilities: Eager, Lazy and Pre-shifting. In the first one, access ports are restored in their
initial positions, this simplifies the shift protocol and also the shift logic. In the second case
the access ports are not restored, this policy tries to exploit the spatial locality of access
to minimize the required shift for the next access. In this case the logic involved in the
computation of the number of required shifts is much more complex because it is necessary
to take into account the current position of each port. The last possibility tries to guess the
next likely accessed data performing in advance a certain number of shifts, this require a
prediction algorithm and an appropriate control logic, it is the most complex one.
Fort this work it was decided to adopt the combination Static Selection and Eager Update identified
with the name Static-Eager (SE) to reduce the design complexity.
The Figure shows the multi-layer structure described in Section 3.2, the red top Racetrack,
named Program Racetrack, is responsible for the programming of the whole structure (it selects
the NAND or NOR operation), while the bottom layer is composed by three separated Racetracks.
The central blue Racetrack is the Logic Racetrack and it is the logic part which computes the
NAND/NOR operation while peripheral Racetracks are respectively the Mask Racetrack (Green)
and the Data Racetrack (Orange).
During an access (read or write operations) data should be shifted to align correctly ports to the
accessed location. Each Racetrack could be thought as a long shift register where bits are shifted
50
3.4 – Racetrack memory array design
back and forth, to prevent data loss during shift operations some overhead cells are placed at the
beginning of each Racetrack (gray cells), these are built with standard Racetrack cells able only to
shift data.
Data parallelism differs considering different operational modes.During the LiM Mode, Figures 3.7
and 3.8, a single Data bit or Logic bit is stored in a single LiM Racetrack, so data parallelism is
1-bit.
Considering the Memory mode, data parallelism is completely different because Data, Program
and Mask Racetracks are used to store three bits. Thus, a single LiM Racetrack is actually storing
a single bit of three different words, Figure 3.8 gives an idea of the configuration, thick lines
of different colors highlights the bits of the three words. In this Mode, Logic Racetrack is not
exploited to store data, unfortunately further studies are required to understand if it is possible
to exploit it for storing data, for this reason in this work it was decided to adopt a conservative
solution, so Logic Racetrack is not used in Memory Mode. Data Racetrack, Program Racetrack and
Mask Racetrack can be controlled separately, each of them can store three different bits of three
different words.
Note that in these Figures the Program Racetrack and the Logic Racetrack are not shown respectively
in Figures 3.7-3.8 and Figure 3.8 to simplify the images understanding.
51
LiM Racetrack memory model
• Nb = 32;
• Nr = 4;
• Np = 8;
The resulting Macro Unit structure is depicted in Figure 3.10, all the four LiM Racetrack work
together. Data parallelism is 4-bits in LiM Mode (Logic Data and Data), while is 12-bits in
Memory Mode (4-bits for each of the 3 different words).
53
LiM Racetrack memory model
The whole memory array is composed by multiple modules, additional surrounding logic is required
to complete the Racetrack memory and assuring the correct behaviour. Figure 3.12 shows the
high-level architecture of the Racetrack memory array. This schematic represents only the memory,
all the additional logic such as word-line activation logic, hand-shaking logic and other important
components will be highlighted in the next Section.
The memory array is very simple, the Racetrack array is controlled by the FSM, which acts as a
memory controller, activating all the useful control signals. The shift generator sends the correct
number shift pulses based on the location of the accesses. Additional logic is then required to
compute logic store and load operations, muxes are exploited to choose the correct computation in
both read and write operations.
A more relevant view on the building blocks will be given in the following.
This simple logic module, shown in Figure 3.13, activates the correct Block based on the memory
access position. It simply groups together in a OR operation a set of 32 word-lines, because each
Block contains 32 words.
54
3.5 – Racetrack Memory architecture
natively the capability to perform byte-writes, an additional logic is adopted to sample forward in
output not-selected bytes and overwrite selected bytes with new input data. In a standard write
operation, data is first read, then selected-bytes are overwritten using the external data input.
As mentioned before, during LiM store operations, data computed by the LiM computation block
is reused.
3.5.4 Shifter
This component is fundamental for the correct behaviour of the Racrtrack memory. As explained
before, access ports are displaced along each Racetrack, data should be shifted and aligned to
the correct access port depending on the adopted head management policy. The shift generator
generates the correct number of shift pulses based on the access performed in memory. Once the
generation is completed the modules freezes, it sends a "done" signal to the FSM and waits for a
new shift request. This component is reset by a control signal issued by the FSM and this blocks
the system to start a new transaction just after the port reset. To speed-up the memory accesses,
two shift generators work in couple (Figure 3.15), one serves the set shifts (port alignment) and
the second one serves the reset shifts (port reset). This configuration allows to start a new memory
access just after the port reset completion. A register samples the shift number for the reset
shift generator, in this way the address could change even before the completion of the port reset
operation.
Then, the programming magnetic field B_z is applied and the FSM waits a number of clock cycles
(which can be set by the user) for the end of the programming phase, finally a read or a write
operation is performed. The access cycle terminates with a port reset, when this task is completed
and if a new memory request is present, it is possible to jump directly to a new port set operation.
57
LiM Racetrack memory model
58
3.6 – Integration in RISC-V memory model
time the Racetrack completes an access, the valid signal is used as an enable signal to increase
the address value. A comparator checks the actual address with the final range address, then it
generates a done signal to stop the range routine, Figure 3.19 depicts the high-level schematic of
the range serialized.
on, all the other memory locations are filled in the same way, Figure 3.20 gives an idea of the new
organization.
Since Racetrack commands arrive at every single blocks, these are exploited to perform a
parallel memory access on B (number of available blocks) words in parallel. Considering the
selected size for the Racetrack memory, the available blocks are 2000, so as the number of words
involved in a parallel logic store access.
This new organization required small design modifications, the range serializer was completely
removed and replaced with the range decoder explained deeply in [5]. This special decoder activates
multiple word-lines whenever a LiM range operation is required. Word-lines for LiM mode and
Memory mode are generated in parallel, MEM_MODE parameter forwards the correct word-lines
to the routing block which has the only aim of routing word-lines to the correct Blocks with the
logic explained before.
A detailed analysis of Simulation results of this implementation is reported in the next Chapter.
61
LiM Racetrack memory model
62
Chapter 4
Simulations
4.1 Tools
The configurable LiM Framework which supports three different types of memory, as well as the
RISC-V core, were designed using SystemVerilog. Simulations were performed adopting Modelsim
Questa Sim 2020.4 and Synopsys VCS 2021.09.
All the following simulations refer to the final Racetrack version , simulation results of the three
different implementations will be analyzed in Section 4.4.
4.2.1 Bitwise
In the following, Listing 4.1 shows the original bitwise.c code. In this example a vector of N
elements and a stand-alone variable are defined, then several logic operations are applied. Vector
computations are carried out with for loops, this increases dramatically the size of the code,
because for loops are un-rolled by the Compiler and the inner code is replicated N times.
63
Simulations
Thanks to the Compiler modifications it is possible to define in-line assembly portions of code, the
new LiM-bitwise program is reported in Listing 4.2. The Compiler recognizes these in-line pieces
of code and replaces them with the corresponding assembly instruction.
64
4.2 – Simulation of custom programs
12 v o l a t i l e int (* stand_alone ) ;
13 v o l a t i l e int (* final_result ) ;
14
15 r e g i s t e r u n s i g n e d int x0 asm ( " x0 " ) ;
16
17 // define variables ’ addresses
18 vector = ( v o l a t i l e int (*) [ N ]) 0 x030000 ,
19 stand_alone = ( v o l a t i l e int (*) ) 0 x30040 ,
20 final_result = ( v o l a t i l e int (*) ) 0 x30044 ;
21
22 // configuration address , where the config of the memory is stored .
23 int cnfAddress = 0 x1fffc ;
24 // configure vector [N -1] address
25 int andAddress = 0 x030010 ;
26 // configure vector [N -2] address
27 int xorAddress = 0 x03000C ;
28 // configure vector [N -3] address
29 int opAddress = 0 x30008 ;
30
31
32 // initialize mask values
33 mask_and = 0 x8F ;
34 mask_or = 0 xF1 ;
35 mask_xor = 0 xF0 ;
36
37
38 /* fill vector */
39 for ( i =0; i < N ; i ++) {
40 (* vector ) [ i ] = i *13467;
41 }
42
43 (* stand_alone ) = (* vector ) [1]+0 x768 ;
44
45
46 /* OR operation */
47
48 // program LiM for range operation
49 asm v o l a t i l e ( " sw_active_or %[ result ] , %[ input_i ] , 0 "
50 : [ result ] " = r " ( N )
51 : [ input_i ] " r " ( cnfAddress ) , " [ result ] " ( N )
52 );
53
54 // sw operation to active OR LiM
55 (* vector ) [0] = mask_or ;
56
57 // program LiM for stand - alone operation
58 asm v o l a t i l e ( " sw_active_or %[ result ] , %[ input_i ] , 0 "
59 : [ result ] " = r " ( x0 )
60 : [ input_i ] " r " ( cnfAddress ) , " [ result ] " ( x0 )
61 );
62 (* stand_alone ) = mask_or ;
63
64 /* AND operation */
65
66 // program LiM for stand - alone operation
67 asm v o l a t i l e ( " sw_active_and %[ result ] , %[ input_i ] , 0 "
68 : [ result ] " = r " ( x0 )
69 : [ input_i ] " r " ( cnfAddress ) , " [ result ] " ( x0 )
70 );
71
65
Simulations
72
73 // lw_mask operation for mask_and computation
74 asm v o l a t i l e ( " lw_mask %[ result ] , %[ input_s ] , %[ input_t ] , 0 "
75 : [ result ] " = r " ( mask_and )
76 : [ input_s ] " r " ( andAddress ) , [ input_t ] " r " ( mask_and ) , " [ result ] " (
mask_and )
77 );
78
79 // program LiM for range operation
80 asm v o l a t i l e ( " sw_active_and %[ result ] , %[ input_i ] , 0 "
81 : [ result ] " = r " ( N )
82 : [ input_i ] " r " ( cnfAddress ) , " [ result ] " ( N )
83 );
84
85 // sw operation to active AND LiM
86 (* vector ) [0] = mask_and ;
87
88 // program LiM for stand - alone operation
89 asm v o l a t i l e ( " sw_active_and %[ result ] , %[ input_i ] , 0 "
90 : [ result ] " = r " ( x0 )
91 : [ input_i ] " r " ( cnfAddress ) , " [ result ] " ( x0 )
92 );
93
94 (* stand_alone ) = mask_and ;
95
96
97 /* XOR operation */
98
99 // program LiM for stand - alone operation
100 asm v o l a t i l e ( " sw_active_xor %[ result ] , %[ input_i ] , 0 "
101 : [ result ] " = r " ( x0 )
102 : [ input_i ] " r " ( cnfAddress ) , " [ result ] " ( x0 )
103 );
104
105
106 // lw_mask operation for mask_xor computation
107 asm v o l a t i l e ( " lw_mask %[ result ] , %[ input_s ] , %[ input_t ] , 0 "
108 : [ result ] " = r " ( mask_xor )
109 : [ input_s ] " r " ( xorAddress ) , [ input_t ] " r " ( mask_xor ) , " [ result ] " (
mask_xor )
110 );
111
112
113 // program LiM for range operation
114 asm v o l a t i l e ( " sw_active_xor %[ result ] , %[ input_i ] , 0 "
115 : [ result ] " = r " ( N )
116 : [ input_i ] " r " ( cnfAddress ) , " [ result ] " ( N )
117 );
118
119 (* vector ) [0] = mask_xor ;
120
121 // program LiM for stand - alone operation
122 asm v o l a t i l e ( " sw_active_xor %[ result ] , %[ input_i ] , 0 "
123 : [ result ] " = r " ( x0 )
124 : [ input_i ] " r " ( cnfAddress ) , " [ result ] " ( x0 )
125 );
126
127 (* stand_alone ) = mask_xor ;
128
129 // lw_mask operation for ~(* vector ) [N -3] computation exploting xor
66
4.2 – Simulation of custom programs
Even if the new code appears longer (in terms of code lines), the resulting code is actually smaller
since for loops are avoided because the Compiler replaces each in-line assembly code with a single
LiM range instruction (store_active_logic). In this way, code replications applied in for loops
are avoided. Furthermore, logic operations are carried out directly in memory, this avoids a lot
of computations inside the Core. Once a logic operation needs to be performed in memory, this
should be programmed with an appropriate store_active_logic. Then, it is possible to perform a
logic store simply assigning to the target element the mask value, memory will perform a logic
operations on elements specified by the operation size.
It is also possible to perform a logic load, where a data is loaded from memory and at the same
time a logic operation specified by the LiM programming is applied. In this example, lw_mask
instruction is exploited to compute new mask values and the inverted operands necessary for the
computation of the final result.
To estimate the execution performance, it is possible to take into account the Execution Time in
terms of number of clock cycles (cc) and assuming N as the size of the vector. For this estimation,
only meaningful parts of the code have been taken into account, initialization of the vector and mask
computations have not included. The following equations show the Execution time considering
the standard memory and the Racetrack memory both in normal and LiM configurations.
Equation 4.1, shows the Execution time in the case in which a standard memory is adopted, here
any kind of LiM operation can be carried out. The first part of the equation refers to the bitwise
computations, in fact the factor 3 refers to the three different logic operations. For operations
involving vector elements or the stand-alone element, 3 clock cycles are required, because data is
loaded in the Core, it is manipulated and then finally it is stored-back in memory. The computation
of the final result took six clock cycles because two different data need to be loaded in the Core,
then they are inverted, added together and in the last clock cycle the result is stored in memory.
This equation has a linear dependency on N.
In the case of adopting a standard memory with the LiM functionalities, the execution time can
be expressed as Equation 4.2. Here, logic store operations can be carried out in parallel, in fact all
the vector operations are carried out in just two cycles, one for setting-up the memory and another
one for computing and storing the results. Final result requires one clock cycle for setting-up the
memory and four additional clock cycles. Two for loading data in the Core and at the same time
performing the NOT operation, one for adding them together and the last one for storing data
in memory. Here, there is not a dependency from parameter N because thanks to range logic
operations, it is possible to perform parallel in-memory computations.
For the Racetrack memory case the estimation is quite different. Thanks to the FSM algorithm
shown in the previous Chapter it is possible to count the amount of clock cycles for standard
memory access and LiM memory accesses:
• Standard type: 1 system clock cycles are required;
• LiM type: 2 clock cycles are required;
If the Racetrack is exploited as a standard memory, the performance is exactly equal to the starting
point, a normal memory access requires one clock cycle and the Execution time equation reported
in 4.3 is exactly equal to the standard memory one.
Equation 4.4 describes Execution Time using the Racetrack with LiM funxtionalities, the first
part with the factor 2 is the computation of OR and AND operations which use built-in logic
computation (NAND/NOR). The XOR part has the same latency of a standard access, because
the logic computation is carried out by means of external logic, as soon as data is available outside
the Racetrack memory array. The final part is the computation of the final result, sum operands
are loaded from memory and at the same time a XOR operation is performed. Once operands are
loaded in the Core, they are added and then stored back. The XOR operations is useful to invert
operands directly in memory, for this operation a mask with all ones was adopted.
Figure 4.1 depicts the behaviour of the previous formulas adopting a different values of N. For
the standard LiM memory case and the LiM Racetrack memory case, the behaviour is constant
because regardless the size of the vector, all the logic operations are carried out in parallel. For
the other cases, the curves have a linear behaviour. The Racetrack case has an higher offset due
to the large number of clock cycles required for the built-in NAND/NOR computations.
Comparing the standard memory implementation and the Racetrack one, it is possible to observe a
similar behaviour in terms of performance. Without adopting any LiM functionality the Execution
time is exactly the same in both memories, while for the LiM case the Racetrack has a small
performance degradation (+ ≃ 23,5%s).
68
4.2 – Simulation of custom programs
Figure 4.1. Execution time estimation for bitwise.c with different vector size N
All the simulations have been carried out with Synopsys VCS, this tool produces a log file with
all the executed instructions. This is very useful to understand the differences of the execution in
all the cases, in the following some extracts will be presented, only meaningful parts will be shown
and loops will be highlighted with blank spaces.
With a not-LiM configuration, the system performs several for loops to compute all the bitwise
results, an example is given in log 4.3. Adopting a LiM configuration with still a standard memory,
all loops are converted in single range operation, this can be seen in log 4.4, here it is also possible
to see the insertion of sw_active and lw_mask instructions. Results for Racetrack memory
implementations are not shown since they are exactly equal to the ones presented here, the only
difference is the Execution time which is much larger due to the initial memory initialization and
due to the additional memory latency.
Listing 4.3. Extract of simulation log of bitwise.c - standard configuration
Time Cycles PC Instr Mnemonic
1896 ns 186 00000244 04 e 7 a 023 sw x 14 , 64( x 15) x 14:00003 c 03 x 15:00030000 PA :00030040
1906 ns 187 00000248 0007 a 023 sw x 0 , 0( x 15) x 15:00030000 PA :00030000
1916 ns 188 0000024 c 00 d 7 a 823 sw x 13 , 16( x 15) x 13:0000 d 26 c x 15:00030000 PA :00030010
1926 ns 189 00000250 00030737 lui x 14 , 0 x 30000 x 14=00030000
1936 ns 190 00000254 01478613 addi x 12 , x 15 , 20 x 12=00030014 x 15:00030000
1946 ns 191 00000258 0007 a 683 lw x 13 , 0( x 15) x 13=00000000 x 15:00030000 PA :00030000
1956 ns 192 0000025 c 00478793 addi x 15 , x 15 , 4 x 15=00030004 x 15:00030000
1966 ns 193 00000260 0 f 16 e 693 ori x 13 , x 13 , 241 x 13=000000 f 1 x 13:00000000
1976 ns 194 00000264 fed 7 ae 23 sw x 13 , -4( x 15) x 13:000000 f 1 x 15:00030004 PA :00030000
1986 ns 195 00000268 fec 798 e 3 bne x 15 , x 12 , -16 x 15:00030004 x 12:00030014
69
Simulations
Listing 4.4. Extract of simulation log of bitwise.c - LiM configuration with new compiler
Time Cycles PC Instr Mnemonic
173821 ns 4341 00000244 00 e 7 a 823 sw x 14 , 16( x 15) x 14:0000 d 26 c x 15:00030000 PA :00030010
173861 ns 4342 00000248 0047 a 703 lw x 14 , 4( x 15) x 14=0000349 b x 15:00030000 PA :00030004
173901 ns 4343 0000024 c 04078593 addi x 11 , x 15 , 64 x 11=00030040 x 15:00030000
173941 ns 4344 00000250 00500693 addi x 13 , x 0 , 5 x 13=00000005
173981 ns 4345 00000254 76870713 addi x 14 , x 14 , 1896 x 14=00003 c 03 x 14:0000349 b
174021 ns 4346 00000258 04 e 7 a 023 sw x 14 , 64( x 15) x 14:00003 c 03 x 15:00030000 PA :00030040
174061 ns 4347 0000025 c 00020737 lui x 14 , 0 x 20000 x 14=00020000
174101 ns 4348 00000260 ffc 70713 addi x 14 , x 14 , -4 x 14=0001 fffc x 14:00020000
174141 ns 4349 00000264 000736 bb sw _ active _ or Nx 13 0( x 14) x 14:0001 fffc x 13:00000005 PA :0001 fffc
174181 ns 4350 00000268 0 f 100613 addi x 12 , x 0 , 241 x 12=000000 f 1
174221 ns 4351 0000026 c 00 c 7 a 023 sw x 12 , 0( x 15) x 12:000000 f 1 x 15:00030000 PA :00030000
174261 ns 4352 00000270 0007303 b sw _ active _ or Nx 0 0( x 14) x 14:0001 fffc PA :0001 fffc
174341 ns 4354 00000274 04 c 7 a 023 sw x 12 , 64( x 15) x 12:000000 f 1 x 15:00030000 PA :00030040
174381 ns 4355 00000278 0007203 b sw _ active _ and Nx 0 0( x 14) x 14:0001 fffc PA :0001 fffc
174461 ns 4357 0000027 c 08 f 00613 addi x 12 , x 0 , 143 x 12=0000008 f
174501 ns 4358 00000280 01078513 addi x 10 , x 15 , 16 x 10=00030010 x 15:00030000
174541 ns 4359 00000284 00 c 5261 b lw _ mask x 12 , x 12 , 0( x 10) x 12=0000008 d x 12:0000008 f x 10:00030010 PA :00030010
174581 ns 4360 00000288 000726 bb sw _ active _ and Nx 13 0( x 14) x 14:0001 fffc x 13:00000005 PA :0001 fffc
174661 ns 4362 0000028 c 00 c 7 a 023 sw x 12 , 0( x 15) x 12:0000008 d x 15:00030000 PA :00030000
174701 ns 4363 00000290 0007203 b sw _ active _ and Nx 0 0( x 14) x 14:0001 fffc PA :0001 fffc
174781 ns 4365 00000294 04 c 7 a 023 sw x 12 , 64( x 15) x 12:0000008 d x 15:00030000 PA :00030040
174821 ns 4366 00000298 0007103 b sw _ active _ xor Nx 0 0( x 14) x 14:0001 fffc PA :0001 fffc
174901 ns 4368 0000029 c 0 f 000613 addi x 12 , x 0 , 240 x 12=000000 f 0
174941 ns 4369 000002 a 0 00 c 78513 addi x 10 , x 15 , 12 x 10=0003000 c x 15:00030000
174981 ns 4370 000002 a 4 00 c 5261 b lw _ mask x 12 , x 12 , 0( x 10) x 12=00000071 x 12:000000 f 0 x 10:0003000 c PA :0003000 c
175021 ns 4371 000002 a 8 000716 bb sw _ active _ xor Nx 13 0( x 14) x 14:0001 fffc x 13:00000005 PA :0001 fffc
175061 ns 4372 000002 ac 00 c 7 a 023 sw x 12 , 0( x 15) x 12:00000071 x 15:00030000 PA :00030000
175101 ns 4373 000002 b 0 0007103 b sw _ active _ xor Nx 0 0( x 14) x 14:0001 fffc PA :0001 fffc
175141 ns 4374 000002 b 4 fff 00693 addi x 13 , x 0 , -1 x 13= ffffffff
175181 ns 4375 000002 b 8 04 c 7 a 023 sw x 12 , 64( x 15) x 12:00000071 x 15:00030000 PA :00030040
175221 ns 4376 000002 bc 00878513 addi x 10 , x 15 , 8 x 10=00030008 x 15:00030000
175261 ns 4377 000002 c 0 00068613 addi x 12 , x 13 , 0 x 12= ffffffff x 13: ffffffff
175301 ns 4378 000002 c 4 00 d 5261 b lw _ mask x 12 , x 13 , 0( x 10) x 12= ffffff 0 b x 13: ffffffff x 10:00030008 PA :00030008
175341 ns 4379 000002 c 8 00 d 5 a 69 b lw _ mask x 13 , x 13 , 0( x 11) x 13= ffffff 0 f x 13: ffffffff x 11:00030040 PA :00030040
175381 ns 4380 000002 cc 0007003 b sw _ active _ none Nx 0 0( x 14) x 14:0001 fffc PA :0001 fffc
175421 ns 4381 000002 d 0 00 d 60633 add x 12 , x 12 , x 13 x 12= fffffe 1 a x 12: ffffff 0 b x 13: ffffff 0 f
175461 ns 4382 000002 d 4 04 c 7 a 223 sw x 12 , 68( x 15) x 12: fffffe 1 a x 15:00030000 PA :00030044
175501 ns 4383 000002 d 8 00000513 addi x 10 , x 0 , 0 x 10=00000000
175541 ns 4384 000002 dc 00008067 jalr x0, x1, 0 x 1:000001 d 8
4.2.2 Inverted-bitwise
Example code bitwise_inv.c, reported in Listing 4.5, has basically the same structure as bitwise.c,
but instead of the original logic operations, these are replaced with inverting ones (NAND, NOR,
XNOR). This test is interesting because it is possible to analyze also the new inverting logic
operations, results are coherent with the expected ones. The modified C code with in-line assembly
instructions is reported in Listing 4.6
Listing 4.5. bitwise_inv.c code
1 # include < stdio .h >
2 # include < stdlib .h >
3
4 int main ( int argc , char * argv [])
70
4.2 – Simulation of custom programs
5 {
6 /* variable declaration */
7 int N = 5 , i , mask_or , mask_and , mask_xor ;
8 int * vector = 0 x030000 , * stand_alone = 0 x30040 , * final_result = 0 x30080 ;
9
10 /* fill vector */
11 for ( i =0; i < N ; i ++) {
12 vector [ i ] = i *13467;
13 }
14 * stand_alone = vector [1]+0 x768 ;
15
16 /* OR operation */
17 mask_or = 0 xF1 ;
18 for ( i =0; i < N ; i ++) {
19 vector [ i ] = ~( vector [ i ] | mask_or ) ;
20 }
21 * stand_alone = ~(* stand_alone | mask_or ) ;
22
23 /* AND operation */
24 mask_and = ~( vector [N -1] & 0 x8F ) ;
25 for ( i =0; i < N ; i ++) {
26 vector [ i ] = ~( vector [ i ] & mask_and ) ;
27 }
28 * stand_alone = ~(* stand_alone & mask_and ) ;
29
30 /* XOR operation */
31 mask_xor = ~( vector [N -2] ^ 0 xF0 ) ;
32 for ( i =0; i < N ; i ++) {
33 vector [ i ] = ~( vector [ i ] ^ mask_xor ) ;
34 }
35 * stand_alone = ~(* stand_alone ^ mask_xor ) ;
36
37 * final_result = ~ vector [N -3] + ~(* stand_alone ) ;
38
39 return EXIT_SUCCESS ;
40 }
71
Simulations
72
4.2 – Simulation of custom programs
81 : [ input_s ] " r " ( andAddress ) , [ input_t ] " r " ( mask_nand ) , " [ result ] " (
mask_nand )
82 );
83
84 // program LiM for range operation
85 asm v o l a t i l e ( " sw_a ctive _nan d %[ result ] , %[ input_i ] , 0 "
86 : [ result ] " = r " ( N )
87 : [ input_i ] " r " ( cnfAddress ) , " [ result ] " ( N )
88 );
89
90 // sw operation to active NAND LiM
91 (* vector ) [0] = mask_nand ;
92
93 // program LiM for stand - alone operation
94 asm v o l a t i l e ( " sw_a ctive _nan d %[ result ] , %[ input_i ] , 0 "
95 : [ result ] " = r " ( x0 )
96 : [ input_i ] " r " ( cnfAddress ) , " [ result ] " ( x0 )
97 );
98
99 (* stand_alone ) = mask_nand ;
100
101
102 /* XNOR operation */
103
104 // program LiM for stand - alone operation
105 asm v o l a t i l e ( " sw_a ctive _xno r %[ result ] , %[ input_i ] , 0 "
106 : [ result ] " = r " ( x0 )
107 : [ input_i ] " r " ( cnfAddress ) , " [ result ] " ( x0 )
108 );
109
110
111 // lw_mask operation for mask_xnor computation
112 asm v o l a t i l e ( " lw_mask %[ result ] , %[ input_s ] , %[ input_t ] , 0 "
113 : [ result ] " = r " ( mask_xnor )
114 : [ input_s ] " r " ( xorAddress ) , [ input_t ] " r " ( mask_xnor ) , " [ result ] " (
mask_xnor )
115 );
116
117
118 // program LiM for range operation
119 asm v o l a t i l e ( " sw_a ctive _xno r %[ result ] , %[ input_i ] , 0 "
120 : [ result ] " = r " ( N )
121 : [ input_i ] " r " ( cnfAddress ) , " [ result ] " ( N )
122 );
123
124 (* vector ) [0] = mask_xnor ;
125
126 // program LiM for stand - alone operation
127 asm v o l a t i l e ( " sw_a ctive _xno r %[ result ] , %[ input_i ] , 0 "
128 : [ result ] " = r " ( x0 )
129 : [ input_i ] " r " ( cnfAddress ) , " [ result ] " ( x0 )
130 );
131
132 (* stand_alone ) = mask_xnor ;
133
134
135 // lw_mask operation for ~(* vector ) [N -3] computation exploting xnor
136 asm v o l a t i l e ( " lw_mask %[ result ] , %[ input_s ] , %[ input_t ] , 0 "
137 : [ result ] " = r " ( sum_a )
138 : [ input_s ] " r " ( opAddress ) , [ input_t ] " r " ( sum_a ) , " [ result ] " ( sum_a )
73
Simulations
139 );
140
141 // lw_mask operation for ~(* stand_alone ) computation exploting xnor
142 asm v o l a t i l e ( " lw_mask %[ result ] , %[ input_s ] , %[ input_t ] , 0 "
143 : [ result ] " = r " ( sum_b )
144 : [ input_s ] " r " (&(* stand_alone ) ) , [ input_t ] " r " ( sum_b ) , " [ result ] " (
sum_b )
145 );
146
147
148 // restore standard operations
149 asm v o l a t i l e ( " sw_a ctive _non e %[ result ] , %[ input_i ] , 0 "
150 : [ result ] " = r " ( x0 )
151 : [ input_i ] " r " ( cnfAddress ) , " [ result ] " ( x0 )
152 );
153
154 (* final_result ) = sum_a + sum_b ;
155
156
157
158 return EXIT_SUCCESS ;
159 }
In the following, Equations for Execution Time estimation are reported, as in the previous
code, variable N is assumed as vector size.
This program exhibits the same results in terms of Execution time because the program’s structure
is identical to bitwise.c unless for the inverted operations. Figure 4.2, reports the plot of the
Execution time for the three different cases. Curves are the same because only inverting operation
have been added inside the computations but these do not require additional clock cycles because
they are performed directly in memory, thus the same results are obtained.
74
4.2 – Simulation of custom programs
Figure 4.2. Execution time estimation for bitwise_inv.c with different vector size N
In log 4.7, the most interesting part of the Simulation Log is reported. Opcodes and funct
fields of new logic inverting operations are very similar to old ones, unfortunately the Compiler
is not capable to recognize new instructions. For this reason the simulation log reports wrong
names for logic store instructions (i.e. sw_active_or instead of sw_active_nor), this is an intrinsic
limitation of the compiler, in any case results are correct.
Listing 4.7. Extract of simulation log of bitwise_inv.c - LiM configuration with new compiler
Time Cycles PC Instr Mnemonic
173821 ns 4341 00000244 00 e 7 a 823 sw x 14 , 16( x 15) x 14:0000 d 26 c x 15:00030000 PA :00030010
173861 ns 4342 00000248 0047 a 703 lw x 14 , 4( x 15) x 14=0000349 b x 15:00030000 PA :00030004
173901 ns 4343 0000024 c 04078593 addi x 11 , x 15 , 64 x 11=00030040 x 15:00030000
173941 ns 4344 00000250 00500693 addi x 13 , x 0 , 5 x 13=00000005
173981 ns 4345 00000254 76870713 addi x 14 , x 14 , 1896 x 14=00003 c 03 x 14:0000349 b
174021 ns 4346 00000258 04 e 7 a 023 sw x 14 , 64( x 15) x 14:00003 c 03 x 15:00030000 PA :00030040
174061 ns 4347 0000025 c 00020737 lui x 14 , 0 x 20000 x 14=00020000
174101 ns 4348 00000260 ffc 70713 addi x 14 , x 14 , -4 x 14=0001 fffc x 14:00020000
174141 ns 4349 00000264 001736 bb sw _ active _ or Nx 13 1( x 14) x 14:0001 fffc x 13:00000005 PA :0001 fffc
174181 ns 4350 00000268 0 f 100613 addi x 12 , x 0 , 241 x 12=000000 f 1
174221 ns 4351 0000026 c 00 c 7 a 023 sw x 12 , 0( x 15) x 12:000000 f 1 x 15:00030000 PA :00030000
174261 ns 4352 00000270 0017303 b sw _ active _ or Nx 0 1( x 14) x 14:0001 fffc PA :0001 fffc
174341 ns 4354 00000274 04 c 7 a 023 sw x 12 , 64( x 15) x 12:000000 f 1 x 15:00030000 PA :00030040
174381 ns 4355 00000278 0017203 b sw _ active _ and Nx 0 1( x 14) x 14:0001 fffc PA :0001 fffc
174461 ns 4357 0000027 c 08 f 00613 addi x 12 , x 0 , 143 x 12=0000008 f
174501 ns 4358 00000280 01078513 addi x 10 , x 15 , 16 x 10=00030010 x 15:00030000
174541 ns 4359 00000284 00 c 5261 b lw _ mask x 12 , x 12 , 0( x 10) x 12= fffffffd x 12:0000008 f x 10:00030010 PA :00030010
174581 ns 4360 00000288 001726 bb sw _ active _ and Nx 13 1( x 14) x 14:0001 fffc x 13:00000005 PA :0001 fffc
174661 ns 4362 0000028 c 00 c 7 a 023 sw x 12 , 0( x 15) x 12: fffffffd x 15:00030000 PA :00030000
174701 ns 4363 00000290 0017203 b sw _ active _ and Nx 0 1( x 14) x 14:0001 fffc PA :0001 fffc
174781 ns 4365 00000294 04 c 7 a 023 sw x 12 , 64( x 15) x 12: fffffffd x 15:00030000 PA :00030040
174821 ns 4366 00000298 0017103 b sw _ active _ xor Nx 0 1( x 14) x 14:0001 fffc PA :0001 fffc
174901 ns 4368 0000029 c 0 f 000613 addi x 12 , x 0 , 240 x 12=000000 f 0
174941 ns 4369 000002 a 0 00 c 78513 addi x 10 , x 15 , 12 x 10=0003000 c x 15:00030000
174981 ns 4370 000002 a 4 00 c 5261 b lw _ mask x 12 , x 12 , 0( x 10) x 12= ffff 62 fc x 12:000000 f 0 x 10:0003000 c PA :0003000 c
175021 ns 4371 000002 a 8 001716 bb sw _ active _ xor Nx 13 1( x 14) x 14:0001 fffc x 13:00000005 PA :0001 fffc
175061 ns 4372 000002 ac 00 c 7 a 023 sw x 12 , 0( x 15) x 12: ffff 62 fc x 15:00030000 PA :00030000
175101 ns 4373 000002 b 0 0017103 b sw _ active _ xor Nx 0 1( x 14) x 14:0001 fffc PA :0001 fffc
175141 ns 4374 000002 b 4 00000693 addi x 13 , x 0 , 0 x 13=00000000
175181 ns 4375 000002 b 8 04 c 7 a 023 sw x 12 , 64( x 15) x 12: ffff 62 fc x 15:00030000 PA :00030040
175221 ns 4376 000002 bc 00878513 addi x 10 , x 15 , 8 x 10=00030008 x 15:00030000
175261 ns 4377 000002 c 0 00068613 addi x 12 , x 13 , 0 x 12=00000000 x 13:00000000
175301 ns 4378 000002 c 4 00 d 5261 b lw _ mask x 12 , x 13 , 0( x 10) x 12= ffff 0 b 0 b x 13:00000000 x 10:00030008 PA :00030008
175341 ns 4379 000002 c 8 00 d 5 a 69 b lw _ mask x 13 , x 13 , 0( x 11) x 13= ffff 5 e 0 f x 13:00000000 x 11:00030040 PA :00030040
175381 ns 4380 000002 cc 0007003 b sw _ active _ none Nx 0 0( x 14) x 14:0001 fffc PA :0001 fffc
175421 ns 4381 000002 d 0 00 d 60633 add x 12 , x 12 , x 13 x 12= fffe 691 a x 12: ffff 0 b 0 b x 13: ffff 5 e 0 f
175461 ns 4382 000002 d 4 04 c 7 a 223 sw x 12 , 68( x 15) x 12: fffe 691 a x 15:00030000 PA :00030044
75
Simulations
76
4.3 – Simulation with standard programs
The program was modified to integrate new Compiler capabilities, Listing 4.9 shows the code
with new LiM operations (store_active and load_mask). To optimize the LiM functionalities, it
was decided to adopt a single in-memory operation, in this case only OR operations are performed
in memory. Since, once programmed the memory, this will behave differently as in standard
situations, the mask value was selected as 0, this allows to load variables into the Core without
changing their values. Also memory locations which will host final results should be zeroed, in
this way the logic store will not corrupt the results, in fact the result of the query will be used
as mask value, thus by means of the OR operation the final result could be stored in the final
memory destination.
Note that in the second Query, the famous De Morgan’s Law is adopted to simplify the calculation,
in fact (∼ A)&(∼ B) =∼ (A|B). In this specific algorithm, range operations cannot be used
because operands which are needed to compute the final result need to be manipulated and loaded
into the Core. This limits the possible improvement brought by the LiM approach. In the following,
Listing 4.9 reports the simulation log of the modified C program.
Listing 4.9. bitmap_search.c code with new compiler
1 /* Bitmap search program */
2 // This program emulates the bitmap search algorithm , students ’ features are
distributed over 6 integer vectors .
3 // In this program two queries are perfomed
4 // 1. Which students are male and older than 19?
5 // 2. Which students are older than 18?
6
7 # include < stdio .h >
8 # include < stdlib .h >
9
10 int main ( int argc , char * argv [])
11 {
12 /* variable declaration */
13 int i , N =0 , mask =0 , partial , res , operand ;
77
Simulations
78
4.3 – Simulation with standard programs
79
Simulations
In the following the Execution time is estimated for all the four memory configurations. In these
Nindexes
formulas, N is considered as N = because operations are executed on 32-bits.
32
The factor 6 in Equation 4.9 comes to the fact that three clock cycles are required to load
all the operands in the Core, two clock cycles are requested to perform the logic computations
and the last one is used to store-back the final result. The considerations are similar also for
the second query. Equation 4.10 shows the possible improvement brought by the LiM approach,
two clock cycles are wasted to program the memory, but queries computations are performed in
less clock cycles. The standard Racetrack case, Equation 4.11, has the same Execution time of
the standard memory case. While, the LiM Racetrack case reported in Equation 4.12, embeds
an higher latency due to the internal built-in NOR feature exploitation. Remember that all the
memory operations are performed keeping the memory programmed to compute in-memory OR
operations. Neutral memory accesses are treated as in-place computations using as mask the zero
value. For this reason, this implementation results in worse performance with respect to all the
other cases, even worse than not adopting the LiM feature.
Figure 4.3, shows the Execution time estimation with different N indexes, as anticipated, the LiM
Racetrack case has a much steeper curve due to the additional latency brought by the internal
NOR computations.
80
4.3 – Simulation with standard programs
Figure 4.3. Execution time estimation for bitmap_search.c with different vector size N
Log 4.11 shows how some operations are partially replaced by in-memory computations.
Simulations were taken adopting N=6 (N_indexes=192).
Listing 4.10. Extract of simulation log of bitmap_search.c - standard configuration
with new compiler
Time Cycles PC Instr Mnemonic
2366 ns 233 00000300 09078793 addi x 15 , x 15 , 144 x 15=00030090 x 15:00030000
2376 ns 234 00000304 fe 872603 lw x 12 , -24( x 14) x 12= ffffffff x 14:00030078 PA :00030060
2386 ns 235 00000308 fd 072683 lw x 13 , -48( x 14) x 13=00000000 x 14:00030078 PA :00030048
2396 ns 236 0000030 c 00470713 addi x 14 , x 14 , 4 x 14=0003007 c x 14:00030078
2406 ns 237 00000310 00 c 6 e 6 b 3 or x 13 , x 13 , x 12 x 13= ffffffff x 13:00000000 x 12: ffffffff
2416 ns 238 00000314 ffc 72603 lw x 12 , -4( x 14) x 12=00000000 x 14:0003007 c PA :00030078
2436 ns 240 00000318 00 c 6 f 6 b 3 and x 13 , x 13 , x 12 x 13=00000000 x 13: ffffffff x 12:00000000
2446 ns 241 0000031 c 02 d 72 a 23 sw x 13 , 52( x 14) x 13:00000000 x 14:0003007 c PA :000300 b 0
2456 ns 242 00000320 fef 712 e 3 bne x 14 , x 15 , -28 x 14:0003007 c x 15:00030090
3486 ns 345 0000032 c 0187 a 703 lw x 14 , 24( x 15) x 14= ffff 0000 x 15:00030014 PA :0003002 c
3496 ns 346 00000330 0007 a 603 lw x 12 , 0( x 15) x 12=0000 ffff x 15:00030014 PA :00030014
3506 ns 347 00000334 00478793 addi x 15 , x 15 , 4 x 15=00030018 x 15:00030014
3516 ns 348 00000338 00 c 76733 or x 14 , x 14 , x 12 x 14= ffffffff x 14: ffff 0000 x 12:0000 ffff
3526 ns 349 0000033 c fff 74713 xori x 14 , x 14 , -1 x 14=00000000 x 14: ffffffff
3536 ns 350 00000340 0 ce 7 a 623 sw x 14 , 204( x 15) x 14:00000000 x 15:00030018 PA :000300 e 4
3546 ns 351 00000344 fed 794 e 3 bne x 15 , x 13 , -24 x 15:00030018 x 13:00030018
3556 ns 352 00000348 00000513 addi x 10 , x 0 , 0 x 10=00000000
3566 ns 353 0000034 c 00008067 jalr x0, x1, 0 x 1:000001 d 8
Listing 4.11. Extract of simulation log of bitmap_search.c - LiM configuration with new compiler
Time Cycles PC Instr Mnemonic
177501 ns 4433 00000300 ffc 68693 addi x 13 , x 13 , -4 x 13=0001 fffc x 13:00020000
177541 ns 4434 00000304 0006 b 03 b sw _ active _ or Nx 0 0( x 13) x 13:0001 fffc PA :0001 fffc
177581 ns 4435 00000308 000306 b 7 lui x 13 , 0 x 30000 x 13=00030000
177621 ns 4436 0000030 c 06078793 addi x 15 , x 15 , 96 x 15=00030060 x 15:00030000
177661 ns 4437 00000310 00000613 addi x 12 , x 0 , 0 x 12=00000000
81
Simulations
82
4.3 – Simulation with standard programs
12
13 int (* key ) [4][4] = 0 x30200 ;
14 (* key ) [0][0]=0 x00 ; (* key ) [0][1]=0 xA5 ; (* key ) [0][2]=0 xA8 ; (* key ) [0][3]=0 xA0 ;
15 (* key ) [1][0]=0 xE9 ; (* key ) [1][1]=0 x09 ; (* key ) [1][2]=0 xBB ; (* key ) [1][3]=0 x2A ;
16 (* key ) [2][0]=0 xC9 ; (* key ) [2][1]=0 xD4 ; (* key ) [2][2]=0 xB7 ; (* key ) [2][3]=0 xAB ;
17 (* key ) [3][0]=0 xF2 ; (* key ) [3][1]=0 xE8 ; (* key ) [3][2]=0 x60 ; (* key ) [3][3]=0 x08 ;
18
19 /* Others */
20 int i , j ;
21
22 /* Add around key */
23 for ( i =0; i <4; i ++) {
24 for ( j =0; j <4; j ++) {
25 (* states ) [ i ][ j ] = (* states ) [ i ][ j ] ^ (* key ) [ i ][ j ];
26 }
27 }
28
29 return EXIT_SUCCESS ;
30 }
Listing 4.15 shows the code modified with new LiM instructions, once the memory is programmed
for XOR operation, the load of the first operand requires a mask equal to 0, in this way data
loaded in the Core is not corrupted. Once key data is loaded, it could be used for the logic store
and for computing in-memory XOR operations.
Listing 4.13. aes128_addroundkey.c code with new compiler
1 /* AES128 Addroundkey program */
2 // Compute AES128 Addroundkey , the algorithm encrypts chunks of 128 - bit data
organized in a 4 x4 matrix named ’ states ’.
3 // Data is transformed with a XOR operation using a 4 x4 matrix named ’ key ’.
4
5 # include < stdio .h >
6 # include < stdlib .h >
7
8 int main ( int argc , char * argv [])
9 {
10
11 /* input variables declaration */
12 v o l a t i l e int (* states ) [4][4];
13 v o l a t i l e int (* key ) [4][4];
14
15 states = ( v o l a t i l e int (*) [4][4]) 0 x30000 ; // define states matrix
starting address
16 key = ( v o l a t i l e int (*) [4][4]) 0 x30200 ; // define key matrix starting
address
17
18 // configuration address , where the config of the memory is stored .
19 int cnfAddress = 0 x1fffc ;
20
21 r e g i s t e r u n s i g n e d int x0 asm ( " x0 " ) ;
22
23 // Initialize states matrix
24 (* states ) [0][0]=0 x32 ; (* states ) [0][1]=0 x88 ; (* states ) [0][2]=0 x31 ; (* states )
[0][3]=0 xE0 ;
25 (* states ) [1][0]=0 x43 ; (* states ) [1][1]=0 x54 ; (* states ) [1][2]=0 x31 ; (* states )
[1][3]=0 x37 ;
26 (* states ) [2][0]=0 xF6 ; (* states ) [2][1]=0 x30 ; (* states ) [2][2]=0 x98 ; (* states )
[2][3]=0 x07 ;
27 (* states ) [3][0]=0 xA8 ; (* states ) [3][1]=0 x8D ; (* states ) [3][2]=0 xA2 ; (* states )
[3][3]=0 x34 ;
83
Simulations
28
29 // Initialize key matrix
30 (* key ) [0][0]=0 x00 ; (* key ) [0][1]=0 xA5 ; (* key ) [0][2]=0 xA8 ; (* key ) [0][3]=0 xA0 ;
31 (* key ) [1][0]=0 xE9 ; (* key ) [1][1]=0 x09 ; (* key ) [1][2]=0 xBB ; (* key ) [1][3]=0 x2A ;
32 (* key ) [2][0]=0 xC9 ; (* key ) [2][1]=0 xD4 ; (* key ) [2][2]=0 xB7 ; (* key ) [2][3]=0 xAB ;
33 (* key ) [3][0]=0 xF2 ; (* key ) [3][1]=0 xE8 ; (* key ) [3][2]=0 x60 ; (* key ) [3][3]=0 x08 ;
34
35 /* Other variables */
36 int i , j , N = 1 , opK ;
37
38 // Program memory for XOR operations
39 asm v o l a t i l e ( " sw_active_xor %[ result ] , %[ input_i ] , 0 "
40 : [ result ] " = r " ( x0 )
41 : [ input_i ] " r " ( cnfAddress ) , " [ result ] " ( N )
42 );
43
44
45 /* Add around key */
46 for ( i =0; i <4; i ++) {
47 for ( j =0; j <4; j ++) {
48
49 // lw key [ i ][ j ] inside the core
50 asm v o l a t i l e ( " lw_mask %[ result ] , %[ input_s ] , %[ input_t ] , 0 "
51 : [ result ] " = r " ( opK )
52 : [ input_s ] " r " (&(* key ) [ i ][ j ]) , [ input_t ] " r " ( x0 ) , " [ result ] " (
opK )
53 );
54
55 // sw operation to activate sw_xor , compute XOR oepration between
states [ i ][ j ] and key [ i ][ j ]
56 (* states ) [ i ][ j ] = opK ; // use key as mask
57 }
58 }
59
60 // restore standard operations
61 asm v o l a t i l e ( " sw_a ctive _non e %[ result ] , %[ input_i ] , 0 "
62 : [ result ] " = r " ( x0 )
63 : [ input_i ] " r " ( cnfAddress ) , " [ result ] " ( x0 )
64 );
65
66
67 return EXIT_SUCCESS ;
68 }
The Execution time estimation for this algorithm does not depend on any parameter because it
requires a fixed number of elements equal to N = 16, all the following computations are performed
assuming this value.
The factor 4 in front of Equation 4.13 comes to the fact that key and states bits should be loaded
in the Core, then the XOR operation is performed and then finally the result is stored-back.
Adopting the LiM configuration, the logic computations takes two clock cycles because the key
element should be loaded in memory and then it is used as mask for the logic store, note that two
additional cycles are required for memory programming. Note that, in this algorithm it is not
possible to exploit logic range operations because every iteration requires a different set of key
and state bits. Racetrack implementations, Equations 4.15 and 4.16, show the same behaviour in
terms of Execution time with respect to their counterparts.
Racetrack memory with LiM functionalities computes XOR operations with no additional required
latency, for this reason the performance is equal to the standard memory LiM implementation.
For both types of memories, speed-up improvements are remarkable, they are near 50%, Figure
4.4 shows an histogram of the Execution time.
Simulations results are reported in simulation logs 4.14 and 4.15, it is clearly visible how the
result computation requires less instructions than in the normal configuration.
Listing 4.14. Extract of simulation log of aes128_addroundkey.c - standard configu-
ration with new compiler
Time Cycles PC Instr Mnemonic
2486 ns 245 00000310 000308 b 7 lui x 17 , 0 x 30000 x 17=00030000
2496 ns 246 00000314 00400313 addi x6, x0, 4 x 6=00000004
2506 ns 247 00000318 00000693 addi x 13 , x 0 , 0 x 13=00000000
2516 ns 248 0000031 c 00261813 slli x 16 , x 12 , 0 x 2 x 16=00000000 x 12:00000000
2526 ns 249 00000320 00 d 807 b 3 add x 15 , x 16 , x 13 x 15=00000000 x 16:00000000 x 13:00000000
2536 ns 250 00000324 00279793 slli x 15 , x 15 , 0 x 2 x 15=00000000 x 15:00000000
2546 ns 251 00000328 00 f 88533 add x 10 , x 17 , x 15 x 10=00030000 x 17:00030000 x 15:00000000
85
Simulations
A Neural Network (NN) is a structure capable of very complex tasks, it is composed by neurons[2],
these basic building blocks cooperate together to take decisions. As shown in Figure 4.5, neurons
are composed by two parts, the net part performs the weighted computations, while f(net) is
basically an activation function applied to the output and in general it is a non-linear function.
The formula of the net part is the following:
N
∑︂
net = Xi × Wi + Bias (4.17)
i=0
Each input is multiplied by the corresponding weight, possibly an additional bias term can be
applied. Weights are adjusted during the training procedure, in this way the net’s behaviour is
tailored on the required goals.
86
4.3 – Simulation with standard programs
A Neural Network is composed by multiple layers formed by multiple neurons, a very common
structure of Neural Network is the Multi-Layer Perceptron, which is the one used for this test.
The network is composed by multiple layers with different tasks:
• Convolutional layer: performs the convolution between the input values and the weights;
• Pooling layer: similar to convolutional layers, but performs maximum or the average of the
selected inputs returning only one value, it perform the sub-sampling operation;
• FC layer: it is composed by fully-interconnected neurons and it performs classification
operations;
Since NNs are very complex, a Binary approximation was introduced, this leads to a reduction of
complexity as well as a reduction of the energy consumption of the algorithm. The approximation
used for this test is the XNOR-Net, where all wights and inputs are in binary format [24].
The XNOR-net is used as a case study, the main part of the algorithm is the computation of the
XNOR products and the pop-counting [10]. The proposed architecture will compute internally
the XNOR products while pop-counting is performed outside the memory. In the proposed test
program, shown in Listing 4.16, an input feature map (IFMAP) is convolved with a set of weights
called kernel. When the first convolution is finished, the kernel windows is moved by a position
defined by the stride parameter [10].
In xnor_net.c code 4.16, XNOR products are computed, several for-loops binaryzes the input and
then for each element of ofmap matrix a XNOR operation is performed between the selected ofmap
element and the computed bWeight. Note that the program implementation is a little bit different
from the theory, in fact XOR products are computed and then outside pop-counting is performed
on zeros. This version is completely equivalent to the original algorithm, it has the advantage that
XOR operations requires less time to be executed inside the core, thus this approach was adopted.
A LiM architecture could speed up the computation, in 4.17 the code was modified to exploit a
range XNOR operation, final matrix is computed with a single logic store operation. The final
computation is performed as many times the number of channels, which is a single one in this
case. In this way, a lot of instructions are avoided because this computation is performed directly,
this leads also to a reduction of the size of the code because XNOR computation is not replicated
every time in the inner for-loops but it is replicated only n_channels times.
Listing 4.16. xnor_net.c code
1 # include < stdio .h >
2 # include < stdlib .h >
3 # include < time .h >
4 # define N 28
5 # define W_F 2
6
7 int sign_function ( int x )
8 {
87
Simulations
9 if ( x > 0)
10 {
11 return 1;
12 }
13 else
14 {
15 return 0;
16 }
17 }
18
19
20
21 int main ()
22 {
23 // // initialize srand function
24 // srand ( time ( NULL ) ) ;
25 // weight matrix
26 v o l a t i l e int (* weight ) [ W_F ][ W_F ];
27 // image matrix
28 v o l a t i l e int (* image ) [ N ][ N ];
29 // store location
30 weight = ( v o l a t i l e int (*) [ W_F ][ W_F ]) 0 x3000 ;
31 image = ( v o l a t i l e int (*) [ N ][ N ]) 0 x30800 ;
32 // the of - map is stored from 0 x20000 address and so on .
33 v o l a t i l e int (* ofmap ) [ N ][ N ];
34 ofmap = ( v o l a t i l e int (*) [ N ][ N ]) 0 x20000 ;
35 int zero = 0;
36 // configuration address , where the config of the memory is stored .
37 int cnfAddress = 0 x1fffc ;
38
39 // indexes definition .
40 int i ,j ,c ,m , t ;
41
42 for ( i = 0; i < W_F ; i ++)
43 {
44 for ( j = 0; j < W_F ; j ++)
45 {
46 (* weight ) [ i ][ j ] = sign_function (0) ;
47 }
48 }
49 for ( i = 0; i < N ; i ++)
50 {
51 for ( j = 0; j < N ; j ++)
52 {
53 (* image ) [ i ][ j ] = sign_function (0) ;
54 (* ofmap ) [ i ][ j ] = 0;
55 }
56 }
57
58 // number of channels
59 int n_channels = 1;
60 // stride
61 int stride = 1;
62 // size of the kernel
63 int wf = W_F ;
64 // dimension of the output
65 int w_out = (N - wf ) / stride + 1;
66 // dimension squared of the output
67 int w_out2 = w_out * w_out + 1;
68 // index A and B
88
4.3 – Simulation with standard programs
69 int A = 0;
70 int B = 0;
71 // flag indicating if the weight has been already binarized or not .
72 int flag = 0;
73 // binarized weight
74 u n s i g n e d int bWeight = 0;
75 // counting zeros
76 int countZeros ;
77 for ( c = 0; c < n_channels ; c ++)
78 {
79 for ( j = 0; j < w_out ; j ++)
80 {
81 for ( i = 0; i < w_out ; i ++)
82 {
83 for ( m = 0; m < wf ; m ++)
84 {
85 for ( t = 0; t < wf ; t ++)
86 {
87 A = j + m + j *( stride -1) ;
88 B = i + t + i *( stride -1) ;
89 (* ofmap ) [ j ][ i ] = ((* image ) [ A ][ B ]) | ((* ofmap ) [ j ][ i ] << 1) ;
90 if ( flag == 0)
91 {
92 bWeight = ( (* weight ) [ m ][ t ] ) | ( bWeight << 1) ;
93 }
94 }
95 }
96 // xor bitwise between the ofmap content and the binary weight
97 (* ofmap ) [ j ][ i ] = (* ofmap ) [ j ][ i ] ^ bWeight ;
98 flag = 1;
99 }
100 }
101
102 }
103
104 return EXIT_SUCCESS ;
105 }
89
Simulations
21 int main ()
22 {
23 // // initialize srand function
24 // srand ( time ( NULL ) ) ;
25 // weight matrix
26 v o l a t i l e int (* weight ) [ W_F ][ W_F ];
27 // image matrix
28 v o l a t i l e int (* image ) [ N ][ N ];
29 // store location
30 weight = ( v o l a t i l e int (*) [ W_F ][ W_F ]) 0 x3000 ;
31 image = ( v o l a t i l e int (*) [ N ][ N ]) 0 x30800 ;
32 // the of - map is stored from 0 x20000 address and so on .
33 v o l a t i l e int (* ofmap ) [ N ][ N ];
34 ofmap = ( v o l a t i l e int (*) [ N ][ N ]) 0 x20000 ;
35 int zero = 0;
36 // configuration address , where the config of the memory is stored .
37 int cnfAddress = 0 x1fffc ;
38
39 // indexes definition .
40 int i ,j ,c ,m , t ;
41
42 for ( i = 0; i < W_F ; i ++)
43 {
44 for ( j = 0; j < W_F ; j ++)
45 {
46 (* weight ) [ i ][ j ] = sign_function (0) ;
47 }
48 }
49 for ( i = 0; i < N ; i ++)
50 {
51 for ( j = 0; j < N ; j ++)
52 {
53 (* image ) [ i ][ j ] = sign_function (0) ;
54 (* ofmap ) [ i ][ j ] = 0;
55 }
56 }
57
58 // number of channels
59 int n_channels = 1;
60 // stride
61 int stride = 1;
62 // size of the kernel
63 int wf = W_F ;
64 // dimension of the output
65 int w_out = (N - wf ) / stride + 1;
66 // dimension squared of the output
67 int w_out2 = w_out * w_out + 1;
68 // index A and B
69 int A = 0;
70 int B = 0;
71 // flag indicating if the weight has been already binarized or not .
72 int flag = 0;
73 // binarized weight
74 u n s i g n e d int bWeight = 0;
75 // counting zeros
76 int countZeros ;
77 for ( c = 0; c < n_channels ; c ++)
78 {
79 for ( j = 0; j < w_out ; j ++)
80 {
90
4.3 – Simulation with standard programs
In the following a tentative Execution Time estimation is proposed. For the estimation only
operations and operands involved in the LiM XNOR computation are taken into account, all the
other computations are considered as an offset. Multiple parameters are involved in the Equations,
for sake of simplicity it was assumed only N as variable parameters all the others are fixed:
• wf = 2;
• n_channel = 1;
(N − wf )
Note that parameter w_out is a function of N and it is expressed as wout = + 1. After
stride
the weights bynarization the line of code regarding bWeight is no more executed, this is taken into
account in the equation with a factor wf 2 .
2
Execution_timestd_mem ≈ nchannel · wout [wf 2 (f ixed) + 3(of map)] + wf 2 (f ixed_bW eight)
(4.18)
2
Execution_timestd_LiM _mem ≈ nchannel · wout [wf 2 (f ixed)] + nchannel [1(mem_active) + 1(of map)+
1(mem_active)] + wf 2 (f ixed_bW eight)
(4.19)
91
Simulations
2
Execution_timert_mem ≈ nchannel · wout [wf 2 (f ixed) + 3(of map)] + wf 2 (f ixed_bW eight)
(4.20)
2
Execution_timeRT _LiM _mem ≈ nchannel · wout [wf 2 (f ixed)] + nchannel [1(mem_active) + 1(of map)+
1(mem_active)] + wf 2 (f ixed_bW eight)
(4.21)
Equations 4.19 and 4.21 shows that the logic computation is carried out in a fully parallel way,
this results in a constant Execution time with any value N, note that two additional clock cycles
are required for memory programming. In this case Racetrack memory has the same behaviour
because the involved operation is a XOR one, and as known it is performed with the same latency
of a standard memory access.
The plot shows that Execution times are overlapped for both LiM and not-LiM versions, thus the
behaviour in terms of performances is exactly the same. Standard memory and Raetrack memory
2
without LiM functionalities show a quadratic behaviour due to the factor wout involved in the
output computation.
Code that does not change between the four memory versions is considered as fixed and it is not
included as offset in the plot reported in Figure 4.6.
Figure 4.6. Execution time estimation for xnor_net.c with different vector size N
Simulation logs were taken with N=4 and adopting sign_function(1) for weight variables, there
is a good improvement with the LiM configuration. The final XNOR operation is performed with
a single logic store instruction repeated for the number of channels while without LiM instructions
the line of code is repeated as many times the for-loop requires.
Listing 4.18. Extract of simulation log of xnor_net.c - standard configuration with new compiler
92
4.4 – Simulation Results Analysis
Listing 4.19. Extract of simulation log of xnor_net_lim.c - LiM configuration with new compiler
Time Cycles PC Instr Mnemonic
2736 ns 270 000002 dc 00082403 lw x 8 , 0( x 16) x 8=00000000 x 16:00030800 PA :00030800
2746 ns 271 000002 e 0 0006 a 803 lw x 16 , 0( x 13) x 16=00000000 x 13:00020000 PA :00020000
2766 ns 273 000002 e 4 00181813 slli x 16 , x 16 , 0 x 1 x 16=00000000 x 16:00000000
2776 ns 274 000002 e 8 00886833 or x 16 , x 16 , x 8 x 16=00000000 x 16:00000000 x 8:00000000
2786 ns 275 000002 ec 0106 a 023 sw x 16 , 0( x 13) x 16:00000000 x 13:00020000 PA :00020000
2796 ns 276 000002 f 0 00031 c 63 bne x 6 , x 0 , 24 x 6:00000000
2806 ns 277 000002 f 4 00361813 slli x 16 , x 12 , 0 x 3 x 16=00000000 x 12:00000000
2816 ns 278 000002 f 8 010 e 8833 add x 16 , x 29 , x 16 x 16=00003000 x 29:00003000 x 16:00000000
2826 ns 279 000002 fc 00082803 lw x 16 , 0( x 16) x 16=00000001 x 16:00003000 PA :00003000
2836 ns 280 00000300 00171713 slli x 14 , x 14 , 0 x 1 x 14=00000000 x 14:00000000
2846 ns 281 00000304 00 e 86733 or x 14 , x 16 , x 14 x 14=00000001 x 16:00000001 x 14:00000000
2856 ns 282 00000308 01 c 787 b 3 add x 15 , x 15 , x 28 x 15=00000001 x 15:00000000 x 28:00000001
2866 ns 283 0000030 c 00279793 slli x 15 , x 15 , 0 x 2 x 15=00000004 x 15:00000001
Considering at first the adoption of a standard memory technology, improvements are around
≃ −21% for bitwise.c, exploiting the range operations of the LiM architecture it is possible to have
a quite interesting result in terms of performance. The improvement is even higher in bitwise_inv.c
and it reaches nearly ≃ −25%, inverting operations require more internal instructions with respect
to normal one, this additional latency is avoided using the LiM structure.
For the Racetrack case, there is still some improvements adopting a LiM architecture (≃ −9%
and ≃ −12% respectively) but it is less than the other memory type. One of the reasons is the
additional latency for built-in operations, like NAND/NOR and AND/OR which are performed
with additional cycles with respect to XOR/XNOR ones. In addition, when adopting a Racetrack
memory, it is not possible to perform parallel operations, in fact they are serialized.
In any case, these programs are tailored on the LiM architecture, thus the importance of these
results is marginal, in Table 4.2 Standard programs results are reported, these can give a better
idea of the improvements in real case scenarios.
Result for bitmap_search.c reported in the original word [5] showed a marginal improvement
in terms of execution time which is limited to −2% (estimated in cc).
The overall Execution time for bitmap_search.c program in this Thesis has a completely different
trend, in fact the adoption of a LiM architecture degrades the performance. Execution time increases
of ≃ +22% adopting a standard technology memory but performance decreases dramatically using
the Racetrack memory (≃ +30%), these results are of course unacceptable.
The reason of such bad results could reside in the code design, remember that in [5], LiM programs
were manually modified adding required assembly instructions, this of course increases the software
efficiency because it is possible to have a register-level optimization. Adopting in-line assembly
pieces of codes is much easier but at the same time the code optimization is in charge of the
Compiler. Probably in this case, the Compiler is not able to optimize the code. Taking into
account the original assembly trace reported in [5], between one cycle iteration and the next one,
only few instructions are implemented (i.e. addi for address update used in the lw_mask, logic
operation, sw result and a bne for testing if the loop is ended). In the current implementation,
between one cycle iteration and the next one, many other instructions are executed, this is an
example on how the Compiler is not able to optimize the code. Another reason could be the code
itself, standard code design is more restricting with respect to directly introduce modifications in
.hex file, so it is possible that the designed LiM code is not much efficient.
Not considering software-related issues, this worse result of the Racetrack implementation is not
unexpected because this program exploits only OR computations. This memory carries out this
operation with the built-in NOR operation which requires more clock cycles with respect to a
standard access. This was partially anticipated by the Execution time estimations, Figure 4.3
94
4.4 – Simulation Results Analysis
shows clearly how built-in NOR operation have a huge impact on performance.
Results for aes128_addroundkey.c program are very interesting. For the standard technology
memory case, adopting a LiM configuration can achieve a remarkable improvement (≃ −22%),
also in the case of a Racetrack memory it is possible to have a quite reasonable result ( ≃ −19%)
which is very close to the previous one and even higher than Custom programs results. In this
program only LiM XOR operations are exploited, as explained previously, in the Racetrack this
computation is carried out with a latency equal to a standard memory access, this is the reason
of this result. Remember that a NAND or NOR logic operation would have required more clock
cycles.
For program xnor_net.c Execution time reduction is quite small, it reaches ≃ −5% in the standard
technology memory case and ≃ −4% in the Racetrack case. These results have been obtained with
a small number of elements (N=4) due to simulation time reasons, possibly improvement could be
even higher with a higher value of N. Also in this case the program uses XOR operations, this is
executed with the same latency as standard accesses in the Racetrack memory, this is the reason
of the similar improvement.
It is clearly visible that different memories could lead to different results and satisfy different needs.
In a standard technology memory configuration, the adoption of the LiM paradigm could achieve,
for specific programs, interesting improvements in terms of Execution time. Of course area and
power are necessarily higher because additional logic is required to support LiM operations.
Simulation results show how the adoption of a Racetrack memory leads to a deterioration of the
Execution time, Table 4.3 shows the increase with the adoption of not of LiM functionalities with
respect to the standard technology memory case.
Data show how Execution time increases on average in the range of +40 − 80%. This analysis
has not taken into account Area and Power, in [7] it is highlighted that Racetrack technology has
a smaller cell size (≤ 2 F 2 ) than classic SRAM (≤ 100 − 200 F 2 ) and DRAM cells (≤ 4 − 8 F 2 ).
Furthermore, Read energy, Write energy and Leakage Power is lower in all the three cases with
respect to SRAM and DRAM.
In addition, a lot of operations are carried out by means of magnetic interactions (i.e. data shifting
and NAND/NOR computations) thus it is reasonable to expect a lower power consumption.
It is clearly visible how Execution time for the Racetrack LiM case is reduced and percentage
improvements are similar to Std-LiM-mem ones. Execution time is still higher due to the multiple
access cycles required for memory accesses.
Table 4.5 shows the result for standard programs.
Execution times reduces only for xnor_net.c program because it is the only one involving range
operations. In this case percentage improvement is even higher than in the standard memory case.
In bitmap_search.c program, Racetrack memory shows a worse behaviour due to the high number
of OR operations involved in the programs.
Table 4.6 reports the comparison between adopting or not LiM functionalities for both memory
types.
Results suggest that programs which use heavily range operations get a better improvement in
terms of Execution time, in fact bitwise.c and bitwise_inv.c programs, considering the Racetrack
case, go from +50/52% to +58,79% in the LiM case, a remarkable improvement. Considering
xnor_net.c program, the improvement is marginal because the the number of elements used for
the simulation is small, an higher improvement is expected using a higher value of N.
Table 4.7 reports the exact Execution times derived from VCS simulations for custom programs.
96
4.5 – Racetrack organization analysis
Now for both programs performances are similar to the original ones, Execution time for
both programs is a little bit higher for the Racetrack memory case due to built-in NAND/NOR
operations performed in two clock cycles. In absolute terms, Execution time in all the cases
increases due to the longer system clock period.
In Table 4.8 all the results for standard programs are shown.
Also in this case performances are basically the same for all the analyzed cases. Considerations
for bitmap_search.c program are the same as in the previous Subsection.
Table 4.9 shows the comparison between adopting or not LiM functionalities.
Thanks to the new clock system frequency and the modifications applied to the Racetrack
memory, in most of the programs the overhead due to the Racetrack memory is almost negligible
(below 8% for in the worst case). Racetrack memory suffers when multiple NAND/NOR (AND/OR)
operations are involved in program execution, in this case the latency is double with respect to a
standard access or another logic operation, so it is important to take into account also the types
of LiM operations implemented in the software.
performed serially.
Different design solutions were explored and they focused on different design aspects:
• MU parameters (Type 1): In this category, the MU design parameters were modified to
achieve a larger logic store operation parallelism. Here the number of bits per Racetrack
is ranged from 1 to 32, but ports are arranged in a way in which the port alignment is
performed always in a single clock cycle. In fact, for some configurations, the total number
of ports remain stable as the number of bits per Racetrack increases. For example, the
configuration 1 × 8_1024_2048 − 32 corresponds to a MU with 8 bits per Racetrack, 1024
Racetracks, 2048 ports and store parallelism equal to 32 words. The number of ports is
double because in this way it is possible to perform a port alignment in the second half of
the Racetrack with the same latency.
These configurations are quite different from the other two, here a single MU is used and the
concept of Block is different from the one explained previously because it can be considered
as the whole MU;
• Active ports (Type 2): Here, a single 32-bit word is accessed within a Block, the difference is
the possibility to access multiple words in parallel activating the same word-line in all the
available Blocks. This solution is the one adopted in the final design of this Thesis and it
requires a different word organization. The store parallelism depends on the available Blocks,
larger the memory, larger is the store parallelism. In all these configurations it was assumed
to have a Block structure as the one implemented in this Thesis, thus it contains 32 words.
Fort this category it was analyzed also the case in which a larger memory size is used;
• Block parallelism (Type 3): The idea is to exploit the already available ports within each
Block. Instead of accessing a single word within a Block, it is possible to access multiple
words inside the same Block, this feature is combined to the possibility to access multiple
Blocks in parallel. For example, a memory composed by 8 Blocks in which it is possible to
access 2 words per Block, would result in a logic store parallelism equal to 16 words.
Also in this case a different word organization is required. Also here it was assumed to have
a Block structure as the one implemented in this Thesis;
All the configurations are represented with a code with the following format:
M type − T ype − N × N b_N r_N p − NLSP .
• Mtype = Memory type
• Type = Test type
• N = Number of Blocks
• N b = Number of bits per Racetrack
• N r = Number of Racetracks
• N p = Number of ports per Racetrack
• NLSP = Logic store parallelism expressed in terms of accessed words
In the following a brief description of all the three different design solutions is given. For all
the analysis it was considered to adopt the same FSM algorithm described in the previous section
and the configuration with the Core working at 25M Hz and the Racetrack working at 100M Hz,
thus accesses require one or at most two clock cycles.
In the following multiple analysis are shown, parameter N is ranged from 5 to 256 and it refers to
the number of words accessed in parallel during a parallel store logic. Execution Time (in terms of
98
4.5 – Racetrack organization analysis
clock cycles) is used as comparison parameter between the different configurations, this metric
was estimated with the same methodology adopted in previous analysis, taking into account the
different logic store parallelism.
Configurations with a lower NLSP are expected to have a higher Execution time because less words
can be processed in parallel. In all the cases the first plot is taken with N equal to the one used
in real tests, then it was increased starting from 32 and arriving to 256 (operation on the whole
memory). Using N equal to the real parameter helps to have an idea of the actual configuration,
while increasing N is helpful to understand the performances of each different configuration.
In Table 4.10 a summary of all the analyzed configurations is given.
99
Simulations
The following Figures show the Execution time, expressed in terms of clock cycles, changing the
parameter N. Results show that for N equal to 5, as in the original program, performances are
almost the same for all the cases. The LiM architecture presented in [5] has the best performance
because all the operations are performed with the same latency. The worst result for the LiM
approach is the one with the serial approach (like in the first Racetrack design), this is reasonable
because parallel accesses are serialized.
On the right y axis the memory size is shown, for configurations of Type 2 the size increases due
to the higher number of Blocks. Thus, for a limited N, in general lower than 8 (minimum logic SW
parallelism, with the exception of the serial configuration), Execution times are more or less the
same as the reference LiM case, with only a small overhead brought by the well known operations
which are performed directly in memory.
As parameter N increases, performances will depend deeply from parameter NLSP . The behaviour
is the same in all the plots, configurations with higher NLSP show a lower Execution time because
they can perform accesses on a higher range of words, while other configurations require multiple
accesses to complete the range operations.
100
4.5 – Racetrack organization analysis
101
Simulations
102
4.5 – Racetrack organization analysis
103
Simulations
104
4.5 – Racetrack organization analysis
105
Simulations
106
4.5 – Racetrack organization analysis
107
Simulations
108
4.5 – Racetrack organization analysis
109
110
Chapter 5
The aim of this Thesis was to implement an open and configurable Logic-in-Memory framework
capable to support different types of memory implemented with standard an novel technologies.
The Thesis work followed two tasks, the first was to expand the already available LiM structure
integrated in a Microprocessor context, the second one was to apply the concept of Logic-in-Memory
to a novel technology as Racetrack.
In the first part of the Thesis, the architecture was improved to execute a larger amount of LiM
functions, furthermore also the RISC-V GNU GCC compiler was modified to support the definition
of new LiM instructions. This makes the architecture expandable and configurable for future
improvements and decouples the framework from the hardware necessary to implement the LiM
functionalities.
The second part of the Thesis focused on the design of a LiM architecture based on Racetrack
technology exploiting the concept of pNML logic. Results are comparable to the original LiM
system implemented with standard technology, even if the different access latency of bit-wise LiM
operations should be taken into account during the algorithm selection and implementation to
speed-up the execution.
Future works should focus on the study of new architectures based on the Racetrack and LiM
concepts. This technology allows a high degree of flexibility and this aspect should be investigated
to find the best internal organization.
The expansion the available LiM instruction set should be another point to be taken into account.
This would require further work on the RISC-V compiler in order to support new and more
complex LiM operations.
Furthermore, another focus should be the analysis and comparison of new and more complex
algorithms, this would give the possibility to understand the effectiveness of the application of the
Racetrack technology in real-case scenarios.
111
Bibliography
113
BIBLIOGRAPHY
[15] Roland Höller et al. «Open-Source RISC-V Processor IP Cores for FPGAs – Overview and
Evaluation». In: 2019 8th Mediterranean Conference on Embedded Computing (MECO).
2019, pp. 1–6. doi: 10.1109/MECO.2019.8760205.
[16] Bruce Jacob, Spencer Ng, and David Wang. Memory Systems: Cache, DRAM, Disk. San
Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2007, pp. 1–4. isbn: 0123797519.
[17] Philip Koopman. Multi-Level Strategies. [Online; accessed 2022-04-04]. url: https://users.
ece.cmu.edu/~koopman/ece548/handouts/10levels.pdf.
[18] Donghyuk Lee et al. «Tiered-latency DRAM: A low latency and low cost DRAM architecture».
In: 2013 IEEE 19th International Symposium on High Performance Computer Architecture
(HPCA). 2013, pp. 615–626. doi: 10.1109/HPCA.2013.6522354.
[19] Prerna Mahajan and Abhishek Sachdeva. «A study of encryption algorithms AES, DES and
RSA for security». In: Global Journal of Computer Science and Technology (2013).
[20] NAND Flash Controller. Reference Design RD1055. rd1055_01.2. Lattice Semiconductor Cor-
poration. 2010. url: https://www.latticesemi.com/-/media/LatticeSemi/Documents/
ReferenceDesigns/NR/NANDFlashControllerDesign- Documentation.ashx?document%
5C%5Fid=34185.
[21] Northwest Logic Offers MRAM Controller IP compatible with Everspin’s ST-MRAM. https:
/ / www . everspin . com / sites / default / files / pressdocs / Northwest _ Logic _ and _
Everspin_PR.pdf. [Online; accessed 2022-03-20].
[22] David A. Patterson and John L. Hennessy. Computer Organization and Design RISC-V
Edition: The Hardware Software Interface. 2nd. San Francisco, CA, USA: Morgan Kaufmann
Publishers Inc., 2020, p. 474.
[23] Andrei Pavlov and Manoj Sachdev. CMOS SRAM circuit design and parametric test in
nano-scaled technologies: process-aware SRAM design and test. Vol. 40. Springer Science &
Business Media, 2008, p. 14.
[24] Mohammad Rastegari et al. «Xnor-net: Imagenet classification using binary convolutional
neural networks». In: European conference on computer vision. Springer. 2016, pp. 525–542.
[25] Fabrizio Riente et al. «Parallel Computation in the Racetrack Memory». In: IEEE Transac-
tions on Emerging Topics in Computing 10.2 (2022), pp. 1216–1221. doi: 10.1109/TETC.
2021.3078061.
[26] Hadi R. Sandid. Adding Custom Instructions to the RISC-V GNU-GCC toolchain. [On-
line; accessed 2022-06-02]. url: https://hsandid.github.io/posts/risc- v- custom-
instruction/.
[27] Giulia Santoro. «Exploring New Computing Paradigms for Data-Intensive Applications».
PhD thesis. PhD thesis, Politecnico di Torino, 2019.
[28] Jennifer Tran. Synthesizable DDR SDRAM Controller. XAPP200. v2.4. XILINX. 2002.
url: http://www.cisl.columbia.edu/courses/spring-2004/ee4340/restricted%5C%
5Fhandouts/xapp200.pdf.
[29] Rangharajan Venkatesan et al. «Cache Design with Domain Wall Memory». In: IEEE
Transactions on Computers 65.4 (2016), pp. 1010–1024. doi: 10.1109/TC.2015.2506581.
[30] Jingcheng Wang et al. «A 28-nm Compute SRAM With Bit-Serial Logic/Arithmetic Opera-
tions for Programmable In-Memory Vector Computing». In: IEEE Journal of Solid-State
Circuits 55.1 (2020), pp. 76–86. doi: 10.1109/JSSC.2019.2939682.
[31] Wikipedia contributors. Cache inclusion policy. [Online; accessed 2022-04-05]. 2021. url:
https://en.wikipedia.org/w/index.php?title=Plagiarism%5C&oldid=5139350.
114
BIBLIOGRAPHY
[32] Ming-Chuan Wu and A.P. Buchmann. «Encoded bitmap indexing for data warehouses». In:
Proceedings 14th International Conference on Data Engineering. 1998, pp. 220–230. doi:
10.1109/ICDE.1998.655780.
[33] Kai Yang, Robert Karam, and Swarup Bhunia. «Interleaved logic-in-memory architecture
for energy-efficient fine-grained data processing». In: 2017 IEEE 60th International Midwest
Symposium on Circuits and Systems (MWSCAS). 2017, pp. 409–412. doi: 10.1109/MWSCAS.
2017.8052947.
[34] Hongbin Zhang et al. «Performance analysis on structure of racetrack memory». In: 2018
23rd Asia and South Pacific Design Automation Conference (ASP-DAC). 2018, pp. 367–374.
doi: 10.1109/ASPDAC.2018.8297351.
[35] Xinmiao Zhang and K.K. Parhi. «High-speed VLSI architectures for the AES algorithm». In:
IEEE Transactions on Very Large Scale Integration (VLSI) Systems 12.9 (2004), pp. 957–967.
doi: 10.1109/TVLSI.2004.832943.
115