An Efficient SRAM-based Reconfigurable Architecture For Embedded
An Efficient SRAM-based Reconfigurable Architecture For Embedded
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2812118, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
1
M
Matrix Soft-Core Processor
embedded systems. In this paper, we present an efficient reconfigurable M
Logic Fetch
architecture to implement soft-core embedded processors in SRAM-based M
M
Decode
FPGAs by using characteristics such as low utilization and fragmented M
Block Write Back
accessibility of comprising units. To this end, we integrate the low M
utilized functional units into efficiently designed Look-Up Table (LUT) Select
Switch
based Reconfigurable Units (RUs). To further improve the efficiency of Matrix
the proposed architecture, we used a set of efficient Configurable Hard Execution
Logics (CHLs) that implement frequent Boolean functions while the other M M
Logic
functions will still be employed by LUTs. We have evaluated effectiveness Block
of the proposed architecture by implementing the Berkeley RISC-V M M
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2812118, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
2
are power inefficient. These shortcomings obstruct the use of such efficient synthesis algorithm which exploits efficient CHLs for im-
architectures in the majority of embedded systems. plementing significant portion of functions in the proposed RUs. To
In this paper, we propose an area and power efficient reconfigurable the best of our knowledge, this is the first effort which investigates
architecture for soft-core embedded processors by exploiting the low the applicability of CHLs in either soft or hard processors.
utilization and fragmented accesses of functional units. The proposed • We investigate the generality of the proposed architecture on an
architecture is motivated by large area and high power consumption of alternative embedded processor (LEON2) as well. In addition, to
the low-utilized functional units in SRAM-based soft-core processors. examine the insensitivity of shared RU, we use both academic and
Such an architecture, however, necessitates a comprehensive profiling commercial synthesis tools.
to determine the accessibility of processor functional units across The rest of this paper is as follows. The related work is reviewed
a wide range of applications to (a) specify which functional units in Section II. The motivation behind the proposed architecture is
should be replaced with reconfigurable component and (b) determine explained in Section III. We articulate the proposed architecture
how to efficiently integrate these units. For this purpose, we first in Section IV. Details of the experimental setup and results are
profile a wide range of benchmarks to determine infrequently-used elaborated in Section V. Finally, Section VII concludes the paper.
and fragmentally-accessed units that are the major contributor of
the logic static power, and hence, are promising candidates to be
II. R ELATED W ORK
merged in an individual Reconfigurable Unit (RU). Afterwards, to
further improve the area and power efficiency of the proposed RU, Different approaches have been proposed to reduce the area and
we propose an algorithm to efficiently integrate these components as power of soft-core processors as well as to increase their performance,
shared RUs. The proposed algorithm aims to minimize the number which can be classified in two categories. Approaches in the first
of configuration cells and logic resources in the RU by exploiting category employ three techniques, (a) static or (b) dynamic power
the similarities along with the non-uniform distribution of Boolean gating of components of execution stage [27]–[29], [36], and (c) dual
functions in the components. These homomorphic-structure functions threshold voltage to reduce power. Static power gating is applied
are implemented as a set of small area-size Configurable Hard Logics offline at the configuration time. Therefore, the power efficiency
(CHLs). To our knowledge, this is the first study that, in a fine-grained of this technique is restricted to permanently-off components which
manner, improves the area and static power efficiency of FPGA-based depends on the target application. Dynamic power gating is applied
soft-core processors using the concept of spatiality in component online, i.e., during the application runtime, so it may have higher
utilization. The proposed architecture can be employed in tandem opportunity for power reduction. In practice, however, dynamic power
with other methods that intend to architecturally improve the soft- gating encounters several shortcomings that hinders practicality of
core processors, e.g., multi-threading [26], power gating [27]–[29], this technique. We will discuss advantages of the proposed architec-
or ISA customization [30], as they are orthogonal to our proposed ture over dynamic power gating method in Section V.
architecture. As the second category, using reconfigurable devices to imple-
To evaluate the efficiency of the proposed architecture, we first ment contemporary processors is proposed. Previous studies in this
execute MiBench benchmarks [31] on Berkeley open-source RISC- category can be classified as coarse-grained and fine-grained ar-
V processor [8] to obtain the access patterns and utilization rate chitectures. The former allows bit-level operations, while the latter
of the RISC-V units. Afterwards, using an in-house script, we enables relatively more complicated operations. In the coarse-grained
identified the candidate units to be integrated in an optimal number architectures [20]–[23], the reconfigurable devices are used as co-
of RUs. Berkeley ABC [32] and Altera Quartus [33] are exploited processor to speed up computations and enhance the parallelism
for synthesizing the selected units and integrating into a shared RU. degree and energy efficiency by loop unrolling. The role of com-
Thereafter, latest version of COFFE [34] is exploited to generate piling algorithms, i.e., hardware-software partitioning, considering
efficient transistor sizing used in HSPICE. Finally, we use Verilog- today’s complex many-core architectures and multi-programming
To-Routing (VTR) 7.0 [35] fed with the parameters obtained from can adversely affect the efficiency of such architectures [37]. Thus,
HSPICE and Design Compiler to place and route the processors and aforementioned architectures impose considerable area overhead of
obtain the design characteristics. Furthermore, we examine generality profiler and departure compilers from their straightforward flow
of the proposed architecture on an alternative open-source processor to support branch prediction which may not always be correct in
(LEON2), as well. Experimental results demonstrate that our pro- interactive and time-dependent embedded applications [23].
posed architecture improves the area, static power, energy, critical In fine-grained architectures, one or several types of instructions
path delay, and total execution time of RISC-V by 30.7%, 32.5%, are selected to be executed on the reconfigurable platform. The main
36.9%, 9.2%, and 6.3%, respectively, compared to the conventional challenge in such architectures is choosing computation-intensive in-
modular LUT-based soft-core processors. Examining the applicability structions. In this regard, CHIMAERA [24] has proposed integrating
of the proposed architecture on LEON2 revealed that the area, static an array of Reconfigurable Functional Units (RFUs) into the pipeline
power consumption, and energy of LEON2 is enhanced by 23.1%, of a superscalar processor which are capable of performing 9-input
23.2%, and 23.2%, respectively. functions. The authors also proposed a C compiler that attempts
The novel contributions of this work are as follows: to map the instructions into RFUs [24]. The OneChip architecture
• We present a comprehensive design space exploration among that integrates an RFU into the pipeline of a superscalar RISC
prevalent infrastructures for datapath computational units of a processor has been proposed in [25]. The RFU is implemented in
processor to discriminate the proper infrastructure for embedded parallel with Execute and Memory Access stages and includes one
systems. This suggests integration of low utilized and fragmentally- or more FPGA devices and a controller. Thus, it can be internally
accessed components into shared RUs. pipelined and/or parallelized. In addition to significant area and power
• For the first time, we propose a metric to efficient integration overheads imposed by multiple reconfiguration memories and addi-
of units in elaborately designed RUs based on component area, tional circuitries, design complexity (e.g., modifying the compiler)
utilization ratio, and configuration time. is another drawback of the OneChip architecture. A heterogeneous
• By comprehensive characterization of most frequent and architecture exploiting RFUs based on Spin-Transfer Torque Random
homomorphic-structure functions, we propose an area and power Access Memory (STT-RAM) LUTs for the hard-core general-purpose
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2812118, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
3
TABLE I
S UMMARY OF EXPLOITING RECONFIGURABLE FABRICS IN PROCESSORS
Granularity Modification Structure (Topology) Improvement/Design Overhead
8x8 array of 32-bit ALU/multiplier, register file, ✓ Performance
MorphoSys [20] Coarse-grained Not mentioned
modified TinyRISC processor, and memory unit 5 No compiler support, no area and power evaluation
Developing specific VLIW processor with reconfigurable ✓ Performance 5 Large configuration overhead,
ADRES [21] Coarse-grained
compiler framework matrix and register file limited benchmarks, no area and power evaluation
✓ Reconfiguration time
Developing specific Array of Processing Elements (PEs) contains 8-bit
PipeRench [22] Coarse-grained 5 Performance (PEs communicate through global I/O),
compiler framework ALU (LUT and carry chain) and register file
large bitstream, no area and power evaluation
ARM7 microprocessor with a FPGA contains ✓ Performance and energy due to loop unrolling
Warp [23] Coarse-grained Not mentioned
32 bit MAC, loop control hardware, and data address 5 3-4X area overhead
Dynamically scheduled out of order superscalar
✓ Performance
CHIMAERA [24] Fine-grained Modified GCC compiler processor contains 32 rows of 4-LUTs, carry chain,
5 No area and power evaluation
shadow register file, cache, and control unit
Developing specific Superscalar pipeline,processor integrated,with RFU ✓ Performance
OneChip [25] Fine-grained
simulator framework (parallel to EXE and MEM stages) 5 No power evaluation
Reconfiguration and Replacement of ASIC-based low utilized functional ✓ Performance and power
[18], [19] Fine-grained
task migration algorithm units with STT-RAM 3-LUTs in IBM PowerPC 5 Area overhead
Integrating low utilized units in shared reconfigurable ✓ Area, power, performance, and energy
Proposed Fine-grained No modification required
units in embedded processors (RISC-V and LEON2) 5 Reconfiguration overhead (slight)
Write Back
Interface
been reported, considerable area and power overhead is expected Outer
Memory Instruction Fetch
due to larger area and higher power consumption of a reconfigurable
System
functional unit compared with custom ASIC or standard cell based
Memory Page Table Word Cache Controller
components. More importantly, substantial modifications in both Controller FPU Unit Decoder FP-FP2FP
compiler and processor architecture, especially bus lines and datapath, FP-INT2FP
FP-Single
Interrupt [MUL/ADD/SUB] FP-FP2INT
are required which are neglected in this study. A similar approach Manager
along with thermal-aware reconfiguration of functional units has
icache
been proposed in [19], wherein computations of thermally hotspot DMA FP-Double
FP-DIV/SQRT
functional units are migrated to cooler units in order to balance the [MUL/ADD/SUB]
III. M OTIVATION Fig. 2. RISC-V processor (area of each block is normalized to the chip area)
The main prerequisite to integrate several functional units into a
TABLE II
single shared RU is low utilization and fragmented access of each
C OMPONENTS U TILIZATION AND A REA C OMPARISON IN RISC-V
unit. Low utilization is a crucial constraint since implementing even
few high utilized components as a single RU is inefficient as frequent Computational Units
Operation Descriptions
# 4-LUTs
(Area ratio Utilization
in Processor Core
access to such components leads to substantial execution time and without Memories)
Integer ALU Integer Addition and Subtraction 1150 (2.4%) 44.8%
power penalties. Fragmented accesses is another important criterion Unit IU-MUL/DIV Integer Multiplication and Division 4573 (9.5%) 1.8%
which facilitates integration of several functional components into a FP-Double Double Precision Multiplication,
10918 (22.9%) 0.16%
[MUL/ADD/SUB] Addition, and Subtraction
single RU. To comprehend this, assume two components C1 and C2 32/64-bit Integer Numbers to
(IEE754 2008 standard)
Floating Point Unit
wherein an arrow (→) denotes a need to reconfiguration. While both Single/Double Precision Division
FP-DIV/SQRT 9269 (4.9%) 0.001%
and Square Root Operations
scenarios result in the same utilization rate, obviously, the former Other Fetch Instruction Fetch Unit 11140 (23.3%) -
Pipeline Decode Instruction Decode Unit 1974 (4.1%) -
requires fewer number of reconfigurations. Stages Write Back Write Back Unit 571 (1.2%) -
A. RISC-V Processor
We have examined these conditions by running MiBench bench- occupy substantial portion (more than 44%) of total processor area
marks on RISC-V processor. RISC-V is a 64-bit in-order, open- (including caches). Area of caches and memory cells are calculated
source embedded processor suited for FPGA devices [38], [39] which using CACTI (version 6.5) [40]. We modelled default RISC-V cache
consists of six pipeline stages, wherein its execution stage is equipped configuration in CACTI for both of I-cache and D-cache which both
with both integer and floating point units, as summarized in Table have 16 KB size with 16-bytes block size, 4-way associate, 32-bit
II. Fig. 2 illustrates RISC-V components proportional to their area input/output address and data bus, and tag size of 20 for I-cache and
footprint on the silicon die. As shown in this figure, functional units 22 for D-cache. Other parts of the processor such as fetch, decode,
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2812118, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
4
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2812118, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
5
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2812118, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
6
our investigation has revealed that CHL1 and CHL2 can implement
High-Frequent High-Frequent High-Frequent Low-Frequent Low-Frequent Low-Frequent Low-Frequent 68% of functions in OR1200 in VTR benchmark suite and 59% of
Instruction Instruction Instruction Instruction Instruction Instruction Instruction
type 1 type 2 type K-1 type K type K+1 type N-1 type N functions in TV80 benchmark in IWLS’05. It is also noteworthy
... ... that any generic function, even unidentified at design time, can be
implement using pure CHLs. Albeit, in such non-LUT designs there
LUT LUT CHL-LUT
component component
CHL-LUT will be an overhead in the number of used logic blocks (CHLs) due
component component
#1 #M1 #1 #M1 to elimination of 4-LUT and pure exploiting of CHLs which are less
CHL-LUT-RU
flexible than LUT in one-to-one implementation of some functions.
Fig. 10 illustrates the proposed Configurable Logic Block (CLB)
Fig. 9. The proposed CHL-LUT-RU architecture overview wherein cluster size (N) is equal to 10 and CHL1:CHL2:LUT ratio is
CHLs instead of conventional LUTs. Previous studies have shown equal to 4:3:3, based on observations obtained from Table IV. Each
that 4-LUT based FPGA architecture has the minimum area among proposed CLB reduces area by 28.6% compared to homogeneous
different LUTs [42]. Nonetheless, even in such a small-input LUT, LUT-based CLB. The proposed CHL-LUT-RU approach iterates over
the majority of LUT structure remains underutilized in a broad range functions of components mapped to an RU and examines whether the
of circuits [43]–[45]. For example, a considerable portion of functions function could be substituted with either one of CHLs. Otherwise, it
do not necessarily have 4-inputs, so half of LUT configuration bits will remain as a 4-LUT.
and multiplexer structure, or more, is wasted. Even if all inputs
D. Insensitivity of Proposed Architecture to Different Synthesis Tools
are used, a small set of specific functions are repeated significantly
higher than others. Therefore, there is no need to allocate generous In addition to an academic synthesis tool (Berkeley ABC), we
flexibility of 16 SRAM cells to implement them. The high number repeat the mapping flow with a commercial tool (Altera Quartus
of SRAMs in LUT-based logic blocks can exacerbates the area and Integrated Synthesis (QIS) tool [33]) to investigate the sensitivity
power consumptions in FPGA and soft-core processors. In recent of the results to CAD flow. To this regard, QUIP targeting Startix IV
studies, several fine-grained architectures have been proposed that device is exploited which directly synthesizes a VHDL, Verilog, or
use small reconfigurable logic cells solely or along with conventional SystemVerilog circuit to hierarchical BLIF. Our investigation reveals
LUTs [43]–[45]. CHLs are logic blocks that can implement the that the coverage ratio of CHLs is almost the same obtained by
majority of functions with considerably smaller area and less delay an academic synthesis tool (Berkeley ABC). The comparison of the
as compared to their LUT counterparts. However, the main goal of CHLs coverage ratio in both Berkeley ABC and Altera Quartus is
these studies is enhancing of design constraints in FPGA platforms detailed in Table IV. It is noteworthy that default synthesis constraint
while their applicability in soft-core processors is not evaluated. of ABC is area optimization while Quartus aims to reduce the delay
Accordingly, we have proposed an area-power optimization ap- which results in larger area (number of LUTs) and therefore power
proach, i.e., CHL-LUT-RU, where for each RU, we synthesize each consumption [33]. To be consistent with other studies, we evaluate
component (which is assigned to the target RU) into 4-input LUTs our proposed architecture using netlists of Berkeley ABC.
by the means of Berkeley ABC synthesis tool [32] to obtain 4-
input functions and then extract its Negation-Permutation-Negation V. E XPERIMENTAL S ETUP AND R ESULTS
(NPN) class representation. To elaborate, F = AB̄ + CD and Here, we elaborate the implementation and evaluation flow of
G = BC + AD are NPN-equivalent since each function can be the proposed and baseline architectures. According to the overall
obtained from the other by negating B and permuting A and C. There- evaluation flow illustrated in Fig. 11, we first examine the access pat-
fore, such functions can be implemented with the same homomorphic tern to each of the datapath components to identify low-utilized and
logic structure augmented with configurable inverters at its inputs TABLE IV
and output. Our investigation, reported in Table IV, has revealed C OVERAGE RATIO OF 4- INPUT NPN- CLASSES IN LOW- UTILIZED
that a significant portion (66.8%) of functions in the proposed RUs COMPONENT FUNCTIONS IN RISC-V PROCESSOR
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2812118, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
7
Automated
Processor MiBench
Transistor Sizing
HDL Description Application
Tool
M
COFEE 2.1
M
Cross-Compiled to the RISC-V ISA
M
Computational Unit GCC Cross-Compiler
BLE 1
M
D-FF
M
Capacitance Transistor Decomposition
D-FF
M
M
Resistance Sizing Xilinx ISE Component Utilization Extraction
M
M
M
D-FF
M
and Active and Leisure Periods
M
M
D-FF
M
M
Synthesis to RISC-V Full System Simulator (Spike)
M
*Area, Delay,
M
4-Input Functions
M
and Power Estimation
10 Hspice Berkeley ABC / Resource
(45nm PTM) Altera Quartus Utilization
M
I
M
BLE 5
M
Candidate
M
M
Architecture File Components Rate
D-FF
M
and Analysis
M
D-FF
M
Development
M
D-FF
M
Component
M
Mapping Algorithm
M
Delay
M
(Pre-Designed
M
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2812118, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
8
D. Application Execution Time Fig. 14. Comparing the power and energy consumption of different architec-
To calculate the execution time, we multiplied execution time of tures implementing the RISC-V processor
each instruction to its corresponding component delay. In addition, It is noteworthy that the proposed architecture advances power-
for the proposed architectures, we added the reconfiguration time gating approach in different aspects. First, the most challenging issue
by calculating the write latency of the total number of configu- of power-gating is so-called inrush current, i.e., a substantial wake-
ration bits (sum of logic and routing) with the method proposed up current occurs whenever a power-gated module turns abruptly
in [18]2 . Accordingly, we assume a 128-bit parallel data bus [18] on, which may cause violation scenarios such as instability of the
and a write latency of 0.2ns for a single SRAM cell [51]. Thus, registers content and functional errors resulting from dynamic voltage
reconfiguration time of RU1, RU2, and RU3 in LUT-RU architecture (IR) drop [52]. It also can increase the wake-up time and wake-up
will be 424619×0.2ns
128
= 0.66 µs, 1481035×0.2ns
128
= 0.23 µs, and energy. Previous studies have reported inrush power contributes to
337531×0.2ns 20% of the static power in FPGAs with 45nm technology size [53].
128
= 0.52 µs, respectively, in which 424619, 1481035,
and 337531 are the configuration bit count of the largest component Second, power gating requires complicated controller and compiling
in each RU. Due to the fact that the configurations time is prorated techniques to (a) anticipate the idle intervals and evaluate whether
the power gating is advantageous in spite of associated overheads,
2 It is noteworthy configuring one component to another in [18] has been (b) predict data arrival in order to initiate the wake-up phase and
calculated based on configuring configuration cells of the smallest component assure that only a small fraction of resources is switched each time to
which is not correct. preclude timing (and hereby, functional) errors, and (c) route signals
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2812118, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
9
TABLE VII
U TILIZATION , AREA , AND CHL S COVERAGE RATIO IN LEON2
TAP Peri Cache RAM 1 PROCESSOR FUNCTIONAL UNITS
Functional Area (4-LUT)
Utilization
Units RC/module LUT-RU CHL-LUT-RU
Mem ctrl RAM 2
Cache ADD 15.42% 176 176 4-LUT 176 4-LUT
MUL 3.4 × 10−4 % 485 414 CHL1, 61 CHL2
2498 4-LUT
DIV 9.49% 2498 , and 2023 4-LUT)
RAM 3
FPU
RAM 4
IU
DRAM 1 Adder Divider
DRAM 2 Multiplier
Fig. 15. LEON2 processor microarchitecture (the area of each block is Fig. 16. Utilization ratio of LEON2 components in MiBench benchmarks
normalized to the chip area.)
F. Alternative Processor
We have examined the applicability of the proposed architecture
on an alternative open-source soft-core processor. We choose LEON2
which is SPARC-V8-compliant, as an alternative processor for several
reasons. First, it is popular among a wide range of embedded
processors and also compatible and extensible to be customized for
diverse range of performance/cost trade-off, e.g., LEON2-FT. Second,
the full HDL source code of total LEON2 (including FPU) is license Fig. 18. Comparing the number of LUTs and area of LEON2 components
free, while source code of FPU is not available in LEON3 and in the baseline and proposed architectures
LEON4 which causes problem in running application with floating
point instructions. Finally, in RISC-V case study, the integrated 16. Utilization ratio of the integer divider (DIV) is approximately
functional units all belong to FPU, but here in LEON2, as we will 3.48 × 10−6 % while it occupies a significant area (79.1%) of total
demonstrate in the following, all functional units are separate and functional units. Therefore, it can be efficiently integrated with integer
thereby, integration of low-utilized units is more challenging. multiplier (MUL) into a shared RU. Hence, switching ratio of the
Fig. 15 illustrates the synthesized floorplan of LEON2. As shown proposed RU unit is 2.81 × 10−6 on average, as demonstrated in
in this figure, LEON2 functional units include a 32-bit adder, 32- Fig. 17. Subsequently, reconfiguration overhead of the proposed RU
bit multiplier (that can be configured to 32x32, 32x16, 32x8, and will be small and prorated in application execution time, leaving
16x16), 32-bit divider, and an optional FPU. FPU as a co-processor negligible impact on total execution time of applications. The rest
is licensed and therefore, is not available in the processor source code. of the evaluation flow for the proposed and baseline architectures in
Therefore, we have omitted the FPU unit from our evaluations and LEON2 processor is the same as RISC-V processor.
focused on the integer functional units which proves the generality of 1) Area: Fig. 18 demonstrates the number of LUTs and configu-
the proposed architectures not just in FPU. Accordingly, in the rest ration bits in the proposed and baseline architectures. As shown in
of this section, the reported results for LEON2 processor are in the this figures, the number of LUTs and configuration bits in LUT-RU
scope of Integer Unit (IU). has been reduced by 23.1%, compared to the baseline RC/module
To identify the utilization and access pattern to the functional architecture. The total area for LUT-RU is also reduced by 23.1%.
units, each MiBench benchmark is cross-compiled to the SPARC- The detailed results for each architecture are shown in Fig. 18.
V8 architecture using RTEMS [55]. Afterwards, Simics full system 2) Power and Energy Consumption: As demonstrated in Fig. 19,
simulator [56] is used to obtain the utilization and access patterns static power of the LUT-RU architecture is reduced by 23.2% as
of the functional units which are shown in Table VII and Fig. compared to the RC/module architecture. In addition, compared to
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2812118, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
10
TABLE VIII
R ESULTS FOR CHL-LUT-RU IN RISC-V AND LEON2 PROCESSORS
NORMALIZED TO RC/ MODULE ( CONSIDERING LOGIC RESOURCES )
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2812118, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
11
TABLE IX
S UMMARY OF THE RESULTS FOR VARIOUS SYNTHESIS TOOLS AND PROCESSORS ACROSS ALL BENCHMARK ( NORMALIZED TO RC/ MODULE ).
Parameter Area Critical Path Execution Time Energy Static Power
Processor RISC-V LEON2 RISC-V LEON2 RISC-V LEON2 RISC-V LEON2 RISC-V LEON2
Berkeley Altera Berkeley Berkeley Altera Berkeley Berkeley Altera Berkeley Berkeley Altera Berkeley Berkeley Altera Berkeley
Synthesis Tool
ABC Quartus ABC ABC Quartus ABC ABC Quartus ABC ABC Quartus ABC ABC Quartus ABC
ASIC/inst
0.06 0.06 0.05 0.16 0.16 0.12 0.01 0.01 0.26 0.01 4.18 × 10−3 0.3 0.41 0.26 0.25
(logic)
ASIC/module −3
0.01 0.01 0.01 1.8 × 10 0.16 0.57 0.17 0.16 1.18 0. 01 0.01 0.11 0.07 0.07 0.09
(logic)
RC/all 0.29 0.24 0.64 0.93 1.02 0.99 41.48 29.57 46.81 0.1 0.08 0.64 4.2 × 10−3 2.7 × 10−3 0.03
RC/module 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
LUT-RU 0.69 0.71 0.77 0.91 0.97 1 0.94 0.99 1 0.63 0.69 0.77 0.68 0.69 0.77
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2812118, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
12
[15] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avižienis, [36] M. Hosseinabady and J. L. Nunez-Yanez, “Run-time power gating in hy-
J. Wawrzynek, and K. Asanović, “Chisel: constructing hardware in a brid arm-fpga devices,” in Field Programmable Logic and Applications
scala embedded language,” in Proceedings of the 49th Annual Design (FPL), 24th International Conference on. IEEE, 2014, pp. 1–6.
Automation Conference. ACM, 2012, pp. 1216–1225. [37] J. Teich, “Hardware/software codesign: The past, the present, and
[16] A. A. Bsoul and S. J. Wilton, “An FPGA architecture supporting dy- predicting the future,” Proceedings of the IEEE, vol. 100, no. Special
namically controlled power gating,” in Field-Programmable Technology Centennial Issue, pp. 1411–1430, 2012.
(FPT), 2010 International Conference on. IEEE, 2010, pp. 1–8. [38] VectorBlox/risc-v. VectorBlox Computing Inc. [Online]. Available:
[17] H. Esmaeilzadeh, E. Blem, R. St Amant, K. Sankaralingam, and https://github.com/VectorBlox/orca
D. Burger, “Dark silicon and the end of multicore scaling,” in ACM [39] Y. Lee, A. Waterman, R. Avizienis, H. Cook, C. Sun, V. Stojanović,
SIGARCH Computer Architecture News, vol. 39, no. 3. ACM, 2011, and K. Asanović, “A 45nm 1.3 ghz 16.7 double-precision gflops/w risc-
pp. 365–376. v processor with vector accelerators,” in European Solid State Circuits
[18] A. R. Ashammagari, H. Mahmoodi, and H. Homayoun, “Exploiting Conference (ESSCIRC), 2014-40th. IEEE, 2014, pp. 199–202.
STT-NV technology for reconfigurable, high performance, low power, [40] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, “Cacti 6.0:
and low temperature functional unit design,” in Proceedings of the A tool to model large caches,” HP Laboratories, pp. 22–31, 2009.
conference on Design, Automation & Test in Europe. European Design [41] H. Wong, V. Betz, and J. Rose, “Comparing FPGA vs. custom CMOS
and Automation Association, 2014, p. 335. and the impact on processor microarchitecture,” in Proceedings of the
[19] A. R. Ashammagari, H. Mahmoodi, T. Mohsenin, and H. Homayoun, 19th ACM/SIGDA international symposium on Field programmable gate
“Reconfigurable STT-NV LUT-based functional units to improve perfor- arrays. ACM, 2011, pp. 5–14.
mance in general-purpose processors,” in Proceedings of the 24th edition [42] E. Ahmed and J. Rose, “The effect of lut and cluster size on deep-
of the great lakes symposium on VLSI. ACM, 2014, pp. 249–254. submicron fpga performance and density,” IEEE Transactions on Very
[20] H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and Large Scale Integration (VLSI) Systems, vol. 12, no. 3, pp. 288–298,
M. C. Eliseu Filho, “Morphosys: an integrated reconfigurable system for 2004.
data-parallel and computation-intensive applications,” Computers, IEEE [43] I. Ahmadpour, B. Khaleghi, and H. Asadi, “An efficient reconfigurable
Transactions on, vol. 49, no. 5, pp. 465–481, 2000. architecture by characterizing most frequent logic functions,” in Field
[21] B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins, Programmable Logic and Applications (FPL), 2015 25th International
“ADRES: An architecture with tightly coupled VLIW processor and Conference on. IEEE, 2015, pp. 1–6.
coarse-grained reconfigurable matrix,” in Field Programmable Logic and [44] A. Ahari, B. Khaleghi, Z. Ebrahimi, H. Asadi, and M. B. Tahoori,
Application. Springer, 2003, pp. 61–70. “Towards dark silicon era in fpgas using complementary hard logic
[22] S. C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, and design,” in Field Programmable Logic and Applications (FPL), 2014
R. R. Taylor, “Piperench: A reconfigurable architecture and compiler,” 24th International Conference on. IEEE, 2014, pp. 1–6.
Computer, vol. 33, no. 4, pp. 70–77, 2000. [45] Z. Ebrahimi, B. Khaleghi, and H. Asadi, “PEAF: A Power-Efficient
[23] R. Lysecky, G. Stitt, and F. Vahid, “Warp processors,” in ACM Transac- Architecture for SRAM-Based FPGAs Using Reconfigurable Hard Logic
tions on Design Automation of Electronic Systems (TODAES), vol. 11, Design in Dark Silicon Era,” Computers, IEEE Transactions on, In press,
no. 3. ACM, 2004, pp. 659–681. 2017.
[46] (Accessed August, 2016) Behavioural simulation (spike). [Online].
[24] Z. A. Ye, A. Moshovos, S. Hauck, and P. Banerjee, CHIMAERA:
Available: www.lowrisc.org/docs/untether-v0.2/spike
a high-performance architecture with a tightly-coupled reconfigurable
[47] D. Chai and A. Kuehlmann, “Building a better boolean matcher and sym-
functional unit. ACM, 2000, vol. 28, no. 2.
metry detector,” in Proceedings of the conference on Design, automation
[25] J. E. Carrillo and P. Chow, “The effect of reconfigurable units in
and test in Europe: Proceedings. European Design and Automation
superscalar processors,” in Proceedings of the 2001 ACM/SIGDA ninth
Association, 2006, pp. 1079–1084.
international symposium on Field programmable gate arrays. ACM,
[48] (Accessed March, 2016) Nangate Open Cell Library. [Online].
2001, pp. 141–150.
Available: www.nangate.com/
[26] M. Labrecque and J. G. Steffan, “Improving pipelined soft processors
[49] “Stratix-2 platform FPGA hand book,” Altera, April 2011.
with multithreading,” in Field Programmable Logic and Applications,
[50] “Virtex-4 platform FPGA user guide,” Xilinx, December 2008.
FPL. International Conference on. IEEE, 2007, pp. 210–215.
[51] A. Chen, J. Hutchby, V. Zhirnov, and G. Bourianoff, Emerging nano-
[27] Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V. Zyuban, H. Jacobson, and electronic devices. John Wiley & Sons, 2014.
P. Bose, “Microarchitectural techniques for power gating of execution [52] A. A. Bsoul and S. J. Wilton, “A configurable architecture to limit
units,” in Proceedings of the 2004 international symposium on Low wakeup current in dynamically-controlled power-gated fpgas,” in Pro-
power electronics and design. ACM, 2004, pp. 32–37. ceedings of the ACM/SIGDA international symposium on Field Pro-
[28] S. Roy, N. Ranganathan, and S. Katkoori, “A framework for power- grammable Gate Arrays. ACM, 2012, pp. 245–254.
gating functional units in embedded microprocessors,” IEEE transactions [53] S. Sharp. Power management solution guide - xilinx. [On-
on very large scale integration (VLSI) systems, vol. 17, no. 11, pp. 1640– line]. Available: https://www.xilinx.com/publications/archives/solution
1649, 2009. guides/power management.pdf
[29] A. A. Bsoul, S. J. Wilton, K. H. Tsoi, and W. Luk, “An fpga architecture [54] Z. Seifoori, B. Khaleghi, and H. Asadi, “A power gating switch box
and cad flow supporting dynamically controlled power gating,” IEEE architecture in routing network of sram-based fpgas in dark silicon
Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24, era,” in Proceedings of the Conference on Design, Automation & Test
no. 1, pp. 178–191, 2016. in Europe. European Design and Automation Association, 2017, pp.
[30] P. Yiannacouras, J. G. Steffan, and J. Rose, “Exploration and cus- 1342–1347.
tomization of FPGA-based soft processors,” Computer-Aided Design of [55] (Accessed August, 2016) RTEMS Cross Compilation System (RCC).
Integrated Circuits and Systems, IEEE Transactions on, vol. 26, no. 2, [Online]. Available: www.rtems.org
pp. 266–277, 2007. [56] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg,
[31] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, “Simics: A full
R. B. Brown, “MiBench: A free, commercially representative embedded system simulation platform,” Computer, vol. 35, no. 2, pp. 50–58, 2002.
benchmark suite,” in Workload Characterization, 2001. WWC-4. IEEE [57] S. Yazdanshenas, H. Asadi, and B. Khaleghi, “A scalable dependability
International Workshop on. IEEE, 2001, pp. 3–14. scheme for routing fabric of sram-based reconfigurable devices,” IEEE
[32] BLSG, “ABC: A system for sequential synthesis and verification,” Trans. VLSI Syst., vol. 23, no. 9, pp. 1868–1878, 2015.
Berkeley Logic Synthesis and Verification Group, 2011. [58] H. Asadi and M. B. Tahoori, “Analytical techniques for soft error rate
[33] J. Pistorius, M. Hutton, A. Mishchenko, and R. Brayton, “Benchmarking modeling and mitigation of fpga-based designs,” IEEE Trans. VLSI Syst.,
method and designs targeting logic synthesis for fpgas,” in Proc. IWLS, vol. 15, no. 12, pp. 1320–1331, 2007.
vol. 7, 2007, pp. 230–237. [59] W. Zhang, N. K. Jha, and L. Shang, “Low-power 3D nano/CMOS hybrid
[34] S. Yazdanshenas and V. Betz, “Automatic circuit design and modelling dynamically reconfigurable architecture,” ACM Journal on Emerging
for heterogeneous fpgas,” in Field-Programmable Technology (FPT), Technologies in Computing Systems (JETC), vol. 6, no. 3, p. 10, 2010.
2017 International Conference on. IEEE, 2017, pp. 9–16. [60] D. Mahajan, J. Park, E. Amaro, H. Sharma, A. Yazdanbakhsh, J. K.
[35] J. Luu, J. Goeders, M. Wainberg, A. Somerville, T. Yu, K. Nasartschuk, Kim, and H. Esmaeilzadeh, “Tabla: A unified template-based framework
M. Nasr, S. Wang, Liu et al., “VTR 7.0: Next generation architecture for accelerating statistical machine learning,” in High Performance
and CAD system for FPGAs,” ACM Transactions on Reconfigurable Computer Architecture (HPCA), 2016 IEEE International Symposium
Technology and Systems (TRETS), vol. 7, no. 2, p. 6, 2014. on. IEEE, 2016, pp. 14–26.
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2812118, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
13
[61] (Accessed November, 2017) Opencores. [Online]. Available: www. Hossein Asadi (M’08, SM’14) received the B.Sc.
opencores.org and M.Sc. degrees in computer engineering from the
[62] S. Yazdanshenas and V. Betz, “Quantifying and mitigating the costs SUT, Tehran, Iran, in 2000 and 2002, respectively,
of fpga virtualization,” in Field Programmable Logic and Applications and the Ph.D. degree in electrical and computer
(FPL), 27th International Conference on. IEEE, 2017, pp. 1–7. engineering from Northeastern University, Boston,
MA, USA, in 2007.
He was with EMC Corporation, Hopkinton, MA,
USA, as a Research Scientist and Senior Hardware
Engineer, from 2006 to 2009. From 2002 to 2003, he
was a member of the Dependable Systems Labora-
tory, SUT, where he researched hardware verification
techniques. From 2001 to 2002, he was a member of the Sharif Rescue
Robots Group. He has been with the Department of Computer Engineering,
SUT, since 2009, where he is currently a tenured Associate Professor. He
is the Founder and Director of the Data Storage, Networks, and Processing
(DSN) Laboratory, Director of Sharif High-Performance Computing (HPC)
Center, the Director of Sharif Information and Coummnications Technology
Center (ICTC), and the President of Sharif ICT Innovation Center. He spent
three months in the summer 2015 as a Visiting Professor at the School of
Sajjad Tamimi has received the B.Sc. and M.Sc.
Computer and Communication Sciences at the Ecole Poly-technique Federele
degrees in computer engineering from Iran Univer-
de Lausanne (EPFL). He is also the co-founder of HPDS corp., designing
sity of Science and Technology (IUST) and Sharif
and fabricating midrange and high-end data storage systems. He has authored
University of Technology (SUT), Tehran, Iran, in
and co-authored more than eighty technical papers in reputed journals and
2014 and 2016, respectively. He has been with
conference proceedings. His current research interests include data storage
the Data Storage, Processing, and Networks (DSN)
systems and networks, solid-state drives, operating system support for I/O
Laboratory at the Department of Computer Engi-
and memory management, and reconfigurable and dependable computing.
neering, SUT, as a research assistant for two years.
Dr. Asadi was a recipient of the Technical Award for the Best Robot
His current research interests include reconfigurable
Design from the International RoboCup Rescue Competition, organized by
computing, embedded system design, and computer
AAAI and RoboCup, a recipient of Best Paper Award at the 15th CSI
architecture.
Internation Symposium on Computer Architecture & Digital Systems (CADS),
the Distinguished Lecturer Award from SUT in 2010, the Distinguished
Researcher Award and the Distinguished Research Institute Award from SUT
in 2016, and the Distinguished Technology Award from SUT in 2017. He is
also recipient of Extraordinary Ability in Science visa from US Citizenship
and Immigration Services in 2008. He has also served as the publication
chair of several national and international conferences including CNDS2013,
AISP2013, and CSSE2013 during the past four years. Most recently, he has
served as a Guest Editor of IEEE Transactions on Computers, an Associate
Editor of Microelectronics Reliability, a Program Co-Chair of CADS2015,
and the Program Chair of CSI National Computer Conference (CSICC2017).
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.