A Proposed Risc Instruction Set Architecture for the Mac Unit of 2014
A Proposed Risc Instruction Set Architecture for the Mac Unit of 2014
Abstract—Multiplier-accumulator is a specific hardware unit speeds and often include a modest amount of additional
that performs a common operation – computing the product of hardware (barrel shifter, instruction cache), to improve
two numbers and adding that product to an accumulator. performance in common DSP algorithms. These processors
Especially, in digital signal processing applications which consist also tend to have deeper pipelines.
of a large number of convolution operations, the emergence of
MAC unit contributes greatly to the high performance of the Another DSP generation was built by expanding
systems. This work is about an implementation for a specific conventional DSP architectures, for instance, adding parallel
MAC unit based on the proposed RISC instruction set execution units, i.e. a second multiplier and adder. The
architecture (ISA) of 32-bit VLIW Fixed-point DSP processor hardware extensions are typically associated with extended
core presented in our previous work. The computational unit is instruction set, allows multiple operations to be encoded in a
designed to be flexible for 32-bit/16-bit/8-bit data computations. single instruction and be executed in parallel. DSP processors
The implementation is verified to function correctly not only in in this category often have wider data buses, allowing them to
Modelsim software but also on Altera Cyclone II (2C35) FPGA get more data words per clock cycle. They can also use wider
board. instruction words to integrate parallel operations within a
single instruction. The downside of these DSP processors is the
Keywords—Digital Signal Processors, Multiply, Accumulate, difficulty in assembly language programming.
VLIW, RISC.
Multi-issue processors use very simple instructions that
I. INTRODUCTION typically encode a single operation. These processors achieve a
Digital signal processing is increasingly important for high level of parallelism by issuing and executing instructions
applications in real life such as communications [1]-[2], in parallel groups rather than one at a time. Using simple
medical imaging [3]-[4], radar & sonar [5], high fidelity music instructions simplifies instruction decoding and execution,
reproduction [6], oil prospecting [7], etc. As applications allowing multi-issue processors to execute at higher clock rates
become more complex, the processing of digital signals in an than conventional or enhanced conventional DSP processors.
efficient manner will help the system be more attractive. A The two sub-categories of implementation of this architecture
digital signal processor is a specialized microprocessor with an that execute multiple instructions in parallel are VLIW (Very
architecture optimized for operational needs of digital signal. Long Instruction Word) and superscalar. The biggest difference
DSP algorithms and functions will determine the appropriate between them is how instructions are grouped for parallel
architecture for the processor. Although DSP processors have a execution.
comprehensive change in the past few decades, there are still Recently, an implementation of 16-bit RISC-based DSP
common features in most DSP processors today. DSP processor was proposed in [10]. In this design, all of logical
processors need multiple memory banks with independent and arithmetic operations are carried out by only one ALU.
buses, specialized instruction sets, addressing modes, control The ALU is constructed to include three sub units: MAC,
and peripherals. Modern DSP architectures can be divided into LOGIC and ARITH units. Obviously, this design is not
3 or 4 categories (generations) [8]. effective in term of parallel computation. For example, when
For the conventional DSP processors, one instruction is the MAC unit is busy, the ARITH unit is free and vice versa.
issued and executed in one clock cycle. They use the complex, Therefore, in this paper, an implementation of separate
multi-operation type of instructions. These processors typically MAC unit is proposed. This unit only supports MAC and
include a single multiplier (MAC unit) and an ALU, but few relating multiplying operations. Other arithmetic operations are
additional execution units. Typical processors in this category handled by the ALU. Moreover, the novelty of this design is
include Analog Devices' ADSP-21xx family, Texas that MAC unit can support multiple data widths. At the same
Instruments' TMS320C2xx family, and Motorola's DSP560xx time, one 32-bit MAC operation or two 16-bit MAC operations
family. DSP processors like the Motorola DSP563xx and or four 8-bit MAC operations can be calculated separately.
Texas Instruments TMS320C54x operate at higher clock
171
3) Parallel Execution MIN MAC4(U) CMPGT(U/2/4) OR
There are four instructions that can be fetched at a time MIN2 MAC4(S) CMPLT(U/2/4) STB(U)
MINU4 MACN2 EXT(U) STH(U)
forming a fetch packet of 128 bits. The execution of the
MVK MMAC MVK STW
individual instruction in a packet is determined by bit p in each NEG MPY(U) OR STB(U) offset
instruction. Bit p (bit 0) determines whether an instruction is NOT MPY(S) NEG STH(U) offset
executed in parallel with another instruction. If the p-bit of OR NOT STW offset
instruction i is 1, then instruction i+1 is executed in parallel SIN SADD(SU/US/2) SUB2
with (in the same cycle as) instruction i. If the p-bit of SINC SADDU4 XOR
instruction i is 0, then instruction i+1 is executed in the cycle SADD SHLMB SLDB(U)
after instruction i. Therefore, the last p-bit in a fetch packet is SAT SHR2 SLDH(U)
SUB(U) SHRMB SLDW
always set to 0. In assembly language, parallel execution can
SUB2 SHRU2 STB(U)
be denoted by || symbol before an instruction to specify its SUB4 SUB2 STH(U)
parallel execution with previous one. SUBABS4 SWAP2 STW
Instr. A SWAP XOR
|| Instr. B SWAP4
|| Instr. C XOR
|| Instr. D TABLE I describes all of the proposed instructions
An execute packet consists of all instructions executing in according to each functional unit. The opcode maps as well as
parallel. Each instruction in the execute packet must be proposed design data path for MAC unit will be presented in
implemented by a different functional unit. The p-bit pattern of next subsections.
four instructions in a fetch packet can result in the execution
sequence that is fully parallel, fully serial, or partially serial. 5) Instruction to MAC unit
With the help of SLD/SST, MAC instructions in parallel
execution, some of signal processing computations can get
throughput of one instruction cycle. For example, in
Fig. 4. Opcode for MAC unit.
convolution operation, the SLD can load two operands, while
the MAC performs multiplication and accumulation for the Fig. 4 illustrates an overall opcode map for all operations
previous ones. Following example shows an advantage of on the MAC unit. All notations for opcode are depicted in
SLD, MAC and parallel execution support in convolution. TABLE II.
Convolution in C language:
TABLE II. INSTRUCTION OPERATION & EXECUTION
conv = conv + x[i] * h[i]. NOTATIONS
Convolution in Assembly language without super load, Symbol Description
MAC and parallel execution requires four sequential cr conditional registers: instruction executed based on z value
instructions as follows: z zero or non-zero
dst destination operand
LDW A0, A7[i] src2 second source operand
LDW B0, B7[i] src1 first source operand
MPY A0, A0, B0
const constant operand
ADD A1, A1, A0
rsv reserved
However, only two contemporaneous instructions are func. code functional code: contain up to 64 instructions
necessary within MAC, super load and parallel support: 00000 opcode: identify MAC instructions
SLDW A0, B0, i p parallel execution
|| MAC A0, B0 MAC unit aims at having multiply and accumulate
4) Instruction to Functional Unit Mapping instructions. Indeed, there are specific instruction groups such
as multiplication group (MPY and extensions, etc.),
TABLE I. MAPPING BETWEEN INSTRUCTIONS AND multiplication and accumulation group (MAC(2/4), MMAC,
FUNCTIONAL UNITS etc.). Additionally, this unit includes a functional block to
FALU MAC BALU LSU perform bit-oriented group (BITC4, BITR, etc.). Moreover, the
ABS AVG2 ADD ADD common addition/subtraction instructions are still supported.
ABS2 AVGU4 ADDK ADD2 The single cycle data path design for instruction groups is
ADD BITC4 ADD2 ADDB(H/W) depicted in Fig. 5.
ADDU BITR ADDKPC ADDAD
ADD2 DEAL AND AND The MAC block in Fig. 5 can be implemented in top-level
ADD4 ROTL ANDN ANDN architecture as illustrated in Fig 6. The interesting point in this
AND SHFL B disp LDB(U) architecture is the emergence of barrel shifter allowing the
ANDN XPND2 B reg LDH(U) multiplication by 2x number without using heavy multiplier.
COS XPND4 BDEC LDW
COSC MAC(U) BNOP LDB(U) offset
MAX MAC(S) BPOS LDH(U) offset
MAX2 MAC2(U) CLR LDW offset
MAXU4 MAC2(S) CMPEQ(2/4) MVK
172
TABLE IV. CONTROL SIGNAL DESCRIPTION
Name Type Bit Description
Width
iClk input 1 System clock
iReset_n input 1 Reset signal
iMac input 1 MAC/Bit-oriented unit selection signal
iOp input 5 Operation signal
iFunc input 5 Function signal
iSource1 input 32 First operand
iSource2 input 32 Second operand
oResult output 64 Result
Fig. 7 describes a top-level architecture for the proposed
MAC unit. The four multiply-accumulate instructions
including MAC/MAC2/MAC4/MACN2 are inputs for the
Fig. 5. Single cycle data path design for MAC unit.
MAC Register block, their results are then stored in 64-bit
MACReg. As the proposed DSP processor is based on 32-bit
architecture, the MACReg is needed to transfer its value to the
dst1 and dst2 in memory by utilizing MMAC instruction
according to TABLE III. Except for MPY and those multiply-
accumulate instructions, the oResult_High of the other
instructions are all masked with “0” bits. The 5-bit iOp and
iFunc are used to design the signal mapping for MAC unit as
described in following subsections.
In general, the MAC unit is controlled by following signals MPY MAC AVG ROTL
iOp[3] 1 0 0 0
as depicted in TABLE IV. Certainly, Instruction Decoder will
iOp[2] 0 1 0 0
be in charge of generating those signals. iOp[1] 0 0 1 0
iOp[0] 0 0 0 1
173
cases for those operations. Notations used in Fig. 8 are
According to this design, the 5-bit iOp signal will indicate explained in TABLE IX.
the execution of corresponding instructions. For example, the
MPY instruction will be executed only if the relevant iOp[3] =
1. If one instruction includes sub-operations such as 8-bit/16-
bit/32-bit computation, the iFunc will be fully utilized as
described in TABLE VI.
174
[12]. In order to achieve throughput one clock cycle/
instruction, pipelining will be surely in our future work.
Besides, the three remaining functional blocks including
FALU, BALU, LSU should also be carried out in the future to
finalize the design of the 32-bit VLIW DSP processor core.
Moreover, the compiler and assembler need to be taken into
account carefully as they absolutely contribute to the
effectiveness of proposed instruction set.
ACKNOWLEDGMENT
This work was granted under Project 39/2013/HĐ-SKHCN
by the Department of Science and Technology of HCM City.
REFERENCES
[1] Gatherer A., Stetzler T., McMahan M., and Auslander E., DSP-based
Architectures for Mobile Communications: Past, Present, and Future,
IEEE Communications Magazine, Vol. 38, Issue 1, pp. 84 – 90, Jan
2000.
[2] Xuan-Thuan NGUYEN, QM-Dang DO, Hoang-Dat TRAN, Huu-Thuan
HUYNH, and Cong-Kha PHAM, A PCIe-based FFT Implementation for
High-speed Spectrum Analysis, Proc. 3rd IEICE Int. Conf. Integrated
Circuits and Devices in Vietnam, pp. 126 – 131, Danang, Vietnam, Aug
13th – 15th, 2013.
Fig. 9. Input data and Golden output data.
[3] Yagi M., Shibata T., An Image Representation Algorithm Compatible
with Neural-Associative-Processor-Based Hardware Recognition
Systems, IEEE Trans. Neural Networks, Vol. 14, No. 5, pp. 1144 –
1161, Sep. 2003.
[4] Greenberg J.E., Delgutte B., and Gray M.L., Hands-on Learning in
Biomedical Signal Processing, IEEE Engineering in Medicine and
Biology Magazine, Vol. 22, Issue 4, pp. 71 – 79, Aug 2003.
[5] Titlebaum, Edward L. ; Dept. of Electr. Eng., Rochester Univ., NY,
USA, “Frequency- and time-hop coded signals for use in radar and sonar
systems and multiple access communications systems”, in Conference
Fig. 10. Simulation of MAC unit on Modelsim tool. Record of The Twenty-Seventh Asilomar Conference on Signals,
Systems and Computers, 1993.
After the successful simulation on Modelsim tool, the [6] Olswang, B.S. ; LOUD Technol. Inc., Wodinville, WA ; Cvetkovic, Z.,
“Separation of Audio Signals Into Direct and Diffuse Soundfields for
design is synthesized by Altera Quartus II targeting on Cyclone Surround Sound”, in Procs. of IEEE International Conference on
II EP2C35 FPGA device. Compilation report is shown in Acoustics, Speech and Signal Processing, 2006.
TABLE X. The verification on FPGA for the test cases from [7] Mottl, V. ; Tula State Univ., Russia ; Dvoenko, S. ; Levyant, V. ;
number 15 to number 20 in Fig. 8 is finally completed by Muchnik, I., “Pattern recognition in spatial data: a new method of
utilizing SignalTap II Logic Analyzer to capture MAC signals seismic explorations for oil and gas in crystalline basement rocks”, in
as presented in Fig. 11. In this test, MAC unit is running at Procs. of 15th Internation Conference on Pattern Recognition, 2000.
clock of 40 MHz. Also, the waveform is identical with [8] Edwin J. Tan and Wendi B. Heinzelman. DSP architectures: past,
present and futures. SIGARCH Comput. Archit. News 31, 3 (June 2003),
simulation result. pp. 6-19.
[9] Donghoon Lee, Chanwon Ryu, Jusung Park, Kyunsoo Kwon and
TABLE X. COMPILATION REPORT Wontae Choi, Design and implementation of 16-bit fixed point digital
Resource Logic Element 1633 signal processor, IEEE International SoC Design Conference (ISOCC),
Register 203 vol.2, pp. II-61 – II-64, 2008.
Fmax 42.17 MHz [10] Xuan-Thuan Nguyen, Trong-Tu Bui, Huu-Thuan Huynh, Cong-Kha
Pham, Duc-Hung Le, An Asic Implementation Of 16-Bit Fixed-Point
Digital Signal Processor, International Conference on Advanced
Computing and Applications (ACOMP), 2013.
[11] Khoi-Nguyen Le-Huu, Thanh T. Vu, Diem N. Ho, Anh-Vu Dinh-Duc,
“Towards a VLIW Architecture for the 32-bit Digital Signal Processor
Core”, in Procs. of the 5th FTRA Int. Conf. on Computer Science and its
Applications (CSA-13), 2013.
[12] Khoi-Nguyen Le-Huu, Thanh T. Vu, Diem N. Ho, Anh-Vu Dinh-Duc,
Fig. 11. Verification on SignalTap II Logic Analyzer. “Towards a RISC Instruction Set Architecture for the 32-bit VLIW DSP
Processor Core”, to appear in Procs. of the IEEE Region 10 Technical
Symposium (TENSYMP 2014), 2014.
V. CONCLUSION
In this work, we have presented an implementation for
MAC unit according to the proposed ISA in our previous work
175