0% found this document useful (0 votes)
10 views

CADENCE_Practical

The document discusses the advantages of configurable DSPs in high-performance systems, emphasizing the need for flexibility and efficiency in software deployment. It highlights the importance of tailored architectures for various applications, such as imaging and communications, and presents Cadence's portfolio of application-specific DSPs. Additionally, it outlines the automated tools and customization options available for optimizing performance and power consumption in DSP designs.

Uploaded by

ziguoxut
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

CADENCE_Practical

The document discusses the advantages of configurable DSPs in high-performance systems, emphasizing the need for flexibility and efficiency in software deployment. It highlights the importance of tailored architectures for various applications, such as imaging and communications, and presents Cadence's portfolio of application-specific DSPs. Additionally, it outlines the automated tools and customization options available for optimizing performance and power consumption in DSP designs.

Uploaded by

ziguoxut
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Practical usage of Configurable DSPs in High

Performance Systems
Marcus Binning, Application Engineering Director
High Performance Digital Systems Event, Sept 13th 2017
University of Warwick
The Case for Configurable, Custom Cores

• People pay for features / capabilities


– Rarely for implementations
• Easiest, fastest, most flexible way to deploy features – Software
– E.g. advanced imaging in cell phone, video stabilisation, voice control – valuable feature$
• We live in an energy-constrained world – the software must run on extremely
power efficient platforms (cores, memory systems)
– From always-on triggers to highest performance datacentre…
• We live in a Time-Constrained world – rate of design has increased dramatically in
the last few years
• One size does NOT fit all – what is optimal for large applications is not optimal for
audio, communications, imaging, classification, voice triggering …
– It’s a HETEROGENEOUS, “OFFLOAD-ENABLED” world …
– Oh, and we need it NOW ..

2 © 2017 Cadence Design Systems, Inc.


What affects the efficiency of Software ?

• Cycle counts – the higher the longer it takes to complete


– Targeting the right instructions, number representations crucially important
• Sufficient local register storage
– Access to registers very low energy cost. Insufficient registers can be a major efficiency loss
• Efficient Memory System
– This varies greatly from system to system
– With large software sets, a sophisticated cache based or DMA based system may be appropriate
– Support for Prefetch of Instructions and/or Data may be beneficial
– With self-contained very small, low power systems the complete opposite may apply.
– Local tightly coupled memory may be appropriate
• Connecting accelerators – place on the bus or go point-to-point ?

• Efficient Architecture must span all these


– (and more, of course – debug, trace ..)

3 © 2017 Cadence Design Systems, Inc.


Comparative System View
Transferring Data from a HW engine to DSP
• Classic DMA based scheme, using AXI Slave port on the DSP • Utilise Input Queue and FIFO to interface to external engine
• Note - could also use integrated DMA on P5 - same performance • Queue input can be wider - e.g. same width as register file (512b)
• Background task - should not interfere with current processing • Minimum 4x raw bandwidth compared to "full rate" AXI fabric
• Can be limited by AXI width and clock rate (128b) • Queue can be accessed every cycle by instructions - greater bandwidth

Mem Mem

Icache AXI Master iCache AXI Master

AXI Slave AXI Slave


Vision P6 DMA Vision P6
Queue In HW
DataRam1 DataRam0 DataRam1 DataRam0 ENGINE

Mem Mem HW Mem Mem


ENGINE
Interleaved Data Memories Interleaved Data Memories
4 © 2017 Cadence Design Systems, Inc.
Application Targeted Processor Portfolio from Cadence

Broad Range of Application Specific DSPs Custom


HiFi DSPs Fusion DSPs ConnX DSPs Vision DSPs Custom ISAs

Hi
Hello

• Audio Pre and • Auto Radar • Narrow to wide • Image Pre-/Post- • High Performance
Post-Processing • Always-alert band Wireless Processing DSPs, NPUs,
• Voice trigger Sensor processing • LTE/LTE-A/5G • Convolutional CPUs
• Noise Reduction • Low-end Imaging • WiFi, Smart Grid Neural Networks • Application specific
• Audio Encode & • Audio, Video and • Infrastructure & (CNN) data types
Decode Speech Terminals • AR/VR • Custom ISA
• ADAS • Special Functions

Automated User-Defined Customization


Interfaces, Instructions, State & Registers, Unique and Secure features
Xtensa® Processor Generator
Configurable and Extensible Common Foundation Technology of all Tensilica Processors
5 © 2017 Cadence Design Systems, Inc.
Example – Computer Vision

6 © 2017 Cadence Design Systems, Inc.


Emerging use cases: Cameras everywhere
Mobile HDR Video Stabilizer Face Detection

Automotive Traffic Sign Recognition Gesture Control Drone 3D Vision

Security People Detection Wearables and IoT

7 © 2017 Cadence Design Systems, Inc.


The basics of real-time neural networks
Training: Runs once per database, server-based, very compute intensive

Selection of
layered Iterative derivation of coefficients by
network stochastic descent error minimization

Labeled
Server Farm dataset 1016-1022 MACs/dataset

Embedded 108-1012 MACs/image

Single-pass evaluation of Most probable label


input image

Deployment (“Inference”): Runs on every image, device based, compute intensive


8 © 2017 Cadence Design Systems, Inc.
Tensilica® Vision P6 DSP for Computer vision/imaging/CNN

9 © 2017 Cadence Design Systems, Inc.


Tensilica® Vision C5 DSP for Neural Networks

Tensilica Vision C5 DSP


for Neural Networks

10 © 2017 Cadence Design Systems, Inc.


Bringing Efficient Programmability to any Design

Tensilica
Processor

CPU Strengths Custom Strengths DSP Strengths


General Purpose Control, Task-specific, Virtually SIMD, VLIW,
Easy C Programming unlimited I/O, Arbitrary Vectorization
Real-time Operating computations, N-way programming,
System Support Differentiation, Fixed and Floating Point

Optimized Instruction Set, Data Path and I/O Interfaces


One development flow with C programmability and full debug

11 © 2017 Cadence Design Systems, Inc.


Example – high performance programming

12 © 2017 Cadence Design Systems, Inc.


Source Code - popcount
Counts the number of '1's in a series of 64bit vectors

13 © 2017 Cadence Design Systems, Inc.


Source Code - Explained (part 1)

• Declaration uses standard 'C' types


 Makes it easy to test with pure 'C' implementations - any cast to
machine specific types done inside function
 Use of __restrict same as normal 'C' (== no pointer aliasing)
• New vector types specific to Vision P5
 Type xb_vec2Nx8U is a vector of 8bit unsigned quantities
 In this machine 'N' == 32 so vbP is a pointer to 64Byte items
• Compiler is 'taught' through TIE how to do pointer
casting, arithmetic etc
• Declare some local variables
 Scope rules exactly the same as 'C'
 Type xb_vecNx16U is a 32-way vector of 16bit "unsigned" values
 In this machine there are 32 vector registers, so having lots of "live"
variables will not cause spills
 Compiler "knows" how to align these vector variables correctly
• Vector instructions inferred by the compiler
 In this case vec0 is an array of 8bit unsigned values
 Therefore in this case compiler will infer a vector LOGICAL right
shift (not ARITHMETIC).
• The "+" operator in this case means "Vector add"
 Because both "types" are unsigned 8bit vectors, ad 64-way
unsigned add will be inferred by the compiler

14 © 2017 Cadence Design Systems, Inc. Cadence Confidential


Source Code - Explained (part 2)

• Pointer math understood by compiler


 vbP points to 64-Byte entities so compiler knows to increment by 64
automatically
• "DSEL" intrinsic very powerful shuffle instructions
 Complex operation takes 5 vector operands - 3 input, 2 output
 Can do arbitrary byte-oriented shuffling of 128 Bytes input to 128
Bytes output
• We don't explicitly use loads or stores
 Compiler knows how to compute array offsets for vbP and rP
which are pointers to xb_vec2Nx8U. It will automatically schedule
loads according to use in the loop, number of unrolls, register
pressure etc

15 © 2017 Cadence Design Systems, Inc. Cadence Confidential


Benchmark and Analyze
Use Xplorer to Interactively Check Performance As You Build Find your critical
loops and
Check boxes to select Profile your software to see the critical performance
pre-designed options loops and what operations are used most bottlenecks

Analyze performance bottlenecks


with pipeline view
Instantly view PPA as you select options

16 © 2017 Cadence Design Systems, Inc.


Debugging Resume
HW Disconnect
Terminate
Step Into

HW Sync Pause Step Over


Instruction
Step Return Stepping
Debug Action Mode
Buttons

Displays
Processes
& Call
Views for:
Stacks • Variables
• Expressions
• Registers
Source • Breakpoints
File • TIE wires

Breakpoint
Set/Clear

Views for:
• Console
• Problems
• Memory
17 © 2017 Cadence Design Systems, Inc.
Automation – the key to making it happen

18 © 2017 Cadence Design Systems, Inc.


Automated Tool, ISS, Model, RTL, and EDA Script Generation...

Base Processor Complete Hardware Design


Dozens of Templates for many
common applications Tensilica Pre-verified
IP Iterate in
Synthesizable RTL
Pre-Verified Options minutes!
EDA Scripts
Off-the-shelf DSPs, interfaces, Test suite…
peripherals, debug, etc.

Advanced Software Tools


Optional Customization
Create your own instructions, Customer IDE
data types, registers, interfaces IP C/C++ Compiler
Debugger
ISS Simulator
SystemC Models
DSP code libraries

19 © 2017 Cadence Design Systems, Inc.


Simple Click Box Configurability

• From fine tuning of performance, power & area…


– Size, type, width & latency of memories
– Optional pre-fetch unit, extra registers…
– Load/Store unit characteristics
– Number of general purpose registers
– Number and priority levels of interrupts

• High-level application specific blocks


– Scalar or Vector Floating point, multiplier, divider…
– FLIX: Flexible length Instruction Extensions

• Select DSP ISAs..


– HiFi DSPs for Audio, Voice, and Speech
– ConnX BBE16/32/64EP DSPs for Baseband
– Vision DSPs for Imaging and Computer Vision and CNN
– Fusion DSPs for multi-purpose DSP, IoT, and Wearables
20 © 2017 Cadence Design Systems, Inc.
Extensibility Customize To Your Task
Example; Define I/O Queues
inA
Create three 256 bit queues and
• Simple Verilog-like language you can define… an “add” operation:
inB +
outC

– Input/output queues and ports queue inA 256 in


queue inB 256 in
– Custom register files queue outC 256 out
operation ADD_XFER {} {in inA, in inB, out outC} {
– Fast lookup tables and local memories assign outC = inA + inB;
– Simple single-cycle instructions }
High throughput without using system bus:
– Multi-cycle instructions 64 bytes in and 32 bytes out per operation
– SIMD for vectorization
– FLIX for grouping parallel operations into one instruction
Example: Define a single-cycle instruction
pop_count counts the “1”’s in a 32-bit register by adding the bits together. This simple Verilog-like code is all it takes to
create both the pre-verified adder RTL (175 gates) and the instruction

operation pop_count {out AR co, in AR ci}{}{


wire [3:0] a0 = ci[0] + ci[1] + ci[2] + ci[3] + ci[4] + ci[5] + ci[6] + ci[7];
wire [3:0] a1 = ci[8] + ci[9] + ci[10] + ci[11] + ci[12] + ci[13] + ci[14] + ci[15];
wire [3:0] a2 = ci[16] + ci[17] + ci[18] + ci[19] + ci[20] + ci[21] + ci[22] + ci[23];
wire [3:0] a3 = ci[24] + ci[25] + ci[26] + ci[27] + ci[28] + ci[29] + ci[30] + ci[31];
wire [5:0] sum = a0 + a1 + a2 + a3;
assign co = {26’b0, sum};
} 10x speedup: this simple instruction takes it down to just one cycle;
Best hand-coded ASM using standard instructions takes >10 cycles
21 © 2017 Cadence Design Systems, Inc.
Xtensa Fully Automated Hardware and Software Tools Generation

Custom Instructions Set configuration Choose processor


(optional) options (optional) template

Xtensa Processor Generator

Xtensa Processor Generator Outputs


Application
Hardware System Modeling / Design Software Tools Source
EDA Instruction Set Simulator Xplorer IDE C/C++
RTL Graphical User Interface
scripts (ISS)
to all tools
Compile
Fast Function Simulator
(TurboXim) GNU Software Toolkit
Synthesis (Assembler, Linker, Executable
XTSC Debugger, Profiler)
SystemC
Block Place & Route System Xtensa C/C++ (XCC)
Profile using ISS
XTMP C-
Modeling Compiler
based
Optimize
Verification System
Modeling Optimize using
configuration
Pin Level C Software Libraries
configuration
- or -
co-
Chip Integration / simulation options
Develop custom
Operating Systems instructions
Co-verification

To Fab / FPGA System Development Software Development


22 © 2017 Cadence Design Systems, Inc.
Comprehensive System Software
Customer Application
Register Library
Init and Interrupt Library
Loadable Program Remote
Access &
Control Handlers Library Overlays Debug
Control

APIs

XTOS XOS AXOM XMON


Library Automatic
Xtensa Single-Thread OR Multi-Thread Debug
Executive Real Time Kernel Loader Overlay
Monitor
Software Manager

Hardware Abstraction Layer (HAL)

Customer
Customer libraries Processor
System
23 © 2017 Cadence Design Systems, Inc.
Summary

• Configurable Processor Technology allows you to get from concept ..


• … to implementation …

• Rapidly
• With a rich set or Software tools
• And Models
• In a short amount of time
• With a low engineering effort.

24 © 2017 Cadence Design Systems, Inc.

You might also like