0% found this document useful (0 votes)
109 views70 pages

Vlsi

Processing notes

Uploaded by

Cecilia Chinna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views70 pages

Vlsi

Processing notes

Uploaded by

Cecilia Chinna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

VLSI Programming 2016: Lecture 1

Course: 2IMN35

Teachers: Kees van Berkel [email protected]


Rudolf Mak [email protected]

Lab: Kees van Berkel, Rudolf Mak,


Alok Lele

www: http://www.win.tue.nl/~wsinmak/Education/2IMN35/

Lecture 1: Introduction

1 19/04/16
Introduction to VLSI Programming: goals

•  to acquire insight in the description, design, and


optimization of fine-grained parallel computations;

•  to acquire insight in the (future) capabilities of VLSI 



as an implementation medium of parallel computations;

•  to acquire skills in the design of parallel computations


and in their implementation on FPGAs.

2 19/04/16
Contents

Massive parallelism is needed to exploit 



the huge and still increasing computational capabilities 

of Very Large Scale Integrated (VLSI) circuits:

•  we focus on fine-grained parallelism 



(not on networks of computers);

•  we assume that parallelism is by design 



(not by compilation);

•  we draw inspiration from consumer applications, such as


digital TV, 3D TV, image processing, mobile phones, etc.;

•  we will use Field Programmable Arrays (FPGA) as fine-


grained abstraction of VLSI for practical implementation.

3 19/04/16
FPGA IC on a Xilinx XUP Board (Atlys)

FPGA
Xilinx
Spartan 6

4 19/04/16
Atlys board, based on Xilinx Spartan 6

FPGA
Xilinx
Spartan 6

5
19/04/16
Lab work prerequisites

•  Laptop, running Windows

•  Exceed 

(can be obtained through the TU/e software distribution)

•  Access to UNIX server Dept. W&I



(can be obtained through BCF)

•  Lab work is by teams of two students, 



with at least 1 Windows laptop.

•  Have FPGA tools (SW) installed on your machine 



by Tuesday April 26


•  check website 2IMN35


6 19/04/16
VLSI Programming (2IMN35): time table 2016


2015 in Tue: h5-h8; MF.07 out 2015 in Thu: h1-h4; Gemini-Z3A-08/10/13 out
19-Apr introduc/on, DSP graphs, bounds, … 21-Apr pipelining, re/ming, transposi/on, T1
J-slow, unfolding + T2
26-Apr tools Introduc/ons to L1: audio filter L1 28-Apr T1 unfolding, look-ahead, L1 cntd T3
installed FPGA and Verilog simula/on L2 + T2 strength reduc/on + T4
3-May folding L2: audio filter 5-May
on XUP board
10-May T3 + T4 DSP processors L2 cntd L3 12-May L3:
sequen/al FIR + strength-reduced FIR
17-May L3 cntd 19-May L3 cntd L4

24-May systolic computa/on T5 26-May L4

31-May T5 L4: 2-Jun L3 L4 cntd L5
audio sample rate convertor
7-Jun L5: 9-Jun L4 L5 cntd
1024x audio sample rate convertor

14-Jun 16-Jun L5 deadline report L5

7 19/04/16
Course grading (provisional)

Your course grade is based on:

•  the quality of your programs/designs [30%];

•  your final report on the design and evaluation 



of these programs (guidelines will follow) [30%];

•  a concluding discussion with you on the 



programs, the report and the lecture notes [20%];

•  intermediate assignments [20%].

•  Credits: 5 points = based on 140 hours from your side

8 19/04/16
Note on course literature

Lectures VLSI programming are loosely based on:


•  Keshab K. Parhi. VLSI Digital Signal Processing Systems, Design and
Implementation. Wiley Inter-Science 1999.
•  This book is recommended, but not mandatory

Accompanying slides can be found on:


•  http://www.ece.umn.edu/users/parhi/slides.html
•  http://www.win.tue.nl/~wsinmak/Education/2IMN35/

Mandatory reading:
•  Keshab K. Parhi. High-Level Algorithm and Architecture
Transformations for DSP Synthesis. Journal of VLSI Signal
Processing, 9, 121-143 (1995), Kluwer Academic Publishers.

9 19/04/16
Introduction

• Some inspiration from the technology side


• VLSI
• FPGAs

• Some inspiration from the application side


• Machine Intellligence
• Bee, SKA, SETI
• Digital Signal Processing (Software Defined Radio)

• Parhi, Chapters 1, 2
• DSP Representation Methods
• Iteration bounds

10 19/04/16
Some inspiration
from the technology side

11 19/04/16
Vertical cut through VLSI circuit

12 19/04/16
Intel 4004 processor [1970]

§  1970
§  4-bit
§  2300
transist
ors

13 19/04/16
Apple A9 SoC (System on Chip)

• 2015
• Production: 

Samsung/TSMC
• 14/16 nm FinFet
• 96/104.5 mm2
•  > 2B transistors
• Assuming 0.1$/mm2
production costs
• ⇒ 5 nano$ / transistor

14 19/04/16
Flash memory

• 32 GB = 256Gb
• ≈100G transistors => << 1 n$ per transistor

15 19/04/16
Xilinx Kintex7 FPGA

• 2G transistors • 1920 DSP slices


• 165mm2
16 19/04/16
Stratix 10 FPGA from Altera (Intel)

• > 10,000 FLOPs per clock cycle


• @ nearly 1 GHz
17 19/04/16
Exa-scale computing: 1018 FLOPs/Sec

A scenario (year 2021):


• 1018 FLOPs/sec = 109 arithmetic units running 109Hz
• 109 arithmetic units = 104.5 nodes ×104.5 arithmetic units
• 1 node = 32TFLOPs/s “X”+ 1TB DRAM + “CPU” 

@ 10 MW


Today (2016: “petaflop” era):


• #1: Tianhe-2 (China): 34 ×1015 FLOPs/sec 

104.5 nodes @ 24 MW,
• GPU (Nvidia GM200): 6 TFLOPs/sec
• FPGA (Altera Stratix 10, GX2800): 9 TFLOPs/sec

18 19/04/16
A 2016 “node”

Source: Samsung
19 19/04/16
Source: NVidia
20 19/04/16
Moore’s Law: 50th anniversary in 2015!

21 19/04/16
Cost per Transistor over Time for Intel MPUs

×0.5/2years

↑

US$
?
Rule of two [Hu, 1993]

•  Every 2 generations of IC technology (6 years)


•  device feature size 0.5 x
•  chip size 2x
•  clock frequency 2x (no longer true)
•  number of i/o pins 2x
•  DRAM capacity 16 x
•  logic-gate density 4 x

23 19/04/16
ITRS: INTERNATIONAL TECHNOLOGY
ROADMAP FOR SEMICONDUCTORS
• The overall objective of the ITRS is to present industry-wide
consensus on the “best current estimate” of the industry’s
research and development needs out to a 15-year horizon.
• As such, it provides a guide to the efforts of companies,
universities, governments, and other research providers/funders.
• The ITRS has improved the quality of R&D investment decisions
made at all levels and has helped channel research efforts to
areas that most need research breakthroughs.

• Involves over 1000 technical experts, world wide.


• a self-fulfilling prophecy? … or wishful thinking?

24 ST-Ericsson confidential 19/04/16


ITRS 2013

25 19/04/16
2013 ITRS

MPU/ASIC Half Pitch and Gate Length Trends

26 ST-Ericsson confidential 19/04/16


Virtex 4 FPGA: 4VSX55

FPGA = Field Programmable Gate Array

Flexible Logic
500MHz clock 6,144 CLBs

Programmable
multi-port RAM
512 DSP slides
320 × 18 kbit

450MHz 1Gbps 0.6-11.1Gbps


PowerPC™ Differential I/O Serial Trx

27 19/04/16
Some inspiration
from the application side

28 19/04/16
All things grand and small [Moravec ‘98]

29 19/04/16
Chess Machine Performance [Moravec ‘98]

30 19/04/16
Evolution computer power/cost [Moravec ‘98]

brain power equivalent


per $1000 of computer

31 19/04/16
The Square Kilometer Array (SKA)

... the ultimate exploration tool


... and the ultimate 

software defined radio

MPSoC -- 2010, June


32 30
The Square Kilometer Array (SKA)
• antenna surface: 1 km2 (sensitivity 50×)
• large physical extent (3000+ km)
• wide frequency range: 50 MHz – 30 GHz
• full design by 2016; phase 1: 2021; phase 2: 2026
• phase 1: 250 dishes (12m) in the central 5 km
• + dense and/or sparse aperture arrays
• connected to a massive data processor by an optical fibre network
• Software Defined Radio Astronomy
• computational load ≈ 1 exa FLOPs/sec (1018 FLops/s)
• power budget = 20 MW (≈ 20 pJ/FLOP “all-in”)
MPSoC -- 2010, June
33 30
References

•  Chip fotos:
•  http://www-vlsi.stanford.edu/group/chips.html

•  ITRS Roadmap
•  http://www.itrs.net/Links/2005ITRS/ExecSum2005.pdf

•  When will computer hardware match the human brain?


•  http://www.jetpress.org/volume1/moravec.htm

•  BEE & Square Kilometer Array


•  http://bwrc.eecs.berkeley.edu/Research/BEE/
•  http://seti.berkeley.edu/casper/papers/BEE2_ska2004_poster.pdf
•  http://www.skatelescope.org/

34 19/04/16
VLSI Digital Signal Processing
Systems

Parhi, Chapters 1&2

35 19/04/16
DSP applications classes

10G
1G radar
Sample rate [Hz]→

100M
10M
HDTV
radio
1M video
100k
modems
audio modems
10k
speech
1k seismic
100 control modeling
10
1
complexity →
# operations/sample [log]
36 19/04/16
Typical DSP algorithms

• speech (de-)coding • sound synthesis


• speech recognition • echo cancellation
• speech synthesis • modem: (de-)modulation
• speaker identification • vision
• Hi-fi audio en/decoding • image (de-)compression
• noise cancellation • image composition
• audio equalization • beam cancellation
• ambient acoustic • spectral estimation
emulation.
• etc.

37 19/04/16
Typical DSP kernels: FIR Filters

• Filters reduce signal noise and enhance image or signal quality


by removing unwanted frequencies.
• Finite Impulse Response (FIR) filters compute y(n) :
N −1
y (i ) = ∑ h ( k ) x (i − k ) = h ( n ) * x ( n )
k =0

• where
• x is the input sequence
• y is the output sequence
• h is the impulse response (filter coefficients)
• N is the number of taps (coefficients) in the filter
• Output sequence depends only on input sequence and impulse
response.
38 19/04/16
Typical DSP kernels: IIR Filters

• Infinite Impulse Response (IIR) filters compute:


M −1 N −1
y (i ) = ∑ a ( k ) y (i − k ) + ∑ b ( k ) x (i − k )
k =1 k =0

• Output sequence depends on input sequence, impulse


response,as well as previous outputs

• Adaptive filters (FIR and IIR) update their coefficients to


minimize the distance between the filter output and the
desired signal.

39 19/04/16
Typical DSP kernels: DFT and FFT
The Discrete Fourier Transform (DFT) supports frequency
domain (“spectral”) analysis:
N −1 − 2 jπ
nk
y (k ) = ∑ WN x(n) WN = e N j = −1
n =0
for k = 0, 1, … , N-1, where
• x is the input sequence in the time domain (real or complex)
• y is an output sequence in the frequency domain (complex)
The Inverse Discrete Fourier Transform (IDFT) is computed as
N −1
− nk
x(n) = ∑ WN y ( k ), for n = 0, 1, ... , n - 1
k =0
The Fast Fourier Transform (FFT) and its inverse (IFFT) provide
an efficient method for computing the DFT and IDFT.

40 19/04/16
Typical DSP kernels: DCT

The Discrete Cosine Transform (DCT) and its inverse IDCT are
frequently used in video (de-) compression (e.g., MPEG-2):

N −1
( 2 n + 1) kπ
y ( k ) = e ( k ) ∑ cos[ ] x ( n ), for k = 0, 1, ... N - 1
n=0 2N

N −1
2 ( 2 n + 1) kπ
x(n) = ∑ e( k ) cos[ ] y ( n ), for k = 0, 1, ... N - 1
N k =0 2N

where e(k) = 1/sqrt(2) if k = 0; otherwise e(k) = 1.


A N-Point, 1D-DCT requires N2 MAC operations.

41 19/04/16
Typical DSP kernels: distance calculation

• Distance calculations are typically used in pattern


recognition, motion estimation, and coding.

• Problem: chose the vector rk whose distance (see below) from


the input vector x is minimum.

Mean Absolute Difference (MAD) Mean Square Error (MSE)

N −1
1 1 N −1
d= ∑ | x (i ) − rk (i ) | d= ∑ [ x ( i ) − rk ( i )] 2
N i =0 N i =0

42 19/04/16
Typical DSP kernels: matrix computations

Matrix computations are typically used to estimate parameters


in DSP systems.

•  Matrix vector multiplication


•  Matrix-matrix multiplication
•  Matrix inversion
•  Matrix triangulization

Matrices may be dense/sparse/band-structured/….

43 19/04/16
Computation Rates

• To estimate the hardware resources required, we can use


the equation:

RC = R S ⋅ N S
• where
• Rc is the computation rate
• Rs is the sampling rate
• Ns is the (average) number of operations per sample

• For example, a 1-D FIR has NS = 2N 



and a 2-D FIR has NS = 2N2.

44 19/04/16
Computational Rates for FIR Filtering

Signal type Frequency # taps Performance

Speech 8 kHz N =128 20 MOPs

Music 48 kHz N =256 240 MOPs

Video phone 6.75 MHz N*N = 81 1,090 MOPs

TV 27 MHz N*N = 81 4,370 MOPs

HDTV 144 MHz N*N = 81 23,300 MOPs

45 19/04/16
DSP systems and programs

x(n) DSP y(n)


System

• infinite input stream (samples): x(0), x(1), x(2), …


• infinite output stream (samples): y(0), y(1), y(2), …
• (there may be multiple input and/or output streams)
• non-terminating program, e.g:
for n=1 to ∞

y(n) = a*x(n) + b*x(n-1) + c*x(n-2)

end

46 19/04/16
DSP SYSTEMS

GRAPHICAL REPRESENTATIONS

47 19/04/16
DSP systems: 3 graphical representations

• Block diagram:
• general block diagram

• loose semantics

data flow graph

• Data-flow graph:
• used for signal processing signal flow graph

• formal definition
LTI systems

• powerful tools , lots of theory signal processing

• Signal-flow graph:
general
• linear time-invariant systems
• formal definition, stilll more theory

48 19/04/16
DSP system: block diagram

•  Consider FIR: y(n) = a*x(n) + b*x(n-1) + c*x(n-2)

x(n-1) x(n-2)
x(n) D D

a × b × c ×

+ + y(n)

•  D delay element = memory element = register 


•  a × multiply with constant a


•  + adder: output value = sum of input values

49 19/04/16


DSP system: data-flow graph (DFG)

•  Consider FIR: y(n) = a*x(n) + b*x(n-1) + c*x(n-2)

D D
x(n)

a b c

y(n)

•  D is (non-negative) number of delays


•  a multiplier: output value = (constant a) × input value


•  adder: output value = sum of input values

50 19/04/16


Data-flow graph (DFG)

•  Consider FIR: y(n) = a*x(n) + b*x(n-1) + c*x(n-2)

D D
x(n)

a b c

y(n)

Each edge describes a precedence constraint between two nodes:


•  D=0: Intra-iteration precedence constraint
•  D>0: Inter-iteration precedence constraint




51 19/04/16


Data-flow graphs

Tokens can represent numbers, vectors (blocks), matrices …


Nodes may be complex (coarse-grained) functions, e.g.:

Single-rate data flow: Each node:


• consumes one token from each input edge;
• performs its function (in T time units);
• produces one token onto each output edge.

52 19/04/16
Data-flow graphs

Multi-rate data flow: Each node:


• consumes a fixed number of tokens from each input edge;
• performs its function (in T time units);
• produces a fixed number of tokens onto each output edge.

53 19/04/16
Signal-flow graph (representation method 3)

• A join-node denotes an adder


• Label a next to an edge denotes multiplication by constant a
• z-k denotes k units delay
• Signal-flow graphs are used to represent 

Linear Time Invariant systems LTI.
• A signal flow-graph represents a so-called 

Z-transform (Laplace), a powerful LTI system theory.

(outside the scope of 2IN35)
54 19/04/16
Linear Systems

input x, output y:

discrete system:
results in
•  x(n) y(n)

linear system:
•  x1(n) + x2(n) results in y1(n) + y2(n)
•  c1 x1(n) + c2 x2(n)
results in
c1 y1(n) + c2 y2(n)

for arbitrary c1 and c2

Most of our examples will be linear systems

55 19/04/16
Linear Time-Invariant Systems

input x, output y:

•  x(n+k) = x(n) shifted by integer k sample periods 



time-invariant system
•  x’(n) =x(n+k) results in y’(n) = y(n+k)

Most of our examples will be linear time-invariant systems,


or LTI systems

56 19/04/16
Commutativity of LTI systems

LTI f(n) LTI


x(n) y(n)
System A System B

is equivalent to

LTI g(n) LTI


x(n) y(n)
System B System A

57 19/04/16
LOOP BOUNDS 

AND ITERATION BOUNDS

58 19/04/16
Iteration of a Synchronous Flow Graph

• Each actor fires the minimum number of times to return the


graph to a particular state
1 2 2 3 2 1
• Example of a
 A B C
multi-rate DFG:
# firings for 1 iteration
A B C
2 2 3

# tokens per edge for 1 iteration


→A A→B B→C C→
2 4 6 3

59 19/04/16
Iteration period

Iteration period = 

the time required for the execution of one iteration of the SFG

a
Example, let
×
• Tm = 10 = multiplication time
• Ta = 4 = addition time
 x(n) + D y(n-1)

Iteration period = Tm+Ta = 14 [e.g. nsec]


= minimum sample period Ts; that is: Ts ≥ Tm+Ta
Iteration rate = (iteration period)-1 [e.g. GHz]

60 19/04/16
Loop and loop bound

• A loop (cycle) in a DFG is a directed path that begins and ends
at the same node.
• The loop bound of loop j is defined Tj/Wj where
• Tj is the loop computation time (sum of all Ti of loop nodes i ),
• Wj is the number of delays (D-elements) in the loop.

• Example (IIR filter):


a
• Tloop = Tm+Ta = 14 ns
• Wloop = 2 ×
• Loop bound 

= Tloop /Wloop 
 x(n) + 2D y(n-2)
= 14 /2

=7 nsec

61 19/04/16
Critical loop and Iteration bound

• The critical loop of a DFG 



is the loop with the maximum loop bound.
• The iteration bound T∞ of a DFG 
 #T &
is the loop bound of the critical loop: T∞ = max %% j ((
$ Wj '
• L is the set of loops of the DFG
j∈L

• Tj of is the loop bound of loop j


• Wj of is the weight of loop j, i.e. the number of delays D.

62 19/04/16
Iteration bound cntd

Example:
• TL1 = (10+2)/1 = 12
• TL2 = (2+3+5)/2 =5
• TL3 = (10+2+3)/2 = 7.5
• Iteration bound = max (12, 5, 7.5) = 12

Notes:
• Delays are non-negative 

(negative delay would imply non-causality).
• If loop weight equals 0 (no delay elements in loop)

then TL/0 = ∞ (deadlock).

63 19/04/16
4 types of delay paths; critical path
• Redraw block diagram by partitioning nodes in
D-elements and combinational functions (“FSM view”):

combinational
inputs functions 3 outputs
path from to
2
1 1 inputs state
2 state outputs
4
3 inputs outputs

delay elements 4 state state


= state

• Paths do not contain delay-elements


• The critical path is the path with the longest computation bound
and is an lower bound for the clock period.
64 19/04/16
Critical path cntd

Example (FIR filter):


• Tm= 10 ns
• Ta= 4 ns
• No loops!

1.  1 path from input to state: 0 ns
2.  4 path from state to outputs: 26, 22, 18, 14 ns
3.  1 path from input to output: 26 ns
4.  3 paths from state to state: 0, 0, 0 ns
The critical path is 26 ns. 

(can be reduced by pipelining and parallel processing.)

65 19/04/16
DSP references

•  Keshab K. Parhi. VLSI Digital Signal Processing Systems, Design and


Implementation. Wiley Inter-Science 1999.

•  Richard G. Lyons. Understanding Digital Signal Processing (2nd


edition). Prentice Hall 2004.

•  John G. Proakis and Dimitris K Manolakis. Digital Signal Processing


(4th edition), Prentice Hall, 2006.

•  Simon Haykin. Neural Networks, a Comprehensive Foundation (2nd


edition). Prentice Hall 1999.

66 19/04/16
Computer Architecture and DSP references

•  Hennessy and Patterson, Computer Architecture, a


Quantitative Approach. 3rd edition. Morgan Kaufmann, 2002.

•  Phil Lapsley, Jeff Bier, Amit Sholam, Edward Lee. DSP


Processor Fundamentals, Berkeley Design Technology, Inc,
1994-199

•  Jennifer Eyre, Jeff Bier, The Evolution of DSP Processors, IEEE


Signal Processing Magazine, 2000.

•  Kees van Berkel et al. Vector Processing as an Enabler for


Software-Defined Radio in Handheld Devices, EURASIP Journal
on Applied Signal Processing 2005:16, 2613-2625.

67 19/04/16
VLSI Programming:

Preparations for Lab work, before Tuesday April 26:

• team up (2 students/team), and

• install FPGA tools.

68 19/04/16
VLSI Programming: Thursday April 21

Transformations:
•  Transposition
•  Pipelining
•  Retiming
•  K-slow transformation
•  Parallel processing

(Parhi, Chapters 2, 3)

69 19/04/16
THANK YOU

You might also like