Exm Opencl Tdfir Optimization Guide
Exm Opencl Tdfir Optimization Guide
Optimization Guide
Overview
2
Introduction to FIR Filters
3
FIR Filter
DIGITAL
FILTER
4
FIR Filter Structure
C0 C1 C2 C3 C4 C5 C6 C7
X X X X X X X X
y(n)
Ts t
X[2] X[6]
x(n) z-1 z-1 z-1 z-1 z-1 z-1 z-1
X[0]
C0 C1 C2 C3 C4 C5 C6 C7
X X X X X X X X
6
X[0]*C0
y(n)
Example (Time Step 1)
X[3]
X[4]
X[0] X[1] X[5]
Ts t
X[2] X[6]
x(n) z-1 z-1 z-1 z-1 z-1 z-1 z-1
X[1] X[0]
C0 C1 C2 C3 C4 C5 C6 C7
X X X X X X X X
7
X[1]*C0+X[0]*C1
y(n)
Example (Time Step 2)
X[3]
X[4]
X[0] X[1] X[5]
Ts t
X[2] X[6]
x(n) z-1 z-1 z-1 z-1 z-1 z-1 z-1
X[2] X[1] X[0]
C0 C1 C2 C3 C4 C5 C6 C7
X X X X X X X X
+
X[2]*C0+X[1]*C1+X[0]*C2
8
y(n)
Relationship between Time and Frequency
What is a Frequency?
Quick Changes in Time = High Frequency
Slow Changes in Time = Low Frequencies
Signal
Signal
(Mag)
time
freq
Signal
Signal
(Mag)
time
freq
Example: Types of Filters
freq
High Pass Filters
Pass All Frequencies Above to a
Limit Frequency High Pass
Reject Frequencies Less Than Filter Response
Limit Frequency
freq
Designing Filters
13
Graphical Representation of the TDFIR
multiplication
i=0 i=N-1
OutputArray 𝑦[𝑖]
Benchmark Description
% TDFIR comes with a file containing separate input data for each filter.
% The input data provided for each filter is exactly the same.
% The benchmark implementation ignores that fact and reads in all the copies.
% For this illustration we assume the data for each filter was read into rows
% and a separate row used for each filter
for f=1:M
OutputStream(f,:) = conv (InputStream(f,:), FilterStream(f,:));
end
18
Key Factors in Improving FPGA Throughput
19
Optimization #1: Data Restructuring
20
Alignment of input and output arrays
InputLength = 4096
ResultLength = InputLength + FilterLength – 1
4096 + 128 – 1 = 4223
output
Filter 0 Filter 1
21
Loading the Filter Coefficients
output
Filter 0 Filter 1
22
Optimization #2: Using Local Memory
Loaded
InputArray into
Local
On-chip
Memory
FilterArray
23
Simplified Code Structure
// Hardcode these for efficiency
#define N 4096
#define K 128
InputArray += get_group_id(0) * N;
FilterArray += get_group_id(0) * K;
OutputArray += get_group_id(0) * (N+K);
// Perform Compute
float result=0.0f; *Ignoring Complex Math for simplicity
for (int i=0; i<K; i++) {
result += local_copy_of_filter_array[K-1-i]*local_copy_of_input_array[get_local_id(0)+i];
}
OutputArray[get_local_id(0)] = result;
}
24
Not the most efficient for FPGA
25
Reading from a banked Memory
Bank0
Arbitration Network
Bank2 Bank3
Read #1
Bank4
Read #126
Bank5 Bank6
Read #127
Bank7
26
Using Banking to service all Reads
Element 4096
Bank_127
Bank_0
Bank_1
28
Optimization #3: Implementing Single Work Item Execution
// Hardcode these for efficiency
#define N 4096
#define K 128
OutputArray[i] = result;
}
}
29
Consider the unrolled loops … Again
30
Shift Register Implementation
31
We handle Floating-Point !
sign
Floating-point
bit
32
Example: Floating Point Addition
Subtractor
0 0x19 0x12345
0 0x15 0x54321
0 0x19 0x12345
0 0x19 0x05432
4
0 0x19 0x17777
Only a subset of circuitry
shown
Circuitry required for rounding
modes, normalization, etc
sign
Floating-point
bit
0x05432
Exponent Mantissa
(8 bits) (23bits) Barrel Shifter
33
OpenCL builds on Altera’s Floating Point Technology
Multipliers
Function ALUTs Register Latency Performance
(27x27)
Add/Subtract 541 611 n/a 14 497 MHz
34
TDFIR Results
35
Results of Task TDFIR
36
OpenCL Overhead
38
Comparisons
39
Kernels Data Set CPU Throughput (GFLOPS) * GPU Throughput (GFLOPS) * Speedup
Set 1 3.382 97.506 28.8
TDFIR
Set 2 3.326 23.130 6.9
Set 1 0.541 61.681 114.1
FDFIR
Set 2 0.542 11.955 22.1
Set 1 1.194 17.177 14.3
CT
Set 2 0.501 35.545 70.9
Set 1 0.871 7.761 8.9
PM
Set 2 0.281 21.241 75.6
Set 1 1.154 2.234 1.9
Set 2 1.314 17.319 13.1
CFAR
Set 3 1.313 13.962 10.6
Set 4 1.261 8.301 6.6
Set 1 0.562 1.177 2.1
Set 2 0.683 8.571 12.5
GA
Set 3 0.441 0.589 1.4
Set 4 0.373 2.249 6.0
Set 1 1.704 54.309 31.8
QR Set 2 0.901 5.679 6.3
Set 3 0.904 6.686 7.4
Set 1 0.747 4.175 5.6
SVD
Set 2 0.791 2.684 3.4
Set 1 112.3 126.8 1.13
DB
Set 2 5.794 8.459 1.46
40
Performance Comparison
1000
LOG10(MFLOPS)
100
10
1
CPU GPU DSP CPU GPU DSP CPU GPU DSP CPU GPU DSP CPU GPU DSP
TDFIR FDFIR CFAR QR SVD