0% found this document useful (0 votes)

36 views

Exm Opencl Tdfir Optimization Guide

The document describes optimizations for implementing a time domain finite impulse response (TDFIR) filter on an FPGA. It begins with an introduction to FIR filters, their structure and applications. It then describes the HPEC TDFIR benchmark, which is used to evaluate performance. The key to optimizing for FPGAs is restructuring data input/output, using local memory, and implementing single work-item execution to improve throughput. Restructuring the data from a nested loop format to a striped format improves data locality and efficiency.

Uploaded by

Hadji Mhamed

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views

Exm Opencl Tdfir Optimization Guide

Uploaded by

Hadji Mhamed

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Time Domain FIR Filter (TDFIR)

Optimization Guide
Overview

 Introduction to FIR Filters

 TDFIR HPEC Benchmark
 How to Optimize TDFIR for the FPGA
 Results + Comparison

2
Introduction to FIR Filters

3
FIR Filter

 The FIR filter is a basic building block for many market

segments
 Wireless, video applications
 Military and medical fields

 It is a digital equivalent of the analog filter

 Purpose is to allow discrete signals in the time domain
to be filtered (remove noise, high-frequency
components, etc.)

DIGITAL
FILTER

4
FIR Filter Structure

 Multiply a sequence of input samples by various

coefficients and add their results to achieve an
output
x(n) z-1 z-1 z-1 z-1 z-1 z-1 z-1

C0 C1 C2 C3 C4 C5 C6 C7

X X X X X X X X

y(n)

 This series of multiplying and adding will dictate how

filter characteristics are shaped
5
Example (Time Step 0)
X[3]
X[4]
X[0] X[1] X[5]

Ts t
X[2] X[6]
x(n) z-1 z-1 z-1 z-1 z-1 z-1 z-1
X[0]

C0 C1 C2 C3 C4 C5 C6 C7

X X X X X X X X

6
X[0]*C0
y(n)
Example (Time Step 1)
X[3]
X[4]
X[0] X[1] X[5]

Ts t
X[2] X[6]
x(n) z-1 z-1 z-1 z-1 z-1 z-1 z-1
X[1] X[0]

C0 C1 C2 C3 C4 C5 C6 C7

X X X X X X X X

7
X[1]*C0+X[0]*C1
y(n)
Example (Time Step 2)
X[3]
X[4]
X[0] X[1] X[5]

Ts t
X[2] X[6]
x(n) z-1 z-1 z-1 z-1 z-1 z-1 z-1
X[2] X[1] X[0]

C0 C1 C2 C3 C4 C5 C6 C7

X X X X X X X X

+
X[2]*C0+X[1]*C1+X[0]*C2
8
y(n)
Relationship between Time and Frequency

 What is a Frequency?
 Quick Changes in Time = High Frequency
 Slow Changes in Time = Low Frequencies

Signal
Signal
(Mag)

time

freq

Signal
Signal
(Mag)

time

freq
Example: Types of Filters

 Low Pass Filters

 Pass All Frequencies Up to a Low Pass
Limit Frequency Filter Response

 Reject Frequencies Greater Than

Limit Frequency

freq
 High Pass Filters
 Pass All Frequencies Above to a
Limit Frequency High Pass
 Reject Frequencies Less Than Filter Response
Limit Frequency

freq
Designing Filters

 Get a desired response in the frequency domain (i.e. :

low pass filter)
 Modify Desired response to possess certain
“desirable” characteristics
 I.e. : finite in length (truncate the response)

 Take the Inverse Fourier Transform to obtain the

Impulse Response

 The Impulse Response IS the value of the coefficients

Filter Examples : Low Pass Filter

 Low Pass Filter Frequency

Response
 Start with Ideal Low Pass Filter

 Take the inverse discrete

Fourier Transform
 Resulting Filter
Coefficients
 All Frequencies Up to desired
Frequency are present
TDFIR HPEC Benchmark

Slides courtesy of Craig Lund

13
Graphical Representation of the TDFIR

FilterArray  Benchmark performs complex

K=3 in this diagram computation
 Each point has 2 components
C[K-1] C[1] C[0]
(real, imag)

multiplication
i=0 i=N-1

InputArray 𝑥[𝑖-𝐾] 𝑥[𝑖-1] 𝑥[𝑖]

summation K-1 extra elements

in the OutputArray

OutputArray 𝑦[𝑖]
Benchmark Description

Parameter Description Data Set Data Set

1 2
M Number of filters to 64 20
stream
N Length of InputArray in 4096 1024
complex elements
K Length of FilterArray in 128 12
complex elements
Workload in MFLOP 268.44 1.97

We need some real-world data sizes if we are going to show results.

For that we turn to the RADAR-oriented, HPEC Challenge benchmark
suite. See the TDFIR kernel
http://www.ll.mit.edu/HPECchallenge/tdfir.html
Their benchmark offers two datasets. The timing data we present in
later slides uses Data Set 1.
 Note that you won’t find direct convolution implementations optimized for giant arrays. It is faster to use an FFT
algorithm in that case. Using an FFT for that is benchmarked using hpecChallenge’s FDFIR kernel.
Using HPEC Challenge Data

% TDFIR comes with a file containing separate input data for each filter.
% The input data provided for each filter is exactly the same.
% The benchmark implementation ignores that fact and reads in all the copies.

% For this illustration we assume the data for each filter was read into rows
% and a separate row used for each filter
for f=1:M
OutputStream(f,:) = conv (InputStream(f,:), FilterStream(f,:));
end

 This slide illustrates how the HPEC Challenge tdFIR

benchmark uses the data that it supplies, expressed in
MATLAB.
 In words, the benchmark presents a batch of M separate
filters to compute.
Simple C Implementation

 The obvious implementation in C is very simple.

 Often “good enough.”

 This implementation exactly matches the

fundamental equation of the FIR filter.
 Upcoming slides show alternative implementation
methods that lead to an efficient FPGA realization.

// Loop assumes OutputArray starts out full of zeros.

for ( int FilterElement = 0; FilterElement < K; FilterElement++ )
{ for ( int InputElement = 0; InputElement < N; InputElement++ )
{ OutputArray [ InputElement + FilterElement ]
+= InputArray [ InputElement ] * FilterArray[ FilterElement ];
}
}
How to Optimize TDFIR for FPGA

18
Key Factors in Improving FPGA Throughput

 Restructuring data input and output

 Using local memory
 Implement single work-item execution

19
Optimization #1: Data Restructuring

 DataSet #1 contains 64 sets of filter data to process

 Each set of data contains 4096 complex points
 Original implementation has a double nested for loop
for ( int FilterElement = 0; FilterElement < K; FilterElement++ )
{ for ( int InputElement = 0; InputElement < N; InputElement++ )
{
perform_computation(FilterElement, InputElement);
}
}
 For simplicity, we combined this into a single for floop
for ( int ilen = 0; ilen < TotalInputLength; ilen++)
{
perform_computation(ilen);
}

 This simplifies control flow

20
Alignment of input and output arrays

 InputLength = 4096
 ResultLength = InputLength + FilterLength – 1
 4096 + 128 – 1 = 4223

 Thus to maintain the expected behaviour, we PAD the

input array by 127 complex points of zero
127*2 zeros
input

output

Filter 0 Filter 1

21
Loading the Filter Coefficients

 Each new filter to process has different filter coefficients.

 Must load these in before we can perform computation
 We chose to load these constants 8 complex points at a time
 For the 128-tap filter, need to spend 128/8 = 16 clock cycles to load in the filter
coefficients
 To account for this, while at the same time maintaining our
simplified control flow, we simply add more padding to the
*beginning* of both the input and the output array

162 zeros 162 zeros 127*2 zeros

input

output

Filter 0 Filter 1

22
Optimization #2: Using Local Memory

 Accessing global memory is slow, so a more efficient

implementation involves breaking the problem up into
smaller segments that can fit in local memory
 The InputArray needs to be broken up into smaller
pieces and the convolution of each of these subregions
can be computed independently

Loaded
InputArray into
Local
On-chip
Memory
FilterArray

23
Simplified Code Structure
// Hardcode these for efficiency
#define N 4096
#define K 128

__kernel void tdFIR

( __global const float *restrict InputArray, // Length N
__global const float *restrict FilterArray, // Length K
__global float *restrict OutputArray // Length N+K-1
)
{
__local float local_copy_input_array[2*K+N];
__local float local_copy_filter_array[K];

InputArray += get_group_id(0) * N;
FilterArray += get_group_id(0) * K;
OutputArray += get_group_id(0) * (N+K);

// Copy from global to local

local_copy_input_array[get_local_id(0)] = InputArray[get_local_id(0)];
if (get_local_id(0) < K)
local_copy_filter_array[get_local_id(0)] = FilterArray[get_local_id(0)];
barrier(CLK_LOCAL_MEM_FENCE);

// Perform Compute
float result=0.0f; *Ignoring Complex Math for simplicity
for (int i=0; i<K; i++) {
result += local_copy_of_filter_array[K-1-i]*local_copy_of_input_array[get_local_id(0)+i];
}
OutputArray[get_local_id(0)] = result;
}

24
Not the most efficient for FPGA

 Consider if we unrolled the compute loop

result = local_copy_of_filter_array[K-1]*local_copy_of_input_array[get_local_id(0)] +
local_copy_of_filter_array[K-2]*local_copy_of_input_array[get_local_id(0)+1] +
local_copy_of_filter_array[K-3]*local_copy_of_input_array[get_local_id(0)+2] +
...+
local_copy_of_filter_array[0]*local_copy_of_input_array[get_local_id(0)+K-1];

 We have something that looks somewhat like a good

circuit for a FIR filter
 Several loads from memory fetch 128 coefficients
from the local FilterArray and 128 elements from the
local InputArray

25
Reading from a banked Memory

 On any given clock cycle, as long as all requests are

asking for elements in different banks  we can
service ALL requests!
 Altera’s OpenCL compiler automatically builds this
structure

Bank0

M20K M20K M20K M20K M20K M20K M20K M20K

Bank1
Read #0

Arbitration Network

Bank2 Bank3
Read #1

Bank4
Read #126
Bank5 Bank6

Read #127
Bank7

26
Using Banking to service all Reads

Element 4096

Banks are arranged

as
Vertical Strips

Bank_127
Bank_0
Bank_1

Element 0 Element 127

27
Disadvantages

 Banked local memory structures are an inefficient way

to handle FIR data reuse.
 Consumes LOTs of area and on-chip logic and memory resources

 Notice that every thread accesses almost the same

data, but shifted by one position
 We really need to create a shift register structure!
 It is the ultimate form of expressing the data reuse pattern for the FIR filter

28
Optimization #3: Implementing Single Work Item Execution
// Hardcode these for efficiency
#define N 4096
#define K 128

kernel attribute((task)) void tdFIR

( __global const float *restrict InputArray, // Length N
__global const float *restrict FilterArray, // Length K
__global float *restrict OutputArray, // Length N+K-1
unsigned int totalLengthOfInputsConcatenated
)
{
float data_shift_register[K];
for (int i=0; i<totalLengthOfInputsConcatenated; i++) {
#pragma unroll
for (int j=0; j<K; j++) Shift register
data_shift_register[j]=data_shift_register[j+1];
data_shift_register[K-1]=InputArray[i];
implementation
float result=0.0f;
#pragma unroll
for (int j=0; j<K; j++)
result += data_shift_register[K-1-j]*FilterArray[j];

OutputArray[i] = result;
}
}

*For Simplicity Assume the FilterArray doesn’t change

29
Consider the unrolled loops … Again

 The first unrolled loop looks like:

data_shift_register[0]=data_shift_register[1];
data_shift_register[1]=data_shift_register[2];
data_shift_register[2]=data_shift_register[3];
...
data_shift_register[K-1]=InputArray[i];

 The second unrolled loop:

result = data_shift_register[K-1]*FilterArray[0] +
data_shift_register[K-2]*FilterArray[1] +
data_shift_register[K-3]*FilterArray[2] +
...
data_shift_register[0]*FilterArray[K-1];

 The key observation is that all array accesses are from

a constant address
 Altera’s OpenCL compiler will now build 128 floating point registers instead
of a complex memory system

30
Shift Register Implementation

Iteration I Iteration i+1

data_shift_register[0] data_shift_register[0]
 Pipelining
iterations in
data_shift_register[1] data_shift_register[1] this loop is
very simple
data_shift_register[2] data_shift_register[2] because the
dependencies
data_shift_register[3] data_shift_register[3] are trivial
 Essentially
data_shift_register[4] data_shift_register[4] becomes a
shift register
... ...
in hardware.
data_shift_register[K-1] data_shift_register[K-1]

31
We handle Floating-Point !

 Notice that the TDFIR is a Floating Point Application

 Floating-point formats represent a large range of real numbers
with a finite number of bits

sign
Floating-point
bit

 Extreme care is required to handle Exponent Mantissa

(8 bits) (23bits)
OpenCL compliant floating point on
the FPGA
 Handling of Infs, NaNs
 Denormalized Numbers
 Rounding

32
Example: Floating Point Addition

Subtractor
0 0x19 0x12345

0 0x15 0x54321

0 0x19 0x12345

0 0x19 0x05432
4
0 0x19 0x17777
 Only a subset of circuitry
shown
 Circuitry required for rounding
modes, normalization, etc
sign
Floating-point
bit
0x05432

Exponent Mantissa
(8 bits) (23bits) Barrel Shifter

33
OpenCL builds on Altera’s Floating Point Technology

Multipliers
Function ALUTs Register Latency Performance
(27x27)
Add/Subtract 541 611 n/a 14 497 MHz

Multiplier 150 391 1 11 431 MHz

Divider 254 288 4 14 316 MHz

Inverse 470 683 4 20 401 MHz

SQRT 503 932 n/a 28 478 MHz

Inverse SQRT 435 705 6 26 401 MHz

EXP 626 533 5 17 279 MHz

LOG 1,889 1,821 2 21 394 MHz

We can implement the full range of OpenCL math functions

34
TDFIR Results

35
Results of Task TDFIR

 Data collected using the

BittWare S5-PCIe-HQ
(S5PH-Q) board
 Shipped .aocx yields 170
GFLOP/s

Sample Data Output

Loading tdfir_131_s5phq_d8.aocx...
tdFirVars: inputLength = 4096, resultLength = 4223, filterLen = 128
Done.
Latency: 0.001575 s.
Buffer Setup Time: 0.004548 s.
Throughput: 170.491 GFLOPs.
Verification: PASS

36
OpenCL Overhead

Input Filter Output Total MB Transfer

Dataset M N K
Bytes Bytes Bytes transferred Time (ms)
Large 64 4096 128 2097152 65536 2162176 4.35 MB 1.36
Small 20 1024 12 163840 1920 165600 0.36 MB 0.11

 OpenCL is usually associated with PCIe attached

acceleration hardware.
 The DMA of data over PCIe consumes time.
 Transfer Time is the amount of time to transfer all of
the data back and forth to the FPGA using PCIe
Gen2x8.
Can overlap transfers and computation

 For the most common case in HPEC, we likely want

to continuously process batches of filters as time
progresses.
 Can overlap transfer and compute.

Transfer #1 Transfer #2 Transfer #3

Compute #1 Compute #2 Compute #3

 This pipelining approach can be used to hide transfer

latency

38
Comparisons

39
Kernels Data Set CPU Throughput (GFLOPS) * GPU Throughput (GFLOPS) * Speedup
Set 1 3.382 97.506 28.8
TDFIR
Set 2 3.326 23.130 6.9
Set 1 0.541 61.681 114.1
FDFIR
Set 2 0.542 11.955 22.1
Set 1 1.194 17.177 14.3
CT
Set 2 0.501 35.545 70.9
Set 1 0.871 7.761 8.9
PM
Set 2 0.281 21.241 75.6
Set 1 1.154 2.234 1.9
Set 2 1.314 17.319 13.1
CFAR
Set 3 1.313 13.962 10.6
Set 4 1.261 8.301 6.6
Set 1 0.562 1.177 2.1
Set 2 0.683 8.571 12.5
GA
Set 3 0.441 0.589 1.4
Set 4 0.373 2.249 6.0
Set 1 1.704 54.309 31.8
QR Set 2 0.901 5.679 6.3
Set 3 0.904 6.686 7.4
Set 1 0.747 4.175 5.6
SVD
Set 2 0.791 2.684 3.4
Set 1 112.3 126.8 1.13
DB
Set 2 5.794 8.459 1.46
40
Performance Comparison

GPU: NVIDIA Fermi, CPU: Intel Core 2 Duo (3.33GHz), DSP AD

TigerSHARC 101
100000
~100 GFlop Set 1 Set 2
10000

1000
LOG10(MFLOPS)

100

1
CPU GPU DSP CPU GPU DSP CPU GPU DSP CPU GPU DSP CPU GPU DSP
TDFIR FDFIR CFAR QR SVD

Our FPGA implementation achieves 170-190 GFlops

depending on FMax
Thank You

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6408)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (640)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brene Brown
4/5 (1173)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (990)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1851)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4101)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (887)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (627)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1015)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1138)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (581)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (297)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4355)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (278)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2283)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1087)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2032)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2876)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (835)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (918)
Docker Magneta
75% (4)
Docker Magneta
3 pages
Windows IoT Enteprise Guide
0% (1)
Windows IoT Enteprise Guide
89 pages
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (814)
Snoc CM Mop A Interface BSC Migration Zte v1.0
No ratings yet
Snoc CM Mop A Interface BSC Migration Zte v1.0
9 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
BiPAC 8920NZ
No ratings yet
BiPAC 8920NZ
2 pages
05 Communication Protocol
No ratings yet
05 Communication Protocol
88 pages
F7D1401 Basic Modem Router
No ratings yet
F7D1401 Basic Modem Router
30 pages
Microsoft Testking 70-697 v2018-09-17 by Carter 165q
No ratings yet
Microsoft Testking 70-697 v2018-09-17 by Carter 165q
134 pages
Cyber Security Assignment-1: Incident Analysis and Ethics Computerized Elections Cyber Warfare
100% (1)
Cyber Security Assignment-1: Incident Analysis and Ethics Computerized Elections Cyber Warfare
26 pages
E Tech Brochure
No ratings yet
E Tech Brochure
2 pages
Four-Stage Pipelined Controller
No ratings yet
Four-Stage Pipelined Controller
38 pages
Charter - Testing Final-1-1
No ratings yet
Charter - Testing Final-1-1
29 pages
Mediationzone 7.0: by Markus Henriks & Irene Gonzalvez
No ratings yet
Mediationzone 7.0: by Markus Henriks & Irene Gonzalvez
38 pages
Security Intelligence Report Volume 22
No ratings yet
Security Intelligence Report Volume 22
74 pages
Invensys Foxboro IA Series Hardware Manual
No ratings yet
Invensys Foxboro IA Series Hardware Manual
10 pages
CEH
No ratings yet
CEH
9 pages
KIET Group of Institutions, Ghaziabad: Department of Computer Applications
No ratings yet
KIET Group of Institutions, Ghaziabad: Department of Computer Applications
2 pages
72720936
No ratings yet
72720936
76 pages
Ranjith Krishnan: Edition Types Comments
No ratings yet
Ranjith Krishnan: Edition Types Comments
10 pages
SP1 Chapter 3
No ratings yet
SP1 Chapter 3
9 pages
Barracuda Load Balancer ADC 使用手冊
No ratings yet
Barracuda Load Balancer ADC 使用手冊
449 pages
Pronto Xi 760 - Foundation - Pront Connect API Platform
No ratings yet
Pronto Xi 760 - Foundation - Pront Connect API Platform
6 pages
Rizalie Evangelista-Bsit3f-It-Ias01
No ratings yet
Rizalie Evangelista-Bsit3f-It-Ias01
2 pages
SK900 WEB Monitoring Module User's Manual
No ratings yet
SK900 WEB Monitoring Module User's Manual
17 pages
MSC Thesis Niket Agrawal
No ratings yet
MSC Thesis Niket Agrawal
113 pages
de-cuong-chi-tiet-tacn
No ratings yet
de-cuong-chi-tiet-tacn
14 pages
CODENA
No ratings yet
CODENA
9 pages
Cloud Computing and Business Intelligence by Alexandru TOLE
No ratings yet
Cloud Computing and Business Intelligence by Alexandru TOLE
10 pages
Mid Term Exam - Digital Litteracy and Communication
No ratings yet
Mid Term Exam - Digital Litteracy and Communication
2 pages
MOP For ASR903 OS Upgrade
No ratings yet
MOP For ASR903 OS Upgrade
6 pages
Ict 4 Lesson 1 Being Healthy and Safe When Using Computers Uses of Computers
No ratings yet
Ict 4 Lesson 1 Being Healthy and Safe When Using Computers Uses of Computers
12 pages

Exm Opencl Tdfir Optimization Guide

Uploaded by

Exm Opencl Tdfir Optimization Guide

Uploaded by

Time Domain FIR Filter (TDFIR)

 Introduction to FIR Filters

 The FIR filter is a basic building block for many market

 It is a digital equivalent of the analog filter

 Multiply a sequence of input samples by various

 This series of multiplying and adding will dictate how

 Low Pass Filters

 Reject Frequencies Greater Than

 Get a desired response in the frequency domain (i.e. :

 Take the Inverse Fourier Transform to obtain the

 The Impulse Response IS the value of the coefficients

 Low Pass Filter Frequency

 Take the inverse discrete

Slides courtesy of Craig Lund

FilterArray  Benchmark performs complex

InputArray 𝑥[𝑖-𝐾] 𝑥[𝑖-1] 𝑥[𝑖]

summation K-1 extra elements

Parameter Description Data Set Data Set

We need some real-world data sizes if we are going to show results.

 This slide illustrates how the HPEC Challenge tdFIR

 The obvious implementation in C is very simple.

 This implementation exactly matches the

// Loop assumes OutputArray starts out full of zeros.

 Restructuring data input and output

 DataSet #1 contains 64 sets of filter data to process

 This simplifies control flow

 Thus to maintain the expected behaviour, we PAD the

 Each new filter to process has different filter coefficients.

16*2 zeros 16*2 zeros 127*2 zeros

 Accessing global memory is slow, so a more efficient

__kernel void tdFIR

// Copy from global to local

 Consider if we unrolled the compute loop

 We have something that looks somewhat like a good

 On any given clock cycle, as long as all requests are

M20K M20K M20K M20K M20K M20K M20K M20K

Banks are arranged

Element 0 Element 127

 Banked local memory structures are an inefficient way

 Notice that every thread accesses almost the same

__kernel __attribute((task)) void tdFIR

*For Simplicity Assume the FilterArray doesn’t change

 The first unrolled loop looks like:

 The second unrolled loop:

 The key observation is that all array accesses are from

Iteration I Iteration i+1

 Notice that the TDFIR is a Floating Point Application

 Extreme care is required to handle Exponent Mantissa

Multiplier 150 391 1 11 431 MHz

Divider 254 288 4 14 316 MHz

Inverse 470 683 4 20 401 MHz

SQRT 503 932 n/a 28 478 MHz

Inverse SQRT 435 705 6 26 401 MHz

EXP 626 533 5 17 279 MHz

LOG 1,889 1,821 2 21 394 MHz

We can implement the full range of OpenCL math functions

 Data collected using the

Sample Data Output

Input Filter Output Total MB Transfer

 OpenCL is usually associated with PCIe attached

 For the most common case in HPEC, we likely want

Transfer #1 Transfer #2 Transfer #3

Compute #1 Compute #2 Compute #3

 This pipelining approach can be used to hide transfer

GPU: NVIDIA Fermi, CPU: Intel Core 2 Duo (3.33GHz), DSP AD

Our FPGA implementation achieves 170-190 GFlops

You might also like

162 zeros 162 zeros 127*2 zeros

kernel attribute((task)) void tdFIR