SlideShare a Scribd company logo
Najeeb Ahmad
Master Thesis Presentation
May, 2012
Supervisor: Dr. Sun Jinping
Design and Implementation of GPU
based SAR Image Processor
School of Electronic Information
Engineering
Beihang University, Beijing China.
Contents
1. Introduction
2. GPU Computing
3. SAR Processing
4. Implementation
5. Conclusion & Future Work
1.Introduction
 Problem
 Motivation
 Objective
 Methodology
PROBLEM
Synthetic Aperture Radar data processing is a
computationally intensive and time consuming task
using conventional CPUs. Given the increasing
popularity and use of GPU for scientific computing,
it is required to accelerate simplified range Doppler
SAR processing algorithm on GPU using modern
GPGPU technology to achieve real/near real-time
performance and to evaluate its suitability for SAR
processing.
MOTIVATION
 Computationally intensive and time consuming
nature of SAR processing algorithms.
 Inherent algorithm parallelism in most SAR
processing algorithms.
 Advent of modern GPGPU technology and
availability of commodity GPUs as general
purpose computation engines.
 Architectural parallelism and availability of
sufficient hardware resources in modern GPUs
rendering them especially useful for handling
large data quantities and parallel SAR algorithm
implementation.
OBJECTIVE
 To implement and accelerate simplified range
Doppler SAR processing algorithm on a modern
NVIDIA TESLA GPU using CUDA and MATLAB-
GPU capabilities.
 The resulting research will explore the areas like:
 Algorithm adaptation for parallel implementation.
 Suitability of MATLAB for algorithm implementation.
 Suitability of CUDA for algorithm implementation.
 Comparison of CPU/CUDA/MATLAB-GPU
implementations.
 GPU as SAR processing platform.
METHODOLOGY
 Algorithm implementation and verification on Intel
Xeon CPU using MATLAB.
 Identification of parallelizable portions of
algorithm.
 Algorithm implementation on TESLA C1060 GPU
using MATLAB’s native GPU capabilities.
 Algorithm implementation on TESLA C1060 GPU
using CUDA.
 Analysis of CPU, MATLAB-GPU and CUDA
implementations.
2.GPU Computing
 Introduction to GPU Computing
 GPGPU: Brief History
 NVIDIA CUDA
 Writing efficient code
Introduction to GPU Computing
 Use of Graphics Processing Units (GPUs) for
general purpose computing applications.
 CPU: Single, four or eight cores. Capable of
handling few threads. Suitable for serial code.
 GPU: Hundreds of cores. Capable of handling
hundreds of threads. Suitable for parallel code.
Introduction to GPU Computing
 GPU Computing Model: Heterogeneous
computing model employing both CPU and GPU
with serial computing on CPU, parallel computing
on GPU.
GPGPU: Brief History
 First use of GPU as general purpose computing
device, around 1999-2000 using graphics APIs.
Huge performance boosts observed. Generally
unpopular due to tedious programming.
 Introduction of NVIDIAs “CUDA” and AMDs
“Stream Computing” in 2007. Beginning of
modern GPGPU era. Other vendors introduced
their own GPGPU systems.
 NVIDIAs CUDA gaining popularity due to its
maturity and performance.
NVIDIA CUDA
 Compute Unified Device Architecture.
 Comprises of Instruction Set Architecture (ISA)
and parallel compute engine in GPU
programmable with high level languages
extended for GPU computing.
 CUDA framework comprises of two parts;
hardware and software. From software
perspective, CUDA means extended C/C++,
FORTRAN to support GPU computing.
 CUDA is “Single Instruction Multiple Thread”
(SIMT) architecture.
CUDA Hardware
 Streaming multiprocessor (SM): Basic computing unit
of the GPU. Comprises of eight streaming processors
(SP) and memory. Different GPUs differ in number of
SMs and SP clock frequency.
SP SP
SP SP
SP SP
SP SP
SFU SFU
MT IU
Shared Memory
CUDA Memory Architecture
 Understanding of memory architecture critical for
writing efficient CUDA programs.
 All CUDA-enabled hardware have following types
of memory:
 Global memory
 Shared memory and registers.
 Texture memory and texture cache.
 Constant memory and constant cache.
 Local memory for register spilling.
SP SP
Shared memory
SP SP SP
Texture cache
Constant cache
SM n
SP SP
Shared memory
SP SP SP
Texture cache
Constant cache
SM 3
SP SP
Shared memory
SP SP SP
Texture cache
Constant cache
SP SP
Shared memory
SP SP SP
Texture cache
Constant cache
SM 1
SM 2
GPU
Global memory (RAM)
Local MemoryTexture memory
Constant
memory
NVIDIA TESLA C1060 GPU
 PCI Express 2.0 compliant computing processor
board based on NVIDIA Tesla T10 graphics
processing unit targeted for HPC applications.
 Feature highlights
 30 SMs = 240 SPs.
 SP Clock = 1.296 GHz
 4 GB DDR3 memory with 120
GB/s bandwidth.
 IEEE 754 single and double
floating point compliant.
 933 GFLOPS single and 78
GFLOPS double precision
performance.
 Compute capability: 1.3
 Supported by MATLAB for
GPU computing
CUDA Programming Model
 At its core are thread groups, shared memory and
barrier synchronization.
 Provides coarse-grained data and task
parallelism and fine-grained data and thread
parallelism providing expressivity and scalability.
 Thread hierarchy: Grid, blocks, threads.
 Kernels: Functions executed on device (GPU) in
parallel threads.
 CUDA provides APIs to run and launch kernels in
parallel threads and to synchronize them.
Processing Flow
 Copy input data from CPU to GPU memory.
 Load GPU program and execute, caching result
on the device.
 Copy results from GPU to CPU.
RAM
CPU
Host
Global memory
Constant
Texture
GPU
Device
Writing Efficient Code
 High priority considerations
 Minimum CPU-GPU transfers.
 Use of coalesced data transfers.
 Use of shared memory instead of global memory
whenever possible.
 Avoiding different execution paths within a warp.
 Medium priority considerations
 Access to shared memory should be planned to
avoid serialization.
 Redundant data transfers from global memory
should be avoided.
Writing Efficient Code
 Threads per block should be multiple of 32.
 Use of fast math library whenever possible.
 Low Priority Considerations
 Use of zero copy operations.
 For kernels with long argument list, some argument
should be placed in constant memory.
 Expensive modulo, division operations should be
avoided in favor of shift operations whenever
possible.
 Automatic conversion of double to float should be
avoided.
 Loop unrolling should be used whenever possible.
3.SAR Processing
 What is Synthetic Aperture Radar
 SAR Processing
 Processing Algorithms
 Basic RDA
 Simplified RDA
What is Synthetic Aperture Radar
 An active microwave remote sensing imaging
system.
 Employs long range propagation characteristics
of radar and complex signal processing
techniques to produce high resolution images.
 High resolution achieved by synthesizing long
antenna aperture through signal processing
techniques.
 Pros (in comparison with optical systems):
 All weather and day and night operation.
 No effects of constituents of atmosphere.
 Sensitivity to dielectric properties (can image ice,
biomass etc.)
 Sensitivity to surface roughness (oceans, wind
What is Synthetic Aperture Radar
 Accurate measurement of distance.
 Sensitivity to man made objects.
 Sensitivity to target structure.
 Subsurface penetration.
 Cons
 Complex interactions (difficult to visualize and
understand)
 Speckle effects (difficult in visual interpretation)
 Topographic effects
SAR Processing
 A set of procedures to obtain interpretable image
from raw scattered in azimuth and range
directions.
 In range, data is scattered by duration of
transmitted FM pulse.
 In azimuth, data spread by duration point target is
illuminated by the radar beam.
 SAR processing compresses this data taking into
account range cell migration, earth curvature,
earth rotation, air/spacecraft attitude noise to
produce the final image.
 Given nature of SAR system and signals, signal
processing rather than image processing provide
appropriate tools for SAR processing.
SAR Processing Algorithms
 Mainstream SAR processing include:
 Range Doppler algorithm (RDA)
 High resolution images for low squint and for relatively
smaller aperture sizes. Very popular.
 Chirp scaling algorithm (CSA)
 Two-dimensional operations with range independence
followed by range corrections in range Doppler domain.
 Omega-K algorithm (ωKA)
 Efficient and accurate in two-dimensional frequency
domain.
 SPECAN algorithm
 Good for medium to low resolution requirements.
Range Doppler Algorithm
 Versions of range Doppler:
 Basic RDA
 RDA with accurate SRC
 RDA with approximate SRC
 Simplified range Doppler
Basic RDA
Raw data
Range
Compression
Azimuth FFT
RCMC
Azimuth
Compression
Azimuth IFFT
and lookup
Summation
Final Image
Range FFT,
matched filter
multiply, range
IFFT
Data in range
Doppler domain
Interpolation
operation in range
Doppler domain
Azimuth matched
filter multiply
To bring back
signal into time
domain.
Simplified RDA
 For narrower swath width and medium resolution
requirements, RCM can be assumed independent
of range.Raw data Pre-filtering
Range
Compression
Azimuth FFTRCMCRange IFFT
Azimuth
Compression
Azimuth IFFT
and lookup
Summation
Final Image
To remove
Doppler centroid
Range FFT,
matched filter
multiply (No
range IFFT)
Both range and
azimuth in
frequency domain
RCM phase
function multiply
with each range
line
Data in range
Doppler domain
4.Implementation
 Hardware resources
 Software resources
 CPU Implementation
 MATLAB GPU Implementation
 CUDA Implementation
 Result Comparison
Hardware resources
CPU GPU
Name NVIDIA Tesla
C1060
# of cores 240
SP Clock 1.296 GHz
Memory 4 GB GDDR3
Maximum
memory
bandwidth
102 GB/s
Memory
interface
512 bit – PCI
Express
GFLOPS 933 single
precision, 78
double precision
Name Intel Xeon
E5504
CPU Clock 2 GHz
# of cores 4
System Memory 4 GB
DDR3 Clock 800 MHz
Maximum
memory
bandwidth
19.2 GB/s
Memory type DDR3 PC3
PCI Slot PCI Express
Software resources
CPU GPU
 Windows 7 Ultimate
64-bit
 MATLAB release
2010b
 Visual Studio 2008
SP1
 CUDA Toolkit 4.1
 MATLAB release
2010b
 NVIDIA Parallel
Nsight
 Visual Profiler
 CUDA MEMCHECK
 CUFFT library
RADARSAT – I Data
• CEOS Format
• Raw data is required to be
extracted from CEOS data
before SAR processing
algorithm can be applied.
Parameter Value Units
Sampling rate 32.317 MHz
Range FM rate 0.7213
5
MHz/µs
Pulse duration 41.74 µs
Radar frequency 5.3 GHz
Radar wavelength 0.0565
7
m
Pulse repetition
frequency
1256.9
8
Hz
Effective radar
velocity
7062 m/s
Azimuth FM rate 1733 Hz/s
Table RADARSAT – I data parameters
CEOS data
CEOS data
extraction utility
RAW SAR data
SAR Processing GUI
Functions
• CEOS data
extraction.
• MATLAB-
CPU SAR
processing.
• MATLAB-
GPU SAR
processing
• CUDA
input/output
manipulation.
• CUDA
program
execution.
CPU Implementation
 Implemented using MATLAB
 FFT/IFFT using standard MATLAB functions
CPU Processed SAR image
A 2048 x 4096 SAR image using CPU based implementation
MATLAB-GPU Implementation
 MATLAB started supporting GPU computing since
MATLAB release 2010b.
 Implemented using native MATLAB-GPU functions only
(no CUDA kernel calls).
 Vectorization strategy employed to implement vector-
matrix multiplications on GPU.
 All FFT/IFFTs performed using MATLAB-GPU FFT/IFFT
support functions.
Column1
Column2
………...
Columnn
Column1
Column2 ………...
Columnn
Column1
Column2
………...
Columnn
MATLAB-GPU Implementation
 Limit on maximum image size that can be
calculated due to GPU memory constraints.
MATLAB-GPU Implementation
 Speedup as high as 21 achieved compared with
CPU implementation
MATLAB-GPU Implementation
A 2048 x 4096 SAR image using MATLAB-GPU based implementation
MATLAB-GPU Implementation
 Advantages
 Quick and easy to implement
 Sufficient speedups obtained with little effort
 Little knowledge of GPU hardware and no
knowledge of optimization techniques required.
 Disadvantages
 Currently, limited number of MATLAB functions
supported on GPU.
 Not all overloads of a function available for GPU.
 Lesser control of hardware resources and memory.
 Not many optimization options.
CUDA Implementation
 Strategy
 Signal data read as binary file
 Vectors, matched filters calculated on CPU
 Vectors/signal data transferred to GPU
 Following kernels executed in order on GPU
 Pre-filtering kernel
 Range compression kernel
 RCMC kernel
 Azimuth compression kernel
 Image pixel calculation kernel
 Data transferred from GPU to CPU and saved on
disk as image.
Optimization considerations
 Chosen block size = 8 × 8 = 64. Conforms with
memory coalescing requirements.
 Constant variables stored in constant memory
 Local variable and phase function calculation
whenever possible to reduce global memory
access.
 CPU-GPU data transfer kept to minimum by
transferring data from CPUGPU at beginning
and GPUCPU transfers at the end of algorithm.
 Using CUFFTs cufftPlanMany() plan for
FFT/IFFTs along data columns.
CUDA Implementation Results
A 2048 x 4096 SAR image using CUDA based implementation
CUDA Implementation Results
CUDA Implementation Results
CUDA/MATLAB-CPU/MATLAB-CPU
Computation Time Comparison
MATLAB-GPU/CUDA Computation
Time Comparison
MATLAB-GPU/CUDA speedup
comparison
 Speedups as high as 53 times achieved in
comparison with maximum speedup of 21 times
in MATLAB.
5. Conclusions & Future
Work
Conclusions
 Feasibility of GPU for SAR processing
 Amount of data, computational effort and inherent
algorithm parallelism makes SAR processing
suitable on GPU.
 TESLA C1060 GPU offers enough memory to
handle various common SAR image sizes.
 Cooling GPU may be a challenge in some
environments.
 Scalability of CUDA will prove to be an advantage to
port existing SAR code to newer GPUs.
 GPUs might not be suitable where customizable
hardware is required or military hardware standards
are to be adhered.
Conclusions
 MATLAB-GPU based SAR Processing
 Significant speedups compared with CPU.
 Quick and easy to implement.
 Has some limitations:
 Currently have lesser function support for GPU. Expected to
improve with future MATLAB releases.
 Vectorization strategy needs more memory. Future release
promise to take away need for vectorization (e.g. bsxfun in
release 2012a).
 Lesser control over GPU resources (memory etc.).
 CUDA SAR Processing
 CUDA: Flexible and scalable with least learning curve.
 More control over GPU resources.
 Optimization strategies can be applied.
 Faster and more memory efficient than MATLAB
implementation.
Conclusions
 Downsides of GPU
 Significant testing/verification effort might be
required if GPU hardware have to be upgraded (due
to old one becoming obsolete).
 Proprietary nature of CUDA might be problematic in
case company discontinues CUDA or its support.
Future work
 CUDA kernels can be called in MATLAB code
using MATLAB’s CUDA kernel calling support.
 MATLAB GPU implementation can be improved
as newer and better functions become available.
 C/C++ based CPU implementation can be
developed to better judge MATLAB-CPU/CUDA
performance.
 Other SAR processing algorithms can be
implemented using framework laid out in this
project.
Q & A
Thank You
Ad

Recommended

Conectividad satelital vsat en las aip
Conectividad satelital vsat en las aip
Karito Lizeth Benites Socola
 
27 transmissão da informação
27 transmissão da informação
Bruno De Siqueira Costa
 
Synchronization
Synchronization
Sri Manakula Vinayagar Engineering College
 
Microstrip Antennas
Microstrip Antennas
Roma Rico Flores
 
Antenna array
Antenna array
yonas yifter
 
Introducción a la Tecnología RFID
Introducción a la Tecnología RFID
IMF Business School
 
Windowing ofdm
Windowing ofdm
Sreeram Reddy
 
Microprocessor fundamentals
Microprocessor fundamentals
JLoknathDora
 
2 random variables notes 2p3
2 random variables notes 2p3
MuhannadSaleh
 
MIMO in 15 minutes
MIMO in 15 minutes
Chaitanya Tata, PMP
 
Phased array radar antennas - Anten mảng pha
Phased array radar antennas - Anten mảng pha
Tuấn Trần
 
Calculo radioenlace
Calculo radioenlace
Jorge Lara
 
IPTV Basics
IPTV Basics
GAUTAM KOPPALA (JORGE)
 
Antena Ranurada
Antena Ranurada
Sting Martinez
 
Detectores ópticos
Detectores ópticos
abemen
 
Embedded system Design
Embedded system Design
AJAL A J
 
Introduccion a la telefonia
Introduccion a la telefonia
PaloSanto Solutions
 
DVB-S2X Migration
DVB-S2X Migration
Newtec
 
Redes de Telecomunicaciones cap 4-3
Redes de Telecomunicaciones cap 4-3
Francisco Apablaza
 
CI19.2. Presentaciones: Large scale path loss
CI19.2. Presentaciones: Large scale path loss
Francisco Sandoval
 
Parabolic antenna
Parabolic antenna
Jamal Jamali
 
Rf power amplifier design
Rf power amplifier design
venkateshp100
 
7 isdb
7 isdb
Francisco Sandoval
 
Satellite Link Design: C/N Ratio
Satellite Link Design: C/N Ratio
RCC Institute of Information Technology
 
TELEVISIÓN DIGITAL
TELEVISIÓN DIGITAL
BenjaminAnilema
 
Lecture 2 intro a sist radiocom p2
Lecture 2 intro a sist radiocom p2
nica2009
 
Televisión digital
Televisión digital
CDSJA
 
Multiband Transceivers - [Chapter 7] Multi-mode/Multi-band GSM/GPRS/TDMA/AMP...
Multiband Transceivers - [Chapter 7] Multi-mode/Multi-band GSM/GPRS/TDMA/AMP...
Simen Li
 
PresentationSAR
PresentationSAR
Rohollah Javaheri
 
Parallel Processing for Digital Image Enhancement
Parallel Processing for Digital Image Enhancement
Nora Youssef
 

More Related Content

What's hot (20)

2 random variables notes 2p3
2 random variables notes 2p3
MuhannadSaleh
 
MIMO in 15 minutes
MIMO in 15 minutes
Chaitanya Tata, PMP
 
Phased array radar antennas - Anten mảng pha
Phased array radar antennas - Anten mảng pha
Tuấn Trần
 
Calculo radioenlace
Calculo radioenlace
Jorge Lara
 
IPTV Basics
IPTV Basics
GAUTAM KOPPALA (JORGE)
 
Antena Ranurada
Antena Ranurada
Sting Martinez
 
Detectores ópticos
Detectores ópticos
abemen
 
Embedded system Design
Embedded system Design
AJAL A J
 
Introduccion a la telefonia
Introduccion a la telefonia
PaloSanto Solutions
 
DVB-S2X Migration
DVB-S2X Migration
Newtec
 
Redes de Telecomunicaciones cap 4-3
Redes de Telecomunicaciones cap 4-3
Francisco Apablaza
 
CI19.2. Presentaciones: Large scale path loss
CI19.2. Presentaciones: Large scale path loss
Francisco Sandoval
 
Parabolic antenna
Parabolic antenna
Jamal Jamali
 
Rf power amplifier design
Rf power amplifier design
venkateshp100
 
7 isdb
7 isdb
Francisco Sandoval
 
Satellite Link Design: C/N Ratio
Satellite Link Design: C/N Ratio
RCC Institute of Information Technology
 
TELEVISIÓN DIGITAL
TELEVISIÓN DIGITAL
BenjaminAnilema
 
Lecture 2 intro a sist radiocom p2
Lecture 2 intro a sist radiocom p2
nica2009
 
Televisión digital
Televisión digital
CDSJA
 
Multiband Transceivers - [Chapter 7] Multi-mode/Multi-band GSM/GPRS/TDMA/AMP...
Multiband Transceivers - [Chapter 7] Multi-mode/Multi-band GSM/GPRS/TDMA/AMP...
Simen Li
 
2 random variables notes 2p3
2 random variables notes 2p3
MuhannadSaleh
 
Phased array radar antennas - Anten mảng pha
Phased array radar antennas - Anten mảng pha
Tuấn Trần
 
Calculo radioenlace
Calculo radioenlace
Jorge Lara
 
Detectores ópticos
Detectores ópticos
abemen
 
Embedded system Design
Embedded system Design
AJAL A J
 
DVB-S2X Migration
DVB-S2X Migration
Newtec
 
Redes de Telecomunicaciones cap 4-3
Redes de Telecomunicaciones cap 4-3
Francisco Apablaza
 
CI19.2. Presentaciones: Large scale path loss
CI19.2. Presentaciones: Large scale path loss
Francisco Sandoval
 
Rf power amplifier design
Rf power amplifier design
venkateshp100
 
Lecture 2 intro a sist radiocom p2
Lecture 2 intro a sist radiocom p2
nica2009
 
Televisión digital
Televisión digital
CDSJA
 
Multiband Transceivers - [Chapter 7] Multi-mode/Multi-band GSM/GPRS/TDMA/AMP...
Multiband Transceivers - [Chapter 7] Multi-mode/Multi-band GSM/GPRS/TDMA/AMP...
Simen Li
 

Viewers also liked (20)

PresentationSAR
PresentationSAR
Rohollah Javaheri
 
Parallel Processing for Digital Image Enhancement
Parallel Processing for Digital Image Enhancement
Nora Youssef
 
HPC Essentials 0
HPC Essentials 0
William Brouwer
 
MATLAB_BIg_Data_ds_Haddop_22032015
MATLAB_BIg_Data_ds_Haddop_22032015
Asaf Ben Gal
 
Survieellance by dr najeeb
Survieellance by dr najeeb
muhammed najeeb
 
Current & Future Spaceborne Sar Systems
Current & Future Spaceborne Sar Systems
gpetrie
 
Study design of Prof Zak
Study design of Prof Zak
Professor M Zak Khalil, MD, MRCP (UK), FACC, FESC
 
Cuda tutorial
Cuda tutorial
Mahesh Khadatare
 
Digital image processing ppt
Digital image processing ppt
khanam22
 
Parallel Programming
Parallel Programming
Uday Sharma
 
Iceberg phenomena in dentistry
Iceberg phenomena in dentistry
pratiklovehoney
 
WE3.L10.3: THE FUTURE OF SPACEBORNE SYNTHETIC APERTURE RADAR
WE3.L10.3: THE FUTURE OF SPACEBORNE SYNTHETIC APERTURE RADAR
grssieee
 
Tesla personal super computer
Tesla personal super computer
Priya Manik
 
Iceberg phenomena by dr najeeb memon
Iceberg phenomena by dr najeeb memon
muhammed najeeb
 
Introduction to Epidemiology and Surveillance
Introduction to Epidemiology and Surveillance
George Moulton
 
Matlab ppt
Matlab ppt
Dhammpal Ramtake
 
Writing Fast MATLAB Code
Writing Fast MATLAB Code
Jia-Bin Huang
 
Study design in research
Study design in research
Kusum Gaur
 
Epidemiology And Public Health Part II for Graduate and Postgraduate students
Epidemiology And Public Health Part II for Graduate and Postgraduate students
Tauseef Jawaid
 
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Hsien-Hsin Sean Lee, Ph.D.
 
Parallel Processing for Digital Image Enhancement
Parallel Processing for Digital Image Enhancement
Nora Youssef
 
MATLAB_BIg_Data_ds_Haddop_22032015
MATLAB_BIg_Data_ds_Haddop_22032015
Asaf Ben Gal
 
Survieellance by dr najeeb
Survieellance by dr najeeb
muhammed najeeb
 
Current & Future Spaceborne Sar Systems
Current & Future Spaceborne Sar Systems
gpetrie
 
Digital image processing ppt
Digital image processing ppt
khanam22
 
Parallel Programming
Parallel Programming
Uday Sharma
 
Iceberg phenomena in dentistry
Iceberg phenomena in dentistry
pratiklovehoney
 
WE3.L10.3: THE FUTURE OF SPACEBORNE SYNTHETIC APERTURE RADAR
WE3.L10.3: THE FUTURE OF SPACEBORNE SYNTHETIC APERTURE RADAR
grssieee
 
Tesla personal super computer
Tesla personal super computer
Priya Manik
 
Iceberg phenomena by dr najeeb memon
Iceberg phenomena by dr najeeb memon
muhammed najeeb
 
Introduction to Epidemiology and Surveillance
Introduction to Epidemiology and Surveillance
George Moulton
 
Writing Fast MATLAB Code
Writing Fast MATLAB Code
Jia-Bin Huang
 
Study design in research
Study design in research
Kusum Gaur
 
Epidemiology And Public Health Part II for Graduate and Postgraduate students
Epidemiology And Public Health Part II for Graduate and Postgraduate students
Tauseef Jawaid
 
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Hsien-Hsin Sean Lee, Ph.D.
 
Ad

Similar to Design and implementation of GPU-based SAR image processor (20)

transforming-wireless-system-design-with-matlab-and-ni.pdf
transforming-wireless-system-design-with-matlab-and-ni.pdf
JunaidKhan188662
 
0507036
0507036
meraz rizel
 
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...
IRJET Journal
 
Cliff sugerman
Cliff sugerman
clifford sugerman
 
Jg3515961599
Jg3515961599
IJERA Editor
 
Bb4201362367
Bb4201362367
IJERA Editor
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
Fisnik Kraja
 
Accelerating S3D A GPGPU Case Study
Accelerating S3D A GPGPU Case Study
Martha Brown
 
Mantle for Developers
Mantle for Developers
Electronic Arts / DICE
 
Gpu with cuda architecture
Gpu with cuda architecture
Dhaval Kaneria
 
Stream Processing
Stream Processing
arnamoy10
 
NVIDIA CUDA
NVIDIA CUDA
Jungsoo Nam
 
DSP Processor
DSP Processor
Laxmikant Kalkonde
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUs
iguazio
 
DSP by FPGA
DSP by FPGA
Abhijay Sisodia
 
APSys Presentation Final copy2
APSys Presentation Final copy2
Junli Gu
 
transforming-wireless-system-design-with-matlab-and-ni.pdf
transforming-wireless-system-design-with-matlab-and-ni.pdf
JunaidKhan188662
 
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...
IRJET Journal
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
Fisnik Kraja
 
Accelerating S3D A GPGPU Case Study
Accelerating S3D A GPGPU Case Study
Martha Brown
 
Gpu with cuda architecture
Gpu with cuda architecture
Dhaval Kaneria
 
Stream Processing
Stream Processing
arnamoy10
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUs
iguazio
 
APSys Presentation Final copy2
APSys Presentation Final copy2
Junli Gu
 
Ad

Recently uploaded (20)

Solar thermal – Flat plate and concentrating collectors .pptx
Solar thermal – Flat plate and concentrating collectors .pptx
jdaniabraham1
 
International Journal of Advanced Information Technology (IJAIT)
International Journal of Advanced Information Technology (IJAIT)
ijait
 
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
djiceramil
 
Fatality due to Falls at Working at Height
Fatality due to Falls at Working at Height
ssuserb8994f
 
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Mark Billinghurst
 
DESIGN OF REINFORCED CONCRETE ELEMENTS S
DESIGN OF REINFORCED CONCRETE ELEMENTS S
prabhusp8
 
Stay Safe Women Security Android App Project Report.pdf
Stay Safe Women Security Android App Project Report.pdf
Kamal Acharya
 
machine learning is a advance technology
machine learning is a advance technology
ynancy893
 
retina_biometrics ruet rajshahi bangdesh.pptx
retina_biometrics ruet rajshahi bangdesh.pptx
MdRakibulIslam697135
 
Tally.ERP 9 at a Glance.book - Tally Solutions .pdf
Tally.ERP 9 at a Glance.book - Tally Solutions .pdf
Shabista Imam
 
Complete guidance book of Asp.Net Web API
Complete guidance book of Asp.Net Web API
Shabista Imam
 
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
resming1
 
Structural Wonderers_new and ancient.pptx
Structural Wonderers_new and ancient.pptx
nikopapa113
 
Decoding Kotlin - Your Guide to Solving the Mysterious in Kotlin - Devoxx PL ...
Decoding Kotlin - Your Guide to Solving the Mysterious in Kotlin - Devoxx PL ...
João Esperancinha
 
Structured Programming with C++ :: Kjell Backman
Structured Programming with C++ :: Kjell Backman
Shabista Imam
 
How to Un-Obsolete Your Legacy Keypad Design
How to Un-Obsolete Your Legacy Keypad Design
Epec Engineered Technologies
 
Machine Learning - Classification Algorithms
Machine Learning - Classification Algorithms
resming1
 
Deep Learning for Natural Language Processing_FDP on 16 June 2025 MITS.pptx
Deep Learning for Natural Language Processing_FDP on 16 June 2025 MITS.pptx
resming1
 
Industrial internet of things IOT Week-3.pptx
Industrial internet of things IOT Week-3.pptx
KNaveenKumarECE
 
Unit III_One Dimensional Consolidation theory
Unit III_One Dimensional Consolidation theory
saravananr808639
 
Solar thermal – Flat plate and concentrating collectors .pptx
Solar thermal – Flat plate and concentrating collectors .pptx
jdaniabraham1
 
International Journal of Advanced Information Technology (IJAIT)
International Journal of Advanced Information Technology (IJAIT)
ijait
 
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
djiceramil
 
Fatality due to Falls at Working at Height
Fatality due to Falls at Working at Height
ssuserb8994f
 
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Mark Billinghurst
 
DESIGN OF REINFORCED CONCRETE ELEMENTS S
DESIGN OF REINFORCED CONCRETE ELEMENTS S
prabhusp8
 
Stay Safe Women Security Android App Project Report.pdf
Stay Safe Women Security Android App Project Report.pdf
Kamal Acharya
 
machine learning is a advance technology
machine learning is a advance technology
ynancy893
 
retina_biometrics ruet rajshahi bangdesh.pptx
retina_biometrics ruet rajshahi bangdesh.pptx
MdRakibulIslam697135
 
Tally.ERP 9 at a Glance.book - Tally Solutions .pdf
Tally.ERP 9 at a Glance.book - Tally Solutions .pdf
Shabista Imam
 
Complete guidance book of Asp.Net Web API
Complete guidance book of Asp.Net Web API
Shabista Imam
 
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
resming1
 
Structural Wonderers_new and ancient.pptx
Structural Wonderers_new and ancient.pptx
nikopapa113
 
Decoding Kotlin - Your Guide to Solving the Mysterious in Kotlin - Devoxx PL ...
Decoding Kotlin - Your Guide to Solving the Mysterious in Kotlin - Devoxx PL ...
João Esperancinha
 
Structured Programming with C++ :: Kjell Backman
Structured Programming with C++ :: Kjell Backman
Shabista Imam
 
Machine Learning - Classification Algorithms
Machine Learning - Classification Algorithms
resming1
 
Deep Learning for Natural Language Processing_FDP on 16 June 2025 MITS.pptx
Deep Learning for Natural Language Processing_FDP on 16 June 2025 MITS.pptx
resming1
 
Industrial internet of things IOT Week-3.pptx
Industrial internet of things IOT Week-3.pptx
KNaveenKumarECE
 
Unit III_One Dimensional Consolidation theory
Unit III_One Dimensional Consolidation theory
saravananr808639
 

Design and implementation of GPU-based SAR image processor

  • 1. Najeeb Ahmad Master Thesis Presentation May, 2012 Supervisor: Dr. Sun Jinping Design and Implementation of GPU based SAR Image Processor School of Electronic Information Engineering Beihang University, Beijing China.
  • 2. Contents 1. Introduction 2. GPU Computing 3. SAR Processing 4. Implementation 5. Conclusion & Future Work
  • 4. PROBLEM Synthetic Aperture Radar data processing is a computationally intensive and time consuming task using conventional CPUs. Given the increasing popularity and use of GPU for scientific computing, it is required to accelerate simplified range Doppler SAR processing algorithm on GPU using modern GPGPU technology to achieve real/near real-time performance and to evaluate its suitability for SAR processing.
  • 5. MOTIVATION  Computationally intensive and time consuming nature of SAR processing algorithms.  Inherent algorithm parallelism in most SAR processing algorithms.  Advent of modern GPGPU technology and availability of commodity GPUs as general purpose computation engines.  Architectural parallelism and availability of sufficient hardware resources in modern GPUs rendering them especially useful for handling large data quantities and parallel SAR algorithm implementation.
  • 6. OBJECTIVE  To implement and accelerate simplified range Doppler SAR processing algorithm on a modern NVIDIA TESLA GPU using CUDA and MATLAB- GPU capabilities.  The resulting research will explore the areas like:  Algorithm adaptation for parallel implementation.  Suitability of MATLAB for algorithm implementation.  Suitability of CUDA for algorithm implementation.  Comparison of CPU/CUDA/MATLAB-GPU implementations.  GPU as SAR processing platform.
  • 7. METHODOLOGY  Algorithm implementation and verification on Intel Xeon CPU using MATLAB.  Identification of parallelizable portions of algorithm.  Algorithm implementation on TESLA C1060 GPU using MATLAB’s native GPU capabilities.  Algorithm implementation on TESLA C1060 GPU using CUDA.  Analysis of CPU, MATLAB-GPU and CUDA implementations.
  • 8. 2.GPU Computing  Introduction to GPU Computing  GPGPU: Brief History  NVIDIA CUDA  Writing efficient code
  • 9. Introduction to GPU Computing  Use of Graphics Processing Units (GPUs) for general purpose computing applications.  CPU: Single, four or eight cores. Capable of handling few threads. Suitable for serial code.  GPU: Hundreds of cores. Capable of handling hundreds of threads. Suitable for parallel code.
  • 10. Introduction to GPU Computing  GPU Computing Model: Heterogeneous computing model employing both CPU and GPU with serial computing on CPU, parallel computing on GPU.
  • 11. GPGPU: Brief History  First use of GPU as general purpose computing device, around 1999-2000 using graphics APIs. Huge performance boosts observed. Generally unpopular due to tedious programming.  Introduction of NVIDIAs “CUDA” and AMDs “Stream Computing” in 2007. Beginning of modern GPGPU era. Other vendors introduced their own GPGPU systems.  NVIDIAs CUDA gaining popularity due to its maturity and performance.
  • 12. NVIDIA CUDA  Compute Unified Device Architecture.  Comprises of Instruction Set Architecture (ISA) and parallel compute engine in GPU programmable with high level languages extended for GPU computing.  CUDA framework comprises of two parts; hardware and software. From software perspective, CUDA means extended C/C++, FORTRAN to support GPU computing.  CUDA is “Single Instruction Multiple Thread” (SIMT) architecture.
  • 13. CUDA Hardware  Streaming multiprocessor (SM): Basic computing unit of the GPU. Comprises of eight streaming processors (SP) and memory. Different GPUs differ in number of SMs and SP clock frequency. SP SP SP SP SP SP SP SP SFU SFU MT IU Shared Memory
  • 14. CUDA Memory Architecture  Understanding of memory architecture critical for writing efficient CUDA programs.  All CUDA-enabled hardware have following types of memory:  Global memory  Shared memory and registers.  Texture memory and texture cache.  Constant memory and constant cache.  Local memory for register spilling. SP SP Shared memory SP SP SP Texture cache Constant cache SM n SP SP Shared memory SP SP SP Texture cache Constant cache SM 3 SP SP Shared memory SP SP SP Texture cache Constant cache SP SP Shared memory SP SP SP Texture cache Constant cache SM 1 SM 2 GPU Global memory (RAM) Local MemoryTexture memory Constant memory
  • 15. NVIDIA TESLA C1060 GPU  PCI Express 2.0 compliant computing processor board based on NVIDIA Tesla T10 graphics processing unit targeted for HPC applications.  Feature highlights  30 SMs = 240 SPs.  SP Clock = 1.296 GHz  4 GB DDR3 memory with 120 GB/s bandwidth.  IEEE 754 single and double floating point compliant.  933 GFLOPS single and 78 GFLOPS double precision performance.  Compute capability: 1.3  Supported by MATLAB for GPU computing
  • 16. CUDA Programming Model  At its core are thread groups, shared memory and barrier synchronization.  Provides coarse-grained data and task parallelism and fine-grained data and thread parallelism providing expressivity and scalability.  Thread hierarchy: Grid, blocks, threads.  Kernels: Functions executed on device (GPU) in parallel threads.  CUDA provides APIs to run and launch kernels in parallel threads and to synchronize them.
  • 17. Processing Flow  Copy input data from CPU to GPU memory.  Load GPU program and execute, caching result on the device.  Copy results from GPU to CPU. RAM CPU Host Global memory Constant Texture GPU Device
  • 18. Writing Efficient Code  High priority considerations  Minimum CPU-GPU transfers.  Use of coalesced data transfers.  Use of shared memory instead of global memory whenever possible.  Avoiding different execution paths within a warp.  Medium priority considerations  Access to shared memory should be planned to avoid serialization.  Redundant data transfers from global memory should be avoided.
  • 19. Writing Efficient Code  Threads per block should be multiple of 32.  Use of fast math library whenever possible.  Low Priority Considerations  Use of zero copy operations.  For kernels with long argument list, some argument should be placed in constant memory.  Expensive modulo, division operations should be avoided in favor of shift operations whenever possible.  Automatic conversion of double to float should be avoided.  Loop unrolling should be used whenever possible.
  • 20. 3.SAR Processing  What is Synthetic Aperture Radar  SAR Processing  Processing Algorithms  Basic RDA  Simplified RDA
  • 21. What is Synthetic Aperture Radar  An active microwave remote sensing imaging system.  Employs long range propagation characteristics of radar and complex signal processing techniques to produce high resolution images.  High resolution achieved by synthesizing long antenna aperture through signal processing techniques.  Pros (in comparison with optical systems):  All weather and day and night operation.  No effects of constituents of atmosphere.  Sensitivity to dielectric properties (can image ice, biomass etc.)  Sensitivity to surface roughness (oceans, wind
  • 22. What is Synthetic Aperture Radar  Accurate measurement of distance.  Sensitivity to man made objects.  Sensitivity to target structure.  Subsurface penetration.  Cons  Complex interactions (difficult to visualize and understand)  Speckle effects (difficult in visual interpretation)  Topographic effects
  • 23. SAR Processing  A set of procedures to obtain interpretable image from raw scattered in azimuth and range directions.  In range, data is scattered by duration of transmitted FM pulse.  In azimuth, data spread by duration point target is illuminated by the radar beam.  SAR processing compresses this data taking into account range cell migration, earth curvature, earth rotation, air/spacecraft attitude noise to produce the final image.  Given nature of SAR system and signals, signal processing rather than image processing provide appropriate tools for SAR processing.
  • 24. SAR Processing Algorithms  Mainstream SAR processing include:  Range Doppler algorithm (RDA)  High resolution images for low squint and for relatively smaller aperture sizes. Very popular.  Chirp scaling algorithm (CSA)  Two-dimensional operations with range independence followed by range corrections in range Doppler domain.  Omega-K algorithm (ωKA)  Efficient and accurate in two-dimensional frequency domain.  SPECAN algorithm  Good for medium to low resolution requirements.
  • 25. Range Doppler Algorithm  Versions of range Doppler:  Basic RDA  RDA with accurate SRC  RDA with approximate SRC  Simplified range Doppler
  • 26. Basic RDA Raw data Range Compression Azimuth FFT RCMC Azimuth Compression Azimuth IFFT and lookup Summation Final Image Range FFT, matched filter multiply, range IFFT Data in range Doppler domain Interpolation operation in range Doppler domain Azimuth matched filter multiply To bring back signal into time domain.
  • 27. Simplified RDA  For narrower swath width and medium resolution requirements, RCM can be assumed independent of range.Raw data Pre-filtering Range Compression Azimuth FFTRCMCRange IFFT Azimuth Compression Azimuth IFFT and lookup Summation Final Image To remove Doppler centroid Range FFT, matched filter multiply (No range IFFT) Both range and azimuth in frequency domain RCM phase function multiply with each range line Data in range Doppler domain
  • 28. 4.Implementation  Hardware resources  Software resources  CPU Implementation  MATLAB GPU Implementation  CUDA Implementation  Result Comparison
  • 29. Hardware resources CPU GPU Name NVIDIA Tesla C1060 # of cores 240 SP Clock 1.296 GHz Memory 4 GB GDDR3 Maximum memory bandwidth 102 GB/s Memory interface 512 bit – PCI Express GFLOPS 933 single precision, 78 double precision Name Intel Xeon E5504 CPU Clock 2 GHz # of cores 4 System Memory 4 GB DDR3 Clock 800 MHz Maximum memory bandwidth 19.2 GB/s Memory type DDR3 PC3 PCI Slot PCI Express
  • 30. Software resources CPU GPU  Windows 7 Ultimate 64-bit  MATLAB release 2010b  Visual Studio 2008 SP1  CUDA Toolkit 4.1  MATLAB release 2010b  NVIDIA Parallel Nsight  Visual Profiler  CUDA MEMCHECK  CUFFT library
  • 31. RADARSAT – I Data • CEOS Format • Raw data is required to be extracted from CEOS data before SAR processing algorithm can be applied. Parameter Value Units Sampling rate 32.317 MHz Range FM rate 0.7213 5 MHz/µs Pulse duration 41.74 µs Radar frequency 5.3 GHz Radar wavelength 0.0565 7 m Pulse repetition frequency 1256.9 8 Hz Effective radar velocity 7062 m/s Azimuth FM rate 1733 Hz/s Table RADARSAT – I data parameters CEOS data CEOS data extraction utility RAW SAR data
  • 32. SAR Processing GUI Functions • CEOS data extraction. • MATLAB- CPU SAR processing. • MATLAB- GPU SAR processing • CUDA input/output manipulation. • CUDA program execution.
  • 33. CPU Implementation  Implemented using MATLAB  FFT/IFFT using standard MATLAB functions
  • 34. CPU Processed SAR image A 2048 x 4096 SAR image using CPU based implementation
  • 35. MATLAB-GPU Implementation  MATLAB started supporting GPU computing since MATLAB release 2010b.  Implemented using native MATLAB-GPU functions only (no CUDA kernel calls).  Vectorization strategy employed to implement vector- matrix multiplications on GPU.  All FFT/IFFTs performed using MATLAB-GPU FFT/IFFT support functions. Column1 Column2 ………... Columnn Column1 Column2 ………... Columnn Column1 Column2 ………... Columnn
  • 36. MATLAB-GPU Implementation  Limit on maximum image size that can be calculated due to GPU memory constraints.
  • 37. MATLAB-GPU Implementation  Speedup as high as 21 achieved compared with CPU implementation
  • 38. MATLAB-GPU Implementation A 2048 x 4096 SAR image using MATLAB-GPU based implementation
  • 39. MATLAB-GPU Implementation  Advantages  Quick and easy to implement  Sufficient speedups obtained with little effort  Little knowledge of GPU hardware and no knowledge of optimization techniques required.  Disadvantages  Currently, limited number of MATLAB functions supported on GPU.  Not all overloads of a function available for GPU.  Lesser control of hardware resources and memory.  Not many optimization options.
  • 40. CUDA Implementation  Strategy  Signal data read as binary file  Vectors, matched filters calculated on CPU  Vectors/signal data transferred to GPU  Following kernels executed in order on GPU  Pre-filtering kernel  Range compression kernel  RCMC kernel  Azimuth compression kernel  Image pixel calculation kernel  Data transferred from GPU to CPU and saved on disk as image.
  • 41. Optimization considerations  Chosen block size = 8 × 8 = 64. Conforms with memory coalescing requirements.  Constant variables stored in constant memory  Local variable and phase function calculation whenever possible to reduce global memory access.  CPU-GPU data transfer kept to minimum by transferring data from CPUGPU at beginning and GPUCPU transfers at the end of algorithm.  Using CUFFTs cufftPlanMany() plan for FFT/IFFTs along data columns.
  • 42. CUDA Implementation Results A 2048 x 4096 SAR image using CUDA based implementation
  • 47. MATLAB-GPU/CUDA speedup comparison  Speedups as high as 53 times achieved in comparison with maximum speedup of 21 times in MATLAB.
  • 48. 5. Conclusions & Future Work
  • 49. Conclusions  Feasibility of GPU for SAR processing  Amount of data, computational effort and inherent algorithm parallelism makes SAR processing suitable on GPU.  TESLA C1060 GPU offers enough memory to handle various common SAR image sizes.  Cooling GPU may be a challenge in some environments.  Scalability of CUDA will prove to be an advantage to port existing SAR code to newer GPUs.  GPUs might not be suitable where customizable hardware is required or military hardware standards are to be adhered.
  • 50. Conclusions  MATLAB-GPU based SAR Processing  Significant speedups compared with CPU.  Quick and easy to implement.  Has some limitations:  Currently have lesser function support for GPU. Expected to improve with future MATLAB releases.  Vectorization strategy needs more memory. Future release promise to take away need for vectorization (e.g. bsxfun in release 2012a).  Lesser control over GPU resources (memory etc.).  CUDA SAR Processing  CUDA: Flexible and scalable with least learning curve.  More control over GPU resources.  Optimization strategies can be applied.  Faster and more memory efficient than MATLAB implementation.
  • 51. Conclusions  Downsides of GPU  Significant testing/verification effort might be required if GPU hardware have to be upgraded (due to old one becoming obsolete).  Proprietary nature of CUDA might be problematic in case company discontinues CUDA or its support.
  • 52. Future work  CUDA kernels can be called in MATLAB code using MATLAB’s CUDA kernel calling support.  MATLAB GPU implementation can be improved as newer and better functions become available.  C/C++ based CPU implementation can be developed to better judge MATLAB-CPU/CUDA performance.  Other SAR processing algorithms can be implemented using framework laid out in this project.
  • 53. Q & A