0% found this document useful (0 votes)
18 views

Parallel Implementation of Compressive S

This document discusses parallel implementation of compressive sensing based synthetic aperture radar (SAR) imaging using a graphics processing unit (GPU). It proposes modifying the existing iterative shrinkage/thresholding algorithm structure to allow for faster recovery speed on a GPU. The experiment results showed significant speedup compared to CPU-based computing, demonstrating GPUs' potential for fast SAR image reconstruction using compressive sensing. Key aspects covered include compressive sensing theory, the iterative shrinkage/thresholding algorithm, GPU architecture/software frameworks, and realizing the algorithm's parallel implementation on a GPU.

Uploaded by

Venkateswaran N
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Parallel Implementation of Compressive S

This document discusses parallel implementation of compressive sensing based synthetic aperture radar (SAR) imaging using a graphics processing unit (GPU). It proposes modifying the existing iterative shrinkage/thresholding algorithm structure to allow for faster recovery speed on a GPU. The experiment results showed significant speedup compared to CPU-based computing, demonstrating GPUs' potential for fast SAR image reconstruction using compressive sensing. Key aspects covered include compressive sensing theory, the iterative shrinkage/thresholding algorithm, GPU architecture/software frameworks, and realizing the algorithm's parallel implementation on a GPU.

Uploaded by

Venkateswaran N
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU

Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen

Parallel Implementation of Compressive Sensing Based SAR Imaging with


GPU
1
Tian Jihua, 2Sun Jinping, 3Zhang Yuxi, 4Najeeb Ahmad, 5Zhang Bingchen
1
School of Electronic and Information Engineering, Beihang University
[email protected]
2,3,4
School of Electronic and Information Engineering, Beihang University
5
Nat. Key Lab of MW Imaging Tech. Institute of Electronics, CAS

Abstract
The paper proposed a new scheme for parallel implementation of compressive sensing based SAR
imaging on GPU with Iterative Shrinkage/Thresholding algorithm. To get a faster recovery speed, we
modified the existed IST algorithm structure, and realized the fast implementation on GPU. The
experiment result shows that parallel computing capabilities of GPU have a significant speedup in
comparison with computing capability of CPU.

Keywords: Compressive Sensing, Synthetic Aperture Radar, Graphics Processing Unit, CUDA

1. Introduction

As a major remote sensing sensor, Synthetic aperture radar (SAR) can produce high resolution
images from a moving platform, such as an airplane or a satellite. A SAR system produces 2D (range
and azimuth) terrain reflectivity images by emitting a sequence of closely spaced radio frequency
pulses and by sampling the echoes scattered from the ground targets [1]. The main advantage of SAR is
that images of the illuminated area can be obtained independent of time-of-day or weather conditions
(e.g., fog, cloud level, rain, and snow). Modern airborne and spaceborne SAR systems can produce
very high resolution images and are being widely used in many civilian and military applications [1, 2].
Compressive sensing (CS), proposed by Donoho [3], Emmanuel Cand`es [4] and Micheal Elad [5] et al.
is a new developing novel theory that enables perfect recovery of signals and data from what appear to
be highly sub-Nyquist-rate samples. CS proclaims that an unknown sparse (or sparse under certain
basis) signal can be exactly recovered with high probability from very limited number of measurements
by solving a convex l1 optimization problem. Based on rigid mathematics, CS has attracted many
attentions in image processing, data fusion of multiple sensors, radar applications and so on. Up to now,
a few literatures have addressed adopting CS in some radar applications including SAR and Inverse
Synthetic Aperture Radar (ISAR) [6-9].
However, the reconstruction of sparse signal requires numerous matrix-vector multiplications,
which imposes a heavy burden on the numerical computation, especially when the sensing matrix is a
large dense one. Meanwhile, the computation of compressive sensing based SAR imaging technique
becomes larger and larger along with the increasing demand on high resolution SAR images. As a
result, it takes quite a long time to reconstruct a SAR image on CPU, which can not be implemented in
real-time. Recently, the fast processing performance of graphics processing unit (GPU) offers an
alternative for fast reconstruction of sparse signal. This paper realized the fast reconstruction of
compressive sensing based SAR images, taking advantage of the efficient parallel computing
capabilities of GPU.

2. GPU Architecture and Software Framework


Recently, driven by the insatiable market demand for real-time, high-definition 3D graphics, the
programmable GPU has evolved into a highly parallel, multithreaded many core processor with
tremendous computational horsepower and very high memory bandwidth [10].
NVIDIA released computed unified device architecture (CUDA) in November 2006, the first
developed environment and software framework for GPU, which is an extension to the standard C

Journal of Convergence Information Technology(JCIT) 122


Volume6, Number12, December 2011
doi:10.4156/jcit.vol6.issue12.16
Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU
Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen

language that allows users to manage the GPU as computational device without the help of graphic
API.
In CUDA architecture, tasks are split into a grid of multiple thread blocks each of which consists of
series of threads. The thread blocks are arranged into different stream multiprocessors of GPU. CUDA
adopts the single instruction multiple thread (SIMT) model, which means all the threads in one block
share the same instruction code with different data, and possibly run at different states. In CUDA,
every thread has its own dedicated registers, and communication between blocks is realized through
shared memory and synchronous mechanism. This design helps minimize the costs of context
switching on GPU.
The design aim of GPU is to realize the parallel computation through numerous threads, fixing it
suitable for large scale parallel computing tasks that are intensive in computation and simple in logic.

3. Theory of Compressive Sensing

The theory of CS reveals that exact recovery of an unknown sparse signal is possible from very
limited samples by solving an inverse problem through either a linear program or a greedy pursuit.
Suppose that signal s Î R N is K-sparse on an overcomplete dictionary Ψ , i.e.

s = Ψx (1)

where Ψ = {y 1 ,y 2 ,×××,y N } is an N ´ N matrix constructed by a sparse basis {y i } , and x is a vector


with all except K of its entries are zeros. Various expansions, including wavelets, the DCT, and Gabor
frames, are widely used for the representation and compression of natural signals, images, and other
data. The matrix Ψ is constructed according to the selected expansion. In order to reconstruct signal s ,
a set of M measurements is acquired ( M < N ), which are linear combinations of the points within s .
More precisely

y = Φs = ΦΨx (2)

where Φ is a M ´ N matrix, hereinafter called measurement matrix. Since M < N , recovery of the
signal s from the measurements y is ill-posed in general. The CS theory reveals that when the
matrix A = ΦΨ has the Restricted Isometry Property (RIP) [3-5], the x or the signal s can be recovered
from a similarly sized set of M = O( K log( N / K )) measurements y with high probability. The RIP is
closely related to an incoherency property between Φ and Ψ , which means the rows of Φ can not
provide a sparse representation of the columns of Ψ and vice versa. The RIP and incoherency holds for
many pairs of basis, such as delta spikes and Fourier sinusoids, or sinusoids and wavelets. It can be
proved that (pseudo) random noise-like matrix performs excellently as Φ , such as the matrix
constructed by Bernoulli or Gaussian random variables. Another choice for the measurement matrix
Φ that offers good performance in many cases is a causal, quasi-Toeplitz matrix [3-6]. When the RIP
holds, the x or the signal s can be recovered from the solution of a convex optimization problem.
Formally, with high probability, x is the unique solution to

min x 1
s.t. y = ΦΨx (3)

which can be solved efficiently with linear programming techniques. Current reconstruction methods
include iterative greedy algorithms such as Matching Pursuit (MP), Orthogonal Matching Pursuit
(OMP) and convex relaxation algorithms such as Basis Pursuit (BP) and Iterative Shrinkage/Threshold
(IST) and so on.

4. Iterative Shrinkage/Thresholding Algorithm and GPU Realization

4.1 Iterative Shrinkage/Thresholding Algorithm

123
Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU
Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen

In practice, the measurements are always disturbed by noise or other interference. So it is no more
suitable to enforce y = ΦΨx according to the constraint in Equation 3. Generally we solve the problem
by transforming the constraint convex optimization problem into the following unconstraint convex
optimization problem
1 2
min y - Ax 2 + t x (4)
x 2 1

where × 2 denotes the Euclidean norm and t is the regularization parameter which provides a tradeoff
between fidelity to the measurements and the noise sensitivity. Iterative Shrinkage/Thresholding (IST)
[11, 12] is a state-of-the-art algorithm in solving the unconstraint convex optimization problem, with
the following iterative scheme

xk +1 = soft ( xk +A H ( y - Axk ),t ) (5)

where soft ( x ,t ) = sgn( x ) max(| x | -t , 0) is the shrinkage operator.


IST algorithm has already been applied extensively to handle the unconstraint convex optimization
problem arising in recovery of sparse signal, image restoration and other linear inverse problems. Each
iteration step of IST algorithm only requires matrix-vector multiplications and addition computation,
which is suited to utilize the efficient parallel computing capabilities of GPU to realize fast recovery.
So this paper realized the fast reconstruction of compressive sensing based SAR images using IST
algorithm on GPU. The detailed procedure of IST algorithm is described as follows
1. Initialize x0 = 0 , residual vector r0 = y , and set iteration step k = 1 ;
2. Compute the correlation of A with the current residual, and the next estimate according to the
current one
xk +1 = xk + A H rk (6)
3. Process the new estimate with shrinkage operator
xk +1 = soft ( xk +1 ,t ) (7)
4. Update the residual vector
rk +1 = y - Axk +1 (8)
2
5. Compute the objective function value f k +1 = 0.5 rk +1 + t xk +1 1
, and get the relative
change Df = f k +1 - f k f k . If Df is smaller than the stopping threshold then terminate the
iteration and output the estimate x̂ . Otherwise, go to step 2 for the next iteration.
In IST algorithm, the most computation prohibitive portion is the matrix-vector multiplication
involving A and A H , with computation complexity of O( MN ) . Besides, each iteration step requires
two such multiplications, so the computation of the whole recovery process is very large, especially
when the matrix is a large dense one. However, if we convert Equation 5 to Equation 9, we can find
that, A H y participates in the computation as a constant vector, and the two multiplications reduce to
one multiplication only involving A H A in each iteration. Although the computation complexity of
A H Axk is O( N 2 ) , larger than the matrix-vector multiplication involving A and A H on CPU, it really
can reduce the whole computation when implemented in parallel.

xk +1 = soft ( xk +A H y - A H Axk ,t ) (9)

So we proposed a new scheme based on the above analysis, precompute the A H A and A H y before
the iteration, then only one matrix-vector multiplication is required in each iteration step. In this way, it
not only reduces the cost of matrix-vector multiplication, but also reduces the time cost by data
transmission between two multiplications. In addition, we noticed that the residual vector should be
available when calculate the objective function value. However, the calculation of residual vector
involves matrix A , which is against the proposed method requiring only A H A and A H y . So we need

124
Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU
Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen

some changes to the calculation of residual vector to meet with the proposal. Fortunately, we find that
if we replace the computation formula of objective function value with Equation 10, the two different
methods show the same effect when judging whether the termination criterion is satisfied based on the
relative change of two contiguous objective function values, although f% is different from f .In addition,
as we know, the SAR images are not sparse over all range gate, so we need add some constraint to
escape from recovering the unsparse ones. If the objective function value in one step is no less than the
one got in the former step, then we can say the scene is not sparse and exit the recovery.

2
f% = 0.5 A H y - A H Ax + t x 1
(10)

The basic procedure of the modified version of IST algorithm is


1. Initialize x0 = 0 , compute A H y and A H A , and set iteration step k = 1 ;
2. Compute the correlation of A with the current residual
xtemp = A H y - A H Axk (11)
3. Compute the new estimate
xk +1 = soft ( xk + xtemp ,t ) (12)
2
4. Compute the objective function value f%k +1 = 0.5 xtemp + t xk +1 1 , if f%k +1 > f%k , k > 2 , then
terminate the iteration. Otherwise, get the relative change Df% = f%k +1 - f%k f%k . If Df% is smaller
than the stopping threshold then terminate the iteration and output the estimate x̂ . Otherwise,
go to step 2 for the next iteration.

4.2 GPU implementation of IST

As we know, in CUDA framework, communication between host CPU and GPU device often costs
lots of time, so we should use such communication as few as possible[13,14]. In this paper, data
communication between host CPU and GPU device only occurs at the start and end of implementation
of the algorithm. At the start phase, the precomputed A H y and A H A , regularization parameter t and
other necessary parameters are transmitted to GPU, while the reconstructed results are transmitted back
to CPU at the end of the recovery. Where A H A is stored in the global memory, and A H y is stored in the
constant memory as it will not be changed during the recovery. In order to save the communication
time further, we transmit back the nonzero elements and the corresponding indices instead of the whole
elements of the reconstructed signal [15]. Besides, when numerous data need to be reconstructed, a
series of streams can be built, each of which is responsible for transmission and execution of different
data. Processing with two streams allows for the memory copies of one stream to overlap with the
kernel execution of the other stream. Then the time cost by memory copies between CPU and GPU can
be efficiently hidden, and get the performance improved.
GPU device begins to execute the iterative recovery once it receives the data from CPU. As
mentioned above, the dominant computation during the recovery is the matrix-vector multiplication.
Note that in the matrix-vector multiplication, column vectors of the matrix are mutually independent,
which is fit to be implemented in parallel. The matrix-vector multiplication is realized with
coarse-grained parallelism blocks that can not communication with each other, together with the
fine-grained parallelism threads. In detail, multiplications between column vectors of A H A with xk are
realized in coarse-grained parallelism, while elements and elements products inside the vector
multiplication are realized in fine-grained parallelism. We stored the matrix A H A in global memory, so
we have to access the global memory to fetch it when it is needed. To limit the memory latency, we
utilized the shared memory that is as fast as register. For instance, column vectors of A H A and xk are all
stored in the shared memory that lies in each thread block.
During the IST recovery, we have to transform some multidimensional data to one dimension, such
as the calculation of Euclidean norm and l1 norm. Take the calculation of l1 norm for instance, normally

125
Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU
Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen

we add all the elements step by step. But on GPU, there are only 512 threads in each block which is
smaller than dimension of the vector, so we split such task into parallel accumulation involving
multiple thread blocks, where each block is responsible for addition of part data and the partial sum got
by each thread are summed up at last. To make the most efficient use of the compute power of GPU,
we should utilize enough thread blocks, keeping the maximum active thread blocks per multiprocessor.
However, communication between thread blocks works only through global memory that is limited in
GPU and has long access latency, which means it will cost lots of time in memory access if too many
blocks were used. So we should make a tradeoff to select the suitable thread blocks.
During the vector multiplication realization in fine-grained parallelism and computation of
multidimensional data to one dimension, each thread block will complete summation of many data.
This paper adopts the parallel summation reduction method to make most efficient of the parallel
performance of GPU, Figure 1 shows the procedure of parallel summation reduction with 8 elements.
The traditional serial summation method requires n steps to sum up n elements, while the parallel
summation reduction method only requires log 2 n steps. Meanwhile, the parallel summation reduction
works with sequential addressing which is bank conflict free, avoiding the reduction in efficient access
bandwidth. In addition, the threads in each warp will either execute the summation or not, which will
avoid the performance degradation caused by divergence.
0 1 2 3 4 5 6 7

Figure1. Parallel summation reduction with 8 elements

5. Experiment
To validate the speeding up of parallel realization of compressive sensing based SAR imaging on
GPU, we separately reconstructed the same SAR image using IST algorithm on CPU and GPU. The
configuration of the CPU used in this paper is Intel Core2 Quad 8400, 2.66GHz, and the GPU is Tesla
C1060. And the data used in the experiment are real airborne SAR data which have been collected by
an X-band SAR with the resolution of 2m. For the detailed compressive sensing based SAR imaging
technique, please refer to literature [16]. We implemented the CPU and GPU code in single precision
float and computed the average processing time over 100 repeated executions on CPU and GPU
separately. Figure 2.a shows the conventional SAR imaging result with full samples, while Figure 2.b
shows the compressive sensing base SAR image reconstructed from 50% samples implemented on
GPU. The time cost by CPU and GPU are shown in Table 1. From Table 1, we can see that GPU speeds
up 35 times than CPU.

126
Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU
Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen

(a) (b)

Figure2 (a). Conventional SAR imaging result with full samples. (b). Compressive sensing based SAR
imaging result with 50% samples implemented on GPU

Table1. The average execution times on CPU and GPU


CPU GPU Speedup
Time/s 8.995 0.258 35

6. Conclusion
The paper realized the parallel implementation of compressive sensing based SAR imaging on GPU,
and Iterative Shrinkage/Thresholding algorithm is adopted to reconstruct the SAR image. To make the
most efficient use of parallelism characteristic of GPU, we modified the existed IST algorithm structure,
and realized the fast implementation on CUDA platform. The experiment result shows that parallel
computing capabilities of GPU have a significant speedup in comparison with computing capability of
CPU.

7. Acknowledgement

This work was supported by the “973” Program of China under Grant 2010CB731903, the
National Natural Science Foundation of China (Grant No. 60901056).

8. References
[1] W. G. Carrara, R. S. Goodman, R. M. Majewaki, “Spotlight Synthetic Aperture Radar: Signal
Processing Algorithms”, Norwood, MA: Artech House, 1995.
[2] I. G. Cumming and F. Wong, “Digital Processing of Synthetic Aperture Radar”, Norwood, MA:
Artech House, 2005.
[3] D. L. Donoho, “Compressed Sensing”, IEEE Trans. on Info. Theory, vol.52, no.4, pp.1289–1306,
2006.
[4] E. Cand`es, J. Romberg and T. Tao, “Robust uncertainty principles: Exact signal reconstruction
from highly incomplete frequency information”, IEEE Trans. on Info.Theory, vol.52, no.2, 2006,
pp.489–509.
[5] M. Elad, “Optimized Projections for Compressed Sensing”, IEEE Trans. on Signal Process., vol.55,
no.12, pp.5695–5702, 2007.
[6] R. Baraniuk, P. Steeghs, “Compressive radar imaging”, IEEE Radar Conference, pp.128-133,
2007.

127
Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU
Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen

[7] M. Herman, T. Strohmer, “Compressed sensing radar”, IEEE Radar Conference, pp.1-6, 2008.
[8] V. M. Patel, G. R. Easley, D. M. Healy and R. Chellappa, “Compressed Synthetic Aperture Radar”,
IEEE Journal of Selected Topics in Signal Processing, vol.4, no.2, pp.244–254, 2010.
[9] J. H. G. Ender, “On compressive sensing applied to radar”, Signal Processing, vol.90, no.5,
pp.1402-1414, 2010.
[10] NVIDIA, “CUDA Programming Guide”, Version 2.3.1, Auguest 2009.
[11] M. A. T. Figueiredo and R. D. Nowak, “An EM algorithm for wavelet-based image restoration”,
IEEE Transactions on Image Processing, vol.12, no.8, pp.906-916, 2003.
[12] Ingrid Daubechies, Michel Defrise, Christine De Mol, “An iterative thresholding algorithm for
linear inverse problems with a sparsity constraint”, Communications in Pure and Applied
Mathematics, vol.57, pp.1413–1457, 2004.
[13] Zhiyong Yuan, Yuanyuan Zhang, Jianhui Zhao, Yihua Ding, Chengjiang Long, Lu Xiong, Dengyi
Zhang, Guozhong Liang, “Real-time Simulation for 3D Tissue Deformation with CUDA Based
GPU Computing”, JCIT: Journal of Convergence Information Technology, vol.5, no.4, pp.109-119,
2010.
[14] Xiangyun Liao, Zhiyong Yuan, Weixin Si, Zhaoliang Duan, Ruixue Mao, Jianhui Zhao, “Research
and Application of Parallel Computing Technologies based on CUDA and OpenCL”, Journal of
Covergence Information Technology, vol.6, no.6, 2011.
[15] Sangkyun Lee, Stephen J. Wright, “Implementing algorithms for signal and image reconstruction
on graphical processing units”, Computer Sciences Department, University of Wisconsin-Madison,
Tech. Rep., November, 2008.
[16] Jihua Tian, Jinping Sun, Xiao Han, Bingchen Zhang, “Motion Compensation for Compressive
Sensing SAR Imaging with Autofocus”, The 6th IEEE Conference on Industrial Electronics &
Applications, pp.1564-1567, 2011.

128

You might also like