Parallel Implementation of Compressive S
Parallel Implementation of Compressive S
Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen
Abstract
The paper proposed a new scheme for parallel implementation of compressive sensing based SAR
imaging on GPU with Iterative Shrinkage/Thresholding algorithm. To get a faster recovery speed, we
modified the existed IST algorithm structure, and realized the fast implementation on GPU. The
experiment result shows that parallel computing capabilities of GPU have a significant speedup in
comparison with computing capability of CPU.
Keywords: Compressive Sensing, Synthetic Aperture Radar, Graphics Processing Unit, CUDA
1. Introduction
As a major remote sensing sensor, Synthetic aperture radar (SAR) can produce high resolution
images from a moving platform, such as an airplane or a satellite. A SAR system produces 2D (range
and azimuth) terrain reflectivity images by emitting a sequence of closely spaced radio frequency
pulses and by sampling the echoes scattered from the ground targets [1]. The main advantage of SAR is
that images of the illuminated area can be obtained independent of time-of-day or weather conditions
(e.g., fog, cloud level, rain, and snow). Modern airborne and spaceborne SAR systems can produce
very high resolution images and are being widely used in many civilian and military applications [1, 2].
Compressive sensing (CS), proposed by Donoho [3], Emmanuel Cand`es [4] and Micheal Elad [5] et al.
is a new developing novel theory that enables perfect recovery of signals and data from what appear to
be highly sub-Nyquist-rate samples. CS proclaims that an unknown sparse (or sparse under certain
basis) signal can be exactly recovered with high probability from very limited number of measurements
by solving a convex l1 optimization problem. Based on rigid mathematics, CS has attracted many
attentions in image processing, data fusion of multiple sensors, radar applications and so on. Up to now,
a few literatures have addressed adopting CS in some radar applications including SAR and Inverse
Synthetic Aperture Radar (ISAR) [6-9].
However, the reconstruction of sparse signal requires numerous matrix-vector multiplications,
which imposes a heavy burden on the numerical computation, especially when the sensing matrix is a
large dense one. Meanwhile, the computation of compressive sensing based SAR imaging technique
becomes larger and larger along with the increasing demand on high resolution SAR images. As a
result, it takes quite a long time to reconstruct a SAR image on CPU, which can not be implemented in
real-time. Recently, the fast processing performance of graphics processing unit (GPU) offers an
alternative for fast reconstruction of sparse signal. This paper realized the fast reconstruction of
compressive sensing based SAR images, taking advantage of the efficient parallel computing
capabilities of GPU.
language that allows users to manage the GPU as computational device without the help of graphic
API.
In CUDA architecture, tasks are split into a grid of multiple thread blocks each of which consists of
series of threads. The thread blocks are arranged into different stream multiprocessors of GPU. CUDA
adopts the single instruction multiple thread (SIMT) model, which means all the threads in one block
share the same instruction code with different data, and possibly run at different states. In CUDA,
every thread has its own dedicated registers, and communication between blocks is realized through
shared memory and synchronous mechanism. This design helps minimize the costs of context
switching on GPU.
The design aim of GPU is to realize the parallel computation through numerous threads, fixing it
suitable for large scale parallel computing tasks that are intensive in computation and simple in logic.
The theory of CS reveals that exact recovery of an unknown sparse signal is possible from very
limited samples by solving an inverse problem through either a linear program or a greedy pursuit.
Suppose that signal s Î R N is K-sparse on an overcomplete dictionary Ψ , i.e.
s = Ψx (1)
y = Φs = ΦΨx (2)
where Φ is a M ´ N matrix, hereinafter called measurement matrix. Since M < N , recovery of the
signal s from the measurements y is ill-posed in general. The CS theory reveals that when the
matrix A = ΦΨ has the Restricted Isometry Property (RIP) [3-5], the x or the signal s can be recovered
from a similarly sized set of M = O( K log( N / K )) measurements y with high probability. The RIP is
closely related to an incoherency property between Φ and Ψ , which means the rows of Φ can not
provide a sparse representation of the columns of Ψ and vice versa. The RIP and incoherency holds for
many pairs of basis, such as delta spikes and Fourier sinusoids, or sinusoids and wavelets. It can be
proved that (pseudo) random noise-like matrix performs excellently as Φ , such as the matrix
constructed by Bernoulli or Gaussian random variables. Another choice for the measurement matrix
Φ that offers good performance in many cases is a causal, quasi-Toeplitz matrix [3-6]. When the RIP
holds, the x or the signal s can be recovered from the solution of a convex optimization problem.
Formally, with high probability, x is the unique solution to
min x 1
s.t. y = ΦΨx (3)
which can be solved efficiently with linear programming techniques. Current reconstruction methods
include iterative greedy algorithms such as Matching Pursuit (MP), Orthogonal Matching Pursuit
(OMP) and convex relaxation algorithms such as Basis Pursuit (BP) and Iterative Shrinkage/Threshold
(IST) and so on.
123
Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU
Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen
In practice, the measurements are always disturbed by noise or other interference. So it is no more
suitable to enforce y = ΦΨx according to the constraint in Equation 3. Generally we solve the problem
by transforming the constraint convex optimization problem into the following unconstraint convex
optimization problem
1 2
min y - Ax 2 + t x (4)
x 2 1
where × 2 denotes the Euclidean norm and t is the regularization parameter which provides a tradeoff
between fidelity to the measurements and the noise sensitivity. Iterative Shrinkage/Thresholding (IST)
[11, 12] is a state-of-the-art algorithm in solving the unconstraint convex optimization problem, with
the following iterative scheme
So we proposed a new scheme based on the above analysis, precompute the A H A and A H y before
the iteration, then only one matrix-vector multiplication is required in each iteration step. In this way, it
not only reduces the cost of matrix-vector multiplication, but also reduces the time cost by data
transmission between two multiplications. In addition, we noticed that the residual vector should be
available when calculate the objective function value. However, the calculation of residual vector
involves matrix A , which is against the proposed method requiring only A H A and A H y . So we need
124
Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU
Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen
some changes to the calculation of residual vector to meet with the proposal. Fortunately, we find that
if we replace the computation formula of objective function value with Equation 10, the two different
methods show the same effect when judging whether the termination criterion is satisfied based on the
relative change of two contiguous objective function values, although f% is different from f .In addition,
as we know, the SAR images are not sparse over all range gate, so we need add some constraint to
escape from recovering the unsparse ones. If the objective function value in one step is no less than the
one got in the former step, then we can say the scene is not sparse and exit the recovery.
2
f% = 0.5 A H y - A H Ax + t x 1
(10)
As we know, in CUDA framework, communication between host CPU and GPU device often costs
lots of time, so we should use such communication as few as possible[13,14]. In this paper, data
communication between host CPU and GPU device only occurs at the start and end of implementation
of the algorithm. At the start phase, the precomputed A H y and A H A , regularization parameter t and
other necessary parameters are transmitted to GPU, while the reconstructed results are transmitted back
to CPU at the end of the recovery. Where A H A is stored in the global memory, and A H y is stored in the
constant memory as it will not be changed during the recovery. In order to save the communication
time further, we transmit back the nonzero elements and the corresponding indices instead of the whole
elements of the reconstructed signal [15]. Besides, when numerous data need to be reconstructed, a
series of streams can be built, each of which is responsible for transmission and execution of different
data. Processing with two streams allows for the memory copies of one stream to overlap with the
kernel execution of the other stream. Then the time cost by memory copies between CPU and GPU can
be efficiently hidden, and get the performance improved.
GPU device begins to execute the iterative recovery once it receives the data from CPU. As
mentioned above, the dominant computation during the recovery is the matrix-vector multiplication.
Note that in the matrix-vector multiplication, column vectors of the matrix are mutually independent,
which is fit to be implemented in parallel. The matrix-vector multiplication is realized with
coarse-grained parallelism blocks that can not communication with each other, together with the
fine-grained parallelism threads. In detail, multiplications between column vectors of A H A with xk are
realized in coarse-grained parallelism, while elements and elements products inside the vector
multiplication are realized in fine-grained parallelism. We stored the matrix A H A in global memory, so
we have to access the global memory to fetch it when it is needed. To limit the memory latency, we
utilized the shared memory that is as fast as register. For instance, column vectors of A H A and xk are all
stored in the shared memory that lies in each thread block.
During the IST recovery, we have to transform some multidimensional data to one dimension, such
as the calculation of Euclidean norm and l1 norm. Take the calculation of l1 norm for instance, normally
125
Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU
Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen
we add all the elements step by step. But on GPU, there are only 512 threads in each block which is
smaller than dimension of the vector, so we split such task into parallel accumulation involving
multiple thread blocks, where each block is responsible for addition of part data and the partial sum got
by each thread are summed up at last. To make the most efficient use of the compute power of GPU,
we should utilize enough thread blocks, keeping the maximum active thread blocks per multiprocessor.
However, communication between thread blocks works only through global memory that is limited in
GPU and has long access latency, which means it will cost lots of time in memory access if too many
blocks were used. So we should make a tradeoff to select the suitable thread blocks.
During the vector multiplication realization in fine-grained parallelism and computation of
multidimensional data to one dimension, each thread block will complete summation of many data.
This paper adopts the parallel summation reduction method to make most efficient of the parallel
performance of GPU, Figure 1 shows the procedure of parallel summation reduction with 8 elements.
The traditional serial summation method requires n steps to sum up n elements, while the parallel
summation reduction method only requires log 2 n steps. Meanwhile, the parallel summation reduction
works with sequential addressing which is bank conflict free, avoiding the reduction in efficient access
bandwidth. In addition, the threads in each warp will either execute the summation or not, which will
avoid the performance degradation caused by divergence.
0 1 2 3 4 5 6 7
5. Experiment
To validate the speeding up of parallel realization of compressive sensing based SAR imaging on
GPU, we separately reconstructed the same SAR image using IST algorithm on CPU and GPU. The
configuration of the CPU used in this paper is Intel Core2 Quad 8400, 2.66GHz, and the GPU is Tesla
C1060. And the data used in the experiment are real airborne SAR data which have been collected by
an X-band SAR with the resolution of 2m. For the detailed compressive sensing based SAR imaging
technique, please refer to literature [16]. We implemented the CPU and GPU code in single precision
float and computed the average processing time over 100 repeated executions on CPU and GPU
separately. Figure 2.a shows the conventional SAR imaging result with full samples, while Figure 2.b
shows the compressive sensing base SAR image reconstructed from 50% samples implemented on
GPU. The time cost by CPU and GPU are shown in Table 1. From Table 1, we can see that GPU speeds
up 35 times than CPU.
126
Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU
Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen
(a) (b)
Figure2 (a). Conventional SAR imaging result with full samples. (b). Compressive sensing based SAR
imaging result with 50% samples implemented on GPU
6. Conclusion
The paper realized the parallel implementation of compressive sensing based SAR imaging on GPU,
and Iterative Shrinkage/Thresholding algorithm is adopted to reconstruct the SAR image. To make the
most efficient use of parallelism characteristic of GPU, we modified the existed IST algorithm structure,
and realized the fast implementation on CUDA platform. The experiment result shows that parallel
computing capabilities of GPU have a significant speedup in comparison with computing capability of
CPU.
7. Acknowledgement
This work was supported by the “973” Program of China under Grant 2010CB731903, the
National Natural Science Foundation of China (Grant No. 60901056).
8. References
[1] W. G. Carrara, R. S. Goodman, R. M. Majewaki, “Spotlight Synthetic Aperture Radar: Signal
Processing Algorithms”, Norwood, MA: Artech House, 1995.
[2] I. G. Cumming and F. Wong, “Digital Processing of Synthetic Aperture Radar”, Norwood, MA:
Artech House, 2005.
[3] D. L. Donoho, “Compressed Sensing”, IEEE Trans. on Info. Theory, vol.52, no.4, pp.1289–1306,
2006.
[4] E. Cand`es, J. Romberg and T. Tao, “Robust uncertainty principles: Exact signal reconstruction
from highly incomplete frequency information”, IEEE Trans. on Info.Theory, vol.52, no.2, 2006,
pp.489–509.
[5] M. Elad, “Optimized Projections for Compressed Sensing”, IEEE Trans. on Signal Process., vol.55,
no.12, pp.5695–5702, 2007.
[6] R. Baraniuk, P. Steeghs, “Compressive radar imaging”, IEEE Radar Conference, pp.128-133,
2007.
127
Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU
Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen
[7] M. Herman, T. Strohmer, “Compressed sensing radar”, IEEE Radar Conference, pp.1-6, 2008.
[8] V. M. Patel, G. R. Easley, D. M. Healy and R. Chellappa, “Compressed Synthetic Aperture Radar”,
IEEE Journal of Selected Topics in Signal Processing, vol.4, no.2, pp.244–254, 2010.
[9] J. H. G. Ender, “On compressive sensing applied to radar”, Signal Processing, vol.90, no.5,
pp.1402-1414, 2010.
[10] NVIDIA, “CUDA Programming Guide”, Version 2.3.1, Auguest 2009.
[11] M. A. T. Figueiredo and R. D. Nowak, “An EM algorithm for wavelet-based image restoration”,
IEEE Transactions on Image Processing, vol.12, no.8, pp.906-916, 2003.
[12] Ingrid Daubechies, Michel Defrise, Christine De Mol, “An iterative thresholding algorithm for
linear inverse problems with a sparsity constraint”, Communications in Pure and Applied
Mathematics, vol.57, pp.1413–1457, 2004.
[13] Zhiyong Yuan, Yuanyuan Zhang, Jianhui Zhao, Yihua Ding, Chengjiang Long, Lu Xiong, Dengyi
Zhang, Guozhong Liang, “Real-time Simulation for 3D Tissue Deformation with CUDA Based
GPU Computing”, JCIT: Journal of Convergence Information Technology, vol.5, no.4, pp.109-119,
2010.
[14] Xiangyun Liao, Zhiyong Yuan, Weixin Si, Zhaoliang Duan, Ruixue Mao, Jianhui Zhao, “Research
and Application of Parallel Computing Technologies based on CUDA and OpenCL”, Journal of
Covergence Information Technology, vol.6, no.6, 2011.
[15] Sangkyun Lee, Stephen J. Wright, “Implementing algorithms for signal and image reconstruction
on graphical processing units”, Computer Sciences Department, University of Wisconsin-Madison,
Tech. Rep., November, 2008.
[16] Jihua Tian, Jinping Sun, Xiao Han, Bingchen Zhang, “Motion Compensation for Compressive
Sensing SAR Imaging with Autofocus”, The 6th IEEE Conference on Industrial Electronics &
Applications, pp.1564-1567, 2011.
128