Accelerating GNSS Software Receivers
Accelerating GNSS Software Receivers
BIOGRAPHY ABSTRACT
Dr. Carles Fernández–Prades holds the position of Se- This paper addresses both the efficiency and the portability of
nior Researcher and served as Head of the Communications a computer program in charge of the baseband signal process-
Systems Division (2013-2016) and the Communication Sus- ing of a GNSS receiver. Efficiency, in this context, refers to
bystems Area (2006-2013) at the Centre Tecnològic de Tele- optimizing the speed and memory requirements of the soft-
comunicacions de Catalunya (CTTC). He received a PhD de- ware receiver. Specifically, the interest is focused on how fast
gree in Electrical Engineering from Universitat Politècnica the software receiver can process the incoming stream of raw
de Catalunya (UPC) in 2006. His primary areas of inter- signal samples and, in particular, if signal processing up to
est include statistical and multi-sensor signal processing, es- the position fix can be executed in real-time (and how many
timation and detection theory, and Bayesian filtering, with channels the host computer executing the receiver application
applications related to communication systems, GNSS and can sustain in parallel). This is achieved by applying the con-
software-defined radio technology. cept of parallelization at different abstraction levels. The pa-
per describes strategies based on task, data and instruction-
level parallelism, as well as actual implementations released
Dr. Javier Arribas holds the position of Senior Researcher under an open source license and the results obtained with
at the Centre Tecnològic de Telecomunicacions de Catalunya different commercially available computing platforms. At the
(CTTC). He received the BSc and MSc degree in Telecom- same time, the proposed solution also addresses portability,
munication Engineering in 2002 and 2004, respectively, at La understood as the usability of the same software in different
Salle University in Barcelona, Spain. His primary areas of computing environments.
interest include statistical signal processing, GNSS synchro-
nization, detection and estimation theory, software defined re-
ceivers, FPGA prototyping and the design of RF front-ends. 1. INTRODUCTION
• Non-termination: understood as an infinite running being PT the transmitting power, d(u) ∈ {−1, 1} the navi-
flow graph process without deadlocks situations, and gation message data symbols, Tb the bit period, Nc the num-
ber of repetitions of a full codeword that spans a bit period,
• Strictly bounded: the number of data elements buffered TP RN = N Tb
the codeword period, ci (l) ∈ {−1, 1} a chip of a
c
on the communication channels remains bounded for spreading codeword i of length Lc chips, gT (t) the transmit-
all possible execution orders. ting chip pulse shape, which is considered energy-normalized
An analysis of such process networks scheduling was pro- for notation clarity, and Tc = NTc L
b
c
is the chip period.
vided in [9]. By adopting GNU Radio’s signal processing
framework, GNSS-SDR bases its software architecture in a The analytic representation of a signal received from a
well-established design and extensively proven implementa- generic GNSS satellite i can be generically expressed as
tion. Section 6.1 provides details on how this concept is ap-
plied in the context of a GNSS software-defined receiver. ri (t) = αi (t)si,T (t − τi (t)) e−j2πfdi (t) ej2πfc t + w(t) , (4)
where αi (t) is the amplitude, si,T (t) is the complex baseband
4.1.1. GPU Offloading transmitted signal, τi (t) is the time–varying delay, fdi (t) =
GPU-accelerated computing consists in the use of a graph- fc τi (t) is the Doppler shift, fc is the carrier frequency, and
ics processing unit (GPU) together with a CPU to acceler- w(t) is a noise term.
ate the execution of a software application, by offloading
computation-intensive portions of the application to the GPU, Assuming w(t) as additive white Gaussian noise, at least
while the remainder of the code still runs on the CPU. The key in the band of interest, it is well known that the optimum re-
idea is to utilize the computation power of both CPU cores ceiver is the code matched filter (often referred to as correla-
and GPU execution units in tandem for better utilization of tor), expressed as
available computing power. Examples of GPU offloading in c −1
LX
the context of GNSS receivers have been extensively reported hM Fi (tk ; τ̂i , fˆdi , φ̂i ) = c∗i (l)gR
∗
(−tk − lTc + τ̂i + Lc Tc )·
in literature (see, for instance, [10, 11, 12]). l=0
ˆ
· e−j φ̂i e−j2πfdi tk =
4.1.2. FPGA Offloading ˆ
∗
= qR (−tk + τ̂i + Lc Tc )e−j φ̂i e−j2πfdi tk , (5)
The commercial availability of system-on-chip (SoC) devices
which integrate the software programmability of an ARM- where ci (l) ∈ {−1, +1} is the l-th chip of a spreading code-
based processor with the hardware programmability of an word (known as pseudorandom sequence) of length Lc , gR (t)
FPGA (e.g., Xilinx’s Zynq-7000 family [13]), allows for sys- is the receiving chip pulse shape, Tc is the chip period, and
tems in which the most computationally demanding opera- τ̂i , fˆdi , φ̂i are local estimates of the time-delay, Doppler-shift
tions of the GNSS receiver are executed in the programmable and carrier phase of the received signal, respectively. The
logic, whereas the rest of the software receiver is executed in code matched filter output can be written as a convolution of
the processing system. The implementation of FPGA-based the form
accelerators and its communication with processes executed
yi (tk ; τ̂ik−1 , fˆdik−1 , φ̂ik−1 ) = (6)
in the ARM processor are out of the scope of this paper.
= ri (tk ; τik , fdik , φik ) ∗ hM F i (tk ; τ̂ik−1 , fˆdik−1 , φ̂ik−1 ) .
4.2. Key operations and data types
Notice that, in the matched filter, we have substituted the es-
In order to describe the most computationally demanding op- timates τ̂ik , fˆdik and φ̂ik for trial values obtained from pre-
erations in the receiver chain, let us assume a generic GNSS vious (in time) estimates of these parameters, which we have
defined as τ̂ik−1 , fˆdik−1 and φ̂ik−1 , respectively. This is the the coherent carrier phase four-quadrant arctangent discrimi-
usual procedure in GNSS receivers, since the estimates are nator
not really available, but are to be estimated after correlation. ={Pik }
∆φ̂ik = atan2 , (12)
Since the correlators perform the accumulation of the sam- <{Pik }
pled signal during a period Tint and then release an output, and the four-quadrant arctangent FLL discriminator as a mea-
we can write the discrete version of the signal as: sure of the frequency error
Integrate E
Input
sample
16ic_32fc_x2_rotator_16ic & Dump
stream Code
Integrate P Code
Loop
& Dump Discr.
Filter
Integrate L
& Dump Carrier
NCO Phase
16ic_x2_dot_prod_16ic Loop
Carrier Discr.
Filter
Generator
PRN Multiple-delay
Codes Resampler
16ic_xn_resampler_16ic_xn
Fig. 1. Diagram of typical code and carrier tracking loops in a GNSS receiver. Colored dotted-line boxes show functions
that have been implemented in SIMD technology. In this example, lanes with label “16ic” are data streams whose items are
complex numbers with real and imaginary components represented with 16-bit integers, whereas label “32fc” indicates lanes
whose items are complex numbers with real and imaginary components in 32-bit floating point representation.
or 64 bits per item), and interpreted either as integers (signed tions is executed in parallel to different sets of data. This
or unsigned) or floating-point values. Some of those specific reduces the amount of hardware control logic needed by N
formats for data items are summarized in Table 1. A con- times for the same amount of calculations, where N is the
version is then required from the sample bit length delivered width of the SIMD unit. In case of SSE, registers are 128-
by the analog-to-digital converter at the output of the front- bits wide, so each one can hold two complex floating point
end to the bit length and format of the data items feeding the samples (denoted as “32fc” in Table 1), or four complex short
software-defined receiver. Section 7 presents results for op- integers (denoted as “16ic” in Table 1). Operations are then
erations (shown in Figure 1) on data types labelled as “16ic” applied to those registers, thus executing the same instruc-
and “32fc” in Table 1. tion to multiple samples at a time, and saving clock cycles.
Hence, SIMD operations can only be applied to certain pre-
defined processing patterns. In addition, it is important to take
5. DATA PARALLELIZATION
into account that Intel’s and AMD’s processors will transfer
data to and from memory into registers faster if the data is
At a lower level of abstraction, some operations on incoming
aligned to 16-byte boundaries. While the compiler will take
data can be further parallelized by applying the same opera-
care of this alignment when using the basic 128-bit type, this
tion on different data samples at a time (i.e., parallelizing in
means that data has to be stored in sets of four 32-bit floating
the temporal index n). This is the approach of SIMD pro-
point values in memory for optimal performance. If data is
cessing, which has been embodied in different technologies
not stored in this kind of fashion then more costly unaligned
described below.
scalar memory moves are needed, instead of packaged 128-
bit aligned moves. Effective SSE will minimize the number
5.1. SSE technology of data movements between the memory subsystem and the
CPU registers. The data should be loaded into SSE registers
The family of Streaming SIMD Extensions (SSE) instruction
only once, and then the results moved back into memory only
sets is now present in all Intel and AMD processors of to-
once when they are no longer needed in that particular code
day’s computers. In this technology, the same set of instruc-
Type name [15], thus expending extra clock cycles. The same applies to
Definition Sample stream
in VOLK
AVX-2 (which adds integer operations to the AVX instruc-
Signed integer,
8-bit two’s complement number
tion set), and most recent AVX-512, which introduces 512-bit
“8i” [S0 ], [S1 ], [S2 ], ... wide registers.
ranging from -128 to 127.
C type name: int8 t
Unsigned integer, 8 bits
“8u” ranging from 0 to 255. [S0 ], [S1 ], [S2 ], ... 5.3. NEON technology
C type name: unsigned char
Complex samples, with real and SIMD technology is also present in ARM processors through
“8ic” imaginary parts of type int8 t [S0I +jS0Q ], [S1I +jS1Q ], ... the NEON instruction set. The NEON instructions support 8-
C type name: lv 8sc t (*) bit, 16-bit, 32-bit, and 64-bit signed and unsigned integers, as
Signed integer,
16-bit two’s complement number
well as 32-bit single-precision floating point elements. NEON
“16i” [S0 ], [S1 ], [S2 ], ... technology includes support for unaligned data accesses and
ranging from -32768 to 32767
C type name: int16 t easy loading of interleaved data, so there is no need to account
Unsigned integer, 16 bits for that, on the contrary of SSEx and AXV. A draw back is
“16u” ranging from 0 to 65535. [S0 ], [S1 ], [S2 ], ...
that the NEON floating point pipeline is not entirely IEEE-
C++ type name: uint16 t
Complex samples, with real and 754 compliant. This is a problem for blocks when processing
“16ic” imaginary parts of type int16 t [S0I +jS0Q ], [S1I +jS1Q ], ... a large large number of floating point items, since the differ-
C type name: lv 16sc t (*) ent results will accumulate along samples and makes NEON
Unsigned integer, 32 bits
and SSE results not comparable. Countermeasures should be
“32u” ranging from 0 to 4294967295. [S0 ], [S1 ], [S2 ], ...
C type name: uint32 t
taken where applicable.
Signed numbers with fractional parts,
can represent values ranging from
“32f” ≈ 3.4×10−38 to 3.4×1038 [S0 ], [S1 ], [S2 ], ... 6. IMPLEMENTATION
with a precision of 7 digits (32 bits).
C type name: float 6.1. A multi-threaded GNSS receiver
Complex samples, with real and
“32fc” imaginary parts of type float [S0I +jS0Q ], [S1I +jS1Q ], ... Software defined receivers can be represented as flow graph
C++ type name: lv 32fc t (*)
of nodes. Each node represents a signal processing block,
Unsigned integer, 64 bits
“64u” ranging from 0 to 264 − 1. [S0 ], [S1 ], [S2 ], ... whereas links between nodes represents a flow of data. The
C type name: uint64 t concept of a flow graph can be viewed as an acyclic direc-
Signed numbers with fractional parts, tional graph with one or more source blocks (to insert sam-
can represent values ranging from
ples into the flow graph), one or more sink blocks (to termi-
“64f” ≈ 1.7×10−308 to 1.7×10308 [S0 ], [S1 ], [S2 ], ...
with a precision of 15 digits (64 bits). nate or export samples from the flow graph), and any signal
C type name: double processing blocks in between. The diagram of a processing
block (that is, of a given node in the flow graph), as imple-
Table 1. Data type names used in the VOLK library, which mented by the GNU Radio framework, is shown in Figure 2.
also provides the C programming language type name defini- Each block can have an arbitrary number of input and out-
tions marked with an asterisk (*). put ports for data and for asynchronous message passing with
other blocks in the flow graph. In all software applications
based on the GNU Radio framework, the underlying process
block. Listing 1 provides a pseudocode example of SIMD
scheduler passes items (i.e., units of data) from sources to
programming.
sinks. For each block, the number of items it can process
in a single iteration is dependent on how much space it has
5.2. AVX technology in its output buffer(s) and how many items are available on
Intel’s extension to the SSE family is the Advanced Vector the input buffer(s). The larger that number is, the better in
Extension (AVX), which extends the 128-bit SSE register into terms of efficiency (since the majority of the processing time
256-bit AVX register that consist of two 128-bit lanes. An is taken up with processing samples), but also the larger the
AVX lane is an extension of SSE4.2 functionality, with each latency that will be introduced by that block. On the contrary,
register holding eight samples of complex type 16-bit integer the smaller the number of items per iteration, the larger the
or four samples of complex type 32-bit float. AVX operates overhead that will be introduced by the scheduler.
most efficiently when the same operations are performed on
both lanes. On the contrary, cross-lane operations are limited Thus, there are some constraints and requirements in
and expensive, and not all bit shuffling combinations are al- terms of number of available items in the input buffers and
lowed. This leads to higher shuffle-overhead since many op- in available space in the output buffer in order to make all the
erations now require both cross-lane and intra-lane shuffling processing chain efficient. In GNU Radio, each block has a
Algorithm 1 Simplified pseudocode for each block’s thread being executed in its own, independent thread. This strategy
in GNU Radio. results in a software receiver that always attempts to process
1: Set thread’s processor affinity and thread priority. signal at the maximum processing capacity, since each block
2: Handle queued messages. in the flow graph runs as fast as the processor, data flow and
3: Compute items available on input buffer(s) as the differ- buffer space allows, regardless of its input data rate. Achiev-
ence between write and read pointers for all inputs. ing real-time is only a matter of executing the receivers full
4: Compute space on output buffer(s) as the difference be- processing chain in a processing system powerful enough to
tween write pointers to the first read pointer. sustain the required processing load, but it does not prevent
5: if all requirements are fulfilled then from executing exactly the same process at a slower pace, for
6: Get all pointers to input and output buffers. example, by reading samples from a file in a less powerful
7: Execute the actual signal processing (call work()). platform.
8: else
9: Try again. Figure 3 shows the flow graph diagram used in GNSS-
10: end if SDR. There is a signal source block (either a file or a radio-
11: Notify neighbors (tell previous and next blocks that there frequency front-end) writing samples in a memory buffer at a
is input data and/or output buffer space). given sampling rate; some signal conditioning (possible data
12: Propagate to upstream and downstream blocks that the type adaptation, filtering, frequency downshifting to base-
iteration has finalized. band, and resampling); a set of parallel channels, each one
13: Wait for data/space or a new message to handle. reading form the same upstream buffer and targeted to a dif-
ferent satellite; a block in charge of the formation of observ-
ables collecting the output of each satellite channel after the
runtime scheduler that dynamically performs all those com- despreading (and thus in a much slower rate); and a signal
putations, using algorithms that attempt to optimize through- sink, responsible for computing the position-velocity-time so-
put, implementing a process network scheduling that fulfills lution from the obtained observables and providing outputs in
the requirements described in [9]. Each processing block ex- standard formats (such as KML, GeoJSON, RINEX, RTCM
ecutes in its own thread, which runs the procedure sketched and NMEA).
in Algorithm 1. A detailed description of the GNU Radio
internal scheduler implementation (memory management, re- N parallel channels
notify_upstream( ) notify_downstream( )
Runtime Scheduler Fig. 3. Simplified GNSS-SDR flow graph diagram. Each
from upstream from downstream
blue box is a signal processing block sketched in Figure 2.
Upstream output write Circular
Output
handle_msg( )
{ pointer Output
Here, only the data stream layer is represented, where differ-
… work( )
Buffer
}
Buffer ent data rates are indicated with different colors for the mem-
{
… ory buffers.
}
upstream read pointer from downstream
The flow graph in Figure 3 can be expanded to accom-
Fig. 2. Diagram of a signal processing block, as implemented modate more GNSS signal definitions in the same band (for
by GNU Radio. Each block has a completely independent instance, a GPS L1 C/A and Galileo E1b receiver), and to
scheduler running in its own execution thread and a messag- accommodate more bands (thus defining a multi-band, multi-
ing system for communication with other upstream and down- system GNSS receiver). In all cases, each of the process-
stream blocks. The actual signal processing is performed in ing blocks will be executing its own thread, defining a multi-
the work() method. Figure adapted from [17]. threaded GNSS receiver that efficiently exploits task paral-
lelization.
• Platform #3 - Embedded development kit: NVIDIA’s 7.2. Integration in a GNSS software receiver
Jetson TK1 developer kit, equipped with a quad-
In order to measure the performance of the parallelization
core ARM Cortex-A15 CPU at 2.32 GHz and an
strategies in combination with the data parallelization tech-
NVIDIA Kepler GPU with 192 CUDA cores clocked
niques described in this paper when applied to a full software-
at 950 MHz. The operating system during tests was
receiver, both the VOLK GNSSSDR library and the GPU-
GNU/Linux Ubuntu 14.04, 32 bits, using GCC 4.8.4.
targeted implementations were integrated into GNSS-SDR.
• Platform #4 - Mini-computer: Raspberry Pi 3 Model The results obtained on the aforementioned computing plat-
B, equipped with a Broadcom BCM2837 CPU (64 bit, forms are shown below.
ARMv8 quad-core ARM Cortex A53) clocked at 1.2
GHz. The operating system used during tests was Rasp- 7.2.1. Number of channels processed in real-time
bian GNU/Linux 8 (jessie), 32 bits, using GCC 4.9.2.
Figure 8 shows the execution time for different correlation
This Section reports the results obtained by some of the key vector lengths (2048, 4096, and 8192) and for different num-
kernels implemented in the VOLK GNSSSDR library and the ber of parallel channels targeting GPS L1 C/A signals and
performance achieved by the full software receiver in the de- using three correlators per channel. This configuration mim-
scribed computer environments. All these results were ob- icks typical receiver’s configurations, and corresponds to the
tained using GNSS-SDR v0.0.7 [23]. The reader is free to correlation lengths to be computed in 1 ms when sampling at
reproduce the experiments in his/her own machine by build- 2.048, 4.096, and 8.192 Msps, respectively. In all platforms,
ing that specific source code snapshot (or any other more re- volk profile and volk gnsssdr profile were ex-
cent version) and executing the provided profiling application ecuted before the tests in order to enjoy the fastest available
SIMD implementation for each specific processor. Remark- isters and 64-bit NEON for data parallelization; and FPGA
ably, ARM-based platforms achieved real-time processing of offloading in order to target real-time operation with higher
four or more channels in the narrowest bandwidth configura- signal bandwidths in multi-band, multi-constellation configu-
tion. rations.
REFERENCES
Table 2. Maximum number of real-time parallel channels for
each platform using GPU accelerators. [1] G. W. Heckler and J. L. Garrison, “SIMD correla-
Correlator length 2048 4096 8192 tor library for GNSS software receivers,” GPS Solu-
Platform #1 65 45 25 tions, vol. 10, no. 4, pp. 269–276, Nov. 2006, DOI:
Platform #2 12 11 10 10.1007/s10291-006-0037-5.
Platform #3 1 0 0 [2] B. Chapman, G. Jost, and R. van der Pas, Using
OpenMP: Portable Shared Memory Parallel Program-
ming, The MIT Press, Cambridge, MA, 2008.
8. CONCLUSIONS
[3] N. P. Jouppi and D. Wall, “Available instruction-
This paper described several parallelization techniques ad- level parallelism for superscalar and superpipelined ma-
dressing computational efficiency, at different abstraction lay- chines,” in Proc. International Conference on Archi-
ers, and their specific application in the context of software- tectural Support for Programming Languages and Op-
defined GNSS receivers. All those concepts were applied into erating Systems (ASPLOS-III), Boston, MA, 1998, pp.
a practical implementation available online under a free and 272–282.
open source software license. Building upon well-established
[4] I. Bartunkova and B. Eissfeller, “Massive parallel al-
open source frameworks and libraries, this paper showed that
gorithms for software GNSS signal simulation using
it is possible to achieve real-time operation in different com-
GPU,” in Proc. of the 25th International Technical
puting environments. Portability was demonstrated by build-
Meeting of The Satellite Division of the Institute of Nav-
ing and executing the same source code in a wide range of
igation, Nashville, TN, Sept. 2012, pp. 118–126.
computing platforms, from high-end servers to tiny and af-
fordable computers, using different operating systems and [5] GNSS-SDR. An Open Source Global Navigation Satel-
compilers, and showing notable acceleration factors of key lite Systems Software Defined Receiver, Website:
operations in all of them. http://gnss-sdr.org. Accessed: January 30,
2017.
As a practical outcome of the presented work, this paper
[6] GNU Radio. The Free & Open Software Radio Ecosys-
introduced, to the best of authors’ knowledge, the first free
tem, Website: http://gnuradio.org. Accessed:
and open source software-defined GNSS receiver able to sus-
January 30, 2017.
tain real-time processing and to provide position fixes (as well
as other GNSS data in form of RINEX files or RTCM mes- [7] G. Kahn, “The semantics of a simple language for par-
sages streamed over a network) in ARM-based devices. allel programming,” in Information processing, J. L.
Rosenfeld, Ed., Stockholm, Sweden, Aug 1974, pp.
Future work will be related to the application of OpenMP 471–475, North Holland.
for task parallelization; AVX+ technology using 512-bit reg-
[8] G. Kahn and D. B. MacQueen, “Coroutines and net- percomputing, Tucson, Arizona, May 31 - June 4, 2011,
works of parallel processes,” in Information processing, pp. 265–274.
B. Gilchrist, Ed., Amsterdam, NE, 1977, pp. 993–998,
North Holland. [16] T. W. Rondeau, Explaining the GNU Radio Scheduler,
Sep. 2013, Slides published online at http://www.
[9] T. M. Parks, Bounded Scheduling of Process Networks, trondeau.com/blog. Accessed: January 30, 2017.
Ph.D. thesis, University of California, Berkeley, CA,
Dec. 1995. [17] J. Corgan, “GNU Radio runtime operation,” in Proc.
GNU Radio Conference, Washington, DC, Aug. 24–28
[10] B. Huang, Z. Yao, F. Guo, S. Deng, X. Cui, and M. Lu, 2015, pp. 1–12.
“STARx – A GPU based multi-system full-band real-
time GNSS software receiver,” in Proc. of the 26th In- [18] Vector-Optimized Library of Kernels, Website: http:
ternational Technical Meeting of The Satellite Division //libvolk.org. Accessed: January 30, 2017.
of the Institute of Navigation, Nashville, TN, Sept. 2013, [19] T. W. Rondeau, N. McCarthy, and T. O’Shea, “SIMD
pp. 1549–1559. programming in GNU Radio: Maintainable and user-
[11] K. Karimi, A. G. Pamir, and H. Afzal, “Accelerat- friendly algorithm optimization with VOLK,” in Proc.
ing a cloud-based software GNSS receiver,” Intl. Jour- of the Wireless Innovation Forum Conference of Wire-
nal of Grid and High Performance Computing, vol. less Communication Technologies and Software Defined
6, no. 3, pp. 17–33, Jul./Sep. 2014, DOI: 10.4018/i- Radio, Washington, DC, Jan. 2013.
jghpc.2014070102. [20] N. West and D. Geiger, “Accelerating software radio
on ARM: Adding NEON support to VOLK,” in Proc.
[12] L. Xu, N. I. Ziedan, X. Niu, and W. Guo, “Correlation
IEEE Radio and Wireless Symposium, Newport Beach,
acceleration in GNSS software receivers using a CUDA-
CA, Jan. 2014.
enabled GPU,” GPS Solutions, pp. 1–12, Available on-
line since Feb. 2016, DOI: 10.1007/s10291-016-0516-2. [21] CUDA Programming Guide, Website:
http://docs.nvidia.com/cuda/
[13] Xilinx, San Jose, CA, Zynq-7000 All Programmable
cuda-c-programming-guide. Accessed: Jan-
SoC Overview DS190 (v1.9). Product Specification, Jan.
uary 30, 2017.
2016.
[22] NVIDIA CUDA Technology, Website: http://www.
[14] E. D. Kaplan, Understanding GPS. Principles and Ap-
nvidia.com/CUDA. Accessed: January 30, 2017.
plications, Artech House Publishers, 1996.
[23] C. Fernández–Prades, J. Arribas, and L. Esteve, GNSS-
[15] D. S. McFarlin, V. Arbatov, F. Franchetti, and
SDR v0.0.7, Zenodo, May 2016, DOI : 10.5281/zen-
M. Püschel, “Automatic SIMD vectorization of Fast
odo.51521. Available online at https://github.com/gnss-
Fourier Transforms for the Larrabee and AVX instruc-
sdr/gnss-sdr/releases/tag/v0.0.7.
tion sets,” in Proc. 25th International Conference on Su-
Acceleration with respect to plain C Acceleration with respect to plain C Acceleration with respect to plain C
0
5
10
15
20
25
0
5
10
15
20
25
0
5
10
15
20
25
a
av a av
x2 x2 a a
re a ss
e3 ax
lo ss v2
ad e3
a a re
lo
ss s ad a
e3 se3 ss
re e2
lo ge
a ge ne
ge
ge ge d ne ric
ne n ric n er
r i c er i re ic
re c lo
lo ad u
a
u u d ax
av av u v2
x2 x2 u ss
e3
ss
16ic x2 dot prod 16ic
re
lo e3 u
Acceleration with respect to plain C Acceleration with respect to plain C Acceleration with respect to plain C
0
5
10
15
20
25
0
5
10
15
20
25
0
5
10
15
20
25
a a
av av
x x 2
a
a ss
e2 a
ss ss
e3 e2
ge ge
ge ne ne
ne ric ge ge ric
ric ne ne
re ric ric
lo
ad sa
t
u u
av u av
x ss x 2
e2
16ic x2 dot prod 16ic xn
u
16ic xn resampler 16ic xn
Operations were applied to vectors of 8111–item length, and the results were averaged over 1987 iterations.
ss 32fc rotator dot prod 32fc xn ss
e3 e2
Fig. 4. Acceleration factor with respect to the generic implementation achieved by different proto-kernels in Platform #1.
16ic x2 dot prod 16ic 16ic x2 dot prod 16ic xn
20 20
15 15
10
10
5
5
0
0
e2
ric
e2
sa
ss
ss
ne
e2
ric
e2
ric
a
u
ge
ss
ss
ne
ne
a
u
ge
ge
16ic 32fc x2 rotator 16ic 16ic xn resampler 16ic xn
Acceleration with respect to plain C
25 25
15
15
10
10
5
5
0
e3
ad
ric
ad
e3
ad
ss
ss
lo
lo
lo
ne
0
re
re
re
a
u
ge
e2
ric
e2
e3
ric
e3
ss
ss
ss
ss
ne
ne
u
a
ge
ge
16ic rotator dot prod 16ic xn 32fc rotator dot prod 32fc xn
Acceleration with respect to plain C
25 25
20 20
15 15
10 10
5 5
0 0
e3
ad
ric
ad
e3
e3
ric
ad
e3
av
av
ss
ss
ss
ss
lo
lo
lo
ne
ne
re
re
re
a
u
a
u
ge
ge
e3
ric
ric
ss
ne
ne
a
ge
ge
Fig. 5. Acceleration factor with respect to the generic implementation achieved by different proto-kernels in Platform #2.
Operations were applied to vectors of 8111–item length, and the results were averaged over 1987 iterations.
16ic x2 dot prod 16ic 16ic x2 dot prod 16ic xn
Acceleration with respect to plain C
20 20
15 15
10 10
5 5
0 0
ric
on
ric
on
a
sa
m
vm
vm
ne
ne
ne
ne
tv
tv
ric
ge
ge
on
on
op
op
ne
ne
ne
on
on
ge
ne
ne
16ic 32fc x2 rotator 16ic 16ic xn resampler 16ic xn
Acceleration with respect to plain C
25 25
15
15
10
10
5
5
0
ic
ad
on
ad
er
lo
ne
lo
0
n
re
re
ge
ric
on
ric
on
ne
ne
ne
ne
ge
ge
16ic rotator dot prod 16ic xn 32fc rotator dot prod 32fc xn
Acceleration with respect to plain C
25 25
20 20
15 15
10 10
5 5
0 0
ric
ad
on
ric
ad
on
vm
lo
ne
lo
ne
ne
ne
re
re
ge
ge
on
ric
ric
ne
ne
ne
ge
ge
Fig. 6. Acceleration factor with respect to the generic implementation achieved by different proto-kernels in Platform #3.
Operations were applied to vectors of 8111–item length, and the results were averaged over 1987 iterations.
16ic x2 dot prod 16ic 16ic x2 dot prod 16ic xn
Acceleration with respect to plain C
20 20
15 15
10 10
5 5
0 0
ric
on
ric
on
a
sa
m
vm
vm
ne
ne
ne
ne
tv
tv
ric
ge
ge
on
on
op
op
ne
ne
ne
on
on
ge
ne
ne
16ic 32fc x2 rotator 16ic 16ic xn resampler 16ic xn
Acceleration with respect to plain C
25 25
15
15
10
10
5
5
0
ic
ad
on
ad
er
lo
ne
lo
0
n
re
re
ge
ric
on
ric
on
ne
ne
ne
ne
ge
ge
16ic rotator dot prod 16ic xn 32fc rotator dot prod 32fc xn
Acceleration with respect to plain C
25 25
20 20
15 15
10 10
5 5
0 0
ric
ad
on
ric
ad
on
vm
lo
ne
lo
ne
ne
ne
re
re
ge
ge
on
ric
ric
ne
ne
ne
ge
ge
Fig. 7. Acceleration factor with respect to the generic implementation achieved by different proto-kernels in Platform #4.
Operations were applied to vectors of 8111–item length, and the results were averaged over 1987 iterations.
×10−3 CPU Correlation size: 2048 samples ×10−3 CPU Correlation size: 4096 samples
1.6 3.5
Platform #1 Platform #1
1.4 Platform #2 Platform #2
3
Platform #3 Platform #3
1.2 Platform #4 Platform #4
2.5
Execution time [s]
0.2 0.5
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Number of parallel channels Number of parallel channels
0
0 20 40 60 80 100
Number of parallel channels
Fig. 8. CPU execution times (averaged for 1000 independent realizations) for different number of parallel channels and cor-
relation lengths (2048, 4096 and 8192 samples of type “32fc”) and executing platforms. Each channel was configured with
3 correlators. The intersection of these plots with the dashed red line at 1 ms indicates the number of channels that a given
platform can sustain in real-time for GPS L1 C/A signals.
GPU Correlation size: 2048 samples GPU Correlation size: 4096 samples
0.02 0.02
Platform #1 Platform #1
0.018 Platform #2 0.018 Platform #2
0.016 Platform #3 0.016 Platform #3
Execution time [s]
0.014
0.012
0.01
0.008
0.006
0.004
0.002
0
0 20 40 60 80 100
Number of parallel channels
Fig. 9. GPU execution times (averaged for 1000 independent realizations) for different number of parallel channels and cor-
relation lengths (2048, 4096 and 8192 samples of type “32fc”) and executing platforms. Each channel was configured with
3 correlators. The intersection of these plots with the dashed red line at 1 ms indicates the number of channels that a given
platform can sustain in real-time for GPS L1 C/A signals.