Vectorization Strategies For Ant ColonyOptimization On Intel Architectures
Vectorization Strategies For Ant ColonyOptimization On Intel Architectures
1. Introduction
2. Fundamentals
The Traveling Salesman Problem (TSP) [2] consists of finding the shortest round trip tour
that include exactly once each city from a set of n cities. The TSP is a paradigmatic NP-
hard combinatorial optimization problem used as a standard test bed for new algorithms.
Ant System [1], the first ACO algorithm, was first applied to the symmetric TSP, in which
the distance between two cities, i and j, is the same in both directions (di j = d ji ).
Algorithm 1 shows the common structure of ACO algorithms. Firstly, all the data
structures (distance and pheromone matrices, ant colony, etc.) are initialized. Within the
loop, each iteration is composed of two stages: tour construction and pheromone update.
At the start of the tour construction stage, some criterion is used to choose the start
cities at which the ants are positioned. In Ant System, each ant is placed on a randomly
chosen initial city. At each construction step, each ant makes use of a probabilistic choice
rule in order to choose its next city to visit. The probability for ant k, currently placed at
city i, of selecting city j is specified in Eq. (1):
[τi j ]α [ηi j ]β
pkij = , i f j ∈ Nik , (1)
∑l∈N k [τi j ]α [ηi j ]β
i
where τi j is the amount of pheromone associated with edge (i, j), ηi j = 1/di j is a
heuristic value computed a priori, α and β are two parameters (fixed at the beginning
of an execution) which determine the relative influence of the pheromone trail and the
heuristic information, and Nik is the feasible neighborhood of ant k when placed at city
i, that is, the set of cities that have not been visited yet by ant k. The probability of
choosing a city outside this latter set is 0, thus preventing an ant from visiting a city more
than once. Once the probabilities have been computed, a selection function is used for
choosing next city taking into account these probabilities (see next subSection).
After all ants have finished constructing their tours, pheromone update takes place
in a different way for each ACO variant.
October 2017
The default selection function suggested for choosing the next city in the tour construc-
tion stage was Roulette Wheel Selection [1]. The procedure operates as a roulette wheel
in a casino. That is, each not visited city is assigned to a portion on a circular roulette
wheel, with the size of the slice being proportional to the probability of visiting that city.
Then, in order to simulate the roulette being spun, a random number is generated, and
the portion in which the number takes place determines the selected city for a given ant.
Algorithm 2 shows the pseudocode for Roulette Wheel. The choice info matrix
stores the probabilities of choosing each city without taking into account if the cities are
visited or not. This latter information is stored in the visited matrix. As Roulette Wheel
Selection has an inherently sequential part (lines 15-18) which can not be vectorized,
two main alternative selection functions were successfully proposed for GPUs in order
to exploit data level parallelism: I-Roulette and DS-Roulette.
I-Roulette (Independent Roulette) [3] was proposed as an alternative method for
removing the sequential part of Roulette Wheel. In this selection function, the probability
of each city which has not been visited yet is multiplied by a different random number in
[0, 1], obtaining a weight for each city. For visited cities, the value of the weight equals
zero. After the weights have been computed for all cities, the city with the highest one
is selected as the next city. Note that, because of the multiplication by different random
numbers, the weights are not proportional to the probabilities of visiting each city. In
addition, this selection method needs to generate n random numbers (where n is the
number of cities), which is costly, but this can be done in parallel.
October 2017
Intel Xeon Phi is based on the Many Integrated Core (MIC) [6] architecture. The first
generation of this many-core architecture, known as Knight’s Corner (KNC), has up to 61
cores and four hardware threads per core. It is also provided with a vector processing unit
(VPU) which can operate on 512-bits wide registers. Each core has in-order instruction
execution and runs at a low clock speed (less than 1.3 GHz). Thus, although its cores
are simple, Intel many-core architectures present better performance/power ratio than
Intel multi-core processors, which comprise a lower number of more complex cores.
It is precise to point out that, in order to exploit all the hardware capabilities of this
architecture, it is paramount to make use of both thread and vector level parallelism.
The tour construction stage is inherently parallel, as each ant can construct its solution
individually. On Intel architectures, thread level parallelization is immediate. We use
#pragma omp parallel for to map ants to threads. The computation of the numerator
of Eq. (1), performed before the ants construct the solutions, is also parallelized using
the same pragma. However, vectorization is also necessary to obtain high performance.
In our vectorized implementations, we use the capability of the Intel C++ compiler to
automatically vectorize some loops, but we have to facilitate its task by means of some
hints and changes on the code. Concretely, we ensure all the data structures are aligned
to 64 bytes and use some pragmas to assist vectorization, such as #pragma ivdep for
telling the compiler to ignore vector dependences.
October 2017
Within the tour construction stage, 99.66 % of the time is spent on choosing the next
city, that is, the selection function. Thus, hereafter we focus on how to implement the dif-
ferent selection functions to get vectorized on Intel architectures. The following are some
details of our implementation of the three selection algorithms previously described:
• Roulette Wheel (Algorithm 2) comprises two loops: the first one (lines 3-10) com-
putes the probability of selection of each city and the second (lines 15-18) sim-
ulates the roulette spinning. Looking at the vectorization report from the Intel
compiler, we notice that none of these loops are vectorized. We discuss how to
address this issue in the next Section.
• I-Roulette v1: In our first implementation of I-Roulette, the main loop is vector-
ized and performs the weights computation as a multiplication of three elements:
the probability of choosing a city, the value representing whether the city has been
visited (zero) or not (one), and a random number in [0, 1]. Then, a serial reduction
is performed to find the city with the largest weight.
• DS-Roulette: Our implementation of DS-Roulette has 3 stages. In the first one,
selection probabilities are computed in a SIMD fashion for cities and blocks. The
size of the block (block size) has been set to 16, as in our case we obtained the
best results with this value. The number of blocks is d blockn size e, so the last block
contains the last remaining cities in case n is not a multiple of block size. After
stage one, two roulette wheels are run sequentially: the first for choosing a block
and the second for choosing a city within that block.
Regarding the generation of random numbers, it does not affect the execution time
of Roulette Wheel nor DS-Roulette, but it does for I-Roulette, as it generates n times
more random numbers. We have selected Stutzle’s generator because of its speed for
I-Roulette. For the others, we have used the same generator to be fair in our comparisons.
As ants are mapped to threads, the seed for generating random numbers needs to be
replicated to a vector of seeds, so each thread has its own one. In addition, as I-Roulette
also generates random numbers in a vectorized way, a matrix of seeds is necessary, hav-
ing a row for each thread and as many columns as the architecture’s vector length.
4. Improving Vectorization
4.1. V-Roulette
The main problem when trying to vectorize the first loop of Roulette Wheel Selection is
the if-else statement (lines 4-9 in Algorithm 2). As shown in Algorithm 3, this statement
has been replaced with a multiplication. The values stored in the visited matrix are set
in the following way: if an ant k has visited city i, visited[k][i] equals 0. Otherwise,
visited[k][i] equals 1. This way, when multiplying the probability of visiting a city (in
choice info) by its corresponding position of the visited matrix, a 0 is obtained if the city
is already visited, or the value stored in the choice info matrix if the city is not visited
yet. Moreover, the computation of the addition of probabilities (in prob sum) can be
carried out as a vectorized reduction operation that is performed automatically by the
Intel compiler. Thus, we maintain the meaning of the original Roulette Wheel algorithm.
Regarding the second loop of Roulette Wheel, each iteration depends on the previous
one and the number of iterations is not known before executing the loop, so there is
October 2017
4.2. I-Roulette v2
5. Evaluation
For generating our base code, we have modified Stützle’s implementation [7] of the main
ACO algorithms so that it follows the original proposal of the metaheuristic [1]. The
evaluation platform is equipped with two Intel Xeon (E5-2650 v2, 16 cores; and E5-2698
v4, 40 cores) processors and an Intel Xeon Phi 7120P coprocessor (61 cores). The system
runs Linux CentOS 6.7 with kernel 2.6.32, and Intel MPSS 3.7.2. Codes are built using
Intel’s icpc compiler (version 17.0.2) with the optimization level -O3. Evaluations have
been carried out using the maximum number of threads per core available for each archi-
tecture (2 on Intel Xeon v2 and v4, and 4 on Xeon Phi). Our different implementations
are tested using a set of instances from the TSPLIB benchmark library [8]. We set ACO
parameters as recommended in [1]: m = n (where m is the number of ants and n is the
number of cities), α = 1 and β = 5. Performance figures are given for single-precision
numbers and a single iteration averaged over 10 independent runs of 100 iterations.
October 2017
Algorithm 4 I-Roulette v2
Input: Ant identifier (a), current city (current city).
Output: Selected city.
1: Imax ← (−1 . . . − 1)
2: Wmax ← (−1 . . . − 1)
3: {Weights Computation and Vectorized Reduction}
4: for i = 1 to num vectors do
5: j ← i ∗V ECT OR LENGT H
6: for k = 1 to V ECT OR LENGT H do
7: weight ← choice in f o[current city][ j] ∗ visited[a][ j] ∗ random(0..1)
8: if weight > Wmax [k] then
9: Imax [k] ← j
10: Wmax [k] ← weight
11: end if
12: j ← j+1
13: end for
14: end for
15: {Compute maximum of vector (Serial)}
16: i ← argmax(Wmax )
17: city ← Imax [i]
18: max weight ← Wmax [i]
19: {Compute remainder weights and check if greater than current maximum (Serial)}
20: return city
5.2. Scalability
Our first experiment deals with the parallel scalability of the tour construction stage.
As this stage does not have any synchronization problems and is compute bound, our
implementation for the Intel architectures should scale across the number of cores used
by the processor. We tested our parallel implementation of the tour construction stage
using the default selection function (Roulette Wheel).
In this experiment, speed up is presented against the code running on one core using
2 threads for Xeon v2 and v4. For the other number of cores, 2 threads per core are
always used, and the affinity parameter is set to compact to fill in all the cores used every
time. Figure 1 shows the thread scalability on Xeon v2 and Xeon v4.
As we can see, the tour construction stage scales well, achieving better scalability
with larger instances, which require more computations. However, we expect our re-
sults get closer to the theoretical limit when addressing some problems with unbalanced
thread load, by means of adjusting the scheduling policy. Similar results for this stage
are obtained varying the selection function used.
obtaining speed up factors of up to 7.51x and increasing the speed up with the problem
size; b) the best results on Xeon Phi are obtained for I-Roulette v2 and DS-Roulette,
and on Xeon v4 for DS-Roulette; c) just partially vectorizing the Roulette Wheel version
(V-Roulette), a speed up factor of 2x is obtained; d) I-Roulette v1 gives worse results
than V-Roulette, as it performs more computations: an extra multiplication within the
vectorized loop and a final serial reduction of n iterations, while V-Roulette executes n
iterations in its second loop in the worst case; e) it is necessary to completely vectorize
I-Roulette (as done in I-Roulette v2) for obtaining high performance with this strategy
on Xeon Phi; f) the overall enhancements for the different strategies are quite higher on
Xeon Phi, as the width of its VPU doubles the Xeon v4’s. For this reason, the generation
of random numbers in I-Roulette v2 does not affect performance for Xeon Phi as much
as it does for Xeon v4, as it presents more vector level parallelism.
Figure 2. Speed up for the tour construction stage with different selection functions (compared to parallel
version on each architecture).
Table 1 presents the execution times for instances of different size, and Figure 3 shows
the speed up obtained for each selection function on Xeon Phi against the sequential
version running on the Xeon v2 CPU. Just parallelizing the code using threads (Roulette
Wheel version) achieves a speed up factor of between 7.86x and 11.95x over the single-
threaded code on the CPU. Vectorization is essential for achieving higher performance,
October 2017
as it can be seen from the results of the three vectorized implementations. The partially
vectorized version of Roulette Wheel (V-Roulette Wheel), is between 16.91 and 27.23
times faster than the serial code on the CPU. Finally, the best results are obtained for
I-Roulette v2 and DS-Roulette, varying between 37.05x and 78.98x.
Table 1. Execution time (ms) for tour construction on Xeon v2 (sequential version) and Xeon Phi.
Instance CPU RW V-RW I-R v2 DS-R
lin318 77.8 9.9 4.6 2.1 2.1
rat783 1391.3 116.4 51.1 17.8 19.0
pr1002 2056.3 225.0 100.9 31.6 35.1
pr2392 26410.1 2510.0 1114.7 334.4 339.3
Figure 3. Speed up for the tour construction stage with different selection functions on Xeon Phi (compared
to sequential version on Xeon v2).
6. Related Work
Regarding ACO on GPUs, Cecilia et al. [3] noted that high performance cannot be ob-
tained adopting a task parallelism approach, which maps each ant to a CUDA thread. In
contrast, they suggested a data parallel approach for a better use of the GPU architecture,
and introduced a new selection function, I-Roulette (Independent Roulette), reporting
speed ups of more than 20x over the sequential implementation on the CPU. They also
implemented a vectorized version of Roulette Wheel (in a similar way as we have done
for V-Roulette on Intel architectures), but in their case I-Roulette reached up to 2.36x
gain over this latter method on the GPU. On the other hand, and around the same time,
Dawson and Stewart [4] proposed DS-Roulette (Double Spin Roulette) as an alterna-
tive for preserving the proportionality of Roulette Wheel while minimizing its inherently
sequential part. Speed ups of up to 82x against the sequential counterpart were reported.
A new line of research is porting ACO to Intel many-core architectures. Lloyd and
Amos [9] claimed that the previous attempts to implement ACO on Xeon Phi obtained
poor performance because they made no use of vectorization. As an alternative, they pro-
posed two vectorized selection functions. The first one, vRoulette-1, is a vectorized ver-
October 2017
sion of I-Roulette [3], for which they report speed ups between 5.6x and 16.6x compared
to the CPU sequential version. Their second proposal, vRoulette-2, is a variant of DS-
Roulette [4] in which several roulette wheel selections take place as repeated binary trials
in which the winner of a trial accumulates the weight assigned to the loser. The reported
speed up for vRoulette-2 varies between 5.3x and 13x. However, these two proposals do
not scale well for large instances (over 1000 cities).
7. Conclusions
Acknowledgments
References
[1] M. Dorigo and T. Stützle. Ant Colony Optimization. A Bradford Book, The MIT Press, USA, 2004.
[2] E. Lawler, J. Lenstra, A. Kan and D. Shmoys. The Traveling Salesman Problem. Wiley, New York, 1987.
[3] J. M. Cecilia, J. M. Garcı́a, A. Nisbet, M. Amos, and M. Ujaldón. Enhancing data parallelism for Ant
Colony Optimization on GPUs. J. Parallel Distrib. Comput., vol. 73, no. 1, pp. 42–51, 2013.
[4] L. Dawson and I. Stewart. Improving Ant Colony Optimization performance on the GPU using CUDA.
IEEE Conference on Evolutionary Computation, vol. 1, Cancun, Mexico, pp. 1901–1908, 2013.
[5] H. Lloyd and M. Amos. Analysis of Independent Roulette Selection in Parallel Ant Colony Optimization.
Proceedings of the Genetic and Evolutionary Computation Conference, Berlin, Germany, 2017.
[6] A. Duran and M. Klemm. The Intel Many Integrated Core Architecture. High Performance Computing
and Simulation (HPCS), 2012 International Conference, pp. 365–366, 2012.
[7] T. Stützle. ACOTSP v1.03. Last accessed 2017-07-20. [Online]. URL: iridia.ulb.ac.be/∼
mdorigo/ACO/downloads/ACOTSP-1.03.tgz
[8] G. Reinelt. TSPLIB - A traveling salesman problem library. ORSA Journal on Computing, 3, 376-384,
1991.
[9] H. Lloyd and M. Amos. A Highly Parallelized and Vectorized Implementation of Max-Min Ant System
on Intel Xeon Phi. IEEE Symposium Series on Computational Intelligence, Athens, 2016.