0% found this document useful (0 votes)
21 views

Vectorization Strategies For Ant ColonyOptimization On Intel Architectures

This document discusses vectorization strategies for implementing ant colony optimization algorithms on Intel architectures. It presents parallel and vectorized implementations of three selection functions (Roulette Wheel, I-Roulette, and DS-Roulette) for the tour construction phase of ACO. Experimental results show that the best implementation using I-Roulette as the selection function on an Intel Xeon Phi 7120P processor can achieve a speedup of up to 78.98x compared to a sequential implementation on an Intel Xeon v2 CPU.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Vectorization Strategies For Ant ColonyOptimization On Intel Architectures

This document discusses vectorization strategies for implementing ant colony optimization algorithms on Intel architectures. It presents parallel and vectorized implementations of three selection functions (Roulette Wheel, I-Roulette, and DS-Roulette) for the tour construction phase of ACO. Experimental results show that the best implementation using I-Roulette as the selection function on an Intel Xeon Phi 7120P processor can achieve a speedup of up to 78.98x compared to a sequential implementation on an Intel Xeon v2 CPU.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

October 2017

Vectorization Strategies for Ant Colony


Optimization on Intel Architectures
Victoriano Montesinos a,1 and José M. Garcı́a a
a Computer Engineering Department, University of Murcia, 30100 Murcia, Spain

Abstract. This paper presents an efficient parallel and vectorized implementation


of three different selection functions (Roulette Wheel, I-Roulette and DS-Roulette)
for tour construction (the most time-consuming part of the Ant Colony Optimiza-
tion bio-inspired metaheuristic) targeting two Intel multi-core processors and the
Knights Corner Intel Xeon Phi coprocessor. The results show that our best imple-
mentation (with I-Roulette as selection function) on Xeon Phi 7120P runs up to
78.98x faster compared to its sequential counterpart on a Xeon v2 CPU.

Keywords. Ant Colony Optimization, Parallel Metaheuristics, Xeon Phi, Vectorization.

1. Introduction

Ant Colony Optimization (ACO) [1] is a well-known population based metaheuristic


which has been applied successfully to a wide range of NP-hard combinatorial optimiza-
tion problems, including the Traveling Salesman Problem (TSP). ACO is a bio-inspired
swarm intelligence method based on ants’ foraging process and first proposed by Marco
Dorigo in 1992. The technique generates solutions in a constructive way, starting with
an empty solution and iteratively adding new elements. When using ACO for solving
the TSP, each solution is a tuple of cities. The metaheuristic consists of two main stages:
tour construction and pheromone update. In the first stage, each ant builds a path by it-
eratively selecting a next city among the unvisited ones. After this, pheromone update is
performed, comprising two phases: pheromone evaporation (in order to gradually forget
bad tours) and pheromone deposit (for reinforcing good quality solutions).
This paper presents an efficient implementation of three different selection functions
for the tour construction stage (which is common to all ACO variants). Parallelization
of the pheromone update stage is left as future work, as tour construction takes over
99.82 % of the time in the sequential version and the pheromone update stage is different
for each ACO variant. Thus, our conclusions in this paper concern all ACO algorithms.
Firstly, we develop a parallel implementation for the default selection function
(Roulette Wheel Selection) for tour construction, observing important limitations due to
a poor vectorization of the code. Then, we implement on Intel architectures two other al-
ternative algorithms that were introduced for GPUs: I-Roulette and DS-Roulette. Finally,
we propose a partially vectorized implementation of Roulette Wheel Selection (named
1 Corresponding author.

Email addresses: [email protected] (V. Montesinos), [email protected] (J.M. Garcı́a).


October 2017

V-Roulette), and an enhanced completely vectorized implementation of I-Roulette. Per-


formance assessment of all the implementations is performed in terms of execution time
targeting two Intel multi-core processors (Xeon E5-2650 v2 and E5-2698 v4) and the
first generation of Intel Xeon Phi 7120P (KNC) coprocessor. Our experimental results
confirm that I-Roulette and DS-Roulette are the best strategies to use when parallelizing
ACO, as both selection functions on Xeon Phi achieve a speed up factor of up to 78.98x
over the sequential counterpart on Xeon v2.

2. Fundamentals

2.1. Ant Colony Optimization for the Traveling Salesman Problem

The Traveling Salesman Problem (TSP) [2] consists of finding the shortest round trip tour
that include exactly once each city from a set of n cities. The TSP is a paradigmatic NP-
hard combinatorial optimization problem used as a standard test bed for new algorithms.
Ant System [1], the first ACO algorithm, was first applied to the symmetric TSP, in which
the distance between two cities, i and j, is the same in both directions (di j = d ji ).
Algorithm 1 shows the common structure of ACO algorithms. Firstly, all the data
structures (distance and pheromone matrices, ant colony, etc.) are initialized. Within the
loop, each iteration is composed of two stages: tour construction and pheromone update.

Algorithm 1 ACO algorithms general structure


1: Initialization()
2: while not TerminationCondition() do
3: TourConstruction()
4: PheromoneU pdate()
5: end while

At the start of the tour construction stage, some criterion is used to choose the start
cities at which the ants are positioned. In Ant System, each ant is placed on a randomly
chosen initial city. At each construction step, each ant makes use of a probabilistic choice
rule in order to choose its next city to visit. The probability for ant k, currently placed at
city i, of selecting city j is specified in Eq. (1):

[τi j ]α [ηi j ]β
pkij = , i f j ∈ Nik , (1)
∑l∈N k [τi j ]α [ηi j ]β
i

where τi j is the amount of pheromone associated with edge (i, j), ηi j = 1/di j is a
heuristic value computed a priori, α and β are two parameters (fixed at the beginning
of an execution) which determine the relative influence of the pheromone trail and the
heuristic information, and Nik is the feasible neighborhood of ant k when placed at city
i, that is, the set of cities that have not been visited yet by ant k. The probability of
choosing a city outside this latter set is 0, thus preventing an ant from visiting a city more
than once. Once the probabilities have been computed, a selection function is used for
choosing next city taking into account these probabilities (see next subSection).
After all ants have finished constructing their tours, pheromone update takes place
in a different way for each ACO variant.
October 2017

2.2. Selection Functions for Tour Construction

The default selection function suggested for choosing the next city in the tour construc-
tion stage was Roulette Wheel Selection [1]. The procedure operates as a roulette wheel
in a casino. That is, each not visited city is assigned to a portion on a circular roulette
wheel, with the size of the slice being proportional to the probability of visiting that city.
Then, in order to simulate the roulette being spun, a random number is generated, and
the portion in which the number takes place determines the selected city for a given ant.

Algorithm 2 Roulette Wheel Selection


Input: Ant identifier (a), current city (current city).
Output: Selected city.
1: {Selection Probabilities Computation}
2: prob sum ← 0
3: for i = 1 to n do
4: if visited[a][i] then
5: prob[i] ← 0
6: else
7: prob[i] ← choice in f o[current city][i]
8: prob sum ← prob sum + prob[i]
9: end if
10: end for
11: {City Selection}
12: r ← random(0..prob sum)
13: city ← 1
14: partial sum ← prob[city]
15: while partial sum < r do
16: city ← city + 1
17: partial sum ← partial sum + prob[city]
18: end while
19: return city

Algorithm 2 shows the pseudocode for Roulette Wheel. The choice info matrix
stores the probabilities of choosing each city without taking into account if the cities are
visited or not. This latter information is stored in the visited matrix. As Roulette Wheel
Selection has an inherently sequential part (lines 15-18) which can not be vectorized,
two main alternative selection functions were successfully proposed for GPUs in order
to exploit data level parallelism: I-Roulette and DS-Roulette.
I-Roulette (Independent Roulette) [3] was proposed as an alternative method for
removing the sequential part of Roulette Wheel. In this selection function, the probability
of each city which has not been visited yet is multiplied by a different random number in
[0, 1], obtaining a weight for each city. For visited cities, the value of the weight equals
zero. After the weights have been computed for all cities, the city with the highest one
is selected as the next city. Note that, because of the multiplication by different random
numbers, the weights are not proportional to the probabilities of visiting each city. In
addition, this selection method needs to generate n random numbers (where n is the
number of cities), which is costly, but this can be done in parallel.
October 2017

DS-Roulette (Double-Spin Roulette) [4] was introduced as a selection function that


minimizes the sequential part of Roulette Wheel at the same time that preserves the pro-
portionality between weights and probabilities. This method consists of grouping cities
into blocks and computing each block’s probability as the addition of the probabilities
of the cities within that block. Then, two roulette wheel selections take place: one for
choosing a block, and a second for choosing a city within the selected block.
Concerning the solution quality, Dawson and Stewart [4] show that DS-Roulette
matches the quality of the solutions generated using Roulette Wheel Selection, and argue
that, in I-Roulette, the multiplication by different random numbers could decrease the
influence of the heuristic and the pheromone information, leading to a reduction in the
quality of the tours. However, Lloyd and Amos [5] have recently shown that I-Roulette
does not affect the quality of the solutions obtained in comparison with Roulette Wheel,
but significantly accelerates convergence to a solution.
Regarding the random number generator, we have tested three options: C library
function rand(), Intel MKL’s vsRngUniform and the function used by Stutzle [7]. The
same solution quality is obtained for all of them.

2.3. Intel High Performance Architecture

Intel Xeon Phi is based on the Many Integrated Core (MIC) [6] architecture. The first
generation of this many-core architecture, known as Knight’s Corner (KNC), has up to 61
cores and four hardware threads per core. It is also provided with a vector processing unit
(VPU) which can operate on 512-bits wide registers. Each core has in-order instruction
execution and runs at a low clock speed (less than 1.3 GHz). Thus, although its cores
are simple, Intel many-core architectures present better performance/power ratio than
Intel multi-core processors, which comprise a lower number of more complex cores.
It is precise to point out that, in order to exploit all the hardware capabilities of this
architecture, it is paramount to make use of both thread and vector level parallelism.

3. Porting ACO to Intel Architectures

3.1. Thread Parallelism

The tour construction stage is inherently parallel, as each ant can construct its solution
individually. On Intel architectures, thread level parallelization is immediate. We use
#pragma omp parallel for to map ants to threads. The computation of the numerator
of Eq. (1), performed before the ants construct the solutions, is also parallelized using
the same pragma. However, vectorization is also necessary to obtain high performance.

3.2. Data Parallelism

In our vectorized implementations, we use the capability of the Intel C++ compiler to
automatically vectorize some loops, but we have to facilitate its task by means of some
hints and changes on the code. Concretely, we ensure all the data structures are aligned
to 64 bytes and use some pragmas to assist vectorization, such as #pragma ivdep for
telling the compiler to ignore vector dependences.
October 2017

Within the tour construction stage, 99.66 % of the time is spent on choosing the next
city, that is, the selection function. Thus, hereafter we focus on how to implement the dif-
ferent selection functions to get vectorized on Intel architectures. The following are some
details of our implementation of the three selection algorithms previously described:
• Roulette Wheel (Algorithm 2) comprises two loops: the first one (lines 3-10) com-
putes the probability of selection of each city and the second (lines 15-18) sim-
ulates the roulette spinning. Looking at the vectorization report from the Intel
compiler, we notice that none of these loops are vectorized. We discuss how to
address this issue in the next Section.
• I-Roulette v1: In our first implementation of I-Roulette, the main loop is vector-
ized and performs the weights computation as a multiplication of three elements:
the probability of choosing a city, the value representing whether the city has been
visited (zero) or not (one), and a random number in [0, 1]. Then, a serial reduction
is performed to find the city with the largest weight.
• DS-Roulette: Our implementation of DS-Roulette has 3 stages. In the first one,
selection probabilities are computed in a SIMD fashion for cities and blocks. The
size of the block (block size) has been set to 16, as in our case we obtained the
best results with this value. The number of blocks is d blockn size e, so the last block
contains the last remaining cities in case n is not a multiple of block size. After
stage one, two roulette wheels are run sequentially: the first for choosing a block
and the second for choosing a city within that block.
Regarding the generation of random numbers, it does not affect the execution time
of Roulette Wheel nor DS-Roulette, but it does for I-Roulette, as it generates n times
more random numbers. We have selected Stutzle’s generator because of its speed for
I-Roulette. For the others, we have used the same generator to be fair in our comparisons.
As ants are mapped to threads, the seed for generating random numbers needs to be
replicated to a vector of seeds, so each thread has its own one. In addition, as I-Roulette
also generates random numbers in a vectorized way, a matrix of seeds is necessary, hav-
ing a row for each thread and as many columns as the architecture’s vector length.

4. Improving Vectorization

4.1. V-Roulette

The main problem when trying to vectorize the first loop of Roulette Wheel Selection is
the if-else statement (lines 4-9 in Algorithm 2). As shown in Algorithm 3, this statement
has been replaced with a multiplication. The values stored in the visited matrix are set
in the following way: if an ant k has visited city i, visited[k][i] equals 0. Otherwise,
visited[k][i] equals 1. This way, when multiplying the probability of visiting a city (in
choice info) by its corresponding position of the visited matrix, a 0 is obtained if the city
is already visited, or the value stored in the choice info matrix if the city is not visited
yet. Moreover, the computation of the addition of probabilities (in prob sum) can be
carried out as a vectorized reduction operation that is performed automatically by the
Intel compiler. Thus, we maintain the meaning of the original Roulette Wheel algorithm.
Regarding the second loop of Roulette Wheel, each iteration depends on the previous
one and the number of iterations is not known before executing the loop, so there is
October 2017

nothing to do to vectorize it. Therefore, our proposal is a partially vectorized Roulette


Wheel algorithm, which we have called V-Roulette Wheel (Vectorized Roulette Wheel).

Algorithm 3 V-Roulette Wheel


Input: Ant identifier (a), current city (current city).
Output: Selected city.
1: {Selection Probabilities Computation}
2: prob sum ← 0
3: for i = 1 to n do
4: prob[i] ← choice in f o[current city][i] ∗ visited[a][i]
5: prob sum ← prob sum + prob[i]
6: end for
7: {City Selection (lines 11-19 in Algorithm 2)}

4.2. I-Roulette v2

In this subSection, we present a enhanced vectorized implementation of I-Roulette (see


Algorithm 4). Our improvement consists of a vectorized reduction to find the maximum
weight. This is performed using explicit vectors, Wmax and Imax , for storing the maximum
weights and the corresponding city indices, respectively. The inner loop (lines 6-13) is
vectorized. In the pseudocode, VECTOR LENGTH is a parameter of the program and
needs to be tuned for each architecture. In our case, better results are obtained setting
this value depending on the vector width of the architecture. For example, Xeon Phi
(with 512-bits wide vectors) can operate on 16 floats at the same time, so we have set
VECTOR LENGTH to 16 for this architecture. num vectors is set to b V ECT OR nLENGT H c
and represents the number of complete groups (of size VECTOR LENGTH) in which the
number of cities (n) can be divided. Thus, the last n%V ECT OR LENGT H cities do not
belong to any group and a final loop is needed for computing the weights associated with
these remainder cities, and checking if any of them is greater than the current maximum.

5. Evaluation

5.1. Test Bed

For generating our base code, we have modified Stützle’s implementation [7] of the main
ACO algorithms so that it follows the original proposal of the metaheuristic [1]. The
evaluation platform is equipped with two Intel Xeon (E5-2650 v2, 16 cores; and E5-2698
v4, 40 cores) processors and an Intel Xeon Phi 7120P coprocessor (61 cores). The system
runs Linux CentOS 6.7 with kernel 2.6.32, and Intel MPSS 3.7.2. Codes are built using
Intel’s icpc compiler (version 17.0.2) with the optimization level -O3. Evaluations have
been carried out using the maximum number of threads per core available for each archi-
tecture (2 on Intel Xeon v2 and v4, and 4 on Xeon Phi). Our different implementations
are tested using a set of instances from the TSPLIB benchmark library [8]. We set ACO
parameters as recommended in [1]: m = n (where m is the number of ants and n is the
number of cities), α = 1 and β = 5. Performance figures are given for single-precision
numbers and a single iteration averaged over 10 independent runs of 100 iterations.
October 2017

Algorithm 4 I-Roulette v2
Input: Ant identifier (a), current city (current city).
Output: Selected city.
1: Imax ← (−1 . . . − 1)
2: Wmax ← (−1 . . . − 1)
3: {Weights Computation and Vectorized Reduction}
4: for i = 1 to num vectors do
5: j ← i ∗V ECT OR LENGT H
6: for k = 1 to V ECT OR LENGT H do
7: weight ← choice in f o[current city][ j] ∗ visited[a][ j] ∗ random(0..1)
8: if weight > Wmax [k] then
9: Imax [k] ← j
10: Wmax [k] ← weight
11: end if
12: j ← j+1
13: end for
14: end for
15: {Compute maximum of vector (Serial)}
16: i ← argmax(Wmax )
17: city ← Imax [i]
18: max weight ← Wmax [i]
19: {Compute remainder weights and check if greater than current maximum (Serial)}
20: return city

5.2. Scalability

Our first experiment deals with the parallel scalability of the tour construction stage.
As this stage does not have any synchronization problems and is compute bound, our
implementation for the Intel architectures should scale across the number of cores used
by the processor. We tested our parallel implementation of the tour construction stage
using the default selection function (Roulette Wheel).
In this experiment, speed up is presented against the code running on one core using
2 threads for Xeon v2 and v4. For the other number of cores, 2 threads per core are
always used, and the affinity parameter is set to compact to fill in all the cores used every
time. Figure 1 shows the thread scalability on Xeon v2 and Xeon v4.
As we can see, the tour construction stage scales well, achieving better scalability
with larger instances, which require more computations. However, we expect our re-
sults get closer to the theoretical limit when addressing some problems with unbalanced
thread load, by means of adjusting the scheduling policy. Similar results for this stage
are obtained varying the selection function used.

5.3. Vectorization Strategies

Secondly, we assess the vectorization strategies presented in Sections 3 and 4. Figures


2a and 2b show the speed up obtained for each selection function against the Roulette
Wheel parallel version on Xeon v4 and Xeon Phi, respectively. From these results, we
can point out the following: a) the vectorization process on Xeon Phi is really effective,
October 2017

(a) Xeon v2 (b) Xeon v4

Figure 1. Thread scalability for tour construction.

obtaining speed up factors of up to 7.51x and increasing the speed up with the problem
size; b) the best results on Xeon Phi are obtained for I-Roulette v2 and DS-Roulette,
and on Xeon v4 for DS-Roulette; c) just partially vectorizing the Roulette Wheel version
(V-Roulette), a speed up factor of 2x is obtained; d) I-Roulette v1 gives worse results
than V-Roulette, as it performs more computations: an extra multiplication within the
vectorized loop and a final serial reduction of n iterations, while V-Roulette executes n
iterations in its second loop in the worst case; e) it is necessary to completely vectorize
I-Roulette (as done in I-Roulette v2) for obtaining high performance with this strategy
on Xeon Phi; f) the overall enhancements for the different strategies are quite higher on
Xeon Phi, as the width of its VPU doubles the Xeon v4’s. For this reason, the generation
of random numbers in I-Roulette v2 does not affect performance for Xeon Phi as much
as it does for Xeon v4, as it presents more vector level parallelism.

(a) Xeon v4 (b) Xeon Phi

Figure 2. Speed up for the tour construction stage with different selection functions (compared to parallel
version on each architecture).

5.4. Speed Up on Xeon Phi vs Sequential Version on CPU

Table 1 presents the execution times for instances of different size, and Figure 3 shows
the speed up obtained for each selection function on Xeon Phi against the sequential
version running on the Xeon v2 CPU. Just parallelizing the code using threads (Roulette
Wheel version) achieves a speed up factor of between 7.86x and 11.95x over the single-
threaded code on the CPU. Vectorization is essential for achieving higher performance,
October 2017

as it can be seen from the results of the three vectorized implementations. The partially
vectorized version of Roulette Wheel (V-Roulette Wheel), is between 16.91 and 27.23
times faster than the serial code on the CPU. Finally, the best results are obtained for
I-Roulette v2 and DS-Roulette, varying between 37.05x and 78.98x.

Table 1. Execution time (ms) for tour construction on Xeon v2 (sequential version) and Xeon Phi.
Instance CPU RW V-RW I-R v2 DS-R
lin318 77.8 9.9 4.6 2.1 2.1
rat783 1391.3 116.4 51.1 17.8 19.0
pr1002 2056.3 225.0 100.9 31.6 35.1
pr2392 26410.1 2510.0 1114.7 334.4 339.3

Figure 3. Speed up for the tour construction stage with different selection functions on Xeon Phi (compared
to sequential version on Xeon v2).

6. Related Work

Regarding ACO on GPUs, Cecilia et al. [3] noted that high performance cannot be ob-
tained adopting a task parallelism approach, which maps each ant to a CUDA thread. In
contrast, they suggested a data parallel approach for a better use of the GPU architecture,
and introduced a new selection function, I-Roulette (Independent Roulette), reporting
speed ups of more than 20x over the sequential implementation on the CPU. They also
implemented a vectorized version of Roulette Wheel (in a similar way as we have done
for V-Roulette on Intel architectures), but in their case I-Roulette reached up to 2.36x
gain over this latter method on the GPU. On the other hand, and around the same time,
Dawson and Stewart [4] proposed DS-Roulette (Double Spin Roulette) as an alterna-
tive for preserving the proportionality of Roulette Wheel while minimizing its inherently
sequential part. Speed ups of up to 82x against the sequential counterpart were reported.
A new line of research is porting ACO to Intel many-core architectures. Lloyd and
Amos [9] claimed that the previous attempts to implement ACO on Xeon Phi obtained
poor performance because they made no use of vectorization. As an alternative, they pro-
posed two vectorized selection functions. The first one, vRoulette-1, is a vectorized ver-
October 2017

sion of I-Roulette [3], for which they report speed ups between 5.6x and 16.6x compared
to the CPU sequential version. Their second proposal, vRoulette-2, is a variant of DS-
Roulette [4] in which several roulette wheel selections take place as repeated binary trials
in which the winner of a trial accumulates the weight assigned to the loser. The reported
speed up for vRoulette-2 varies between 5.3x and 13x. However, these two proposals do
not scale well for large instances (over 1000 cities).

7. Conclusions

In this paper we have presented, improved, and evaluated several implementations of


different selection functions for the tour construction stage of the ACO metaheuristic
on Intel multi-core and many-core architectures. Concretely, we have shown the limita-
tions when vectorizing the default selection function (Roulette Wheel), providing a par-
tially vectorized implementation (V-Roulette) of this method. Additionally, we have im-
plemented on Intel architectures the two main selection functions successfully proposed
for GPUs: I-Roulette and DS-Roulette. Finally, we have presented an enhanced com-
pletely vectorized implementation of I-Roulette. Our experimental results confirm that
I-Roulette and DS-Roulette are the best strategies to use when parallelizing ACO, as our
vectorized implementation for this selection functions on Xeon Phi achieves the highest
speed up factor: up to 78.98x over the sequential counterpart on Xeon v2.

Acknowledgments

This work is supported by the Spanish MINECO, as well as European Commission


FEDER funds, under grant TIN2015-66972-C5-3-R. Victor Montesinos has been sup-
ported by the Spanish MECD under a scholarship of collaboration. We thank José A.
Bernabé for maintaining the cluster on which we have obtained the results.

References

[1] M. Dorigo and T. Stützle. Ant Colony Optimization. A Bradford Book, The MIT Press, USA, 2004.
[2] E. Lawler, J. Lenstra, A. Kan and D. Shmoys. The Traveling Salesman Problem. Wiley, New York, 1987.
[3] J. M. Cecilia, J. M. Garcı́a, A. Nisbet, M. Amos, and M. Ujaldón. Enhancing data parallelism for Ant
Colony Optimization on GPUs. J. Parallel Distrib. Comput., vol. 73, no. 1, pp. 42–51, 2013.
[4] L. Dawson and I. Stewart. Improving Ant Colony Optimization performance on the GPU using CUDA.
IEEE Conference on Evolutionary Computation, vol. 1, Cancun, Mexico, pp. 1901–1908, 2013.
[5] H. Lloyd and M. Amos. Analysis of Independent Roulette Selection in Parallel Ant Colony Optimization.
Proceedings of the Genetic and Evolutionary Computation Conference, Berlin, Germany, 2017.
[6] A. Duran and M. Klemm. The Intel Many Integrated Core Architecture. High Performance Computing
and Simulation (HPCS), 2012 International Conference, pp. 365–366, 2012.
[7] T. Stützle. ACOTSP v1.03. Last accessed 2017-07-20. [Online]. URL: iridia.ulb.ac.be/∼
mdorigo/ACO/downloads/ACOTSP-1.03.tgz
[8] G. Reinelt. TSPLIB - A traveling salesman problem library. ORSA Journal on Computing, 3, 376-384,
1991.
[9] H. Lloyd and M. Amos. A Highly Parallelized and Vectorized Implementation of Max-Min Ant System
on Intel Xeon Phi. IEEE Symposium Series on Computational Intelligence, Athens, 2016.

You might also like