Report SPM
Report SPM
Abstract
This project focuses on parallelization of a genetic algorithm to solve the Travelling
Salesman Problem (TSP). A sequential version is developed, followed by an analysis
to identify components suitable for parallelization. Two parallel implementations are cre-
ated using standard threads and FastFlow. Subsequently, the performance gains from
parallelization are evaluated and the sequential and parallel versions are compared using
metrics such as execution time and solution quality. The results provide insights into the
benefits and tradeoffs of standard threads and FastFlow, showing how the field of paral-
lel optimization algorithms is used to efficiently solve complex combinatorial problems.
i
Contents
1 Introduction 1
2 Problem Setting 2
2.1 Algorithm cost analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
5 Conclusion 13
B Run configurations vi
ii
1 Introduction
The Travelling Salesman Problem (TSP) is a complex combinatorial problem that
involves finding the shortest possible route for a salesman to visit a set of cities and re-
turn to the starting point. Genetic algorithms, which simulate the process of natural
selection to find optimal solutions, have proven to be effective in solving the TSP.
1
2 Problem Setting
This chapter formally defines the Travelling salesman problem, then provides an overview
of how to exploit genetic algorithms to solve it. Finally, by analyzing some high-level
implementation details, an analysis is made on which parts of the algorithm are best
suited for parallelization.
Once the problem is defined, the preliminary analysis is concerned with ensuring
that a solution is reached through the iterative process. Figure 1 shows an experiment
performed to empirically demonstrate the convergence properties of the described
(sequential) algorithm and its parallel implementations.
2
Figure (1) The results of 3.000 generations with the starting problem set with: 500 vertices
(cities), edge weights between [1, 10], 2.000 initial chromosomes, mutation rate of 0.4, crossover
rate of 0.8. In each graph, the quality of the solution on the y-axis and the number of iterations
(generations) on the x-axis are shown.
Considering that the weights of the edges are between [1, 10] and that there are 500
vertices, the minimum solution potentially obtainable is cost 500 (if an edge of weight
1 is created between each node in the graph in the random generation), the algorithm
achieves about 600 in each configuration as the final solution. Although the effective
optimal solution of the generated problem in the experiment is not known, the results
can be considered sufficiently close to the theoretical minimum solution.
3
Figure (2) In each pie chart there are the percentages, calculated by an the average time
of three experiments per configuration, of each stage in the sequential algorithm on a single
generation of starting problems with different scales of size. The number of vertices is set to
5.000, 10.000 and 15.000 respectively, in addition the starting population is 15.000 chromosomes
and the probabilities of crossover and mutation are set to 1 (100% probability of these operations
occurring).
The experiments in the above Figure 2, show that as the problem size increases, the
operation that dominates computation is the crossover operation that involves the
exchange of genetic material between two parent individuals to create offspring indi-
viduals. Specifically, the implemented PMX crossover [2] involves randomly choosing a
starting crossover point in the parents. The first operation performed is, starting from the
crossover point in the chromosome of interest, to locate the corresponding element in the
second chromosome. Then, a search operation takes place within the first chromosome
to look for the corresponding element of the second chromosome and swap it (each swap
takes place on the chromosome itself). Moving on to the next position, this process is
repeated for about a twenty percent portion of the string length, for both chromosomes.
So it is easy to realize that perform this operation within larger chromosomes becomes
an increasingly expensive task. Note that in the experiment settings the probability of
corssover occurring is set to 1 to force the algorithm to perform it obligatorily but gen-
erally this operation may also not be performed for some chromosomes, thus taking a
smaller percentage of the execution time out of the total.
4
Then, initial generation and evaluation are two costly operations that depend
heavily on the number of nodes in the graph and the size of the population to be evaluated.
The remaining operations, on the other hand, turn out to be fast even in more complex
settings, meaning that there is no need to parallelize them both to achieve performance
improvement and to avoid additional overhead.
Based on these considerations, it was considered essential to parallelize only the initial
generation, evaluation and crossover operations.
5
In this implementation, parallelism is achieved by dividing the task into smaller por-
tions assigned to workers.
Specifically, to explain how it works, consider the function for initial generation ’gen-
eratePopulation’ that takes three parameters: chromosomeNumber, numVertices, and
workers, specifying the desired number of chromosomes, the number of vertices, and the
number of parallel workers to be used.
A lambda function named generateChromosomes is defined inside generatePopulation
to generate a portion of the chromosomes. It takes a start and end index and produces
pairs consisting of fitness values and shuffled sequences of vertices.
The number of chromosomes per worker (chromosomesPerWorker) is determined by
dividing the total chromosome number by the number of workers, with any remaining
chromosomes assigned as extraChromosomes to be distributed.
A vector of std::future objects named futures is created to store the asynchronous
results of each worker’s computation.
A loop is used to create and launch worker tasks using std::async with the std::launch::async
policy, ensuring asynchronous execution. Within the loop, start and end indices are cal-
culated for each worker, with the last worker handling any extraChromosomes.
Each worker’s task is initiated using std::async and the generateChromosomes
lambda function, with the calculated start and end indices. After launching the tasks,
another loop collects the partial results from each std::future using future.get(),
appending them to the population vector.
Similarly, the functions for evaluation and crossover have been coded. By exploiting
the idea that chromosomes could be processed in independent blocks (each assigned to
each thread), access to shared resources can be guaranteed without the explicit use of
lock mechanisms by accessing them using specific index ranges to prevent each worker
from concurrently accessing the spaces of other workers. In addition, using this fork/join
model avoids exchanging information between workers and with a master, which could
become a heavy operation as the problem size increases.
To go into the details of the implementation, let us again consider the initial gener-
6
ation function ’generatePopulation’. In this case, as mentioned above, several compo-
nents have been defined such as the ChunksEmitter, the ChunksCollector, the node for
worker execution (ChromosomeGenerationWorker), and the setupComputation function
for resource allienement.
The setupComputation method calculates the number of workers based on the avail-
able hardware concurrency and the desired number of chromosomes. Note that this
method is called every time a parallel operation is started because due to stochastic
selection the algorithm may generate an intermediate population of varying size that is
resized only after the evaluation process is performed.
ChunksEmitter is a FastFlow node (ff_node_t) responsible for emitting chunks of
chromosome generation tasks to the worker nodes. It inherits from ff_node_t<std::pair<int,
int>>, indicating that it processes and emits pairs of integers. It has a reference to
a vector of std::pair<int, int> called chunks and an integer generationsNumber.
The svc method is the main execution method, it decrements the generationsNumber
and sends out each chunk in the chunks vector using the ff_send_out method. When
generationsNumber becomes zero, it returns the end-of-stream (EOS) signal.
Correspondingly, the ChunksCollector is a FastFlow node with the same properties
of emitter and is responsible for collecting the generated chromosomes from the worker
nodes. The svc method processes each chunk received. If all chunks have been processed,
it sends out the chunk to the next stage (if any), otherwise, it deletes the chunk. It uses
the GO_ON signal to continue processing.
Each chunk is processed using ChromosomeGenerationWorker that performs the ac-
tual generation of chromosomes. It also inherits from ff_node_t<std::pair<int, int>>
and has references to a graph and vectors of chromosome populations. The svc method
receives a chunk (pair of integers) indicating the range of chromosomes to generate, then
generates random chromosomes and stores them in the local partialPopulation vector
to avoid highly competitive access to the shared structure. After generating the chromo-
somes, it transfers the values from partialPopulation to the main population vector.
Once these operations are completed, it sends out the processed chunk to the next stage
using ff_send_out.
Finally, the generatePopulation method sets up the FastFlow Farm and executes
the chromosome generation. It creates instances of ChunksEmitter, ChunksCollector,
and multiple ChromosomeGenerationWorker nodes and then wraps these nodes in an
ff_Farm object, which manages the execution and communication between the nodes.
The above logic is replicated in both the evaluation and crossover functions. The
use of farms offers several significant advantages, in addition to the parallelization of
processes, that contribute to the overall efficiency and effectiveness of the system.
A key advantage is scalability. As system requirements grow, farms can easily handle
increased workloads by adding more worker nodes or adding it within other parallelization
systems. This scalability allows for continuous expansion of computational resources,
7
enabling the system to efficiently handle larger and more complex tasks.
Then, by employing farms, the complexities associated with managing parallel exe-
cution and inter-node communication are abstracted away. This abstraction simplifies
the development process and enhances code readability. It is possible to focus on the
logic of each phase without having to deal with low-level details such as thread creation,
synchronization, or communication mechanisms.
Finally, error handling is another essential feature offered by farms. If an error occurs
during the execution of a worker node, the farm can handle the error and propagate it
appropriately. This built-in error handling capability ensures the integrity of the overall
computation and allows for graceful error recovery. By providing error resilience, farms
enhance the robustness of the system.
Speedup is a measurement that compares the execution time of the sequential pro-
gram with the execution time of the parallel program. It quantifies how much faster
the parallel program is in comparison to the sequential counterpart. The speedup is a
function of the parallelism degree, denoted by n. The formula for calculating speedup is
as follows:
Tseq
speedup(n) = (1)
Tpar(n)
In this formula, T seq represents the execution time of the sequential program, and
T par(n) represents the execution time of the parallel program with a parallelism degree
of n.
Scalability measures the efficiency of the parallel implementation when dealing with
increasing levels of parallelism. It quantifies the improvement in performance achieved
by increasing the parallelism degree from 1 to n. The formula for calculating scalability
is as follows:
Tpar(1)
scalability(n) = (2)
Tpar(n)
8
Here, T par(1) denotes the execution time of the parallel program with a parallelism
degree of 1, and T par(n) represents the execution time of the parallel program with a
parallelism degree of n.
Both speedup and scalability are essential metrics for evaluating the performance of
parallel programs. Speedup assesses the overall improvement achieved through paral-
lelization, while scalability focuses on the efficiency of parallelism with varying degrees.
These metrics help in understanding and modeling the behavior of parallel programs and
optimizing their performance.
Furthermore, for speedup the expectation on an ideal time where T par(n) = T seq/n
is calculated, this implies that the parallel execution time is directly proportional to the
degree of parallelism, n. This scenario represents perfect parallelism, in which an increase
in the number of parallel resources or processors leads to a proportional decrease in
execution time. In the graphs it is reported to have an upper bound on what maximum
theoretically can be achieved as the best parallelization time.
4.2 Experiments
Considering the results of the experiments, it can be seen that increasing the number of
workers in parallelization problems can have both advantages and disadvantages.
9
Figure (3) The graphs depict the speedups of the FastFlow and the standard threads imple-
mentations on different magnitudes of the starting problem. The number of vertices is set to
5.000, 10.000 and 15.000 respectively, one generation, mutation and crossover probability at 1
(100% probability of these operations being performed), starting population of 15.000 chromo-
somes. On the y-axis is the speedup factor, on the x-axis is the number of workers (threads).
In Figure 3, speedup indicates that increasing the number of workers can lead to
improved performance and faster execution times. Especially in the case of TSP with
genetic algorithms, where tasks can be executed independently without the need for
frequent communication or synchronization. With more workers, the workload can be
divided more finely, enabling better resource utilization and reduced computation time.
10
Figure (4) The graphs depict the scalability of the FastFlow and the standard threads imple-
mentations on different magnitudes of the starting problem. The number of vertices is set to
5.000, 10.000 and 15.000 respectively, one generation, mutation and crossover probability at 1
(100% probability of these operations being performed), starting population of 15.000 chromo-
somes. On the y-axis is the speedup factor, on the x-axis is the number of workers (threads).
11
Finally, the last experiment performed is the measurement of times in milliseconds,
on different problem sizes, with both sequential and parallel implementations.
12
As can be seen from Tables 1, 2 and 3, the goal stated in Chapter 2.1 can be considered
reached. In fact, a net improvement is achieved as the number of threads increases and
the problem become bigger, this implies lower execution times, in example in Table 3,
up to a reduction by a factor of about 18x, if we go from sequential to parallel with 32
threads.
5 Conclusion
In conclusion, parallelization is highly advantageous for speeding up genetic algorithms
used to solve TSP problems, especially when dealing with large instances that are difficult
to solve sequentially. It offers improved computational efficiency, increased exploration
of the search space, faster convergence, and scalability.
While, considering implementation aspects, standard thread-based parallelization us-
ing async/await mechanisms provides benefits, but using a specialized framework like
FastFlow offers additional advantages. FastFlow provides a high-level abstraction, effi-
cient task scheduling, optimized communication and extensibility of the parallelization
structure.
13
A Genetic Algorithm Pseudocode
Algorithm 1 runGeneticAlgorithm
Require: generationN umber, chromosomeN umber, numV ertices, crossoverRate,
mutationRate
1: generatePopulation(chromosomeNumber, numVertices)
2: for generation = 1 to generationN umber do
3: evaluate(evaluationsAverage, chromosomeNumber)
4: fitness(evaluationsAverage)
5: selection(intermediatePopulation)
6: crossover(intermediatePopulation, crossoverRate)
7: mutation(intermediatePopulation, mutationRate)
8: evaluationsAverage = 0
9: population.swap(intermediatePopulation)
10: end for
Algorithm 2 generatePopulation
Require: chromosomeN umber, numV ertices
1: sequence = createSequence(numVertices) {Create a vector with values from 0 to
numVertices}
2: for i = 0 to chromosomeN umber do
3: shuffle(sequence) {Shuffle the sequence}
4: keyValue = 0
5: population.add(keyValue, sequence) {Add the keyValue and sequence to the pop-
ulation}
6: end for
iii
Algorithm 3 evaluate
Require: evaluationsAverage, cutV alue
1: chromosomeSize ← graph.getNumVertices()
2: for each individual in population do
3: chromosomeScoreCurrIt ← 0
4: for j = 0 to chromosomeSize − 2 do
5: chromosomeScoreCurrIt += graph.getWeight(individual.sequence[j],
individual.sequence[j + 1])
6: end for
7: chromosomeScoreCurrIt += graph.getWeight(individual.sequence[0],
individual.sequence[chromosomeSize − 1])
8: individual.keyV alue ← chromosomeScoreCurrIt
9: end for
10: sort(population)
11: totalScore ← 0
12: numElements ← min(cutV alue, size(population))
13: for i = 0 to numElements − 1 do
14: totalScore += population[i].keyV alue
15: end for
totalScore
16: evaluationsAverage ← numElements
17: resize(population, numElements)
Algorithm 4 fitness
Require: evaluationsAverage
1: for i = 0 to size(population) − 1 do
2: individual ← population[i]
3: individual.keyV alue /= evaluationsAverage
4: end for
iv
Algorithm 5 selection
Require: intermediateP opulation
1: distribution ← createUniformDistribution(0.0, 1.0)
2: for each individual in population do
3: numCopies ← integerPart(individual.keyV alue)
4: for i = 0 to numCopies − 1 do
5: intermediateP opulation.add({individual.keyValue, individual.sequence})
6: end for
7: f ractionalP art ← individual.keyValue − numCopies
8: if getRandomValue(distribution) < fractionalPart then
9: intermediateP opulation.add({individual.keyValue, individual.sequence})
10: end if
11: end for
Algorithm 6 crossover
Require: intermediateP opulation, crossoverRate
1: distribution ← uniform distribution between 0.0 and 1.0
2: elementSize ← size of chromosome path in intermediatePopulation
3: twentyPercent ← integer value of (elementSize * 0.2)
4: if getRandomValue(distribution) < crossoverRate then
5: for i ← 0 to size of intermediatePopulation − 1 do
6: vec1 ← reference to the chromosome path of intermediatePopulation[i]
7: if i + 1 < size of intermediatePopulation then
8: vec2 ← reference to the chromosome path of intermediatePopulation[i+1]
9: startIndex ← random integer between 0 and size of (vec1 - twentyPercent)
10: endIndex ← startIndex + twentyPercent
11: for j ← startIndex to endIndex do
12: pos_1 ← find the position of vec2[j] in vec1
13: pos_2 ← find the position of vec1[j] in vec2
14: swap vec1[pos_1] with vec1[j]
15: swap vec2[pos_2] with vec2[j]
16: end for
17: end if
18: end for
19: end if
v
Algorithm 7 mutation
Require: intermediateP opulation, mutationRate
1: distribution ← createUniformDistribution(0.0, 1.0)
2: for each individual in intermediatePopulation do
3: if getRandomValue(distribution) < mutationRate then
4: size ← size(individual.sequence)
5: if size > 1 then
6: index1 ← getRandomIndex(0, size − 1)
7: index2 ← getRandomIndex(0, size − 1)
8: swap(individual.sequence[index1], individual.sequence[index2])
9: end if
10: end if
11: end for
B Run configurations
The code can be built (inside the folder genTSP, use the command: g++ -O3 -std=c++17
main.cpp -o genTSP) and executed by passing the correct parameters to the main func-
tion.
The following are the parameters and their respective domains to be passed as input
to the main.cpp function.
Parameter Domain
numVertices integer > 0
chromosomesNumber integer > 0
generationNumber integer > 0
crossoverRate double ∈ [0, 1]
mutationRate double ∈ [0, 1]
workersNumber integer > 0
[seed] integer (optional value)
Table (4) main.cpp parameters
vi
References
[1] Marco Aldinucci et al. “FastFlow: High-Level and Efficient Streaming on Multi-
Core”. In: Proceedings of the 17th Euromicro International Conference on Parallel,
Distributed and Network-Based Processing (PDP). IEEE. 2009, pp. 173–180. doi:
10.1109/PDP.2009.26.
[2] Göktürk Üçoluk. “Genetic Algorithm Solution of the TSP Avoiding Special Crossover
and Mutation”. In: Intelligent Automation and Soft Computing 3 (Jan. 2002). doi:
10.1080/10798587.2000.10642829.
[3] Darrell Whitley. “A genetic algorithm tutorial”. In: Statistical Computing 4.1 (June
1994), pp. 65–85. doi: 10 . 1007 / BF00175354. url: https : / / doi . org / 10 . 1007 /
BF00175354.
vii