Analyzing Convergence Performance of Evolutionary Algorithms A Statistical Approach 2014
Analyzing Convergence Performance of Evolutionary Algorithms A Statistical Approach 2014
Information Sciences
journal homepage: www.elsevier.com/locate/ins
a r t i c l e i n f o a b s t r a c t
Article history: The analysis of the performance of different approaches is a staple concern in the design of
Received 21 November 2013 Computational Intelligence experiments. Any proper analysis of evolutionary optimization
Received in revised form 21 April 2014 algorithms should incorporate a full set of benchmark problems and state-of-the-art com-
Accepted 1 June 2014
parison algorithms. For the sake of rigor, such an analysis may be completed with the use of
Available online 8 August 2014
statistical procedures, supporting the conclusions drawn.
In this paper, we point out that these conclusions are usually limited to the final results,
Keywords:
whereas intermediate results are seldom considered. We propose a new methodology for
Page’s trend test
Nonparametric tests
comparing evolutionary algorithms’ convergence capabilities, based on the use of Page’s
Convergence-based algorithmic comparison trend test. The methodology is presented with a case of use, incorporating real results from
Evolutionary algorithms selected techniques of a recent special issue. The possible applications of the method are
highlighted, particularly in those cases in which the final results do not enable a clear eval-
uation of the differences among several evolutionary techniques.
Ó 2014 Published by Elsevier Inc.
1. Introduction
An analysis based on final results is the most popular way in which the performance of Computational Intelligence search
methods is assessed. For example, in the field of evolutionary optimization, algorithms are usually evaluated with respect to
the quality of the best result obtained, over a predefined set of benchmark functions. However, there are other traits of evo-
lutionary algorithms that are worthy of analysis, beyond the quality of the final solution reached: Efficiency, applicability to
different domains, diversity management and convergence [2].
Convergence is usually acknowledged to be a desirable capability for every new search algorithm designed today. In the
case of Evolutionary Algorithms (EAs), this is a staple concern in the sense that good convergence is a must-have for any new
technique to be accepted by the research community [4,7,30]. However, it is common to see convergence analyzed only as
the capability of the technique to reach the final, regardless of how quickly such a result is reached.
⇑ Corresponding author. Tel.: +65 6790 5404; fax: +65 6793 3318.
E-mail addresses: [email protected] (J. Derrac), [email protected] (S. García), [email protected] (S. Hui), [email protected] (P.N. Suganthan),
[email protected] (F. Herrera).
http://dx.doi.org/10.1016/j.ins.2014.06.009
0020-0255/Ó 2014 Published by Elsevier Inc.
42 J. Derrac et al. / Information Sciences 289 (2014) 41–58
In this sense, the development of a methodology to assess the convergence performance of several algorithms – that is,
which algorithm converges faster – is important, particularly in cases in which a benchmark problem is unable to differen-
tiate algorithms using the final results achieved.
The conclusions obtained after analyzing the final results of the algorithms are often backed up by using statistical tech-
niques. Nonparametric tests [8,22] are preferred for this task due to the absence of strong limitations regarding the kind of
data to analyze (in contrast with parametric tests, for which the assumptions of normality, independence and homoscedas-
ticity of the data are necessary for the sake of reliability) [18,15,31,41].
Throughout this paper, we show how Page’s trend statistical test [27] can be applied to the analysis of pairwise conver-
gence. It is a nonparametric test for multiple classification, which allows trends to be detected among the results of the treat-
ments if the null hypothesis of equality is rejected. In our case, if the treatments are chosen as the differences between the
fitness values of two algorithms, computed at several points of the run (cut-points), the test can be used to detect increasing
and decreasing trends in the differences as the search goes on. The study of these trends, representing the evolution of the
algorithms during the search, enables us to develop a new methodology for comparing algorithms’ convergence performance.
The description of our approach is completed with the inclusion of an alternative version for computing the ranks of
the test. This second version allows the test to be applied safely should one of the algorithms reach the optimum of some
of the benchmark functions before the end of the run (which would prevent it from progressing further, thereby preventing
the proper evaluation of its convergence in the last stages of the search).
To demonstrate the usefulness of both the basic and the alternative versions of the test, a full case study is presented. The
study compares the performance of several EAs for continuous optimization, namely advanced versions of the Differential Evo-
lution evolutionary technique [32,11]. It is based on the submissions accepted for the Special Issue on Scalability of Evolution-
ary Algorithms and other Metaheuristics for Large Scale Continuous Optimization Problems [21] in the Soft Computing journal.
As will be shown in the study, the use of Page’s trend test can be very useful when analyzing the performance of the algo-
rithms throughout the search. Its use provides the researchers with a new perspective for assessing how the algorithms
behave, considering intermediate results instead of just the final results in each function. This can reveal very illustrative
information when comparing the methods, particularly in cases where the final results are statistically similar.
A further contribution presented in this work is the development of a Java program to implement our approach. The pro-
gram processes the intermediate results of two or more algorithms. After that, Page’s trend test is carried out for every pair of
algorithms, and the results are output in TeX format. It can be downloaded at the following URL: http://sci2s.ugr.es/sicidm/
pageTest.zip.
The rest of this paper is organized as follows: Section 2 provides some background regarding the use of nonparametric tests
to contrast the results of evolutionary optimization experiments. Section 3 presents our approach, detailing how Page’s trend
test can be applied to compare the convergence performance of two algorithms. Section 4 describes the case study chosen to
illustrate the application of the test. Section 5 presents the results obtained and the related discussions. Section 6 concludes the
paper. Three appendices are also included, respectively providing a guide to obtaining and using the software used to run the
test (A), detailed final results of the case of study (B) and the full results of the application of Page’s trend test (C).
2. Background
The assessment of the performance of algorithms is an important task when performing experiments in Computational
Intelligence. When comparing EAs, it is necessary to consider the extent to which the No Free Lunch theorem [39] limits the
conclusions: Under no specific knowledge, any two algorithms are equivalent when their performance is averaged across all
possible problems.
Therefore, assuming that EAs take advantage of the available knowledge in one way or another, it is advisable to focus
interest on efficiency and/or effectiveness criteria. When theoretical developments are not available to check such criteria,
the analysis of empirical results can help to discern which techniques perform more favorably for a given set of problems.
In the literature, it is possible to find different viewpoints on how to improve the analysis of experiments [23]: The design of
test problems [13] (for example, the design of complex test functions for continuous optimization [14,38]), the use of advanced
experimental design methodologies (for example, methodologies for adjusting the parameters of the algorithms depending on
the settings used and results obtained [1,2] or for performing Exploratory Landscape Analysis [3,26]) or the analysis of the
results [9] (to determine whether the differences between algorithms’ performances are significant or not). Another example
is [35], where a method inspired in chess rating systems is adapted to rank the performance of evolutionary algorithms.
From the statistical analysis perspective, the use of statistical tests enhances the conclusions drawn, by determining
whether there is enough evidence to reject null hypotheses based on the results of the experiments. For this task, it is possible
to find applications of both parametric [29,10] and, more recently, nonparametric [18,24,12] statistical procedures.
Nonparametric tests are used to compare algorithms’ final results, represented as average values for each problem (using
the same criterion: average, median, etc. over the same number of runs for each algorithm and problem). This usually
enables practitioners to rank differences among algorithms and determine which ones are significant, thus leading to a char-
acterization of which algorithms behave better than the rest.
However, a drawback of this methodology is that it only takes into consideration the final results obtained at the end.
When analyzing EAs, this often overshadows interesting conclusions which could be drawn by analyzing the performance
of the algorithms during the whole run.
J. Derrac et al. / Information Sciences 289 (2014) 41–58 43
The rest of this section is devoted to the introduction of nonparametric tests and the classical definition of Page’s trend
test. This provides the necessary background to present our proposal on the use of nonparametric tests to analyze the con-
vergence performance of EAs, as an enhancement to the final-results oriented statistical analysis, in particular when final
results are statistically similar.
Nonparametric tests [19] are powerful tools for the analysis of results in Computational Intelligence. They can be used to
analyze both nominal and real data, through the use of rank-based measures. At the cost of some inference power (when
compared with their parametric counterparts), they offer safe and reliable procedures to contrast the differences between
different techniques, particularly in multiple-problem analysis (that is, for studies in which the results over multiple prob-
lems are analyzed jointly, instead of performing a single test per each problem).
To apply nonparametric tests to a multiple-problem set-up, a result per algorithm/problem pair must be provided. This is
often obtained as the average of a given performance measure over several runs – carried out on every single problem. A
typical example could be the average error on 50 runs of an algorithm over 25 different benchmark problems.
A null or no-effect hypothesis is to be formulated prior to the application of the test. It often supports the equality or absence
of differences among the results of the algorithms, and enables alternative hypotheses to be raised that support the opposite
[31]. The null hypothesis can be represented by H0 , and the alternative hypotheses by H1 ; . . . ; Hn . The application of the tests
leads to the computation of a statistic, which can be used to reject the null hyphotesis at a given level of significance a.
For a fine grained analysis, it is also possible to compute the smallest level of significance that results in the rejection of
the null hypothesis. This level is the p-value, which is the probability of obtaining a result at least as extreme as the one that
was actually observed, assuming that the null hypothesis is true. The use of p-values is often preferred over using only fixed
a levels since they provide cleaner measures of how significant the result is (the smaller the p-value, the stronger the evi-
dence against the null hypothesis is) [41].
The nonparametric tests can be classified by their capabilities to perform pairwise comparisons and multiple compari-
sons. It is important to note that the p-values obtained through pairwise comparisons are independent, and thus multiple
comparison procedures should be used instead when comparing more than two algorithms [17].
Several nonparametric tests can be used to compare the final results of EAs in continuous optimization problems: The
Sign test and the Wilcoxon Signed-ranks test can help dealing with pairwise comparisons, whereas the Friedman, the Fried-
man Aligned-ranks and the Quade test can be used for performing multiple comparisons. Post-hoc procedures, such as the
Holm test can be introduced after the application of multiple comparisons, to characterize the existence of pairwise differ-
ences within a multiple comparisons set-up [16].
Page’s trend test for ordered alternatives [27] can be classified in the family of tests for association in multiple classifi-
cations, similar to the Friedman test. Before detailing its application to the analysis of convergence performance (which will
be provided in the next section), it is necessary to provide its original definition.
This test defines the null hypothesis as the equality between the k treatments analyzed that can be rejected in favor of an
ordered alternative (the ordered alternative is the main difference of this test with respect to the Friedman test, which only
defines the alternative hypothesis as the existence of differences between treatments).
The ordered alternative must be defined by the practitioner before starting the analysis. An order between the k treat-
ments has to be provided, and it should reflect the expected order for the populations. Hence, the treatments’ measures
should be numbered from 1 to k, where treatment 1 has the smallest sum of ranks, and treatment k has the largest.
Once such an order and the data (consisting of n samples of the k treatments) are provided, the n samples (data rows) can be
ranked from the best to the worst, giving a rank of 1 to the best measure in the sample, a rank of 2 to the second, . . ., and a rank of
k to the worst. If there are ties for a given sample, average ranks can be assigned (for example, a tie between the first and the
second result would produce an average rank of (1 + 2)/2 = 1.5, which would be assigned to both measures). If the data is con-
sistent with the initial ordering defined, then the sum of ranks’ values for each of the treatments will follow in increasing order.
After obtaining the ranks, the Page L statistic can be computed using the following expression
X
k
L¼ jRj ¼ R1 þ 2R2 þ þ kRk ð1Þ
j¼1
Pn
where Rj ¼ i¼1 rji , and rji is the rank of the j-th of k measures on the i-th of n samples.
The L statistic can be seen as a weighted version of Friedman’s test (as presented in [27]) by which average ranks are given
more weight the closer they are to the final treatments. L critical values can be computed for small values of k and n (see, for
example, Table Q in [19] for values up to k ¼ 8 and n ¼ 12). In the case that larger values are required, a normal approxima-
tion should be considered. The normal approximation for the L statistic is given by the following expression
2
12ðL 0:5Þ 3Nkðk þ 1Þ
Z¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : ð2Þ
kðk þ 1Þ Nðk 1Þ
44 J. Derrac et al. / Information Sciences 289 (2014) 41–58
Table 1
Computation of ranks for Page’s trend test (Example 1).
Ranks C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
Function 1 1 2 3 10 6.5 6.5 6.5 6.5 6.5 6.5
Function 2 10 4 9 8 7 6 5 3 2 1
Function 3 1 2 3 4 5 6 10 9 8 7
Function 4 9 10 8 7 6 5 4 3 2 1
Function 5 1 2 3 10 6.5 6.5 6.5 6.5 6.5 6.5
Function 6 1 2 3 4 10 9 6.5 6.5 6.5 6.5
Function 7 10 9 8 7 6 5 2.5 2.5 2.5 2.5
Function 8 10 9 8 7 6 5 4 3 2 1
Function 9 1 2 3 4 5 6 10 9 8 7
Function 10 1 2 3 4 10 7 7 7 7 7
Function 11 1 2 3 4 5 6 10 9 8 7
Function 12 1 10 9 8 7 6 5 4 3 2
Function 13 1 2 3 4 5 6 10 9 8 7
Function 14 9 10 8 7 6 5 4 3 2 1
Function 15 1 2 3 4 5 6 8.5 8.5 8.5 8.5
Function 16 10 9 8 7 6 5 4 3 2 1
Function 17 1 2 3 4 5 6 10 9 8 7
Function 18 9 10 2 1 3 4 8 7 6 5
Function 19 1 2 3 4 5 8 8 8 8 8
whose estimation, including a continuity correction, is approximately standard normal with a rejection region on the right
tail.
In this section, the use of the Page’s trend test for convergence analysis is described. The test is applied under the assump-
tion that an algorithm with a good convergence performance will advance towards the optimum faster than another algo-
rithm with a worse performance. Thus, differences in the fitness values will increase as the search continues.
The application of Page’s trend test to this task is described in Section 3.1. A modification to the ranks assignment pro-
cedure of the test (and hence, to this proposal) is presented in Section 3.2. This modification, useful in cases where the algo-
rithms reach the optimum of some functions before the end of the experiments, may be of interest for dealing with many
common experimental studies on continuous optimization, whose functions’ optima are very likely to be reached for some
of the functions of the benchmarks.
The original definition of Page’s trend test focuses on detecting increasing trends in the rankings computed using the
input data. This means that decreasing trends in the data values will be detected, provided that ranks are computed as
described before.
The input data would represent the differences between each algorithm’s average best objective value reached, at differ-
ent steps of the search (cut-points). The best objective value reached at each cut-point has to be collected for every run of
each algorithm and function. These values should be then averaged along the runs, so that a single, aggregated value is
obtained per each algorithm, function and cut-point. This will allow us to compute the differences between a pair of algo-
rithms, by subtracting the aggregated values.
Therefore, the input data of the test will represent the differences between the two algorithms, A and B, recorded at c
different points of the search, on n problems (functions).
The treatments (columns) will represent each of the c cut-points at which data is gathered (they should be taken at regular
intervals), whereas the samples (rows) will represent the n different functions used to test the algorithms. Fig. 1 shows an
example of the convergence of two algorithms and how c ¼ 10 cut-points are tracked for each one.
The specific number of samples and treatments to consider would depend on the characteristics of each specific situation
and the available data, although a reasonable rule would be to have approximately twice the number of samples as treat-
ments, at least (see [19]). Also, treatments should always be ordered in increasing order, since we are interested in analyzing
the trends as the search progresses. That is, the first treatment should represent the first cut-point, the second treatment
should represent the second cut-point and so on.
Under these conditions, Page’s trend test may be used to detect increasing trends in the ranks that represent the differ-
ences (or decreasing trends, if the order of the algorithms is reversed). Assuming a minimization objective1, the outcome of
the test can be interpreted as follows:
1
The test could be easily adapted to work with maximization objectives, by reversing the sign of the differences.
J. Derrac et al. / Information Sciences 289 (2014) 41–58 45
Significant increasing trend: If a consistently increasing trend in the ranks is found, this means that either the fitness of A
is growing faster than the fitness of B or that the fitness of B is decreasing faster than the fitness of A.
Since the fitness is computed as the best value found throughout the search, the former case is impossible. Hence, if an
increasing trend is detected, this means that the fitness of B is decreasing faster, which means that it has a better con-
vergence performance.
Significant decreasing trend: Following the same reasoning as above, this could only mean that the fitness of A is decreas-
ing faster. Hence, a decreasing trend in the ranks means that A has a better convergence performance.
No significant trend: If no consistent trend is found, then nothing can be said about the relative convergence performance
of two algorithms.
Example 1. Let A and B be two algorithms to analyze, considering n ¼ 19 different functions and c ¼ 10 cut-points. Table 1
shows an example of the treatments’ ranks (Rj ) computed for the A–B differences in fitness values. Note that ranks are
assigned from 1 (greater absolute differences) to 10 (lower absolute differences), and that midranks are assigned when nec-
essary (hence, the sum of all the Rj values will always be 55).
Fig. 2 shows the sum of all the Rj values per cut-point, and the Page’s L statistic computed from them. It shows the
associated p-value obtained. For completness, the relevant data of the opposite comparison (B–A) is also included.
The comparison A–B shows an increasing trend in the ranks (as can be seen in the figure), which is confirmed by a very
low p-value. Moreover, the opposite comparison B–A, shows clearly that the ranks are not increasing (in fact, they are
decreasing), which is rejected by a p-value near to 1.0. These results show that the algorithm A is converging faster than
algorithm B.
Although the aforementioned procedure should be correct in most cases, it should be used with caution if any of the algo-
rithms is unable to progress in the search for some of the functions. The most typical case of this occurring would probably
100
Algorithm A
Algorithm B
80
Fitness value
60
40
20
0
0 1 2 3 4 5 6 7 8 9 10
Cut-points
120
C3 85.0 124.0
C4 110.0 99.0 110
C5 127.0 82.0
C6 114.0 95.0 100
C7 105.5 103.5
90
C8 113.5 95.5
C9 113.5 95.5 80
C10 113.5 95.5
70
L 6061.0 5434.0 1 2 3 4 5 6 7 8 9 10
p-value 0.00451 0.99560
Cut-points
be when the absolute global optimum could be reached within the evaluations limit stipulated (something that it is very
likely to happen with most of the common benchmarks currently used, such as the IEEE Conference on Evolutionary Com-
putation 2005 Special Session on Real-Parameter Optimization [33] one). Hence, if such optimum is reached (if it is known),
the computation procedure should be corrected in order rank the difference properly.
Fig. 3 shows a graph depicting the convergence process of two different algorithms for a given function. As can be seen, algo-
rithm A is converging faster than algorithm B, reaching the optimum using half of the total evaluations allowed. In this situa-
tion, the test would be expected to report a positive response, showing that algorithm A has a better behavior in that problem.
However, this is not the case if the former ranking computation procedure is used. If fitness value differences between the
algorithms are computed, an increasing trend would be identified from cut-points 1 to 5 (positive for algorithm A, since the
differences are increasing as the algorithms proceed). However, a decreasing trend would also be detected, from cut-points 6
to 10. Clearly, this undesirable behavior is caused by the fact that the optimum has been reached too soon by algorithm A,
preventing it from progressing further for the rest of the fitness evaluations.
To tackle this problem, we propose a modification to the procedure used to compute the ranks. The aim of this alternative
version is to continue using the same ranking scheme as in the original approach, but fixing the ranks for those cases in
which the function’s optimum is reached well before the maximum number of fitness evaluations are exhausted.
When analyzing the differences between two algorithms, A and B, as A–B, four different cases can be highlighted (by
considering ranks for 10 cut-points for each case):
1. No algorithm reaches the optimum before the end: No further changes are necessary.
Example: Using 10 cut-points, a possible ordering could be the following:
7; 3; 5; 2; 1; 4; 8; 6; 9; 10
(Ranks are computed in the standard way. That is, the difference at the first cut-point is the seventh largest absolute differ-
ence (rank 7), the difference at the second one is the third largest absolute difference (rank 3), and so forth. The largest
difference is found at the fifth cut-point (rank 1), whereas the smallest is found at the last cut-point (rank 10)).
2. Algorithm A reaches the optimum before the end: Ranks should be modified so an increasing trend is detected from the
point at which algorithm A reaches the optimum to the last cut-points of the comparison. The rest of the ranks could be
assigned as above.
Example: Using 10 cut-points and having algorithm A ending at the 6th cut-point, a possible ordering could be the
following:
3; 1; 2; 4; 5; 6; 7; 8; 9; 10:
(Highest ranks are assigned increasingly starting from the sixth cut-point).
3. Algorithm B reaches the optimum before the end: Ranks should be modified so a decreasing trend is detected from the
point at which algorithm B reaches the optimum. The lowest ranks should be assigned decreasingly to the last cut-points
of the comparison. The rest of the ranks could be assigned as in the first case.
Example: Using 10 cut-points and having algorithm B converging to the global optimum in the 6th cut-point, a possible
ordering could be the following:
6; 10; 8; 9; 7; 5; 4; 3; 2; 1:
(Lowest ranks are assigned decreasingly starting from the sixth cut-point).
4. Both algorithms reach the optimum at the same cut-point: In this case, the computation of the ranks can be performed
as with the original version of the test. Zero differences will be ranked using the median ranks, denoting that no trend is
detected from the point in which both algorithms reached the optimum.
Example: Using 10 cut-points and having algorithms A and B ending in the sixth cut-point, a possible ordering could be
the following:
1; 10; 9; 8; 2; 5; 5; 5; 5; 5:
(Assigning midranks from the sixth cut-point).
As has been shown, the alternative version addresses those cases in which the global optimum is reached prior to
exhausting the maximum function evaluations, preventing the algorithm from continuing with the search. In cases 2 and
3, the scheme adopted will benefit the first algorithm to converge to the optimum, and this benefit will be higher the sooner
the said optimum is reached. Also, note that if case 4 is reached, the chances of not rejecting the equality hypothesis will
greatly increase (as would happen naturally if the basic ranking scheme is considered).
Example 2. Table 2 show the cutpoint at which algorithms A and B (from Example 1) reached the known optimum, of 19
different functions.
As can be seen in the table, algorithm B always reaches the optimum at the same time or before the algorithm A (except in
function F7). Hence, it would be expected that B would be found to have a better convergence behavior.
J. Derrac et al. / Information Sciences 289 (2014) 41–58 47
100 Algorithm A
Algorithm B
80
Fitness value
60
40
20
0
0 1 2 3 4 5 6 7 8 9 10
Cut-points
Fig. 3. Convergence behavior of two algorithms at 10 different cut-points. Algorithm A converges faster than algorithm B.
140 140
A-B A-B
B-A B-A
130 130
120 120
Average ranks
Average ranks
110 110
100 100
90 90
80 80
70 70
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Cut-points Cut-points
Table 2
Ending cut-points for algorithms A and B (Example 2. ‘‘–’’ denotes that the optimum was not reached.
Ending F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19
Algorithm A 5 – – 10 5 7 7 – – 6 – 10 – – 7 – – – 6
Algorithm B 4 – – 5 4 5 – – – 5 – 5 – – 7 3 – – 6
Table 3 shows the ranks, L statistic and p-values computed using this data, by both versions of Page’s trend test (standard
and alternative). Note that the results of the first two rows are the same as were shown in Example 1. These results are also
depicted graphically in Fig. 4
The alternative computation of the ranks introduces a clear modification in the p-values computed. Such modification
corrects the previous result, in the sense that a favourable trend is identified this time for algorithm B. This is in consonance
with the data shown in Table 2, clearly depicting algorithm B as the one with the best convergence performance.
4. Case of use: On analyzing the performance of several Differential Evolution based approaches
The proposed methodology should make it possible to carry out pairwise comparisons among algorithms involved in an
experimental study. At this point, our objective is to show what could be the outcome of such comparisons and how to inter-
pret them.
Hence, in order to demonstrate the usefulness of the methodology proposed, the following sections will be devoted to
describing a case study focused on the analysis of several Differential Evolution [11] based techniques, and to analyzing
the results obtained. It is complemented by the description of a software developed to apply the test (in A), which can be
obtained at http://sci2s.ugr.es/sicidm/pageTest.zip.
This case is mainly seeded on the Soft Computing journal Special Issue on Scalability of Evolutionary Algorithms and other
Metaheuristics for Large Scale Continuous Optimization Problems [21] from which both the benchmarking functions and
48 J. Derrac et al. / Information Sciences 289 (2014) 41–58
Table 3
Original and alternative version for the computation of ranks (Example 2).
some of the participant algorithms have been taken. Note that, even in large case studies like this one, this methodology
requires to consider only a fixed number of cut-points (dependent on the number of functions considered as benchmark).
This requirement is independent to other factors like population sizes or the number of iterations performed, making the
test a suitable choice when analyzing complex experiments.
4.1. Functions
The benchmark proposed in the special issue consists of 19 functions [20]. The first six were taken from the CEC’2008
Special Session and Competition on Large Scale Global Optimization [34]. Functions F7–F11 were included as shifted versions
from other common benchmarks in continuous optimization. Finally, functions F12–F19 were built for this benchmark com-
bining two of the previous ones (at least one of the functions in each combination is non-separable). Table 4 shows the main
characteristics of each function: Name, Range and Optimum value.
These functions were presented [21] as a suitable benchmark for testing the capabilities of EAs and other metaheuristics.
By including unimodal/multimodal, separable/non-separable and shifted functions, this benchmark should pose a challenge
for modern optimization algorithms.
All the 19 functions will be considered for the study. The analysis of results will be carried out considering three different
set-ups: 50 dimensions, 100 dimensions and 200 dimensions. This will provide a clear picture on how the algorithms per-
form as the dimensionality of the problems increase.
Six different algorithms have been chosen for this study, 5 of which where originally accepted for the special issue [21].
All of them are advanced EAs based on differential evolution:
Table 4
The 19 test functions chosen as benchmark.
GODE [36]: A Generalized Opposition-based learning Differential Evolution algorithm. This technique is based on oppo-
sition-based learning, which is used to transform candidates from the current search region into new search regions.
These transformations are aimed at enabling the algorithm to have a greater chance of finding better solutions than when
searching without opposition based transformation.
SaDE-MMTS [43]: A Self-adaptive Differential Evolution algorithm hybridized with a Modified Multi-Trajectory Search
strategy (MMTS). This search strategy enhances the search performed by the original SaDE algorithm [28] by frequently
refining several diversely distributed solutions at different search stages by using MMTS, satisfying both global and local
search requirements.
SaEPSDE-MMTS [42]: An Ensemble of Parameters and mutation Strategies in Differential Evolution with Self-adaption
[25] improved with the MMTS, with the aim of enhancing the behavior of the original algorithm.
SOUPDE [37]: Shuffle Or Update Parallel Differential Evolution is a structured population algorithm characterized by sub-
populations employing a Differential evolution logic and two strategies: Shuffling, which consists of merging the sub-
populations and subsequently randomly dividing them again into sub-populations; and update, which consists of ran-
domly updating the values of the scale factors of each population.
GADE [40]: A Generalized Adaptive Differential Evolution algorithm, which is governed by a generalized parameter adap-
tation scheme. An auto-adaptive probability distribution, updated during the whole evolution process, is used to generate
suitable values for the most important parameters of the underlying Differential Evolution based search procedure.
jDElscop [6]: A self-adaptive Differential Evolution for large scale continuous optimization problems. This is an upgrade
of the original jDE algorithm [5], incorporating three different evolution strategies, a population size reduction mecha-
nism, and a mechanism for changing the sign of control parameters.
All methods have been used considering the default configuration provided by their authors in their original submissions
to the special issue. Hence, no explicit optimization of parameters was performed.
The experimental study is split into two different sections. The first one (Section 5.1) shows the results obtained after
carrying out 25 independent runs of each algorithm over each function. For each run, 5000 D evaluations have been
allowed, where D is the number of dimensions of the function (50, 100 or 200).
After performing the analysis based on the ending point of the algorithms, the second section (Section 5.2) performs an
analysis of the convergence behavior of the algorithms, using Page’s trend test (the original and the alternative version). Dif-
ferences between the conclusions drawn from the two studies will be pointed out, highlighting the role of the Page’s trend
test based approach to convergence analysis and the benefits of the alternative version proposed.
Table 5 summarizes the final results obtained by the algorithms in 50, 100 and 200 dimensions’ functions, depicted as the
number of functions for which the average final error is lower than 1.00E10 (that is, the number of those for which it can be
assumed that the algorithm has reached the optimum). For further reference, full results are included in B).
The final results can be contrasted by using tests for N N comparisons [12]. In this case, we will use the Friedman test to
contrast the differences, and the Bergmann post hoc procedure for adjusting the results for 1 1 pairwise comparisons.
Table 6 shows the results of the Friedman test, whereas Table 7 shows the results of the Bergmann post hoc procedure.
As shown by the tables, there are very few differences in the performance of the algorithms. If only the final results are
analyzed, only small differences can be found in favor of SOUPDE and jDElscop in every case. However, these differences are
not significant.
The p-values computed by the Friedman test shows that there is no significant difference among the algorithms, even at a
a ¼ 0:1 level of significance. The best (lower) ranks are also obtained by SOUPDE and jDElscop, but this does not make the
differences between them and the rest significant. The Bergman procedure supports these conclusions, pointing out that no
significant difference can be detected in any pairwise comparison.
Therefore, the conclusions of the study – if only the final results were to be analyzed - would be that all the algorithms
exhibit a similar behavior. Perhaps it would be possible to point out that the SOUPDE and jDElscop algorithms show some
differences when compared with the rest, but in every case the differences are not significant.
Table 5
Number of functions solved (reached an average error lower than 1.00E10) per algorithm.
Table 6
Friedman test for the results obtained at 50, 100 and 200 dimensions.
Table 7
Pairwise hypotheses analyzed by the Bergmann post hoc procedure.
However, these conclusions might not be satisfactory, particularly if the behavior of algorithms over time is analyzed.
For example, Fig. 5 shows how important differences can be found between SOUPDE and jDElscop , even when both algo-
rithms reach the same final result. It is not difficult to find examples where one algorithm is converging much faster than
another, but, due to the fixed limit on the number of evaluations, this is not detected if only final results are analyzed.
Throughout the rest of the case study, we will show how this difficulty can be overcomed by using the Page’s trend test to
analyze convergence.
The same experimental conditions have been considered for this second study: Algorithms and functions to study, eval-
uations limit and so forth. Considering that our framework consists of 19 different functions to optimize, the number of cut-
points has been fixed at 10, one after each 10% of fitness function evaluations (see Section 3.1).
Tables 8–10 show a summarized version of the results obtained (an extended version, including ranks at every cut-point,
L statistics and p-values for each pairwise comparison is provided in C). Results are provided for both the original and the
alternative version of the test. Each p-value of the tables is computed as the probability of rejection of the hypothesis of
equality of convergence, in favor of the alternative that the method in the row converges faster than the method in the col-
umn. Rejected hypotheses (at a significance level of a ¼ 0:1) are highlighted in bold.
The very first fact to note here is the differences between the original and the alternative version of the test: Although in
the most clear cases the results seldom change, this is not the case for some of the comparisons, for which very different p-
values are obtained (particularly in the results for 100 dimensional functions). As shown in Example 2 (which actually
reflects the comparison between GODE and SaEPSDE-MMTS in 100 dimensions functions), the interpretation of the results
might change dramatically if the alternative version is not considered.
10000 10000
SOUPDE SOUPDE
jDElscop jDElscop
100 100
1 1
Fitness value
Fitness value
0.01 0.01
0.0001 0.0001
1e-06 1e-06
1e-08 1e-08
1e-10 1e-10
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Cut-points Cut-points
Fig. 5. Convergence plots of SOUPDE vs jDElscop: Performances can be very different even when both algorithms reach the same result at the end. Some
algorithms could converge better for most of the run (left graph, 200-D, F18), or even finish using 10% less evaluations (right graph, 200-D, F5) and
traditional final results analysis would not be able to detect these differences in performance.
J. Derrac et al. / Information Sciences 289 (2014) 41–58 51
Table 8
Convergence results (p-values) for the experiments on 50 dimensional functions.
Table 9
Convergence results (p-values) for the experiments on 100 dimensional functions.
Table 10
Convergence results (p-values) for the experiments on 200 dimensional functions.
After studying the analysis performed with the alternative version, we can draw the following conclusions:
50 dimensions: SOUPDE shows the best convergence behavior, followed by GODE. SaDE-MMTS and jDElscop presents the
worst convergence performance in this case.
100 dimensions: SOUPDE presents the best convergence in this scenario. GADE and SaDE-MMTS presents a better perfor-
mance than SaEPSDE-MMTS, and GODE and jDElscop shows the lowest convergence speed in this case.
52 J. Derrac et al. / Information Sciences 289 (2014) 41–58
200 dimensions: SaDE-MMTS and SOUPDE are the best methods with respect to convergence in this case. GODE is also
significantly better than SaEPSDE-MMTS and GADE, whereas jDElscop presents the worst convergence capabilities.
Although the results differ depending on the number of dimensions considered, it is safe to state that, in general, SOUPDE
shows the best behavior with respect to convergence capabilities, whereas jDElscop presents the worst. A further interesting
observation is in regard of the relationship between SaDE-MMTS and SaEPSDE-MMTS (the latter is better at 50 dimensions,
whereas the former is at 100 dimensions and particularly at 200 dimensions).
The analysis of final results can now be refined with this convergence study. Considering both methodologies, SOUPDE
shows the best performance out of the 6 differential evolution methods analyzed. The most striking difference can be found
in the results of jDElscop, which has shown a marginal advantage with respect to the final results, but the worst convergence
performance. This could indicate that the convergence mechanisms of jDElscop enables it to avoid local optima in a better
way than the other methods, but at the cost of a low convergence speed – thus needing more functions evaluations to fully
reach its best performance.
Other conclusions include the fact that, despite the similar results obtained by SaEPSDE-MMTS and SaDE-MMTS, the for-
mer should be preferred for low dimensional problems whereas SaDE-MMTS should be chosen when the number of dimen-
sions increases. Also, GODE and GADE show a poorer performance than the rest, although the former should still be
considered for low dimensional problems.
In summary, these conclusions reveal the fact that useful information about the performance of evolutionary methods in
continuous optimization can be drawn if the intermediate results are analyzed. By studying convergence in depth, analyzing
how the methods’ results evolve as the fitness function evaluations are consumed, new comparisons can be made depicting
other useful properties of the search methods rather than just the final results at a predefined point.
Page’s trend test has been shown to be a useful method to perform this analysis. Also, the alternative version developed
has helped us to mitigate the problem of performing proper comparisons of algorithms that reach the optimum before the
maximum fitness evaluation count, enabling us to draw meaningful conclusions about the performances of the methods.
6. Conclusions
In this paper we have presented a new way of analyzing the behavior of EAs in optimization problems, with respect to
their convergence performance. We have shown how Page’s trend test can be used to perform such an analysis. Also, we have
described how the ranks can be computed in an alternative way, in the event that the optimum value of the functions could
be reached by the algorithms before the end of the run.
As with other applications of nonparametric tests, the present one does not rely on the assumptions of normality, inde-
pendence and homoscedasticity. Hence, it is safe to assume that it can be used to analyze the convergence performance of
EAs, provided that intermediate results are gathered.
By analyzing such intermediate results, Page’s trend test is able to provide key information about algorithms’ behavior.
Such information can be decisive when establishing differences between algorithms which would otherwise be considered
to be equal, if only the final results were used. Therefore, Page’s trend test may be regarded as a way of enriching experimen-
tal analysis, incorporating statistical convergence analysis within the range of methodologies available.
Acknowledgment
This work was supported by the Spanish Ministry of Education and Science under Grant TIN2011-28488.
A Java implementation of our approach is available at the SCI2S thematic public website on Statistical Inference in Com-
putational Intelligence and Data Mining, http://sci2s.ugr.es/sicidm/. It can be downloaded at the following link:
http://sci2s.ugr.es/sicidm/pageTest.zip
Its main features are that it:
Includes both the basic version of Page’s trend test and the alternative version.
Allows us to perform multiple pairwise tests of several algorithms.
Accepts comma separated values files (CSV) as input data.
Obtains results as a full report in and plain text formats.
The source code is also offered under the terms of the GNU General Public License.
The input data should consist of a CSV file per algorithm, containing the average results obtained at several cutpoints (col-
umns) in several functions (rows). For example, the following file:
J. Derrac et al. / Information Sciences 289 (2014) 41–58 53
2.08E+01,1.99E+01,1.86E+01,1.73E+01
3.58E+03,3.38E+03,2.40E+03,9.31E+00
1.11E+00,1.07E+00,1.04E+00,1.02E+00
2.43E04,2.33E04,7.85E05,8.46E07
0.00E+00,0.00E+00,0.00E+00,0.00E+00
represents the results of one algorithm over 5 different functions, taken at 4 different cutpoints (ordered from left to right).
All files created should share the same format (number of functions and cutpoints).
The following tables show the final results obtained by each technique in the case of study (for 50, 100 and 200 dimen-
sions). For each pair function/technique, the tables report the average error obtained in 25 independent runs.
Note that if an algorithm reaches an average error lower than 1.00E10, then it is assumed to have reached the optimum
(and thus it is replaced by 0.00E+00). Optimum values in these table are highlighted in bold, and the last row (# Solved)
shows the number of optima reached by every algorithm over the 19 functions (see Tables B.11, B.12 and B.13).
Table B.11
Final results at 50 dimensional functions.
Table B.12
Final results at 100 dimensional functions.
Table B.13
Final results at 200 dimensional functions.
The following tables shows the full results obtained in each application of Page’s trend test for the case study. For each
pairwise comparison, the average ranks at 10 cutpoints, the L statistic and the p-value computed are reported.
The results include all the possible pairwise comparisons at 50, 100 and 200 dimensions. Both versions of the test (the
original and the alternative) are considered (see Tables C.14, C.15, C.16, C.17, C.18 and C.19).
Table C.14
Full results of Page’s trend test on 50 dimensional functions.
Table C.15
Full results of Page’s trend test on 50 dimensional functions (alternative version).
Table C.16
Full results of Page’s trend test on 100 dimensional functions.
Table C.17
Full results of Page’s trend test on 100 dimensional functions (alternative version).
Table C.18
Full results of Page’s trend test on 200 dimensional functions.
Table C.19
Full results of Page’s trend test on 200 dimensional functions (alternative version).
References
[1] M.G. Arenas, N. Rico, A.M. Mora, P.A. Castillo, J.J. Merelo, Using statistical tools to determine the significance and relative importance of the main
parameters of an evolutionary algorithm, Intell. Data Anal. 17 (2013) 771–789.
[2] T. Bartz-Beielstein, Experimental Research in Evolutionary Computation: The New Experimentalism, Springer, New York, 2006.
[3] B. Bischl, O. Mersmann, H. Trautmann, M. Preuss, Algorithm selection based on exploratory landscape analysis and cost-sensitive learning, in:
Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation, GECCO ’12, New York, USA, July 7–11.
[4] P.A.N. Bosman, On gradients and hybrid evolutionary algorithms for real-valued multiobjective optimization, IEEE Trans. Evol. Comput. 16 (2012) 51–
69.
[5] J. Brest, S. Greiner, B. Boskovic, M. Mernik, V. Zumer, Self-adapting control parameters in differential evolution: a comparative study on numerical
benchmark problems, IEEE Trans. Evol. Comput. 10 (2006) 646–657.
[6] J. Brest, M.S. Maucec, Self-adaptive differential evolution algorithm using population size reduction and three strategies, Soft Comput. 15 (2011) 2157–
2174.
[7] P. Chakraborty, S. Das, G.G. Roy, A. Abraham, On convergence of the multi-objective particle swarm optimizers, Inf. Sci. 181 (2011) 1411–1425.
[8] W.J. Conover, Practical Nonparametric Statistic, third ed., John Wiley & Sons, 1999.
[9] M. Crepinsek, S.H. Liu, M. Mernik, Replication and comparison of computational experiments in applied evolutionary computing: common pitfalls and
guidelines to avoid them, Appl. Soft Comput. 19 (2014) 161–170.
[10] A. Czarn, C. MacNish, K. Vijayan, R. Turlach, R. Gupta, Statistical exploratory analysis of genetic algorithms, IEEE Trans. Evol. Comput. 8 (2004) 405–421.
[11] S. Das, P.N. Suganthan, Differential evolution: a survey of the state-of-the-art, IEEE Trans. Evol. Comput. 15 (2011) 4–31.
[12] J. Derrac, S. García, D. Molina, F. Herrera, A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary
and swarm intelligence algorithms, Swarm Evol. Comput. 1 (2011) 3–18.
[13] E.A. Duéñez-Guzmán, M.D. Vose, No free lunch and benchmarks, Evolut. Comput. 21 (2013) 293–312.
[14] M. Gallagher, B. Yuan, A general-purpose tunable landscape generator, IEEE Trans. Evol. Comput. 10 (2006) 590–603.
[15] S. García, A. Fernández, J. Luengo, F. Herrera, A study of statistical techniques and performance measures for genetics-based machine learning:
accuracy and interpretability, Soft Comput. 13 (2009) 959–977.
[16] S. García, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests for multiple comparisons in the design of experiments in computational
intelligence and data mining: experimental analysis of power, Inform. Sci. 180 (2010) 2044–2064.
[17] S. García, F. Herrera, An extension on Statistical Comparisons of Classifiers over Multiple Data Sets for all pairwise comparisons, J. Mach. Learn. Res. 9
(2008) 2677–2694.
[18] S. García, D. Molina, M. Lozano, F. Herrera, A study on the use of non-parametric tests for analyzing the evolutionary algorithms’ behaviour: a case
study on the CEC’2005 special session on real parameter optimization, J. Heuristics 15 (2009) 617–644.
[19] J. Gibbons, S. Chakraborti, Nonparametric Statistical Inference, fifth ed., Chapman & Hall, 2010.
[20] F. Herrera, M. Lozano, D. Molina, Test Suite for the Special Issue of Soft Computing on Scalability of Evolutionary Algorithms and Other Metaheuristics
for Large Scale Continuous Optimization Problems, 2010. <http://sci2s.ugr.es/eamhco/cfp.php>.
[21] F. Herrera, M. Lozano, D. Molina, Editorial scalability of evolutionary algorithms and other metaheuristics for large-scale continuous optimization
problems, Soft Comput. 15 (2011) 2085–2087.
[22] J.J. Higgins, Introduction to Modern Nonparametric Statistics, Duxbury Press, 2003.
[23] J. Hooker, Testing heuristics: we have it all wrong, J. Heuristics 1 (1997) 33–42.
58 J. Derrac et al. / Information Sciences 289 (2014) 41–58
[24] J. Luengo, S. García, F. Herrera, A study on the use of statistical tests for experimentation with neural networks: analysis of parametric test conditions
and non-parametric tests, Expert Syst. Appl. 36 (2009) 7798–7808.
[25] R. Mallipeddi, P.N. Suganthan, Q.K. Pan, M.F. Tasgetiren, Differential evolution algorithm with ensemble of parameters and mutation strategies, Appl.
Soft Comput. 11 (2011) 1679–1696.
[26] O. Mersmann, B. Bischl, H. Trautmann, M. Preuss, C. Weihs, G. Rudolph, Exploratory landscape analysis, in: Proceedings of the 13th Annual Conference
on Genetic and Evolutionary Computation, GECCO ’11, New York, USA, July 12–16.
[27] E.B. Page, Ordered hypotheses for multiple treatments: a significance test for linear ranks, J. Am. Stat. Assoc. 58 (1963) 216–230.
[28] A.K. Qin, V.L. Huang, P.N. Suganthan, Differential evolution algorithm with strategy adaptation for global numerical optimization, IEEE Trans. Evol.
Comput. 13 (2009) 398–417.
[29] I. Rojas, J. González, H. Pomares, J.J. Merelo, P.A. Castillo, G. Romero, Statistical analysis of the main parameters involved in the design of a genetic
algorithm, IEEE Trans. Syst. Man Cybern. Part C: Appl. Rev. 32 (2002) 31–37.
[30] G. Rudolph, Convergence analysis of canonical genetic algorithms, IEEE Trans. Neural Networks 5 (1994) 96–101.
[31] D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, fifth ed., Chapman & Hall/CRC, 2011.
[32] R. Storn, K. Price, Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces, J. Global Optim. Arch. 11
(1997) 341–359.
[33] P. Suganthan, N. Hansen, J. Liang, K. Deb, Y. Chen, A. Auger, S. Tiwari, Problem Definitions and Evaluation Criteria for the CEC’2005 Special Session on
Real Parameter Optimization. Nanyang Technological University, Technical Report, 2005. <www.ntu.edu.sg/home/epnsugan/index_files/cec-05/Tech-
Report-May-30-05.pdf>.
[34] K. Tang, X. Yao, P.N. Suganthan, C. MacNish, Y.P. Chen, C.M. Chen, Z. Yang, Benchmark Functions for the CEC’2008 Special Session and Competition on
Large Scale Global Optimization. Nature Inspired Computation and Applications Laboratory, USTC, China, Technical Report, 2007.
[35] N. Vecek, M. Mernik, M. Crepinšek, A chess rating system for evolutionary algorithms: a new method for the comparison and ranking of evolutionary
algorithms, Inform. Sci. 277 (2014) 656–679.
[36] H. Wang, Z. Wu, S. Rahnamayan, Enhanced opposition-based differential evolution for solving high-dimensional continuous optimization problems,
Soft Comput. 15 (2011) 2127–2140.
[37] M. Weber, F. Neri, V. Tirronen, Shuffle or update parallel differential evolution for large scale optimization, Soft Comput. 15 (2011) 2089–2107.
[38] D.L. Whitley, S. Rana, J. Dzubera, K.E. Mathias, Evaluating evolutionary algorithms, Artif. Intell. 85 (1996) 245–276.
[39] D. Wolpert, W.G. Macready, No free lunch theorems for optimization, IEEE Trans. Evol. Comput. 1 (1997) 67–82.
[40] Z. Yang, K. Tang, X. Yao, Scalability of generalized adaptive differential evolution for large-scale continuous optimization, Soft Comput. 15 (2011)
2141–2155.
[41] J.H. Zar, Biostatistical Analysis, fifth ed., Prentice Hall, 2009.
[42] S.Z. Zhao, P.N. Suganthan, Comprehensive comparison of convergence performance of optimization algorithms based on nonparametric statistical
tests, in: Proceedings of the IEEE Congress on Evolutionary Computation, CEC 2012, Brisbane, Australia, June 10–15.
[43] S.Z. Zhao, P.N. Suganthan, S. Das, Self-adaptive differential evolution with multi-trajectory search for large scale optimization, Soft Comput. 15 (2011)
2175–2185.