A General Deep Reinforcement Learning Hyperheuristic Framework For Solving Combinatorial Optimization Problems
A General Deep Reinforcement Learning Hyperheuristic Framework For Solving Combinatorial Optimization Problems
a r t i c l e i n f o a b s t r a c t
Article history: Many problem-specific heuristic frameworks have been developed to solve combinatorial optimization
Received 11 October 2021 problems, but these frameworks do not generalize well to other problem domains. Metaheuristic frame-
Accepted 11 January 2023
works aim to be more generalizable compared to traditional heuristics, however their performances suffer
Available online 16 January 2023
from poor selection of low-level heuristics (operators) during the search process. An example of heuristic
Keywords: selection in a metaheuristic framework is the adaptive layer of the popular framework of Adaptive Large
Heuristics Neighborhood Search (ALNS). Here, we propose a selection hyperheuristic framework that uses Deep Re-
Hyperheuristic inforcement Learning (Deep RL) as an alternative to the adaptive layer of ALNS. Unlike the adaptive layer
Adaptive metaheuristic which only considers heuristics’ past performance for future selection, a Deep RL agent is able to take
Deep reinforcement learning into account additional information from the search process, e.g., the difference in objective value be-
Combinatorial optimization tween iterations, to make better decisions. This is due to the representation power of Deep Learning
methods and the decision making capability of the Deep RL agent which can learn to adapt to differ-
ent problems and instance characteristics. In this paper, by integrating the Deep RL agent into the ALNS
framework, we introduce Deep Reinforcement Learning Hyperheuristic (DRLH), a general framework for
solving a wide variety of combinatorial optimization problems and show that our framework is better at
selecting low-level heuristics at each step of the search process compared to ALNS and a Uniform Ran-
dom Selection (URS). Our experiments also show that while ALNS can not properly handle a large pool
of heuristics, DRLH is not negatively affected by increasing the number of heuristics.
© 2023 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
1. Introduction ators have weights associated with them that determine the prob-
abilities of selecting these during the search. These weights are
A metaheuristic is an algorithmic framework that offers a co- continuously updated after a certain number of iterations (called
herent set of guidelines for the design of heuristic optimization a segment) based on their recent effect on improving the qual-
methods. Classical frameworks such as Genetic Algorithm (GA), ity of the solution during the segment. The ALNS framework was
Particle Swarm Optimization (PSO), Ant Colony Optimization (ACO), early on an approach specific to routing problems. However, in re-
and Simulated Annealing (SA) are examples of such frameworks cent years, there has been a growing number of studies that em-
(Dokeroglu, Sevinc, Kucukyilmaz, & Cosar, 2019). Moreover, there ploy this approach to other problem types, e.g., scheduling prob-
is a large body of literature that addresses solving combinatorial lems (Laborie & Godard, 2007). Its high quality of performance
optimization problems using metaheuristics. Among these, Adap- at finding solutions has made it a go-to approach in many recent
tive Large Neighbourhood Search (ALNS) (Ropke & Pisinger, 2006) studies in combinatorial optimization problems (Aksen, Kaya, Sibel
is one of the most widely used metaheuristics. It is a general Salman, & Özge Tüncel, 2014; Chen, Demir, & Huang, 2021; Demir,
framework based on the principle of Large Neighbourhood Search Bektaş, & Laporte, 2012; Friedrich & Elbert, 2022; Grangier, Gen-
(LNS) of Shaw (1998), where the objective value is iteratively im- dreau, Lehuédé, & Rousseau, 2016; Gullhav, Cordeau, Hvattum, &
proved by applying a set of “removal” and “insertion” operators Nygreen, 2017; Li, Chen, & Prins, 2016). The ALNS framework has
on the solution. In ALNS, each of the removal and insertion oper- several advantages. For most optimization problems, a number of
∗
Corresponding author.
E-mail addresses: [email protected] (J. Kallestad), [email protected] (R. Hasibi), [email protected] (A. Hemmati), [email protected]
(K. Sörensen).
https://doi.org/10.1016/j.ejor.2023.01.017
0377-2217/© 2023 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468
well-performing heuristics are already known which can be used In this paper, we propose Deep Reinforcement Learning Hy-
as the operators in the ALNS framework. Due to the large size and perheuristic (DRLH), a general approach to selection hyperheuris-
diversity of the neighborhoods, the ALNS algorithm will explore tic framework (definition in Section 2) for solving combinatorial
huge chunks of the solution space in a structured way. As a result, optimization problems. In DRLH, we replace the adaptive layer of
ALNS is very robust as it can adapt to different characteristics of ALNS with a Deep RL agent responsible for selecting heuristics
the individual instances, and is able to avoid being trapped in lo- at each iteration of the search. Our Deep RL agent is trained us-
cal optima (Pisinger & Ropke, 2019). According to Turkeš, Sörensen, ing Proximal Policy Optimization (PPO) method of Schulman, Wol-
& Hvattum (2021), the adaptive layer of ALNS has only minor im- ski, Dhariwal, Radford, & Klimov (2017) which is a standard ap-
pact on the objective function value of the solutions in the stud- proach for stable training of the Deep RL agent in different en-
ies that have employed this framework. Moreover, the information vironments. The proposed DRLH utilizes a search state consisting
that the adaptive layer uses for selecting heuristics is limited to the of a problem-independent feature set from the search process and
past performance of each heuristic. This limited data can make the is trained with a problem-independent reward function that en-
adaptive layer naïve in terms of decision making capability because courages better solutions. This approach makes the framework eas-
it is not able to capture other (problem-independent) information ily applicable to many combinatorial optimization problems with-
about the current state of the search process, e.g., the difference out any change in the method and given the proper training step
in cost between past solutions, whether the current solution has for each problem separately. The training process of DRLH makes
been encountered before during the search, or the number of iter- it adaptable to different problem conditions and settings, and en-
ations since the solution was last changed, etc. We refer to the de- sures that DRLH is able to learn good strategies of heuristic selec-
cision making capability of ALNS as performing on a “macro-level” tion prior to testing, while also being effective when encountering
in terms of adaptability, i.e., the weights of each heuristic is only new search states. In contrast to the macro-level decision making
updated at the end of each segment. This means that the heuris- of ALNS, the proposed DRLH makes decisions on a “micro-level”,
tics selected within a segment are sampled according to the fixed meaning that only the current search state information affects the
probabilities of the segment. This limitation makes it impossible probabilities of choosing heuristics. This allows for the probabili-
for ALNS to take advantage of any short-term dependencies that ties of selecting heuristics to change quickly from one iteration to
occur within a segment that could help aid the heuristic selection the next, helping DRLH adapt to new information of the search as
process. soon as it becomes available. The Deep RL agent in DRLH is able
Another area where ALNS struggles is when faced with a large to effectively leverage this search state information at each step of
number of heuristics to choose from. In order to find the best set the search process in order to make better decisions for selecting
of available heuristics for ALNS for a specific setting, initial experi- heuristics compared to ALNS.
ments are often required to identify and remove inefficient heuris- To evaluate the performance and generalizability of DRLH,
tics, and this can be both time consuming and computationally ex- we choose four different combinatorial optimization problems to
pensive (Hemmati & Hvattum, 2017). Furthermore, some heuristics benchmark against different baselines in terms of best objective
are known to perform very well for specific problem variations or found and the speed of convergence as well as the time it takes
specific conditions during the search, but they may have a poor to solve each problem. These problems include the Capacitated Ve-
average performance. In this case, it might be beneficial to remove hicle Routing Problem (CVRP), the Parallel Job Scheduling Problem
these from the pool of heuristics available to ALNS in order to in- (PJSP), the Pickup and Delivery Problem (PDP), and the Pickup and
crease the average performance of ALNS, but this results in a less Delivery Problem with Time Windows (PDPTW). These problems
powerful pool of heuristics that is unable to perform as well dur- are commonly used for evaluation in the literature and are diverse
ing these specific problem variations and conditions. in terms of difficulty to find good and feasible solutions. They ad-
To address the issues in ALNS, one can use Reinforcement ditionally correspond to a broad scope of real world applications.
Learning (RL). RL is a subset of machine learning concerned For each problem, we create separate training and test datasets. In
with “learning how to make decisions”—how to map situations to our experiments, we compare the performance of DRLH on differ-
actions—so as to maximize a numerical reward signal. One of the ent problem sizes and over an increasing number of iterations of
main tasks in machine learning is to generalize a predictive model the search and demonstrate how the heuristic selection strategy of
based on available training data to new unseen situations. An RL DRLH differs from other baselines throughout the search process.
agent learns how to generalize a good policy through interaction Our experiments show the superiority of DRLH compared to
with an environment which returns the reward in exchange for re- the popular method of ALNS in terms of performance quality. For
ceiving an action from the agent. Therefore, through a trial-and- each of the problem sets, DRLH is able to consistently outperform
error search process, the agent is trained to achieve the maximum other baselines when it comes to best objective value specifically
expected future reward at each step of decision making condi- in larger instances sizes. Additionally, DRLH does not add any over-
tioned on the current situation (state). Thus, training an RL agent head to the instance solve time and the performance gain is a re-
(to achieve the best possible results in similar situations), makes sult of the decision making capability of the Deep RL agent used.
the agent aware of the dynamics of the environment as well as Further experiments also validate that unlike other algorithms, the
adaptable to similar environments with slightly different settings. performance of DRLH is not negatively affected by increasing the
One of the more recent approaches in RL is Deep RL which benefits number of available heuristics to choose from. In contrast to this,
from the powerful function approximation property of deep learn- ALNS struggles when handling a large number of heuristics to
ing tools. In this approach, different functions that are used to train choose from. This advantage of our framework makes the devel-
and make decisions in an RL agent are implemented using Artifi- opment process for DRLH very simple as DRLH seems to be able to
cial Neural Networks (ANNs). Different Deep RL algorithms dictate automatically discover the effectiveness of different heuristics dur-
the training mechanism and interaction of the ANNs in the deci- ing the training phase without the need for initial experiments in
sion making process of the agent (Sutton & Barto, 2018). There- order to manually reduce the set of heuristics.
fore, integration of the Deep RL into the adaptive layer of the ALNS The remainder of this paper is organized as follows: In
can make the resulting framework much smarter at making deci- Section 2, related previous work in hyperheuristics and Deep RL
sions at each iteration and improve the overall performance of the is presented. In Section 3, we propose the overall algorithm of
framework. DRLH as well as the choice of heuristics and parameters. The
447
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468
descriptions of the four combinatorial optimization problems used pert and is reported to perform very similarly to the expert, and
for benchmarking purposes are illustrated in Section 4. The exper- even slightly outperforming the expert for some instances.
imental setup and the results of our evaluation are presented in Tyasnurita, Özcan, Shahriar, & John (2015) further improved
Sections 5 and 6, respectively. upon the apprentice learning approach by replacing the decision
tree classifier with a multilayer perceptron (MLP) neural network,
2. Related work and named their approach MLP-ALHH. This change increased the
representational power of the search state and resulted in a bet-
In this section, we first define the term “Hyperheuristic” and ter performance that is reported to even outperform the expert.
review some of the traditional work that fall into this category and A limitation of ALHH and MLP-ALHH is their use of the super-
point out their limitations. We also mention some of the methods vised learning framework which makes performance of these ap-
that employ Deep RL for solving combinatorial problems and their proaches bounded by the expert algorithm’s performance. A con-
shortcomings. In the end, we explain how we combine the best of sequence of this is that the feedback used to train the predictive
two domains (Hyperheuristic and Deep RL) to take advantage of models of ALHH and MLP-ALHH is binary, i.e. it either matches that
both their methodologies. of the expert or not, leaving no room for alternative strategies that
The term hyperheuristic was first used in the context of combi- might perform even better than the expert. In contrast, DRLH uses
natorial optimization by Cowling, Kendall, & Soubeiga (2001) and a Deep RL framework that neither requires, nor is bounded by an
described as heuristics to choose heuristics. Burke et al. (2010) later expert agent and therefore has more potential to outperform exist-
extended the definition of hyperheuristic to “a search method or ing methods by coming up with new ways of selecting heuristics.
learning mechanism for selecting or generating heuristics to solve The feedback used to train DRLH depends on the effect of the ac-
computational search problems”. The most common classification tion on the solutions, and the amount received varies depending
of hyperheuristics makes the distinction between selection hyper- on several factors. Additionally, DRLH takes future iterations of the
heuristics and generation hyperheuristic. Selection hyperheuristics search into account, while ALHH and MLP-ALHH only consider the
are concerned with creating a selection mechanism for heuristics immediate effect of the action on the solution. Because of this, di-
at each step of the search, while generation hyperheuristics are versifying behavior is encouraged in DRLH when it gets stuck, as it
concerned with generating new heuristics using basic components will help improve the solution in future iterations. Another differ-
from already existing heuristic methods. This paper will focus on ence of DRLH compared to ALHH and MLP-ALHH is that the fea-
selection hyperheuristics methods. tures of the search state used by DRLH contain more information
Although it is possible to create highly effective problem- compared to the search state of the other two methods which ul-
specific and heuristic-specific methods for heuristic selection, these timately makes the agent more aware of the search state and thus
methods do not always generalize well to other problem domains capable of making effective decisions.
and different sets of heuristics. A primary motivation of hyper- In addition to hyperheuristic approaches there have also re-
heuristic research is therefore the development of general-purpose, cently been many attempts at solving popular routing problems
problem-independent methods that can deliver good quality solu- using Deep RL by the machine learning community. A big limi-
tions for many combinatorial optimization problems without hav- tation of these works is that they all rely on problem-dependent
ing to make significant modifications to the methods. Thus, ad- information, and are usually designed to solve a single problem
vancements done in hyperheuristic research aims to be easily ap- or a small selection of related problems, often requiring signifi-
plicable by experts and non-experts alike, to various problems cant changes to the approach in order to make them work for sev-
and heuristics sets without requiring extra effort such as domain eral problems. In first versions of these studies, Deep RL is used
knowledge about the specific problem to be solved. as a constructive heuristic approach for solving the vehicle routing
A classic example of using RL in hyperhueristics is the work problem in which the agent, representing the vehicle, selects the
of Özcan, Misir, Ochoa, & Burke (2010) in which they propose a next node to visit at each time step (Kool, van Hoof, & Welling,
framework that uses a traditional RL method for solving exami- 2019; Nazari, Oroojlooy, Snyder, & Takac, 2018). Although this is
nation timetabling. Performance is compared against a simple ran- very effective when compared to simple construction heuristics for
dom hyperheuristic and some previous work, and results show that solving routing problems, it lacks the quality of solutions provided
using RL obtains better results than simply selecting heuristics at by iterative metaheuristic approaches as well as being unable to
random. The RL used here learns during the search process by ad- find feasible solutions in the case of more difficult routing prob-
justing the probabilities of choosing heuristics based on their re- lems that involve more advanced constraints such as pickup and
cent performance during the search. This type of RL framework delivery problem with time windows.
shares many similarities with the ALNS framework, and therefore Another approach that leverages Deep RL for solving combina-
suffers from the same limitations as those mentioned for ALNS. torial optimizations is to take advantage of the decision making
Apart from RL, supervised learning, which is another ma- ability of the agent in generating or selecting low-level heuristics
chine learning technique, has also been utilized in hyperheuristic to be applied on the solution. Hottung & Tierney (2019) have used
frameworks to improve the performance. A hyperheuristic method a Deep RL agent to generate a heuristic for rebuilding partially
for the Vehicle Routing Problem named Apprentice Learning-based destroyed routes in the CVRP using a large neighbourhood search
Hyper-heuristic (ALHH) was proposed by Asta & Özcan (2014) in framework. This method is an example of heuristic generation and
which an apprentice agent seeks to imitate the behavior of an ex- is specifically designed to solve the CVRP. Thus, it can not easily
pert agent through supervised learning. The training of the ALHH be generalized to other problem domains. In Chen & Tian (2019), a
works by running the expert on a number of training instances and framework is presented for using two Deep RL agents for finding a
recording the selected actions of the expert together with a search node in the solution and the best heuristic to apply on that node at
state that consists of the previous action used and the change in each step. Although the authors claim that this method is general-
objective function value for the past n steps. These recordings of izable to three different combinatorial optimization problems, the
search state and action pairs build up a training dataset in which a details in representation of the problem and type of ANNs used for
decision tree classifier is used in order to predict the action choice the agents from one problem to another change a lot depending on
of the expert. This makes up a supervised classification problem the nature of the problem. Additionally, one would have to come
in which the final accuracy of the model is reported to be around up with new inputs and representation when applying this method
65%. In the end ALHH’s performance is compared against the ex- to other optimization problems that are not discussed in the study
448
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468
The heuristic generation process follows the steps in 3.2.3. Additional heuristic C
Algorithm 2. The set H consists of all possible heuristics that Unlike in ALNS where only removal and insertion operators are
can be applied on the solution x at each iteration. The general used, our framework can also make use of standalone heuristics
method for obtaining these heuristics is to combine a removal that share neither of the these types of characteristics. An exam-
and an insertion operator. Furthermore, additional heuristics can ple of one such additional heuristic, “F ind_single_best ”, is responsi-
also be placed in H that do not share the characteristic of being a ble for generating the best possible new solution from the incum-
combination of removal and insertion operators. In the following, bent by changing one element. This heuristic calculates the cost
we present one example set of H for the problem types considered of removing each element and re-inserting it with “Insert _greedy”,
for this paper. and applies this procedure on the solution x for the element that
449
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468
Table 1
List of all removal operators.
Name Description
Table 2
List of all insertion operators.
Name Description
achieves the minimum cost f (x ). “F ind_single_best ” is the only ad- (Goodfellow, Bengio, & Courville, 2016). In this scenario, the aim is
ditional heuristic that is used in the proposed sample set of heuris- to obtain the optimal policy π ∗ by tuning θ which represents the
tics, H. weights of the MLP network.
The training process for an RL agent is illustrated in
3.3. Acceptance criteria and stopping condition Algorithm 3. For training the weights of the MLP, we follow the
450
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468
Table 3
A list of all features used for the state representation.
Name Description
reduced _cost The difference in cost between the previous & the current solutions
cost _ f rom_min The difference in cost between the current & the best found solution
cost The cost of the current solution
min_cost The cost of the best found solution
temp The current temperature
cs The cooling schedule (α )
no_improvement The number of iterations since the last improvement
index_step The iteration number
was_changed 1 if the solution was changed from the previous, 0 otherwise.
unseen 1 if the solution has not previously been encountered in the search, 0 otherwise.
last _action_sign 1 if the previous step resulted in a better solution, 0 otherwise.
last _action The action in previous iteration encoded in 1-hot.
training and should be designed in a way that helps the agent op- as
timize the objective of the problem. In the following, we explain
the choice for each of these properties. π (h|s, θ ) = Pr{At = h|St = s, θt = θ }. (4)
451
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468
4. Problem sets being able to perform the pick up or delivery. A vehicle is never
allowed to arrive at a node after the end of the time window. Ad-
We consider four sets of combinatorial optimization problems ditionally, a service time is considered for each time a call gets
as examples of problems that can be solved using DRLH. These picked up or delivered, i.e., the time it takes a vehicle to load or
problems are the Capacitated Vehicle Routing Problem (CVRP), Par- deliver the goods at each node. For each call, a set of feasible ve-
allel Job Scheduling Problem (PJSP), Pickup and Delivery Prob- hicles is determined. Each vehicle has a predetermined maximum
lem (PDP) and Pickup and Delivery Problem with Time Windows capacity of goods as well as a starting terminal in which duty of
(PDPTW). the vehicle starts. Moreover, a start time is assigned to each vehi-
cle indicating the time that the vehicle leaves its starting terminal.
4.1. CVRP The vehicle must leave its start terminal at the starting time, even
if a possible waiting time at the first node visited occurs. The goal
The Capacitated Vehicle Routing Problem is one of the most is to construct valid routes for each vehicle, such that time win-
studied routing problems in the literature. It consists of a set of dows and capacity constraints are satisfied along each route, each
N orders that needs to be served by any of the M number of ve- pickup is served before the corresponding delivery, pickup and de-
hicles. Additionally, there is a depot in which the vehicles travel liveries of each call are served on the same route and each ve-
from and return to when serving the orders. Following the pre- hicle only serves calls it is allowed to serve. The construction of
vious work, the number of vehicles in this particular problem is the routes should be in such a way that they minimize the cost
not fixed, but is naturally limited to M = {1, . . . , N}. Meaning that function. There is also a compatibility constraint between the ve-
the maximum number of vehicles that can be utilized is N and hicles and the calls. Thus, not all vehicles are able to handle all the
the minimum number is 1. Usually the number of vehicles used calls. If we are not able to handle all calls by our fleet, we have
will fall somewhere in between depending on which number re- to outsource them and pay the cost of not transporting them. For
sults in the best solution. Each order has a weight Wi associated more details, readers are referred to Hemmati, Hvattum, Fagerholt,
to it, and the vehicles have a maximum capacity. The sequence of & Norstad (2014).
orders that a vehicle visits after leaving the depot before returning
to the depot is referred to as a tour. There needs to be a minimum 5. Experimental setup
of one tour and a maximum of N tours. The combined weight of
the orders in a tour can not exceed the maximum capacity of the In this section, we explain the baseline methods, process of hy-
vehicle, and so several tours are often needed in order to solve the perparameter selection, and dataset generation methods used for
CVRP problem. The objective of this problem is to create a set of evaluation of the DRLH framework.
tours that minimize the total distance travelled by all the vehicles
5.1. Experimental environment
that are serving at least one order.
452
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468
5.2.3. Tuned random selection (TRS) tion, which stays the same, regardless of what the underlying com-
We introduce another baseline to our experiments which is binatorial optimization problem actually is. In the case of ALNS,
refered to as TRS. For this method, we tuned the probabilities we apply the same set of optimized hyperparameters that are sug-
of selecting heuristics using the method of IRace (López-Ibáñez, gested by Hemmati et al. (2014), which is optimized for solving the
Dubois-Lacoste, Pérez Cáceres, Birattari, & Stützle, 2016). The pack- benchmark of PDPTW.
age “IRace” applies iterative F-Race to tune a set of parameters in
an optimization algorithm (heuristic probabilities in our method) 5.4. Dataset generation
based on the performance on the training dataset.
For all the problem variations we generate a distinct training
5.2.4. Attention Module (AM) based Deep RL heuristic set consisting of 50 0 0 instances, and a distinct testing set con-
We also consider the AM method of Kool et al. (2019) which sisting of 100 instances. Additionally, for PDPTW we also utilize
achieved state-of-the-art results among the Deep RL based method a known set of benchmark instances for testing (Hemmati et al.,
for solving combinatorial optimization problems. This method uses 2014).
the Deep RL agent combined with deep attention representation
learning to build the solution at each step in a constructive man- 5.4.1. CVRP
ner using problem specific features from the environment. As a re- CVRP data instances are generated in accordance with the gen-
sult, when applied on new problems, a new set of features as well eration scheme of Nazari et al. (2018), Kool et al. (2019), but we
as a problem specific representation learning scheme need to be also add two bigger problem variations. Instances of sizes N = 20,
defined. For example, the time window and vehicle incompatibil- N = 50, N = 100, N = 200 and N = 500 are generated where N is
ity constraints were not mentioned in the original paper and for the number of orders. For each instance the depot location and
that reason we can not solve the difficult problem of PDPTW with node locations are sampled uniformly at random from the unit
this framework. square. Additionally, each order has a size associated with it de-
fined as γˆ = γi /DN where γi is sampled from the discrete set
5.3. Hyperparameter selection of {1, . . . , 9}, and the normalization factor DN is set as D20 = 30,
D50 = 40, D100 = 50, D200 = 50, D500 = 50, for instances with N or-
The hyperparameters for the Deep RL agent determine the ders, respectively.
speed and stability of the training process and also the final per-
formance of the trained model. A small learning rate will cause 5.4.2. PJSP
training to take longer, but the smaller updates to the neural net- For the PJSP we generate instances of sizes N = 20, N = 50,
work also increase the chance of a better final performance once N = 100, N = 300 and N = 500 where N is the number of jobs and
the model has been fully trained. Because the training process is using M = N/4 machines. Job i’s required processing steps P Si are
done in advance of the testing stage, we opt for a slow and sta- sampled from the discrete set of {100, 101, . . . , 10 0 0}, and machine
ble approach in order to train the best models possible. The hy- m’s speed Sm , in processing steps per time unit, is sampled from
perparameters of Deep RL agent for the experiments are listed in N (μ, σ 2 ) with μ = 10, σ = 30, and the speed is rounded to the
Table 4. nearest integer and bounded to be at least 1. From there we get
In order to decide on the hyperparameters for DRLH, some ini- that the time required to process job i on machine m is calculated
tial experiments were performed on the PDP problem (as the sim- as P Si /Sm .
ple baseline problems compared to others) on a separate valida-
tion set to see which combinations performed best. The resulting 5.4.3. PDP
set of hyperparameters have been applied for all experiments in For this problem, PDP data instances of sizes N = 20, N = 50,
this paper. Our motivation for doing so is that we wanted to test and N = 100 are generated where N is the number of nodes based
the generalizability of the framework in terms of the hyperparam- on the generation scheme of Nazari et al. (2018), Kool et al. (2019).
eters as well as the performance on different problems. By tuning For each instance the depot location and node locations are sam-
the hyperparameters on a simpler problem and applying them to pled uniformly at random in unit square. Half of the nodes are
all other problems of all sizes and variations, we tried to avoid pickup locations whereas the other half is the corresponding de-
overtuning DRLH for every separate problem to keep the evalu- livery locations. Additionally, each call has a size associated with
ation fair for the baseline methods and make sure that the ad- it defined as γˆ = γi /DN where γi is sampled from the discrete set
vantage of our approach is in the decision making approach not of {1, . . . , 9}, and the normalization factor DN is set as D20 = 15,
the choice of hyperparameters for each problem. Moreover, this D50 = 20, D100 = 25, for each problem with N number of nodes re-
adds to the generalizability trait of the framework that does not spectively.
require hyperparameter selection for each specific problem. Based
on our experiments we found that these set of hyperparameters 5.4.4. PDPTW
work very well across all the problem variations that we tested. For the PDPTW we use instances with different combinations
It is likely that these hyperparameters can work for any under- of number of calls and number of vehicles, see Table 5. For gener-
lying combinatorial optimization problem, as the hyperparameters ating the training set and the 100 test instances, we use the pro-
for DRLH are related to the high-level problem of heuristic selec- vided instance generator of Hemmati et al. (2014). Additionally, we
Table 4 Table 5
The hyperparameters used during training for the Deep RL agent of DRLH. Properties of different variations of the PDPTW instance types.
453
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468
use benchmark instances of Hemmati et al. (2014) for the remain- when enough iterations are provided to fully explore the problem
ing results. The benchmark test set consists of some instances of space. We also report the results on the benchmark of Hemmati
each variation, which are solved 10 times during testing in order et al. (2014) instances (Section 6.3). In order to demonstrate an-
to calculate the average best objective for each instance. Previous other advantage of using DRLH, we conduct an experiment with
work by Homsi, Martinelli, Vidal, & Fagerholt (2020) have found increased number of heuristics to illustrate the dependence of each
the global optimal objectives for these instances, and we use these framework on the performance of individual heuristics when the
optimal values in order to calculate the Min Gap (%) and Avg Gap number of heuristics exceeds a certain number (Section 6.4). Ad-
(%) to the optimal values for instances with 18, 35, 80 and 130 ditionally, we report the convergence speed and the training and
calls. Additionally, we also generate and test on a much larger in- inference time of each framework on instances of each problem
stance size of 300 calls where we do not have the exact global (sections 6.5 and 6.6). Next, to gain insight into the reason be-
optimal objectives, but instead use the best known values found hind the superiority of DRLH compared to the state of the art, we
by DRLH with 10,0 0 0 iterations to calculate the Min Gap (%) and provide some figures and discuss the difference in strategy behind
Avg Gap (%). choosing a heuristic between DRLH and ALNS (Section 6.7). Finally,
we compare the performance of DRLH, with a Deep RL heuristic
6. Results approach (Section 6.8). Additional experiments and results regard-
ing the reward function, convergence speed, and dependency of
In this section, we present the results of different experiments DRLH on the size of the problem can be found in Appendix.
on the performance of DRLH. In the first experiment (Section 6.1),
we set the number of iterations of the search to 10 0 0 to compare 6.1. Experiment on generated test set
the quality of the best found objective by each algorithm over a
limited number of iterations for different problem sizes in the test For this experiment, each method was evaluated on a test set
set. In the next experiment (Section 6.2), we increase the number of 100 generated instances for each of the problems introduced in
of iterations for all the methods and compare their performance Section 4. Figure 1(a) shows the improvement in percentage that
454
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468
Table 6
Average results for PDPTW instances with mixed call sizes after 10 0 0 iterations.
Min Gap Avg Gap Time Min Gap Avg Gap Time Min Gap Avg Gap Time
#C #V (%) (%) (s) (%) (%) (s) (%) (%) (s)
using DRLH, ALNS, and TRS have over using URS on CVRP instances ations, respectively. We see from the tables that DRLH outperforms
of different sizes. We see that DRLH is able to outperform all the ALNS and URS on all of the tests on average, showing that it can
baselines for all the instance sizes except for the smallest size. find high quality solutions and has a robust average performance.
There is also a clear trend that shows how DRLH becomes increas- Furthermore, we can see that the performance difference between
ingly better compared to other methods on larger instance sizes. DRLH and the baselines increases on bigger instances, meaning
Figure 1(b) shows a similar result for the PJSP problem. We see that DRLH scales favorably to the size of the problem, making it
that DRLH is able to outperform the other methods for all of the more viable for big industrial-sized problems compared to ALNS
instance sizes tested. Compared to the previous results, we see that and URS.
the degree of improvement on larger instance sizes is less promi- We have also included the average time in seconds for opti-
nent for DRLH, but we also see that ALNS does not perform notice- mizing the test instances. Note that the difference in time-usage is
ably better on larger instance sizes at all. Because of that we still not directly dependent on the framework for selecting the heuris-
see a clear separation in performance between DRLH and ALNS on tics (DRLH, ALNS, URS), but rather on the difference in time-usage
larger instance sizes that seem to grow with larger instance sizes. of the heuristics themselves. This means that if all the heuris-
Finally, we observe a similar trend for PDP and PDPTW as for the tics used the same amount of time, then there would not be any
other problems, which can be seen in Fig. 1(c) and (d), respectively. time difference between the frameworks. However, because there
From this figure we see that DRLH outperforms ALNS and URS on is a relatively large variation in the time-usage between the differ-
all instance sizes tested and that performance difference tends to ent heuristics, we see a considerable variation between the frame-
increase with larger instance sizes. works as they all have different strategies for heuristic selection.
455
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468
Table 7
Average results for PDPTW instances with mixed call sizes after 50 0 0 iterations.
Min Gap Avg Gap Time Min Gap Avg Gap Time Min Gap Avg Gap Time
#C #V (%) (%) (s) (%) (%) (s) (%) (%) (s)
Table 8
Average results for PDPTW instances with mixed call sizes after 10,0 0 0 iterations.
Min Gap Avg Gap Time Min Gap Avg Gap Time Min Gap Avg Gap Time
#C #V (%) (%) (s) (%) (%) (s) (%) (%) (s)
all of them during the search in order to identify the best heuris-
tics and take advantage of them during the search.
An important conclusion from this result (albeit one that needs
further empirical proof) is that when using DRLH, it is possible to
supply it with a large number of heuristics and let DRLH iden-
tify the best ones to use. This is not possible for ALNS and con-
sequently it is often necessary to spend time carrying out prior
experiments with the aim of finding a small set of the best per-
forming heuristics to include in the final ALNS model. This also
resonates with the conclusion of Turkeš et al. (2021), who argue
that the performance of ALNS benefits more from a careful a priori
selection of heuristics, than from an elaborate adaptive layer. Con-
sidering that prior experiments can be quite time consuming, using
DRLH can lead to a simpler development phase where heuristics
can be added to DRLH without needing to establish their effective-
ness beforehand, and not having to worry whether adding them
will hurt the overall performance. Should a heuristic be unneces-
sary, then DRLH will learn to not use it during the training phase.
In addition to DRLH having a simpler development phase, an
increased (or more nuanced) set of heuristics also has a larger po-
tential to work well for a wide range of conditions, such as for dif-
ferent problems, instance sizes and specific situations encountered
in the search. In other words, reducing the set of heuristics could
negatively affect the performance of ALNS, but much less so for
DRLH. Some heuristics work well only in specific situations, and so
removing these “specialized” heuristics due to their poor average
performance gives less potential for ALNS to be able to handle a
diverse set of problem and instance variations compared to DRLH,
which learns to take advantage of any heuristic that performs well
in specific situations. Of course, these claims are based on a limited
number of experiments and should be validated in a broad range
of (future) experiments.
456
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468
Fig. 4. Average performance of DRLH, ALNS and URS on each of the problems.
to reach a better cost after less than 500 iterations than what ALNS tween these methods. This can also be shown in Table 10, in which
is able to reach after 10 0 0 iterations. With the exception of the in some cases DRLH is faster than the other two baselines and
CVRP problem, DRLH is also extremely efficient in the beginning in some cases it is slower. It should be noted that the execu-
of the search, reaching costs in only 100 iterations that takes ALNS tion time of the operators can be improved if implemented care-
approximately 500 iterations to match. We refer to Appendix C for fully or using a faster programming language, e.g., C. However, the
a complete collection of performance plots for all the problems main focus of the paper is to improve the hyperheuristic approach
that we have tested. of choosing the next heuristic at each step, not the execution
time.
6.6. Training and inference time needed for each problem
6.7. Comparison between heuristic selection strategies
Tables 9 and 10 report the time needed for training and solv-
ing for the instances of each problem, respectively. The main dif- Figure 5 demonstrate the probability of selecting heuristics at
ference between DRLH and the baselines is the approach to de- each step of the search for DRLH and ALNS in which each line
cision making when it comes to choosing the next heuristic. This corresponds to the probability of one heuristic at every step of
decision making process, on itself, does not add much overhead the search. The “micro-level” heuristic usage of DRLH means that
on the computational time of the methods. The main difference DRLH is able to drastically change the probabilities of selecting
in the speed of these methods is the speed of the operators that heuristics from one iteration to the next by taking advantage of
they choose. This means that in some cases DRLH chooses op- the information provided by the search state, see Fig. 5(a) and (b).
erators that are faster or slower compared to baseline which re- This is in contrast to the “macro-level” heuristic usage of ALNS
sults in lower or higher computational time. Therefore, when it where the probabilities of selecting operators only are updated
comes to computational time, there is not much difference be- at the beginning of each segment, meaning that the decision
457
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468
Table 9
Training time for DRLH on different problems.
Problem Size #Iterations #Training Instances Total training time (s) Average time per instance (s)
Fig. 5. Example of the probability of selecting heuristics for DRLH and ALNS.
458
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468
Table 10 which makes it so that if the solution did not change in the previ-
Average Time (seconds) required for solving the test instances for each method
ous iteration, then it is guaranteed not to change in the following
DRLH, ALNS and URS.
iteration as well. Even though we have not specifically pro-
Problem Size #Iterations DRLH ALNS URS grammed DRLH to utilize this strategy, it becomes clear by exam-
CVRP 20 1k 4.08 11.58 7.75 ining Table 11 that the DRLH has picked up on this strategy when
CVRP 50 1k 11.58 35.17 23.52 learning to optimize micro-level heuristic selection. Table 11 shows
CVRP 100 1k 34.28 99.58 50.65 that the number of wasted actions for DRLH is almost non-existent
CVRP 200 1k 102.76 221.22 94.07
for most problem variations. ALNS on the other hand ends up with
CVRP 500 1k 621.74 664.54 238.86
PJSP 20 1k 20.37 18.06 5.65 far more wasted actions than DRLH, even though ALNS also uses
PJSP 50 1k 30.69 41.84 15.9 F ind_single_best much more seldom on average. Figure 5(c) shows
PJSP 100 1k 57.05 76.15 34.0 how the heuristic probabilities for ALNS remain locked within the
PJSP 300 1k 199.37 237.58 110.81
segments, making it impossible for ALNS to exploit strategies such
PJSP 500 1k 453.92 462.34 195.67
PDP 20 1k 3.89 4.17 1.85 as minimizing wasted actions which relies on excellent micro-level
PDP 50 1k 31.4 20.15 9.93 heuristic selection such as what DRLH demonstrates.
PDP 100 1k 159.86 79.58 45.86
PDPTW 18 1k 32.61 23.85 9.18
PDPTW 35 1k 10.67 29.75 21.18 6.8. Performance comparison with AM deep RL heuristic
PDPTW 80 1k 34.82 71.27 68.33
PDPTW 130 1k 110.67 139.45 132.32 For this experiment, we ran the AM method of Kool et al.
PDPTW 300 1k 500.9 438.65 361.39 (2019) on our test datasets for the CVRP problem. The trained
models and the implementation of the models needed to solve
the problem have been provided publicly by the authors of this
making of ALNS within a single segment is random according to paper. The dataset generation procedure for both our work and
the locked probabilities for that segment, see Fig. 5(c). Depending the AM paper follow the work of Nazari et al. (2018). As a re-
on the problem and available heuristics to select, there might sult, the models are well fit to be evaluated on our test set. For
exist exploitable strategies and patterns for heuristic selection, their method we considered three different approaches : Greedy,
such as heuristic(s) that: work well when used together, work Sample_128 and Sample_1280. In the greedy approach, at each step
well for escaping local minima, work well on solutions not pre- the node with the most probability is chosen. In the sampling ap-
viously encountered during the search. Using DRLH, these types proach, 128 and 1280 different solutions are sampled based on the
of exploitable strategies can be automatically discovered without probability of each node at each step. We test these methods for
the need for specially tailored algorithms designed by human sizes n = 20, 50, 100 of the CVRP problem. The time and resources
experts. We refer to one such exploitable strategy found by DRLH (Graphical Processing Units) needed to train the AM method for
on our problems with our provided set of heuristics as minimizing sizes larger than 100 scales exponentially due to heavy calcula-
“wasted actions”. We define a wasted action as the selection of tions needed for their representation learning method. Therefore,
a deterministic heuristic (in our case F ind_single_best) for two we only solve this problem for the mentioned instance sizes.
consecutive unsuccessful iterations. The reason that this action is Figure 6 illustrates the comparison of performance of our
“wasted” is because of the deterministic nature of the heuristic, method with the AM method of Kool et al. (2019). As shown in
Table 11
The percentage of wasted actions of the total number of deterministic heuristics selected, averaged over the test set for
each problem.
459
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468
460
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468
DRLH with Rt5310 DRLH with RtMC Tables A.3, A.4 and A.5 list the extended set of heuristics built
up from 14 removal operators, 10 insertion operators and 2 ad-
#C #V Min Gap (%) Avg Gap (%) Min Gap (%) Avg Gap (%)
ditional heuristics, for a total of 14 × 10 + 2 = 142 total heuris-
18 5 0.00 0.00 0.00 0.13 tics, using the generation scheme of Algorithm 2. Most of these
35 7 0.67 2.02 0.42 2.32
heuristics only use problem-independent information, but some
80 20 1.80 2.95 2.55 3.87
130 40 1.93 2.84 2.20 3.04 of them rely on problem-dependent information specific to the
300 100 0.00 0.64 1.12 1.88 PDPTW problem.
Table A.3
List of extended removal operators.
A2. RtMC
f (xbest ) − f (x )
Name Description
RtMC = (A.2)
f (xbest ) Random_remove_XS Removes between 2–5 elements chosen randomly
Random_remove_S Removes between 5–10 elements chosen
The RtMC is a reward function that more directly correlates with randomly
Random_remove_M Removes between 10–20 elements chosen
the intended objective of minimizing the cost of the best found randomly
solution, and to achieve this as quickly as possible. Instead of fo- Random_remove_L Removes between 20–30 elements chosen
cusing on rewarding actions that directly improve the solution, this randomly
reward function is subject to the performance of the entire search Random_remove_XL Removes between 30–40 elements chosen
randomly
process up to the current step, putting a greater emphasis on act-
Random_remove_X X L Removes between 80–100 elements chosen
ing quickly and selecting heuristics that have a greater impact on randomly
the solution. The challenge with using this reward function com- Remove_lar gest _D _S Removes 5–10 elements with the largest Di
pared to reward functions such as Rt5310 and RtPM is that there is Remove_lar gest _D _L Removes 20–30 elements with the largest Di
an inherent delay between when a good heuristic is selected and Remove_τ Removes a random segment of 2–5 consecutive
elements in the solution
when the reward function gives a good reward. This makes it more Remove_least _ f requent _S Removes between 5–10 elements that has been
difficult to train an agent using this reward function, making train- removed the least
ing times much longer and less stable than with the Rt5310 reward Remove_least _ f requent _M Removes between 10–20 elements that has been
function. removed the least
Remove_least _ f requent _XL
Having said that, the potential upside of using this reward func-
Remove_one_vehicle Removes all the elements in one vehicle
tion is very promising, and results in Table A.1 show that RtMC Remove_two_vehicles Removes all the elements in two vehicle
is able to outperform the Rt5310 reward function on 1k iteration
searches. However, the agents were unable to learn effectively for Table A.4
larger number of iterations such as 10k (Table A.2), and so results List of extended insertion operators.
for this shows that RtMC performs worse than Rt5310 on 10k itera- Name Description
tion searches. A potential reason for why the RtMC agents were un-
Insert _greedy Inserts each element in the best possible position
able to learn well on 10k iteration searches is that the amount of
Insert _beam_search Inserts each element in the best position using
improving iterations are much less frequent, making the feedback beam search
signal from the RtMC reward function even more delayed and high Insert _by_var iance Sorts the insertion order based on variance and
variance. Another potential reason is that the training required in inserts
order to solve 10k iteration searches likely needed more training each element in the best possible position
Insert _ f ir st Inserts each element randomly in the first
than what was possible to carry out for our experiments due to feasible position
time constraints with the experiments. We encourage future work Insert _least _loaded _vehicle Inserts each element into the least loaded
on improving the integration of the RtMC reward function into the available vehicle
framework of DRLH as it likely has a lot of potential. Insert _least _active_vehicle Inserts each element into the least active
available vehicle
Insert _close_vehicle Inserts each element into the closest available
vehicle
Insert _group Identifies the vehicles that can fit the most of the
removed elements and
inserts each elements into these
Insert _by_di f f iculty Inserts each element using Insert _greedy ordered
by their difficulty,
which is a function of their compatibility with
vehicles, strictness
of time windows,size and more.
Insert _best _ f it Inserts each element into the vehicle that is the
most compatible with the call.
Table A.5
List of extended additional heuristics.
Name Description
461
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468
Appendix C. Additional performance plots This training scheme is referred to as Cross Size (CS) training. Dur-
ing test time, the trained model solved the test instances that were
Figures 8, 9, 10 and 11 show the performance of DRLH, ALNS used in Section 6.1 as well as test instances of slightly different
and URS averaged over the test set for all the problems that we sizes that were seen during training. As seen in Fig. 12, it is possi-
have tested. These show that DRLH usually reaches better solutions ble to train one model that can handle many different variations
more quickly than ALNS and URS, as well as ending up with better of instance sizes quite well. Moreover, as shown in Fig. 12(e)–
solutions overall. (h), the model does not specifically overfit on the specific in-
stance sizes included in the training when evaluated on slightly
Appendix D. Experiment on the cross size training scheme different test data. This means that the DRLH_CS generalizes very
well, even to sizes higher than any of the ones included in the
In this experiment, in the training phase, an instance of a spe- training.
cific problem with different size is solved by DRLH in each episode.
462
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468
463
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468
464
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468
465
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468
466
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468
Fig. 12. Performance of DRLH with Cross Size (CS) training scheme on different problem sizes with 1k iterations.
467
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468
References Homsi, G., Martinelli, R., Vidal, T., & Fagerholt, K. (2020). Industrial and tramp
ship routing problems: Closing the gap for real-scale instances. European Jour-
Aksen, D., Kaya, O., Sibel Salman, F., & Özge Tüncel (2014). An adaptive large neigh- nal of Operational Research, 283(3), 972–990. https://doi.org/10.1016/j.ejor.2019.
borhood search algorithm for a selective and periodic inventory routing prob- 11.068.
lem. European Journal of Operational Research, 239(2), 413–426. https://doi.org/ Hottung, A., & Tierney, K. (2019). Neural large neighborhood search for the capaci-
10.1016/j.ejor.2014.05.043. tated vehicle routing problem. CoRR. http://arxiv.org/abs/1911.09539.
Amodei, D., Olah, C., Steinhardt, J., Christiano, P. F., Schulman, J., & Mané, D. (2016). Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated an-
Concrete problems in AI safety. CoRR. http://arxiv.org/abs/1606.06565. nealing. Science, 220(4598), 671–680. https://doi.org/10.1126/science.220.4598.
Asta, S., & Özcan, E. (2014). An apprenticeship learning hyper-heuristic for vehicle 671.
routing in hyflex. Orlando, Florida. https://doi.org/10.1109/EALS.2014.7009505. Kool, W., van Hoof, H., & Welling, M. (2019). Attention, learn to solve routing prob-
Burke, E. K., Hyde, M., Kendall, G., Ochoa, G., Özcan, E., & Woodward, J. R. (2010). A lems!
classification of hyper-heuristic approaches. In M. Gendreau, & J.-Y. Potvin (Eds.) Laborie, P., & Godard, D. (2007). Self-adapting large neighborhood search: Applica-
(pp. 449–468). Boston, MA: Springer US. tion to single-mode scheduling problems. In Proceedings MISTA-07: Vol. 8. Paris
Chen, C., Demir, E., & Huang, Y. (2021). An adaptive large neighborhood search Li, Y., Chen, H., & Prins, C. (2016). Adaptive large neighborhood search for the pickup
heuristic for the vehicle routing problem with time windows and delivery and delivery problem with time windows, profits, and reserved requests. Euro-
robots. European Journal of Operational Research, 294(3), 1164–1180. https://doi. pean Journal of Operational Research, 252(1), 27–38. https://doi.org/10.1016/j.ejor.
org/10.1016/j.ejor.2021.02.027. 2015.12.032.
Chen, X., & Tian, Y. (2019). Learning to perform local rewriting for combinato- Lu, H., Zhang, X., & Yang, S. (2020). A learning-based iterative method for solving
rial optimization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, vehicle routing problems. In International conference on learning representations.
E. Fox, & R. Garnett (Eds.), Advances in neural information processing systems: https://openreview.net/forum?id=BJe1334YDH
vol. 32. Curran Associates, Inc.. https://proceedings.neurips.cc/paper/2019/file/ López-Ibáñez, M., Dubois-Lacoste, J., Pérez Cáceres, L., Birattari, M., & Stüt-
131f383b434fdf48079bff1e44e2d9a5-Paper.pdf zle, T. (2016). The irace package: Iterated racing for automatic algorithm config-
Cowling, P., Kendall, G., & Soubeiga, E. (2001). A hyperheuristic approach to schedul- uration. Operations Research Perspectives, 3, 43–58. https://doi.org/10.1016/j.orp.
ing a sales summit. In E. Burke, & W. Erben (Eds.), Practice and theory of auto- 2016.09.002.
mated timetabling iii (pp. 176–190). Berlin, Heidelberg: Springer Berlin Heidel- Nazari, M., Oroojlooy, A., Snyder, L., & Takac, M. (2018). Reinforcement learning for
berg. solving the vehicle routing problem. In S. Bengio, H. Wallach, H. Larochelle,
Crama, Y., & Schyns, M. (2003). Simulated annealing for complex portfolio selection K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information
problems. European Journal of Operational Research, 150(3), 546–571. https://doi. processing systems: vol. 31. Curran Associates, Inc.. https://proceedings.neurips.
org/10.1016/S0377- 2217(02)00784- 1. Financial Modelling cc/paper/2018/file/9fb4651c05b2ed70fba5afe0b039a550-Paper.pdf
Demir, E., Bektaş, T., & Laporte, G. (2012). An adaptive large neighborhood search Özcan, E., Misir, M., Ochoa, G., & Burke, E. (2010). A reinforcement learning -
heuristic for the pollution-routing problem. European Journal of Operational Re- great-deluge hyper-heuristic for examination timetabling. International Journal
search, 223(2), 346–359. https://doi.org/10.1016/j.ejor.2012.06.044. of Applied Metaheuristic Computing, 1, 39–59.
Dokeroglu, T., Sevinc, E., Kucukyilmaz, T., & Cosar, A. (2019). A survey on new gener- Pisinger, D., & Ropke, S. (2019). Large neighborhood search. In Handbook of meta-
ation metaheuristic algorithms. Computers & Industrial Engineering, 137, 106040. heuristics (pp. 99–127). Springer.
https://doi.org/10.1016/j.cie.2019.106040. Ropke, S., & Pisinger, D. (2006). An adaptive large neighborhood search heuristic
Friedrich, C., & Elbert, R. (2022). Adaptive large neighborhood search for vehicle for the pickup and delivery problem with time windows. Transportation Science,
routing problems with transshipment facilities arising in city logistics. Comput- 40(4), 455–472. https://doi.org/10.1287/trsc.1050.0135.
ers & Operations Research, 137, 105491. https://doi.org/10.1016/j.cor.2021.105491. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal pol-
Goodfellow, I. J., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge, MA, icy optimization algorithms. CoRR. abs/1707.06347. http://dblp.uni-trier.de/db/
USA: MIT Press. http://www.deeplearningbook.org journals/corr/corr1707.html#SchulmanWDRK17
Grangier, P., Gendreau, M., Lehuédé, F., & Rousseau, L.-M. (2016). An adaptive large Shaw, P. (1998). Using constraint programming and local search methods to solve
neighborhood search for the two-echelon multiple-trip vehicle routing problem vehicle routing problems. In M. Maher, & J.-F. Puget (Eds.), Principles and practice
with satellite synchronization. European Journal of Operational Research, 254(1), of constraint programming — CP98 (pp. 417–431). Berlin, Heidelberg: Springer
80–91. https://doi.org/10.1016/j.ejor.2016.03.040. Berlin Heidelberg.
Gullhav, A. N., Cordeau, J.-F., Hvattum, L. M., & Nygreen, B. (2017). Adaptive large Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cam-
neighborhood search heuristics for multi-tier service deployment problems in bridge, MA, USA: A Bradford Book.
clouds. European Journal of Operational Research, 259(3), 829–846. https://doi. Turkeš, R., Sörensen, K., & Hvattum, L. M. (2021). Meta-analysis of metaheuristics:
org/10.1016/j.ejor.2016.11.003. Quantifying the effect of adaptiveness in adaptive large neighborhood search.
Hemmati, A., & Hvattum, L. M. (2017). Evaluating the importance of randomization European Journal of Operational Research, 292(2), 423–442. https://doi.org/10.
in adaptive large neighborhood search. International Transactions in Operational 1016/j.ejor.2020.10.045.
Research, 24(5), 929–942. https://doi.org/10.1111/itor.12273. Tyasnurita, R., Özcan, E., Shahriar, A., & John, R. (2015). Improving performance of a
Hemmati, A., Hvattum, L. M., Fagerholt, K., & Norstad, I. (2014). Benchmark suite for hyper-heuristic using a multilayer perceptron for vehicle routing. Exeter, UK, http:
industrial and tramp ship routing and scheduling problems. INFOR: Information //eprints.nottingham.ac.uk/id/eprint/45707
Systems and Operational Research, 52(1), 28–38. https://doi.org/10.3138/infor.52.
1.28.
468