0% found this document useful (0 votes)
25 views

A General Deep Reinforcement Learning Hyperheuristic Framework For Solving Combinatorial Optimization Problems

Uploaded by

Ayoub Ouhadi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

A General Deep Reinforcement Learning Hyperheuristic Framework For Solving Combinatorial Optimization Problems

Uploaded by

Ayoub Ouhadi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

European Journal of Operational Research 309 (2023) 446–468

Contents lists available at ScienceDirect

European Journal of Operational Research


journal homepage: www.elsevier.com/locate/ejor

Interfaces with Other Disciplines

A general deep reinforcement learning hyperheuristic framework for


solving combinatorial optimization problems
Jakob Kallestad a, Ramin Hasibi a,∗, Ahmad Hemmati a, Kenneth Sörensen b
a
Department of Informatics, University of Bergen, Norway
b
Faculty of Business and Economics, ANT/OR - University of Antwerp Operations Research Group, Belgium

a r t i c l e i n f o a b s t r a c t

Article history: Many problem-specific heuristic frameworks have been developed to solve combinatorial optimization
Received 11 October 2021 problems, but these frameworks do not generalize well to other problem domains. Metaheuristic frame-
Accepted 11 January 2023
works aim to be more generalizable compared to traditional heuristics, however their performances suffer
Available online 16 January 2023
from poor selection of low-level heuristics (operators) during the search process. An example of heuristic
Keywords: selection in a metaheuristic framework is the adaptive layer of the popular framework of Adaptive Large
Heuristics Neighborhood Search (ALNS). Here, we propose a selection hyperheuristic framework that uses Deep Re-
Hyperheuristic inforcement Learning (Deep RL) as an alternative to the adaptive layer of ALNS. Unlike the adaptive layer
Adaptive metaheuristic which only considers heuristics’ past performance for future selection, a Deep RL agent is able to take
Deep reinforcement learning into account additional information from the search process, e.g., the difference in objective value be-
Combinatorial optimization tween iterations, to make better decisions. This is due to the representation power of Deep Learning
methods and the decision making capability of the Deep RL agent which can learn to adapt to differ-
ent problems and instance characteristics. In this paper, by integrating the Deep RL agent into the ALNS
framework, we introduce Deep Reinforcement Learning Hyperheuristic (DRLH), a general framework for
solving a wide variety of combinatorial optimization problems and show that our framework is better at
selecting low-level heuristics at each step of the search process compared to ALNS and a Uniform Ran-
dom Selection (URS). Our experiments also show that while ALNS can not properly handle a large pool
of heuristics, DRLH is not negatively affected by increasing the number of heuristics.
© 2023 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)

1. Introduction ators have weights associated with them that determine the prob-
abilities of selecting these during the search. These weights are
A metaheuristic is an algorithmic framework that offers a co- continuously updated after a certain number of iterations (called
herent set of guidelines for the design of heuristic optimization a segment) based on their recent effect on improving the qual-
methods. Classical frameworks such as Genetic Algorithm (GA), ity of the solution during the segment. The ALNS framework was
Particle Swarm Optimization (PSO), Ant Colony Optimization (ACO), early on an approach specific to routing problems. However, in re-
and Simulated Annealing (SA) are examples of such frameworks cent years, there has been a growing number of studies that em-
(Dokeroglu, Sevinc, Kucukyilmaz, & Cosar, 2019). Moreover, there ploy this approach to other problem types, e.g., scheduling prob-
is a large body of literature that addresses solving combinatorial lems (Laborie & Godard, 2007). Its high quality of performance
optimization problems using metaheuristics. Among these, Adap- at finding solutions has made it a go-to approach in many recent
tive Large Neighbourhood Search (ALNS) (Ropke & Pisinger, 2006) studies in combinatorial optimization problems (Aksen, Kaya, Sibel
is one of the most widely used metaheuristics. It is a general Salman, & Özge Tüncel, 2014; Chen, Demir, & Huang, 2021; Demir,
framework based on the principle of Large Neighbourhood Search Bektaş, & Laporte, 2012; Friedrich & Elbert, 2022; Grangier, Gen-
(LNS) of Shaw (1998), where the objective value is iteratively im- dreau, Lehuédé, & Rousseau, 2016; Gullhav, Cordeau, Hvattum, &
proved by applying a set of “removal” and “insertion” operators Nygreen, 2017; Li, Chen, & Prins, 2016). The ALNS framework has
on the solution. In ALNS, each of the removal and insertion oper- several advantages. For most optimization problems, a number of


Corresponding author.
E-mail addresses: [email protected] (J. Kallestad), [email protected] (R. Hasibi), [email protected] (A. Hemmati), [email protected]
(K. Sörensen).

https://doi.org/10.1016/j.ejor.2023.01.017
0377-2217/© 2023 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468

well-performing heuristics are already known which can be used In this paper, we propose Deep Reinforcement Learning Hy-
as the operators in the ALNS framework. Due to the large size and perheuristic (DRLH), a general approach to selection hyperheuris-
diversity of the neighborhoods, the ALNS algorithm will explore tic framework (definition in Section 2) for solving combinatorial
huge chunks of the solution space in a structured way. As a result, optimization problems. In DRLH, we replace the adaptive layer of
ALNS is very robust as it can adapt to different characteristics of ALNS with a Deep RL agent responsible for selecting heuristics
the individual instances, and is able to avoid being trapped in lo- at each iteration of the search. Our Deep RL agent is trained us-
cal optima (Pisinger & Ropke, 2019). According to Turkeš, Sörensen, ing Proximal Policy Optimization (PPO) method of Schulman, Wol-
& Hvattum (2021), the adaptive layer of ALNS has only minor im- ski, Dhariwal, Radford, & Klimov (2017) which is a standard ap-
pact on the objective function value of the solutions in the stud- proach for stable training of the Deep RL agent in different en-
ies that have employed this framework. Moreover, the information vironments. The proposed DRLH utilizes a search state consisting
that the adaptive layer uses for selecting heuristics is limited to the of a problem-independent feature set from the search process and
past performance of each heuristic. This limited data can make the is trained with a problem-independent reward function that en-
adaptive layer naïve in terms of decision making capability because courages better solutions. This approach makes the framework eas-
it is not able to capture other (problem-independent) information ily applicable to many combinatorial optimization problems with-
about the current state of the search process, e.g., the difference out any change in the method and given the proper training step
in cost between past solutions, whether the current solution has for each problem separately. The training process of DRLH makes
been encountered before during the search, or the number of iter- it adaptable to different problem conditions and settings, and en-
ations since the solution was last changed, etc. We refer to the de- sures that DRLH is able to learn good strategies of heuristic selec-
cision making capability of ALNS as performing on a “macro-level” tion prior to testing, while also being effective when encountering
in terms of adaptability, i.e., the weights of each heuristic is only new search states. In contrast to the macro-level decision making
updated at the end of each segment. This means that the heuris- of ALNS, the proposed DRLH makes decisions on a “micro-level”,
tics selected within a segment are sampled according to the fixed meaning that only the current search state information affects the
probabilities of the segment. This limitation makes it impossible probabilities of choosing heuristics. This allows for the probabili-
for ALNS to take advantage of any short-term dependencies that ties of selecting heuristics to change quickly from one iteration to
occur within a segment that could help aid the heuristic selection the next, helping DRLH adapt to new information of the search as
process. soon as it becomes available. The Deep RL agent in DRLH is able
Another area where ALNS struggles is when faced with a large to effectively leverage this search state information at each step of
number of heuristics to choose from. In order to find the best set the search process in order to make better decisions for selecting
of available heuristics for ALNS for a specific setting, initial experi- heuristics compared to ALNS.
ments are often required to identify and remove inefficient heuris- To evaluate the performance and generalizability of DRLH,
tics, and this can be both time consuming and computationally ex- we choose four different combinatorial optimization problems to
pensive (Hemmati & Hvattum, 2017). Furthermore, some heuristics benchmark against different baselines in terms of best objective
are known to perform very well for specific problem variations or found and the speed of convergence as well as the time it takes
specific conditions during the search, but they may have a poor to solve each problem. These problems include the Capacitated Ve-
average performance. In this case, it might be beneficial to remove hicle Routing Problem (CVRP), the Parallel Job Scheduling Problem
these from the pool of heuristics available to ALNS in order to in- (PJSP), the Pickup and Delivery Problem (PDP), and the Pickup and
crease the average performance of ALNS, but this results in a less Delivery Problem with Time Windows (PDPTW). These problems
powerful pool of heuristics that is unable to perform as well dur- are commonly used for evaluation in the literature and are diverse
ing these specific problem variations and conditions. in terms of difficulty to find good and feasible solutions. They ad-
To address the issues in ALNS, one can use Reinforcement ditionally correspond to a broad scope of real world applications.
Learning (RL). RL is a subset of machine learning concerned For each problem, we create separate training and test datasets. In
with “learning how to make decisions”—how to map situations to our experiments, we compare the performance of DRLH on differ-
actions—so as to maximize a numerical reward signal. One of the ent problem sizes and over an increasing number of iterations of
main tasks in machine learning is to generalize a predictive model the search and demonstrate how the heuristic selection strategy of
based on available training data to new unseen situations. An RL DRLH differs from other baselines throughout the search process.
agent learns how to generalize a good policy through interaction Our experiments show the superiority of DRLH compared to
with an environment which returns the reward in exchange for re- the popular method of ALNS in terms of performance quality. For
ceiving an action from the agent. Therefore, through a trial-and- each of the problem sets, DRLH is able to consistently outperform
error search process, the agent is trained to achieve the maximum other baselines when it comes to best objective value specifically
expected future reward at each step of decision making condi- in larger instances sizes. Additionally, DRLH does not add any over-
tioned on the current situation (state). Thus, training an RL agent head to the instance solve time and the performance gain is a re-
(to achieve the best possible results in similar situations), makes sult of the decision making capability of the Deep RL agent used.
the agent aware of the dynamics of the environment as well as Further experiments also validate that unlike other algorithms, the
adaptable to similar environments with slightly different settings. performance of DRLH is not negatively affected by increasing the
One of the more recent approaches in RL is Deep RL which benefits number of available heuristics to choose from. In contrast to this,
from the powerful function approximation property of deep learn- ALNS struggles when handling a large number of heuristics to
ing tools. In this approach, different functions that are used to train choose from. This advantage of our framework makes the devel-
and make decisions in an RL agent are implemented using Artifi- opment process for DRLH very simple as DRLH seems to be able to
cial Neural Networks (ANNs). Different Deep RL algorithms dictate automatically discover the effectiveness of different heuristics dur-
the training mechanism and interaction of the ANNs in the deci- ing the training phase without the need for initial experiments in
sion making process of the agent (Sutton & Barto, 2018). There- order to manually reduce the set of heuristics.
fore, integration of the Deep RL into the adaptive layer of the ALNS The remainder of this paper is organized as follows: In
can make the resulting framework much smarter at making deci- Section 2, related previous work in hyperheuristics and Deep RL
sions at each iteration and improve the overall performance of the is presented. In Section 3, we propose the overall algorithm of
framework. DRLH as well as the choice of heuristics and parameters. The

447
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468

descriptions of the four combinatorial optimization problems used pert and is reported to perform very similarly to the expert, and
for benchmarking purposes are illustrated in Section 4. The exper- even slightly outperforming the expert for some instances.
imental setup and the results of our evaluation are presented in Tyasnurita, Özcan, Shahriar, & John (2015) further improved
Sections 5 and 6, respectively. upon the apprentice learning approach by replacing the decision
tree classifier with a multilayer perceptron (MLP) neural network,
2. Related work and named their approach MLP-ALHH. This change increased the
representational power of the search state and resulted in a bet-
In this section, we first define the term “Hyperheuristic” and ter performance that is reported to even outperform the expert.
review some of the traditional work that fall into this category and A limitation of ALHH and MLP-ALHH is their use of the super-
point out their limitations. We also mention some of the methods vised learning framework which makes performance of these ap-
that employ Deep RL for solving combinatorial problems and their proaches bounded by the expert algorithm’s performance. A con-
shortcomings. In the end, we explain how we combine the best of sequence of this is that the feedback used to train the predictive
two domains (Hyperheuristic and Deep RL) to take advantage of models of ALHH and MLP-ALHH is binary, i.e. it either matches that
both their methodologies. of the expert or not, leaving no room for alternative strategies that
The term hyperheuristic was first used in the context of combi- might perform even better than the expert. In contrast, DRLH uses
natorial optimization by Cowling, Kendall, & Soubeiga (2001) and a Deep RL framework that neither requires, nor is bounded by an
described as heuristics to choose heuristics. Burke et al. (2010) later expert agent and therefore has more potential to outperform exist-
extended the definition of hyperheuristic to “a search method or ing methods by coming up with new ways of selecting heuristics.
learning mechanism for selecting or generating heuristics to solve The feedback used to train DRLH depends on the effect of the ac-
computational search problems”. The most common classification tion on the solutions, and the amount received varies depending
of hyperheuristics makes the distinction between selection hyper- on several factors. Additionally, DRLH takes future iterations of the
heuristics and generation hyperheuristic. Selection hyperheuristics search into account, while ALHH and MLP-ALHH only consider the
are concerned with creating a selection mechanism for heuristics immediate effect of the action on the solution. Because of this, di-
at each step of the search, while generation hyperheuristics are versifying behavior is encouraged in DRLH when it gets stuck, as it
concerned with generating new heuristics using basic components will help improve the solution in future iterations. Another differ-
from already existing heuristic methods. This paper will focus on ence of DRLH compared to ALHH and MLP-ALHH is that the fea-
selection hyperheuristics methods. tures of the search state used by DRLH contain more information
Although it is possible to create highly effective problem- compared to the search state of the other two methods which ul-
specific and heuristic-specific methods for heuristic selection, these timately makes the agent more aware of the search state and thus
methods do not always generalize well to other problem domains capable of making effective decisions.
and different sets of heuristics. A primary motivation of hyper- In addition to hyperheuristic approaches there have also re-
heuristic research is therefore the development of general-purpose, cently been many attempts at solving popular routing problems
problem-independent methods that can deliver good quality solu- using Deep RL by the machine learning community. A big limi-
tions for many combinatorial optimization problems without hav- tation of these works is that they all rely on problem-dependent
ing to make significant modifications to the methods. Thus, ad- information, and are usually designed to solve a single problem
vancements done in hyperheuristic research aims to be easily ap- or a small selection of related problems, often requiring signifi-
plicable by experts and non-experts alike, to various problems cant changes to the approach in order to make them work for sev-
and heuristics sets without requiring extra effort such as domain eral problems. In first versions of these studies, Deep RL is used
knowledge about the specific problem to be solved. as a constructive heuristic approach for solving the vehicle routing
A classic example of using RL in hyperhueristics is the work problem in which the agent, representing the vehicle, selects the
of Özcan, Misir, Ochoa, & Burke (2010) in which they propose a next node to visit at each time step (Kool, van Hoof, & Welling,
framework that uses a traditional RL method for solving exami- 2019; Nazari, Oroojlooy, Snyder, & Takac, 2018). Although this is
nation timetabling. Performance is compared against a simple ran- very effective when compared to simple construction heuristics for
dom hyperheuristic and some previous work, and results show that solving routing problems, it lacks the quality of solutions provided
using RL obtains better results than simply selecting heuristics at by iterative metaheuristic approaches as well as being unable to
random. The RL used here learns during the search process by ad- find feasible solutions in the case of more difficult routing prob-
justing the probabilities of choosing heuristics based on their re- lems that involve more advanced constraints such as pickup and
cent performance during the search. This type of RL framework delivery problem with time windows.
shares many similarities with the ALNS framework, and therefore Another approach that leverages Deep RL for solving combina-
suffers from the same limitations as those mentioned for ALNS. torial optimizations is to take advantage of the decision making
Apart from RL, supervised learning, which is another ma- ability of the agent in generating or selecting low-level heuristics
chine learning technique, has also been utilized in hyperheuristic to be applied on the solution. Hottung & Tierney (2019) have used
frameworks to improve the performance. A hyperheuristic method a Deep RL agent to generate a heuristic for rebuilding partially
for the Vehicle Routing Problem named Apprentice Learning-based destroyed routes in the CVRP using a large neighbourhood search
Hyper-heuristic (ALHH) was proposed by Asta & Özcan (2014) in framework. This method is an example of heuristic generation and
which an apprentice agent seeks to imitate the behavior of an ex- is specifically designed to solve the CVRP. Thus, it can not easily
pert agent through supervised learning. The training of the ALHH be generalized to other problem domains. In Chen & Tian (2019), a
works by running the expert on a number of training instances and framework is presented for using two Deep RL agents for finding a
recording the selected actions of the expert together with a search node in the solution and the best heuristic to apply on that node at
state that consists of the previous action used and the change in each step. Although the authors claim that this method is general-
objective function value for the past n steps. These recordings of izable to three different combinatorial optimization problems, the
search state and action pairs build up a training dataset in which a details in representation of the problem and type of ANNs used for
decision tree classifier is used in order to predict the action choice the agents from one problem to another change a lot depending on
of the expert. This makes up a supervised classification problem the nature of the problem. Additionally, one would have to come
in which the final accuracy of the model is reported to be around up with new inputs and representation when applying this method
65%. In the end ALHH’s performance is compared against the ex- to other optimization problems that are not discussed in the study

448
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468

which reduces the generalizability of the framework. Lu, Zhang, &


Algorithm 2: Generation of the set of heuristics H.
Yang (2020) suggested the use of a Deep RL agent for choosing
low-level heuristic at each step for the CVRP. This work also suf- Function Generate_heuristics
fers from the generalizability to other types of optimization prob- H={};
lems due to the elements of the Deep RL agent that are specific to foreach removal operator r ∈ R do
the CVRP problem. Additionally, in this approach the training pro- foreach insertion operator j ∈ I do
cess of the agent is designed in such a way that the agent is only Create a heuristic h by combining r and j;
focused on intensification rather than diversification. Thus, the di- H = H ∪ h;
versification in their framework is done by a rule-based escape ap- end
proach rather than giving the RL agent freedom to find the bal- end
ance between diversification and intensification, which could lead foreach additional heuristic c ∈ C do
to better results. H = H ∪ c;
To the best of our knowledge previous work on this topic either end
suffer from a lack of generalizability in approach when it comes to return H
other problems in the domain or they do not take advantage of the
learning mechanism and representation power of Deep RL. In this
work we seek to address these issues by introducing DRLH. 3.2. Sample set of heuristics

Each heuristic h ∈ H is a combination of a removal and an in-


3. DRLH sertion operator presented in Tables 1 and 2. Furthermore, one ad-
ditional intensifying heuristic is also added to H. In each iteration,
In this section, we present the DRLH, a hyperheuristic frame- a heuristic h ∈ H is applied on the incumbent solution x with cost
work to solve combinatorial optimization problems. of f (x ) and generates a new solution x with cost of f (x ). For our
Our proposed hyperheuristic framework uses an RL agent for sample set of heuristics, H has the size of |H | = 29 (7 removals ×
the selection of heuristics. This process improves on the ALNS 4 insertions + 1 additional).
framework of Ropke & Pisinger (2006) by leveraging the RL agent’s
decision making capability in choosing the next heuristic to apply
3.2.1. Removal operators R
on the solution in each iteration. The pseudocode of DRLH is illus-
The set of all removal operators R are provided in Table 1.
trated in Algorithm 1.
Seven removal operators are implemented, five of which are fo-
cused on inducing diversification through a high degree of ran-
domness denoted by Random in their name. For intensification
Algorithm 1: DRLH.
purposes, we define the operator “Remove_largest _D” which uses
Function Deep Reinforcement Learning Hyperheuristic the metric Deviation D. We define the deviation Di as the differ-
Generate an initial solution x with objective function
ence in cost with and without elementi in the solution, and thus
of f (x ) (see section 3.5)
“Remove_largest _D” removes the elements with the largest Di . Fi-
H=Generate_heuristics() (see section 3.1)
nally, “Remove_τ ” operator selects a number of consecutive ele-
xbest = x, f (xbest ) = f (x )
ments in the solution and removes them.
Repeat

x =x
3.2.2. Insertion operators I
choose h ∈ H based on policy π (h|s, θ ) (see section
Table 2 lists the set of insertion operators I used. A total of 4
3.3)
 insertion operators are utilized to place the removed elements in
Apply heuristic h to x
 a suitable position in solution x . Operator “Insert _greedy” places
if f (x ) < f (xbest ) then each removed element in the position which obtains the minimum
xbest = x total cost of the new solution f (x ). Operator “Insert _beam_search”
end performs beam search with a search width of 10 for inserting each

if accept (x , x ) (see section 3.3), then removed element. Beam search keeps track of the 10 best combi-

x=x nations of positions after inserting each removed element in the
end solution and inserts the elements in the best combination of po-
Until stop-criterion met (see section 3.4) sitions that obtain the minimum f (x ) in the search space. The
return xbest “Insert _by_var iance” operator calculates the variance of the ten best
insertion positions for each of the removed elements. Then the el-
ements are ordered from high to low variance and inserted back
into the solution with the “Insert _greedy” operator. Finally, opera-
tor “Insert _fir st ” places each removed element randomly in the first
3.1. Generating heuristics feasible position found in the new solution.

The heuristic generation process follows the steps in 3.2.3. Additional heuristic C
Algorithm 2. The set H consists of all possible heuristics that Unlike in ALNS where only removal and insertion operators are
can be applied on the solution x at each iteration. The general used, our framework can also make use of standalone heuristics
method for obtaining these heuristics is to combine a removal that share neither of the these types of characteristics. An exam-
and an insertion operator. Furthermore, additional heuristics can ple of one such additional heuristic, “F ind_single_best ”, is responsi-
also be placed in H that do not share the characteristic of being a ble for generating the best possible new solution from the incum-
combination of removal and insertion operators. In the following, bent by changing one element. This heuristic calculates the cost
we present one example set of H for the problem types considered of removing each element and re-inserting it with “Insert _greedy”,
for this paper. and applies this procedure on the solution x for the element that

449
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468

Table 1
List of all removal operators.

Name Description

Random_remove_XS Removes between 2–5 elements chosen randomly


Random_remove_S Removes between 5–10 elements chosen randomly
Random_remove_M Removes between 10–20 elements chosen randomly
Random_remove_L Removes between 20–30 elements chosen randomly
Random_remove_XL Removes between 30–40 elements chosen randomly
Remove_largest _D Removes 2–5 elements with the largest Di
Remove_τ Removes a random segment of 2–5 consecutive elements in the solution

Table 2
List of all insertion operators.

Name Description

Insert _greedy Inserts each element in the best possible position


Insert _beam_search Inserts each element in the best position using beam search
Insert _by_var iance Sorts the insertion order based on variance and inserts each element in the
best possible position
Insert _fir st Inserts each element randomly in the first feasible position

achieves the minimum cost f (x ). “F ind_single_best ” is the only ad- (Goodfellow, Bengio, & Courville, 2016). In this scenario, the aim is
ditional heuristic that is used in the proposed sample set of heuris- to obtain the optimal policy π ∗ by tuning θ which represents the
tics, H. weights of the MLP network.
The training process for an RL agent is illustrated in
3.3. Acceptance criteria and stopping condition Algorithm 3. For training the weights of the MLP, we follow the

We use the acceptance criterion accept (x , x ) used in simulated


annealing (Kirkpatrick, Gelatt, & Vecchi, 1983). This acceptance cri- Algorithm 3: Training the Deep RL agent.
terion depends on the difference in objective value between the Result: π ∗ optimal policy
incumbent x and the new solution x denoted as E = f (x ) − f (x ) Start with random setting of θ for a random policy π ;
together with a temperature parameter T that is gradually decreas- for e ← 1 to episodes do
ing throughout the search. A new solution is always accepted if it Receive initial state S1 ;
has a lower cost than the incumbent, E < 0. In addition, worse for t ← 1 to steps do
solutions are accepted with probability e−|E |/T . choose and perform action a ∈ At according to
To determine the initial temperature T0 we accept all solutions π (a|s, θ );
for the first 100 iterations of the search and keep track of all the Receive Rt = v and s ∈ St+1 from the environment
non-improving steps, E > 0. Then, we calculate the average of end
these positive deltas E in order to get: Optimize the policy parameters θ according to PPO
(Schulman et~al., 2017).
E end
T0 = (1)
ln 0.8
To decrease the temperature we use the cooling schedule of
Crama & Schyns (2003), and the search terminates after a certain policy gradient method of PPO introduced in Schulman et al.
number of iterations has been reached. (2017). In order to generalize to different variations of an optimiza-
tion problem, the training process is done for a number of problem
3.4. Deep RL agent for selection of h instances (episodes) with each instance corresponding to a differ-
ent set of attributes of the problem. Each instance is optimized for
In a typical RL setting, an agent is trained to optimize a pol- a certain number of iterations (time steps) and at the end of each
icy π for choosing an action through interaction with an environ- episode the policy parameters θ are updated until we obtain the
ment. At each time step (iteration) t, the agent chooses an action optimal policy. Once the training process is complete, the optimal
At and receives a scalar reward Rt from the environment indicat- policy π ∗ is used to solve unseen instances in the test sets.
ing how good the action was. State St is defined as the informa- As mentioned above, three main properties of the RL agent
tion received at each time step from the environment based on which are used to obtain the optimal policy π ∗ for solving the in-
the agent’s choice of action At from a set of possible actions. Thus, tended problem are the state representation, the action space, and
a stochastic policy π for the agent is defined as the reward function. These parameters dictate the training pro-
cess and decision making capability of the agent and are therefore
π (a|s ) = Pr{At = a|St = s}. (2)
essential for obtaining good solutions to optimization problems.
One such type of policy is the parameterized stochastic policy Moreover, in our proposed approach, these properties are set to
function in which the probability of action selection is also con- be independent of the type of problem which helps this approach
ditioned on a set of parameters θ ∈ Rd . As a result, Eq. (2) is rede- generalize to many types of combinatorial optimization problems.
fined as The state representation contains the information about the cur-
rent solution and the overall search state, and is shown to the
π (a|s, θ ) = Pr{At = a|St = s, θt = θ }. (3)
agent at each step in order to guide the agent in the action selec-
in which θt represents the parameters at time step t (Sutton & tion process. The action space consists of a set of interchangeable
Barto, 2018). In our setting, the policy π is a MultiLayer percep- heuristics that can be selected at each time step by the agent. Fi-
tron (MLP), which is a class of non-linear function approximation nally, the reward function guides the learning of the agent during

450
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468

Table 3
A list of all features used for the state representation.

Name Description

reduced _cost The difference in cost between the previous & the current solutions
cost _ f rom_min The difference in cost between the current & the best found solution
cost The cost of the current solution
min_cost The cost of the best found solution
temp The current temperature
cs The cooling schedule (α )
no_improvement The number of iterations since the last improvement
index_step The iteration number
was_changed 1 if the solution was changed from the previous, 0 otherwise.
unseen 1 if the solution has not previously been encountered in the search, 0 otherwise.
last _action_sign 1 if the previous step resulted in a better solution, 0 otherwise.
last _action The action in previous iteration encoded in 1-hot.

training and should be designed in a way that helps the agent op- as
timize the objective of the problem. In the following, we explain
the choice for each of these properties. π (h|s, θ ) = Pr{At = h|St = s, θt = θ }. (4)

3.4.1. State representation


3.4.3. Reward function
The state consists of a set of useful features for guiding the
A good reward function needs to balance the need for gradual
agent to select the best action/heuristic at each iteration in
and incremental rewards while also preventing the agent from ex-
the search. We have prioritized general state features that are
ploiting the reward function without actually optimizing the in-
independent of the specifics of the problem being solved. In
tended objective (also known as reward hacking Amodei et al.,
other words, the state representation is easily applicable to many
2016). For our framework, we propose a reward functions that has
optimization problems of different domains. Table 3 lists all the
the above property. We refer to this as Rt5310 , the formula for which
state features used by the agent.
is
The state features cost and min_cost together with index_step ⎧ 
allow the agent to know approximately how well it is doing dur-
⎨5, if f (x ) < f (xbest )

ing the search. This becomes apparent if cost and min_cost are 3 , if f ( x ) < f (x )
Rt5310 = (5)
higher than their average values during training with respect to ⎪1, if accept (x , x )

index_step. These state features primarily help at a macro-level 0, otherwise
by making the agent stick to a high-level strategy of heuris-
tic selection throughout the search. cost _ f rom_min, temp, cs and Rt5310 is inspired from the scoring mechanism that is applied
no_improvement inform the agent about how likely a new so- in the ALNS framework for measuring the performance of each
lution is to be accepted. These state features help the agent heuristic in a segment. This reward function encourages the agent
know how much intensification/diversification is appropriate at to find better solutions than the current one as this gives a high
that step. For instance if it should try to escape a local optima reward. In addition it also gives a small reward if it finds a slightly
or if it should focus on intensification. The last five state features; worse solution that manages to get accepted by the acceptance cri-
reduced_cost , was_changed, unseen, last _action_sign and last _action terion. This property of the function in turn motivates the agent
inform the agent about the immediate changes from the previous to use diversifying operators when it is no longer able to improve
solution to the current solution. In particular, reduced_cost shows upon the current solution. Moreover, other reward functions were
the difference in cost between the previous and current solution. considered for the framework which take the step-wise improve-
was_changed indicates if the solution was changed from the pre- ment of the solution as well as the amount of improvement into
vious step to the current step. unseen indicates whether the cur- account. Further experiments on these reward functions demon-
rent solution was encountered before during the search. Finally, strate that the Rt5310 proved to be more stable and faster to train
last _action_sign indicates if the solution improved or worsened compared to the others (results in Appendix A). Furthermore, given
from the previous step, and last _action indicates the action that the fact that Rt5310 comes from the original scoring function of
was used in the previous step. Together these five features give ALNS in Ropke & Pisinger (2006), we use the same function for
information about what action the agent selected in the previous our Deep RL agent and ALNS for an equal comparison.
step and the result of that action. This helps the agent make de-
cisions at a micro-level and is particularly useful as the agent can 3.5. Solution representation and initial solution
avoid selecting deterministic or semi-deterministic heuristics such
as Remove_lar gest _D , Insert _by_var iance and F ind_single_best twice For all the problems described in Section 4, the solution is rep-
in a row if the first time did not lead to any improvement, because resented as a permutation of orders/calls/jobs on each of the avail-
then it is less likely, if at all, to work the second time on the same able vehicles/machines. Additionally, for the PDP and PDPTW, each
solution. This is particularly important for F ind_single_best which call should be in the solution twice, one time for each of the
is a fully deterministic heuristic and produces the same result if pickup and the delivery elements respectively, and no call can be
applied for two consecutive iterations. present in multiple vehicles, as the same vehicle has to both pick
up and deliver the call.
3.4.2. Action The initial solutions for all of the problems are created by in-
The actions in our setting for the agent are the same as the serting all the orders/calls/jobs into the vehicles/machines using
set of heuristics H , i.e, At = H . At each iteration of the DRLH (c.f., the insert_greedy operator from Table 2. For each of the problems
Algorithm 1), a heuristic h is selected and applied on the solution and each test instance, DRLH, ALNS and URS start with the same
by the agent. Therefore the policy function π in Eq. (3) is redefined initial solution for a fair comparison.

451
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468

4. Problem sets being able to perform the pick up or delivery. A vehicle is never
allowed to arrive at a node after the end of the time window. Ad-
We consider four sets of combinatorial optimization problems ditionally, a service time is considered for each time a call gets
as examples of problems that can be solved using DRLH. These picked up or delivered, i.e., the time it takes a vehicle to load or
problems are the Capacitated Vehicle Routing Problem (CVRP), Par- deliver the goods at each node. For each call, a set of feasible ve-
allel Job Scheduling Problem (PJSP), Pickup and Delivery Prob- hicles is determined. Each vehicle has a predetermined maximum
lem (PDP) and Pickup and Delivery Problem with Time Windows capacity of goods as well as a starting terminal in which duty of
(PDPTW). the vehicle starts. Moreover, a start time is assigned to each vehi-
cle indicating the time that the vehicle leaves its starting terminal.
4.1. CVRP The vehicle must leave its start terminal at the starting time, even
if a possible waiting time at the first node visited occurs. The goal
The Capacitated Vehicle Routing Problem is one of the most is to construct valid routes for each vehicle, such that time win-
studied routing problems in the literature. It consists of a set of dows and capacity constraints are satisfied along each route, each
N orders that needs to be served by any of the M number of ve- pickup is served before the corresponding delivery, pickup and de-
hicles. Additionally, there is a depot in which the vehicles travel liveries of each call are served on the same route and each ve-
from and return to when serving the orders. Following the pre- hicle only serves calls it is allowed to serve. The construction of
vious work, the number of vehicles in this particular problem is the routes should be in such a way that they minimize the cost
not fixed, but is naturally limited to M = {1, . . . , N}. Meaning that function. There is also a compatibility constraint between the ve-
the maximum number of vehicles that can be utilized is N and hicles and the calls. Thus, not all vehicles are able to handle all the
the minimum number is 1. Usually the number of vehicles used calls. If we are not able to handle all calls by our fleet, we have
will fall somewhere in between depending on which number re- to outsource them and pay the cost of not transporting them. For
sults in the best solution. Each order has a weight Wi associated more details, readers are referred to Hemmati, Hvattum, Fagerholt,
to it, and the vehicles have a maximum capacity. The sequence of & Norstad (2014).
orders that a vehicle visits after leaving the depot before returning
to the depot is referred to as a tour. There needs to be a minimum 5. Experimental setup
of one tour and a maximum of N tours. The combined weight of
the orders in a tour can not exceed the maximum capacity of the In this section, we explain the baseline methods, process of hy-
vehicle, and so several tours are often needed in order to solve the perparameter selection, and dataset generation methods used for
CVRP problem. The objective of this problem is to create a set of evaluation of the DRLH framework.
tours that minimize the total distance travelled by all the vehicles
5.1. Experimental environment
that are serving at least one order.

The computational experiments in this paper were run on a


4.2. PJSP
desktop computer running a 64-bit Ubuntu 20.04 operating system
with a AMD Ryzen 5 3600 processor and 32GB RAM.
In the Parallel Job Scheduling Problem, we are given N jobs and
M machines. Each of the machines operate with a different pro-
5.2. Baseline models
cessing speed, and so the time required to complete job i on ma-
chine m is Ti,m . Each job has a due time associated with it, and if a
Four baseline frameworks are chosen to compare with DRLH.
job is finished after its due time, a delay is calculated for that job.
Three of these methods use the same approach as DRLHin selecting
The delay for job i is the difference in time between the due time
a heuristic from the same set of heuristics at each iteration with
and the actual finishing time of job i, and can never be lower than
the difference being in selection strategy. The last baseline uses a
0. The objective of the problem is to find a sequence of jobs to
trained Deep RL agent to build a route by selecting a node at each
complete on each of the machines in order to minimize the total
step. The details of the baselines are presented in the following.
delay of all the jobs.
5.2.1. Adaptive large neighborhood search (ALNS)
4.3. PDP As our approach is improving on the ALNS algorithm, this
method is chosen as a baseline for performance comparison. This
In Pickup and Delivery Problem we are given N calls and a sin- framework measures the performance of each heuristic using a
gle vehicle with a maximum capacity. Each call has a weight, a scoring function for a certain number of iterations, referred to as
pickup location, and a delivery location, and is served when the a segment. At the end of each segment, the probability of choosing
order is transported by the vehicle from the pickup to the delivery a heuristic during the next segment is updated using the aggre-
location. The objective of the problem is to minimize the traveling gated scores of each heuristic in the previous segment. The extent
distance of the vehicle while serving all the calls and not carrying to which the scores of the previous segment should contribute to
more than the maximum capacity at any point. updating the weights is controlled by the reaction factor.
There is a trade-off between speed and stability when choosing
4.4. PDPTW the values of the segment size and the reaction factor. Longer seg-
ments mean less frequent updates of the weights, but may increase
In pickup and delivery problem, we are given a set of calls. A the quality of the update. Similarly, a low reaction factor means
call consists of an origin and a destination and an amount of goods that the weights can take longer to reach their desired values, but
that should be transported. A heterogeneous fleet of vehicles are may also prevent sudden unfavorable changes to the weights due
serving the calls, picking up goods at their origins and delivering to the stochastic nature of the problem.
them to their destinations. Time windows are assigned to each call
at origins and destinations. Pickups and deliveries must be within 5.2.2. Uniform random selection (URS)
the associated time windows. In the event of early arrival of a ve- As a simpler approach to the selecting heuristics in each itera-
hicle to a node before the start of the time window, the mentioned tion, this method selects the heuristic randomly from H with equal
vehicle must wait until the beginning of the time window before probabilities.

452
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468

5.2.3. Tuned random selection (TRS) tion, which stays the same, regardless of what the underlying com-
We introduce another baseline to our experiments which is binatorial optimization problem actually is. In the case of ALNS,
refered to as TRS. For this method, we tuned the probabilities we apply the same set of optimized hyperparameters that are sug-
of selecting heuristics using the method of IRace (López-Ibáñez, gested by Hemmati et al. (2014), which is optimized for solving the
Dubois-Lacoste, Pérez Cáceres, Birattari, & Stützle, 2016). The pack- benchmark of PDPTW.
age “IRace” applies iterative F-Race to tune a set of parameters in
an optimization algorithm (heuristic probabilities in our method) 5.4. Dataset generation
based on the performance on the training dataset.
For all the problem variations we generate a distinct training
5.2.4. Attention Module (AM) based Deep RL heuristic set consisting of 50 0 0 instances, and a distinct testing set con-
We also consider the AM method of Kool et al. (2019) which sisting of 100 instances. Additionally, for PDPTW we also utilize
achieved state-of-the-art results among the Deep RL based method a known set of benchmark instances for testing (Hemmati et al.,
for solving combinatorial optimization problems. This method uses 2014).
the Deep RL agent combined with deep attention representation
learning to build the solution at each step in a constructive man- 5.4.1. CVRP
ner using problem specific features from the environment. As a re- CVRP data instances are generated in accordance with the gen-
sult, when applied on new problems, a new set of features as well eration scheme of Nazari et al. (2018), Kool et al. (2019), but we
as a problem specific representation learning scheme need to be also add two bigger problem variations. Instances of sizes N = 20,
defined. For example, the time window and vehicle incompatibil- N = 50, N = 100, N = 200 and N = 500 are generated where N is
ity constraints were not mentioned in the original paper and for the number of orders. For each instance the depot location and
that reason we can not solve the difficult problem of PDPTW with node locations are sampled uniformly at random from the unit
this framework. square. Additionally, each order has a size associated with it de-
fined as γˆ = γi /DN where γi is sampled from the discrete set
5.3. Hyperparameter selection of {1, . . . , 9}, and the normalization factor DN is set as D20 = 30,
D50 = 40, D100 = 50, D200 = 50, D500 = 50, for instances with N or-
The hyperparameters for the Deep RL agent determine the ders, respectively.
speed and stability of the training process and also the final per-
formance of the trained model. A small learning rate will cause 5.4.2. PJSP
training to take longer, but the smaller updates to the neural net- For the PJSP we generate instances of sizes N = 20, N = 50,
work also increase the chance of a better final performance once N = 100, N = 300 and N = 500 where N is the number of jobs and
the model has been fully trained. Because the training process is using M = N/4 machines. Job i’s required processing steps P Si are
done in advance of the testing stage, we opt for a slow and sta- sampled from the discrete set of {100, 101, . . . , 10 0 0}, and machine
ble approach in order to train the best models possible. The hy- m’s speed Sm , in processing steps per time unit, is sampled from
perparameters of Deep RL agent for the experiments are listed in N (μ, σ 2 ) with μ = 10, σ = 30, and the speed is rounded to the
Table 4. nearest integer and bounded to be at least 1. From there we get
In order to decide on the hyperparameters for DRLH, some ini- that the time required to process job i on machine m is calculated
tial experiments were performed on the PDP problem (as the sim- as P Si /Sm .
ple baseline problems compared to others) on a separate valida-
tion set to see which combinations performed best. The resulting 5.4.3. PDP
set of hyperparameters have been applied for all experiments in For this problem, PDP data instances of sizes N = 20, N = 50,
this paper. Our motivation for doing so is that we wanted to test and N = 100 are generated where N is the number of nodes based
the generalizability of the framework in terms of the hyperparam- on the generation scheme of Nazari et al. (2018), Kool et al. (2019).
eters as well as the performance on different problems. By tuning For each instance the depot location and node locations are sam-
the hyperparameters on a simpler problem and applying them to pled uniformly at random in unit square. Half of the nodes are
all other problems of all sizes and variations, we tried to avoid pickup locations whereas the other half is the corresponding de-
overtuning DRLH for every separate problem to keep the evalu- livery locations. Additionally, each call has a size associated with
ation fair for the baseline methods and make sure that the ad- it defined as γˆ = γi /DN where γi is sampled from the discrete set
vantage of our approach is in the decision making approach not of {1, . . . , 9}, and the normalization factor DN is set as D20 = 15,
the choice of hyperparameters for each problem. Moreover, this D50 = 20, D100 = 25, for each problem with N number of nodes re-
adds to the generalizability trait of the framework that does not spectively.
require hyperparameter selection for each specific problem. Based
on our experiments we found that these set of hyperparameters 5.4.4. PDPTW
work very well across all the problem variations that we tested. For the PDPTW we use instances with different combinations
It is likely that these hyperparameters can work for any under- of number of calls and number of vehicles, see Table 5. For gener-
lying combinatorial optimization problem, as the hyperparameters ating the training set and the 100 test instances, we use the pro-
for DRLH are related to the high-level problem of heuristic selec- vided instance generator of Hemmati et al. (2014). Additionally, we

Table 4 Table 5
The hyperparameters used during training for the Deep RL agent of DRLH. Properties of different variations of the PDPTW instance types.

Hyperparameter Value #Calls #Vehicles #Vehicle types

Learning rate 1e−5 18 5 3


Batch size 64 35 7 4
First hidden layer size 256 80 20 2
Second hidden layer size 256 130 40 2
Discount factor 0.5 300 100 2

453
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468

Fig. 1. Performance of DRLH on the generated test set.

use benchmark instances of Hemmati et al. (2014) for the remain- when enough iterations are provided to fully explore the problem
ing results. The benchmark test set consists of some instances of space. We also report the results on the benchmark of Hemmati
each variation, which are solved 10 times during testing in order et al. (2014) instances (Section 6.3). In order to demonstrate an-
to calculate the average best objective for each instance. Previous other advantage of using DRLH, we conduct an experiment with
work by Homsi, Martinelli, Vidal, & Fagerholt (2020) have found increased number of heuristics to illustrate the dependence of each
the global optimal objectives for these instances, and we use these framework on the performance of individual heuristics when the
optimal values in order to calculate the Min Gap (%) and Avg Gap number of heuristics exceeds a certain number (Section 6.4). Ad-
(%) to the optimal values for instances with 18, 35, 80 and 130 ditionally, we report the convergence speed and the training and
calls. Additionally, we also generate and test on a much larger in- inference time of each framework on instances of each problem
stance size of 300 calls where we do not have the exact global (sections 6.5 and 6.6). Next, to gain insight into the reason be-
optimal objectives, but instead use the best known values found hind the superiority of DRLH compared to the state of the art, we
by DRLH with 10,0 0 0 iterations to calculate the Min Gap (%) and provide some figures and discuss the difference in strategy behind
Avg Gap (%). choosing a heuristic between DRLH and ALNS (Section 6.7). Finally,
we compare the performance of DRLH, with a Deep RL heuristic
6. Results approach (Section 6.8). Additional experiments and results regard-
ing the reward function, convergence speed, and dependency of
In this section, we present the results of different experiments DRLH on the size of the problem can be found in Appendix.
on the performance of DRLH. In the first experiment (Section 6.1),
we set the number of iterations of the search to 10 0 0 to compare 6.1. Experiment on generated test set
the quality of the best found objective by each algorithm over a
limited number of iterations for different problem sizes in the test For this experiment, each method was evaluated on a test set
set. In the next experiment (Section 6.2), we increase the number of 100 generated instances for each of the problems introduced in
of iterations for all the methods and compare their performance Section 4. Figure 1(a) shows the improvement in percentage that

454
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468

Fig. 2. Boxplot results for different iterations of PDP100.

Table 6
Average results for PDPTW instances with mixed call sizes after 10 0 0 iterations.

DRLH ALNS URS

Min Gap Avg Gap Time Min Gap Avg Gap Time Min Gap Avg Gap Time
#C #V (%) (%) (s) (%) (%) (s) (%) (%) (s)

18 5 0.00 0.18 32 0.00 0.46 25 0.00 0.40 12


35 7 2.67 5.78 9 3.45 7.08 36 2.46 6.40 27
80 20 3.04 4.85 37 3.64 6.51 98 4.62 7.23 100
130 40 3.44 4.66 100 4.00 6.24 186 4.85 6.71 176
300 100 2.40 3.15 637 3.10 5.04 599 5.29 6.51 398

using DRLH, ALNS, and TRS have over using URS on CVRP instances ations, respectively. We see from the tables that DRLH outperforms
of different sizes. We see that DRLH is able to outperform all the ALNS and URS on all of the tests on average, showing that it can
baselines for all the instance sizes except for the smallest size. find high quality solutions and has a robust average performance.
There is also a clear trend that shows how DRLH becomes increas- Furthermore, we can see that the performance difference between
ingly better compared to other methods on larger instance sizes. DRLH and the baselines increases on bigger instances, meaning
Figure 1(b) shows a similar result for the PJSP problem. We see that DRLH scales favorably to the size of the problem, making it
that DRLH is able to outperform the other methods for all of the more viable for big industrial-sized problems compared to ALNS
instance sizes tested. Compared to the previous results, we see that and URS.
the degree of improvement on larger instance sizes is less promi- We have also included the average time in seconds for opti-
nent for DRLH, but we also see that ALNS does not perform notice- mizing the test instances. Note that the difference in time-usage is
ably better on larger instance sizes at all. Because of that we still not directly dependent on the framework for selecting the heuris-
see a clear separation in performance between DRLH and ALNS on tics (DRLH, ALNS, URS), but rather on the difference in time-usage
larger instance sizes that seem to grow with larger instance sizes. of the heuristics themselves. This means that if all the heuris-
Finally, we observe a similar trend for PDP and PDPTW as for the tics used the same amount of time, then there would not be any
other problems, which can be seen in Fig. 1(c) and (d), respectively. time difference between the frameworks. However, because there
From this figure we see that DRLH outperforms ALNS and URS on is a relatively large variation in the time-usage between the differ-
all instance sizes tested and that performance difference tends to ent heuristics, we see a considerable variation between the frame-
increase with larger instance sizes. works as they all have different strategies for heuristic selection.

6.2. Experiment on increased number of iterations


6.4. Experiment on the increased pool of heuristics
Figure 2 shows that the number of iterations for improving the
solution affects the minimum costs found for all the methods. We In addition to the set of heuristics mentioned in Section 3.1 we
see that DRLH outperforms the baselines when tested for 10 0 0, have also created an extended set of heuristics (see list in
50 0 0, 10,0 0 0 and 25,0 0 0 iterations, and that the percentage dif- Appendix B). In total this extended set consists of 142 heuristics.
ference between DRLH, ALNS and URS gets smaller as the number Figure 3 shows the average gap of using the extended set com-
of iterations grows larger. Intuitively this makes sense as all three pared to using the standard set for each of DRLH, ALNS and URS on
methods are getting closer to finding the optimal objectives for the PDPTW. The extended set obtains worse results on average com-
test instances, and more iterations for improving the solution dur- pared to the standard set, but there is an interesting difference
ing the search makes the choices of which heuristics to select less between the performance hit of DRLH, ALNS and URS when com-
sensitive compared to searching for a smaller number of iterations. paring the results of the extended set and the standard set. We
see from Fig. 3 that DRLH is relatively unaffected by increasing the
6.3. Experiment on the PDPTW benchmark dataset number of available heuristics (being only 0.02% worse on aver-
age), while ALNS and URS are performing much worse when using
In this section, we report results for PDPTW on the benchmark the extended set, and ALNS is hit especially hard. A likely reason
test set shown in Tables 6, 7 and 8 for 10 0 0, 50 0 0 and 10,0 0 0 iter- for this is that there are too many heuristics to accurately explore

455
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468

Table 7
Average results for PDPTW instances with mixed call sizes after 50 0 0 iterations.

DRLH ALNS URS

Min Gap Avg Gap Time Min Gap Avg Gap Time Min Gap Avg Gap Time
#C #V (%) (%) (s) (%) (%) (s) (%) (%) (s)

18 5 0.00 0.00 56 0.00 0.11 159 0.00 0.01 64


35 7 1.02 2.95 218 0.78 3.24 207 1.26 3.49 141
80 20 1.76 3.25 201 2.11 4.04 503 2.54 4.14 471
130 40 2.10 3.14 530 2.51 3.93 837 2.91 4.09 767
300 100 0.48 1.15 2580 1.01 2.35 2062 2.07 2.99 2352

Table 8
Average results for PDPTW instances with mixed call sizes after 10,0 0 0 iterations.

DRLH ALNS URS

Min Gap Avg Gap Time Min Gap Avg Gap Time Min Gap Avg Gap Time
#C #V (%) (%) (s) (%) (%) (s) (%) (%) (s)

18 5 0.00 0.00 219 0.00 0.02 338 0.00 0.00 102


35 7 0.67 2.02 182 0.78 2.66 410 0.68 2.77 289
80 20 1.80 2.95 321 2.03 3.33 757 2.17 3.36 972
130 40 1.93 2.84 877 2.38 3.34 1307 2.56 3.37 1609
300 100 0.00 0.64 4630 0.55 1.89 4120 1.46 2.18 4203

all of them during the search in order to identify the best heuris-
tics and take advantage of them during the search.
An important conclusion from this result (albeit one that needs
further empirical proof) is that when using DRLH, it is possible to
supply it with a large number of heuristics and let DRLH iden-
tify the best ones to use. This is not possible for ALNS and con-
sequently it is often necessary to spend time carrying out prior
experiments with the aim of finding a small set of the best per-
forming heuristics to include in the final ALNS model. This also
resonates with the conclusion of Turkeš et al. (2021), who argue
that the performance of ALNS benefits more from a careful a priori
selection of heuristics, than from an elaborate adaptive layer. Con-
sidering that prior experiments can be quite time consuming, using
DRLH can lead to a simpler development phase where heuristics
can be added to DRLH without needing to establish their effective-
ness beforehand, and not having to worry whether adding them
will hurt the overall performance. Should a heuristic be unneces-
sary, then DRLH will learn to not use it during the training phase.
In addition to DRLH having a simpler development phase, an
increased (or more nuanced) set of heuristics also has a larger po-
tential to work well for a wide range of conditions, such as for dif-
ferent problems, instance sizes and specific situations encountered
in the search. In other words, reducing the set of heuristics could
negatively affect the performance of ALNS, but much less so for
DRLH. Some heuristics work well only in specific situations, and so
removing these “specialized” heuristics due to their poor average
performance gives less potential for ALNS to be able to handle a
diverse set of problem and instance variations compared to DRLH,
which learns to take advantage of any heuristic that performs well
in specific situations. Of course, these claims are based on a limited
number of experiments and should be validated in a broad range
of (future) experiments.

6.5. Average performance results

In this section, we explore the speed and characteristics of the


performance of DRLH, ALNS and URS on the different problems.
Fig. 4 shows that DRLH is able to quickly find better solutions com-
pared to ALNS and URS for all the problems. Although for CVRP,
DRLH takes a little bit longer initially, but ultimately reaches a
much lower average minimum cost before the convergence of all
Fig. 3. Results of an Increased Pool of Heuristics. three methods start to stagnate. For all the problems, DRLH is able

456
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468

Fig. 4. Average performance of DRLH, ALNS and URS on each of the problems.

to reach a better cost after less than 500 iterations than what ALNS tween these methods. This can also be shown in Table 10, in which
is able to reach after 10 0 0 iterations. With the exception of the in some cases DRLH is faster than the other two baselines and
CVRP problem, DRLH is also extremely efficient in the beginning in some cases it is slower. It should be noted that the execu-
of the search, reaching costs in only 100 iterations that takes ALNS tion time of the operators can be improved if implemented care-
approximately 500 iterations to match. We refer to Appendix C for fully or using a faster programming language, e.g., C. However, the
a complete collection of performance plots for all the problems main focus of the paper is to improve the hyperheuristic approach
that we have tested. of choosing the next heuristic at each step, not the execution
time.
6.6. Training and inference time needed for each problem
6.7. Comparison between heuristic selection strategies
Tables 9 and 10 report the time needed for training and solv-
ing for the instances of each problem, respectively. The main dif- Figure 5 demonstrate the probability of selecting heuristics at
ference between DRLH and the baselines is the approach to de- each step of the search for DRLH and ALNS in which each line
cision making when it comes to choosing the next heuristic. This corresponds to the probability of one heuristic at every step of
decision making process, on itself, does not add much overhead the search. The “micro-level” heuristic usage of DRLH means that
on the computational time of the methods. The main difference DRLH is able to drastically change the probabilities of selecting
in the speed of these methods is the speed of the operators that heuristics from one iteration to the next by taking advantage of
they choose. This means that in some cases DRLH chooses op- the information provided by the search state, see Fig. 5(a) and (b).
erators that are faster or slower compared to baseline which re- This is in contrast to the “macro-level” heuristic usage of ALNS
sults in lower or higher computational time. Therefore, when it where the probabilities of selecting operators only are updated
comes to computational time, there is not much difference be- at the beginning of each segment, meaning that the decision

457
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468

Table 9
Training time for DRLH on different problems.

Problem Size #Iterations #Training Instances Total training time (s) Average time per instance (s)

CVRP 20 1k 1000 4586.85 4.59


CVRP 50 1k 1000 12394.3 12.39
CVRP 100 1k 1000 36330.0 36.33
CVRP 200 1k 100 8618.64 86.19
CVRP 500 1k 50 26483.2 529.66
PJSP 20 1k 1000 28233.7 28.23
PJSP 50 1k 1000 35552.1 35.55
PJSP 100 1k 500 16576.8 33.15
PJSP 300 1k 100 19758.1 197.58
PJSP 500 1k 100 79975.3 799.75
PDP 20 1k 500 1868.66 3.74
PDP 50 1k 100 2160.65 21.61
PDP 100 1k 100 12875.3 128.75
PDPTW 18 1k 600 25340.2 42.23
PDPTW 35 1k 600 12154.9 20.26
PDPTW 80 1k 500 20704.4 41.41
PDPTW 130 1k 100 8595.9 85.96
PDPTW 300 1k 90 53657.5 596.19

Deep Reinforcement Learning Hyperheuristic

Fig. 5. Example of the probability of selecting heuristics for DRLH and ALNS.

458
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468

Table 10 which makes it so that if the solution did not change in the previ-
Average Time (seconds) required for solving the test instances for each method
ous iteration, then it is guaranteed not to change in the following
DRLH, ALNS and URS.
iteration as well. Even though we have not specifically pro-
Problem Size #Iterations DRLH ALNS URS grammed DRLH to utilize this strategy, it becomes clear by exam-
CVRP 20 1k 4.08 11.58 7.75 ining Table 11 that the DRLH has picked up on this strategy when
CVRP 50 1k 11.58 35.17 23.52 learning to optimize micro-level heuristic selection. Table 11 shows
CVRP 100 1k 34.28 99.58 50.65 that the number of wasted actions for DRLH is almost non-existent
CVRP 200 1k 102.76 221.22 94.07
for most problem variations. ALNS on the other hand ends up with
CVRP 500 1k 621.74 664.54 238.86
PJSP 20 1k 20.37 18.06 5.65 far more wasted actions than DRLH, even though ALNS also uses
PJSP 50 1k 30.69 41.84 15.9 F ind_single_best much more seldom on average. Figure 5(c) shows
PJSP 100 1k 57.05 76.15 34.0 how the heuristic probabilities for ALNS remain locked within the
PJSP 300 1k 199.37 237.58 110.81
segments, making it impossible for ALNS to exploit strategies such
PJSP 500 1k 453.92 462.34 195.67
PDP 20 1k 3.89 4.17 1.85 as minimizing wasted actions which relies on excellent micro-level
PDP 50 1k 31.4 20.15 9.93 heuristic selection such as what DRLH demonstrates.
PDP 100 1k 159.86 79.58 45.86
PDPTW 18 1k 32.61 23.85 9.18
PDPTW 35 1k 10.67 29.75 21.18 6.8. Performance comparison with AM deep RL heuristic
PDPTW 80 1k 34.82 71.27 68.33
PDPTW 130 1k 110.67 139.45 132.32 For this experiment, we ran the AM method of Kool et al.
PDPTW 300 1k 500.9 438.65 361.39 (2019) on our test datasets for the CVRP problem. The trained
models and the implementation of the models needed to solve
the problem have been provided publicly by the authors of this
making of ALNS within a single segment is random according to paper. The dataset generation procedure for both our work and
the locked probabilities for that segment, see Fig. 5(c). Depending the AM paper follow the work of Nazari et al. (2018). As a re-
on the problem and available heuristics to select, there might sult, the models are well fit to be evaluated on our test set. For
exist exploitable strategies and patterns for heuristic selection, their method we considered three different approaches : Greedy,
such as heuristic(s) that: work well when used together, work Sample_128 and Sample_1280. In the greedy approach, at each step
well for escaping local minima, work well on solutions not pre- the node with the most probability is chosen. In the sampling ap-
viously encountered during the search. Using DRLH, these types proach, 128 and 1280 different solutions are sampled based on the
of exploitable strategies can be automatically discovered without probability of each node at each step. We test these methods for
the need for specially tailored algorithms designed by human sizes n = 20, 50, 100 of the CVRP problem. The time and resources
experts. We refer to one such exploitable strategy found by DRLH (Graphical Processing Units) needed to train the AM method for
on our problems with our provided set of heuristics as minimizing sizes larger than 100 scales exponentially due to heavy calcula-
“wasted actions”. We define a wasted action as the selection of tions needed for their representation learning method. Therefore,
a deterministic heuristic (in our case F ind_single_best) for two we only solve this problem for the mentioned instance sizes.
consecutive unsuccessful iterations. The reason that this action is Figure 6 illustrates the comparison of performance of our
“wasted” is because of the deterministic nature of the heuristic, method with the AM method of Kool et al. (2019). As shown in

Table 11
The percentage of wasted actions of the total number of deterministic heuristics selected, averaged over the test set for
each problem.

(a) CVRP (b) PJSP

Wasted Actions (%) Wasted Actions (%)

#Orders #Iterations DRLH ALNS #Jobs #Iterations DRLH ALNS

20 1k 3.37 26.55 20 1k 0.00 20.82


50 1k 0.00 23.98 50 1k 0.86 24.57
100 1k 1.22 19.48 100 1k 0.00 24.80
200 1k 0.00 23.43 300 1k 0.00 24.85
500 1k 0.01 25.15 500 1k 0.00 24.50

(c) PDP (d) PDPTW

Wasted Actions (%) Wasted Actions (%)

#Calls #Iterations DRLH ALNS #Calls #Iterations DRLH ALNS

20 1k 6.82 31.53 18 1k 0.00 21.68


50 1k 0.00 29.00 35 1k 0.00 28.65
100 1k 0.00 28.01 80 1k 0.00 24.50
100 5k 0.02 30.62 130 1k 0.00 19.60
100 10k 0.00 33.86 300 1k 0.00 17.90
100 25k 0.00 32.69 18 5k 0.00 30.88
35 5k 0.00 36.26
80 5k 0.00 27.49
130 5k 0.00 26.98
300 5k 0.00 26.10
18 10k 0.25 37.82
35 10k 0.00 36.60
80 10k 0.00 32.41
130 10k 0.08 29.67
300 10k 0.00 26.10

459
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468

heuristics in a way that achieves better results in less number of it-


erations for almost all of the problem variations compared to ALNS
and URS. Furthermore, the performance gap between DRLH and
the baselines is shown to increase for larger problem sizes, mak-
ing DRLH a suitable option for large real-world problem instances.
Additional experiments on an extended set of heuristics show that
DRLH is not negatively affected when selecting from a large set
of available heuristics, while the performance of ALNS is much
worse in this situation. Enriching or refining the state representa-
tion with additional information is possible with very little effort.
We have experimented with adding problem-dependent informa-
tion into the state representation and seen that this gives even
better results than sticking with the simple chosen state represen-
tation. Yet once we start to introduce problem-dependent struc-
ture and constraint information into the state representation we
lose some of the generality that we strive for with DRLH as we
would have to separately engineer a different state representa-
tion for each new problem. For this reason we deem this out-
side of the scope of this paper and leave this area open for future
work.
Future research should provide more empirical evidence for the
superiority of DRLH over ALNS by applying this novel hyperheuris-
tic to different problems. A potential direction for improving the
Fig. 6. Comparison of DRLH with the Deep RL method of Kool et al. (2019) (AM) model in the future is designing a reward function that is both sta-
on test instances of CVRP. ble and takes into account the difference of objective value at each
iteration of the search. Initial experiments on alternative reward
functions have shown promising results (see Appendix A), but are
the figure, AM is not able to outperform the baseline of URS in time-consuming to train and not very stable compared to the R5310
the size 20 with any of the sampling methods. Regarding size 50 reward function that we have used in this paper.
of the problem, in the greedy approach it still falls behind URS.
However, given enough samples, AM manages to perform better Appendix A. Experiments on different reward functions
than ALNS in some instances of the CVRP problem. On the other
hand, our method of DRLH outperforms this approach in every sin- A1. RtPM
gle instance size as well as being able to handle different problems
without any significant change in the code which is not the case
for the method of Kool et al. (2019). 
1, if f (x ) < f (x )
RtPM = (A.1)
−1, otherwise
7. Concluding remarks
The RtPM reward function focuses more heavily on intensifica-
For quite some time now, it has increasingly become evident tion by punishing any action choice that does not directly improve
that the fields of machine learning and (heuristic) optimization can upon the current solution. This causes the agent to favor intensify-
mutually benefit from an integration. On the one hand, recent ad- ing heuristics more strongly than Rt5310 . However, because the PPO
vances in optimization can support the development of advanced framework leverages the discounted future rewards as opposed to
machine learning methods, since these methods generally solve an only the immediate reward for training the agent, even the RtPM
optimization problem (e.g., what is the optimal subset of features can cause the agent to select heuristics with a high likelihood of
from a data set that predict a certain outcome). This paper ad- immediate negative reward if it sets it up for more positive re-
dressed the mirror issue: how can optimization approaches ben- wards in future iterations.
efit from an integration of machine learning methods. We demon- Figure A.1 illustrates the distribution of minimum costs found
strated that applying a well-known machine learning approach to on the PDP of size 100 test set after 10 0 0 and 10,0 0 0 iterations
the selection of low-level operators in a metaheuristic framework for two different versions of DRLH, trained with reward functions
results in a robust mechanism that can be used to improve the per- Rt5310 and RtPM respectively. The model trained with Rt5310 achieves
formance of a heuristic on a broad range of optimization problems. a lower median and quantile values for both iteration variations,
We believe that approaches like the one presented in this paper compared to the model trained with RtPM . This makes the Rt5310 re-
have the potential to make the development of a powerful heuris- ward function more reliable to perform relatively better, and we
tic less dependent on the knowledge of an experienced developer therefore decided to use the Rt5310 reward function in this paper.
with a deep insight into the structure of the specific problem be-
ing solved, and may therefore be instrumental in the integration
Table A.1
of metaheuristics ideas into general purpose software packages. Average results for PDPTW instances with mixed call sizes after 10 0 0 iterations.
Our proposed DRLH, a general framework for solving combinatorial
DRLH with Rt5310 DRLH with RtMC
optimization problems, utilizes a trained Deep RL agent to select
low-level heuristics to be applied on the solution in each iteration #C #V Min Gap (%) Avg Gap (%) Min Gap (%) Avg Gap (%)
of the search based on a search state consisting of features from 18 5 0.00 0.18 0.00 0.11
the search process. In our experiments, we solved four combinato- 35 7 2.67 5.78 1.48 3.65
rial optimization problems (CVRP, PJSP, PDP, and PDPTW) using our 80 20 3.04 4.85 3.15 4.39
130 40 3.44 4.66 2.99 4.33
proposed approach and compared its performance with the base- 300 100 2.40 3.15 2.28 3.00
lines of ALNS and URS. Our results show that DRLH is able to select

460
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468

Table A.2 Appendix B. Extended set of heuristics


Average results for PDPTW instances with mixed call sizes after 10,0 0 0 iterations.

DRLH with Rt5310 DRLH with RtMC Tables A.3, A.4 and A.5 list the extended set of heuristics built
up from 14 removal operators, 10 insertion operators and 2 ad-
#C #V Min Gap (%) Avg Gap (%) Min Gap (%) Avg Gap (%)
ditional heuristics, for a total of 14 × 10 + 2 = 142 total heuris-
18 5 0.00 0.00 0.00 0.13 tics, using the generation scheme of Algorithm 2. Most of these
35 7 0.67 2.02 0.42 2.32
heuristics only use problem-independent information, but some
80 20 1.80 2.95 2.55 3.87
130 40 1.93 2.84 2.20 3.04 of them rely on problem-dependent information specific to the
300 100 0.00 0.64 1.12 1.88 PDPTW problem.

Table A.3
List of extended removal operators.
A2. RtMC

f (xbest ) − f (x )
Name Description
RtMC = (A.2)
f (xbest ) Random_remove_XS Removes between 2–5 elements chosen randomly
Random_remove_S Removes between 5–10 elements chosen
The RtMC is a reward function that more directly correlates with randomly
Random_remove_M Removes between 10–20 elements chosen
the intended objective of minimizing the cost of the best found randomly
solution, and to achieve this as quickly as possible. Instead of fo- Random_remove_L Removes between 20–30 elements chosen
cusing on rewarding actions that directly improve the solution, this randomly
reward function is subject to the performance of the entire search Random_remove_XL Removes between 30–40 elements chosen
randomly
process up to the current step, putting a greater emphasis on act-
Random_remove_X X L Removes between 80–100 elements chosen
ing quickly and selecting heuristics that have a greater impact on randomly
the solution. The challenge with using this reward function com- Remove_lar gest _D _S Removes 5–10 elements with the largest Di
pared to reward functions such as Rt5310 and RtPM is that there is Remove_lar gest _D _L Removes 20–30 elements with the largest Di
an inherent delay between when a good heuristic is selected and Remove_τ Removes a random segment of 2–5 consecutive
elements in the solution
when the reward function gives a good reward. This makes it more Remove_least _ f requent _S Removes between 5–10 elements that has been
difficult to train an agent using this reward function, making train- removed the least
ing times much longer and less stable than with the Rt5310 reward Remove_least _ f requent _M Removes between 10–20 elements that has been
function. removed the least
Remove_least _ f requent _XL
Having said that, the potential upside of using this reward func-
Remove_one_vehicle Removes all the elements in one vehicle
tion is very promising, and results in Table A.1 show that RtMC Remove_two_vehicles Removes all the elements in two vehicle
is able to outperform the Rt5310 reward function on 1k iteration
searches. However, the agents were unable to learn effectively for Table A.4
larger number of iterations such as 10k (Table A.2), and so results List of extended insertion operators.
for this shows that RtMC performs worse than Rt5310 on 10k itera- Name Description
tion searches. A potential reason for why the RtMC agents were un-
Insert _greedy Inserts each element in the best possible position
able to learn well on 10k iteration searches is that the amount of
Insert _beam_search Inserts each element in the best position using
improving iterations are much less frequent, making the feedback beam search
signal from the RtMC reward function even more delayed and high Insert _by_var iance Sorts the insertion order based on variance and
variance. Another potential reason is that the training required in inserts
order to solve 10k iteration searches likely needed more training each element in the best possible position
Insert _ f ir st Inserts each element randomly in the first
than what was possible to carry out for our experiments due to feasible position
time constraints with the experiments. We encourage future work Insert _least _loaded _vehicle Inserts each element into the least loaded
on improving the integration of the RtMC reward function into the available vehicle
framework of DRLH as it likely has a lot of potential. Insert _least _active_vehicle Inserts each element into the least active
available vehicle
Insert _close_vehicle Inserts each element into the closest available
vehicle
Insert _group Identifies the vehicles that can fit the most of the
removed elements and
inserts each elements into these
Insert _by_di f f iculty Inserts each element using Insert _greedy ordered
by their difficulty,
which is a function of their compatibility with
vehicles, strictness
of time windows,size and more.
Insert _best _ f it Inserts each element into the vehicle that is the
most compatible with the call.

Table A.5
List of extended additional heuristics.

Name Description

F ind _single_best Calculates the cost of removing each element and


re-inserting it with Insert _greedy, and
applies this procedure on the solution x for the element
that achieves the minimum cost f (x ).
Rearrange_vehicles Removes all of the elements from each vehicle and
inserts them back into the same vehicles
using Insert _beam_search
Fig. 7. Comparison of the two reward functions.

461
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468

Appendix C. Additional performance plots This training scheme is referred to as Cross Size (CS) training. Dur-
ing test time, the trained model solved the test instances that were
Figures 8, 9, 10 and 11 show the performance of DRLH, ALNS used in Section 6.1 as well as test instances of slightly different
and URS averaged over the test set for all the problems that we sizes that were seen during training. As seen in Fig. 12, it is possi-
have tested. These show that DRLH usually reaches better solutions ble to train one model that can handle many different variations
more quickly than ALNS and URS, as well as ending up with better of instance sizes quite well. Moreover, as shown in Fig. 12(e)–
solutions overall. (h), the model does not specifically overfit on the specific in-
stance sizes included in the training when evaluated on slightly
Appendix D. Experiment on the cross size training scheme different test data. This means that the DRLH_CS generalizes very
well, even to sizes higher than any of the ones included in the
In this experiment, in the training phase, an instance of a spe- training.
cific problem with different size is solved by DRLH in each episode.

462
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468

Fig. 8. Average performance of DRLH, ALNS and URS on CVRP.

463
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468

Fig. 9. Average performance of DRLH, ALNS and URS on PJSP.

464
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468

Fig. 10. Average performance of DRLH, ALNS and URS on PDP.

465
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468

Fig. 11. Average performance of DRLH, ALNS and URS on PDPTW.

466
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468

Fig. 12. Performance of DRLH with Cross Size (CS) training scheme on different problem sizes with 1k iterations.

467
J. Kallestad, R. Hasibi, A. Hemmati et al. European Journal of Operational Research 309 (2023) 446–468

References Homsi, G., Martinelli, R., Vidal, T., & Fagerholt, K. (2020). Industrial and tramp
ship routing problems: Closing the gap for real-scale instances. European Jour-
Aksen, D., Kaya, O., Sibel Salman, F., & Özge Tüncel (2014). An adaptive large neigh- nal of Operational Research, 283(3), 972–990. https://doi.org/10.1016/j.ejor.2019.
borhood search algorithm for a selective and periodic inventory routing prob- 11.068.
lem. European Journal of Operational Research, 239(2), 413–426. https://doi.org/ Hottung, A., & Tierney, K. (2019). Neural large neighborhood search for the capaci-
10.1016/j.ejor.2014.05.043. tated vehicle routing problem. CoRR. http://arxiv.org/abs/1911.09539.
Amodei, D., Olah, C., Steinhardt, J., Christiano, P. F., Schulman, J., & Mané, D. (2016). Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated an-
Concrete problems in AI safety. CoRR. http://arxiv.org/abs/1606.06565. nealing. Science, 220(4598), 671–680. https://doi.org/10.1126/science.220.4598.
Asta, S., & Özcan, E. (2014). An apprenticeship learning hyper-heuristic for vehicle 671.
routing in hyflex. Orlando, Florida. https://doi.org/10.1109/EALS.2014.7009505. Kool, W., van Hoof, H., & Welling, M. (2019). Attention, learn to solve routing prob-
Burke, E. K., Hyde, M., Kendall, G., Ochoa, G., Özcan, E., & Woodward, J. R. (2010). A lems!
classification of hyper-heuristic approaches. In M. Gendreau, & J.-Y. Potvin (Eds.) Laborie, P., & Godard, D. (2007). Self-adapting large neighborhood search: Applica-
(pp. 449–468). Boston, MA: Springer US. tion to single-mode scheduling problems. In Proceedings MISTA-07: Vol. 8. Paris
Chen, C., Demir, E., & Huang, Y. (2021). An adaptive large neighborhood search Li, Y., Chen, H., & Prins, C. (2016). Adaptive large neighborhood search for the pickup
heuristic for the vehicle routing problem with time windows and delivery and delivery problem with time windows, profits, and reserved requests. Euro-
robots. European Journal of Operational Research, 294(3), 1164–1180. https://doi. pean Journal of Operational Research, 252(1), 27–38. https://doi.org/10.1016/j.ejor.
org/10.1016/j.ejor.2021.02.027. 2015.12.032.
Chen, X., & Tian, Y. (2019). Learning to perform local rewriting for combinato- Lu, H., Zhang, X., & Yang, S. (2020). A learning-based iterative method for solving
rial optimization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, vehicle routing problems. In International conference on learning representations.
E. Fox, & R. Garnett (Eds.), Advances in neural information processing systems: https://openreview.net/forum?id=BJe1334YDH
vol. 32. Curran Associates, Inc.. https://proceedings.neurips.cc/paper/2019/file/ López-Ibáñez, M., Dubois-Lacoste, J., Pérez Cáceres, L., Birattari, M., & Stüt-
131f383b434fdf48079bff1e44e2d9a5-Paper.pdf zle, T. (2016). The irace package: Iterated racing for automatic algorithm config-
Cowling, P., Kendall, G., & Soubeiga, E. (2001). A hyperheuristic approach to schedul- uration. Operations Research Perspectives, 3, 43–58. https://doi.org/10.1016/j.orp.
ing a sales summit. In E. Burke, & W. Erben (Eds.), Practice and theory of auto- 2016.09.002.
mated timetabling iii (pp. 176–190). Berlin, Heidelberg: Springer Berlin Heidel- Nazari, M., Oroojlooy, A., Snyder, L., & Takac, M. (2018). Reinforcement learning for
berg. solving the vehicle routing problem. In S. Bengio, H. Wallach, H. Larochelle,
Crama, Y., & Schyns, M. (2003). Simulated annealing for complex portfolio selection K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information
problems. European Journal of Operational Research, 150(3), 546–571. https://doi. processing systems: vol. 31. Curran Associates, Inc.. https://proceedings.neurips.
org/10.1016/S0377- 2217(02)00784- 1. Financial Modelling cc/paper/2018/file/9fb4651c05b2ed70fba5afe0b039a550-Paper.pdf
Demir, E., Bektaş, T., & Laporte, G. (2012). An adaptive large neighborhood search Özcan, E., Misir, M., Ochoa, G., & Burke, E. (2010). A reinforcement learning -
heuristic for the pollution-routing problem. European Journal of Operational Re- great-deluge hyper-heuristic for examination timetabling. International Journal
search, 223(2), 346–359. https://doi.org/10.1016/j.ejor.2012.06.044. of Applied Metaheuristic Computing, 1, 39–59.
Dokeroglu, T., Sevinc, E., Kucukyilmaz, T., & Cosar, A. (2019). A survey on new gener- Pisinger, D., & Ropke, S. (2019). Large neighborhood search. In Handbook of meta-
ation metaheuristic algorithms. Computers & Industrial Engineering, 137, 106040. heuristics (pp. 99–127). Springer.
https://doi.org/10.1016/j.cie.2019.106040. Ropke, S., & Pisinger, D. (2006). An adaptive large neighborhood search heuristic
Friedrich, C., & Elbert, R. (2022). Adaptive large neighborhood search for vehicle for the pickup and delivery problem with time windows. Transportation Science,
routing problems with transshipment facilities arising in city logistics. Comput- 40(4), 455–472. https://doi.org/10.1287/trsc.1050.0135.
ers & Operations Research, 137, 105491. https://doi.org/10.1016/j.cor.2021.105491. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal pol-
Goodfellow, I. J., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge, MA, icy optimization algorithms. CoRR. abs/1707.06347. http://dblp.uni-trier.de/db/
USA: MIT Press. http://www.deeplearningbook.org journals/corr/corr1707.html#SchulmanWDRK17
Grangier, P., Gendreau, M., Lehuédé, F., & Rousseau, L.-M. (2016). An adaptive large Shaw, P. (1998). Using constraint programming and local search methods to solve
neighborhood search for the two-echelon multiple-trip vehicle routing problem vehicle routing problems. In M. Maher, & J.-F. Puget (Eds.), Principles and practice
with satellite synchronization. European Journal of Operational Research, 254(1), of constraint programming — CP98 (pp. 417–431). Berlin, Heidelberg: Springer
80–91. https://doi.org/10.1016/j.ejor.2016.03.040. Berlin Heidelberg.
Gullhav, A. N., Cordeau, J.-F., Hvattum, L. M., & Nygreen, B. (2017). Adaptive large Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cam-
neighborhood search heuristics for multi-tier service deployment problems in bridge, MA, USA: A Bradford Book.
clouds. European Journal of Operational Research, 259(3), 829–846. https://doi. Turkeš, R., Sörensen, K., & Hvattum, L. M. (2021). Meta-analysis of metaheuristics:
org/10.1016/j.ejor.2016.11.003. Quantifying the effect of adaptiveness in adaptive large neighborhood search.
Hemmati, A., & Hvattum, L. M. (2017). Evaluating the importance of randomization European Journal of Operational Research, 292(2), 423–442. https://doi.org/10.
in adaptive large neighborhood search. International Transactions in Operational 1016/j.ejor.2020.10.045.
Research, 24(5), 929–942. https://doi.org/10.1111/itor.12273. Tyasnurita, R., Özcan, E., Shahriar, A., & John, R. (2015). Improving performance of a
Hemmati, A., Hvattum, L. M., Fagerholt, K., & Norstad, I. (2014). Benchmark suite for hyper-heuristic using a multilayer perceptron for vehicle routing. Exeter, UK, http:
industrial and tramp ship routing and scheduling problems. INFOR: Information //eprints.nottingham.ac.uk/id/eprint/45707
Systems and Operational Research, 52(1), 28–38. https://doi.org/10.3138/infor.52.
1.28.

468

You might also like