ConcurrencyandComputation-2021-Yildiz-Reinforcementlearningusingfullyconnectedattentionandtransformer
ConcurrencyandComputation-2021-Yildiz-Reinforcementlearningusingfullyconnectedattentionandtransformer
DOI: 10.1002/cpe.6509
Beytullah Yildiz
KEYWORDS
attention, combinatorial optimization problem, deep Q-learning, knapsack, reinforcement
learning, transformer
1 INTRODUCTION
We prepare a knapsack with the necessary items when we plan a hiking trip. The items can be bottled water, sandwich, compass, lantern, and so on. At
least two attributes, weight (and/or volume) and a value that determines the importance of items, help us decide which of these items to take with us.
Because the knapsack has a limited weight (and/or volume) capacity, determining how to fill with a combination of certain types of items that give the
highest overall value is a major challenge. The name of this challenge is knapsack problem (KP). A wide variety of resource allocation problems can be
considered as KP. Many applications, such as study time to be devoted to subjects while studying for an exam, transportation logistics optimization,
role-playing games, include the same problem. The solutions are being used in many practical applications such as retail and online advertising
because product diversity optimization problems arise widely. One of the most important decisions faced by a retailer is to maximize the expected
earnings by choosing a subset of items to be offered to customers among a certain number of items.1
Knapsack is a combinatorial optimization problem that has been studied in applied mathematics and computer science for decades. It is classi-
fied as an NP-hard problem and many exact and heuristic algorithms have been proposed.2,3 The need to provide faster solutions has triggered many
new approaches. Reinforcement Learning (RL) has become a choice with the potential to be an important milestone in solving KP due to its recent
successes in other areas. RL, which can be described as trial and error learning, determines its actions based on past experiences or new trials. It
uses an environment that can be defined as a function that converts an action performed in the current situation into the next state and a reward. An
Concurrency Computat Pract Exper. 2022;34:e6509. wileyonlinelibrary.com/journal/cpe © 2021 John Wiley & Sons, Ltd. 1 of 12
https://doi.org/10.1002/cpe.6509
2 of 12 YILDIZ
RL agent receives a numerical reward expressing the success of an action’s outcome from the environment and trains itself to choose actions that
maximize the reward accumulated over time. Despite its significant achievements in many areas, RL has also several challenges to be resolved. One
of these challenges is called the curse of dimensionality. Trying to explore all possible actions in all possible states can reach complexity that cannot
be overcome. Deep Q-learning, unaffected by the size of the state-action set, is a convenient method to provide a solution.
Deep Q-learning, an RL method, made a huge impact when it was introduced by DeepMind in 2013. For the first time, RL agents using convo-
lutional neural networks were able to learn from high-dimensional visual inputs.4 RL trained agents performed superhumanly in roughly half of the
Atari 2600 console games, using only screen pixels as inputs and values as rewards. Using deep neural networks5 as function estimators within an
RL system,6 deep reinforcement learning (DRL) has been shown to be effective in many areas including Atari games,4 Go game,7 dialog systems, text
generation, computer vision, and robotics.8 DRL can be considered as a step towards embodying universal artificial intelligence.9 However, modern
DRL systems have several shortcomings. First, learning speeds are slow as they inherit the need for very large datasets for training from deep learn-
ing. Second, a trained agent can perform poorly on a new task. Third, they can perform high-level tasks such as planning and reasoning only with
the patterns they can obtain from the statistical structures found in training data. Fourth, they are unlikely to form a system that creates a humanly
understandable chain of reasons for choosing the right actions.
Using exact solution methods such as branch and bound, and dynamic programming for NP-hard problems can cause burdens due to the length
of processing time. On the other hand, approximate solutions may also have drawbacks. For example, the quality of the labels is an important factor
in the success of supervised learning models. Therefore, the challenges of obtaining high-quality labeled data or relabeling them for new prob-
lems can be among the main obstacles to successful solutions. For these reasons, using a DRL method, which makes use of relatively simple reward
mechanisms and deep neural networks, in solving combinatorial optimization problems provides significant gains. In this article, we present a deep
Q-learning with deep neural network models that use fully connected layers, attention mechanism, and transformer encoder block to solve KP. To
the best of our knowledge, we do not witness any work involving our deep Q-networks approach to a KP solution. We have taken advantage of
attention mechanism and transformer encoder block proven by numerous studies in many fields such as language modeling, question answering,
text summarization, and so forth, to advance RL performance on the top of fully connected layers. In addition to this important contribution, we have
also created a knapsack simulation environment compatible with the OpenAI Gym library. In order for the proposed method to exhibit adequate
performance, studies have also been carried out to optimize the reward function and used hyperparameters.
In Section 2, related works are discussed. We provide the necessary information related to research topics in Section 3. The research
methodology is given in Section 4. Evaluation and benchmarking are explored in Section 5. We conclude with the outcomes in Section 6.
2 RELATED WORK
Knapsack is a combinatorial optimization problem that has been explored in many studies. Martello and Toth evaluated both exact and approximate
algorithms for solving KP.10 In the article, branch-and-bound algorithms, dynamic programming methods, and a combination of these two approaches
were examined. The performances of the algorithms were evaluated by randomly generated sample items. It was also noted that approximate
algorithms are mostly based on greedy approaches or scaling methods.
Most real-life KPs are not deterministic as some parameters are unknown when being solved. Kosuch and Lisser11 proposed a solution to fill a
knapsack with items whose actual weights were not yet known. They explained that the optimization problem can be solved either by a single-stage
decision that needs to be taken before the weight values are revealed or by a two-stage decision that allows correction of the previously made
decision. The authors claimed that the second solution was a more accurate method and emphasized that the item weights were determined with a
normal distribution at the beginning. In the second stage, the items were added if there was sufficient capacity but were not added to the knapsack
if there was not enough capacity.
Chu and Beasley12 proposed a heuristic method based on genetic algorithms for multidimensional KP. An intuitive operator using
problem-specific information was included in the standard genetic algorithm approach. It was stated that high-quality solutions for various problems
with the method used were reached within a reasonable time. Sahni13 presented approximate solutions for KP with polynomial time complexity and
linear storage requirement. These approximate solutions were claimed to give near-optimum answers in most cases. Kulkarni and Shabir2 applied
the cohort intelligence method inspired by individuals’ learning from each other with their natural and social inclinations, with the number of objects
varying between 4 and 75 to solve KP. It was indicated that the method applied produced satisfactory results at a reasonable computational cost.
Also, the cohort intelligence algorithm was compared with other modern methods and claimed to be better than other methods for specific prob-
lems. As with KP, optimization has a wide variety of problem areas and solutions. There are optimization techniques such as the analytical hierarchy
process to solve a multi-objective decision-making problem by relying on the judicious allocation of available resources.14,15 On the other hand,
some solutions such as the Probabilistic Trans-Algorithmic Search framework exploit several search algorithms iteratively for optimization.16
Solutions using neural networks for KP are also presented as approximate methods. Gu and Hao17 proposed a purely data-driven approach
using recurrent neural networks for the KP solution. Motivated by the successful application of pointer networks to traveling salesman problem, a
pointer network was applied to solve KP. The coefficients of each variable of the target function and the limitations in KP were given as an input to the
YILDIZ 3 of 12
pointer network. Optimal results were used as labels for supervised training of the model. Denysiuk et al.18 presented a solution for KP by applying
binary classification using artificial neural networks where the values and weights of items were used as inputs. A structure called neuroevolution,
which uses evolutionary algorithms, was applied to adjust the parameters of the neural networks because the target values were unknown.
Bello et al.19 proposed a framework for solving combinatorial optimization problems using RL and neural networks. Two approaches based on
the policy gradient method were discussed. The first approach, called RL pretraining, utilized a training set to optimize a recurrent neural network.
A stochastic policy using the expected reward as an objective was applied to this network. At the test time, the policy was fixed and the result was
obtained with a greedy method. The second approach, called active search, requires no prior training. It starts with a random policy and iteratively
optimizes the parameters of the recurrent neural network on a single test sample, using the expected reward, while watching the best solution sam-
pled during the search. A pointer network that effectively indicates a specific location was also applied in these approaches. Although the traveling
salesman problem was tried to be solved, it was expressed that the solution could be adapted to the KP problem.
As a combinatorial optimization problem, path planning has been one of the main research topics for many studies, such as land vehicles,
wheeled mobile robots, transportation, and travel path inference.20–23 Nazari et al.24 conducted a study based on an approach provided by Bello
et al. to solve the problem of vehicle routing. They claimed that the direct use of the approach was not appropriate because it was considered to
be static over time. Since the demands could change after the node visit, they proposed an alternative approach that could effectively handle both
static and dynamic situations. The policy model was made up of a recurrent neural network decoder with an attention mechanism added, and the
embeddings of static elements were used as input to the decoder of the recurrent neural network. The outputs of the recurrent neural network and
the embeddings of dynamic elements were fed into the attention mechanism.
Dai et al.25 proposed a method to solve graph problems by using a graphic embedding structure and a deep Q-learning algorithm. The greedy
policy being learned acted like a meta-algorithm that gradually built a solution. The action was determined by the output of a graphic embedding
network over the current state of the solution. It was expressed that the proposed solution could be applied to various optimization problems such as
maximum cutting and traveling salesman problems. However, it could not provide a general solution to the traveling salesman problem, as a particular
node could be visited multiple times. Parisotto et al.26 proposed architecture with new elements and arrangements for RL training. Although the
structure was similar to the transformer structure, it had differences in terms of reordering layer normalization modules and using a gating layer
instead of standard residual connections. They noted that with these improvements, RL training could be done more consistently.
When we look at studies using RL, they generally aim to provide a solution to combinatorial optimization problems other than KP. The only work
that mentions KP is that of Bello et al.,19 which focused more on the traveling salesman problem. Only a small section of this article was devoted to
KP to confirm that their solution could be generalized for combinatorial optimization problems. In this study, we focus on KP by introducing deep Q
learning with deep neural network models that use fully connected layers, attention mechanism, and transformer encoder block. To the best of our
knowledge, we have not seen a comprehensive study corresponding to our approach to a KP solution.
3 BACKGROUND
In this section, we will briefly explain KP, RL, deep Q-networks, attention mechanism, and transformer.
KP aims to maximize the sum of the values when a set of n items, each with a weight wi and value pi , are put into a knapsack with a maximum weight
capacity (W). KP can be defined as:
∑
max pi (1)
i⊆{1, … ,n}
∑
wi ≤ W (2)
i⊆{1, … ,n}20
where wi and pi are non-negative integer variables. In addition to weight, knapsack can be subject to volume capacity:
∑
vi ≤ V (3)
i⊆{1, … ,n}
The maximum volume capacity of the knapsack is denoted as V , and the volume of each item added to the bag is indicated by vi .
4 of 12 YILDIZ
KP is defined as NP-hard. A simple yet powerful intuitive solution can be achieved by selecting items with the best value/weight ratio to put in
the knapsack until weight capacity is reached. However, this solution may not always be optimal. Moreover, although better results can be obtained
with branch-and-bound algorithm or dynamic programming methods, the computational cost becomes a high burden for a larger number of items.
RL is a method that learns the state of an agent in a given environment and tries to maximize reward through actions. While learning, RL makes
mistakes during many trials, corrects them gradually, and finds the desired answer over time. Let us consider an RL agent that interacts with an
environment. Environment (E) is defined by a set of states (S), a set of actions (A), a reward function r ∶ S × A → R. Policy is a mapping from states to
actions: 𝜋 ∶ S → A. Each episode starts with an initial state. At each step the agent generates an action based on the current state: at = 𝜋(st ). Later,
the environment returns a new state and a reward rt = R(st , at ). Discount factor γ ∈ [0,1] is a concept used to realize that current earnings are more
∑
valuable than future earnings. The discounted amount of future rewards is called a return: Rt = γi−t rt . The agent’s goal is to maximize the expected
i=t
reward Es0 [R0 |s0 ]. Q-function is used during these operations Q𝜋 (st , at ) = E [Rt st , at ].
If we define the optimal course of actions as 𝜋 ∗ , the optimal state must be maintained for every s ∈ S, a ∈ A, and any course of actions π. In other
words, all optimal courses of actions must satisfy the optimal Q-function. The Bellman equation fulfills these requirements:
Thus, RL is the process of observing the rewards achieved by an agent running sequences of state-action pairs and adapting the Q-function
according to the rewards until the agent accurately predicts the best path to take. One of the challenges that arise in the RL process is the choice
between exploration and exploitation. To maximize the total reward, an RL agent should choose actions that have been tried in the past and have
been found to be effective. However, in order to discover such actions, it must also try new actions that it had not previously chosen. In short, an
agent not only benefits from what it has already experienced but also seeks new actions that it has not encountered before to maximize the overall
reward. Therefore, various actions should be tried and those that look best should be promoted gradually.
Q-learning uses a Q (s, a) memory table in which Q values are stored for all possible combinations of states and actions. It is like making a list of the
best moves if you are a chess player. During the game, we make a move by choosing an action according to our own and the opponent’s situation.
After each move, the positions formed, namely our place on the chessboard and the reward we receive, provide learning. Knowing the outcome
of a single move at each step and seeing the reward gives a one-step view advantage forward. In short, the value of the action we will take in the
current situation Q (s, a) can be estimated by the value in the next state and the reward R + Q (s′ , a′ ). As we continue our actions, the average value
of Q continues to be updated in the memory table. The values get better and better and eventually, the Q values get closer to the optimal values. We
use the information in the memory table for a new move and determine the best action we can take in our position. Q-learning,27 one of the most
important inventions for RL, is stated as follows:
from this experience reply memory. To make the optimization procedure more stable, two identical networks are used, the main network and the
target network. The target network is usually updated at a slower rate than the main network. The general practice is to update the weights of the
target networks periodically with the current weights of the main networks. This process is explained in Algorithm 1.
3.4 Attention
An attention mechanism works for a given input in a similar way to human behavior. We focus on the word we are currently reading, but at the same
time, our mind treats the important words of the text more carefully to create context. The human visual similarly allows us to look at a certain
area we focus on as high resolution and the surrounding parts as low resolution and make inferences accordingly. Likewise, an attention mechanism
focuses on input and decides which parts of the input are important at each step. It can be described by a general definition as mapping a query and
a set of key-value pairs to an output. To perform the mapping function, it processes a set of query vectors q0 , … .qm , key vectors k0 , … .kn , and value
vectors v0 , … .vn to create a context vector per query.
An attention mechanism called additive attention has been introduced by Bahdanau et al.32 for an encoder-decoder model. The context vector
for the attention mechanism is created by summing the input sequence hidden states weighted by the alignment score. This score is estimated by
an alignment model that measures how well the pair of input at position i and output at position t match. For the alignment score, a feed-forward
network with a single multilayer perceptron is used and trained with other parts of the model. The function of the alignment score using the nonlinear
tanh activation function is given as:
6 of 12 YILDIZ
where Ws , Wq , and Wk are weight matrices to be learned in the alignment model. Here, dynamic weights that represent the relative importance of the
inputs in a sequence (keys) for a particular output (query) are trained. For normalization, softmax is applied to scale the weight values. Multiplying
these weights by the input sequence (values) generates a weighted sequence:
Global and local attention mechanisms with a slightly different structure were proposed by Luong et al.33 These two categories of attention
mechanisms vary depending on whether the attention is placed on all positions or several positions in the input sequence. The global attention
model is computationally challenging when it takes into account all the words of the long source sequence to predict target values. To overcome this
problem, the local attention mechanism prefers to handle only a small subset of input locations per target as opposed to global attention using the
entire input sequence. Despite their difference, most of the processing steps of these mechanisms are the same except that they diverge on how
the context vectors are created. Three alignment scoring functions are provided: dot, general, and concat. These scoring functions take the encoder
outputs and the decoder hidden state generated in the previous step to compute the alignment scores. The dot scoring function is the most basic
of the functions and takes the outputs of the encoder and multiply them by the hidden state of the decoder: score(qt , ki ) = QK T . The general scoring
function calculates the alignment scores by extending the dot scoring function with a weighting matrix: score(qt , ki ) = WQK T . The concat function is
similar to the way alignment scores are computed in the attention mechanism provided by Bahdanau et al., except that the addition of the decoder
hidden state and the encoder hidden state is multiplied by the weight matrix: score(qt , ki ) = vT tanh(W(qi + kj )). Attention is similarly calculated by
Equation (7).
Scaled dot product, a new attention mechanism used in the Transformer model, was introduced by Vaswani et al.34 It resembles the dot product
attention mechanism provided by Luong et al. It adds a scaling factor to have effective learning; the dot products of queries with the keys are divided
by the square root of the dimension of queries and keys. A softmax function is applied to the calculated values to obtain the weights:
( )
QK T
Attention(Q, K, V) = softmax √ V (8)
dk
Attention has become very influential in the Artificial Intelligence community as a core component of neural architectures for numerous appli-
cations in Natural Language Processing and Computer Vision. During this quest, various types of attention mechanisms were also introduced in
addition to the attention mechanisms mentioned above. Cheng et al.35 used intra-attention, also called self-attention, to perform machine reading.
Xu et al.36 applied hard or soft attention mechanisms, reaching the entire image or only a portion of the image, for image captioning. To extract the
features, images were first encoded by convolutional layers, then an long short-term memory (LSTM)-based decoder was applied, where attentive
weights are generated by attention mechanisms, to create individual descriptive words.
3.5 Transformer
Transformer was introduced Vaswani et al.34 in the famous article “All You Need is Attention.” It has an encoder–decoder architecture. The encoder
takes a variable-length sequence as input and transforms it into a contextual fixed size encoding sequence. The encoder of the transformer consists
of six identical layers, each containing two sub-layers. The first layer is a multi-head self-attention mechanism and the second layer is a position-
wise fully connected feed-forward network. A residual link followed by a normalizing layer is employed around each of the sub-layers. The decoder
computes the conditional probability distribution of a target sequence given the encoded representation from the encoder. The decoder architec-
ture is almost the same as the encoder but differs in that the decoder includes two multi-head attention sub-layers. The first multi-head attention
is masked to ensure that the model does not consider paddings as input.
The main component in the transformer is the multi-head self-attention mechanism given in Equation (9). The self-attention structure esti-
mates a portion of a data sample using other parts of the same data sample. The scaled dot product attention described in the Attention
section is used to compute the context vector. The multi-head mechanism runs the scaled dot-product attention multiple times in parallel on
split input to take advantage of parallelism for fast computing. After each head has been processed, individual attention outputs are simply
concatenated.
Transformer architecture uses positional encoding. The position is important for processing sequential data. For example, the position and
order of words are key parts of languages. Recurrent Neural Networks naturally consider word order when processing a text. However, Trans-
former removes the recurrence mechanism in favor of the multi-headed self-attention to speed up training time and capture longer dependencies.
Therefore, positional encoding is used to give a sense of position to the model by adding a piece of information about the location.
The Transformer architecture has had breakthrough success in a wide variety of domains: language modeling,37–39 neural machine translation,40
text summarization,41 and question answering.42 It should be noted that the repeated success of transformer architecture in many areas makes it
an ideal candidate for RL problems as well, especially where episodes can span too many steps and critical observations for any decision often cover
the entire section.
4 METHODOLOGY
The proposed architecture that we used for RL to train the models in the developed knapsack environment is shown in Figure 2. There are two
cycles in the given architecture. The first of these is the cycle where we repeatedly do the packing of our knapsack using the simulation environment
that we call the action cycle. In this cycle, the sample inputs required for the Q-networks training are collected. On the other hand, the learning
cycle extracts statistics and determines patterns from the data we obtain while the action cycle carries on. This cycle continues for a predetermined
number of steps. Ultimately, the knowledge of the optimal knapsack packing process is learned by the Q-network.
We used three models as function approximations; a model with fully connected layers, a model with attention, and a model with transformer
encoder block. As the first model, fully connected three layers of neural networks containing 128, 196, and 128 nodes were utilized to record the pol-
icy that progressively improved during deep Q-learning. We determined the initial weights of the neural network with the GlorotUniform initializer.
For the second model, we used a dot-product attention mechanism proposed by Luong et al. In the dot-product attention, the score value is basi-
cally the product of the query and key values. After applying the softmax function, the score value is multiplied by values. Since this is a self-attention,
value, query, and key values contain the same values. Two fully connected layers containing 96 and 64 nodes were added after the attention. Finally,
two transformer encoder blocks were used as the third model. Since we already had numerical values as inputs, the embeddings that usually map a
given input into an n-dimensional space were removed. Instead, 9-dimensional input was directly transformed into an n-dimensional space. A single
normalizer is utilized for the input instead of using two normalizers as in the paper of Vaswani et al. Two fully connected layers containing 128 and
96 nodes were used after the transformer encoder blocks. In all models, the softmax activation function in the output layers was discarded as the
outputs were real values, not probabilities. In its place, a linear activation function was used.
Adam optimizer and the Huber loss function with a delta value of 2.0 were applied for the optimization. The Huber loss function shows behavior in
the middle of Mean Absolute Error and Mean Square Error, depending on the delta parameter. It is an effective loss function in data with high variability
and a few outliers. Therefore, it exhibits an appropriate loss function behavior for RL. The action space includes two actions that represent putting
an item in the knapsack and not putting an item in the knapsack. We reduced the epsilon value used to balance the exploration and exploitation with
a 0.999 reduction rate, starting from 1 to the minimum value of 0.001. We chose 50,000 as experience reply memory size. The value of the future
reward was updated with a discount factor of 0.99. The reason for applying the discount factor was the fact that the reward value of the current
action was preferred over the value of the future action.
5 EVALUATION
We developed a knapsack environment compatible with the Gym library provided by the OpenAI working group. The gym environment offers a
standard interface for RL with functions such as reset, step, and render. A total capacity limit of 1000 for the weight and a total capacity limit of
1000 for the volume were applied. We assigned random numbers between 0 and 100 for each item’s volume, weight, and value. During the training
process, we used 30 items for each knapsack filling process. State, reward, done and info parameters were gathered at the initial stage of an episode
and each subsequent step. For a state, nine parameters including weight capacity, volume capacity, bag weight, bag volume, bag value, item weight,
item volume, item value, and the number of remaining steps were used. As a reward, the item value was chosen if placed in the knapsack and a
negative value if not placed in the knapsack. With this simple reward function, we ensured that the items were placed in the knapsack in the most
appropriate way and the preference of not putting them always in the knapsack was prevented.
5.2 Training
We run our deep Q-learning model and knapsack simulation environment in the Google Colab environment supported by a graphics card accel-
erator. For training, the simulation environment must first be initialized with the reset command, which returns an initial state in the environment.
Progress is made with a step function that takes an action as a parameter. Initially, the epsilon value is determined as 1. The probability of choos-
ing a random action is higher where this value is close to 1. In other words, the learning is in the exploration phase. We collect, in this phase,
reward and observation pairs from the environment by trying new actions in the knapsack environment. As the epsilon value approaches 0, we
start to benefit more from our past experiences. This phase is called exploitation. To put it more clearly, we begin to determine the best action
to be taken in our current state by looking at our past experiences. This action is determined through the neural networks that we are train-
ing by using RL. After determining an action either by randomly or looking at our past experiences, we take a step in the knapsack environment.
After each action, the values of our knapsack, such as total weight, total volume, the total value of the items, and capacities of the knapsack are
changed. Hence, the knapsack environment informs us about the changes such as the new state we are in, the reward value we have earned,
and whether the knapsack filling process is over. The received information is added to the experience reply memory after we update the total
reward value being tried to maximize. Each experience reply memory entry contains a bunch of information such as current state, action, reward,
next state, done.
We started training our deep neural networks when a sufficient number of sample entries were stored in the experience reply memory. The
reason we call deep neural networks is that we use twin neural networks. One is the main network where the training is carried out and the other is
the network we use to make the training more stable. We also used this second network to calculate our future Q̂ value. For training, we created a
randomly selected data set of 512 from the experience reply memory each time. We copied the weights of the actual network trained to the target
network after every 100 knapsack filling operations. In this way, we managed to create more stable learning without being too affected by instant
changes during the training. The guiding factor for the correct training is the reward values given as feedback from the knapsack environment.
Different functions and values can be used as the reward value, but the simple and accurate guiding reward function offers the most convenient and
faster solution.
5.3 Results
We conducted experiments for the deep Q-networks using the model having fully connected layers, the model containing the attention layer, and
the model including the transformer’s encoder block. Although the weight, volume, and value of the items are randomly assigned, the order of items
is set the same for different models during training so that the comparison can be made correctly. In other words, the models received the same
sequence of items. The total reward values earned for each knapsack episode were collected during training. As seen in Figure 3, as the number of
episodes increases, the total reward earned also increases. Initially, the total reward obtained by using the randomly chosen actions appears to be
below 400. It is reasonable that the total reward initially appears at these levels because the fact that the decision to put items in the knapsack is
YILDIZ 9 of 12
FIGURE 3 Changing total reward values while training the deep Q-learning models used in solving KP
TA B L E 1 The ratio of total reward to the optimum value of the knapsack solution
made with a random function using a uniform distribution and the reward value earned from an item placed in the knapsack is equal to the value of
that item between 0 and 100, and the value of −20 for the item not packed.
As the learning progresses, the Q-networks continue their learning with a randomly selected batch of datasets from the experience reply mem-
ory where the state-action-reward information is stored. In the process, deep Q-learning focuses on maximizing the total reward and ensures that
the pattern required for this objective is learned by the Q-networks. In this experiment, the optimal total reward was around 1200 and each knap-
sack episode contained 30 items to pack. When reaching 3000 episodes, it is observed that the average total reward is approaching the optimal
knapsack values measured using dynamic programming as a baseline. Q-network with fully connected layers performs well. However, we observe
superior total rewards when the attention mechanism is introduced. The model that uses the attention layer performs slightly better. We get the best
output when using the transformer encoder blocks. At the same time, this model enables the total reward earned to be maximized earlier. Table 1
gives the ratio of the total reward for each model to the optimal value gathered from the dynamic programming application. After training each RL
model, we collected the total rewards for 100 backpack filling processes using RL models and dynamic programming. The ratios of the total rewards
and their standard deviations were calculated with these values. The ratios, especially for the transformer model, are reasonably good enough for
a KP solution with significant processing acceleration.
The RL agent being trained brings the total reward earned by the knapsack in each episode closer to the optimal value while significantly
shortening the processing time compared to exact solutions. As seen in Figure 4, our trained agent can obtain results that are very close to the
optimal value with approximately 40 times faster. In this performance test, 100 different knapsack filling processes were carried out and the pro-
cessing times were averaged. The experimental results show that knapsack filling processes are completed in less than a second using RL models
while dynamic programming takes over 32 s. Although models using fully connected layers and attention are faster, their accuracy is not better
than the model using transformer encoder blocks. Since the difference in processing time is very small, the decision to choose a model leans
towards the one using the transformer encoder blocks due to the trade-off between time and performance. Between the dynamic model and the
RL models, the speed difference will become larger as the number of items added to the knapsack increases. Because the speed of filling the knapsack
of the RL agents increases linearly, as the problem is NP-hard, the filling process of an exact solution will increase at a non-polynomial speed.
10 of 12 YILDIZ
F I G U R E 4 Comparison of execution times for the dynamic programming, which is an exact solution, and the deep Q-learning models, which
are approximate methods
F I G U R E 5 Average total reward and average processing time (seconds) obtained with the increasing number of items in the deep
Q-Networks used in the solution of KP
The processing speed and the total reward earned by our RL agent, which we trained for 30 items using the knapsack environment, are given
in Figure 5 while packing different numbers of items into the knapsack. We collected the processing time and total reward values by increasing
the number of items from 20 to 40. The process was repeated 100 times for each item number. It is observed that the processing speed increases
linearly as the number of items placed in the knapsack increases. On the other hand, we witness an increase in the total reward earned. However,
the reward increase appears to be limited due to the knapsack’s weight and volume limitations. In addition, as shown in Figure 5, the model has its
limitation. The more the number of items used to fill the knapsack deviates from the number of items used during model training, the further the
total value of the reward is from the optimal value. In other words, the quality of the model decreases as the difference between the number of
items for knapsack filling and the number of items of the model being trained increases.
The knapsack is a combinatorial optimization problem involving a wide variety of resource allocation problems. It is classified as an NP-hard problem,
and many exact and approximate methods have been proposed for its solution. In this study, we used the deep Q-learning approach as an approximate
solution, as exact solutions take a long computation time. We have shown that deep Q-learning with appropriate Q-networks provides high-quality
solutions for KP. We found that the model with fully connected layers produces results of sufficient quality, but using the attention mechanism
YILDIZ 11 of 12
and transformer encoder blocks for Q-networks brings the solution closer to perfection. With these deep Q-network models, the knapsack filling
process that is very close to the optimal solution can be obtained 40 times faster than an exact solution using dynamic programming. It should be
noted that careful selection of parameters such as reward value, discount rate, and Q-network model with optimal hyperparameters is key for the
efficiency and effectiveness of the solution. Additionally, the trained agent generalizes the knapsack solution for the number of items close to the
number of items used during training. Moreover, in most real-life KPs, deciding for an item faith to put into a knapsack is a major problem, as some
of the parameters involved cannot be completely known. Therefore, instead of the solutions used with all the parameters at our disposal, an agent
trained with deep Q-networks, an off-policy RL, comes out as a more suitable method for real-life problems. In further research, we aim to adapt
our approach to other combinatorial problems and make it a more general solution for combinatorial problems.
ORCID
Beytullah Yildiz https://orcid.org/0000-0001-7664-5145
REFERENCES
1. Désir A, Goyal V, Segev D. Assortment optimization under a random swap based distribution over permutations model. Paper presented at: Proceedings
of the 2016 ACM Conference on Economics and Computation; 2016; Maastricht, The Netherlands.
2. Kulkarni AJ, Shabir H. Solving 0–1 knapsack problem using cohort intelligence algorithm. Int J Mach Learn Cybernet. 2016;7(3):427-441.
3. Fréville A. The multidimensional 0–1 knapsack problem: an overview. Eur J Oper Res. 2004;155(1):1-21.
4. Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. Nature. 2015;518(7540):529-533.
5. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436-444.
6. Sutton RS, Barto AG. Reinforcement Learning: An Introduction. Cambridge, MA, USA: MIT Press; 2018.
7. Silver D, Huang A, Maddison CJ, et al. Mastering the game of go with deep neural networks and tree search. Nature. 2016;529(7587):484.
8. Gu S, Holly E, Lillicrap T, Levine S. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. Paper presented at:
proceedings of the IEEE international conference on robotics and automation (ICRA). 2017; Singapore, Singapore.
9. Legg S, Hutter M. Universal intelligence: a definition of machine intelligence. Minds Mach. 2007;17(4):391-444.
10. Martello S, Toth P. Algorithms for knapsack problems. North-Holland Mathematics Studies. Vol 132. Amsterdam, Netherlands: Elsevier; 1987:213-257.
11. Kosuch S, Lisser A. On two-stage stochastic knapsack problems. Discrete Appl Math. 2011;159(16):1827-1841.
12. Chu PC, Beasley JE. A genetic algorithm for the multidimensional knapsack problem. J Heuristics. 1998;4(1):63-86.
13. Sahni S. Approximate algorithms for the 0/1 knapsack problem. JACM. 1975;22(1):115-124.
14. Aktas MS, Kapdan M. Implementation of analytical hierarchy process in detecting structural code clones. In: Gervasi O, Murgante B, Misra S, et al., eds.
Computational Science and its Applications – ICCSA 2017. Lecture Notes in Computer Science. Vol 10408. Cham, Switzerland: Springer; 2017.
15. Aktas MS, Kapdan M. Structural code clone detection methodology using software metrics. Int J Softw Eng Knowl Eng. 2016;26(02):307-332.
16. Gonen B, Gunduz G, Yuksel M. Automated network management and configuration using probabilistic trans-algorithmic search. Comput Netw.
2015;76:275-293.
17. Gu S, Hao T. A pointer network based deep learning algorithm for 0–1 knapsack problem. Paper presented at: Tenth International Conference on
Advanced Computational Intelligence (ICACI); 2018; Xiamen, China.
18. Denysiuk R, Gaspar-Cunha A, Delbem AC. Neuroevolution for solving multiobjective knapsack problems. Expert Syst Appl. 2019;116:65-77.
19. Bello I, Pham H, Le QV, Norouzi M, Bengio S. Neural combinatorial optimization with reinforcement learning. arXiv:1611.09940; 2016.
20. Bulut V. Path planning for autonomous ground vehicles based on quintic trigonometric Bézier curve. J Brazilian Soc Mech Sci Eng. 2021;43(2):1-14.
21. Ozdemir E, Topcu AE, Ozdemir MK. A hybrid HMM model for travel path inference with sparse GPS samples. Transportation. 2018;45(1):233-246.
22. Bulut H. Multiloop transportation simplex algorithm. Optimiz Methods Softw. 2017;32(6):1206-1217.
23. Kul S, Eken S, Sayar A. Distributed and collaborative real-time vehicle detection and classification over the video streams. Int J Adv Robot Syst.
2017;14(4):1–12.
24. Nazari M, Oroojlooy A, Snyder L, Takác M. Reinforcement learning for solving the vehicle routing problem. Paper presented at: advances in neural
information processing systems; 2018; Montréal, Canada.
25. Dai H, Khalil E, Zhang Y, Dilkina B, Song L. Learning combinatorial optimization algorithms over graphs. Paper presented at: Advances in Neural
Information Processing Systems; 2017; Long Beach, CA.
26. Parisotto E, Song F, Rae J, Pascanu R, Gulcehre C, Jayakumar S, et al. Stabilizing transformers for reinforcement learning. Paper presented at: Proceedings
of the 37th International Conference on Machine Learning; 2020; Virtual Event.
27. Watkins CJCH. Learning from Delayed Rewards. Cambridge, UK: King’s College; 1989.
28. Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, et al. Playing atari with deep reinforcement learning. arXiv preprint
arXiv:1312.5602; 2013.
29. Ay Karakus B, Talo M, Hallaç IR, Aydin G. Evaluating deep learning models for sentiment classification. Concurr Comput Pract Exp. 2018;30(21):e4783.
30. Yildiz B, Tezgider M. Improving word embedding quality with innovative automated approaches to hyperparameters. Concurr Comput Pract Exp.
2021:e6091.
31. Yildiz B, Tezgider M. Learning quality improved word embedding with assessment of hyperparameters. In: Schwardmann U et al., eds. Euro-Par 2019:
Parallel Processing Workshops. Euro-Par 2019. Lecture Notes in Computer Science. Vol 11997. Cham, Switzerland: Springer; 2020.
32. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. Paper presented at: 3rd International Conference on
Learning Representations; 2015; San Diego, CA.
12 of 12 YILDIZ
33. Luong T, Pham H, Manning CD. Effective approaches to attention-based neural machine translation. Paper presented at: Proceedings of the 2015
Conference on Empirical Methods in Natural Language Processing; 2015; Lisbon, Portugal.
34. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Paper presented at: advances in neural information
processing systems; 2017; Long Beach, CA.
35. Cheng J, Dong L, Lapata M. Long short-term memory-networks for machine Reading. Paper presented at: Proceedings of the 2016 Conference on
Empirical Methods in Natural Language Processing; 2016; Austin, TX.
36. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, et al. Show, attend and tell: neural image caption generation with visual attention. Paper presented
at: Proceedings of the 32nd International Conference on Machine Learning; 2015; Lille, France.
37. Dai Z, Yang Z, Yang Y, Carbonell J, Le Q, Salakhutdinov R. Transformer-XL: attentive language models beyond a fixed-length context. Paper presented at:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; jul, 2019; Florence, Italy.
38. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. Paper presented at: Pro-
ceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; jun,
2019; Minneapolis, Minnesota.
39. Brown T, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language models are few-shot learners. arXiv:2005.14165; 2020.
40. Ott M, Edunov S, Grangier D, Auli M. Scaling neural machine translation. Paper presented at: proceedings of the third conference on machine translation;
2018; Brussels, Belgium.
41. Liu Y, Lapata M. Text summarization with Pretrained encoders. Paper presented at: Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP/IJCNLP); 2019; Hong Kong, China.
42. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le QV. XLNet: generalized autoregressive Pretraining for language understanding. Paper presented
at: advances in neural information processing systems; 2019; Vancouver, Canada.
How to cite this article: Yildiz B. Reinforcement learning using fully connected, attention, and transformer models in knapsack
problem solving. Concurrency Computat Pract Exper. 2022;34(9):e6509. doi: 10.1002/cpe.6509