0% found this document useful (0 votes)

14 views

ConcurrencyandComputation-2021-Yildiz-Reinforcementlearningusingfullyconnectedattentionandtransformer

Uploaded by

gungoreceeylul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

ConcurrencyandComputation-2021-Yildiz-Reinforcementlearningusingfullyconnectedattentionandtransformer

Uploaded by

gungoreceeylul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Received: 30 January 2021 Revised: 8 May 2021 Accepted: 21 June 2021

DOI: 10.1002/cpe.6509

SPECIAL ISSUE PAPER

Reinforcement learning using fully connected, attention, and

transformer models in knapsack problem solving

Beytullah Yildiz

Department of Software Engineering, School of

Engineering, Atilim University, Ankara, Turkey Summary
Knapsack is a combinatorial optimization problem that involves a variety of resource
Correspondence
Beytullah Yildiz, Department of Software allocation challenges. It is defined as non-deterministic polynomial time (NP) hard and
Engineering, School of Engineering, Atilim has a wide range of applications. Knapsack problem (KP) has been studied in applied
University, Ankara, Turkey.
Email: [email protected] mathematics and computer science for decades. Many algorithms that can be classi-
fied as exact or approximate solutions have been proposed. Under the category of exact
solutions, algorithms such as branch-and-bound and dynamic programming and the
approaches obtained by combining these algorithms can be classified. Due to the fact
that exact solutions require a long processing time, many approximate methods have
been introduced for knapsack solution. In this research, deep Q-learning using mod-
els containing fully connected layers, attention, and transformer as function estimators
were used to provide the solution for KP. We observed that deep Q-networks, which
continued their training by observing the reward signals provided by the knapsack
environment we developed, optimized the total reward gained over time. The results
showed that our approaches give near-optimum solutions and work about 40 times
faster than an exact algorithm using dynamic programming.

KEYWORDS
attention, combinatorial optimization problem, deep Q-learning, knapsack, reinforcement
learning, transformer

1 INTRODUCTION

We prepare a knapsack with the necessary items when we plan a hiking trip. The items can be bottled water, sandwich, compass, lantern, and so on. At
least two attributes, weight (and/or volume) and a value that determines the importance of items, help us decide which of these items to take with us.
Because the knapsack has a limited weight (and/or volume) capacity, determining how to fill with a combination of certain types of items that give the
highest overall value is a major challenge. The name of this challenge is knapsack problem (KP). A wide variety of resource allocation problems can be
considered as KP. Many applications, such as study time to be devoted to subjects while studying for an exam, transportation logistics optimization,
role-playing games, include the same problem. The solutions are being used in many practical applications such as retail and online advertising
because product diversity optimization problems arise widely. One of the most important decisions faced by a retailer is to maximize the expected
earnings by choosing a subset of items to be offered to customers among a certain number of items.1
Knapsack is a combinatorial optimization problem that has been studied in applied mathematics and computer science for decades. It is classi-
fied as an NP-hard problem and many exact and heuristic algorithms have been proposed.2,3 The need to provide faster solutions has triggered many
new approaches. Reinforcement Learning (RL) has become a choice with the potential to be an important milestone in solving KP due to its recent
successes in other areas. RL, which can be described as trial and error learning, determines its actions based on past experiences or new trials. It
uses an environment that can be defined as a function that converts an action performed in the current situation into the next state and a reward. An

Concurrency Computat Pract Exper. 2022;34:e6509. wileyonlinelibrary.com/journal/cpe © 2021 John Wiley & Sons, Ltd. 1 of 12
https://doi.org/10.1002/cpe.6509
2 of 12 YILDIZ

RL agent receives a numerical reward expressing the success of an action’s outcome from the environment and trains itself to choose actions that
maximize the reward accumulated over time. Despite its significant achievements in many areas, RL has also several challenges to be resolved. One
of these challenges is called the curse of dimensionality. Trying to explore all possible actions in all possible states can reach complexity that cannot
be overcome. Deep Q-learning, unaffected by the size of the state-action set, is a convenient method to provide a solution.
Deep Q-learning, an RL method, made a huge impact when it was introduced by DeepMind in 2013. For the first time, RL agents using convo-
lutional neural networks were able to learn from high-dimensional visual inputs.4 RL trained agents performed superhumanly in roughly half of the
Atari 2600 console games, using only screen pixels as inputs and values as rewards. Using deep neural networks5 as function estimators within an
RL system,6 deep reinforcement learning (DRL) has been shown to be effective in many areas including Atari games,4 Go game,7 dialog systems, text
generation, computer vision, and robotics.8 DRL can be considered as a step towards embodying universal artificial intelligence.9 However, modern
DRL systems have several shortcomings. First, learning speeds are slow as they inherit the need for very large datasets for training from deep learn-
ing. Second, a trained agent can perform poorly on a new task. Third, they can perform high-level tasks such as planning and reasoning only with
the patterns they can obtain from the statistical structures found in training data. Fourth, they are unlikely to form a system that creates a humanly
understandable chain of reasons for choosing the right actions.
Using exact solution methods such as branch and bound, and dynamic programming for NP-hard problems can cause burdens due to the length
of processing time. On the other hand, approximate solutions may also have drawbacks. For example, the quality of the labels is an important factor
in the success of supervised learning models. Therefore, the challenges of obtaining high-quality labeled data or relabeling them for new prob-
lems can be among the main obstacles to successful solutions. For these reasons, using a DRL method, which makes use of relatively simple reward
mechanisms and deep neural networks, in solving combinatorial optimization problems provides significant gains. In this article, we present a deep
Q-learning with deep neural network models that use fully connected layers, attention mechanism, and transformer encoder block to solve KP. To
the best of our knowledge, we do not witness any work involving our deep Q-networks approach to a KP solution. We have taken advantage of
attention mechanism and transformer encoder block proven by numerous studies in many fields such as language modeling, question answering,
text summarization, and so forth, to advance RL performance on the top of fully connected layers. In addition to this important contribution, we have
also created a knapsack simulation environment compatible with the OpenAI Gym library. In order for the proposed method to exhibit adequate
performance, studies have also been carried out to optimize the reward function and used hyperparameters.
In Section 2, related works are discussed. We provide the necessary information related to research topics in Section 3. The research
methodology is given in Section 4. Evaluation and benchmarking are explored in Section 5. We conclude with the outcomes in Section 6.

2 RELATED WORK

Knapsack is a combinatorial optimization problem that has been explored in many studies. Martello and Toth evaluated both exact and approximate
algorithms for solving KP.10 In the article, branch-and-bound algorithms, dynamic programming methods, and a combination of these two approaches
were examined. The performances of the algorithms were evaluated by randomly generated sample items. It was also noted that approximate
algorithms are mostly based on greedy approaches or scaling methods.
Most real-life KPs are not deterministic as some parameters are unknown when being solved. Kosuch and Lisser11 proposed a solution to fill a
knapsack with items whose actual weights were not yet known. They explained that the optimization problem can be solved either by a single-stage
decision that needs to be taken before the weight values are revealed or by a two-stage decision that allows correction of the previously made
decision. The authors claimed that the second solution was a more accurate method and emphasized that the item weights were determined with a
normal distribution at the beginning. In the second stage, the items were added if there was sufficient capacity but were not added to the knapsack
if there was not enough capacity.
Chu and Beasley12 proposed a heuristic method based on genetic algorithms for multidimensional KP. An intuitive operator using
problem-specific information was included in the standard genetic algorithm approach. It was stated that high-quality solutions for various problems
with the method used were reached within a reasonable time. Sahni13 presented approximate solutions for KP with polynomial time complexity and
linear storage requirement. These approximate solutions were claimed to give near-optimum answers in most cases. Kulkarni and Shabir2 applied
the cohort intelligence method inspired by individuals’ learning from each other with their natural and social inclinations, with the number of objects
varying between 4 and 75 to solve KP. It was indicated that the method applied produced satisfactory results at a reasonable computational cost.
Also, the cohort intelligence algorithm was compared with other modern methods and claimed to be better than other methods for specific prob-
lems. As with KP, optimization has a wide variety of problem areas and solutions. There are optimization techniques such as the analytical hierarchy
process to solve a multi-objective decision-making problem by relying on the judicious allocation of available resources.14,15 On the other hand,
some solutions such as the Probabilistic Trans-Algorithmic Search framework exploit several search algorithms iteratively for optimization.16
Solutions using neural networks for KP are also presented as approximate methods. Gu and Hao17 proposed a purely data-driven approach
using recurrent neural networks for the KP solution. Motivated by the successful application of pointer networks to traveling salesman problem, a
pointer network was applied to solve KP. The coefficients of each variable of the target function and the limitations in KP were given as an input to the
YILDIZ 3 of 12

pointer network. Optimal results were used as labels for supervised training of the model. Denysiuk et al.18 presented a solution for KP by applying
binary classification using artificial neural networks where the values and weights of items were used as inputs. A structure called neuroevolution,
which uses evolutionary algorithms, was applied to adjust the parameters of the neural networks because the target values were unknown.
Bello et al.19 proposed a framework for solving combinatorial optimization problems using RL and neural networks. Two approaches based on
the policy gradient method were discussed. The first approach, called RL pretraining, utilized a training set to optimize a recurrent neural network.
A stochastic policy using the expected reward as an objective was applied to this network. At the test time, the policy was fixed and the result was
obtained with a greedy method. The second approach, called active search, requires no prior training. It starts with a random policy and iteratively
optimizes the parameters of the recurrent neural network on a single test sample, using the expected reward, while watching the best solution sam-
pled during the search. A pointer network that effectively indicates a specific location was also applied in these approaches. Although the traveling
salesman problem was tried to be solved, it was expressed that the solution could be adapted to the KP problem.
As a combinatorial optimization problem, path planning has been one of the main research topics for many studies, such as land vehicles,
wheeled mobile robots, transportation, and travel path inference.20–23 Nazari et al.24 conducted a study based on an approach provided by Bello
et al. to solve the problem of vehicle routing. They claimed that the direct use of the approach was not appropriate because it was considered to
be static over time. Since the demands could change after the node visit, they proposed an alternative approach that could effectively handle both
static and dynamic situations. The policy model was made up of a recurrent neural network decoder with an attention mechanism added, and the
embeddings of static elements were used as input to the decoder of the recurrent neural network. The outputs of the recurrent neural network and
the embeddings of dynamic elements were fed into the attention mechanism.
Dai et al.25 proposed a method to solve graph problems by using a graphic embedding structure and a deep Q-learning algorithm. The greedy
policy being learned acted like a meta-algorithm that gradually built a solution. The action was determined by the output of a graphic embedding
network over the current state of the solution. It was expressed that the proposed solution could be applied to various optimization problems such as
maximum cutting and traveling salesman problems. However, it could not provide a general solution to the traveling salesman problem, as a particular
node could be visited multiple times. Parisotto et al.26 proposed architecture with new elements and arrangements for RL training. Although the
structure was similar to the transformer structure, it had differences in terms of reordering layer normalization modules and using a gating layer
instead of standard residual connections. They noted that with these improvements, RL training could be done more consistently.
When we look at studies using RL, they generally aim to provide a solution to combinatorial optimization problems other than KP. The only work
that mentions KP is that of Bello et al.,19 which focused more on the traveling salesman problem. Only a small section of this article was devoted to
KP to confirm that their solution could be generalized for combinatorial optimization problems. In this study, we focus on KP by introducing deep Q
learning with deep neural network models that use fully connected layers, attention mechanism, and transformer encoder block. To the best of our
knowledge, we have not seen a comprehensive study corresponding to our approach to a KP solution.

3 BACKGROUND

In this section, we will briefly explain KP, RL, deep Q-networks, attention mechanism, and transformer.

3.1 Knapsack problem

KP aims to maximize the sum of the values when a set of n items, each with a weight wi and value pi , are put into a knapsack with a maximum weight
capacity (W). KP can be defined as:

∑
max pi (1)
i⊆{1, … ,n}

with subject to:

∑
wi ≤ W (2)
i⊆{1, … ,n}20

where wi and pi are non-negative integer variables. In addition to weight, knapsack can be subject to volume capacity:

∑
vi ≤ V (3)
i⊆{1, … ,n}

The maximum volume capacity of the knapsack is denoted as V , and the volume of each item added to the bag is indicated by vi .
4 of 12 YILDIZ

KP is defined as NP-hard. A simple yet powerful intuitive solution can be achieved by selecting items with the best value/weight ratio to put in
the knapsack until weight capacity is reached. However, this solution may not always be optimal. Moreover, although better results can be obtained
with branch-and-bound algorithm or dynamic programming methods, the computational cost becomes a high burden for a larger number of items.

3.2 Reinforcement learning

RL is a method that learns the state of an agent in a given environment and tries to maximize reward through actions. While learning, RL makes
mistakes during many trials, corrects them gradually, and finds the desired answer over time. Let us consider an RL agent that interacts with an
environment. Environment (E) is defined by a set of states (S), a set of actions (A), a reward function r ∶ S × A → R. Policy is a mapping from states to
actions: 𝜋 ∶ S → A. Each episode starts with an initial state. At each step the agent generates an action based on the current state: at = 𝜋(st ). Later,
the environment returns a new state and a reward rt = R(st , at ). Discount factor γ ∈ [0,1] is a concept used to realize that current earnings are more
∑
valuable than future earnings. The discounted amount of future rewards is called a return: Rt = γi−t rt . The agent’s goal is to maximize the expected
i=t
reward Es0 [R0 |s0 ]. Q-function is used during these operations Q𝜋 (st , at ) = E [Rt st , at ].
If we define the optimal course of actions as 𝜋 ∗ , the optimal state must be maintained for every s ∈ S, a ∈ A, and any course of actions π. In other
words, all optimal courses of actions must satisfy the optimal Q-function. The Bellman equation fulfills these requirements:

Q∗ (s, a) = Es′ ∼p(.∣s,a) [r(s, a) + γmaxa′∈A Q∗ (s′ , a′ )] (4)

Thus, RL is the process of observing the rewards achieved by an agent running sequences of state-action pairs and adapting the Q-function
according to the rewards until the agent accurately predicts the best path to take. One of the challenges that arise in the RL process is the choice
between exploration and exploitation. To maximize the total reward, an RL agent should choose actions that have been tried in the past and have
been found to be effective. However, in order to discover such actions, it must also try new actions that it had not previously chosen. In short, an
agent not only benefits from what it has already experienced but also seeks new actions that it has not encountered before to maximize the overall
reward. Therefore, various actions should be tried and those that look best should be promoted gradually.

3.3 Deep Q-learning

Q-learning uses a Q (s, a) memory table in which Q values are stored for all possible combinations of states and actions. It is like making a list of the
best moves if you are a chess player. During the game, we make a move by choosing an action according to our own and the opponent’s situation.
After each move, the positions formed, namely our place on the chessboard and the reward we receive, provide learning. Knowing the outcome
of a single move at each step and seeing the reward gives a one-step view advantage forward. In short, the value of the action we will take in the
current situation Q (s, a) can be estimated by the value in the next state and the reward R + Q (s′ , a′ ). As we continue our actions, the average value
of Q continues to be updated in the memory table. The values get better and better and eventually, the Q values get closer to the optimal values. We
use the information in the memory table for a new move and determine the best action we can take in our position. Q-learning,27 one of the most
important inventions for RL, is stated as follows:

Q(St , At ) ← Q𝜋 (St , At ) + 𝛼 [Rt+1 + γmaxa Q(St+1 , a) − Q(St , At )] (5)

where 𝛼 is the learning rate and 𝛾 is the discount rate.

In Q-learning, the learned action-value function Q can predict the optimal value q∗ regardless of the policy followed. In this way, performing the
algorithm is significantly simplified and early convergence can be achieved. Although it is an off-policy method, the policy has the effect of determin-
ing which state-action pairs will occur and be updated. For the convergence to the optimal value, all that is needed is to keep updating state-action
pairs.
The use of a deep neural network approximating the memory table offers an advantageous solution, as the size of the memory table becomes
very large when there are too many states and actions. Therefore, deep Q-Learning, which is an off-policy method for discrete action space, uses
deep neural networks as an estimator instead of using a memory table.28 Neural networks have been successfully used in other machine learn-
ing approaches to solve many problems such as forecasting, sentiment analysis, text classification, image processing.29–31 Figure 1 illustrates the
general structure of deep Q-learning. The neural networks are basically used to converge the optimal state-action value Q∗ . A greedy policy
𝜋Q (s) = argmaxa∈A Q(s, a) is preferred so that the estimation of Q is achieved by using a random action with probability 𝜀 or 𝜋Q (s) with probability
1 − 𝜀. Training can contain many episodes. In each episode, information about the action, the reward, and the next state (sk , ak , rk , sk+1 ) obtained from
the environment is recorded in an experience reply memory. The network is trained by gradient descent using the loss function with the batch data
YILDIZ 5 of 12

FIGURE 1 General structure of deep Q-learning

from this experience reply memory. To make the optimization procedure more stable, two identical networks are used, the main network and the
target network. The target network is usually updated at a slower rate than the main network. The general practice is to update the weights of the
target networks periodically with the current weights of the main networks. This process is explained in Algorithm 1.

Algorithm 1. Deep Q-learning

Output: Optimal policy (𝜋 )

Initialize the experience reply memory (M)
Initialize the main networks (Q)
̂)
Initialize the target networks (Q
foreach episode do
Retrieve initial state from the environment
foreach episode step do
Chose an action ak ∈ A with ε probability,
Otherwise, chose action ak = argmaxa∈A Q(s, a) with (1 − ε) probability
Observe reward rk and the next state sk+1 from environment
Store (sk , ak , rk , sk+1 ) to M.
Generate{an N size batch data from M
rn episode ends
labeln = ̂ (sn+1 , a′ ) episode continues
rn + γmaxa′ Q
Train the main networks using the gradient descent with loss function L= E( Q(sn , an ) − labeln )2
̂ =Q
After every C steps, Q
end for
end for

3.4 Attention

An attention mechanism works for a given input in a similar way to human behavior. We focus on the word we are currently reading, but at the same
time, our mind treats the important words of the text more carefully to create context. The human visual similarly allows us to look at a certain
area we focus on as high resolution and the surrounding parts as low resolution and make inferences accordingly. Likewise, an attention mechanism
focuses on input and decides which parts of the input are important at each step. It can be described by a general definition as mapping a query and
a set of key-value pairs to an output. To perform the mapping function, it processes a set of query vectors q0 , … .qm , key vectors k0 , … .kn , and value
vectors v0 , … .vn to create a context vector per query.
An attention mechanism called additive attention has been introduced by Bahdanau et al.32 for an encoder-decoder model. The context vector
for the attention mechanism is created by summing the input sequence hidden states weighted by the alignment score. This score is estimated by
an alignment model that measures how well the pair of input at position i and output at position t match. For the alignment score, a feed-forward
network with a single multilayer perceptron is used and trained with other parts of the model. The function of the alignment score using the nonlinear
tanh activation function is given as:
6 of 12 YILDIZ

score(qt , ki ) = WsT tanh(Wq qt + Wk ki ) (6)

where Ws , Wq , and Wk are weight matrices to be learned in the alignment model. Here, dynamic weights that represent the relative importance of the
inputs in a sequence (keys) for a particular output (query) are trained. For normalization, softmax is applied to scale the weight values. Multiplying
these weights by the input sequence (values) generates a weighted sequence:

attention(Q, K, V) = softmax (score(qt , ki ))V (7)

Global and local attention mechanisms with a slightly different structure were proposed by Luong et al.33 These two categories of attention
mechanisms vary depending on whether the attention is placed on all positions or several positions in the input sequence. The global attention
model is computationally challenging when it takes into account all the words of the long source sequence to predict target values. To overcome this
problem, the local attention mechanism prefers to handle only a small subset of input locations per target as opposed to global attention using the
entire input sequence. Despite their difference, most of the processing steps of these mechanisms are the same except that they diverge on how
the context vectors are created. Three alignment scoring functions are provided: dot, general, and concat. These scoring functions take the encoder
outputs and the decoder hidden state generated in the previous step to compute the alignment scores. The dot scoring function is the most basic
of the functions and takes the outputs of the encoder and multiply them by the hidden state of the decoder: score(qt , ki ) = QK T . The general scoring
function calculates the alignment scores by extending the dot scoring function with a weighting matrix: score(qt , ki ) = WQK T . The concat function is
similar to the way alignment scores are computed in the attention mechanism provided by Bahdanau et al., except that the addition of the decoder
hidden state and the encoder hidden state is multiplied by the weight matrix: score(qt , ki ) = vT tanh(W(qi + kj )). Attention is similarly calculated by
Equation (7).
Scaled dot product, a new attention mechanism used in the Transformer model, was introduced by Vaswani et al.34 It resembles the dot product
attention mechanism provided by Luong et al. It adds a scaling factor to have effective learning; the dot products of queries with the keys are divided
by the square root of the dimension of queries and keys. A softmax function is applied to the calculated values to obtain the weights:
( )
QK T
Attention(Q, K, V) = softmax √ V (8)
dk

Attention has become very influential in the Artificial Intelligence community as a core component of neural architectures for numerous appli-
cations in Natural Language Processing and Computer Vision. During this quest, various types of attention mechanisms were also introduced in
addition to the attention mechanisms mentioned above. Cheng et al.35 used intra-attention, also called self-attention, to perform machine reading.
Xu et al.36 applied hard or soft attention mechanisms, reaching the entire image or only a portion of the image, for image captioning. To extract the
features, images were first encoded by convolutional layers, then an long short-term memory (LSTM)-based decoder was applied, where attentive
weights are generated by attention mechanisms, to create individual descriptive words.

3.5 Transformer

Transformer was introduced Vaswani et al.34 in the famous article “All You Need is Attention.” It has an encoder–decoder architecture. The encoder
takes a variable-length sequence as input and transforms it into a contextual fixed size encoding sequence. The encoder of the transformer consists
of six identical layers, each containing two sub-layers. The first layer is a multi-head self-attention mechanism and the second layer is a position-
wise fully connected feed-forward network. A residual link followed by a normalizing layer is employed around each of the sub-layers. The decoder
computes the conditional probability distribution of a target sequence given the encoded representation from the encoder. The decoder architec-
ture is almost the same as the encoder but differs in that the decoder includes two multi-head attention sub-layers. The first multi-head attention
is masked to ensure that the model does not consider paddings as input.
The main component in the transformer is the multi-head self-attention mechanism given in Equation (9). The self-attention structure esti-
mates a portion of a data sample using other parts of the same data sample. The scaled dot product attention described in the Attention
section is used to compute the context vector. The multi-head mechanism runs the scaled dot-product attention multiple times in parallel on
split input to take advantage of parallelism for fast computing. After each head has been processed, individual attention outputs are simply
concatenated.

MultiHeadAttention(Q, K, V) = concat(head1 , … ., headn ) (9)

where head = Attention(QW q , KW k , VW v ).

YILDIZ 7 of 12

Transformer architecture uses positional encoding. The position is important for processing sequential data. For example, the position and
order of words are key parts of languages. Recurrent Neural Networks naturally consider word order when processing a text. However, Trans-
former removes the recurrence mechanism in favor of the multi-headed self-attention to speed up training time and capture longer dependencies.
Therefore, positional encoding is used to give a sense of position to the model by adding a piece of information about the location.
The Transformer architecture has had breakthrough success in a wide variety of domains: language modeling,37–39 neural machine translation,40
text summarization,41 and question answering.42 It should be noted that the repeated success of transformer architecture in many areas makes it
an ideal candidate for RL problems as well, especially where episodes can span too many steps and critical observations for any decision often cover
the entire section.

4 METHODOLOGY

The proposed architecture that we used for RL to train the models in the developed knapsack environment is shown in Figure 2. There are two
cycles in the given architecture. The first of these is the cycle where we repeatedly do the packing of our knapsack using the simulation environment
that we call the action cycle. In this cycle, the sample inputs required for the Q-networks training are collected. On the other hand, the learning
cycle extracts statistics and determines patterns from the data we obtain while the action cycle carries on. This cycle continues for a predetermined
number of steps. Ultimately, the knowledge of the optimal knapsack packing process is learned by the Q-network.
We used three models as function approximations; a model with fully connected layers, a model with attention, and a model with transformer
encoder block. As the first model, fully connected three layers of neural networks containing 128, 196, and 128 nodes were utilized to record the pol-
icy that progressively improved during deep Q-learning. We determined the initial weights of the neural network with the GlorotUniform initializer.
For the second model, we used a dot-product attention mechanism proposed by Luong et al. In the dot-product attention, the score value is basi-
cally the product of the query and key values. After applying the softmax function, the score value is multiplied by values. Since this is a self-attention,
value, query, and key values contain the same values. Two fully connected layers containing 96 and 64 nodes were added after the attention. Finally,
two transformer encoder blocks were used as the third model. Since we already had numerical values as inputs, the embeddings that usually map a
given input into an n-dimensional space were removed. Instead, 9-dimensional input was directly transformed into an n-dimensional space. A single
normalizer is utilized for the input instead of using two normalizers as in the paper of Vaswani et al. Two fully connected layers containing 128 and
96 nodes were used after the transformer encoder blocks. In all models, the softmax activation function in the output layers was discarded as the
outputs were real values, not probabilities. In its place, a linear activation function was used.
Adam optimizer and the Huber loss function with a delta value of 2.0 were applied for the optimization. The Huber loss function shows behavior in
the middle of Mean Absolute Error and Mean Square Error, depending on the delta parameter. It is an effective loss function in data with high variability
and a few outliers. Therefore, it exhibits an appropriate loss function behavior for RL. The action space includes two actions that represent putting

FIGURE 2 Deep Q-learning model used for KP

8 of 12 YILDIZ

an item in the knapsack and not putting an item in the knapsack. We reduced the epsilon value used to balance the exploration and exploitation with
a 0.999 reduction rate, starting from 1 to the minimum value of 0.001. We chose 50,000 as experience reply memory size. The value of the future
reward was updated with a discount factor of 0.99. The reason for applying the discount factor was the fact that the reward value of the current
action was preferred over the value of the future action.

5 EVALUATION

5.1 Experimental setup

We developed a knapsack environment compatible with the Gym library provided by the OpenAI working group. The gym environment offers a
standard interface for RL with functions such as reset, step, and render. A total capacity limit of 1000 for the weight and a total capacity limit of
1000 for the volume were applied. We assigned random numbers between 0 and 100 for each item’s volume, weight, and value. During the training
process, we used 30 items for each knapsack filling process. State, reward, done and info parameters were gathered at the initial stage of an episode
and each subsequent step. For a state, nine parameters including weight capacity, volume capacity, bag weight, bag volume, bag value, item weight,
item volume, item value, and the number of remaining steps were used. As a reward, the item value was chosen if placed in the knapsack and a
negative value if not placed in the knapsack. With this simple reward function, we ensured that the items were placed in the knapsack in the most
appropriate way and the preference of not putting them always in the knapsack was prevented.

5.2 Training

We run our deep Q-learning model and knapsack simulation environment in the Google Colab environment supported by a graphics card accel-
erator. For training, the simulation environment must first be initialized with the reset command, which returns an initial state in the environment.
Progress is made with a step function that takes an action as a parameter. Initially, the epsilon value is determined as 1. The probability of choos-
ing a random action is higher where this value is close to 1. In other words, the learning is in the exploration phase. We collect, in this phase,
reward and observation pairs from the environment by trying new actions in the knapsack environment. As the epsilon value approaches 0, we
start to benefit more from our past experiences. This phase is called exploitation. To put it more clearly, we begin to determine the best action
to be taken in our current state by looking at our past experiences. This action is determined through the neural networks that we are train-
ing by using RL. After determining an action either by randomly or looking at our past experiences, we take a step in the knapsack environment.
After each action, the values of our knapsack, such as total weight, total volume, the total value of the items, and capacities of the knapsack are
changed. Hence, the knapsack environment informs us about the changes such as the new state we are in, the reward value we have earned,
and whether the knapsack filling process is over. The received information is added to the experience reply memory after we update the total
reward value being tried to maximize. Each experience reply memory entry contains a bunch of information such as current state, action, reward,
next state, done.
We started training our deep neural networks when a sufficient number of sample entries were stored in the experience reply memory. The
reason we call deep neural networks is that we use twin neural networks. One is the main network where the training is carried out and the other is
the network we use to make the training more stable. We also used this second network to calculate our future Q̂ value. For training, we created a
randomly selected data set of 512 from the experience reply memory each time. We copied the weights of the actual network trained to the target
network after every 100 knapsack filling operations. In this way, we managed to create more stable learning without being too affected by instant
changes during the training. The guiding factor for the correct training is the reward values given as feedback from the knapsack environment.
Different functions and values can be used as the reward value, but the simple and accurate guiding reward function offers the most convenient and
faster solution.

5.3 Results

We conducted experiments for the deep Q-networks using the model having fully connected layers, the model containing the attention layer, and
the model including the transformer’s encoder block. Although the weight, volume, and value of the items are randomly assigned, the order of items
is set the same for different models during training so that the comparison can be made correctly. In other words, the models received the same
sequence of items. The total reward values earned for each knapsack episode were collected during training. As seen in Figure 3, as the number of
episodes increases, the total reward earned also increases. Initially, the total reward obtained by using the randomly chosen actions appears to be
below 400. It is reasonable that the total reward initially appears at these levels because the fact that the decision to put items in the knapsack is
YILDIZ 9 of 12

FIGURE 3 Changing total reward values while training the deep Q-learning models used in solving KP

TA B L E 1 The ratio of total reward to the optimum value of the knapsack solution

Approximator model Ratios to optimality (%) Standard deviation (%)

Dense 86.18 4.20

Attention 88.84 6.19

Transformer 92.71 3.15

made with a random function using a uniform distribution and the reward value earned from an item placed in the knapsack is equal to the value of
that item between 0 and 100, and the value of −20 for the item not packed.
As the learning progresses, the Q-networks continue their learning with a randomly selected batch of datasets from the experience reply mem-
ory where the state-action-reward information is stored. In the process, deep Q-learning focuses on maximizing the total reward and ensures that
the pattern required for this objective is learned by the Q-networks. In this experiment, the optimal total reward was around 1200 and each knap-
sack episode contained 30 items to pack. When reaching 3000 episodes, it is observed that the average total reward is approaching the optimal
knapsack values measured using dynamic programming as a baseline. Q-network with fully connected layers performs well. However, we observe
superior total rewards when the attention mechanism is introduced. The model that uses the attention layer performs slightly better. We get the best
output when using the transformer encoder blocks. At the same time, this model enables the total reward earned to be maximized earlier. Table 1
gives the ratio of the total reward for each model to the optimal value gathered from the dynamic programming application. After training each RL
model, we collected the total rewards for 100 backpack filling processes using RL models and dynamic programming. The ratios of the total rewards
and their standard deviations were calculated with these values. The ratios, especially for the transformer model, are reasonably good enough for
a KP solution with significant processing acceleration.
The RL agent being trained brings the total reward earned by the knapsack in each episode closer to the optimal value while significantly
shortening the processing time compared to exact solutions. As seen in Figure 4, our trained agent can obtain results that are very close to the
optimal value with approximately 40 times faster. In this performance test, 100 different knapsack filling processes were carried out and the pro-
cessing times were averaged. The experimental results show that knapsack filling processes are completed in less than a second using RL models
while dynamic programming takes over 32 s. Although models using fully connected layers and attention are faster, their accuracy is not better
than the model using transformer encoder blocks. Since the difference in processing time is very small, the decision to choose a model leans
towards the one using the transformer encoder blocks due to the trade-off between time and performance. Between the dynamic model and the
RL models, the speed difference will become larger as the number of items added to the knapsack increases. Because the speed of filling the knapsack
of the RL agents increases linearly, as the problem is NP-hard, the filling process of an exact solution will increase at a non-polynomial speed.
10 of 12 YILDIZ

F I G U R E 4 Comparison of execution times for the dynamic programming, which is an exact solution, and the deep Q-learning models, which
are approximate methods

F I G U R E 5 Average total reward and average processing time (seconds) obtained with the increasing number of items in the deep
Q-Networks used in the solution of KP

The processing speed and the total reward earned by our RL agent, which we trained for 30 items using the knapsack environment, are given
in Figure 5 while packing different numbers of items into the knapsack. We collected the processing time and total reward values by increasing
the number of items from 20 to 40. The process was repeated 100 times for each item number. It is observed that the processing speed increases
linearly as the number of items placed in the knapsack increases. On the other hand, we witness an increase in the total reward earned. However,
the reward increase appears to be limited due to the knapsack’s weight and volume limitations. In addition, as shown in Figure 5, the model has its
limitation. The more the number of items used to fill the knapsack deviates from the number of items used during model training, the further the
total value of the reward is from the optimal value. In other words, the quality of the model decreases as the difference between the number of
items for knapsack filling and the number of items of the model being trained increases.

6 CONCLUSION AND FUTURE WORK

The knapsack is a combinatorial optimization problem involving a wide variety of resource allocation problems. It is classified as an NP-hard problem,
and many exact and approximate methods have been proposed for its solution. In this study, we used the deep Q-learning approach as an approximate
solution, as exact solutions take a long computation time. We have shown that deep Q-learning with appropriate Q-networks provides high-quality
solutions for KP. We found that the model with fully connected layers produces results of sufficient quality, but using the attention mechanism
YILDIZ 11 of 12

and transformer encoder blocks for Q-networks brings the solution closer to perfection. With these deep Q-network models, the knapsack filling
process that is very close to the optimal solution can be obtained 40 times faster than an exact solution using dynamic programming. It should be
noted that careful selection of parameters such as reward value, discount rate, and Q-network model with optimal hyperparameters is key for the
efficiency and effectiveness of the solution. Additionally, the trained agent generalizes the knapsack solution for the number of items close to the
number of items used during training. Moreover, in most real-life KPs, deciding for an item faith to put into a knapsack is a major problem, as some
of the parameters involved cannot be completely known. Therefore, instead of the solutions used with all the parameters at our disposal, an agent
trained with deep Q-networks, an off-policy RL, comes out as a more suitable method for real-life problems. In further research, we aim to adapt
our approach to other combinatorial problems and make it a more general solution for combinatorial problems.

DATA AVAILABILITY STATEMENT

Data available on request from the authors

ORCID
Beytullah Yildiz https://orcid.org/0000-0001-7664-5145

REFERENCES
1. Désir A, Goyal V, Segev D. Assortment optimization under a random swap based distribution over permutations model. Paper presented at: Proceedings
of the 2016 ACM Conference on Economics and Computation; 2016; Maastricht, The Netherlands.
2. Kulkarni AJ, Shabir H. Solving 0–1 knapsack problem using cohort intelligence algorithm. Int J Mach Learn Cybernet. 2016;7(3):427-441.
3. Fréville A. The multidimensional 0–1 knapsack problem: an overview. Eur J Oper Res. 2004;155(1):1-21.
4. Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. Nature. 2015;518(7540):529-533.
5. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436-444.
6. Sutton RS, Barto AG. Reinforcement Learning: An Introduction. Cambridge, MA, USA: MIT Press; 2018.
7. Silver D, Huang A, Maddison CJ, et al. Mastering the game of go with deep neural networks and tree search. Nature. 2016;529(7587):484.
8. Gu S, Holly E, Lillicrap T, Levine S. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. Paper presented at:
proceedings of the IEEE international conference on robotics and automation (ICRA). 2017; Singapore, Singapore.
9. Legg S, Hutter M. Universal intelligence: a definition of machine intelligence. Minds Mach. 2007;17(4):391-444.
10. Martello S, Toth P. Algorithms for knapsack problems. North-Holland Mathematics Studies. Vol 132. Amsterdam, Netherlands: Elsevier; 1987:213-257.
11. Kosuch S, Lisser A. On two-stage stochastic knapsack problems. Discrete Appl Math. 2011;159(16):1827-1841.
12. Chu PC, Beasley JE. A genetic algorithm for the multidimensional knapsack problem. J Heuristics. 1998;4(1):63-86.
13. Sahni S. Approximate algorithms for the 0/1 knapsack problem. JACM. 1975;22(1):115-124.
14. Aktas MS, Kapdan M. Implementation of analytical hierarchy process in detecting structural code clones. In: Gervasi O, Murgante B, Misra S, et al., eds.
Computational Science and its Applications – ICCSA 2017. Lecture Notes in Computer Science. Vol 10408. Cham, Switzerland: Springer; 2017.
15. Aktas MS, Kapdan M. Structural code clone detection methodology using software metrics. Int J Softw Eng Knowl Eng. 2016;26(02):307-332.
16. Gonen B, Gunduz G, Yuksel M. Automated network management and configuration using probabilistic trans-algorithmic search. Comput Netw.
2015;76:275-293.
17. Gu S, Hao T. A pointer network based deep learning algorithm for 0–1 knapsack problem. Paper presented at: Tenth International Conference on
Advanced Computational Intelligence (ICACI); 2018; Xiamen, China.
18. Denysiuk R, Gaspar-Cunha A, Delbem AC. Neuroevolution for solving multiobjective knapsack problems. Expert Syst Appl. 2019;116:65-77.
19. Bello I, Pham H, Le QV, Norouzi M, Bengio S. Neural combinatorial optimization with reinforcement learning. arXiv:1611.09940; 2016.
20. Bulut V. Path planning for autonomous ground vehicles based on quintic trigonometric Bézier curve. J Brazilian Soc Mech Sci Eng. 2021;43(2):1-14.
21. Ozdemir E, Topcu AE, Ozdemir MK. A hybrid HMM model for travel path inference with sparse GPS samples. Transportation. 2018;45(1):233-246.
22. Bulut H. Multiloop transportation simplex algorithm. Optimiz Methods Softw. 2017;32(6):1206-1217.
23. Kul S, Eken S, Sayar A. Distributed and collaborative real-time vehicle detection and classification over the video streams. Int J Adv Robot Syst.
2017;14(4):1–12.
24. Nazari M, Oroojlooy A, Snyder L, Takác M. Reinforcement learning for solving the vehicle routing problem. Paper presented at: advances in neural
information processing systems; 2018; Montréal, Canada.
25. Dai H, Khalil E, Zhang Y, Dilkina B, Song L. Learning combinatorial optimization algorithms over graphs. Paper presented at: Advances in Neural
Information Processing Systems; 2017; Long Beach, CA.
26. Parisotto E, Song F, Rae J, Pascanu R, Gulcehre C, Jayakumar S, et al. Stabilizing transformers for reinforcement learning. Paper presented at: Proceedings
of the 37th International Conference on Machine Learning; 2020; Virtual Event.
27. Watkins CJCH. Learning from Delayed Rewards. Cambridge, UK: King’s College; 1989.
28. Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, et al. Playing atari with deep reinforcement learning. arXiv preprint
arXiv:1312.5602; 2013.
29. Ay Karakus B, Talo M, Hallaç IR, Aydin G. Evaluating deep learning models for sentiment classification. Concurr Comput Pract Exp. 2018;30(21):e4783.
30. Yildiz B, Tezgider M. Improving word embedding quality with innovative automated approaches to hyperparameters. Concurr Comput Pract Exp.
2021:e6091.
31. Yildiz B, Tezgider M. Learning quality improved word embedding with assessment of hyperparameters. In: Schwardmann U et al., eds. Euro-Par 2019:
Parallel Processing Workshops. Euro-Par 2019. Lecture Notes in Computer Science. Vol 11997. Cham, Switzerland: Springer; 2020.
32. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. Paper presented at: 3rd International Conference on
Learning Representations; 2015; San Diego, CA.
12 of 12 YILDIZ

33. Luong T, Pham H, Manning CD. Effective approaches to attention-based neural machine translation. Paper presented at: Proceedings of the 2015
Conference on Empirical Methods in Natural Language Processing; 2015; Lisbon, Portugal.
34. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Paper presented at: advances in neural information
processing systems; 2017; Long Beach, CA.
35. Cheng J, Dong L, Lapata M. Long short-term memory-networks for machine Reading. Paper presented at: Proceedings of the 2016 Conference on
Empirical Methods in Natural Language Processing; 2016; Austin, TX.
36. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, et al. Show, attend and tell: neural image caption generation with visual attention. Paper presented
at: Proceedings of the 32nd International Conference on Machine Learning; 2015; Lille, France.
37. Dai Z, Yang Z, Yang Y, Carbonell J, Le Q, Salakhutdinov R. Transformer-XL: attentive language models beyond a fixed-length context. Paper presented at:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; jul, 2019; Florence, Italy.
38. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. Paper presented at: Pro-
ceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; jun,
2019; Minneapolis, Minnesota.
39. Brown T, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language models are few-shot learners. arXiv:2005.14165; 2020.
40. Ott M, Edunov S, Grangier D, Auli M. Scaling neural machine translation. Paper presented at: proceedings of the third conference on machine translation;
2018; Brussels, Belgium.
41. Liu Y, Lapata M. Text summarization with Pretrained encoders. Paper presented at: Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP/IJCNLP); 2019; Hong Kong, China.
42. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le QV. XLNet: generalized autoregressive Pretraining for language understanding. Paper presented
at: advances in neural information processing systems; 2019; Vancouver, Canada.

How to cite this article: Yildiz B. Reinforcement learning using fully connected, attention, and transformer models in knapsack
problem solving. Concurrency Computat Pract Exper. 2022;34(9):e6509. doi: 10.1002/cpe.6509

Finite Elements and Approximation
From Everand
Finite Elements and Approximation
O. C. Zienkiewicz
4.5/5 (4)
Engineering Utilities Module 3
No ratings yet
Engineering Utilities Module 3
85 pages
Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data
From Everand
Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data
EMC Education Services
No ratings yet
Pepsico'S Operations Management, 10 Decisions, Productivity
No ratings yet
Pepsico'S Operations Management, 10 Decisions, Productivity
2 pages
applsci-12-03068-v2
No ratings yet
applsci-12-03068-v2
18 pages
refaei-afshar20a
No ratings yet
refaei-afshar20a
16 pages
neu_bz60xp74g
No ratings yet
neu_bz60xp74g
60 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
P307-315 Knapsack
No ratings yet
P307-315 Knapsack
9 pages
s10898-024-01364-6
No ratings yet
s10898-024-01364-6
31 pages
Improved Approximation Results For Stochastic Knapsack Problems
No ratings yet
Improved Approximation Results For Stochastic Knapsack Problems
19 pages
IJAEMR_3931 (1)
No ratings yet
IJAEMR_3931 (1)
11 pages
Astudyreportonsolving0 1knapsackwithimprecise
No ratings yet
Astudyreportonsolving0 1knapsackwithimprecise
6 pages
Machine Learning-Driven Optimization for Solution Space Reduction in the Quadratic Multiple Knapsack Problem
No ratings yet
Machine Learning-Driven Optimization for Solution Space Reduction in the Quadratic Multiple Knapsack Problem
15 pages
A_reinforcement_learning_approach_to_stochastic_bu
No ratings yet
A_reinforcement_learning_approach_to_stochastic_bu
14 pages
Reinforcement Learning: A Practical Guide to Algorithms
From Everand
Reinforcement Learning: A Practical Guide to Algorithms
Trilokesh Khatri
No ratings yet
Knapsack Problem
No ratings yet
Knapsack Problem
5 pages
2014 QKP Heuristic
No ratings yet
2014 QKP Heuristic
12 pages
Teaching and Learning in STEM With Computation, Modeling, and Simulation Practices: A Guide for Practitioners and Researchers
From Everand
Teaching and Learning in STEM With Computation, Modeling, and Simulation Practices: A Guide for Practitioners and Researchers
Alejandra J. Magana
No ratings yet
Knapsack Problem
No ratings yet
Knapsack Problem
2 pages
The 0-1 Knapsack Problem: An Introductory Survey
No ratings yet
The 0-1 Knapsack Problem: An Introductory Survey
13 pages
Reinforcement Learning For Combinatorial Optimization: A Survey
No ratings yet
Reinforcement Learning For Combinatorial Optimization: A Survey
24 pages
List of Knapsack Problems: 2 Multiple Constraints
No ratings yet
List of Knapsack Problems: 2 Multiple Constraints
3 pages
AI To Solve TSP
No ratings yet
AI To Solve TSP
15 pages
Knapsack Problems: I. History
No ratings yet
Knapsack Problems: I. History
7 pages
RL PDF
No ratings yet
RL PDF
4 pages
Algorithms For Reinforcement Learning Csaba Szepesvari instant download
No ratings yet
Algorithms For Reinforcement Learning Csaba Szepesvari instant download
36 pages
2008.06319v2
No ratings yet
2008.06319v2
29 pages
Python Machine Learning
From Everand
Python Machine Learning
Sebastian Raschka
4/5 (18)
Application of Black Hole Algorithm for Solving Knapsack
No ratings yet
Application of Black Hole Algorithm for Solving Knapsack
6 pages
RL For CO Survey
No ratings yet
RL For CO Survey
15 pages
3.RL Unit 3
No ratings yet
3.RL Unit 3
31 pages
Algorithms For Reinforcement Learning - Szepesvari
No ratings yet
Algorithms For Reinforcement Learning - Szepesvari
98 pages
OR-Gym: A Reinforcement Learning Library For Operations Research Problems
No ratings yet
OR-Gym: A Reinforcement Learning Library For Operations Research Problems
28 pages
DAA Assign 3
No ratings yet
DAA Assign 3
7 pages
Elementos Basicos Aprendizaje Por Refuerzo
No ratings yet
Elementos Basicos Aprendizaje Por Refuerzo
52 pages
Knapsack Problem
No ratings yet
Knapsack Problem
10 pages
Week 11 - Dynamic Programming KnapSack
No ratings yet
Week 11 - Dynamic Programming KnapSack
10 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
Dynamic Programming (Dr. Ashraf Iqbal) (WWW - myUET.net - TC)
No ratings yet
Dynamic Programming (Dr. Ashraf Iqbal) (WWW - myUET.net - TC)
154 pages
Solutions Manual to accompany An Introduction to Numerical Methods and Analysis
From Everand
Solutions Manual to accompany An Introduction to Numerical Methods and Analysis
James F. Epperson
5/5 (1)
1 s2.0 S030505482100201X Main
No ratings yet
1 s2.0 S030505482100201X Main
14 pages
Ashwin Rao, Tikhon Jelvis - Foundations of Reinforcement Learning With Applications in Finance-CRC Press_Chapman & Hall (2022)
No ratings yet
Ashwin Rao, Tikhon Jelvis - Foundations of Reinforcement Learning With Applications in Finance-CRC Press_Chapman & Hall (2022)
522 pages
Paper Fiuri
No ratings yet
Paper Fiuri
17 pages
Four Room Pages 1
No ratings yet
Four Room Pages 1
8 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
98 pages
Image Classification: Step-by-step Classifying Images with Python and Techniques of Computer Vision and Machine Learning
From Everand
Image Classification: Step-by-step Classifying Images with Python and Techniques of Computer Vision and Machine Learning
Mark Magic
No ratings yet
A Novel Approach in Solving 0/1 Knapsack Problem Using Neural Selection Principle
No ratings yet
A Novel Approach in Solving 0/1 Knapsack Problem Using Neural Selection Principle
5 pages
Lecture 21ppt
No ratings yet
Lecture 21ppt
17 pages
Foundations Of Reinforcement Learning With Applications In Finance Ashwin Rao instant download
No ratings yet
Foundations Of Reinforcement Learning With Applications In Finance Ashwin Rao instant download
77 pages
(Ebook) Graph-related Optimization and Decision Theory by Saoussen Krichen, Jouhaina Chaouachi ISBN 9781848217430, 1848217439 download
100% (1)
(Ebook) Graph-related Optimization and Decision Theory by Saoussen Krichen, Jouhaina Chaouachi ISBN 9781848217430, 1848217439 download
52 pages
code-cho-bạn-Drake-Ak-Lý-thuyết-đã chuyển đổi
No ratings yet
code-cho-bạn-Drake-Ak-Lý-thuyết-đã chuyển đổi
17 pages
Lec37 Dynamic Programming
No ratings yet
Lec37 Dynamic Programming
23 pages
Srarm_Unit 1
No ratings yet
Srarm_Unit 1
16 pages
Designing An Expert System For Online Shopping Cart Management
No ratings yet
Designing An Expert System For Online Shopping Cart Management
4 pages
13-RL DRL
No ratings yet
13-RL DRL
102 pages
3003 o Ine Reinforcement Learning W
No ratings yet
3003 o Ine Reinforcement Learning W
15 pages
Mastering Partial Least Squares Structural Equation Modeling (Pls-Sem) with Smartpls in 38 Hours
From Everand
Mastering Partial Least Squares Structural Equation Modeling (Pls-Sem) with Smartpls in 38 Hours
Ken Kwong-Kay Wong
3/5 (1)
report algo
No ratings yet
report algo
3 pages
0 1 knapsack using class notes DP
No ratings yet
0 1 knapsack using class notes DP
18 pages
14.4 RL
No ratings yet
14.4 RL
17 pages
0/1 Single Knapsack Problem: - Weight W - Profit P
No ratings yet
0/1 Single Knapsack Problem: - Weight W - Profit P
3 pages
foam ratio controller
No ratings yet
foam ratio controller
14 pages
EEX5362-2021 Case Study 519222428 PERFORMANCE MODELING
No ratings yet
EEX5362-2021 Case Study 519222428 PERFORMANCE MODELING
8 pages
15 000 Word Dissertation
100% (2)
15 000 Word Dissertation
4 pages
What Do You Think Is The Role of The Person in The Picture Brainly - Google Search
No ratings yet
What Do You Think Is The Role of The Person in The Picture Brainly - Google Search
1 page
Adobe Dreamweaver CS5
No ratings yet
Adobe Dreamweaver CS5
1 page
CSE115/ENGR160 Discrete Mathematics 01/17/12: Ming-Hsuan Yang UC Merced
No ratings yet
CSE115/ENGR160 Discrete Mathematics 01/17/12: Ming-Hsuan Yang UC Merced
39 pages
ADET - Lesson 1
No ratings yet
ADET - Lesson 1
21 pages
Rab Komputer Merek Libera
No ratings yet
Rab Komputer Merek Libera
1 page
UNIT-2 Part3
No ratings yet
UNIT-2 Part3
8 pages
Open Closed Thesis Statement
100% (3)
Open Closed Thesis Statement
8 pages
JP Polyplast Swift SFMS-6523
No ratings yet
JP Polyplast Swift SFMS-6523
5 pages
LCD - Programmer - Product - Guide
No ratings yet
LCD - Programmer - Product - Guide
54 pages
RMPart USB
No ratings yet
RMPart USB
3 pages
s14904.r4 Acorn Rev3, Estun Pronet
No ratings yet
s14904.r4 Acorn Rev3, Estun Pronet
1 page
Citrix Xenapp 6.0, 6
No ratings yet
Citrix Xenapp 6.0, 6
14 pages
701 Dumps
No ratings yet
701 Dumps
12 pages
5DC - AI Chat Bot Widget Offer, Whitelabeled Ad
No ratings yet
5DC - AI Chat Bot Widget Offer, Whitelabeled Ad
2 pages
Formulating The Research Design: Session 7-8
No ratings yet
Formulating The Research Design: Session 7-8
25 pages
Parts Catalogue: RM-Z450K8 RM-Z450K9 RM-Z450L0 RM-Z450L1
No ratings yet
Parts Catalogue: RM-Z450K8 RM-Z450K9 RM-Z450L0 RM-Z450L1
132 pages
Module 3 Living in It
No ratings yet
Module 3 Living in It
12 pages
Cisco Adaptive Security Virtual Appliance (Asav)
No ratings yet
Cisco Adaptive Security Virtual Appliance (Asav)
8 pages
Engage_Developer_Advocates[Influitive]
No ratings yet
Engage_Developer_Advocates[Influitive]
13 pages
Effect of Mach Number On Supersonic Wrap Around Fin Aerodynamics
No ratings yet
Effect of Mach Number On Supersonic Wrap Around Fin Aerodynamics
6 pages
Efi Fiery Es3000 Ds en Us
No ratings yet
Efi Fiery Es3000 Ds en Us
2 pages
DSCombiNorm-BUGB
No ratings yet
DSCombiNorm-BUGB
2 pages
2020 Colocation Data Centers in India
No ratings yet
2020 Colocation Data Centers in India
12 pages
Xap Xi Nhi Thuc Voi Phan Phoi Chuan
No ratings yet
Xap Xi Nhi Thuc Voi Phan Phoi Chuan
3 pages
CS File-1
No ratings yet
CS File-1
19 pages