0% found this document useful (0 votes)
39 views

Robot

Uploaded by

God Dragon 5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Robot

Uploaded by

God Dragon 5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Machine Translated by Google

Online 3D Bin Packing with Constrained Deep Reinforcement Learning

Hang Zhao1 , Qijin She1 , Chenyang Zhu1 , Yin Yang2 , Kai Xu1*
1National University of Defense Technology, 2Clemson University

Abstract RGB image Depth image

We solve a challenging yet practically useful variant of 3D


Bin Packing Problem (3D-BPP). In our problem, the agent
has limited information about the items to be packed into a
single bin, and an item must be packed immediately after its
arrival without buffering or readjusting. The item’s place-
ment also subjects to the constraints of order dependence
and physical stability. We formulate this online 3D-BPP as
a constrained Markov decision process (CMDP). To solve
the problem, we propose an effective and easy-to-implement
constrained deep reinforcement learning (DRL) method un- Figure 1: Online 3D-BPP, where the agent observes only a limited
der the actor-critic framework. In particular, we introduce a
numbers of lookahead items (shaded in green), is widely useful in
prediction-and-projection scheme: The agent first predicts a
logistics, manufacture, warehousing etc.
feasibility mask for the placement actions as an auxiliary
task and then uses the mask to modulate the action
probabilities output by the actor during training. Such problem) as many real-world challenges could be much more
supervision and pro-jection facilitate the agent to learn efficiently handled if we have a good solution to it. A good example
feasible policies very effi-ciently. Our method can be easily is large-scale parcel packaging in modern lo-gistics systems
extended to handle looka-head items, multi-bin packing, and (Figure. 1), where parcels are mostly in reg-ular cuboid shapes,
item re-orienting. We have conducted extensive evaluation and we would like to collectively pack them into rectangular bins of
arc2
n.Xs0
5v87941.600]2G:v2Li2 3 a
1
J[

showing that the learned pol-icy significantly outperforms


the standard dimension. Maxi-mizing the storage use of bins
the state-of-the-art methods. A preliminary user study even
effectively reduces the cost of inventorying, wrapping, transportation,
suggests that our method might attain a human-level performance.
and warehous-ing. While being strongly NP-hard, 1D-BPP has
been ex-tensively studied. With the state-of-the-art computing hard-
1 Introduction ware, big 1D-BPP instances (with about 1, 000 items) can be
As a classic NP-hard problem, the bin packing problem (1D-BPP) exactly solved within tens of minutes (Delorme, Iori, and Martello
seeks for an assignment of a collection of items with various weights 2016) using e.g., integer linear programming (ILP) (Schrijver 1998),
to bins. The optimal assignment houses all the items with the fewest and good approximations can be ob-tained within milliseconds. On
bins such that the total weight of items in a bin is below the bin’s the other hand 3D-BPP, due to the extra complexity imposed, is
capacity c (Korte and Vygen 2012). In its 3D version i.e., 3D-BPP relatively less explored.
(Martello, Pisinger, and Vigo 2000), an item i has a 3D “weight”
corresponding to its length, li , width wi , and height hi . Similarly, c Solving a 3D-BPP of moderate size exactly (either using ILP or
is also in 3D including L ÿ li , W ÿ wi , and H ÿ hi . branch-and-bound) is much more involved, and we still have to
It is assumed , resort to heuristic algorithms (Crainic, Perboli, and Tadei 2008;
li hi are positive
,
wi, L,integers.
W, H ÿ Z + thatthe set of items I, we would like to
Given Karabulut and ÿInceoglu 2004). ÿ
pack all the items into as few bins as possible. Clearly, 1D-BPP is a Most existing 3D-BPP literature assumes that the infor-mation
special case of its three dimensional counter part – as long as we of all items is known while does not take physical stability into
constrain hi = H and wi = W for all i ÿ I, a 3D-BPP instance can be consideration, and the packing strategies al-low backtracking i.e.,
relaxed to a 1D-BPP. Therefore, 3D-BPP is also highly NP-hard one can always repack an item from the bin in order to improve the
(Man Jr, Garey, and Johnson 1996). current solution (Martello, Pisinger, and Vigo 2000). In practice
however, we do not know the information of all items. For instance
Regardless of its difficulty, the bin packing problem turns out to see Figure 1, where a robot works beside a bin, and a conveyor
be one of the most needed academic problems (Skiena 1997) (the forwards parcels sequentially. The robot may only have the vision
second most needed, only after the suffix tree of several upcoming items (similar to Tetris), and an item must be
packed within a given time period after its arrival.
*Hang Zhao and Qijin She are co-first authors. Kai Xu is the
corresponding author ([email protected]).
Copyright © 2021, Association for the Advancement of Artificial It is costly and inefficient if the robot frequently unloads and
Intelligence (www.aaai.org). All rights reserved. readjusts parcels in packed bins. Such constraints fur-
Machine Translated by Google

ther complicate 3D-BPP in its real-world applications. 2D- and 3D-BPP are natural generalization of the origi-nal
As an echo to those challenges, we design a deep rein- BPP. Here, an item does not only have a scalar-valued
forcement learning algorithm for 3D-BPP. To maximize the weight but a high-dimension size of width, height, and/or
applicability, we carefully accommodate restrictions raised depth. The main difference between 1D- and 2D-/3D- pack-ing
in its actual usage. For instance, we require item placement problems is the verification of the feasibility of the
satisfying order dependence and not inducing instable stack- packing, i.e. determining whether an accommodation of the
ing. An item is immediately packed upon its arrival, and no items inside the bin exists such that items do not interpene-
adjustment will be permitted after it is packed. To this end, trate and the packing is within the bin size. The complexity
we opt to formulate our problem as a constrained Markov and the difficulty significantly increase for high-dimension
decision process (CMDP) (Altman 1999) and propose a con- BPP instances. In theory, it is possible to generalize ex-act 1D
strained DRL approach based on the on-policy actor-critic solutions like MTP (Martello and Toth 1990) or
framework (Mnih et al. 2016; Wu et al. 2017). branch-and-bound (Delorme, Iori, and Martello 2016) al-
In particular, we introduce a prediction-and-projection gorithms to 2D-BPP (Martello and Vigo 1998) and 3D-BPP
scheme for the training of constrained DRL. The agent first (Martello, Pisinger, and Vigo 2000). However accord-ing to the
predicts a feasibility mask for the placement actions as an timing statistic reported in (Martello, Pisinger,
auxiliary task. It then uses the mask to modulate the action and Vigo 2000), exactly solving 3D-BPP of a size match-ing
probabilities output by the actor. These supervision and pro- an actual parcel packing pipeline, which could deal with
jection enable the agent to learn feasible policy very effi-ciently. tens of thousand parcels, remains infeasible. Resorting to
We also show that our method is general with the approximation algorithms is a more practical choice for us.
ability to handle lookahead items, multi-bin packing, and Hifi et al. (2010) proposed a mixed linear programming al-
item re-orienting. With a thorough test and validation, we gorithm for 3D-BPP by relaxing the integer constraints in
demonstrate that our algorithm outperforms existing meth-ods the problem. Crainic et al. (2008) refined the idea of corner
by a noticeable margin. It even demonstrates a human-level points (Martello, Pisinger, and Vigo 2000), where an upcom-
performance in a preliminary user study. ing item is placed to the so-called extreme pointsto better ex-
plore the un-occupied space in a bin. Heuristic local search
2 Related Work iteratively improves an existing packing by searching within
a neighbourhood function over the set of solutions. There
1D-BPP is one of the most famous problems in combina-torial have been several strategies in designing fast approximate
optimization, and related literature dates back to the algorithms, e.g., guided local search (Faroe, Pisinger, and
sixties (Kantorovich 1960). Many variants and generaliza-tions Zachariasen 2003), greedy search (De Castro Silva, Soma,
of 1D-BPP arise in practical contexts such as the cut-ting stock and Maculan 2003), and tabu search (Lodi, Martello, and
problem (CSP), in which we want to cut bins to Vigo 1999; Crainic, Perboli, and Tadei 2009). Similar strat-egy
produce desired items of different weights, and minimize has also been adapted to Online BPP (Ha et al. 2017;
the total number of bins used. A comprehensive list of bib- Wang et al. 2016). In contrast, genetic algorithms leads to
liography on 1D-BPP and CSP can be found in (Sweeney better solutions as a global, randomized search (Li, Zhao,
and Paternoster 1992). Knowing to be strongly NP-hard, and Zhang 2014; Takahara and Miyamoto 2005).
most existing literature focuses on designing good heuristic
and approximation algorithms and their worst-case perfor- Deep reinforcement learning (DRL) has demonstrated
mance analysis (Coffman, Garey, and Johnson 1984). For tremendous success in learning complex behaviour skills
example, the well-known greedy algorithm, the next fit al- and solving challenging control tasks with high-dimensional
gorithm (NF) has a linear time complexity of O(N) and raw sensory state-space (Lillicrap et al. 2015; Mnih et al. 2015).
its worst-case performance ratio is 2 i.e. NF needs at most 2015, 2016). The existing research can largely be divided
twice as many bins as the optimal solution does (De La Vega into two lines: on-policy methods (Schulman et al. 2017; Wu
and Lueker 1981). The first fit algorithm (FF) allows an et al. 2017; Zhao et al. 2018) and off-policy ones (Mnih et al.
item to be packed into previous bins that are not yet full, 2015; Wang et al. 2015; Barth-Maron et al. 2018). On-policy
and its time complexity increases to O(N log N). The best algorithms optimize the policy with agent-environment in-
fit algorithm (BF) aims to reduce the residual capacity of teraction data sampled from the current policy. While lack-ing
all the non-full bins. Both FF and BF have a better worst-case
17
the ability of reusing old data makes them less data ef-ficient,
performance ratio of than NF10(Johnson et al. 1974). updates calculated by on-policy data lead to stable
Pre-sorting all the items yields the off-line version of those optimization. In contrast, off-policy methods are more data-
greedy strategies sometimes also known as the decreasing efficient but less stable. In our problem, agent-environment
version (Martello 1990). While straightforward, NF, FF, and interaction data is easy to obtain (in 2000FPS), thus data ef-
BF form a foundation of more sophisticated approxima-tions ficiency is not our main concern. We base our method on the
to 1D-BPP (e.g. see (Karmarkar and Karp 1982)) or on-policy actor-critic framework. In addition, we formulate
its exact solutions (Martello and Toth 1990; Scholl, Klein,
¨ ´ online 3D-BPP as constrained DRL and solve it by project-ing
and Jurgens 1997; Labb e, Laporte, and Martello 1995; De- the trajectories sampled from the actor to the constrained
lorme, Iori, and Martello 2016). We also refer the reader to state-action space, instead of resorting to more involved con-
BPPLib library (Delorme, Iori, and Martello 2018), which strained policy optimization (Achiam et al. 2017).
includes the implementation of most known algorithms for
the 1D-BPP problem.
Machine Translated by Google

llllllllll
llllllllll
0000000000
llllllllll
0000000000 llllllllllllllllllll
0000000000
llllllllll
3330000000
llllllllll
3330000000 llllllllll
4440000000 llllllllllllllllllll
4442220000
3332220000 wwwwwwwwww
wwwwwwwwww
6662220000
wwwwwwwwww
6662220000 wwwwwwwwww
wwwwwwwwww
ÿÿ ×
wwwwwwwwww
wwwwwwwwww
wwwwwwwwww
0000000000 wwwwwwwwww
0000000000 wwwwwwwwww

0000000000 hhhhhhhhhh
0011111111 hhhhhhhhhhhhhhhhhhhh
ÿ
0000011111
hhhhhhhhhh
0000011111 hhhhhhhhhh
0000000011 hhhhhhhhhh
hhhhhhhhhhhhhhhhhhhh
0000000011
FLB 0000010011 hhhhhhhhhh
0000010011 hhhhhhhhhh

× ÿÿ × ×3
ÿÿ

Figure 2: Left: The environment state of the agent includes the configuration of the bin (the grey boxes) and the size of the
next item to be packed (green box). The bin configuration is parameterized as a height map H over a L × W grid. The feasibility
mask M is a binary matrix of size L × W indicating the placement feasibility at each grid cell. The three dimensions of the next
item are stored into a L × W × 3 tensor D. Right: The network architecture (the three losses other than the standard actor and
critic losses are shown in red color).

RL for combinatorial optimization has a distinguished discount factor, and ÿ = (s0, a0, s1, . . .) is a trajectory sam-pled based on the
history (Gambardella and Dorigo 1995; Zhang and Diet- policy ÿ.
terich 2000) and is still an active direction with especially The environment state of 3D-BPP is comprised of two
intensive focus on TSP (Bello et al. 2016). Early attempts parts: the current configuration of the bin and the coming
strive for heuristics selection using RL (Nareyek 2003). items to be placed. For the first part, we parameterize the
Bello et al. (2016) combined RL pretraining and active bin through discretizing its bottom area as a L × W regular
search and demonstrated that RL-based optimization out- grid along length (X) and width (Y ) directions, respectively.
performs supervised learning framework when tackling NP- We record at each grid cell the current height of stacked items, leading to a
hard combinatorial problems. Recently, Hu et al. (2017) height map Hn (see Figure 2). Here, the subscript n implies n is the next item
pro-posed a DRL solution to 3D-BPP. Laterre et al. (2018) to be packed. Since all the dimensions are integers, Hn ÿ Z can be ex-pressed
in-troduced a rewarding strategy based on self-play. L×W
as a 2D integer array. The dimensionality of item n is given as dn = [ln, wn,
Different from ours, these works deal with an offline setting hn] ÿ Z teger dimensions helps to reduce the state/action space and accelerate
where the main goal is to find an optimal sequence of items 3
the policy learning significantly. A spatial resolu-tion . Working with in-
inspired by the Pointer Network (Vinyals, Fortunato, and Jaitly 2015).
of up to 30 × 30 is sufficient in many real scenarios.

3 Method
Putting together, the current environment state can be writ-ten as sn = {Hn,
In online 3D-BPP, the agent is agnostic on li , wi or hi of all dn, dn+1, ..., dn+kÿ1}. We first consider the case where k = |Io| = 1, and name
the items in I – only immediately incoming ones Io ÿ I are this special instance as BPP-1. In other words, BPP-1 only considers the imme-
observable. As soon as an item arrives, we pack it into the diately coming item n i.e., Io = {n}. We then generalize it to BPP-k with k > 1
bin, and no further adjustment will be applied. As the afterwards.
complexity of BPP decreases drastically for bigger items,
we further constrain the sizes of all items to be li ÿ L/2, wi ÿ
W/2, and hi ÿ H/2. We start with our problem state-ment BPP-1 In BPP-1, the agent places n’s front-left-bottom
under the context of DRL and the formulation based on (FLB) corner (Figure 2 (left)) at a certain grid point or the
constrained DRL. We show how we solve the problem via loading position (LP) in the bin. For instance, if the agent
predicting action feasibility in the actor-critic framework. chooses to put n at the LP of (xn, yn). This action is rep-
resented as an = xn + L · yn ÿ A, where the action set A =
{0, 1, . . . , L · W ÿ 1}. After an is executed, Hn is updated
3.1 Problem statement and formulation
by adding hn to the maximum height over all (x, y) = hmax(x,
The 3D-BPP can be formulated as a Markov decision pro- the cells covered by n: H x n y) + hn for
cess, which is a tuple of (S, A, P, R). S is the set of en- ÿ [xn, xn + ln], y ÿ [yn, yn + wn], with hmax(x, y) be-ing the
vironment states; A is the action set; R : S × A ÿ R is the maximum height among those cells. The state tran-sition
reward function; P : S × A × S ÿ [0, 1] is the tran-sition is deterministic: P(H|Hn, an) = 1 for H = H and P(H|Hn,n an)
probability function. P(s |s, a) gives the probability of = 0 otherwise.
transiting from s to s for given action a. Our method is During packing, the agent needs to secure enough space
model-free since we do not learn P(s |s, a). The policy ÿ : S in the bin to host item n. Meanwhile, it is equally important
ÿ A is a map from states to probability distribu-tions over to have n statically equilibrated by the underneath at the LP
actions, with ÿ(a|s) denoting the probability of selecting so that all the stacking items are physically stable. Evaluating
action a under state s. For DRL, we seek for a policy ÿ to the physical stability at a LP is involved, taking into account
maximize the accumulated discounted reward, J(ÿ) = of n’s center of mass, moment of inertia, and rotational sta-
ÿ
Eÿÿÿ[ t=0 ÿ tR(st, at)]. Here, ÿ ÿ [0, 1] is the bility (Goldstein, Poole, and Safko 2002). All of them are
Machine Translated by Google

normally unknown as the mass distribution differs among Feasibility constraints We devise a prediction-and-projection
items. To this end, we employ a conservative and simplified mechanism to enforce feasibility constraints.
criterion. Specifically, a LP is considered feasible if it not First, we introduce an independent multilayer perceptron
only provides sufficient room for n but also satisfies any of module, namely the mask predictor (Figure 2 (right)),
following conditions with n placed: 1) over 60% of n’s bot-tom area and to predict the feasibility mask Mn for the item n.
all of its four bottom corners are supported by The predictor takes the state CNN features of the cur-rent state as the
existing items; or 2) over 80% of n’s bottom area and three input and is trained with the ground-truth mask as the supervision. Next,
out of four bottom corners are supported; or 3) over 95% of we use the pre-dicted mask to modulate the output, i.e., the probability
n’s bottom area is supported. We store the feasibility of all
the LPs for item n with a feasibility mask Mn, an L × W distribution of the ac-tions.
binary matrix (also see Figure 2). In theory, if the
Since not all actions are allowed, our problem becomes LP at (x, y) is infeasible
a constrained Markov decision processes (CMDP) (Altman for n, the corresponding
1999). Typically, one augments the MDP with an auxil-iary cost function probability P(an = x +
C : S × A ÿ R mapping state-action tuples to costs, and require that the L·y|sn) should be set to
expectation of 0. However, we find that
the accumulated cost should be bounded by cm: JC (ÿ) = setting P to a small pos-itive
ÿ t
Eÿÿÿ[ t=0 c C C(st, at)] ÿ cm. Several methods have quantity like = 10ÿ3 works better in practice – it
been proposed to solve CMDP based on e.g., algorith-mic heuristics provides a strong penalty to an invalid action but a smoother
(Uchibe and Doya 2007), primal-dual meth-ods (Chow et al. 2017), or transformation beneficial to the network training. The inset
constrained policy optimiza-tion (Achiam et al. 2017). While these shows that softening the mask-based modulation improves
methods are proven the training convergence. To further discourage infeasible
effective, it is unclear how they could fit for 3D-BPP in-stances, where actions, we explicitly minimize the summed probability at
the constraint is rendered as a discrete mask. all infeasible LPs: Einf = P(an = x + L · y|sn),
In this work, we propose to exploit the mask M to guide the ÿ(x, y)|Mn(x, y) = which is plugged into the final loss
,

DRL training to enforce the feasibility constraint without in-troducing function for training.
excessive training complexity.
Loss function Our loss function is defined as:

3.2 Network architecture L = ÿ·Lactor+ÿ·Lcritic+ÿ·Lmask+ÿ·Einfÿÿ·Eentropy.


(1)
We adopt the actor-critic framework with Kronecker-Factored Trust
Here, Lactor and Lcritic are the loss functions used for
Region (ACKTR) (Wu et al. 2017). It iter-atively updates an actor and a
training the actor and the critic, respectively. Lmask is the
critic module jointly. In each
MSE loss for mask prediction. To push the agent to explore
iteration, the actor learns a policy network that outputs the
more LPs, we also utilize an action entropy loss Eentroy =
probability of each action (i.e., placing n at the each LP).
The critic trains a state-value network producing the value Mn(x,y)=1 ÿP(an|sn) · log P(an|sn) . Note that the
function. We find through experiments that on-policy meth-ods (such as entropy is computed only over the set of all feasible ac-tions whose LP
ACKTR) lead to better performance than off-policy ones like SAC satisfies Mn(x, y) = 1. In this way, we
(Haarnoja et al. 2018); see a compari-son in the supplemental material. stipulate the agent to explore only feasible actions. We find
through experiments that the following weights lead to con-sistently
good performance throughout our tests: ÿ = 1,
State input In the original ACKTR framework, both ac-tor and critic ÿ = ÿ = 0.5, and ÿ = ÿ = 0.01.
networks take the raw state directly as input. In
our implementation however, we devise a CNN, named state 3.3 BPP-k with k = |Io| > 1
CNN, to encode the raw state vector into features. To facili-tate this, In a more general case, the agent receives the information
we “stretch” dn into a three-channel tensor Dn ÿ of k > 1 lookahead items (i.e., from n to n + k ÿ 1).
L×W×3
WITH
so that each channel of dn spans a L × W matrix Obviously, the additional items inject more information to
with all of its elements being ln, wn or hn, respectively (also the environment state, which should be exploited in learn-ing the policy
see Figure 2). Consequently, state sn = (Hn, Dn) becomes ÿ(an|Hn, dn, ..., dn+kÿ1). One possible so-lution is to employ sequential
a L × W × 4 array (Figure 2 (right)). modeling of the state sequence
(dn, ..., dn+kÿ1) using, e.g., recurrent neural networks. We
Reward We define a simplistic step-wise reward as the found that, however, such state encoding cannot well inform
volumetric occupancy introduced by the current item: rn = the agent about the lookahead items during DRL training
10 × ln · wn · hn/(L · W · H) for item n. When the current and yields limited improvement. Alternatively, we propose
item is not placeable, its reward is zero and the episode ends. a search-based solution leveraging the height map H update
While the feasibility mask saves the efforts of exploring in-valid actions, and feasibility mask prediction.
this step-wise reward directs the agent to place The core idea is to condition the placement of the current
as many items as possible. We find through comparison that item n on the next k ÿ 1 ones. Note that the actual place-ment of the k
this step-wise reward is superior than a termination one (e.g. items still follows the order of arrival. To make
the final space utilization); see supplemental material.
Machine Translated by Google

the actual order a virtual order


1
=2
MP MC FE Space uti. # items
3 2
3
2
=3 ÿ ÿ ÿ 7.82% ÿ ÿ ÿ 27.9% ÿ 2.0
2 1 =1

1
3
ÿ ÿ 63.7% ÿ ÿ ÿ 63.0% ÿ ÿ 7.5
1 ÿ 66.9% 16.9
a virtual order 16.7
2
17.5
3
3 1
1 2

2 3
Table 1: This ablation study compares the space utilization
and the total number of packed items with different combi-
nations of MP, MC and FE, on the CUT-2 dataset.
Figure 3: The permutation tree for Io = {1, 2, 3}. To find the
best packing for item 1, our method explores different virtual
placing orders satisfying the order dependence constraint, on our adaptions of the standard MCTS. MCTS allows a
e.g., 1 cannot be placed on top of virtually placed 2 or 3. scalable lookahead for BPP-k with a complexity of O(km)
where m is the number of paths sampled.
the current placement account for the future ones, we opt to
“hallucinate” the placement of future items through updat-ing 4 Experiments
the height map accordingly. Conditioned on the virtually We implement our framework on a desktop computer
placed future items, the decision for the current item could (ubuntu 16.04), which equips with an Intel Xeon
be globally more optimal. However, such virtual placement Gold 5115 CPU @ 2.40 GHz, 64G memory, and a
must satisfy the order dependence constraint which stipu- Nvidia Titan V GPU with 12G memory. The DRL and
lates that the earlier items should never be packed on top all other networks are implemented with PyTorch (Paszke
of the later ones. In particular, given two items p and q, et al. 2019). The model training takes about 16 hours on a
p < q in Io, if q is (virtually) placed before p, we require spatial resolution of 10 × 10. The test time of BPP-1 model
that the placement of p should be spatially independent to (no lookahead) is less than 10 ms. Please refer to the sup-
the placement of q. It means p can never be packed at any plemental material for more implementation details.
LPs that overlap with q. This constraint is enforced by set-
ting the height values in H at the corresponding LPs to H, Training and test set We set L = W = H = 10 in our
the maximum height value allowed: Hp(x, y) ÿ H, for all experiments with 64 pre-defined item dimensions (|I| =
x ÿ [xq, xq + lq] and y ÿ [yq, yq + wq]. Combining ex-plicit 64). Results with higher spatial resolution are given in the
height map updating with feasibility mask prediction, supplemental material. We also set li ÿ L/2, wi ÿ W/2
the agent utilizes the trained policy with the order depen- and hi ÿ H/2 to avoid over-simplified scenarios. The train-ing
dence constraint satisfied implicitly. and test sequence is synthesized by generating items out
of I, and the total volume of items should be equal to or big-
Monte Carlo permutation tree search We opt to search ger than bin’s volume. We first create a benchmark called
for a better an through exploring the permutations of the RS where the sequences are generated by sampling items
sequence (dn, ..., dn+kÿ1). This amounts to a permutation out of I randomly. A disadvantage of the random sampling
tree search during which only the actor network test is con- is that the optimality of a sequence is unknown (unless per-
ducted – no training is needed. Figure 3 shows a k-level per- forming a brute-force search). Without knowing whether the
mutation tree: A path (r, v1, v2, ..., vk) from the root to a leaf sequence would lead to a successful packing, it is difficult to
forms a possible permutation of the placement of the k items gauge the packing performance with this benchmark.
in Io, where r is the (empty) root node and let item(vi) rep- Therefore, we also generate training se-
resent the i-th item being placed in the permutation. Given quences via cutting stock (Gilmore and Go-
4
two items item(vi) < item(vj ) meaning item(vi) arrives mory 1961). Specifically, items in a se- 3
before item(vj ) in the actual order. If i > j along a permu- quence are created by sequentially “cut-
tation path, meaning that item(vj ) is virtually placed before ting” the bin into items of the pre-defined 2
1
item(vi), we block the LPs corresponding to item(vj )’s oc- 64 types so that we understand the se-
cupancy to avoid placing item(vi) on top of item(vj ). quence may be perfectly packed and re-
Clearly, enumerating all the permutations for k items stored back to the bin. There are two variations of this strat-
quickly becomes prohibitive with an O(k!) complexity. To egy. CUT-1: After the cutting, we sort resulting items into
make the search scalable, we adapt the Monte Carlo tree the sequence based on Z coordinates of their FLBs, from
search (MCTS) (Silver et al. 2017) to our problem. With bottom to top. If FLBs of two items have the same Z coor-
MCTS, the permutation tree is expanded in a priority-based dinate, their order in the sequence is randomly determined.
fashion through evaluating how promising a node would CUT-2: The cut items are sorted based on their stacking de-
lead to the optimal solution. The latter is evaluated by sam- pendency: an item can be added to the sequence only after
pling a fixed number of paths starting from that node and all of its supporting items are there. A 2D toy example is
computing for each path a value summing up the accumu- given in the inset figure with FLB of each item highlighted.
lated reward and the critic value (“reward to go”) at the leaf Under CUT-1, both {1, 2, 3, 4} and {2, 1, 3, 4} are valid item
(k-th level) node. After search, we choose the action an cor- sequences. If we use CUT-2 on the other hand, {1, 3, 2, 4}
responding to the permutation with the highest path value. and {2, 4, 1, 3} would also be valid sequences as the place-
Please refer to the supplemental material for more details ment of 3 or 4 depends on 1 or 2. For the testing purpose,
Machine Translated by Google

(MPFEw/
MC,
MP,
Ours
FE)
MC w/ M
o
0.417 0.524 0.674
o
12 items 22 items 18 items
++
Figure 5: HM shows a clear advantage over vector-based
0.228 0.24 0.297
height parameterizations (HV and ISV).
6 items 10 items 8 items

1.0 0.874 0.728


33 items 30 items 19 items

Seq 1 (CUT-1) Seq 2 (CUT-2) Seq 3 (RS)

Figure 4: Packing results in the ablation study. The numbers


beside each bin are space uti. and # items.
Figure 6: Comparison to DRL with reward tuning. Our
we generate 2,000 sequences using RS, CUT-1, and CUT-2 method obtains much better space utilization.
respectively. The performance of the packing algorithm is
quantitated with space utilization (space uti.) and the total tion. We show that this strategy is less effective than our
number of items packed in the bin (# items). constraint-based method (i.e., learning invalid move by pre-
In the supplemental material, we provide visual packing dicting the mask). In Figure 6, we compare to an alterna-tive
results on all three datasets, as well as an evaluation of method which uses a negative reward to penalize unsafe
model generalization across different datasets. Animated re- placements. Constraint-based DRL seldom predicts invalid
sults can be found in the accompanying video. moves (predicted placement are 99.5% legit).

Ablation study Table 1 reports an ablation study. From Scalability of BPP-k With the capability of lookahead, it
the results, we found that the packing performance drops is expected that the agent better exploits the remaining space
significantly if we do not incorporate the feasibility mask in the bin and delivers a more compact packing. On the other
prediction (MP) during the training. The performance is im- hand, due to the NP-hard nature, big k values increase the
paired if the mask constraint (MC) is not enforced with environment space exponentially. Therefore, it is important
our projection scheme. The feasibility-based entropy (FE) is to understand if MCTS is able to effectively navigate us in
also beneficial for both the training and final performance. the space at the scale of O(k!) for a good packing strategy.
Figure 4 demonstrates the packing results visually for dif- In Figure 7(a,b), we compare our method with a brute-force
ferent method settings. permutation search, which traverses all k! permutations of
k coming items and chooses the best packing strategy (i.e.,
Height parameterization Next, we show that the environ-
the global optimal). We also compare to MCTS-based ac-
ment parameterization using the proposed 2D height map
(HM) (i.e., the H matrix) is necessary and effective. To this
end, we compare our method using HM against that employ- (a) (b) (c)

ing two straightforward 1D alternatives. The first competitor


is the height vector (HV), which is an L · W-dimensional
vector stacking columns of H. The second competitor is re-
ferred to as the item sequence vector (ISV). The ISV lists all
the information of items currently packed in the bin. Each
packed item has 6 parameters corresponding to X, Y and Z ,

coordinates of its FLB as well as the item’s dimension. From


our test on CUT-1, HM leads to 16.0% and 19.1% higher
space utilization and 4.3 and 5.0 more items packed than Figure 7: (a): Our permutation based MCTS maintains good
HV and ISV, respectively. The plots in Figure 5 compare the time efficiency as the number of lookahead items increases.
average reward received using different parameterizations. (b): The performance of our MCTS based BPP-k model
These results show that 2D height map (HM) is an effective achieves similar performance (avg. space utility) as the
way to describe the state-action space for 3D-BPP. brute-force search over permutation tree. (c): The distribu-tion
Constraint vs. reward In DRL training, one usually dis- of space utilization using boundary rule (Heu.), human
courages low-profile moves by tuning the reward func- intelligence (Hum.), and our BPP-1 method (Ours).
Machine Translated by Google

# bins Space uti. # items per bin # total items Decision time # items / % Space uti.
Method
1 67.4% 17.6 17.6 2.2 × 10ÿ3 s RS CUT-1 CUT-2
4 69.4% 6.3 × 10ÿ3 s18.8 75.2 Boundary rule (Online) 8.7 / 34.9% 10.8 / 41.2% 11.1 / 40.8%
9 72.1% 1.8 × 10ÿ2 s19.1 171.9 BPH (Online) 8.7 / 35.4% 13.5 / 51.9% 13.1 / 49.2%
16 75.3% 2.8 × 10ÿ2 s19.6 313.6 LBP (Offline) 12.9 / 54.7% 14.9 / 59.1% 15.2 / 59.5%
Our BPP-1 (Online) 12.2 / 50.5% 19.1 / 73.4% 17.5 / 66.9%
25 77.8% 20.2 505.0 4.5 × 10ÿ2 s

Table 3: Comparison with three baselines including both on-


Table 2: Multi-bin packing tested with the CUT-2 dataset. line and offline approaches.
tion search with k lookahead items in which no item per-
mutation is involved. We find that our MCTS-based permu-
tation tree search yields the best results – although having
slightly lower space utilization rate (ÿ 3%), it is far more ef-
ficient. The search time of brute-force permutation quickly
surpasses 100s when k = 8. Our method takes only 3.6s
even for k = 20, when permutation needs hours. A larger k
makes the brute-force search computationally intractable.
Extension to different 3D-BPP variants Our method is Figure 8: Comparison with the online BPH method (Ha et al.
versatile and can be easily generalized to handle different 2017) on BPP-k. Note that BPH allows lookahead item re-
3D-BPP variants such as admitting multiple bins or allowing ordering while ours does not.
item re-orientation. To realize multi-bin 3D-BPP, we initial-ize
multiple BPP-1 instances matching the total bin number. our method automatically learns the above “boundary rule”
When an item arrives, we pack it into the bin in which the even without imposing such constraints explicitly. From Fig-
item introduces the least drop of the critic value given by the ure 8, our method performs better than online BPH consis-
corresponding BPP-1 network. More details can be found tently with varying number of lookahead items even though
in the supplemental material. Table 2 shows our results for BPH allows re-ordering of the lookahead items.
varying number of bins. More bins provide more options to We also conducted a preliminary comparison on a real
host an item, thus leading to better performance (avg. num- robot test of BPP-1 (see our accompanying video). Over
ber of items packed). Both time (decision time per item) and 50 random item sequences, our method achieves averagely
space complexities grow linearly with the number of bins. 66.3% space utilization, much higher than boundary rule
We consider only horizontal, axis-align orientations of an (39.2%) and online BPH (43.2%).
item, which means that each item has two possible orien- Our method vs. human intelligence The strongest com-
tations. We therefore create two feasibility masks for each petitor to all heuristic algorithms may be human intuition.
item, one for each orientation. The action space is also dou- To this end, we created a simple Sokoban-like app (see the
bled. The network is then trained to output actions in the supplemental material) and asked 50 human users to pack
doubled action space. In our test on the RS dataset, we items manually vs. AI (our method). The winner is the one
find allowing re-orientation increases the space utilization with a higher space utilization rate. 15 of the users are pal-
by 11.6% and the average items packed by 3, showing that letizing workers and the rest are CS-majored undergradu-ate/
our network handles well item re-orientation. graduate students. We do not impose any time limits to
Comparison with non-learning methods Existing works the user. The statistics are plotted in Figure 7(c). To our
mostly study offline BPP and usually adopt non-learning surprise, our method outperforms human players in gen-eral
methods. We compare to two representatives with source (1, 339 AI wins vs. 406 human wins and 98 evens): it
code available. The first is a heuristic-based online approach, achieves 68.9% average space utilization over 1, 851 games,
BPH (Ha et al. 2017) which allows the agent to select the while human players only have 52.1%.
next best item from k lookahead ones (i.e., BPP-k with
re-ordering). In Table 3, we compare to its BPP-1 version 5 Conclusion
to be fair. In Figure 8, we compare online BPH and our We have tackled a challenging online 3D-BPP via formulat-
method under the setting of BPP-k. Most existing methods ing it as a constrained Markov decision process and solv-ing
focus on offline packing where the full sequence of items it with constrained DRL. The constraints include order
is known a priori. The second method is the offline LBP dependence and physical stability. Within the actor-critic
method (Martello, Pisinger, and Vigo 2000) which is again framework, we achieve policy optimization subject to the
heuristic based. In addition, we also design a heuristic base- complicated constraints based on a height-map bin repre-
line which we call boundary rule method. It replicates hu- sentation and action feasibility prediction. In realizing BPP
man’s behavior by trying to place a new item side-by-side with multiple lookahead items, we adopt MCTS to search
with the existing packed items and keep the packing volume the best action over different permutations of the lookahead
as regular as possible (details in the supplemental material). items. In the future, we would like to investigate more relax-
From the comparison in Table 3, our method outper-forms ations of the problem. For example, one could lift the order
all alternative online methods on all three benchmarks dependence constraint by adding a buffer zone smaller than
and even beats the offline approach on CUT-1 and CUT-2. |Io|. Another more challenging relaxation is to learn to pack
Through examining the packing results visually, we find that items with irregular shape.
Machine Translated by Google

Acknowledgments Faroe, O.; Pisinger, D.; and Zachariasen, M. 2003. Guided


We thank the anonymous PC, AC and extra review-ers for their local search for the three-dimensional bin-packing problem.
insightful comments and valuable sugges-tions. We are also Informs journal on computing 15(3): 267–283.
grateful to the colleagues of Speed-Bot Robotics for their help on Gambardella, L. M.; and Dorigo, M. 1995. Ant-Q: A rein-forcement
real robot test. Thanks also learning approach to the traveling salesman prob-lem. In Machine
go to Chi Trung Ha for providing the source code of Learning Proceedings 1995, 252–260. El-sevier.
their work (Ha et al. 2017). This work was supported in
part by the National Key Research and Development Pro-gram
of China (No. 2018AAA0102200), NSFC (62002376, Gilmore, P. C.; and Gomory, R. E. 1961. A linear program-ming
62002375, 61532003, 61572507, 61622212) and NUDT Re- approach to the cutting-stock problem. Operations re-search
search Grants (No. ZK19-30). 9(6): 849–859.
Goldstein, H.; Poole, C.; and Safko, J. 2002. Classical me-
References chanics.
Achiam, J.; Held, D.; Tamar, A.; and Abbeel, P. 2017. Con- Ha, C. T.; Nguyen, T. T.; Bui, L. T.; and Wang, R. 2017. An
strained policy optimization. In Proceedings of the 34th online packing heuristic for the three-dimensional container
International Conference on Machine Learning-Volume 70, loading problem in dynamic environments and the Physical
22–31. JMLR. org. Internet. In European Conference on the Applications of
Altman, E. 1999. Constrained Markov decision processes, Evolutionary Computation, 140–155. Springer.
volume 7. CRC Press.
Haarnoja, T.; Zhou, A.; Abbeel, P.; and Levine, S. 2018.
Barth-Maron, G.; Hoffman, M. W.; Budden, D.; Dabney, Soft actor-critic: Off-policy maximum entropy deep rein-forcement
W.; Horgan, D.; Muldal, A.; Heess, N.; and Lillicrap, T. learning with a stochastic actor. arXiv preprint
2018. Distributed distributional deterministic policy gradi-ents. arXiv:1801.01290 .
arXiv preprint arXiv:1804.08617 . `

Hifi, M.; Kacem, I.; Negre, S.; and Wu, L. 2010. A lin- ear
Bello, I.; Pham, H.; Le, Q. V.; Norouzi, M.; and Bengio, programming approach for the three-dimensional bin-packing
S. 2016. Neural combinatorial optimization with reinforce-ment problem. Electronic Notes in Discrete Mathematics
learning. arXiv preprint arXiv:1611.09940 . 36: 993–1000.
Chow, Y.; Ghavamzadeh, M.; Janson, L.; and Pavone, M. Hu, H.; Zhang, X.; Yan, X.; Wang, L.; and Xu, Y. 2017.
2017. Risk-constrained reinforcement learning with per-centile Solving a new 3d bin packing problem with deep reinforce-ment
risk criteria. The Journal of Machine Learning Re-search 18(1): learning method. arXiv preprint arXiv:1708.05930 .
6070–6120.
Johnson, D. S.; Demers, A.; Ullman, J. D.; Garey, M. R.; and
Coffman, E. G.; Garey, M. R.; and Johnson, D. S. 1984. Graham, R. L. 1974. Worst-case performance bounds for
Approximation algorithms for bin-packing—an updated sur-vey. simple one-dimensional packing algorithms. SIAM Journal
In Algorithm design for computer system design, 49– on computing 3(4): 299–325.
106. Springer.
Kantorovich, L. V. 1960. Mathematical methods of orga-nizing
Crainic, TG; Perboli, G.; and Tadei, R. 2008. Extremes
and planning production. Management science 6(4):
point-based heuristics for three-dimensional bin packing. In-forms 366–422.
Journal on computing 20(3): 368–384. ÿ

Crainic, TG; Perboli, G.; and Tadei, R. 2009. TS2PACK: Karabulut, K.; and ÿInceoglu, M. M. 2004. A hybrid genetic
algorithm for packing in 3D with deepest bottom left with
A two-level tabu search for the three-dimensional bin pack-ing
fill method. In International Conference on Advances in In-
problem. European Journal of Operational Research
formation Systems, 441–450. Springer.
195(3): 744–760.
De Castro Silva, J.; Soma, N.; and Maculan, N. 2003. A Karmarkar, N.; and Karp, R. M. 1982. An efficient approx-imation
greedy search for the three-dimensional bin packing prob-lem: scheme for the one-dimensional bin-packing prob-lem. In 23rd
the packing static stability case. International Transac-tions in Annual Symposium on Foundations of Com-puter Science (sfcs
Operational Research 10(2): 141–153. 1982), 312–320. IEEE.

De La Vega, W. F.; and Lueker, G. S. 1981. Bin packing can Korte, B.; and Vygen, J. 2012. Bin packing. In Combinatorial
be solved within 1+ ÿ in linear time. Combinatorica 1(4): Optimization, 499–516. Springer.
349–355. ´

Labbe, M.; Laporte, G.; and Martello, S. 1995. An exact algorithm


Delorme, M.; Iori, M.; and Martello, S. 2016. Bin packing for the dual bin packing problem. Operations Re-search Letters
and cutting stock problems: Mathematical models and ex-act 17(1): 9–18.
algorithms. European Journal of Operational Research Laterre, A.; Fu, Y.; Jabri, M.K.; Cohen, A.-S.; Kas,
255(1): 1–20. D.; Hajjar, K.; Dahl, T.S.; Kerkeni, A.; and Beguir, K.
Delorme, M.; Iori, M.; and Martello, S. 2018. BPPLIB: a 2018. Ranked reward: Enabling self-play reinforcement
library for bin packing and cutting stock problems. Opti-mization learning for combinatorial optimization. arXiv preprint
Letters 12(2): 235–250. arXiv:1807.01672 .
Machine Translated by Google

Li, X.; Zhao, Z.; and Zhang, K. 2014. A genetic algorithm Silver, D.; Stepwieser, J.; Simonyan, K.; Antonoglou, I.;
for the three-dimensional bin packing problem with hetero-geneous Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, 1999
bins. In IIE Annual Conference. Proceedings, 2039. A.; et al. 2017. Mastering the game of go without human
Institute of Industrial and Systems Engineers (IISE). knowledge. Nature 550(7676): 354–359.

Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Skiena, S. S. 1997. The stony brook algorithm repos-itory. http://
Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous www.cs.sunysb.edu/algorith/implement/nauty/
control with deep reinforcement learning. arXiv preprint implement.shtml.
arXiv:1509.02971 .
Sweeney, P. E.; and Paternoster, E. R. 1992. Cutting and
Lodi, A.; Martello, S.; and Vigo, D. 1999. Approximation al-gorithms for packing problems: a categorized, application-orientated re-search
the oriented two-dimensional bin packing prob-lem. European Journal bibliography. Journal of the Operational Research
of Operational Research 112(1): Society 43(7): 691–706.
158–166.
Takahara, S.; and Miyamoto, S. 2005. An evolutionary
Man Jr, E. C.; Garey, M.; and Johnson, D. 1996. Approxi-mation approach for the multiple container loading problem. In
algorithms for bin packing: A survey. Approximation Fifth International Conference on Hybrid Intelligent Sys-tems (HIS’05),
algorithms for NP-hard problems 46–93. 6–pp. IEEE.

Martello, S. 1990. Knapsack problems: algorithms and com-puter Uchibe, E.; and Doya, K. 2007. Constrained reinforcement
implementations. Wiley-Interscience series in discrete learning from intrinsic and extrinsic rewards. In 2007 IEEE
mathematics and optimiza tion . 6th International Conference on Development and Learn-ing, 163–168.
IEEE.
Martello, S.; Pisinger, D.; and Vigo, D. 2000. The three-dimensional bin
Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointer net-works. In
packing problem. Operations research
Advances in Neural Information Processing Sys-tems, 2692–2700.
48(2): 256–267.

Martello, S.; and Toth, P. 1990. Lower bounds and reduction


Wang, R.; Nguyen, TT; Kavakeb, S.; Yang, Z.; and Li, C.
procedures for the bin packing problem. Discrete applied
2016. Benchmarking dynamic three-dimensional bin pack-ing problems
mathematics 28(1): 59–70.
using discrete-event simulation. In European
Martello, S.; and Vigo, D. 1998. Exact solution of the two-dimensional Conference on the Applications of Evolutionary Computa-tion, 266–
finite bin packing problem. Management sci-ence 44(3): 388–399. 279. Springer.
Wang, Z.; Schaul, T.; Hessel, M.; Van Hasselt, H.; Lanc-tot, M.; and
Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; De Freitas, N. 2015. Dueling network architectures for deep
Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asyn-chronous reinforcement learning. arXiv preprint
methods for deep reinforcement learning. In In-ternational conference arXiv:1511.06581 .
on machine learning, 1928–1937. Wu, Y.; Mansimov, E.; Grosse, R. B.; Liao, S.; and Ba,
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve-ness, J.; J. 2017. Scalable trust-region method for deep reinforce-ment learning
Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fid-jeland, A. K.; using kronecker-factored approximation. In
Ostrovski, G.; et al. 2015. Human-level con-trol through deep Advances in neural information processing systems, 5279–
reinforcement learning. Nature 518(7540): 5288.
529.
Zhang, W.; and Dietterich, T. G. 2000. Solving combina-torial
Nareyek, A. 2003. Choosing search heuristics by non-stationary optimization tasks by reinforcement learning: A gen-eral methodology
reinforcement learning. In Metaheuristics: Com-puter decision-making, applied to resource-constrained schedul-ing. Journal of Artificial
523–544. Springer. Intelligence Reseach 1: 1–38.

Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Zhao, Y.; Xu, K.; Zhu, E.; Liu, X.; Zhu, X.; and Yin, J. 2018.
Cannon, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; Triangle lasso for simultaneous clustering and optimization
et al. 2019. PyTorch: An imperative style, high-performance in graph datasets. IEEE Transactions on Knowledge and
deep learning library. In Advances in Neural Information Data Engineering 31(8): 1610–1623.
Processing Systems, 8024–8035.
¨

Scholl, A.; Klein, R.; and Jurgens, C. 1997. Bison: A fast hy- brid
procedure for exactly solving the one-dimensional bin
packing problem. Computers & Operations Research 24(7):
627–645.

Schrijver, A. 1998. Theory of linear and integer program-ming. John


Wiley & Sons.

Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and


Klimov, O. 2017. Proximal policy optimization algorithms.
arXiv preprint arXiv:1707.06347 .
Machine Translated by Google

A Supplemental material overview In this height map next box size

supplemental document, we report more implemen-tation Critic Network Actor Network Mask Predictor Network

and experiment details. • Conv Conv Conv Conv


Container Container
Container
Wÿ64 x 4 x 3 x 3 Wÿ4 x 64 x 1 x 1 Wÿ8 x 64 x 1 x 1 Wÿ8 x 64 x 1 x 1
Section B gives more descriptions regarding the network Bÿ64 Bÿ4 Bÿ8 Bÿ8

architecture, training selection, Monte Carlo permutation resume resume resume resume

tree search, etc. Conv

Wÿ64 x 64 x 3 x 3
Flatten Flatten Flatten

Bÿ64

• Section C elaborates how CUT-1, CUT-2, and RS are con- resume

structed. Details about the heuristic baseline we compared Conv

Wÿ64 x 64 x 3 x 3
groan groan groan

Wÿ400 x 256 Wÿ800 x 256 Wÿ800 x 100

in our experiment. Bÿ64 Bÿ256 Bÿ256 Bÿ100

resume
resume resume Sigmoid

• The user study design is reported in Section E. • Conv groan groan predicted
mask
Wÿ64 x 64 x 3 x 3 Wÿ256 x 1 Wÿ256 x 100
Bÿ64 Bÿ1 Bÿ100
Section F analyzes the performance difference between step- resume
binaryze
Multipy 7
wise reward and termination reward in our problem. • The Conv
critic value Sub

Wÿ64 x 64 x 3 x 3
Bÿ64
details of reward function to penalize unsafe place- Softmax

ments is reported in Section G action


distribution

• More experiment results are reported in Section H.

B Implementation Details We
Figure 9: Detailed network architecture.
report the details of our implementation in this section,
and our source code is also submitted with this
supplemental material.
Afterwards, a softmax operation is adopted to output the fi-nal
action distribution. Note that, the infeasibility penalty could be
Network architecture and training configurations A detailed absent in test with the help of Einf and our method can still work
specifications of the our network is shown in Fig-ure 9. Our well in this situation.
pipeline consists of three major components: an actor network,
a critic network, and the feasibility mask pre-dictor. It takes Monte Carlo permutation tree search Our algorithm is
three inputs, height map Hn and the dimen-sionality dn = [ln, inspired by the Monte Carlo tree search of (Silver et al.
3
wn, hn] ÿ Z of the current item n to be packed as state, and the 2017). The main difference lies in: Firstly, the goal of our MCTS
feasibility mask Mn as ground truth. Note that Mn is only used is to find the best packing order for the next k items; Secondly,
in the training processing. max reward is used in our MCTS instead of the mean reward
The whole network is trained via a composite loss consist- as the evaluation strategy. Algorithm 1 out-lines the entire
ing of actor loss Lactor, critic loss Lcritic, mask prediction loss procedure step by step, where in T is the maximum simulation
Lmask, infeasibility loss Einf and action entropy loss Eentropy. time, I is lookahead items, Last is the first item after I, i0 is
These loss function are defined as: current item, s is state input of each node and a is action of
each node. Environment simulator SimEnv, which takes height
Lactor = (Rn ÿ V (sn)) log P(an|sn) ÿ map, the next item dimension, and action (position of item) as
Lcritic = (Rn ÿ V (sn))2 the input, and returns the updated height map. Action choosing
2 function ÿ uses pol-icy network from BPP-1 model to get the
Lmask = (Mgt n ÿ Mpred
n
)
action with high-est possibility. N is visit times of a node. Q is
(x,y)
expect return from a node. We test our method on our three
=
ÿÿÿÿÿÿÿÿÿÿÿ

Simple P(an = L · x + y|sn) benchmarks, and more results can be found in Figure 10.
Mn(x,y)=0

Eentropy = ÿP(an|sn) · log P(an|sn) ,


ÿÿÿÿÿÿÿÿÿÿÿ

Mn(x,y)=1 Multi-bin scoring It is straightforward to generalize our method


(2) for multiple bins. The only difference is that the agent needs to
where Rn = rn+ÿV (sn+1) and rn = 10×ln·wn·hn/(L· W · H) is determine which bin the item is to be packed and the rest is
our reward function which indicates the space uti-lization. naturally reduced to a single-bin BPP instance.
When the current item is not placeable, its reward is zero and To this end, we estimate the packing score for all the avail-
the packing sequence ends. Here, ÿ ÿ [0, 1] is the discount able bins and pack the item into the one with highest scores.
factor and we set ÿ as 1 so that Rn can directly present how This score of each bin indicates how much the state value
much utilization can agent obtain from sn on. changes if the current item is packed into it. The state value is
The output of critic network V (sn) would give a state value given by the value network, which is a estimation value of how
prediction of sn and help the training of actor network which much reward will we get from selected bin. If this value of the
outputs a possibility matrix of the next move. This probabil-ity selected bin drops significantly while an item is packed in it, it
is scaled based on Mn — if the move is infeasible, the possibility implies that we should pack this item into other bins. Our bin
will be multiplied by a penalty factor of 0.001. selection method is described in Algo-
Machine Translated by Google

(a) Packing performance on RS. (b) Packing performance on CUT-1. (c) Packing performance on CUT-2.

Figure 10: When k increases, the space utilization rate first goes up and later, enters a “plateau” zone.

rithm 2. Here, val is the last state value estimation of bin b, V is the state D Heuristic Baseline Method Online BPP is
value estimation function via a value network, n is the current item, B is
an under-investigated problem. To better demonstrate the effectiveness
the set of bins and H is the height maps of bins. The default score for
of our method, we design a heuristic baseline approach to evaluate the
each bin at beginning sdef = ÿ0.2.
performance of our DRL method. We report details of this baseline
approach in this section. This method is designed based on a simple
observation, human would try to keep the volume of packed bin to be
regular during the packing to maintain the left reg-ular space as large as
Orientated items Our method can also incorporating items with different
orientation. We multifold the action space and related mask based on possible. Such “regularity” is used as the metric to measure a packing
action.
how many different ori-entations are considered, e.g. we will have a m
times larger feasibility Mn and action space if m different poses are al-
To describe regularity of a bin, we introduce the concept of spare
lowed for an item. Doing so induces more flexibility for the packing, and
cuboid. As shown in Figure 12, a spare cuboid is an unoccupied,
it potentially leads to a better result. This is ob-served and reported in
rectangular space in the bin, and the reg-ularity of a bin is defined
Table 4. Note that, orientation only happens around Z axis in our
based on the maximum spare cuboids. Intuitively, we would like to have
problem setting.
a bigger max-imum spare cuboid. If a packed bin has many small-size
spare cuboids, it implies the remaining space of this bin is not “regular”.
As illustrated in Figure 12, the right packing strategy would left the
Table 4: Performance comparison with and without orienta-tion on
biggest spare cuboid. The regular-ity of the bin is then defined as the
different benchmarks.
maximum rectangular residual space or maximum spare cuboid. Since
I is pre-defined, we know how many items can be packed into a maximum
RS CUT-1 CUT-2 w spare cuboid. Based on this, we rate each max-imum spare cuboid c
orientation 62.1% 76.2% 70.2% w/o by the number of item types can be packed in RSc = Ivalid + cvolume,
orientation 50.5% 73.4% 66.9% Ivalid ÿ I. If a max-imum spare cuboid fits all the items in I, additional
reward is given as: RSc = I+cvolume + 10. The final score BSp of a bin
by packing the current item at p would be the sum of RSc of its
maximum spare cuboid c. And we can find the best packing position
C Benchmark Construction
pbest as:
All 64 pre-defined items are visualized in Figure 11. Algo-rithm 3
outlines how the dataset is constructed given the bin size L, W, H and a
valid item size threshold.
The sequence in RS benchmark is generated by random sampling.
1
Each item along the sequence is picked out of our pre-defined item set I pbest = arg max RSc (3)
p C
randomly. However, as everything is random, we do not know the cÿC
optimal packing configura-tion of a RS sequence ourselves (unless we
run an exhaus-tive branch-and-bound search (Martello, Pisinger, and
Vigo 2000) which is much too time consuming to accomplish). E User Study Figure
13 is the interface of our user study app, which con-sists of two parts:
For a better quantitative evaluation, we also generate item sequences visualization and action space. The test se-quences is randomly picked
via cutting stock (Gilmore and Gomory 1961). It is clear that a sequence from CUT-2 test set. Users can drag our UI to change the angle of view
created by cutting the bin should be packed in bin perfectly with a perfect thus having a full observation of the packed items. To help users make
space utilization of 100%. Algorithm 3 provides the detailed procedures better decision, our app allow them choose any suggestion circle in
of the data generation. action space and virtually place item before they make
Machine Translated by Google

(2, 2, 2)~(2, 2, 5)

(3, 2, 2)~(3, 5, 5)

(4, 2, 2)~(4, 5, 5)

(5, 2, 2)~(5, 5, 5)

Figure 11: Pre-defined item set I.

Table 5: Extra information enables agent make use of termi-


nation reward.

RS CUT-1 CUT-2 step-


wise reward 50.5% 73.4% 66.9% termination
reward 39.2% 72.5% 66.5% termination reward
& uti. 50.3% 73.2% 66.7%

Figure 12: The maximum spare cuboid.


it doesn’t perform well on RS benchmark due to the con-
struction approach of sequences in RS can not guarantee the
height map is enough for describing the packed bin as il-
lustrated in Figure 15. In other cases, the step-wise reward
which focuses more on how much space is left above the packed
items at each step, it makes this reward function can perform
well even on RS benchmark.
We also design an experiment to further investigate the above
assumption about these two reward functions. We en-code an
additional matrix as input for the termination reward which
Figure 13: Left: The 3D visualization of packed items in our user indicates whether there exists free space below height map. In
study app. The current item is highlighted with red frame. Right: this case, the state input would no longer be ambigu-ous for
The action space of the bin — the empty cir-cles indicate agent to perform prediction. Table 5 demonstrates that with
suggested placements for the current item. additional information, performance on termina-tion reward
nearly equals to step-wise one. While CUT-1 and CUT-2 can be
packed relatively close and less free space under height map
the final decision. No time limit is given. When there is no exists, termination reward doesn’t affect performance of these
suitable place for the current item, the test will reset and the benchmarks too much.
selected sequence will be saved.
G Penalized reward
F Reward function design In this
To explore whether reward guided based DRL or our con-straint
section we analyze the design of our reward function. based DRL can help the agent avoiding to place item in unsafe
We have two candidate reward functions, step-wise reward and place better, we design a reward guided based DRL alternative
termination reward. For current item n, the step-wise reward is for comparison in our main paper. This section will report this
defined as rn = 10 × ln · wn · hn/(L · W · H) if n is placed reward guided based DRL alternative de-tailedly.
successfully otherwise rn = 0. Meanwhile the termination reward
is defined as the final capacity utilization 10 × li · wi · hi/(L · W · The basic idea to design this guided based DRL method is
r= here
H) ÿ (0, 10], and it is only functional if the packing reward the agent when it packing item in a safe place and
of a sequence is terminated. penalize it when it perform a dangerous move. The reward is
We evaluate the performance of these two reward functions and designed as below: If item n is placed successfully, the agent
the result is presented in Table 5. would be awarded as rn = 10 × ln · wn · hn/(L · W · H).
The termination reward can perform similar with step-wise Otherwise, if the placement of item n violates the physical
reward on CUT-1 and CUT-2 benchmarks. However, stability, the agent would be penalized as rn = ÿ1 and the
Machine Translated by Google

Figure 14: Left: Packing performance on different resolution. Size-10 means the resolution of the test bin is 10 × 10 × 10 and etc.
Second to right: Imposing the C&B rule leads to inferior performance (lower average reward).

Table 6: Evaluating the effect of boundary rule.

ÿ ÿ Space uti. # packed items


w/o corner & boundary rule 66.9% 17.5 60.9% w
& boundary rule 16.2 corner

(a) (b)

Figure 15: Both (a) and (b) has same height map but different
space utilization, which is ambiguous for agent to predict
state value given termination reward. (a) (b) (c)
Before packing the red item Following the C&B rule 16 Our method 19

items packed items packed

packing sequence would be terminated.


We found even we explicitly penalize the agent when Figure 16: Visual comparison between learned vs. heuristic
placing items on unsafe places during the training, the agent strategy. Different packing strategy for the red item would
will still make mistakes every now and then in the test. Re- lead to different performance.
ward shaping cannot guarantee the placement safety as our
constraint based DRL method.
An interesting discovery from our experiment is that our
H More Results method can automatically learn when to follow the C&B rule
More visualized results. Figure 19 shows more packing smartly to obtain a globally more optimal packing. In addition,
results on three different benchmarks. An animated packing imposing such constraints explicitly leads to infe-rior
can be found in the supplemental video. performance. We found that the performance (average
reward) drops about 20% when adding such constraints, as
shown in right of Figure 14. This can also be verified by the
Study of action space resolution. We also investigate how experiment in Table 6.
the resolution of the bin would affect the our perfor-mance.
To illustrate why our agent can decide when to follow the
In this experiment, we increase the spatial discretiza-tion
C&B rule to obtain a globally more optimal packing, we give
from 10×10×10 to 20×20×20 and 30×30×30. As shown in
a visual example here. As shown in Figure 16 (b), if the agent
Figure 14, the performance only slightly decrease.
exactly follow the C&B rule when packing the red item, it will
Increased discretization widens the distribution of possible
leave gaps around the item. However, our method (Figure 16
action space and dilutes the weight of the optimal action.
(c)) can make a decision of packing the item in the middle
However, our method remains efficient even when the prob-
upon the yellow and blue ones. Our method is trained to
lem complexity is ÿ 27× bigger. This experiment demon-
consider the whether there is enough
strates a good scalability of our method in a high-resolution
environment.
Table 7: Space utilization of unseen items.
Learned vs. heuristic strategy In real-world bin packing,
human tends to place a box to touch the sides or corners of 64 ÿ 64 40 ÿ 64 RS
the bin or the other already placed boxes. We refer to this 50.5% 49.4%
intuitive strategy as corner & boundary rule (C&B rule).
Machine Translated by Google

RS

0.726 0.698
24 items 34 items

CUT-1

0.830 0.762
27 items 56 items

Figure 17: Performance of different DRL frameworks on


CUT-2.

CUT-2
room for next moves but not only takes the current situation
0.884 0.772
into consideration. This move reserves enough space around
41 items 62 items
the red item for the following item and this decision makes
our method packing 3 more items when dealing with a same
sequence. Figure 18: Packing results of our BPP-1 model with 125
types of pre-defined items.
Generalizability with unseen items We also test our
method with unseen items. In this experiment, we randomly Table 8: Evaluation of generalizability with untrained se-quences.
choose 40 items from the pre-defined item set I to train an
agent and test it in complete I. All the items are generated
with RS and their test dimensions may not be seen during Test Train Space utilization #Packed items
training. RS 50.5% 12.2
The result is presented in Table 7. It shows that our RS CUT-243.4% CUT-1 47.6% 10.7
method does demonstrate some generalizability and pro-vides a 73.4% RS 60.8% CUT-1 11.6
reasonable benchmark. CUT-2 69.4% 60.9% RS 15.7
CUT-1 62.4% CUT-1 66.9% 19.1
CUT-2 17.9
Generalizability with untrained sequences Since our
16.1
RS, CUT-1 and CUT-2 benchmarks are constructed based
CUT-2 16.6
on different types of sequences, we can also evaluate the per-formance
17.5
of our method on different sequence type from the
training data. The result is presented in Table 8, our method
can still perform well while testing on varied sequences.
Note that our model trained on CUT-2 attains the best gener-alization l ÿ L/2, w ÿ W/2, and h ÿ H/2 to ensure the complexity
since this benchmark has the best balance between of BPP which
, also means more little items has been added
variation and completeness. (one of item’s axes must be 1). The result can be seen from
Table 9 and Figure 18.
Different DRL framework We also test our method with
different DRL frameworks on CUT-2 dataset with well-
tuned parameters. For on-policy methods, we have evaluated
A2C (Mnih et al. 2016) and ACKTR (Wu et al. 2017). And
we also evaluated DQN (Mnih et al. 2015), RAINBOW (?)
and SAC (Haarnoja et al. 2018) for off-policy methods. Fig-ure 17 and
Table 10 demonstrate that ACKTR can achieve
the fastest convergence speed and best performance.

Study of increasing item dimensions We add more item


dimensions to our pre-defined item set I and |I| is enlarged
to 125. The newly added items also satisfy the condition that
Machine Translated by Google

Table 10: Performance of different DRL frameworks on


Table 9: Performance on |I| = 125. CUT-2.

Space uti. # items DRL Space uti. # items


RS 18.3 46.3% ACKTR 66.9% 17.5
CUT-1 59.2% 21.0 RAINBOW 58.8% 15.5
CUT-2 58.3% 22.1 A2C 13.6 53.0%
SAC 11.8 44.2%
DQN 9.3 35.3%

Random sample

CUTÿ1

CUT-2

Figure 19: Packing results of our BPP-1 model.


Machine Translated by Google

Algorithm 1: Permutation MCTS Algorithm 2: Bin Selecting Algorithm


1 Function SEARCH(s0): Input: The current item n, the set of candidate bins
2 Create root node v0 with state s0 while t < T do B;
3 Copy I as R; Output: The most fitting bin bbest;
4 Sort R according to original order; 1 Initialize the set of bin scores B.val with sdef ;
5 vl R ÿ TREEPOLICY(v0, R);
,
2 bbest ÿ arg maxbÿB V (H(b), n) ÿ b.val;
6 ÿ ÿ DEFAULTPOLICY(vl .s, R); 3 bbest.val ÿ value(H(bbest), n);
7 BACKUP(vl , ÿ); 4 return bbest;
8 t ÿ t + 1;
9 while v.item is not i0 do
10 v ÿ BESTCHILD(v, 0);
11 return v.a;
12 Function TREEPOLICY(v, R): Algorithm 3: Benchmark Construction
13 while v.s is non-terminal do
Inputs: valid item size threshold (lmin, wmin, hmin)
14 if v not fully expanded then then
ÿ (lmax, wmax, hmax), bin size (L, W, H);
15 return EXPAND(v, R); 1 Function Construction of pre-defined
16 else items collection(F):
17 R ÿ R\v.item; 2 Initialize invalid item list
18 v ÿ BESTCHILD(v, c); Linvalid = {(L, W, H)}, valid item list
Lvalid = ÿ;
19 return v, R; 3 while Linvalid = ÿ do
20 Function DEFAULTPOLICY(s, R): 4 Randomly pop an itemi from Linvalid;
21 eval = 0; 5 Randomly select an axis ai of itemi , which
22 while s is non-terminal or R is not empty do ai>amax, ai ÿ {xi , yi , zi};
23 i ÿ first item in R; 6 Randomly split the itemi into two sub items
24 a ÿ ÿ(s, i); along axis ai ;
25 s ÿ SimEnv(s, i, a); 7 Calculate sub items’ FLB corner coordinate
26 if s is non-terminal then (lx, ly, lz);
27 eval ÿ eval + Reward(i); 8 for item ÿ itemsub do
else
9 if amin ÿ asub ÿ amax then
28
10 Add the item into Lvalid;
29 R ÿ R\i;
11 else
30 if R is empty then
12 Add the item into Linvalid;
31 eval ÿ eval + V (s, Last);
32 return eval; 13 return Lvalid;

33 Function EXPAND(v, R): 14 Function CUT-1(Lvalid):

34 Choose i ÿ unused item from R; 15 Initialize items sequence S = ÿ;


35 Add new child v to v 16 Sort Lvalid by lzi coordinate of each item in
36 with v .a = ÿ(v.s, i) ascending order;
37 with v .s = SimEnv(v.s, i, v .a) 17 si ÿ itemi’s index in the sorted list;
38 with v .item = i; 18 return S;
39 R ÿ R\i; 19 Function CUT-2(Lvalid):
40 return v R; L×W ;
,
20 Initialize height map Hn ÿ Z
41 Function BESTCHILD(v, c): 21 Hn = 0L×W S = ÿ; ,

42 return argmax (v .Q + c
v.N
);
22 while Lvalid = ÿ do
1+v.N
v ÿchildren(v)
23 Randomly pop an itemi from Lvalid satisfy
43 Function BACKUP(v, ÿ): lzi = Hn(item)
44 while v is not null do
45 v.N ÿ v.N + 1; 24 Add the itemi into S;
46 vQ ÿ max(vQ, ÿ); 25 Hn(itemi) ÿ Hn(itemi) + hi ;
47 v ÿ parent of v; 26 return S;

You might also like