Robot
Robot
Hang Zhao1 , Qijin She1 , Chenyang Zhu1 , Yin Yang2 , Kai Xu1*
1National University of Defense Technology, 2Clemson University
ther complicate 3D-BPP in its real-world applications. 2D- and 3D-BPP are natural generalization of the origi-nal
As an echo to those challenges, we design a deep rein- BPP. Here, an item does not only have a scalar-valued
forcement learning algorithm for 3D-BPP. To maximize the weight but a high-dimension size of width, height, and/or
applicability, we carefully accommodate restrictions raised depth. The main difference between 1D- and 2D-/3D- pack-ing
in its actual usage. For instance, we require item placement problems is the verification of the feasibility of the
satisfying order dependence and not inducing instable stack- packing, i.e. determining whether an accommodation of the
ing. An item is immediately packed upon its arrival, and no items inside the bin exists such that items do not interpene-
adjustment will be permitted after it is packed. To this end, trate and the packing is within the bin size. The complexity
we opt to formulate our problem as a constrained Markov and the difficulty significantly increase for high-dimension
decision process (CMDP) (Altman 1999) and propose a con- BPP instances. In theory, it is possible to generalize ex-act 1D
strained DRL approach based on the on-policy actor-critic solutions like MTP (Martello and Toth 1990) or
framework (Mnih et al. 2016; Wu et al. 2017). branch-and-bound (Delorme, Iori, and Martello 2016) al-
In particular, we introduce a prediction-and-projection gorithms to 2D-BPP (Martello and Vigo 1998) and 3D-BPP
scheme for the training of constrained DRL. The agent first (Martello, Pisinger, and Vigo 2000). However accord-ing to the
predicts a feasibility mask for the placement actions as an timing statistic reported in (Martello, Pisinger,
auxiliary task. It then uses the mask to modulate the action and Vigo 2000), exactly solving 3D-BPP of a size match-ing
probabilities output by the actor. These supervision and pro- an actual parcel packing pipeline, which could deal with
jection enable the agent to learn feasible policy very effi-ciently. tens of thousand parcels, remains infeasible. Resorting to
We also show that our method is general with the approximation algorithms is a more practical choice for us.
ability to handle lookahead items, multi-bin packing, and Hifi et al. (2010) proposed a mixed linear programming al-
item re-orienting. With a thorough test and validation, we gorithm for 3D-BPP by relaxing the integer constraints in
demonstrate that our algorithm outperforms existing meth-ods the problem. Crainic et al. (2008) refined the idea of corner
by a noticeable margin. It even demonstrates a human-level points (Martello, Pisinger, and Vigo 2000), where an upcom-
performance in a preliminary user study. ing item is placed to the so-called extreme pointsto better ex-
plore the un-occupied space in a bin. Heuristic local search
2 Related Work iteratively improves an existing packing by searching within
a neighbourhood function over the set of solutions. There
1D-BPP is one of the most famous problems in combina-torial have been several strategies in designing fast approximate
optimization, and related literature dates back to the algorithms, e.g., guided local search (Faroe, Pisinger, and
sixties (Kantorovich 1960). Many variants and generaliza-tions Zachariasen 2003), greedy search (De Castro Silva, Soma,
of 1D-BPP arise in practical contexts such as the cut-ting stock and Maculan 2003), and tabu search (Lodi, Martello, and
problem (CSP), in which we want to cut bins to Vigo 1999; Crainic, Perboli, and Tadei 2009). Similar strat-egy
produce desired items of different weights, and minimize has also been adapted to Online BPP (Ha et al. 2017;
the total number of bins used. A comprehensive list of bib- Wang et al. 2016). In contrast, genetic algorithms leads to
liography on 1D-BPP and CSP can be found in (Sweeney better solutions as a global, randomized search (Li, Zhao,
and Paternoster 1992). Knowing to be strongly NP-hard, and Zhang 2014; Takahara and Miyamoto 2005).
most existing literature focuses on designing good heuristic
and approximation algorithms and their worst-case perfor- Deep reinforcement learning (DRL) has demonstrated
mance analysis (Coffman, Garey, and Johnson 1984). For tremendous success in learning complex behaviour skills
example, the well-known greedy algorithm, the next fit al- and solving challenging control tasks with high-dimensional
gorithm (NF) has a linear time complexity of O(N) and raw sensory state-space (Lillicrap et al. 2015; Mnih et al. 2015).
its worst-case performance ratio is 2 i.e. NF needs at most 2015, 2016). The existing research can largely be divided
twice as many bins as the optimal solution does (De La Vega into two lines: on-policy methods (Schulman et al. 2017; Wu
and Lueker 1981). The first fit algorithm (FF) allows an et al. 2017; Zhao et al. 2018) and off-policy ones (Mnih et al.
item to be packed into previous bins that are not yet full, 2015; Wang et al. 2015; Barth-Maron et al. 2018). On-policy
and its time complexity increases to O(N log N). The best algorithms optimize the policy with agent-environment in-
fit algorithm (BF) aims to reduce the residual capacity of teraction data sampled from the current policy. While lack-ing
all the non-full bins. Both FF and BF have a better worst-case
17
the ability of reusing old data makes them less data ef-ficient,
performance ratio of than NF10(Johnson et al. 1974). updates calculated by on-policy data lead to stable
Pre-sorting all the items yields the off-line version of those optimization. In contrast, off-policy methods are more data-
greedy strategies sometimes also known as the decreasing efficient but less stable. In our problem, agent-environment
version (Martello 1990). While straightforward, NF, FF, and interaction data is easy to obtain (in 2000FPS), thus data ef-
BF form a foundation of more sophisticated approxima-tions ficiency is not our main concern. We base our method on the
to 1D-BPP (e.g. see (Karmarkar and Karp 1982)) or on-policy actor-critic framework. In addition, we formulate
its exact solutions (Martello and Toth 1990; Scholl, Klein,
¨ ´ online 3D-BPP as constrained DRL and solve it by project-ing
and Jurgens 1997; Labb e, Laporte, and Martello 1995; De- the trajectories sampled from the actor to the constrained
lorme, Iori, and Martello 2016). We also refer the reader to state-action space, instead of resorting to more involved con-
BPPLib library (Delorme, Iori, and Martello 2018), which strained policy optimization (Achiam et al. 2017).
includes the implementation of most known algorithms for
the 1D-BPP problem.
Machine Translated by Google
llllllllll
llllllllll
0000000000
llllllllll
0000000000 llllllllllllllllllll
0000000000
llllllllll
3330000000
llllllllll
3330000000 llllllllll
4440000000 llllllllllllllllllll
4442220000
3332220000 wwwwwwwwww
wwwwwwwwww
6662220000
wwwwwwwwww
6662220000 wwwwwwwwww
wwwwwwwwww
ÿÿ ×
wwwwwwwwww
wwwwwwwwww
wwwwwwwwww
0000000000 wwwwwwwwww
0000000000 wwwwwwwwww
0000000000 hhhhhhhhhh
0011111111 hhhhhhhhhhhhhhhhhhhh
ÿ
0000011111
hhhhhhhhhh
0000011111 hhhhhhhhhh
0000000011 hhhhhhhhhh
hhhhhhhhhhhhhhhhhhhh
0000000011
FLB 0000010011 hhhhhhhhhh
0000010011 hhhhhhhhhh
× ÿÿ × ×3
ÿÿ
Figure 2: Left: The environment state of the agent includes the configuration of the bin (the grey boxes) and the size of the
next item to be packed (green box). The bin configuration is parameterized as a height map H over a L × W grid. The feasibility
mask M is a binary matrix of size L × W indicating the placement feasibility at each grid cell. The three dimensions of the next
item are stored into a L × W × 3 tensor D. Right: The network architecture (the three losses other than the standard actor and
critic losses are shown in red color).
RL for combinatorial optimization has a distinguished discount factor, and ÿ = (s0, a0, s1, . . .) is a trajectory sam-pled based on the
history (Gambardella and Dorigo 1995; Zhang and Diet- policy ÿ.
terich 2000) and is still an active direction with especially The environment state of 3D-BPP is comprised of two
intensive focus on TSP (Bello et al. 2016). Early attempts parts: the current configuration of the bin and the coming
strive for heuristics selection using RL (Nareyek 2003). items to be placed. For the first part, we parameterize the
Bello et al. (2016) combined RL pretraining and active bin through discretizing its bottom area as a L × W regular
search and demonstrated that RL-based optimization out- grid along length (X) and width (Y ) directions, respectively.
performs supervised learning framework when tackling NP- We record at each grid cell the current height of stacked items, leading to a
hard combinatorial problems. Recently, Hu et al. (2017) height map Hn (see Figure 2). Here, the subscript n implies n is the next item
pro-posed a DRL solution to 3D-BPP. Laterre et al. (2018) to be packed. Since all the dimensions are integers, Hn ÿ Z can be ex-pressed
in-troduced a rewarding strategy based on self-play. L×W
as a 2D integer array. The dimensionality of item n is given as dn = [ln, wn,
Different from ours, these works deal with an offline setting hn] ÿ Z teger dimensions helps to reduce the state/action space and accelerate
where the main goal is to find an optimal sequence of items 3
the policy learning significantly. A spatial resolu-tion . Working with in-
inspired by the Pointer Network (Vinyals, Fortunato, and Jaitly 2015).
of up to 30 × 30 is sufficient in many real scenarios.
3 Method
Putting together, the current environment state can be writ-ten as sn = {Hn,
In online 3D-BPP, the agent is agnostic on li , wi or hi of all dn, dn+1, ..., dn+kÿ1}. We first consider the case where k = |Io| = 1, and name
the items in I – only immediately incoming ones Io ÿ I are this special instance as BPP-1. In other words, BPP-1 only considers the imme-
observable. As soon as an item arrives, we pack it into the diately coming item n i.e., Io = {n}. We then generalize it to BPP-k with k > 1
bin, and no further adjustment will be applied. As the afterwards.
complexity of BPP decreases drastically for bigger items,
we further constrain the sizes of all items to be li ÿ L/2, wi ÿ
W/2, and hi ÿ H/2. We start with our problem state-ment BPP-1 In BPP-1, the agent places n’s front-left-bottom
under the context of DRL and the formulation based on (FLB) corner (Figure 2 (left)) at a certain grid point or the
constrained DRL. We show how we solve the problem via loading position (LP) in the bin. For instance, if the agent
predicting action feasibility in the actor-critic framework. chooses to put n at the LP of (xn, yn). This action is rep-
resented as an = xn + L · yn ÿ A, where the action set A =
{0, 1, . . . , L · W ÿ 1}. After an is executed, Hn is updated
3.1 Problem statement and formulation
by adding hn to the maximum height over all (x, y) = hmax(x,
The 3D-BPP can be formulated as a Markov decision pro- the cells covered by n: H x n y) + hn for
cess, which is a tuple of (S, A, P, R). S is the set of en- ÿ [xn, xn + ln], y ÿ [yn, yn + wn], with hmax(x, y) be-ing the
vironment states; A is the action set; R : S × A ÿ R is the maximum height among those cells. The state tran-sition
reward function; P : S × A × S ÿ [0, 1] is the tran-sition is deterministic: P(H|Hn, an) = 1 for H = H and P(H|Hn,n an)
probability function. P(s |s, a) gives the probability of = 0 otherwise.
transiting from s to s for given action a. Our method is During packing, the agent needs to secure enough space
model-free since we do not learn P(s |s, a). The policy ÿ : S in the bin to host item n. Meanwhile, it is equally important
ÿ A is a map from states to probability distribu-tions over to have n statically equilibrated by the underneath at the LP
actions, with ÿ(a|s) denoting the probability of selecting so that all the stacking items are physically stable. Evaluating
action a under state s. For DRL, we seek for a policy ÿ to the physical stability at a LP is involved, taking into account
maximize the accumulated discounted reward, J(ÿ) = of n’s center of mass, moment of inertia, and rotational sta-
ÿ
Eÿÿÿ[ t=0 ÿ tR(st, at)]. Here, ÿ ÿ [0, 1] is the bility (Goldstein, Poole, and Safko 2002). All of them are
Machine Translated by Google
normally unknown as the mass distribution differs among Feasibility constraints We devise a prediction-and-projection
items. To this end, we employ a conservative and simplified mechanism to enforce feasibility constraints.
criterion. Specifically, a LP is considered feasible if it not First, we introduce an independent multilayer perceptron
only provides sufficient room for n but also satisfies any of module, namely the mask predictor (Figure 2 (right)),
following conditions with n placed: 1) over 60% of n’s bot-tom area and to predict the feasibility mask Mn for the item n.
all of its four bottom corners are supported by The predictor takes the state CNN features of the cur-rent state as the
existing items; or 2) over 80% of n’s bottom area and three input and is trained with the ground-truth mask as the supervision. Next,
out of four bottom corners are supported; or 3) over 95% of we use the pre-dicted mask to modulate the output, i.e., the probability
n’s bottom area is supported. We store the feasibility of all
the LPs for item n with a feasibility mask Mn, an L × W distribution of the ac-tions.
binary matrix (also see Figure 2). In theory, if the
Since not all actions are allowed, our problem becomes LP at (x, y) is infeasible
a constrained Markov decision processes (CMDP) (Altman for n, the corresponding
1999). Typically, one augments the MDP with an auxil-iary cost function probability P(an = x +
C : S × A ÿ R mapping state-action tuples to costs, and require that the L·y|sn) should be set to
expectation of 0. However, we find that
the accumulated cost should be bounded by cm: JC (ÿ) = setting P to a small pos-itive
ÿ t
Eÿÿÿ[ t=0 c C C(st, at)] ÿ cm. Several methods have quantity like = 10ÿ3 works better in practice – it
been proposed to solve CMDP based on e.g., algorith-mic heuristics provides a strong penalty to an invalid action but a smoother
(Uchibe and Doya 2007), primal-dual meth-ods (Chow et al. 2017), or transformation beneficial to the network training. The inset
constrained policy optimiza-tion (Achiam et al. 2017). While these shows that softening the mask-based modulation improves
methods are proven the training convergence. To further discourage infeasible
effective, it is unclear how they could fit for 3D-BPP in-stances, where actions, we explicitly minimize the summed probability at
the constraint is rendered as a discrete mask. all infeasible LPs: Einf = P(an = x + L · y|sn),
In this work, we propose to exploit the mask M to guide the ÿ(x, y)|Mn(x, y) = which is plugged into the final loss
,
DRL training to enforce the feasibility constraint without in-troducing function for training.
excessive training complexity.
Loss function Our loss function is defined as:
1
3
ÿ ÿ 63.7% ÿ ÿ ÿ 63.0% ÿ ÿ 7.5
1 ÿ 66.9% 16.9
a virtual order 16.7
2
17.5
3
3 1
1 2
2 3
Table 1: This ablation study compares the space utilization
and the total number of packed items with different combi-
nations of MP, MC and FE, on the CUT-2 dataset.
Figure 3: The permutation tree for Io = {1, 2, 3}. To find the
best packing for item 1, our method explores different virtual
placing orders satisfying the order dependence constraint, on our adaptions of the standard MCTS. MCTS allows a
e.g., 1 cannot be placed on top of virtually placed 2 or 3. scalable lookahead for BPP-k with a complexity of O(km)
where m is the number of paths sampled.
the current placement account for the future ones, we opt to
“hallucinate” the placement of future items through updat-ing 4 Experiments
the height map accordingly. Conditioned on the virtually We implement our framework on a desktop computer
placed future items, the decision for the current item could (ubuntu 16.04), which equips with an Intel Xeon
be globally more optimal. However, such virtual placement Gold 5115 CPU @ 2.40 GHz, 64G memory, and a
must satisfy the order dependence constraint which stipu- Nvidia Titan V GPU with 12G memory. The DRL and
lates that the earlier items should never be packed on top all other networks are implemented with PyTorch (Paszke
of the later ones. In particular, given two items p and q, et al. 2019). The model training takes about 16 hours on a
p < q in Io, if q is (virtually) placed before p, we require spatial resolution of 10 × 10. The test time of BPP-1 model
that the placement of p should be spatially independent to (no lookahead) is less than 10 ms. Please refer to the sup-
the placement of q. It means p can never be packed at any plemental material for more implementation details.
LPs that overlap with q. This constraint is enforced by set-
ting the height values in H at the corresponding LPs to H, Training and test set We set L = W = H = 10 in our
the maximum height value allowed: Hp(x, y) ÿ H, for all experiments with 64 pre-defined item dimensions (|I| =
x ÿ [xq, xq + lq] and y ÿ [yq, yq + wq]. Combining ex-plicit 64). Results with higher spatial resolution are given in the
height map updating with feasibility mask prediction, supplemental material. We also set li ÿ L/2, wi ÿ W/2
the agent utilizes the trained policy with the order depen- and hi ÿ H/2 to avoid over-simplified scenarios. The train-ing
dence constraint satisfied implicitly. and test sequence is synthesized by generating items out
of I, and the total volume of items should be equal to or big-
Monte Carlo permutation tree search We opt to search ger than bin’s volume. We first create a benchmark called
for a better an through exploring the permutations of the RS where the sequences are generated by sampling items
sequence (dn, ..., dn+kÿ1). This amounts to a permutation out of I randomly. A disadvantage of the random sampling
tree search during which only the actor network test is con- is that the optimality of a sequence is unknown (unless per-
ducted – no training is needed. Figure 3 shows a k-level per- forming a brute-force search). Without knowing whether the
mutation tree: A path (r, v1, v2, ..., vk) from the root to a leaf sequence would lead to a successful packing, it is difficult to
forms a possible permutation of the placement of the k items gauge the packing performance with this benchmark.
in Io, where r is the (empty) root node and let item(vi) rep- Therefore, we also generate training se-
resent the i-th item being placed in the permutation. Given quences via cutting stock (Gilmore and Go-
4
two items item(vi) < item(vj ) meaning item(vi) arrives mory 1961). Specifically, items in a se- 3
before item(vj ) in the actual order. If i > j along a permu- quence are created by sequentially “cut-
tation path, meaning that item(vj ) is virtually placed before ting” the bin into items of the pre-defined 2
1
item(vi), we block the LPs corresponding to item(vj )’s oc- 64 types so that we understand the se-
cupancy to avoid placing item(vi) on top of item(vj ). quence may be perfectly packed and re-
Clearly, enumerating all the permutations for k items stored back to the bin. There are two variations of this strat-
quickly becomes prohibitive with an O(k!) complexity. To egy. CUT-1: After the cutting, we sort resulting items into
make the search scalable, we adapt the Monte Carlo tree the sequence based on Z coordinates of their FLBs, from
search (MCTS) (Silver et al. 2017) to our problem. With bottom to top. If FLBs of two items have the same Z coor-
MCTS, the permutation tree is expanded in a priority-based dinate, their order in the sequence is randomly determined.
fashion through evaluating how promising a node would CUT-2: The cut items are sorted based on their stacking de-
lead to the optimal solution. The latter is evaluated by sam- pendency: an item can be added to the sequence only after
pling a fixed number of paths starting from that node and all of its supporting items are there. A 2D toy example is
computing for each path a value summing up the accumu- given in the inset figure with FLB of each item highlighted.
lated reward and the critic value (“reward to go”) at the leaf Under CUT-1, both {1, 2, 3, 4} and {2, 1, 3, 4} are valid item
(k-th level) node. After search, we choose the action an cor- sequences. If we use CUT-2 on the other hand, {1, 3, 2, 4}
responding to the permutation with the highest path value. and {2, 4, 1, 3} would also be valid sequences as the place-
Please refer to the supplemental material for more details ment of 3 or 4 depends on 1 or 2. For the testing purpose,
Machine Translated by Google
(MPFEw/
MC,
MP,
Ours
FE)
MC w/ M
o
0.417 0.524 0.674
o
12 items 22 items 18 items
++
Figure 5: HM shows a clear advantage over vector-based
0.228 0.24 0.297
height parameterizations (HV and ISV).
6 items 10 items 8 items
Ablation study Table 1 reports an ablation study. From Scalability of BPP-k With the capability of lookahead, it
the results, we found that the packing performance drops is expected that the agent better exploits the remaining space
significantly if we do not incorporate the feasibility mask in the bin and delivers a more compact packing. On the other
prediction (MP) during the training. The performance is im- hand, due to the NP-hard nature, big k values increase the
paired if the mask constraint (MC) is not enforced with environment space exponentially. Therefore, it is important
our projection scheme. The feasibility-based entropy (FE) is to understand if MCTS is able to effectively navigate us in
also beneficial for both the training and final performance. the space at the scale of O(k!) for a good packing strategy.
Figure 4 demonstrates the packing results visually for dif- In Figure 7(a,b), we compare our method with a brute-force
ferent method settings. permutation search, which traverses all k! permutations of
k coming items and chooses the best packing strategy (i.e.,
Height parameterization Next, we show that the environ-
the global optimal). We also compare to MCTS-based ac-
ment parameterization using the proposed 2D height map
(HM) (i.e., the H matrix) is necessary and effective. To this
end, we compare our method using HM against that employ- (a) (b) (c)
# bins Space uti. # items per bin # total items Decision time # items / % Space uti.
Method
1 67.4% 17.6 17.6 2.2 × 10ÿ3 s RS CUT-1 CUT-2
4 69.4% 6.3 × 10ÿ3 s18.8 75.2 Boundary rule (Online) 8.7 / 34.9% 10.8 / 41.2% 11.1 / 40.8%
9 72.1% 1.8 × 10ÿ2 s19.1 171.9 BPH (Online) 8.7 / 35.4% 13.5 / 51.9% 13.1 / 49.2%
16 75.3% 2.8 × 10ÿ2 s19.6 313.6 LBP (Offline) 12.9 / 54.7% 14.9 / 59.1% 15.2 / 59.5%
Our BPP-1 (Online) 12.2 / 50.5% 19.1 / 73.4% 17.5 / 66.9%
25 77.8% 20.2 505.0 4.5 × 10ÿ2 s
Hifi, M.; Kacem, I.; Negre, S.; and Wu, L. 2010. A lin- ear
Bello, I.; Pham, H.; Le, Q. V.; Norouzi, M.; and Bengio, programming approach for the three-dimensional bin-packing
S. 2016. Neural combinatorial optimization with reinforce-ment problem. Electronic Notes in Discrete Mathematics
learning. arXiv preprint arXiv:1611.09940 . 36: 993–1000.
Chow, Y.; Ghavamzadeh, M.; Janson, L.; and Pavone, M. Hu, H.; Zhang, X.; Yan, X.; Wang, L.; and Xu, Y. 2017.
2017. Risk-constrained reinforcement learning with per-centile Solving a new 3d bin packing problem with deep reinforce-ment
risk criteria. The Journal of Machine Learning Re-search 18(1): learning method. arXiv preprint arXiv:1708.05930 .
6070–6120.
Johnson, D. S.; Demers, A.; Ullman, J. D.; Garey, M. R.; and
Coffman, E. G.; Garey, M. R.; and Johnson, D. S. 1984. Graham, R. L. 1974. Worst-case performance bounds for
Approximation algorithms for bin-packing—an updated sur-vey. simple one-dimensional packing algorithms. SIAM Journal
In Algorithm design for computer system design, 49– on computing 3(4): 299–325.
106. Springer.
Kantorovich, L. V. 1960. Mathematical methods of orga-nizing
Crainic, TG; Perboli, G.; and Tadei, R. 2008. Extremes
and planning production. Management science 6(4):
point-based heuristics for three-dimensional bin packing. In-forms 366–422.
Journal on computing 20(3): 368–384. ÿ
Crainic, TG; Perboli, G.; and Tadei, R. 2009. TS2PACK: Karabulut, K.; and ÿInceoglu, M. M. 2004. A hybrid genetic
algorithm for packing in 3D with deepest bottom left with
A two-level tabu search for the three-dimensional bin pack-ing
fill method. In International Conference on Advances in In-
problem. European Journal of Operational Research
formation Systems, 441–450. Springer.
195(3): 744–760.
De Castro Silva, J.; Soma, N.; and Maculan, N. 2003. A Karmarkar, N.; and Karp, R. M. 1982. An efficient approx-imation
greedy search for the three-dimensional bin packing prob-lem: scheme for the one-dimensional bin-packing prob-lem. In 23rd
the packing static stability case. International Transac-tions in Annual Symposium on Foundations of Com-puter Science (sfcs
Operational Research 10(2): 141–153. 1982), 312–320. IEEE.
De La Vega, W. F.; and Lueker, G. S. 1981. Bin packing can Korte, B.; and Vygen, J. 2012. Bin packing. In Combinatorial
be solved within 1+ ÿ in linear time. Combinatorica 1(4): Optimization, 499–516. Springer.
349–355. ´
Li, X.; Zhao, Z.; and Zhang, K. 2014. A genetic algorithm Silver, D.; Stepwieser, J.; Simonyan, K.; Antonoglou, I.;
for the three-dimensional bin packing problem with hetero-geneous Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, 1999
bins. In IIE Annual Conference. Proceedings, 2039. A.; et al. 2017. Mastering the game of go without human
Institute of Industrial and Systems Engineers (IISE). knowledge. Nature 550(7676): 354–359.
Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Skiena, S. S. 1997. The stony brook algorithm repos-itory. http://
Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous www.cs.sunysb.edu/algorith/implement/nauty/
control with deep reinforcement learning. arXiv preprint implement.shtml.
arXiv:1509.02971 .
Sweeney, P. E.; and Paternoster, E. R. 1992. Cutting and
Lodi, A.; Martello, S.; and Vigo, D. 1999. Approximation al-gorithms for packing problems: a categorized, application-orientated re-search
the oriented two-dimensional bin packing prob-lem. European Journal bibliography. Journal of the Operational Research
of Operational Research 112(1): Society 43(7): 691–706.
158–166.
Takahara, S.; and Miyamoto, S. 2005. An evolutionary
Man Jr, E. C.; Garey, M.; and Johnson, D. 1996. Approxi-mation approach for the multiple container loading problem. In
algorithms for bin packing: A survey. Approximation Fifth International Conference on Hybrid Intelligent Sys-tems (HIS’05),
algorithms for NP-hard problems 46–93. 6–pp. IEEE.
Martello, S. 1990. Knapsack problems: algorithms and com-puter Uchibe, E.; and Doya, K. 2007. Constrained reinforcement
implementations. Wiley-Interscience series in discrete learning from intrinsic and extrinsic rewards. In 2007 IEEE
mathematics and optimiza tion . 6th International Conference on Development and Learn-ing, 163–168.
IEEE.
Martello, S.; Pisinger, D.; and Vigo, D. 2000. The three-dimensional bin
Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointer net-works. In
packing problem. Operations research
Advances in Neural Information Processing Sys-tems, 2692–2700.
48(2): 256–267.
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Zhao, Y.; Xu, K.; Zhu, E.; Liu, X.; Zhu, X.; and Yin, J. 2018.
Cannon, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; Triangle lasso for simultaneous clustering and optimization
et al. 2019. PyTorch: An imperative style, high-performance in graph datasets. IEEE Transactions on Knowledge and
deep learning library. In Advances in Neural Information Data Engineering 31(8): 1610–1623.
Processing Systems, 8024–8035.
¨
Scholl, A.; Klein, R.; and Jurgens, C. 1997. Bison: A fast hy- brid
procedure for exactly solving the one-dimensional bin
packing problem. Computers & Operations Research 24(7):
627–645.
supplemental document, we report more implemen-tation Critic Network Actor Network Mask Predictor Network
architecture, training selection, Monte Carlo permutation resume resume resume resume
Wÿ64 x 64 x 3 x 3
Flatten Flatten Flatten
Bÿ64
Wÿ64 x 64 x 3 x 3
groan groan groan
resume
resume resume Sigmoid
• The user study design is reported in Section E. • Conv groan groan predicted
mask
Wÿ64 x 64 x 3 x 3 Wÿ256 x 1 Wÿ256 x 100
Bÿ64 Bÿ1 Bÿ100
Section F analyzes the performance difference between step- resume
binaryze
Multipy 7
wise reward and termination reward in our problem. • The Conv
critic value Sub
Wÿ64 x 64 x 3 x 3
Bÿ64
details of reward function to penalize unsafe place- Softmax
B Implementation Details We
Figure 9: Detailed network architecture.
report the details of our implementation in this section,
and our source code is also submitted with this
supplemental material.
Afterwards, a softmax operation is adopted to output the fi-nal
action distribution. Note that, the infeasibility penalty could be
Network architecture and training configurations A detailed absent in test with the help of Einf and our method can still work
specifications of the our network is shown in Fig-ure 9. Our well in this situation.
pipeline consists of three major components: an actor network,
a critic network, and the feasibility mask pre-dictor. It takes Monte Carlo permutation tree search Our algorithm is
three inputs, height map Hn and the dimen-sionality dn = [ln, inspired by the Monte Carlo tree search of (Silver et al.
3
wn, hn] ÿ Z of the current item n to be packed as state, and the 2017). The main difference lies in: Firstly, the goal of our MCTS
feasibility mask Mn as ground truth. Note that Mn is only used is to find the best packing order for the next k items; Secondly,
in the training processing. max reward is used in our MCTS instead of the mean reward
The whole network is trained via a composite loss consist- as the evaluation strategy. Algorithm 1 out-lines the entire
ing of actor loss Lactor, critic loss Lcritic, mask prediction loss procedure step by step, where in T is the maximum simulation
Lmask, infeasibility loss Einf and action entropy loss Eentropy. time, I is lookahead items, Last is the first item after I, i0 is
These loss function are defined as: current item, s is state input of each node and a is action of
each node. Environment simulator SimEnv, which takes height
Lactor = (Rn ÿ V (sn)) log P(an|sn) ÿ map, the next item dimension, and action (position of item) as
Lcritic = (Rn ÿ V (sn))2 the input, and returns the updated height map. Action choosing
2 function ÿ uses pol-icy network from BPP-1 model to get the
Lmask = (Mgt n ÿ Mpred
n
)
action with high-est possibility. N is visit times of a node. Q is
(x,y)
expect return from a node. We test our method on our three
=
ÿÿÿÿÿÿÿÿÿÿÿ
Simple P(an = L · x + y|sn) benchmarks, and more results can be found in Figure 10.
Mn(x,y)=0
(a) Packing performance on RS. (b) Packing performance on CUT-1. (c) Packing performance on CUT-2.
Figure 10: When k increases, the space utilization rate first goes up and later, enters a “plateau” zone.
rithm 2. Here, val is the last state value estimation of bin b, V is the state D Heuristic Baseline Method Online BPP is
value estimation function via a value network, n is the current item, B is
an under-investigated problem. To better demonstrate the effectiveness
the set of bins and H is the height maps of bins. The default score for
of our method, we design a heuristic baseline approach to evaluate the
each bin at beginning sdef = ÿ0.2.
performance of our DRL method. We report details of this baseline
approach in this section. This method is designed based on a simple
observation, human would try to keep the volume of packed bin to be
regular during the packing to maintain the left reg-ular space as large as
Orientated items Our method can also incorporating items with different
orientation. We multifold the action space and related mask based on possible. Such “regularity” is used as the metric to measure a packing
action.
how many different ori-entations are considered, e.g. we will have a m
times larger feasibility Mn and action space if m different poses are al-
To describe regularity of a bin, we introduce the concept of spare
lowed for an item. Doing so induces more flexibility for the packing, and
cuboid. As shown in Figure 12, a spare cuboid is an unoccupied,
it potentially leads to a better result. This is ob-served and reported in
rectangular space in the bin, and the reg-ularity of a bin is defined
Table 4. Note that, orientation only happens around Z axis in our
based on the maximum spare cuboids. Intuitively, we would like to have
problem setting.
a bigger max-imum spare cuboid. If a packed bin has many small-size
spare cuboids, it implies the remaining space of this bin is not “regular”.
As illustrated in Figure 12, the right packing strategy would left the
Table 4: Performance comparison with and without orienta-tion on
biggest spare cuboid. The regular-ity of the bin is then defined as the
different benchmarks.
maximum rectangular residual space or maximum spare cuboid. Since
I is pre-defined, we know how many items can be packed into a maximum
RS CUT-1 CUT-2 w spare cuboid. Based on this, we rate each max-imum spare cuboid c
orientation 62.1% 76.2% 70.2% w/o by the number of item types can be packed in RSc = Ivalid + cvolume,
orientation 50.5% 73.4% 66.9% Ivalid ÿ I. If a max-imum spare cuboid fits all the items in I, additional
reward is given as: RSc = I+cvolume + 10. The final score BSp of a bin
by packing the current item at p would be the sum of RSc of its
maximum spare cuboid c. And we can find the best packing position
C Benchmark Construction
pbest as:
All 64 pre-defined items are visualized in Figure 11. Algo-rithm 3
outlines how the dataset is constructed given the bin size L, W, H and a
valid item size threshold.
The sequence in RS benchmark is generated by random sampling.
1
Each item along the sequence is picked out of our pre-defined item set I pbest = arg max RSc (3)
p C
randomly. However, as everything is random, we do not know the cÿC
optimal packing configura-tion of a RS sequence ourselves (unless we
run an exhaus-tive branch-and-bound search (Martello, Pisinger, and
Vigo 2000) which is much too time consuming to accomplish). E User Study Figure
13 is the interface of our user study app, which con-sists of two parts:
For a better quantitative evaluation, we also generate item sequences visualization and action space. The test se-quences is randomly picked
via cutting stock (Gilmore and Gomory 1961). It is clear that a sequence from CUT-2 test set. Users can drag our UI to change the angle of view
created by cutting the bin should be packed in bin perfectly with a perfect thus having a full observation of the packed items. To help users make
space utilization of 100%. Algorithm 3 provides the detailed procedures better decision, our app allow them choose any suggestion circle in
of the data generation. action space and virtually place item before they make
Machine Translated by Google
(2, 2, 2)~(2, 2, 5)
(3, 2, 2)~(3, 5, 5)
(4, 2, 2)~(4, 5, 5)
(5, 2, 2)~(5, 5, 5)
Figure 14: Left: Packing performance on different resolution. Size-10 means the resolution of the test bin is 10 × 10 × 10 and etc.
Second to right: Imposing the C&B rule leads to inferior performance (lower average reward).
(a) (b)
Figure 15: Both (a) and (b) has same height map but different
space utilization, which is ambiguous for agent to predict
state value given termination reward. (a) (b) (c)
Before packing the red item Following the C&B rule 16 Our method 19
RS
0.726 0.698
24 items 34 items
CUT-1
0.830 0.762
27 items 56 items
CUT-2
room for next moves but not only takes the current situation
0.884 0.772
into consideration. This move reserves enough space around
41 items 62 items
the red item for the following item and this decision makes
our method packing 3 more items when dealing with a same
sequence. Figure 18: Packing results of our BPP-1 model with 125
types of pre-defined items.
Generalizability with unseen items We also test our
method with unseen items. In this experiment, we randomly Table 8: Evaluation of generalizability with untrained se-quences.
choose 40 items from the pre-defined item set I to train an
agent and test it in complete I. All the items are generated
with RS and their test dimensions may not be seen during Test Train Space utilization #Packed items
training. RS 50.5% 12.2
The result is presented in Table 7. It shows that our RS CUT-243.4% CUT-1 47.6% 10.7
method does demonstrate some generalizability and pro-vides a 73.4% RS 60.8% CUT-1 11.6
reasonable benchmark. CUT-2 69.4% 60.9% RS 15.7
CUT-1 62.4% CUT-1 66.9% 19.1
CUT-2 17.9
Generalizability with untrained sequences Since our
16.1
RS, CUT-1 and CUT-2 benchmarks are constructed based
CUT-2 16.6
on different types of sequences, we can also evaluate the per-formance
17.5
of our method on different sequence type from the
training data. The result is presented in Table 8, our method
can still perform well while testing on varied sequences.
Note that our model trained on CUT-2 attains the best gener-alization l ÿ L/2, w ÿ W/2, and h ÿ H/2 to ensure the complexity
since this benchmark has the best balance between of BPP which
, also means more little items has been added
variation and completeness. (one of item’s axes must be 1). The result can be seen from
Table 9 and Figure 18.
Different DRL framework We also test our method with
different DRL frameworks on CUT-2 dataset with well-
tuned parameters. For on-policy methods, we have evaluated
A2C (Mnih et al. 2016) and ACKTR (Wu et al. 2017). And
we also evaluated DQN (Mnih et al. 2015), RAINBOW (?)
and SAC (Haarnoja et al. 2018) for off-policy methods. Fig-ure 17 and
Table 10 demonstrate that ACKTR can achieve
the fastest convergence speed and best performance.
Random sample
CUTÿ1
CUT-2
42 return argmax (v .Q + c
v.N
);
22 while Lvalid = ÿ do
1+v.N
v ÿchildren(v)
23 Randomly pop an itemi from Lvalid satisfy
43 Function BACKUP(v, ÿ): lzi = Hn(item)
44 while v is not null do
45 v.N ÿ v.N + 1; 24 Add the itemi into S;
46 vQ ÿ max(vQ, ÿ); 25 Hn(itemi) ÿ Hn(itemi) + hi ;
47 v ÿ parent of v; 26 return S;