Solving The Rubik S Cube With
Solving The Rubik S Cube With
https://doi.org/10.1038/s42256-019-0070-z
The Rubik’s cube is a prototypical combinatorial puzzle that has a large state space with a single goal state. The goal state is
unlikely to be accessed using sequences of randomly generated moves, posing unique challenges for machine learning. We
solve the Rubik’s cube with DeepCubeA, a deep reinforcement learning approach that learns how to solve increasingly difficult
states in reverse from the goal state without any specific domain knowledge. DeepCubeA solves 100% of all test configura-
tions, finding a shortest path to the goal state 60.3% of the time. DeepCubeA generalizes to other combinatorial puzzles and
is able to solve the 15 puzzle, 24 puzzle, 35 puzzle, 48 puzzle, Lights Out and Sokoban, finding a shortest path in the majority
of verifiable cases.
T
he Rubik’s cube is a classic combinatorial puzzle that poses To address these challenges, we have developed DeepCubeA,
unique and interesting challenges for artificial intelligence which combines deep learning11,12 with classical reinforcement
and machine learning. Although the state space is excep- learning13 (approximate value iteration14–16) and path finding meth-
tionally large (4.3 × 1019 different states), there is only one goal ods (weighted A* search17,18). DeepCubeA is able to solve combi-
state. Furthermore, the Rubik’s cube is a single-player game and natorial puzzles such as the Rubik’s cube, 15 puzzle, 24 puzzle, 35
a sequence of random moves, no matter how long, is unlikely to puzzle, 48 puzzle, Lights Out and Sokoban (Fig. 1). DeepCubeA
end in the goal state. Developing machine learning algorithms to works by using approximate value iteration to train a deep neural
deal with this property of the Rubik’s cube might provide insights network (DNN) to approximate a function that outputs the cost
into learning to solve planning problems with large state spaces. to reach the goal (also known as the cost-to-go function). Given
Although machine learning methods have previously been applied that random play is unlikely to end in the goal state, DeepCubeA
to the Rubik’s cube, these methods have either failed to reliably solve trains on states obtained by starting from the goal state and
the cube1–4 or have had to rely on specific domain knowledge5,6. randomly taking moves in reverse. After training, the learned
Outside of machine learning methods, methods based on pattern cost-to-go function is used as a heuristic to solve the puzzles using
databases (PDBs) have been effective at solving puzzles such as the a weighted A* search17–19.
Rubik’s cube, the 15 puzzle and the 24 puzzle7,8, but these methods DeepCubeA builds on DeepCube20, a deep reinforcement
can be memory-intensive and puzzle-specific. learning algorithm that solves the Rubik’s cube using a policy and
More broadly, a major goal in artificial intelligence is to create value function combined with Monte Carlo tree search (MCTS).
algorithms that are able to learn how to master various environ- MCTS, combined with a policy and value function, is also used by
ments without relying on domain-specific human knowledge. The AlphaZero, which learns to beat the best existing programs in chess,
classical 3 × 3 × 3 Rubik’s cube is only one representative of a larger Go and shogi21. In practice, we find that, for combinatorial puzzles,
family of possible environments that broadly share the characteris- MCTS has relatively long runtimes and often produces solutions
tics described above, including (1) cubes with longer edges or higher many moves longer than the length of a shortest path. In contrast,
dimension (for example, 4 × 4 × 4 or 2 × 2 × 2 × 2), (2) sliding tile DeepCubeA finds a shortest path to the goal for puzzles for which a
puzzles (for example the 15 puzzle, 24 puzzle, 35 puzzle and 48 puz- shortest path is computationally verifiable: 60.3% of the time for the
zle), (3) Lights Out and (4) Sokoban. As the size and dimensions are Rubik’s cube and over 90% of the time for the 15 puzzle, 24 puzzle
increased, the complexity of the underlying combinatorial problems and Lights Out.
rapidly increases. For example, while finding an optimal solution
to the 15 puzzle takes less than a second on a modern-day desktop, Deep approximate value iteration
finding an optimal solution to the 24 puzzle can take days, and find- Value iteration15 is a dynamic programming algorithm14,16 that
ing an optimal solution to the 35 puzzle is generally intractable9. iteratively improves a cost-to-go function J. In traditional value
Not only are the aforementioned puzzles relevant as mathematical iteration, J takes the form of a lookup table where the cost-to-
games, but they can also be used to test planning algorithms10 and go J(s) is stored in a table for all possible states s. However, this
to assess how well a machine learning approach may generalize to lookup table representation becomes infeasible for combinatorial
different environments. Furthermore, because the operation of the puzzles with large state spaces like the Rubik’s cube. Therefore,
Rubik’s cube and other combinatorial puzzles are deeply rooted we turn to approximate value iteration16, where J is represented by
in group theory, these puzzles also raise broader questions about a parameterized function implemented by a DNN. The DNN is
the application of machine learning methods to complex symbolic trained to minimize the mean squared error between its estima-
systems, including mathematics. In short, for all these reasons, the tion of the cost-to-go of state s, J(s), and the updated cost-to-go
Rubik’s cube poses interesting challenges for machine learning. estimation J′(s):
Department of Computer Science, University of California Irvine, Irvine, CA, USA. 2Department of Statistics, University of California Irvine, Irvine, CA,
1
USA. 3These authors contributed equally: Forest Agostinelli, Stephen McAleer, Alexander Shmakov. *e-mail: [email protected]
17 16 3 6 9
20 19 18 11 7
23 1 24 13
21 14 10 8 15
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24
Fig. 1 | Visualization of scrambled states and goal states. Visualization of a scrambled state (top) and the goal state (bottom) for four puzzles
investigated here.
J ′(s ) = min a(g a(s, A(s, a )) + J (A(s, a ))) (1) determined by the function f(x) = g(x) + h(x), where g(x) is the path
cost, which is the distance between the starting state and x, and h(x)
where A(s, a) is the state obtained from taking action a in state s and is the heuristic function, which estimates the distance between x
ga(s, s′) is the cost to transition from state s to state s′ taking action and the goal state. The heuristic function h(x) is obtained from the
a. For the puzzles investigated in this Article, ga(s, s′) is always 1. learned cost-to-go function:
We call the resulting algorithm ‘deep approximate value iteration’
(DAVI). 0
if x is associated with the goal state
h(x ) =
(2)
We train on a state distribution that allows information to propa-
J (x ) otherwise
gate from the goal state to all the other states seen during train-
ing. Our method of achieving this is simple: each training state xi is We use a variant of A* search called weighted A* search18.
obtained by randomly scrambling the goal state ki times, where ki is Weighted A* search trades potentially longer solutions for
uniformly distributed between 1 and K. potentially less memory usage by using, instead, the function
Although the update in equation (1) is only a one-step lookahead, f(x) = λg(x) + h(x), where λ is a weighting factor between zero and
it has been shown that, as training progresses, J approximates the opti- one. Furthermore, using a computationally expensive model for the
mal cost-to-go function J* (ref. 16). This optimal cost-to-go function heuristic function h(x), such as a DNN, could result in an intracta-
computes the total cost incurred when taking a shortest path to the bly slow solver. However, h(x) can be computed for many nodes in
goal. Instead of equation (1), multi-step lookaheads such as a depth-N parallel by expanding the N lowest cost nodes at each iteration. We
search or Monte Carlo tree search can be used. We experimented with call the combination of A* search with a path-cost coefficient λ and
different multi-step lookaheads and found that multi-step lookahead batch size N ‘batch weighted A* search’ (BWAS).
strategies resulted in, at best, similar performance to the one-step loo- In summary, the algorithm presented in this Article uses DAVI
kahead used by DAVI (see Methods for more details). to train a DNN as the cost-to-go function on states whose difficulty
ranges from easy to hard. The trained cost-to-go function is then
Batch weighted A* search used as a heuristic for BWAS to find a path from any given state to
After learning a cost-to-go function, we can then use it as a heuris- the goal state. We call the resulting algorithm DeepCubeA.
tic to search for a path between a starting state and the goal state.
The search algorithm that we use is a variant of A* search17, a best- Results
first search algorithm that iteratively expands the node with the To test the approach, we generate a test set of 1,000 states by ran-
lowest cost until the node associated with the goal state is selected domly scrambling the goal state between 1,000 and 10,000 times.
for expansion. The cost of each node x in the search tree is Additionally, we test the performance of DeepCubeA on the three
Table 1 | Comparison of DeepCubeA with optimal solvers based on PDBs along the dimension of solution length, percentage of
optimal solutions, number of nodes generated, time taken to solve the problem and number of nodes generated per second on the
Rubik’s cube states that are furthest away from the goal
Puzzle Solver Length Percentage of optimal solutions No. of nodes Time taken (s) Nodes per second
Rubik’s cubeh PDBs7 – – – – –
PDBs+24 26.00 100.0 2.41 × 1010 13,561.27 1.78 × 106
DeepCubeA 26.00 100.0 5.33 × 10 6
18.77 2.96 × 105
PDBs+ refer to Rokicki’s optimal solver, which uses PDB combined with knowledge of group theory24,25. DeepCubeA finds a shortest path to the goal for all of the states furthest away from the goal.
Table 2 | Comparison of the size (in GB) of the lookup tables for pattern PDBs and the size of the DNN used by DeepCubeA
Rubik’s cube 15 puzzle 24 puzzle 35 puzzle 48 puzzle Lights Out Sokoban
PDBs 4.67 8.51 1.86 0.64 4.86 – –
PDBs+ 182.00 – – – – –
DeepCubeA 0.06 0.06 0.08 0.08 0.10 0.05 0.06
PDBs+ refers to Rokicki’s PDB combined with knowledge of group theory24,25. The table shows that DeepCubeA always uses memory that is orders of magnitude less than PDBs.
known states that are the furthest possible distance away from the DeepCubeA. We use Korf ’s PDB heuristic for BWAS and compare
goal (26 moves)22. To assess how often DeepCubeA finds a shortest to DeepCubeA. We perform BWAS with N = 10,000 and λ = 0.0, 0.1
path to the goal, we need to compare our results to a shortest path and 0.2. We compute the PDB heuristic in parallel across 32 cen-
solver. We can obtain a shortest path solver by using an iterative tral processing units (CPUs). Note that at λ = 0.3 BWAS runs out of
deepening A* search (IDA*)23 with an admissible heuristic com- memory when using PDBs. Figure 2 shows that performing BWAS
puted from a PDB. Initially, we used the PDB described in Korf ’s with DeepCubeA’s learned heuristic consistently produces shorter
work on finding optimal solutions to the Rubik’s cube7; however, solutions, generates fewer nodes and is overall much faster than
this solver only solves a few states a day. Therefore, we use the opti- Korf ’s PDB heuristic.
mal solver that was used to find the maximum of the minimum We also compare the memory footprint and speed of pat-
number of moves required to solve the Rubik’s cube from any given tern databases to DeepCubeA. In terms of memory, for pat-
state (so-called ‘God’s number’)24,25. This human-engineered solver tern databases, it is necessary to load lookup tables into memory.
relies on large PDBs26 (requiring 182 GB of memory) and sophisti- For DeepCubeA, it is necessary to load the DNN into memory.
cated knowledge of group theory to find a shortest path to the goal Table 2 shows that DeepCubeA uses significantly less memory
state. Comparisons between DeepCubeA and shortest path solvers than PDBs. In terms of speed, we measure how quickly PDBs and
are shown later in Table 5. DeepCubeA compute a heuristic for a single state, averaging over
The DNN architecture consists of two fully connected hidden 1,000 states. Given that DeepCubeA uses neural networks, which
layers, followed by four residual blocks27, followed by a linear out- benefit from GPUs and batch processing, we measure the speed
put unit that represents the cost-to-go estimate. The hyperparam- of DeepCubeA with both a single CPU and a single GPU, and
eters of BWAS were chosen by doing a grid search over λ and N on with both sequential and batch processing of the states. Table 3
data generated separately from the test set (see Methods for more shows that, as expected, PDBs on a single CPU are faster than
details). When performing BWAS, the heuristic function is com- DeepCubeA on a single CPU; however, the speed of PDBs on a sin-
puted in parallel across four NVIDIA Titan V graphics processing gle CPU is comparable to the speed of DeepCubeA on a single GPU
units (GPUs). with batch processing.
During training we monitor how well the DNN is able to solve
Performance. DeepCubeA finds a solution to 100% of all test the Rubik’s cube using a greedy best-first search; we also monitor
states. DeepCubeA finds a shortest path to the goal 60.3% of the how well the DNN is able to estimate the optimal cost-to-go func-
time. Aside from the optimal solutions, 36.4% of the solutions are tion (computed with Rokicki’s shortest path solver25). How these
only two moves longer than the optimal solution, while the remain- performance metrics change as a function of training iteration is
ing 3.3% are four moves longer than the optimal solution. For the shown in Fig. 3. The results show that DeepCubeA first learns to
three states that are furthest away from the goal, DeepCubeA finds solve states closer to the goal before it learns to solve states further
a shortest path to the goal for all three states (Table 1). Although away from the goal. Cost-to-go estimation is less accurate for states
we relate the performance of DeepCubeA to the performance of further away from the goal; however, the cost-to-go function still
shortest path solvers based on PDBs, a direct comparison cannot be correctly orders the states according to difficulty. In addition, we
made because shortest path solvers guarantee an optimal solution found that DeepCubeA frequently used the conjugate patterns of
while DeepCubeA does not. moves of the form aba−1 in its solutions and often found symmetric
Although PDBs can be used in shortest path solvers, they solutions to symmetric states. An example of this is shown in Fig. 4
can also be used in BWAS in place of the heuristic learned by (see Methods for more details).
140 140
DeepCubeA, λ = 0.0 DeepCubeA, λ = 0.0
DeepCubeA, λ = 0.1 DeepCubeA, λ = 0.1
120 120
DeepCubeA, λ = 0.2 DeepCubeA, λ = 0.2
PDB, λ = 0.0 PDB, λ = 0.0
100 PDB, λ = 0.1 100 PDB, λ = 0.1
Solution length
Solution length
80 80
60 60
40 40
20 20
101 102 103 107 108 109
Calculation time (s) Nodes generated
Fig. 2 | The performance of DeepCubeA versus PDBs when solving the Rubik’s cube with BWAS. N = 10,000 and λ is either 0.0, 0.1 or 0.2. Each dot
represents the result on a single state. DeepCubeA is both faster and produces shorter solutions.
Table 4 | Comparison of DeepCubeA with optimal solvers based on PDBs along the dimension of solution length, percentage of
optimal solutions, number of nodes generated, time taken to solve the problem and number of nodes generated per second for the 24
puzzle and 35 puzzle
Puzzle Solver Length Percentage of optimal No. of nodes Time taken (s) Nodes per second
solutions
24 puzzle PDBs9 89.41 100.0 8.19 × 1010 4,239.54 1.91 × 107
DeepCubeA 89.49 96.98 6.44 × 106 19.33 3.34 × 105
35 puzzle PDBs 9
– – – – –
DeepCubeA 124.64 – 9.26 × 106 28.45 3.25 × 105
For the 24 puzzle, DeepCubeA finds a shortest path to the goal the overwhelming majority of the time. For the 35 puzzle, no tractable optimal solver exists.
17.5
80
15.0
Average cost-to-go
60 12.5
10.0
40
7.5
20 5.0
2.5
0
0
0 0.2 0.4 0.6 0.8 1.0 1.2 0 0.2 0.4 0.6 0.8 1.0 1.2
Iteration (×106) Iteration (×106)
Fig. 3 | The performance of DeepCubeA. The plots show that DeepCubeA first learns how to solve cubes closer to the goal and then learns to solve
increasingly difficult cubes. Dashed lines represent the true average cost-to-go.
Generalization to other combinatorial puzzles. The Rubik’s cube into its goal configuration shown in Fig. 1. For these sliding tile
is only one combinatorial puzzle among many others. To demon- puzzles, we generated a test set of 500 states randomly scrambled
strate the ability of DeepCubeA to generalize to other puzzles, we between 1,000 and 10,000 times. The same DNN architecture and
applied DeepCubeA to four popular sliding tile puzzles: the 15 puz- hyperparameters that are used for the Rubik’s cube are also used
zle, the 24 puzzle, 35 puzzle and 48 puzzle. Additionally, we applied for the n puzzles with the exception of the addition of two more
DeepCubeA to Lights Out and Sokoban. Sokoban posed a unique residual layers. We implemented an optimal solver using additive
challenge for DeepCubeA because actions taken in its environment pattern databases9. DeepCubeA not only solved every test puzzle,
are not always reversible. but also found a shortest path to the goal 99.4% of the time for the
15 puzzle and 96.98% of the time for the 24 puzzle. We also test on
Sliding tile puzzles. The 15 puzzle has 1.0 × 1013 possible combina- the 17 states that are furthest away from the goal for the 15 puzzle
tions, the 24 puzzle has 7.7 × 1024 possible combinations, the 35 (these states are not known for the 24 puzzle)28. Solutions produced
puzzle has 1.8 × 1041 possible combinations and the 48 puzzle has by DeepCubeA are, on average, 2.8 moves longer than the length
3.0 × 1062 possible combinations. The objective is to move the puzzle of a shortest path and DeepCubeA finds a shortest path to the goal
Mirrored starts
F U R′ U′ F′ R′ L′ F′ D R′
ably compared to PDBs.
F′ U′ L U F L R F D′ L Lights Out. Lights Out is a grid-based puzzle consisting of an N × N
board of lights that may be either active or inactive. The goal is
to convert all active lights to inactive from a random starting posi-
tion, as seen in Fig. 1. Pressing any light in the grid will switch the
Mirrored starts
F′ R′ B R′ B U′ F U B′ U′ L′ state of that light and its immediate horizontal and vertical neigh-
F L B′ L B′ U F′ U′ B U R bours. At any given state, a player may click on any of the N2 lights.
However, one difference of Lights Out compared to the other envi-
ronments is that the moves are commutative. We tested DeepCubeA
on the 7 × 7 Lights Out puzzle. A theorem by Scherphuis29 shows
Fig. 4 | An example of symmetric solutions that DeepCubeA finds to
that, for the 7 × 7 Lights Out puzzle, any solution that does not con-
symmetric states. Conjugate triplets are indicated by green boxes. Note
tain any duplicate moves is an optimal solution. Using this theorem,
that the last two conjugate triplets are overlapping.
we found that DeepCubeA found a shortest path to the goal for all
test cases.
for 17.6% of these states. For the 24 puzzle, on average, PDBs take
4,239 s and DeepCubeA takes 19.3 s, over 200 times faster. Moreover, Sokoban. Sokoban30 is a planning problem that requires an agent
in the worst case we observed that the longest time needed to solve to move boxes onto target locations. Boxes can only be pushed,
the 24 puzzle is 5 days for PDBs and 2 min for DeepCubeA. The not pulled. Note that training states are generated by pulling boxes
average solution length is 124.76 for the 35 puzzle and 253.53 for the instead of pushing them (see Methods for more details). To test our
48 puzzle; however, we do not know how many of them are optimal method on Sokoban, we train on the 900,000 training examples and
due to the optimal solver being prohibitively slow for the 35 puzzle test on the 1,000 testing examples used by previous research on a
and 48 puzzle. The performances of DeepCubeA on the 24 puzzle single-agent policy tree search applied to Sokoban31. DeepCubeA
and 35 puzzle are summarized in Table 4. successfully solves 100% of all test examples. We compare the solu-
Although the shortest path solver for the 35 puzzle and 48 puz- tion length and number of nodes expanded to this same previous
zle was prohibitively slow, we compare DeepCubeA to PDBs using research32. Although the goals of the aforementioned paper are
BWAS. The results show that, compared to PDBs, DeepCubeA slightly different from ours, DeepCubeA finds shorter paths than
produces shorter solutions and generates fewer nodes, as shown in previously reported methods and also expands, at least, three times
Supplementary Figs. 6 and 7. In combination, these results suggest fewer nodes (Table 5).
Table 5 | Comparison of DeepCubeA with optimal solvers based on PDBs along the dimension of solution length, percentage of
optimal solutions, number of nodes generated, time taken to solve the problem and number of nodes generated per second
Puzzle Solver Length Percentage of optimal solutions No. of nodes Time taken (s) Nodes per second
Rubik’s cube PDBs7 – – – – –
PDBs+24 20.67 100.0 2.05 × 106 2.20 1.79 × 106
DeepCubeA 21.50 60.3 6.62 × 10 6
24.22 2.90 × 105
Rubik’s cubeh PDBs 7
– – – – –
PDBs+24 26.00 100.0 2.41 × 1010 13,561.27 1.78 × 106
DeepCubeA 26.00 100.0 5.33 × 106 18.77 2.96 × 105
15 puzzle PDBs 9
52.02 100.0 3.22 × 10 4
0.002 1.45 × 107
DeepCubeA 52.03 99.4 3.85 × 10 6
10.28 3.93 × 105
15 puzzleh PDBs 9
80.00 100.0 1.53 × 10 7
0.997 1.56 × 107
DeepCubeA 82.82 17.65 2.76 × 107 69.36 3.98 × 105
24 puzzle PDBs 9
89.41 100.0 8.19 × 10 10
4,239.54 1.91 × 107
DeepCubeA 89.49 96.98 6.44 × 10 6
19.33 3.34 × 105
35 puzzle PDBs 9
– – – – –
DeepCubeA 124.64 – 9.26 × 106 28.45 3.25 × 105
48 puzzle PDBs – – – – –
DeepCubeA 253.35 – 1.96 × 107 74.46 2.63 × 105
Lights Out DeepCubeA 24.26 100.0 1.14 × 10 6
3.27 3.51 × 105
Sokoban LevinTS 32
39.80 – 6.60 × 10 3
– –
LevinTS(*)32 39.50 – 5.03 × 103 – –
LAMA 32
51.60 – 3.15 × 10 3
– –
DeepCubeA 32.88 – 1.05 × 103 2.35 5.60 × 101
The datasets with an ‘h’ subscript are datasets containing the states that are furthest away from the goal state. PDBs+ refers to Rokicki’s PDB combined with knowledge of group theory24,25. For Sokoban, we
compare nodes expanded instead of nodes generated to allow for a direct comparison to previous work. DeepCubeA often finds a shortest path to the goal. For the states that are furthest away from the
goal, DeepCubeA either finds a shortest path or a path close in length to a shortest path.
Distributed training. In the Rubik’s cube environment, there are 12 possible actions
Lights Out. Lights Out contains N2 lights on an N × N board. The lights can either that can be applied to every state. Using equation (1) to update the cost-to-go
be on or off. The representation given to the DNN is a vector of size N2. Each estimate of a single state thus requires applying the DNN to 12 states. As a result,
element is 1 if the corresponding light is on and 0 if the corresponding light is off. equation (1) takes up the majority of the computational time. However, as is the
case with methods such as ExIt38, this is a trivially parallelizable task that can easily
Sokoban. The Sokoban environment we use is a 10 × 10 grid that contains four be distributed across multiple GPUs.
boxes that an agent needs to push onto four targets. In addition to the agent, boxes
and targets, Sokoban also contains walls. The representation given to the DNN BWAS. A* search17 is a heuristic-based search algorithm that finds a path between
contains four binary vectors of size 102 that represent the position on the agent, a starting node xs and a goal node xg. A* search maintains a set, OPEN, from which
boxes, targets and walls. Given that boxes can only be pushed, not pulled, some it iteratively removes and expands the node with the lowest cost. The cost of each
actions are irreversible. For example, a box pushed into a corner can no longer be node x is determined by the function f(x) = g(x) + h(x), where g(x) is the path cost