Chip Placement With Deep Reinforcement Learning
Chip Placement With Deep Reinforcement Learning
Azalia Mirhoseini * Anna Goldie * Mustafa Yazgan Joe Jiang Ebrahim Songhori Shen Wang Young-Joon Lee
{azalia, agoldie, mustafay, wenjiej, esonghori, shenwang, youngjoonlee}@google.com
Eric Johnson Omkar Pathak Sungmin Bae Azade Nazi Jiwoo Pak Andy Tong Kavya Srinivasa
William Hang Emre Tuncer Anand Babu Quoc Le James Laudon Richard Ho Roger Carpenter Jeff Dean
arXiv:2004.10746v1 [cs.LG] 22 Apr 2020
ing is guided by a fast-but-approximate reward signal for particularly simulated annealing (Kirkpatrick et al., 1983;
each of the agent’s chip placements. Sechen & Sangiovanni-Vincentelli, 1986; Sarrafzadeh
et al., 2003). Simulated annealing (SA) is named for its
To our knowledge, the proposed method is the first place-
analogy to metallurgy, in which metals are first heated
ment approach with the ability to generalize, meaning that
and then gradually cooled to induce, or anneal, energy-
it can leverage what it has learned from placing previous
optimal crystalline surfaces. SA applies random perturba-
netlists to generate placements for new unseen netlists. In
tions to a given placement (e.g., shifts, swaps, or rotations
particular, we show that, as our agent is exposed to a greater
of macros), and then measures their effect on the objec-
volume and variety of chips, it becomes both faster and bet-
tive function (e.g., half-perimeter wirelength described in
ter at generating optimized placements for new chip blocks,
Section 3.3.1). If the perturbation is an improvement, it
bringing us closer to a future in which chip designers are
is applied; if not, it is still applied with some probability,
assisted by artificial agents with vast chip placement expe-
referred to as temperature. Temperature is initialized to a
rience.
particular value and is then gradually annealed to a lower
We believe that the ability of our approach to learn from ex- value. Although SA generates high-quality solutions, it
perience and improve over time unlocks new possibilities is very slow and difficult to parallelize, thereby failing to
for chip designers. We show that we can achieve superior scale to the increasingly large and complex circuits of the
PPA on real AI accelerator chips (Google TPUs), as com- 1990s and beyond.
pared to state-of-the-art baselines. Furthermore, our meth-
The 1990s-2000s were characterized by multi-level parti-
ods generate placements that are superior or comparable
tioning methods (Agnihotri et al., 2005; Roy et al., 2007),
to human expert chip designers in under 6 hours, whereas
as well as the resurgence of analytic techniques, such as
the highest-performing alternatives require human experts
force-directed methods (Tao Luo & Pan, 2008; Bo Hu &
in the loop and take several weeks for each of the dozens of
Marek-Sadowska, 2005; Obermeier et al., 2005; Spindler
blocks in a modern chip. Although we evaluate primarily
et al., 2008; Viswanathan et al., 2007b;a) and non-linear
on AI accelerator chips, our proposed method is broadly
optimizers (Kahng et al., 2005; Chen et al., 2006). The re-
applicable to any chip placement optimization.
newed success of quadratic methods was due in part to al-
gorithmic advances, but also to the large size of modern cir-
2. Related Work cuits (10-100 million nodes), which justified approximat-
ing the placement problem as that of placing nodes with
Global placement is a longstanding challenge in chip
zero area. However, despite the computational efficiency
design, requiring multi-objective optimization over cir-
of quadratic methods, they are generally less reliable and
cuits of ever-growing complexity. Since the 1960s,
produce lower quality solutions than their non-linear coun-
many approaches have been proposed, so far falling into
terparts.
three broad categories: 1) partitioning-based methods, 2)
stochastic/hill-climbing methods, and 3) analytic solvers. Non-linear optimization approximates cost using smooth
mathematical functions, such as log-sum-exp (William
Starting in the 1960s, industry and academic labs took a
et al., 2001) and weighted-average (Hsu et al., 2011) mod-
partitioning-based approach to the global placement prob-
els for wirelength, as well as Gaussian (Chen et al., 2008)
lem, proposing (Breuer, 1977; Kernighan, 1985; Fiduccia
and Helmholtz models for density. These functions are
& Mattheyses, 1982), as well as resistive-network based
then combined into a single objective function using a La-
methods (Chung-Kuan Cheng & Kuh, 1984; Ren-Song
grange penalty or relaxation. Due to the higher complexity
Tsay et al., 1988). These methods are characterized by
of these models, it is necessary to take a hierarchical ap-
a divide-and-conquer approach; the netlist and the chip
proach, placing clusters rather than individual nodes, an ap-
canvas are recursively partitioned until sufficiently small
proximation which degrades the quality of the placement.
sub-problems emerge, at which point the sub-netlists are
placed onto the sub-regions using optimal solvers. Such The last decade has seen the rise of modern analytic tech-
approaches are quite fast to execute and their hierarchi- niques, including more advanced quadratic methods (Kim
cal nature allows them to scale to arbitrarily large netlists. et al., 2010; 2012b; Kim & Markov, 2012; Brenner et al.,
However, by optimizing each sub-problem in isolation, 2008; Lin et al., 2013), and more recently, electrostatics-
partitioning-based methods sacrifice quality of the global based methods like ePlace (Lu et al., 2015) and RePlAce
solution, especially routing congestion. Furthermore, a (Cheng et al., 2019). Modeling netlist placement as an
poor early partition may result in an unsalvageable end electrostatic system, ePlace (Lu et al., 2015) proposed a
placement. new formulation of the density penalty where each node
(macro or standard cell) of the netlist is analogous to a pos-
In the 1980s, analytic approaches emerged, but were
itively charged particle whose area corresponds to its elec-
quickly overtaken by stochastic / hill-climbing algorithms,
Chip Placement with Deep Reinforcement Learning
tric charge. In this setting, nodes repel each other with a can learn across multiple experiences and transfer the ac-
force proportional to their charge (area), and the density quired knowledge to perform better on new unseen exam-
function and gradient correspond to the system’s poten- ples. In the context of chip placement, domain adaptation
tial energy. Variations of this electrostatics-based approach involves training a policy across a set of chip netlists and
have been proposed to address standard-cell placement (Lu applying that trained policy to a new unseen netlist. Re-
et al., 2015) and mixed-size placement (Lu et al., 2015; Lu cently, domain adaptation for combinatorial optimization
et al., 2016). RePlAce (Cheng et al., 2019) is a recent state- has emerged as a trend (Zhou et al., 2019; Paliwal et al.,
of-the-art mixed-size placement technique that further opti- 2019; Addanki et al., 2019). While the focus in prior work
mizes ePlace’s density function by introducing a local den- has been on using domain knowledge learned from previ-
sity function, which tailors the penalty factor for each indi- ous examples of an optimization problem to speed up pol-
vidual bin size. Section 5 compares the performance of the icy training on new problems, we propose an approach that,
state-of-the-art RePlAce algorithm against our approach. for the first time, enables the generation of higher quality
results by leveraging past experience. Not only does our
Recent work (Huang et al., 2019) proposes training a model
novel domain adaptation produce better results, it also re-
to predict the number of Design Rule Check (DRC) vi-
duces the training time 8-fold compared to training the pol-
olations for a given macro placement. DRCs are rules
icy from scratch.
that ensure that the placed and routed netlist adheres to
tape-out requirements. To generate macro placements with
fewer DRCs, (Huang et al., 2019) use the predictions from 3. Methods
this trained model as the evaluation function in simulated
3.1. Problem Statement
annealing. While this work represents an interesting di-
rection, it reports results on netlists with no more than 6 In this work, we target the chip placement optimization
macros, far fewer than any modern block, and the approach problem, in which the objective is to map the nodes of a
does not include any optimization during the place and the netlist (the graph describing the chip) onto a chip canvas
route steps. Due to the optimization, the placement and the (a bounded 2D space), such that final power, performance,
routing can change dramatically, and the actual DRC will and area (PPA) is optimized. In this section, we describe
change accordingly, invalidating the model prediction. In an overview of how we formulate the problem as a rein-
addition, although adhering to the DRC criteria is a neces- forcement learning (RL) problem, followed by a detailed
sary condition, the primary objective of macro placement description of the reward function, action and state repre-
is to optimize for wirelength, timing (e.g. Worst Negative sentations, policy architecture, and policy updates.
Slack (WNS) and Total Negative Slack (TNS)), power, and
area, and this work does not even consider these metrics. 3.2. Overview of Our Approach
To address this classic problem, we propose a new category We take a deep reinforcement learning approach to the
of approach: end-to-end learning-based methods. This type placement problem, where an RL agent (policy network)
of approach is most closely related to analytic solvers, par- sequentially places the macros; once all macros are placed,
ticularly non-linear ones, in that all of these methods op- a force-directed method is used to produce a rough place-
timize an objective function via gradient updates. How- ment of the standard cells, as shown in Figure 1. RL
ever, our approach differs from prior approaches in its abil- problems can be formulated as Markov Decision Processes
ity to learn from past experience to generate higher-quality (MDPs), consisting of four key elements:
placements on new chips. Unlike existing methods that
optimize the placement for each new chip from scratch, • states: the set of possible states of the world (e.g., in
our work leverages knowledge gained from placing prior our case, every possible partial placement of the netlist
chips to become better over time. In addition, our method onto the chip canvas).
enables direct optimization of the target metrics, such as
wirelength, density, and congestion, without having to de- • actions: the set of actions that can be taken by the
fine convex approximations of those functions as is done agent (e.g., given the current macro to place, the avail-
in other approaches (Cheng et al., 2019; Lu et al., 2015). able actions are the set of all the locations in the dis-
Not only does our formulation make it easy to incorporate crete canvas space (grid cells) onto which that macro
new cost functions as they become available, but it also al- can be placed without violating any hard constraints
lows us to weight their relative importance according to the on density or blockages).
needs of a given chip block (e.g., timing-critical or power-
constrained). • state transition: given a state and an action, this is the
probability distribution over next states.
Domain adaptation is the problem of training policies that
• reward: the reward for taking an action in a state. (e.g.,
Chip Placement with Deep Reinforcement Learning
Figure 1. The RL agent (i.e., the policy network) places macros one at a time. Once all macros are placed, the standard cells are placed
using a force-directed method. The reward, a linear combination of the approximate wirelength and congestion, are calculated and
passed to the agent to optimize its parameters for the next iteration.
in our case, the reward is 0 for all actions except the policy network architecture πθ (a|s) parameterized by θ,
last action where the reward is a negative weighted and finally the optimization method we use to train those
sum of proxy wirelength and congestion, subject to parameters.
density constraints as described in Section 3.3).
3.3. Reward
In our setting, at the initial state, s0 , we have an empty chip
Our goal in this work is to minimize power, performance
canvas and an unplaced netlist. The final state sT corre-
and area, subject to constraints on routing congestion and
sponds to a completely placed netlist. At each step, one
density. Our true reward is the output of a commercial
macro is placed. Thus, T is equal to the total number of
EDA tool, including wirelength, routing congestion, den-
macros in the netlist. At each time step t, the agent be-
sity, power, timing, and area. However, RL policies require
gins in state (st ), takes an action (at ), arrives at a new state
100,000s of examples to learn effectively, so it is critical
(st+1 ), and receives a reward (rt ) from the environment (0
that the reward function be fast to evaluate, ideally running
for t < T and negative proxy cost for t = T ).
in a few milliseconds. In order to be effective, these ap-
We define st to be a concatenation of features represent- proximate reward functions must also be positively corre-
ing the state at time t, including a graph embedding of the lated with the true reward. Therefore, a component of our
netlist (including both placed and unplaced nodes), a node cost is wirelength, because it is not only much cheaper to
embedding of the current macro to place, metadata about evaluate, but also correlates with power and performance
the netlist (Section 4), and a mask representing the feasi- (timing). We define approximate cost functions for both
bility of placing the current node onto each cell of the grid wirelength and congestion, as described in Section 3.3.1
(Section 3.3.6). and Section 3.3.5, respectively.
The action space is all valid placements of the tth macro, To combine multiple objectives into a single reward func-
which is a function of the density mask described in sec- tion, we take the weighted sum of proxy wirelength and
tion 3.3.6. Action at is the cell placement of the tth macro congestion where the weight can be used to explore the
that was chosen by the RL policy network. trade-off between the two metrics.
st+1 is the next state, which includes an updated repre- While we treat congestion as a soft constraint (i.e., lower
sentation containing information about the newly placed congestion improves the reward function), we treat density
macro, an updated density mask, and an embedding for the as a hard constraint, masking out actions (grid cells to place
next node to be placed. nodes onto) whose density exceeds the target density, as
described further in section 3.3.6.
In our formulation, rt is 0 for every time step except for
the final rT , where it is a weighted sum of approximate To keep the runtime per iteration small, we apply several
wirelength and congestion as described in Section 3.3. approximations to the calculation of the reward function:
Through repeated episodes (sequences of states, actions,
1. We group millions of standard cells into a few thou-
and rewards), the policy network learns to take actions that
sand clusters using hMETIS (Karypis & Kumar,
will maximize cumulative reward. We use Proximal Pol-
1998), a partitioning technique based on the normal-
icy Optimization (PPO) (Schulman et al., 2017) to update
ized minimum cut objective. Once all the macros are
the parameters of the policy network, given the cumulative
placed, we use force-directed methods to place the
reward for each placement.
standard cell clusters, as described in section 3.3.4.
In this section, we define the reward r, state s, actions a, Doing so enables us to achieve an approximate but fast
Chip Placement with Deep Reinforcement Learning
standard cell placement that facilitates policy network of the final placement. We limit the maximum number of
optimization. rows and columns to 128. We treat choosing the optimal
number of rows and columns as a bin-packing problem and
2. We discretize the grid to a few thousand grid cells and rank different combinations of rows and columns by the
place the center of macros and standard cell clusters amount of wasted space they incur. We use an average of
onto the center of the grid cells. 30 rows and columns in the experiments described in Sec-
3. When calculating wirelength, we make the simplify- tion 5.
ing assumption that all wires leaving a standard cell
cluster originate at the center of the cluster. 3.3.3. S ELECTION OF MACRO ORDER
4. To calculate routing congestion cost, we only consider To select the order in which the macros are placed, we sort
the average congestion of the top 10% most congested macros by descending size and break ties using a topolog-
grid cells, as described in Section 3.3.5. ical sort. By placing larger macros first, we reduce the
chance of there being no feasible placement for a later
macro. The topological sort can help the policy network
3.3.1. W IRELENGTH
learn to place connected nodes close to one another. An-
Following the literature (Shahookar & Mazumder, 1991), other potential approach would be to learn to jointly op-
we employ half-perimeter wirelength (HPWL), the most timize the ordering of macros and their placement, mak-
commonly used approximation for wirelength. HPWL is ing the choice of which node to place next part of the ac-
defined as the half-perimeter of the bounding boxes for all tion space. However, this enlarged action space would sig-
nodes in the netlist. The HPWL for a given net (edge) i is nificantly increase the complexity of the problem, and we
shown in the equation below: found that this heuristic worked in practice.
Figure 2. Policy and value network architecture. An embedding layer encodes information about the netlist adjacency, node features, and
the current macro to be placed. The policy and value networks then output a probability distribution over available placement locations
and an estimate of the expected reward for the current placement, respectively.
To train this supervised model, we needed a large dataset of rithm. We then repeatedly perform the following updates:
chip placements and their corresponding reward labels. We 1) each edge updates its representation by applying a fully
therefore created a dataset of 10,000 chip placements where connected network to an aggregated representation of in-
the input is the state associated with a given placement and termediate node embeddings, and 2) each node updates its
the label is the reward for that placement (wirelength and representation by taking the mean of adjacent edge embed-
congestion). We built this dataset by first picking 5 dif- dings. The node and edge updates are shown in Equation
ferent accelerator netlists and then generating 2,000 place- 5.
ments for each netlist. To create diverse placements for
each netlist, we trained a vanilla policy network at vari-
ous congestion weights (ranging from 0 to 1) and random e
eij = f c1 (concat(f c0 (vi )|f c0 (vj )|wij )) (5)
seeds, and collected snapshots of each placement during
the course of policy training. An untrained policy network vi = meanj∈N (vi ) (eij )
starts off with random weights and the generated place-
ments are of low quality, but as the policy network trains, Node embeddings are denoted by vi s for 1 <= i <= N ,
the quality of generated placements improves, allowing us where N is the total number of macros and standard cell
to collect a diverse dataset with placements of varying qual- clusters. Vectorized edges connecting nodes vi and vj are
ity. represented as eij . Both edge (eij ) and node (vi ) embed-
dings are randomly initialized and are 32-dimensional. f c0
To train a supervised model that can accurately predict e
is a 32 × 32, f c1 is a 65 × 32 feedforward network and wij s
wirelength and congestion labels and generalize to unseen
are learnable 1x1 weights corresponding to edges. N (vi )
data, we developed a novel graph neural network architec-
shows the neighbors of vi . The outputs of the algorithm are
ture that embeds information about the netlist. The role
the node and edge embeddings.
of graph neural networks is to distill information about
the type and connectivity of a node within a large graph Our supervised model consists of: (1) The graph neural
into low-dimensional vector representations which can be network described above that embeds information about
used in downstream tasks. Some examples of such down- node types and the netlist adjacency matrix. (2) A fully
stream tasks are node classification (Nazi et al., 2019), de- connected feedforward network that embeds the metadata,
vice placement (Zhou et al., 2019), link prediction (Zhang including information about the underlying semiconduc-
& Chen, 2018), and Design Rule Violations (DRCs) pre- tor technology (horizontal and vertical routing capacity),
diction (Zhiyao Xie Duke Univeristy, 2018). the total number of nets (edges), macros, and standard cell
clusters, canvas size and number of rows and columns in
We create a vector representation of each node by con-
the grid. (3) A fully connected feedforward network (the
catenating the node features. The node features include
prediction layer) whose input is a concatenation of the
node type, width, height, and x and y coordinates. We
netlist graph and metadata embedding and whose output
also pass node adjacency information as input to our algo-
is the reward prediction. The netlist graph embedding is
Chip Placement with Deep Reinforcement Learning
created by applying a reduce mean function on the edge 4.3. Policy Network Update: Training Parameters θ
embeddings. The supervised model is trained via regres-
In Equation 3, the objective is to train a policy network πθ
sion to minimize the weighted sum of the mean squared
that maximizes the expected value (E) of the reward (Rp,g )
loss of wirelength and congestion.
over the policy network’s placement distribution. To opti-
This supervised task allowed us to find the features and ar- mize the parameters of the policy network, we use Prox-
chitecture necessary to generalize reward prediction across imal Policy Optimization (PPO) (Schulman et al., 2017)
netlists. To incorporate this architecture into our policy net- with a clipped objective as shown below:
work, we removed the prediction layer and then used it as
the encoder component of the policy network as shown in LCLIP (θ) = Êt [min(rt (θ)Ât , clip(rt (θ), 1 − , 1 + )Ât )]
Figure 2. where Êt represents the expected value at timestep t, rt is
the ratio of the new policy and the old policy, and Ât is the
To handle different grid sizes corresponding to different
estimated advantage at timestep t.
choices of rows and columns, we set the grid size to
128 × 128, and mask the unused L-shaped section for grid
sizes smaller than 128 rows and columns. 5. Results
To place a new test netlist at inference time, we load the In this section, we evaluate our method and answer the fol-
pre-trained weights of the policy network and apply it to lowing questions: Does our method enable domain transfer
the new netlist. We refer to placements generated by a and learning from experience? What is the impact of us-
pre-trained policy network with no finetuning as zero-shot ing pre-trained policies on the quality of result? How does
placements. Such a placement can be generated in less than the quality of the generated placements compare to state-
a second, because it only requires a single inference step of of-the-art baselines? We also inspect the visual appearance
the pre-trained policy network. We can further optimize of the generated placements and provide some insights into
placement quality by finetuning the policy network. Do- why our policy network made those decisions.
ing so gives us the flexibility to either use the pre-trained
weights (that have learned a rich representation of the in- 5.1. Transfer Learning Results
put state) or further finetune these weights to optimize for
the properties of a particular chip netlist. Figure 3 compares the quality of placements generated us-
ing pre-trained policies to those generated by training the
4.2. Policy Network Architecture policy network from scratch. Zero-shot means that we ap-
plied a pre-trained policy network to a new netlist with no
Figure 2 depicts an overview of the policy network (mod- finetuning, yielding a placement in less than one second.
eled by πθ in Equation 3) and the value network architec- We also show results where we finetune the pre-trained pol-
ture that we developed for chip placement. The inputs to icy network on the details of a particular design for 2 and 12
these networks are the netlist graph (graph adjacency ma- hours. The policy network trained from scratch takes much
trix and node features), the id of the current node to be longer to converge, and even after 24 hours, the results are
placed, and the metadata of the netlist and the semiconduc- worse than what the finetuned policy network achieves af-
tor technology. The netlist graph is passed through our pro- ter 12 hours, demonstrating that the learned weights and ex-
posed graph neural network architecture as described ear- posure to many different designs are helping us to achieve
lier. This graph neural network generates embeddings of higher quality placements for new designs in less time.
(1) the partially placed graph and (2) the current node. We
Figure 4 shows the convergence plots for training from
use a simple feedforward network to embed (3) the meta-
scratch vs. training from a pre-trained policy network for
data. These three embedding vectors are then concatenated
Ariane RISC-V CPU. The pre-trained policy network starts
to form the state embedding, which is passed to a feedfor-
with a lower placement cost at the beginning of the finetun-
ward neural network. The output of the feedforward net-
ing process. Furthermore, the pre-trained policy network
work is then fed into the policy network (composed of 5
converges to a lower placement cost and does so more than
deconvolutions 1 and Batch Normalization layers) to gen-
30 hours faster than the policy network that was trained
erate a probability distribution over actions and passed to
from scratch.
a value network (composed of a feedforward network) to
predict the value of the input state.
5.2. Learning from Larger Datasets
1
The deconvolutions layers have a 3x3 kernel size with stride
2 and 16, 8, 4, 2, and 1 filter channels respectively. As we train on more chip blocks, we are able to speed
up the training process and generate higher quality results
faster. Figure 5 (left) shows the impact of a larger training
set on performance. The training dataset is created from
Chip Placement with Deep Reinforcement Learning
Figure 3. Domain adaptation results. For each block, the zero-shot results, as well as the finetuned results after 2 and 6 hours of training
are shown. We also include results for policies trained from scratch. As can be seen in the table, the pre-trained policy network
consistently outperforms the policy network that was trained from scratch, demonstrating the effectiveness of learning from training data
offline.
Figure 4. Convergence plots for training a policy network from scratch vs. finetuning a pre-trained policy network for a block of Ariane.
internal TPU blocks. The training data consists of a variety mized placements for new unseen blocks.
of blocks including memory subsystems, compute units,
and control logic. As we increase the training set from 2 5.3. Visualization Insights
blocks to 5 blocks and finally to 20 blocks, the policy net-
work generates better placements both at zero-shot and af- Figure 6 show the placement results for the Ariane RISC-
ter being finetuned for the same number of hours. Figure V CPU. On the left, placements from the zero-shot policy
5 (right) shows the placement cost on the test data, as the network and on the right, placements from the finetuned
policy network is being (pre-)trained. We can see that for policy network are shown. The zero-shot placements are
the small training dataset, the policy network quickly over- generated at inference time on a previously unseen chip.
fits to the training data and performance on the test data The zero-shot policy network places the standard cells in
degrades, whereas it takes longer for the policy network to the center of the canvas surrounded by macros, which is
overfit on largest dataset and the policy network pre-trained already quite close to the optimal arrangement. After fine-
on this larger dataset yields better results on the test data. tuning, the placements of macros become more regularized
This plot suggests that as we expose the policy network to a and the standard cell area in the center becomes less con-
greater variety of distinct blocks, while it might take longer gested.
for the policy network to pre-train, the policy network be- Figure 7 shows the visualized placements: on the left, re-
comes less prone to overfitting and better at finding opti-
Chip Placement with Deep Reinforcement Learning
Figure 5. We pre-train the policy network on three different training datasets (the small dataset is a subset of the medium one, and the
medium dataset is a subset of the large one). We then finetune this pre-trained policy network on the same test block and report cost at
various training durations (shown on the left of the figure). As the dataset size increases, both the quality of generated placements and
time to convergence on the test block improve. The right figure shows evaluation curves for policies trained on each dataset (each dot in
the right figure shows the cost of the placement generated by the policy under training).
Figure 6. Visualization of placements. On the left, zero-shot placements from the pre-trained policy and on the right, placements from
the finetuned policy are shown. The zero-shot policy placements are generated at inference time on a previously unseen chip. The
pre-trained policy network (with no fine-tuning) places the standard cells in the center of the canvas surrounded by macros, which is
already quite close to the optimal arrangement and in line with the intuitions of physical design experts.
sults from a manual placement, and on the right, results but to give an idea of the scale, each block contains up to a
from our approach are shown. The white area shows the few hundred macros and millions of standard cells.
macro placements and the green area shows the standard
Comparisons with Simulated Annealing: Simulated An-
cell placements. Our method creates donut-shaped place-
nealing (SA), is known to be a powerful, but slow, opti-
ments of macros, surrounding standard cells, which results
mization method. However, like RL, simulated annealing
in a reduction in the total wirelength.
is capable of optimizing arbitrary non-differentiable cost
functions. To show the relative sample efficiency of RL,
5.4. Comparing with Baseline Methods we ran experiments in which we replaced it with a sim-
In this section, we compare our method with 3 baselines ulated annealing based optimizer. In these experiments,
methods: Simulated Annealing, RePlAce, and human ex- we use the same inputs and cost function as before, but
pert baselines. For our method, we use a policy pre-trained in each episode, the simulated annealing optimizer places
on the largest dataset (of 20 TPU blocks) and then fine- all macros, followed by an FD step to place the standard
tune it on 5 target unseen blocks denoted by Blocks 1 to 5. cell clusters. Each macro placement is accepted according
Our dataset consists a variety of blocks including memory to the SA update rule using an exponential decay annealing
subsystems, compute units, and control logic. Due to con- schedule (Kirkpatrick et al., 1983). SA takes 18 hours to
fidentiality, we cannot disclose the details of these blocks, converge, whereas our method takes no more than 6 hours.
Chip Placement with Deep Reinforcement Learning
Figure 7. Human-expert placements are shown on the left and results from our approach are shown on the right. The white area represents
macros and the green area represents standard cells. The figures are intentionally blurred as the designs are proprietary.
Table 1. Experiments to evaluate sample efficiency of Deep RL compared to Simulated Annealing (SA). We replaced our RL policy
network with SA and ran 128 different SA experiments for each block, sweeping different hyper-parameters, including min and max
temperature, seed, and max step size. The results from the run with minimum cost is reported. The results show proxy wirelength and
congestion values for each block. Note that because these proxy metrics are relative, comparisons are only valid for different placements
of the same block.
Replacing Deep RL with SA in our framework Ours
Wirelength Congestion Wirelength Congestion
Block 1 0.048 1.21 0.047 0.87
Block 2 0.045 1.11 0.041 0.93
Block 3 0.044 1.14 0.034 0.96
Block 4 0.030 0.87 0.024 0.78
Block 5 0.045 1.29 0.038 0.88
To make comparisons fair, we ran multiple SA experiments ments, using the tool’s default settings. We then report total
that sweep different hyper-parameters, including min and wirelength, timing (worst (WNS) and total (TNS) negative
max temperature, seed, and max SA episodes, such that slack), area, and power metrics. As shown in Table 2, our
SA and RL spend the same amount of CPU-hours in sim- method outperforms RePLAce in generating placements
ulation and search a similar number of states. The results that meet the design requirements. Given constraints im-
from the experiment with minimum cost are reported in Ta- posed by the underlying semiconductor technology, place-
ble 1. As shown in the table, even with additional time, SA ments of these blocks will not be able to meet timing con-
struggles to produce high-quality placements compared to straints in the later stage of the design flow if the WNS
our approach, and produces placements with 14.4% higher is significantly above 100 ps or if the horizontal or verti-
wirelength and 24.1% higher congestion on average. cal congestion is over 1%, rendering some RePlAce place-
ments (Blocks 1, 2, 3) unusable. These results demonstrate
Comparisons with RePlAce (Cheng et al., 2019) and
that our congestion-aware approach is effective in generat-
manual baselines: Table 2 compares our results with the
ing high-quality placements that meet design criteria.
state-of-the-art method RePlAce (Cheng et al., 2019) and
manual baselines. The manual baseline is generated by RePlAce is faster than our method as it converges in 1 to 3.5
a production chip design team, and involved many itera- hours, whereas our results were achieved in 3 to 6 hours.
tions of placement optimization, guided by feedback from However, some of the fundamental advantages of our ap-
a commercial EDA tool over a period of several weeks. proach are 1) our method can readily optimize for various
non-differentiable cost functions, without the need to for-
With respect to RePlAce, we share the same optimization
mulate closed form or differentiable equivalents of those
goals, namely to optimize global placement in chip design,
cost functions. For example, while it is straightforward to
but we use different objective functions. Thus, rather than
model wirelength as a convex function, this is not true for
comparing results from different cost functions, we treat
routing congestion or timing. 2) our method has the abil-
the output of a commercial EDA tool as ground truth. To
ity to improve over time as the policy is exposed to more
perform this comparison, we fix the macro placements gen-
chip blocks, and 3) our method is able to adhere to various
erated by our method and by RePlAce and allow a commer-
design constraints, such as blockages of differing shapes.
cial EDA tool to further optimize the standard cell place-
Chip Placement with Deep Reinforcement Learning
Table 2. Comparing our method with the state-of-the-art (RePlAce (Cheng et al., 2019)) method and manual expert placements using an
industry standard electronic design automation (EDA) tool. For all metrics in this table, lower is better. For placements which violate
constraints on timing (WNS significantly greater than 100 ps) or congestion (horizontal or vertical congestion greater than 1%), we
render their metrics in gray to indicate that these placements are infeasible.
Name Method Timing Area Power Wirelength Congestion
WNS (ps) TNS (ns) Total (µm2 ) Total (W) (m) H (%) V (%)
Block 1 RePlAce 374 233.7 1693139 3.70 52.14 1.82 0.06
Manual 136 47.6 1680790 3.74 51.12 0.13 0.03
Ours 84 23.3 1681767 3.59 51.29 0.34 0.03
Block 2 RePlAce 97 6.6 785655 3.52 61.07 1.58 0.06
Manual 75 98.1 830470 3.56 62.92 0.23 0.04
Ours 59 170 694757 3.13 59.11 0.45 0.03
Block 3 RePlAce 193 3.9 867390 1.36 18.84 0.19 0.05
Manual 18 0.2 869779 1.42 20.74 0.22 0.07
Ours 11 2.2 868101 1.38 20.80 0.04 0.04
Block 4 RePlAce 58 11.2 944211 2.21 27.37 0.03 0.03
Manual 58 17.9 947766 2.17 29.16 0.00 0.01
Ours 52 0.7 942867 2.21 28.50 0.03 0.02
Block 5 RePlAce 156 254.6 1477283 3.24 31.83 0.04 0.03
Manual 107 97.2 1480881 3.23 37.99 0.00 0.01
Ours 68 141.0 1472302 3.28 36.59 0.01 0.03
Table 2 also shows the results generated by human ex- Implications for a broader class of problems: This work
pert chip designers. Both our method and human ex- is just one example of domain-adaptive policies for opti-
perts consistently generate viable placements, meaning that mization and can be extended to other stages of the chip
they meet the timing and congestion design criteria. We design process, such as architecture and logic design, syn-
also outperform or match manual placements in WNS, thesis, and design verification, with the goal of training ML
area, power, and wirelength. Furthermore, our end-to-end models that improve as they encounter more instances of
learning-based approach takes less than 6 hours, whereas the problem. A learning based method also enables fur-
the manual baseline involves a slow iterative optimization ther design space exploration and co-optimization within
process with experts in the loop and can take multiple the cascade of tasks that compose the chip design process.
weeks.
6. Conclusion
5.5. Discussions
In this work, we target the complex and impactful prob-
Opportunities for further optimization of our ap- lem of chip placement. We propose an RL-based approach
proach: There are multiple opportunities to further im- that enables transfer learning, meaning that the RL agent
prove the quality of our method. For example, the pro- becomes faster and better at chip placement as it gains ex-
cess of standard cell partitioning, row and column selec- perience on a greater number of chip netlists. We show
tion, as well as selecting the order in which the macros that our method outperforms state-of-the-art baselines and
are placed all can be further optimized. In addition, we can generate placements that are superior or comparable to
would also benefit from a more optimized approach to stan- human experts on modern accelerators. Our method is end-
dard cell placement. Currently, we use a force-directed to-end and generates placements in under 6 hours, whereas
method to place standard cells due to its fast runtime. How- the strongest baselines require human experts in the loop
ever, we believe that more advanced techniques for stan- and take several weeks.
dard cell placement such as RePlAce (Cheng et al., 2019)
and DREAMPlace (Lin et al., 2019) can yield more accu-
rate standard cell placements to guide the policy network 7. Acknowledgments
training. This is helpful because if the policy network has a This project was a collaboration between Google Research
clearer signal on how its macro placements affect standard and the Google Chip Implementation and Infrastructure
cell placement and final metrics, it can learn to make more (CI2) Team. We would like to thank Cliff Young, Ed Chi,
optimal macro placement decisions. Chip Stratakos, Sudip Roy, Amir Yazdanbakhsh, Nathan
Chip Placement with Deep Reinforcement Learning
Myung-Chul Kim, Sachin Agarwal, Bin Li, Martin Abadi, Chung-Kuan Cheng and Kuh, E. S. Module placement
Amir Salek, Samy Bengio, and David Patterson for their based on resistive network optimization. IEEE Transac-
help and support. tions on Computer-Aided Design of Integrated Circuits
and Systems, 3(3):218–225, July 1984. ISSN 1937-4151.
References doi: 10.1109/TCAD.1984.1270078.
Addanki, R., Venkatakrishnan, S. B., Gupta, S., Mao, H., Fiduccia, C. M. and Mattheyses, R. M. A linear-time
and Alizadeh, M. Placeto: Learning generalizable device heuristic for improving network partitions. In 19th De-
placement algorithms for distributed machine learning. sign Automation Conference, pp. 175–181, June 1982.
CoRR, abs/1906.08879, 2019. URL http://arxiv. doi: 10.1109/DAC.1982.1585498.
org/abs/1906.08879. Gilbert, E. N. and Pollak, H. O. Steiner minimal trees.
SIAM Journal on Applied Mathematics, 16(1):1–29,
Agnihotri, A., Ono, S., and Madden, P. Recursive bisec-
1968.
tion placement: Feng shui 5.0 implementation details. In
Proceedings of the International Symposium on Physical Hanan, M. and Kurtzberg, J. Placement techniques. In
Design, pp. 230–232, 01 2005. doi: 10.1145/1055137. Design Automation of Digital Systems, 1972.
1055186.
Hsu, M., Chang, Y., and Balabanov, V. Tsv-aware
Bo Hu and Marek-Sadowska, M. Multilevel fixed-point- analytical placement for 3d ic designs. In 2011
addition-based vlsi placement. IEEE Transactions on 48th ACM/EDAC/IEEE Design Automation Conference
Computer-Aided Design of Integrated Circuits and Sys- (DAC), pp. 664–669, June 2011.
tems, 24(8):1188–1203, Aug 2005. ISSN 1937-4151.
Huang, Y., Xie, Z., Fang, G., Yu, T., Ren, H., Fang, S.,
doi: 10.1109/TCAD.2005.850802.
Chen, Y., and Hu, J. Routability-driven macro placement
Brenner, U., Struzyna, M., and Vygen, J. Bonnplace: with embedded cnn-based prediction model. In Teich,
Placement of leading-edge chips by advanced combina- J. and Fummi, F. (eds.), Design, Automation & Test in
torial algorithms. Trans. Comp.-Aided Des. Integ. Cir. Europe Conference & Exhibition, DATE 2019, Florence,
Sys., 27(9):16071620, September 2008. ISSN 0278- Italy, March 25-29, 2019, pp. 180–185. IEEE, 2019.
0070. doi: 10.1109/TCAD.2008.927674. URL https: Kahng, A. B., Reda, S., and Qinke Wang. Architecture and
//doi.org/10.1109/TCAD.2008.927674. details of a high quality, large-scale analytical placer. In
ICCAD-2005. IEEE/ACM International Conference on
Breuer, M. A. A class of min-cut placement algorithms. In
Computer-Aided Design, 2005., pp. 891–898, Nov 2005.
Proceedings of the 14th Design Automation Conference,
doi: 10.1109/ICCAD.2005.1560188.
DAC 77, pp. 284290. IEEE Press, 1977.
Karypis, G. and Kumar, V. A hypergraph partitioning pack-
Chen, T., Jiang, Z., Hsu, T., Chen, H., and Chang, Y. Ntu- age. In HMETIS, 1998.
place3: An analytical placer for large-scale mixed-size
designs with preplaced blocks and density constraints. Kernighan, D. . A procedure for placement of standard-cell
IEEE Transactions on Computer-Aided Design of In- vlsi circuits. In IEEE TCAD, 1985.
tegrated Circuits and Systems, 27(7):1228–1240, July Kim, M. and Markov, I. L. Complx: A competitive primal-
2008. ISSN 1937-4151. doi: 10.1109/TCAD.2008. dual lagrange optimization for global placement. In DAC
923063. Design Automation Conference 2012, pp. 747–755, June
Chen, T.-C., Jiang, Z.-W., Hsu, T.-C., Chen, H.-C., and 2012.
Chang, Y.-W. A high-quality mixed-size analytical Kim, M.-C., Lee, D.-J., and Markov, I. L. Simpl: An
placer considering preplaced blocks and density con- effective placement algorithm. In Proceedings of the
straints. In Proceedings of the 2006 IEEE/ACM Interna- International Conference on Computer-Aided Design,
tional Conference on Computer-Aided Design, ICCAD ICCAD 10, pp. 649656. IEEE Press, 2010. ISBN
06, pp. 187192, New York, NY, USA, 2006. Association 9781424481927.
for Computing Machinery. ISBN 1595933891.
Kim, M.-C., Viswanathan, N., Alpert, C. J., Markov, I. L.,
Cheng, C., Kahng, A. B., Kang, I., and Wang, L. Replace: and Ramji, S. Maple: Multilevel adaptive placement for
Advancing solution quality and routability validation in mixed-size designs. In Proceedings of the 2012 ACM
global placement. IEEE Transactions on Computer- International Symposium on International Symposium
Aided Design of Integrated Circuits and Systems, 38(9): on Physical Design, ISPD, pp. 193200, New York, NY,
1717–1730, 2019. USA, 2012a. Association for Computing Machinery.
Chip Placement with Deep Reinforcement Learning
Kim, M.-C., Viswanathan, N., Alpert, C. J., Markov, Paliwal, A. S., Gimeno, F., Nair, V., Li, Y., Lubin, M.,
I. L., and Ramji, S. Maple: Multilevel adaptive place- Kohli, P., and Vinyals, O. Regal: Transfer learning
ment for mixed-size designs. In Proceedings of the for fast optimization of computation graphs. ArXiv,
2012 ACM International Symposium on International abs/1905.02494, 2019.
Symposium on Physical Design, ISPD 12, pp. 193200,
New York, NY, USA, 2012b. Association for Comput- Ren-Song Tsay, Kuh, E. S., and Chi-Ping Hsu. Proud:
ing Machinery. ISBN 9781450311670. doi: 10.1145/ a sea-of-gates placement algorithm. IEEE Design Test
2160916.2160958. URL https://doi.org/10. of Computers, 5(6):44–56, Dec 1988. ISSN 1558-1918.
1145/2160916.2160958. doi: 10.1109/54.9271.
Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. Opti- Roy, J. A., Papa, D. A., and Markov, I. L. Capo:
mization by simulated annealing. Science, 220(4598): Congestion-Driven Placement for Standard-cell and
671–680, 1983. ISSN 0036-8075. doi: 10.1126/ RTL Netlists with Incremental Capability, pp. 97–133.
science.220.4598.671. URL https://science. Springer US, Boston, MA, 2007.
sciencemag.org/content/220/4598/671. Sarrafzadeh, M., Wang, M., and Yang, X. Dragon:
A Placement Framework, pp. 57–89. Springer, 01
Lin, T., Chu, C., Shinnerl, J. R., Bustany, I., and Nedelchev,
2003. ISBN 978-1-4419-5309-4. doi: 10.1007/
I. Polar: Placement based on novel rough legaliza-
978-1-4757-3781-3 3.
tion and refinement. In Proceedings of the International
Conference on Computer-Aided Design, ICCAD 13, pp. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
357362. IEEE Press, 2013. ISBN 9781479910694. Klimov, O. Proximal policy optimization algorithms,
2017.
Lin, Y., Dhar, S., Li, W., Ren, H., Khailany, B., and Pan,
D. Z. Dreamplace: Deep learning toolkit-enabled gpu Sechen, C. and Sangiovanni-Vincentelli, A. L. Timber-
acceleration for modern vlsi placement. In Proceedings Wolf3.2: a new standard cell placement and global rout-
of the 56th Annual Design Automation Conference 2019, ing package. In DAC, pp. 432–439. IEEE Computer So-
DAC 19, 2019. ciety Press, 1986. doi: 10.1145/318013.318083.
Lu, J., Chen, P., Chang, C.-C., Sha, L., Huang, D. J.-H., Shahookar, K. and Mazumder, P. Vlsi cell placement tech-
Teng, C.-C., and Cheng, C.-K. Eplace: Electrostatics- niques. ACM Comput. Surv., 23(2):143220, June 1991.
based placement using fast fourier transform and nes- ISSN 0360-0300. doi: 10.1145/103724.103725. URL
terovs method. ACM Trans. Des. Autom. Electron. Syst., https://doi.org/10.1145/103724.103725.
20(2), 2015. ISSN 1084-4309.
Spindler, P., Schlichtmann, U., and Johannes, F. M.
Lu, J., Zhuang, H., Chen, P., Chang, H., Chang, C., Wong, Kraftwerk2a fast force-directed quadratic placement ap-
Y., Sha, L., Huang, D., Luo, Y., Teng, C., and Cheng, proach using an accurate net model. IEEE Transactions
C. eplace-ms: Electrostatics-based placement for mixed- on Computer-Aided Design of Integrated Circuits and
size circuits. IEEE Transactions on Computer-Aided De- Systems, 27(8):1398–1411, Aug 2008. ISSN 1937-4151.
sign of Integrated Circuits and Systems, 34(5):685–698, doi: 10.1109/TCAD.2008.925783.
2015.
Tao Luo and Pan, D. Z. Dplace2.0: A stable and effi-
Lu, J., Zhuang, H., Kang, I., Chen, P., and Cheng, cient analytical placement based on diffusion. In 2008
C.-K. Eplace-3d: Electrostatics based placement for Asia and South Pacific Design Automation Conference,
3d-ics. In Proceedings of the 2016 on International pp. 346–351, March 2008. doi: 10.1109/ASPDAC.2008.
Symposium on Physical Design, ISPD 16, New York, 4483972.
NY, USA, 2016. Association for Computing Machin-
ery. ISBN 9781450340397. doi: 10.1145/2872334. Viswanathan, N., Nam, G.-J., Alpert, C., Villarrubia, P.,
2872361. URL https://doi.org/10.1145/ Ren, H., and Chu, C. Rql: Global placement via re-
2872334.2872361. laxed quadratic spreading and linearization. In Pro-
ceedings - Design Automation Conference, pp. 453–458,
Nazi, A., Hang, W., Goldie, A., Ravi, S., and Mirhoseini, 07 2007a. ISBN 978-1-59593-627-1. doi: 10.1145/
A. Gap: Generalizable approximate graph partitioning 1278480.1278599.
framework, 2019.
Viswanathan, N., Pan, M., and Chu, C. FastPlace: An Ef-
Obermeier, B., Ranke, H., and Johannes, F. Kraftwerk: a ficient Multilevel Force-Directed Placement Algorithm,
versatile placement approach. In ISPD, pp. 242–244, 01 pp. 193–228. Springer, 01 2007b. doi: 10.1007/
2005. doi: 10.1145/1055137.1055190. 978-0-387-68739-1 8.
Chip Placement with Deep Reinforcement Learning
Zhou, Y., Roy, S., Abdolrashidi, A., Wong, D., Ma, P. C.,
Xu, Q., Zhong, M., Liu, H., Goldie, A., Mirhoseini, A.,
and Laudon, J. Gdp: Generalized device placement for
dataflow graphs, 2019.