VLSI Placement Parameter Optimization Using Deep Reinforcement Learning
VLSI Placement Parameter Optimization Using Deep Reinforcement Learning
VLSI Placement Parameter Optimization Using Deep Reinforcement Learning
Reinforcement Learning
Anthony Agnesina, Kyungwook Chang, and Sung Kyu Lim
School of ECE, Georgia Institute of Technology, Atlanta, GA
[email protected]
ABSTRACT The algorithms implemented inside the EDA tools have param-
The quality of placement is essential in the physical design flow. To eter settings that users can modify to achieve the desired power-
achieve PPA goals, a human engineer typically spends a consider- performance-area (PPA). In the authors’ experience, more time is
able amount of time tuning the multiple settings of a commercial spent on tuning and running a commercial placer than on creating
placer (e.g. maximum density, congestion effort, etc.). This paper a first version of the design. Tools and flows have steadily increased
proposes a deep reinforcement learning (RL) framework to optimize in complexity, with modern place and route tools offering more
the placement parameters of a commercial EDA tool. We build an than 10,000 parameter settings. Expert users are required in partic-
autonomous agent that learns to tune parameters optimally without ular for the latest technology nodes, with increased cost and risk.
human intervention and domain knowledge, trained solely by RL Indeed, as the design space of the parameters is too big and complex
from self-search. To generalize to unseen netlists, we use a mixture to be explored by a human engineer alone, one usually relies on
of handcrafted features from graph topology theory along with expertise and domain knowledge when tuning. However, the corre-
graph embeddings generated using unsupervised Graph Neural lations between the different parameters and the resulting PPA may
Networks. Our RL algorithms are chosen to overcome the spar- be complex or nonintuitive. Placement engines may exhibit nonde-
sity of data and latency of placement runs. Our trained RL agent terministic behaviors as they heavily rely on handcrafted rules and
achieves up to 11% and 2.5% wirelength improvements on unseen metaheurisitics. Moreover, the advertised goal of a parameter may
netlists compared with a human engineer and a state-of-the-art not always directly translate onto the targeted metric.
tool auto-tuner, in just one placement iteration (20× and 50× less A state-of-the-art tool auto-tuner [1] is used in EDA such as in
iterations). [18] and [19] to optimize Quality of Results (QoR) in the FPGA and
high-level synthesis (HLS) compilation flows. It leverages Multi-
Armed Bandit (MAB) to organize a set of classical optimization
1 INTRODUCTION techniques and efficiently explore the design space. However, these
In the recent years, the combination of deep learning techniques techniques rely on heuristics that are too general and do not con-
with reinforcement learning (RL) principles has resulted in the sider the specificities of each netlist. Therefore, each new netlist
creation of self-learning agents achieving superhuman performance requires to start over parameter exploration. We overcome this
at the game of Go, Shogi and Chess [13]. Deep RL is also used with limitation in our RL agent by first encoding the netlist information
large success in real-world applications such as robotics, finance, using a mixture of graph handcrafted features and graph neural
self-driving cars, etc. network embeddings. This helps generalize the tuning process from
The quality of VLSI placement is essential for the subsequent netlist to netlist, saving long and costly placement iterations.
steps of physical design with influential repercussions on design The goal of our RL framework is to learn an optimization pro-
quality and design closure. Recent studies [10] however show that cess that finds placement parameters minimizing wirelength after
existing placers cannot produce near optimal solutions. The goal placement. The main contributions of this paper are:
of a placement engine is to assign locations for the cells inside the
chip’s area. The most common target of state-of-the-art placers is
• We reduce the significant time expense of VLSI development by
to minimize the total interconnect length, i.e. the estimated half-
application of deep RL to pre-set the placement parameters of a
perimeter wire length (HPWL) from the placed cells locations.
commercial EDA tool. To the best of our knowledge, this is the
first work on RL applied to placement parameters optimization.
This material is based upon work supported by the National Science Foundation • Our RL algorithm overcomes the sparsity of data and the la-
under Grant No. CNS 16-24731 and the industry members of the Center for Advanced
Electronics in Machine Learning. tency of design tool runs using multiple environments collecting
experiences in parallel.
• We use a mixture of features relative to graph topological char-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed acteristics along with graph embedding generated by a graph
for profit or commercial advantage and that copies bear this notice and the full citation neural network to train an RL agent capable of generalizing its
on the first page. Copyrights for components of this work owned by others than ACM tuning process to unseen netlists.
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a • We build an autonomous agent that iteratively learns to tune
fee. Request permissions from [email protected]. parameter settings to optimize placement, without supervised
ICCAD ’20, November 2–5, 2020, Virtual Event, USA samples. We achieve better wirelengths on unseen netlists than a
© 2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-8026-3/20/11. . . $15.00 state-of-the-art auto-tuner, without any additional training and
https://doi.org/10.1145/3400302.3415690 in just one placement iteration.
ICCAD ’20, November 2–5, 2020, Virtual Event, USA Anthony Agnesina, Kyungwook Chang, and Sung Kyu Lim
2.5 Our Reward Structure The goal of this optimization problem is to learn directly which
In order to learn with a single RL agent across various netlists with action 𝑎 to take in a specific state 𝑠. We represent the parametrized
different wirelengths, we cannot define a reward directly linear with policy 𝜋𝜽 by a deep neural network. The main reasons for choosing
HPWL. Thus, to help convergence, we adopt a normalized reward this framework are as follows:
function which renders the magnitude of the value approximations • It is model-free which is important as the placer tool environment
similar among netlists: is very complex and may be hard to model.
𝐻𝑃𝑊 𝐿Human Baseline − 𝐻𝑃𝑊 𝐿𝑡 • Our intuition is that the optimal policy may be simple to learn
𝑅𝑡 := . (6) and represent (e.g. keep increasing the effort) while the value of
𝐻𝑃𝑊 𝐿Human Baseline
VLSI Placement Parameter Optimization using Deep Reinforcement Learning ICCAD ’20, November 2–5, 2020, Virtual Event, USA
current state st
actor
netlist param. actor
observation
el
ste
placement critic learner actor
reward actor
engine value actor net
environment Rt =-HPWL policy
TD err.
actor actor
action at
netlist new param.
Figure 5: Synchronous parallel learner. The global network sends
new state st+1 actions to the actors through the step model. Each actor gathers ex-
periences from their own environment.
Figure 4: Actor-critic framework. The critic learns about and cri-
tiques the policy currently being followed by the actor. excellent results on diverse environments. As shown in Equation
12, an advantage function formed as the difference between returns
a parameter setting may not be trivial or change significantly and baseline state-action estimate is used instead of raw returns.
based on observation. The advantage can be thought of as a measure of how good a given
• Policy optimization often shows good convergence properties. action is compared to some average.
3.3 How to Learn: the Actor-Critic Framework 3.4 Synchronous Actor/Critic Implementation
In our chosen architecture we learn a policy that optimizes the value The main issues plaguing the use of RL in EDA are the latency
while learning the value simultaneously. For learning, it is often of tool runs (it takes minutes to hours to perform one placement)
beneficial to use as much knowledge observed from the environ- as well as the sparsity of data (there is no database of millions
ment as possible and hang off other predictions, rather than solely of netlists, placed designs or layouts). To solve both issues, we
predicting the policy. This type of framework called actor-critic is implement a parallel version of A2C, as depicted in Figure 5. In this
shown in Figure 4. The policy is known as the actor, because it is implementation, an agent learns from experiences of multiple Actors
used to select actions, and the estimated value function is known interacting in parallel with their own copy of the environment. This
as the critic, because it criticizes the actions made by the actor. configuration increases the throughput of acting and learning and
Actor-critic algorithms combine value-based and policy-based helps decorrelate samples during training for data efficiency [6].
methods. Value-based algorithms learn to approximate values 𝑣 w (𝑠) ≈ In parallel training setups, the learning updates may be applied
𝑣 𝜋 (𝑠) by exploiting the the Bellman equation: synchronously or asynchronously. We use a synchronous version,
𝑣 𝜋 (𝑠) = E[𝑅𝑡 +1 + 𝛾𝑣 𝜋 (𝑠𝑡 +1 )|𝑠𝑡 = 𝑠] (8) i.e. a deterministic implementation that waits for each Actor to
finish its segment of experience (according to the current policy
which is used in temporal difference (TD): provided by the step model) before performing a single batch update
Δw𝑡 = (𝑅𝑡 +1 + 𝛾𝑣 w (𝑠𝑡 +1 ) − 𝑣 w (𝑠𝑡 ))∇w 𝑣 w (𝑠𝑡 ). (9) to the weights of the network. One advantage is that it provides
On the other hand, policy-based algorithms update a parameterized larger batch sizes, which is more effectively used by computing
policy 𝜋𝜽 (𝑎𝑡 |𝑠𝑡 ) directly through stochastic gradient ascent in the resources.
direction of the value: The parallel training setup does not modify the equations pre-
sented before. The gradients are just accumulated among all the
Δ𝜽𝑡 = 𝐺𝑡 ∇𝜽 log 𝜋𝜽 (𝑎𝑡 |𝑠𝑡 ). (10) environments’ batches.
In actor-critic, the policy updates are computed from incomplete
episodes by using truncated returns that bootstrap on the value 3.5 A Two-Head Network Architecture
estimate at state 𝑠𝑡 +𝑛 according to 𝑣 w : The actor-critic framework uses both policy and value models. The
𝑛−1 full agent network can be represented as a deep neural network
(𝑛)
Õ
𝐺𝑡 = 𝛾 𝑘 𝑅𝑡 +𝑘+1 + 𝛾 𝑛 𝑣 w (𝑠𝑡 +𝑛 ). (11) (𝜋𝜽 , 𝑣 w ) = 𝑓 (𝑠). This neural network takes the state 𝑠 = (𝑝 ◦ 𝑛)
𝑘=0 made of parameter values 𝑝 and netlist features 𝑛 and outputs a
This reduces the variance of the updates and propagates rewards vector of action probabilities with components 𝜋𝜽 (𝑎) for each action
faster. The variance can be further reduced using state-values as a 𝑎, and a scalar value 𝑣 w (𝑠) estimating the expected cumulative
baseline in policy updates, as in advantage actor-critic updates: reward 𝐺 from state 𝑠.
(𝑛) The policy tells how to modify a placement parameter setting
Δ𝜽𝑡 = (𝐺𝑡 − 𝑣 w (𝑠𝑡 ))∇𝜽 log 𝜋𝜽 (𝑎𝑡 |𝑠𝑡 ). (12) and the value network tells us how good this current setting is. We
The critic updates parameters w of 𝑣 w by 𝑛-step TD (Eq. 9) and share the body of the network to allow value and policy predictions
the actor updates parameters 𝜽 of 𝜋𝜽 in direction suggested by to inform one another. The parameters are adjusted by gradient
critic by policy gradient (Eq. 12). In this work we use the advantage ascent on a loss function that sums over the losses of the policy
actor-critic method, called A2C [12], which was shown to produce and the value plus a regularization term, whose gradient is defined
ICCAD ’20, November 2–5, 2020, Virtual Event, USA Anthony Agnesina, Kyungwook Chang, and Sung Kyu Lim
value network
follows: 32 from GNN, 20 from Table 2, 24 one-hot encoding for the
softmax linear enum/bool types from Table 1, and 3 integer types from Table 1.
4 5
tanh tanh Part Input Hidden Output
tanh tanh 1. Shared Body 79 (64, 32) (tanh) 16 (linear)
2. LSTM (6 unroll) 16 16 16 × 6
3. Attention 16 × 6 𝑾𝒂 , 𝑾𝒄 16
attention 3 4. Policy 16 (32, 32) (tanh) 11 (softmax)
5. Value 16 (32, 16) (tanh) 1 (linear)
LSTM 2
1 linear
Natural Language Processing architectures, to help the Recurrent
tanh Layer (RNN) focus on important parts of the recursion. Let 𝒉𝑡 be
the hidden state of the RNN. Then the attention alignment weights
tanh
𝒂𝑡 with each source hidden state 𝒉𝑠 are defined as:
GNN hand one-hot exp score(𝒉𝑡 , 𝒉𝑠 )
𝒂𝑡 (𝑠) = Í (14)
𝑠 ′ exp score(𝒉𝑡 , 𝒉𝑠 ′ )
bool/enum int
where the alignment score function is:
param. param.
score(𝒉𝑡 , 𝒉𝑠 ) = 𝒉𝑡⊤𝑾𝒂 𝒉𝑠 . (15)
placement The global context vector:
netlist graph parameters Õ
𝒄𝑡 = 𝒂𝑡 (𝑠)𝒉𝑠 (16)
Figure 6: Overall network architecture of our agent. The combina- 𝑠
tion of an LSTM with an Attention mechanism enables the learning is combined with the hidden state to produce an attentional hidden
of a complex recurrent optimization process. Table 4 provides the state as follows:
details of the sub-networks used here.
𝒉˜ 𝑡 = tanh 𝑾𝒄 [𝒄𝑡 ◦ 𝒉𝑡 ] .
(17)
as follows as in [16]: This hidden state is then fed to the two heads of the network,
(𝑛) both composed of two FC layers with an output softmax layer for
(𝐺𝑡 − 𝑣 w (𝑠𝑡 ))∇𝜽 log 𝜋𝜽 (𝑎𝑡 |𝑠𝑡 ) +
| {z } the policy and an output linear layer for the value. The parameters
policy gradient of our network are summarized in Table 4.
(𝑛)
𝛽 (𝐺𝑡 − 𝑣 w (𝑠𝑡 ))∇w 𝑣 w (𝑠𝑡 ) +
| {z } 3.6 Our Self-Play Strategy
(13)
value estimation gradient Inspired from AlphaZero [13], our model learns without any su-
Õ pervised samples. We do not use expert knowledge to pre-train the
𝜂 𝜋𝜽 (𝑎|𝑠𝑡 ) log 𝜋𝜽 (𝑎|𝑠𝑡 ) .
𝑎
network using good known parameter sets or actions. While the
| {z } agent makes random moves at first, the idea is that by relying on
entropy regularisation zero human bias, the agent may learn counter-intuitive moves and
The entropy regularization pushes entropy up to encourage explo- achieve superhuman tuning capabilities.
ration, and 𝛽 and 𝜂 are hyper-parameters that balance the impor-
tance of the loss components. 4 EXPERIMENTAL RESULTS
The complete architecture of our deep neural network is shown To train and test our agent, we select 15 benchmarks designs from
in Figure 6. To compute value and policy, the concatenation of OpenCores, ISPD 2012 contest and two RISC-V single cores, pre-
placement parameters with graph extracted features are first passed sented in Table 5. We use the first eleven for training and last four
through two feed-forward fully-connected (FC) layers with 𝑡𝑎𝑛ℎ for testing. We synthesize the RTL netlists using Synopsys Design
activations, followed by a FC linear layer. This is followed by a Long Compiler. We use TSMC 28nm technology node. The placements
Short-Term Memory (LSTM) module with layer normalization and are done with Cadence Innovus 17.1. Aspect ratio of the floorplans
with 16 hidden standard units with forget gate. The feed-forward FC is fixed to 1 and appropriate fixed clock frequencies are selected.
layers have no memory. Introducing an LSTM in the network, which Memory macros of RocketTile and OpenPiton Core benchmarks
is a recurrent layer, the model can base its actions on previous states. are pre-placed by hand. For successful placements, a lower bound
This is motivated by the fact that traditional optimization methods of total cell area divided by floorplan area is set on parameter max
are based on recurrent approaches. Moreover, we add a sequence-to- density. IO pins are placed automatically by the tool between metals
one global attention mechanism [9] inspired from state-of-the-art 4 and 6.
VLSI Placement Parameter Optimization using Deep Reinforcement Learning ICCAD ’20, November 2–5, 2020, Virtual Event, USA
Table 5: Benchmark statistics based on a commercial 28nm technol- Table 6: Comparison of half-perimeter bounding box (HPWL) after
ogy. RCC is the Rich Club Coefficient (𝑒 −4 ), LL is the maximum logic placement on training netlists among human design, Multi-Armed
level and Sp. R. denotes the Spectral Radius. RT is the average place- Bandit (MAB) [1], and our RL-based method. HPWL is reported in
ment runtime using Innovus (in minutes). 𝑚. Δ denotes percentage negative improvement over human design.
Name #cells #nets #IO 𝑅𝐶𝐶 3 LL Sp. R. RT Netlist human MAB [1] (Δ%) RL (Δ%)
training set PCI 0.010 0.0092 (−8.0%) 0.0092 (−8.0%)
PCI 1.2K 1.4K 361 510 17 25.6 0.5 DMA 0.149 0.139 (−6.7%) 0.135 (−9.4%)
DMA 10K 11K 959 65 25 26.4 1 B19 0.30 0.28 (−6.7%) 0.28 (−6.7%)
B19 33K 34K 47 19 86 36.1 2 DES 0.42 0.37 (−11.9%) 0.36 (−14.3%)
DES 47K 48K 370 14 16 25.6 2 VGA 1.52 1.40 (−7.9%) 1.41 (−7.2%)
VGA 52K 52K 184 15 25 26.5 3 ECG 0.72 0.65 (−9.7%) 0.68 (−5.5%)
ECG 83K 84K 1.7K 7.5 23 26.8 4 Rocket 1.33 1.27 (−4.5%) 1.20 (−9.8%)
Rocket 92K 95K 377 8.1 42 514.0 6 AES 1.49 1.44 ( −2.7%) 1.40 (−6.0%)
AES 112K 112K 390 5.8 14 102.0 6 AVC-Nova 1.59 1.49 (−6.3%) 1.46 (−8.2%)
Nova 153K 155K 174 4.6 57 11,298 9 Tate 1.53 1.42 (−7.2%) 1.45 (−5.2%)
Tate 187K 188K 1.9K 3.2 21 25.9 10 JPEG 2.14 1.96 (−8.4%) 1.88 (−12.2%)
JPEG 239K 267K 67 2.8 30 287.0 12
An explained variance of 0.67 shows the value function ex-
test set (unseen netlist) plains relatively well the observed returns. We use a discount factor
𝛾 = 0.997, coefficient for the value loss 𝛽 = 0.25, entropy cost of
LDPC 39K 41K 4.1K 18 19 328.0 2
𝜂 = 0.01, and a learning rate of 0.0008. We use a standard non-
OpenPiton 188K 196K 1.6K 3.9 76 3940 19
centered RMSProp as gradient ascent optimizer. The weights are
Netcard 300K 301K 1.8K 2.9 32 27.3 24
initialized using orthogonal initialization. The learning updates are
Leon3 326K 327K 333 2.4 44 29.5 26
batched across rollouts of 6 actor steps for 16 parallel copies of
the environment, totalling a mini-batch size of 96. All experiments
use gradient clipping by norm to avoid exploding gradients (phe-
nomenom common with LSTMs), with a maximum norm of 0.5.
Note that with 14, 400 placements, we only explore 10−6 % of the
total parameter space.
For a given environment, we select a random netlist and we
always start with the corresponding human parameter set to form
an initial state for each episode. Each environment is reset for
episodes of 16 steps. Training on 𝐾 environments in parallel, each
performing a placement on a different netlist, the reward signal is
averaged on the 𝐾 netlist for the network updates, which decreases
the reward variance and ultimately help the network generalize to
unseen netlists as prescribed in [11].
Figure 7: Training our agent for 150 iterations (= 14,400 placements). 4.2 Netlist Training Results
The reward is an aggregate reward from all training netlists. Train- For comparison, we use the state-of-the-art tool auto-tuner Open-
ing time is within 100 hours. Human baseline: reward = 0. Tuner [1] that we adapt for Innovus as baseline. In this framework,
a Multi-Armed Bandit (MAB) selects at each iteration a search tech-
4.1 RL Network Training Setting nique among Genetic Algorithm, Differential Evolution, Simulated
We define our environment using OpenAI Gym interface [2] and Annealing, Torczon hillclimber, Nelder-Mead and Particle Swarm
implement our RL agent network in Tensorflow. We use 16 parallel Optimization, based on a score that forces a trade-off between ex-
environments (16 threads) in our synchronous A2C framework. ploitation (use arm that worked best) and exploration (use a more
We perform tuning of the hyperparameters of our network using rarely used arm). We run the search techniques in parallel, eval-
Bayesian Optimization, which results in stronger agents. The learn- uating 16 candidate parameter sets at the same time. We run the
ing curve of our A2C agent in our custom Innovus environment tuner on the eleven training netlists and record the best achieved
is shown in Figure 7. We observe that the mean reward accross wirelength, performing 1, 300 placements per netlist so that the
all netlists converges asymptotically to a value of 6.8%, meaning total number of placements equals those of the RL agent training.
wirelength is reduced in average by 6.8%. Table 6 shows the best wirelengths found by the MAB as well as
Training over 150 iterations (= 14, 400 placements) takes about RL agent during training. The human baseline is set by an experi-
100 hours. Note that 99% of that time is spent to perform the place- enced engineer who tuned the parameters for a day. We see that the
ments while updating the neural network weights takes less than RL agent outperforms MAB on most netlists, reducing HPWL by
20min. Without parallelization, training over the same number of 9.8% on Rocket Core benchmark. All in all, both methods improve
placements would take 16 × 100 hr = 27 days. quite well on the human baseline.
ICCAD ’20, November 2–5, 2020, Virtual Event, USA Anthony Agnesina, Kyungwook Chang, and Sung Kyu Lim
HPWL: 5.26m WL: 6.31m HPWL: 5.11m WL: 6.24m HPWL: 4.99m WL: 6.10m
placement routing
(a) human design (took 7hrs) (b) Multi-Armed Bandit (took 16hrs) (c) reinforcement learning (took 20min)
Netlist human #iter. MAB [1] #iter. RL #iter. ckt metric human MAB [1] RL
LDPC 1.14 20 1.04 (−8.8%) 50 1.02 (−10.5%) 1 WL (𝑚) 1.65 1.57 1.53
LDPC WNS (𝑛𝑠) −0.005 −0.001 −0.001
OpenPt 5.26 20 5.11 (−2.9%) 50 4.99 (−5.1%) 1
Power (𝑚𝑊 ) 162.10 156.49 153.77
Netcard 4.88 20 4.45 (−8.8%) 50 4.34 (−11.1%) 1
WL (𝑚) 6.31 6.24 6.10
Leon3 3.52 20 3.37 (−4.3%) 50 3.29 (−6.5%) 1 OpenPiton WNS (𝑛𝑠) −0.003 −0.001 0
Table 8: Best placement parameters found for Netcard benchmark. Power (𝑚𝑊 ) 192.08 190.95 189.72
WL (𝑚) 8.01 7.44 7.15
NETCARD WNS (𝑛𝑠) −0.006 −0.007 −0.004
Name MAB [1] RL Power (𝑚𝑊 ) 174.05 170.51 167.70
eco max distance 54 81 WL (𝑚) 5.66 5.53 5.41
legalization gap 1 5 LEON3 WNS (𝑛𝑠) −0.005 −0.001 −0.003
max density 0.92 0.94 Power (𝑚𝑊 ) 156.83 156.00 155.51
eco priority eco fixed
activity power driven none none placement. Table 8 shows the best parameter settings found by
wire length opt high none MAB and RL agent on Netcard. Interestingly enough, we can see
blockage channel (for macros) none soft how the two optimizers found separate ways to minimize HPWL:
timing effort medium high WL driven vs. congestion driven.
clock power driven none none
congestion effort low high
4.4 PPA Comparison After Routing
clock gate aware true true To confirm improvement in HPWL after placement is translated
uniform density false false into one in final routed wirelength, we perform routing of placed
designs. They are all routed with same settings where metal lay-
4.3 Unseen Netlist Testing Results ers 1 to 6 are used. The layouts of OpenPiton Core are shown in
To verify the ability of our agent to generalize we check its per- Figure 8. We verify target frequency is achieved and routing suc-
formance on the four unseen test netlists. Without any additional ceeded without congestion issues or DRC violations. PPA of routed
training (the network parameters are fixed), the RL agent itera- designs is summarized in Table 9. We observe that HPWL reduc-
tively improves a random initial parameter set by selecting action tion after placement is conserved after routing on all test designs,
𝑎 with highest predicted 𝜋𝜽 (𝑎) value, as described in Equation 5. reaching 7.3% and 11% wirelength savings on LDPC and Netcard
Because our actions are deterministic, the resulting set of parame- compared with the human baseline. Footprints are 74,283 𝑢𝑚 2 for
ters is known, and fed back to the network. We repeat this process LDPC, 1,199,934 𝑢𝑚 2 for OpenPiton, 728,871 𝑢𝑚 2 for Netcard and
until the estimated value decreases for 3 consecutive updates and 894,115 𝑢𝑚 2 for Leon3.
backtrack to the settings with highest value. This way a “good” can-
didate parameter set is found without performing any placement. 5 CONCLUSIONS
We then perform a unique placement with that parameter set and Our placement parameter optimizer based on deep RL provides pre-
record the obtained wirelength. set of improved parameter settings without human intervention.
In comparison, the MAB needs the reward signal to propose a We believe this is an important step to shift from the “CAD” mindset
new set of parameters and therefore needs actual placements to be to the “Design Automation” mindset. We use a novel representation
performed by the tool. We track the best wirelength found, allotting to formulate states and actions applied to placement optimization.
50 sequential iterations to the MAB. Our experimental results show our agent generalizes well to un-
The best wirelength found by our RL agent and the MAB on seen netlists and consistently reduces wirelength compared with a
all four test netlists is show in Table 7. We observe our RL agent state-of-the-art tool auto-tuner, in only one iteration without any
achieves superior wirelengths consistently, performing only one additional training.
VLSI Placement Parameter Optimization using Deep Reinforcement Learning ICCAD ’20, November 2–5, 2020, Virtual Event, USA
REFERENCES [10] I. L. Markov et al. Progress and Challenges in VLSI Placement Research. ICCAD
[1] J. Ansel et al. OpenTuner: An Extensible Framework for Program Autotuning. ’12.
PACT ’14. [11] A. Mirhoseini et al. Chip Placement with Deep Reinforcement Learning, 2020.
[2] G. Brockman et al. OpenAI Gym, 2016. [12] V. Mnih et al. Asynchronous Methods for Deep Reinforcement Learning, 2016.
[3] C. Bron and J. Kerbosch. Algorithm 457: Finding All Cliques of an Undirected [13] D. Silver et al. Mastering Chess and Shogi by Self-Play with a General Reinforce-
Graph. Commun. ACM, 1973. ment Learning Algorithm, 2017.
[4] V. Colizza et al. Detecting rich-club ordering in complex networks. Nature [14] R. E. Tarjan. Depth-First Search and Linear Graph Algorithms. SIAM J. Comput.,
Physics, 2006. 1:146–160, 1972.
[5] O. Coudert. Exact Coloring Of Real-life Graphs Is Easy. DAC ’97. [15] L. van der Maaten and G. E. Hinton. Visualizing Data using t-SNE. 2008.
[6] L. Espeholt et al. IMPALA: Scalable Distributed Deep-RL with Importance [16] O. Vinyals et al. StarCraft II: A New Challenge for Reinforcement Learning, 2017.
Weighted Actor-Learner Architectures, 2018. [17] D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-world’ networks.
[7] W. Hamilton et al. Inductive Representation Learning on Large Graphs. NIPS Nature, 1998.
’17. [18] C. Xu et al. A Parallel Bandit-Based Approach for Autotuning FPGA Compilation.
[8] R. B. Lehoucq and D. C. Sorensen. Deflation Techniques for an Implicitly Restarted FPGA ’17.
Arnoldi Iteration. SIAM J. Matrix Analysis Applications, 17:789–821, 1996. [19] C. H. Yu et al. S2FA: An Accelerator Automation Framework for Heterogeneous
[9] M.-T. Luong et al. Effective Approaches to Attention-based Neural Machine Computing in Datacenters. DAC ’18.
Translation.