VLSI Placement Parameter Optimization Using Deep Reinforcement Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

VLSI Placement Parameter Optimization using Deep

Reinforcement Learning
Anthony Agnesina, Kyungwook Chang, and Sung Kyu Lim
School of ECE, Georgia Institute of Technology, Atlanta, GA
[email protected]

ABSTRACT The algorithms implemented inside the EDA tools have param-
The quality of placement is essential in the physical design flow. To eter settings that users can modify to achieve the desired power-
achieve PPA goals, a human engineer typically spends a consider- performance-area (PPA). In the authors’ experience, more time is
able amount of time tuning the multiple settings of a commercial spent on tuning and running a commercial placer than on creating
placer (e.g. maximum density, congestion effort, etc.). This paper a first version of the design. Tools and flows have steadily increased
proposes a deep reinforcement learning (RL) framework to optimize in complexity, with modern place and route tools offering more
the placement parameters of a commercial EDA tool. We build an than 10,000 parameter settings. Expert users are required in partic-
autonomous agent that learns to tune parameters optimally without ular for the latest technology nodes, with increased cost and risk.
human intervention and domain knowledge, trained solely by RL Indeed, as the design space of the parameters is too big and complex
from self-search. To generalize to unseen netlists, we use a mixture to be explored by a human engineer alone, one usually relies on
of handcrafted features from graph topology theory along with expertise and domain knowledge when tuning. However, the corre-
graph embeddings generated using unsupervised Graph Neural lations between the different parameters and the resulting PPA may
Networks. Our RL algorithms are chosen to overcome the spar- be complex or nonintuitive. Placement engines may exhibit nonde-
sity of data and latency of placement runs. Our trained RL agent terministic behaviors as they heavily rely on handcrafted rules and
achieves up to 11% and 2.5% wirelength improvements on unseen metaheurisitics. Moreover, the advertised goal of a parameter may
netlists compared with a human engineer and a state-of-the-art not always directly translate onto the targeted metric.
tool auto-tuner, in just one placement iteration (20× and 50× less A state-of-the-art tool auto-tuner [1] is used in EDA such as in
iterations). [18] and [19] to optimize Quality of Results (QoR) in the FPGA and
high-level synthesis (HLS) compilation flows. It leverages Multi-
Armed Bandit (MAB) to organize a set of classical optimization
1 INTRODUCTION techniques and efficiently explore the design space. However, these
In the recent years, the combination of deep learning techniques techniques rely on heuristics that are too general and do not con-
with reinforcement learning (RL) principles has resulted in the sider the specificities of each netlist. Therefore, each new netlist
creation of self-learning agents achieving superhuman performance requires to start over parameter exploration. We overcome this
at the game of Go, Shogi and Chess [13]. Deep RL is also used with limitation in our RL agent by first encoding the netlist information
large success in real-world applications such as robotics, finance, using a mixture of graph handcrafted features and graph neural
self-driving cars, etc. network embeddings. This helps generalize the tuning process from
The quality of VLSI placement is essential for the subsequent netlist to netlist, saving long and costly placement iterations.
steps of physical design with influential repercussions on design The goal of our RL framework is to learn an optimization pro-
quality and design closure. Recent studies [10] however show that cess that finds placement parameters minimizing wirelength after
existing placers cannot produce near optimal solutions. The goal placement. The main contributions of this paper are:
of a placement engine is to assign locations for the cells inside the
chip’s area. The most common target of state-of-the-art placers is
• We reduce the significant time expense of VLSI development by
to minimize the total interconnect length, i.e. the estimated half-
application of deep RL to pre-set the placement parameters of a
perimeter wire length (HPWL) from the placed cells locations.
commercial EDA tool. To the best of our knowledge, this is the
first work on RL applied to placement parameters optimization.
This material is based upon work supported by the National Science Foundation • Our RL algorithm overcomes the sparsity of data and the la-
under Grant No. CNS 16-24731 and the industry members of the Center for Advanced
Electronics in Machine Learning. tency of design tool runs using multiple environments collecting
experiences in parallel.
• We use a mixture of features relative to graph topological char-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed acteristics along with graph embedding generated by a graph
for profit or commercial advantage and that copies bear this notice and the full citation neural network to train an RL agent capable of generalizing its
on the first page. Copyrights for components of this work owned by others than ACM tuning process to unseen netlists.
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a • We build an autonomous agent that iteratively learns to tune
fee. Request permissions from [email protected]. parameter settings to optimize placement, without supervised
ICCAD ’20, November 2–5, 2020, Virtual Event, USA samples. We achieve better wirelengths on unseen netlists than a
© 2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-8026-3/20/11. . . $15.00 state-of-the-art auto-tuner, without any additional training and
https://doi.org/10.1145/3400302.3415690 in just one placement iteration.
ICCAD ’20, November 2–5, 2020, Virtual Event, USA Anthony Agnesina, Kyungwook Chang, and Sung Kyu Lim

2 RL ENVIRONMENT current state st


2.1 Overview netlist param.

update curent state


We build an RL agent that tunes the parameter settings of the
placement tool autonomously, with the objective of minimizing placement reward
wirelength. Our RL problem consists of the following four key engine agent
environment R t
= - HPWL
elements:
• States: the set of all netlists in the world and all possible parameter action at
settings combinations (P) from the placement tool (e.g. Cadence netlist new param.
Innovus or Synopsys ICC2). A single state 𝑠 consists of a unique
new state st+1
netlist and a current parameter set.
• Actions: set of actions that the agent can use to modify the cur-
rent parameters. An action 𝑎 changes the setting of a subset of Figure 1: Reinforcement learning agent-environment interaction in
parameters. the proposed methodology.
• State transition: given a state (𝑠𝑡 ) and an action, the next state This is a combinatorial optimization problem where P is very
(𝑠𝑡 +1 ) is the same netlist with updated parameters. large and exhaustive search is infeasible. For an agent to correctly
• Reward: minus the HPWL output from the commercial EDA select an action, we must first define a good representation of its
placement tool. The reward increases if the action improved the environment. In our case, the representation of the environment is
parameter settings in terms of minimizing wirelength. given by a human expert as presented.
As depicted in Figure 1, in RL an agent learns from interacting
with its environment over a number of discrete time steps. At each 2.3 Our States
time step 𝑡 the agent receives a state 𝑠𝑡 , and selects an action 𝑎𝑡 We define our state as the joint values of 12 placement parameters
from a set of possible actions A according to its policy 𝜋, where 𝜋 from Cadence Innovus used to perform the current placement (Table
maps states to actions. In return, the agent receives a reward signal 1), along with information metrics on the netlist being placed. The
𝑅𝑡 and transitions to the next state 𝑠𝑡 +1 . This process continues netlist information consists of a mixture of metadata knowledge
until the agent reaches a terminal state after which the process (number of cells, floorplan area, etc.) with graph topological fea-
restarts. The goal of the agent is to maximize its long-term return: tures (Table 2) along with unsupervised features extracted using

Õ a graph neural network. Netlist characteristics are important to
𝐺𝑡 = 𝛾 𝑘 𝑅𝑡 +𝑘+1 (1) transfer the knowledge accross very different netlists so that our
𝑘=0 agent generalizes its tuning process to unseen netlists. Indeed, the
where 𝛾 is a factor discounting future rewards. optimal policy is likely to be related to the particularities of each
An optimal policy is one that maximizes the expected returns netlist. Our state can be written as a concatenation of one-hot en-
or values. The value function 𝑣 𝜋 (𝑠𝑡 ) is the expected return starting coded categorical parameters (Booleans or enumerates), integer
from state 𝑠𝑡 when following policy 𝜋: parameters and integer and float netlist features.

hÕ i 2.3.1 Topological Graph Features. In order to learn rich graph rep-
𝑣 𝜋 (𝑠𝑡 ) = E𝜋 [𝐺𝑡 |𝑠𝑡 ] = E𝜋 𝛾 𝑘 𝑅𝑡 +𝑘+1 |𝑠𝑡 . (2) resentations that uniquely classify each graph, we borrow concepts
𝑘=0 from graph theory. We use Boost and BLAS optimized C++ libraries
2.2 Our RL Settings to extract efficiently our features on large graphs (all collected in less
than 1hr for the larger netlists). Let 𝐺 = (𝑉 , 𝐸) be the directed cyclic
We formulate the placement parameters optimization task led by
graph representing the netlist, obtained using a fully-connected
an RL agent as follows:
clique model. Global signals such as reset, clock or VDD/VSS are
not considered when building the graph. Multiple connections be-
RL Placement Parameter Optimization Problem tween two vertices in the graph are merged into one and self-loops
Goal Given a netlist, find arg min𝑝 ∈ P 𝐻𝑃𝑊 𝐿(𝑝) where P are eliminated. We consider the following graph features to capture
is the set of all parameter combinations and 𝐻𝑃𝑊 𝐿 is complex topology characteristics (e.g. connections and spectral)
given by the tool. from the netlist:
How? (1) Define Environment as placement tool black-box. • Strongly Connected Components (SCC): A strong component of
(2) Define state 𝑠 ≈ current set of parameters 𝑝𝑐𝑢𝑟𝑟 ∈ P 𝐺 is a subgraph 𝑆 of 𝐺 if for any pair of vertices 𝑢, 𝑣 ∈ 𝑆 there
and target netlist. is a directed cycle in 𝑆 containing 𝑢 and 𝑣. We compute them in
(3) Define actions to modify 𝑝𝑐𝑢𝑟𝑟 . linear time using directed depth-first search with an algorithm
(4) Define a reward 𝑅 ∝ −𝐻𝑃𝑊 𝐿 so that the agent’s due to Tarjan [14].
goal is to decrease wirelength. • Clique: Given an integer 𝑚, does 𝐺 contains 𝐾𝑚 (the complete
graph on 𝑚 vertices) as a subgraph? We use the Bron–Kerbosch
(5) Select a discount factor 𝛾 < 1 to force the agent
algorithm [3] to find the maximal cliques.
reduce wirelength in as few steps as possible.
• k-Colorability: Given an integer 𝑘, is there a coloring of 𝐺 with
𝑘 or fewer colors? A coloring is a map 𝜒 : 𝑉 → 𝐶 such as
VLSI Placement Parameter Optimization using Deep Reinforcement Learning ICCAD ’20, November 2–5, 2020, Virtual Event, USA

Table 1: 12 placement parameters we are targeting. The solution space is 6x109 .

Name Objective Type Groups # val


eco max distance maximum distance allowed during placement legalization integer detail [0, 100]
legalization gap minimum sites gap between instances integer detail [0, 100]
max density controls the maximum density of local bins integer global [0, 100]
eco priority instance priority for refine place enum detail 3
activity power driven level of effort for activity power driven placer enum detail + effort 3
wire length opt optimizes wirelength by swapping cells enum detail + effort 3
blockage channel creates placement blockages in narrow channels between macros enum global 3
timing effort level of effort for timing driven placer enum global + effort 2
clock power driven level of effort for clock power driven placer enum global + effort 3
congestion effort the effort level for relieving congestion enum global + effort 3
clock gate aware specifies that placement is aware of clock gate cells in the design bool global 2
uniform density enables even cell distribution bool global 2

Table 2: Our 20 handcrafted netlist features.


ENC(1)
1
Metadata (10) Topological (10) Graph ENC(4) mean
4 SAGE ENC(2) aggregator
Name Type Name Type
2
# cells integer average degree float 5
3 ENC(3)
# nets integer average fanout float ENC(5)
# cell pins integer largest SCC integer ENC(G)
G = (V, E) ENC(v, v V)
# IO integer max. clique integer
# nets w. fanout ∈]5, 10[ integer chromatic nb. integer
Figure 2: Graph embedding using a graph neural network pack-
# nets w. fanout ≥ 10 integer max. logic level integer
age GraphSAGE. In our experiments, we first extract 32 features
# FFs integer RCC float for each node in the graph. Next, we calculate the mean among all
total cell area (𝑢𝑚 2 ) integer 𝐶𝐶 float nodes for each feature. In the end, we obtain 32 features for the en-
# hardmacros integer Fiedler value float tire graph.
macro area (𝑢𝑚 2 ) integer spectral radius float
GNN + handcrafted features
tate des
400
two adjacent vertices have the same color; i.e., if (𝑢, 𝑣) ∈ 𝐸 then ldpc leon
𝜒 (𝑢) ≠ 𝜒 (𝑣). Minimum 𝑘 (the chromatic number) is computed jpeg
200
using a technique proposed in [5]. rocket ecg
• Logic Levels: What is the maximum distance (# gates traversed) dma pci
between two flip-flops? 𝐿𝐿 = max 𝑑 (𝑎, 𝑏). b19
𝑎,𝑏 ∈𝐹 𝐹𝑠 nova
• Rich Club Coefficient: How well do high degree (rich) nodes know vga aes
each other? Let 𝐺𝑘 = (𝑉𝑘 , 𝐸𝑘 ) be the filtered graph of 𝐺 with -200
2|𝐸 | netcard
only nodes of degree > 𝑘, then 𝑅𝐶𝐶𝑘 = |𝑉 | ( |𝑉𝑘 |−1) [4]. openpiton
𝑘 𝑘
• Clustering coefficient: A measure of the cliquishness of nodes -400 -200 0 200
neighborhoods [17]:
Figure 3: t-SNE visualization of our 20 handcrafted plus 32 GNN
1 Õ |𝑒 𝑗𝑘 : 𝑣 𝑗 , 𝑣𝑘 ∈ Neighbors(𝑖), 𝑒 𝑗𝑘 ∈ 𝐸| features combined. Points representative of each netlist are well
𝐶𝐶 = . (3)
|𝑉 | deg(𝑖)(deg(𝑖) − 1) separated, proving graph features capture the differences between
𝑖 ∈𝑉
netlists well.
• Spectral characteristics: Using the implicitly restarted Arnoldi
method [8], we extract from the Laplacian matrix of 𝐺 the Fiedler 2.3.2 Features from Graph Neural Network (GNN). Starting from
value (second smallest eigenvalue) deeply related to the con- simple node features including degree, fanout, area and encoded
nectivity properties of 𝐺, as well as the spectral radius (largest gate type, we generate node embeddings (enc(𝑣)) using unsuper-
eigenvalue) relative to the regularity of 𝐺. vised GraphSAGE [7] with convolutional aggregation, dropout, and
These features give important information about the netlist. For output size of 32. The GNN algorithm iteratively propagates in-
example, connectivity features such as SCC, maximal clique and formation from a node to its neighbors. The GNN is trained on
RCC are important to capture congestion considerations (consid- each graph individually. Then the graph embedding (enc(𝐺)) is
ered during placement refinement) while logic levels translate in- obtained from the node embeddings with a permutation invariant
directly the difficulty of meeting timing by extracting the longest aggregator as shown in Figure 2:

logic path. enc(𝐺) = mean enc(𝑣)|𝑣 ∈ 𝑉 . (4)
ICCAD ’20, November 2–5, 2020, Virtual Event, USA Anthony Agnesina, Kyungwook Chang, and Sung Kyu Lim

Table 3: Our 11 actions.


While defining rewards in this manner necessitates knowing
1. FLIP Booleans 𝐻𝑃𝑊 𝐿Human Baseline , an expected baseline wirelength per design,
2. UP Integers this only requires one placement to be completed by an engineer.
3. DOWN Integers
4. UP Efforts 2.6 Extensions in Physical Design Flow
5. DOWN Efforts The environment description and in particular the action definition
6. UP Detailed can be applied to Parameter Optimization of any stage in the phys-
7. DOWN Detailed ical design flow, such as routing, clock tree synthesis, etc. As our
8. UP Global (does not touch the bool) actions act on abstracted representations of tool parameters, we
9. DOWN Global (does not touch the bool) can perform rapid design space exploration. The reward function
10. INVERT-MIX timing vs. congestion vs. WL efforts can be easily adjusted to target PPA such as routed wirelength or
11. DO NOTHING congestion, in order to optimize the design for different trade-offs.
For example, a reward function combining various attributes into a
single numerical value can be:
The t-SNE projection in two dimensions [15] of the vector of Õ 
graph features is displayed in Figure 3. We see that all netlists points 𝑅 = exp 𝛼𝑘 QoR𝑘 − 1. (7)
are far apart, indicating that the combination of our handcrafted 𝑘
and learned graph features distinguish well the particularities of
3 RL PLACEMENT AGENT
each netlist.
3.1 Overview
2.4 Our Actions Using the definition of the environment presented in the previous
We define our own deterministic actions to change the setting section, we train an agent to autonomously tune the parameters of
of a subset of parameters. They render the state markovian, i.e. the placement tool. Here is our approach:
given state-action pair (𝑠𝑡 , 𝑎𝑡 ), the resulting state 𝑠𝑡 +1 is unique. An • The agent learns the optimal action for a given state. This action
advantage of fully-observed determinism is that it allows planning. is chosen based on its policy network probability outputs.
Starting from state 𝑠 0 and following a satisfactory policy 𝜋, the • To train the policy network effectively, we adopt an actor-critic
trajectory framework which brings the benefits of value-based and policy-
𝜋 (𝑠 0 ) 𝜋 (𝑠 1 ) 𝜋 (𝑠𝑛−1 ) based optimization algorithms together.
𝑠 0 −−−−→ 𝑠 1 −−−−→ ... −−−−−−→ 𝑠𝑛 (5)
• To solve the well-known shortcomings of RL in EDA that are
leads to a parameter set 𝑠𝑛 of good quality. If 𝜋 has been learned, latency and sparsity, we implement multiple environments col-
𝑠𝑛 can be computed directly in O (1) time without performing any lecting different experiences in parallel.
placement. • To enable the learning of a recursive optimization process with
Defining only two actions per placement parameter would result complex dependencies, our agent architecture utilizes a deep neu-
in 24 different actions, which is too many for the agent to learn well. ral network comprising a recurrent layer with attention mecha-
Thus, we decide to first group tuning variables per type (Boolean, nism.
Enumerate, Numeric) and per placement “focus” (Global, Detailed,
Effort). On each of these groups, we define simple yet expressive 3.2 Goal of Learning
actions such as flip (for booleans), up, down, etc. For integers, we From the many ways of learning how to behave in an environment,
define prepared ranges where up ≡ “put in upper range”, while for we choose to use what is called policy-based reinforcement learning.
enumerates down ≡ “pass from high to medium” for example. We We state the formal definition of this problem as follows:
also add one arm that does not modify the current set of parameters.
It serves as a trigger to reset the environment, in case it gets picked Policy Based RL Problem
multiple times in a row. This leads to the 11 different actions A
presented in Table 3. Our action space is designed to be as simple as Goal Learn the optimal policy 𝜋 ∗ (𝑎|𝑠)
possible in order to help neural network training, but also expressive How? (1) Approximate a policy by parameterized 𝜋𝜽 (𝑎|𝑠).
enough so that any parameter settings can be reached by such (2) Define objective 𝐽 (𝜽 ) = E𝜋𝜽 [𝑣 𝜋𝜽 (𝑠)].
transformations. (3) Find arg max𝜽 𝐽 (𝜽 ) with Stochastic Gradient.

2.5 Our Reward Structure The goal of this optimization problem is to learn directly which
In order to learn with a single RL agent across various netlists with action 𝑎 to take in a specific state 𝑠. We represent the parametrized
different wirelengths, we cannot define a reward directly linear with policy 𝜋𝜽 by a deep neural network. The main reasons for choosing
HPWL. Thus, to help convergence, we adopt a normalized reward this framework are as follows:
function which renders the magnitude of the value approximations • It is model-free which is important as the placer tool environment
similar among netlists: is very complex and may be hard to model.
𝐻𝑃𝑊 𝐿Human Baseline − 𝐻𝑃𝑊 𝐿𝑡 • Our intuition is that the optimal policy may be simple to learn
𝑅𝑡 := . (6) and represent (e.g. keep increasing the effort) while the value of
𝐻𝑃𝑊 𝐿Human Baseline
VLSI Placement Parameter Optimization using Deep Reinforcement Learning ICCAD ’20, November 2–5, 2020, Virtual Event, USA

current state st
actor
netlist param. actor
observation

update cur. state


action o
agent pm d

el
ste
placement critic learner actor
reward actor
engine value actor net
environment Rt =-HPWL policy
TD err.
actor actor
action at
netlist new param.
Figure 5: Synchronous parallel learner. The global network sends
new state st+1 actions to the actors through the step model. Each actor gathers ex-
periences from their own environment.
Figure 4: Actor-critic framework. The critic learns about and cri-
tiques the policy currently being followed by the actor. excellent results on diverse environments. As shown in Equation
12, an advantage function formed as the difference between returns
a parameter setting may not be trivial or change significantly and baseline state-action estimate is used instead of raw returns.
based on observation. The advantage can be thought of as a measure of how good a given
• Policy optimization often shows good convergence properties. action is compared to some average.

3.3 How to Learn: the Actor-Critic Framework 3.4 Synchronous Actor/Critic Implementation
In our chosen architecture we learn a policy that optimizes the value The main issues plaguing the use of RL in EDA are the latency
while learning the value simultaneously. For learning, it is often of tool runs (it takes minutes to hours to perform one placement)
beneficial to use as much knowledge observed from the environ- as well as the sparsity of data (there is no database of millions
ment as possible and hang off other predictions, rather than solely of netlists, placed designs or layouts). To solve both issues, we
predicting the policy. This type of framework called actor-critic is implement a parallel version of A2C, as depicted in Figure 5. In this
shown in Figure 4. The policy is known as the actor, because it is implementation, an agent learns from experiences of multiple Actors
used to select actions, and the estimated value function is known interacting in parallel with their own copy of the environment. This
as the critic, because it criticizes the actions made by the actor. configuration increases the throughput of acting and learning and
Actor-critic algorithms combine value-based and policy-based helps decorrelate samples during training for data efficiency [6].
methods. Value-based algorithms learn to approximate values 𝑣 w (𝑠) ≈ In parallel training setups, the learning updates may be applied
𝑣 𝜋 (𝑠) by exploiting the the Bellman equation: synchronously or asynchronously. We use a synchronous version,
𝑣 𝜋 (𝑠) = E[𝑅𝑡 +1 + 𝛾𝑣 𝜋 (𝑠𝑡 +1 )|𝑠𝑡 = 𝑠] (8) i.e. a deterministic implementation that waits for each Actor to
finish its segment of experience (according to the current policy
which is used in temporal difference (TD): provided by the step model) before performing a single batch update
Δw𝑡 = (𝑅𝑡 +1 + 𝛾𝑣 w (𝑠𝑡 +1 ) − 𝑣 w (𝑠𝑡 ))∇w 𝑣 w (𝑠𝑡 ). (9) to the weights of the network. One advantage is that it provides
On the other hand, policy-based algorithms update a parameterized larger batch sizes, which is more effectively used by computing
policy 𝜋𝜽 (𝑎𝑡 |𝑠𝑡 ) directly through stochastic gradient ascent in the resources.
direction of the value: The parallel training setup does not modify the equations pre-
sented before. The gradients are just accumulated among all the
Δ𝜽𝑡 = 𝐺𝑡 ∇𝜽 log 𝜋𝜽 (𝑎𝑡 |𝑠𝑡 ). (10) environments’ batches.
In actor-critic, the policy updates are computed from incomplete
episodes by using truncated returns that bootstrap on the value 3.5 A Two-Head Network Architecture
estimate at state 𝑠𝑡 +𝑛 according to 𝑣 w : The actor-critic framework uses both policy and value models. The
𝑛−1 full agent network can be represented as a deep neural network
(𝑛)
Õ
𝐺𝑡 = 𝛾 𝑘 𝑅𝑡 +𝑘+1 + 𝛾 𝑛 𝑣 w (𝑠𝑡 +𝑛 ). (11) (𝜋𝜽 , 𝑣 w ) = 𝑓 (𝑠). This neural network takes the state 𝑠 = (𝑝 ◦ 𝑛)
𝑘=0 made of parameter values 𝑝 and netlist features 𝑛 and outputs a
This reduces the variance of the updates and propagates rewards vector of action probabilities with components 𝜋𝜽 (𝑎) for each action
faster. The variance can be further reduced using state-values as a 𝑎, and a scalar value 𝑣 w (𝑠) estimating the expected cumulative
baseline in policy updates, as in advantage actor-critic updates: reward 𝐺 from state 𝑠.
(𝑛) The policy tells how to modify a placement parameter setting
Δ𝜽𝑡 = (𝐺𝑡 − 𝑣 w (𝑠𝑡 ))∇𝜽 log 𝜋𝜽 (𝑎𝑡 |𝑠𝑡 ). (12) and the value network tells us how good this current setting is. We
The critic updates parameters w of 𝑣 w by 𝑛-step TD (Eq. 9) and share the body of the network to allow value and policy predictions
the actor updates parameters 𝜽 of 𝜋𝜽 in direction suggested by to inform one another. The parameters are adjusted by gradient
critic by policy gradient (Eq. 12). In this work we use the advantage ascent on a loss function that sums over the losses of the policy
actor-critic method, called A2C [12], which was shown to produce and the value plus a regularization term, whose gradient is defined
ICCAD ’20, November 2–5, 2020, Virtual Event, USA Anthony Agnesina, Kyungwook Chang, and Sung Kyu Lim

Table 4: Neural network parameters used in our RL agent architec-


action prob. value ture in Figure 6. The number of inputs of the first FC layer is as
policy network = ∆ place. param. = - cumulated WL

value network
follows: 32 from GNN, 20 from Table 2, 24 one-hot encoding for the
softmax linear enum/bool types from Table 1, and 3 integer types from Table 1.
4 5
tanh tanh Part Input Hidden Output
tanh tanh 1. Shared Body 79 (64, 32) (tanh) 16 (linear)
2. LSTM (6 unroll) 16 16 16 × 6
3. Attention 16 × 6 𝑾𝒂 , 𝑾𝒄 16
attention 3 4. Policy 16 (32, 32) (tanh) 11 (softmax)
5. Value 16 (32, 16) (tanh) 1 (linear)
LSTM 2
1 linear
Natural Language Processing architectures, to help the Recurrent
tanh Layer (RNN) focus on important parts of the recursion. Let 𝒉𝑡 be
the hidden state of the RNN. Then the attention alignment weights
tanh
𝒂𝑡 with each source hidden state 𝒉𝑠 are defined as:

GNN hand one-hot exp score(𝒉𝑡 , 𝒉𝑠 )
𝒂𝑡 (𝑠) = Í  (14)
𝑠 ′ exp score(𝒉𝑡 , 𝒉𝑠 ′ )
bool/enum int
where the alignment score function is:
param. param.
score(𝒉𝑡 , 𝒉𝑠 ) = 𝒉𝑡⊤𝑾𝒂 𝒉𝑠 . (15)
placement The global context vector:
netlist graph parameters Õ
𝒄𝑡 = 𝒂𝑡 (𝑠)𝒉𝑠 (16)
Figure 6: Overall network architecture of our agent. The combina- 𝑠
tion of an LSTM with an Attention mechanism enables the learning is combined with the hidden state to produce an attentional hidden
of a complex recurrent optimization process. Table 4 provides the state as follows:
details of the sub-networks used here.
𝒉˜ 𝑡 = tanh 𝑾𝒄 [𝒄𝑡 ◦ 𝒉𝑡 ] .

(17)
as follows as in [16]: This hidden state is then fed to the two heads of the network,
(𝑛) both composed of two FC layers with an output softmax layer for
(𝐺𝑡 − 𝑣 w (𝑠𝑡 ))∇𝜽 log 𝜋𝜽 (𝑎𝑡 |𝑠𝑡 ) +
| {z } the policy and an output linear layer for the value. The parameters
policy gradient of our network are summarized in Table 4.
(𝑛)
𝛽 (𝐺𝑡 − 𝑣 w (𝑠𝑡 ))∇w 𝑣 w (𝑠𝑡 ) +
| {z } 3.6 Our Self-Play Strategy
(13)
value estimation gradient Inspired from AlphaZero [13], our model learns without any su-
Õ pervised samples. We do not use expert knowledge to pre-train the
𝜂 𝜋𝜽 (𝑎|𝑠𝑡 ) log 𝜋𝜽 (𝑎|𝑠𝑡 ) .
𝑎
network using good known parameter sets or actions. While the
| {z } agent makes random moves at first, the idea is that by relying on
entropy regularisation zero human bias, the agent may learn counter-intuitive moves and
The entropy regularization pushes entropy up to encourage explo- achieve superhuman tuning capabilities.
ration, and 𝛽 and 𝜂 are hyper-parameters that balance the impor-
tance of the loss components. 4 EXPERIMENTAL RESULTS
The complete architecture of our deep neural network is shown To train and test our agent, we select 15 benchmarks designs from
in Figure 6. To compute value and policy, the concatenation of OpenCores, ISPD 2012 contest and two RISC-V single cores, pre-
placement parameters with graph extracted features are first passed sented in Table 5. We use the first eleven for training and last four
through two feed-forward fully-connected (FC) layers with 𝑡𝑎𝑛ℎ for testing. We synthesize the RTL netlists using Synopsys Design
activations, followed by a FC linear layer. This is followed by a Long Compiler. We use TSMC 28nm technology node. The placements
Short-Term Memory (LSTM) module with layer normalization and are done with Cadence Innovus 17.1. Aspect ratio of the floorplans
with 16 hidden standard units with forget gate. The feed-forward FC is fixed to 1 and appropriate fixed clock frequencies are selected.
layers have no memory. Introducing an LSTM in the network, which Memory macros of RocketTile and OpenPiton Core benchmarks
is a recurrent layer, the model can base its actions on previous states. are pre-placed by hand. For successful placements, a lower bound
This is motivated by the fact that traditional optimization methods of total cell area divided by floorplan area is set on parameter max
are based on recurrent approaches. Moreover, we add a sequence-to- density. IO pins are placed automatically by the tool between metals
one global attention mechanism [9] inspired from state-of-the-art 4 and 6.
VLSI Placement Parameter Optimization using Deep Reinforcement Learning ICCAD ’20, November 2–5, 2020, Virtual Event, USA

Table 5: Benchmark statistics based on a commercial 28nm technol- Table 6: Comparison of half-perimeter bounding box (HPWL) after
ogy. RCC is the Rich Club Coefficient (𝑒 −4 ), LL is the maximum logic placement on training netlists among human design, Multi-Armed
level and Sp. R. denotes the Spectral Radius. RT is the average place- Bandit (MAB) [1], and our RL-based method. HPWL is reported in
ment runtime using Innovus (in minutes). 𝑚. Δ denotes percentage negative improvement over human design.

Name #cells #nets #IO 𝑅𝐶𝐶 3 LL Sp. R. RT Netlist human MAB [1] (Δ%) RL (Δ%)
training set PCI 0.010 0.0092 (−8.0%) 0.0092 (−8.0%)
PCI 1.2K 1.4K 361 510 17 25.6 0.5 DMA 0.149 0.139 (−6.7%) 0.135 (−9.4%)
DMA 10K 11K 959 65 25 26.4 1 B19 0.30 0.28 (−6.7%) 0.28 (−6.7%)
B19 33K 34K 47 19 86 36.1 2 DES 0.42 0.37 (−11.9%) 0.36 (−14.3%)
DES 47K 48K 370 14 16 25.6 2 VGA 1.52 1.40 (−7.9%) 1.41 (−7.2%)
VGA 52K 52K 184 15 25 26.5 3 ECG 0.72 0.65 (−9.7%) 0.68 (−5.5%)
ECG 83K 84K 1.7K 7.5 23 26.8 4 Rocket 1.33 1.27 (−4.5%) 1.20 (−9.8%)
Rocket 92K 95K 377 8.1 42 514.0 6 AES 1.49 1.44 ( −2.7%) 1.40 (−6.0%)
AES 112K 112K 390 5.8 14 102.0 6 AVC-Nova 1.59 1.49 (−6.3%) 1.46 (−8.2%)
Nova 153K 155K 174 4.6 57 11,298 9 Tate 1.53 1.42 (−7.2%) 1.45 (−5.2%)
Tate 187K 188K 1.9K 3.2 21 25.9 10 JPEG 2.14 1.96 (−8.4%) 1.88 (−12.2%)
JPEG 239K 267K 67 2.8 30 287.0 12
An explained variance of 0.67 shows the value function ex-
test set (unseen netlist) plains relatively well the observed returns. We use a discount factor
𝛾 = 0.997, coefficient for the value loss 𝛽 = 0.25, entropy cost of
LDPC 39K 41K 4.1K 18 19 328.0 2
𝜂 = 0.01, and a learning rate of 0.0008. We use a standard non-
OpenPiton 188K 196K 1.6K 3.9 76 3940 19
centered RMSProp as gradient ascent optimizer. The weights are
Netcard 300K 301K 1.8K 2.9 32 27.3 24
initialized using orthogonal initialization. The learning updates are
Leon3 326K 327K 333 2.4 44 29.5 26
batched across rollouts of 6 actor steps for 16 parallel copies of
the environment, totalling a mini-batch size of 96. All experiments
use gradient clipping by norm to avoid exploding gradients (phe-
nomenom common with LSTMs), with a maximum norm of 0.5.
Note that with 14, 400 placements, we only explore 10−6 % of the
total parameter space.
For a given environment, we select a random netlist and we
always start with the corresponding human parameter set to form
an initial state for each episode. Each environment is reset for
episodes of 16 steps. Training on 𝐾 environments in parallel, each
performing a placement on a different netlist, the reward signal is
averaged on the 𝐾 netlist for the network updates, which decreases
the reward variance and ultimately help the network generalize to
unseen netlists as prescribed in [11].

Figure 7: Training our agent for 150 iterations (= 14,400 placements). 4.2 Netlist Training Results
The reward is an aggregate reward from all training netlists. Train- For comparison, we use the state-of-the-art tool auto-tuner Open-
ing time is within 100 hours. Human baseline: reward = 0. Tuner [1] that we adapt for Innovus as baseline. In this framework,
a Multi-Armed Bandit (MAB) selects at each iteration a search tech-
4.1 RL Network Training Setting nique among Genetic Algorithm, Differential Evolution, Simulated
We define our environment using OpenAI Gym interface [2] and Annealing, Torczon hillclimber, Nelder-Mead and Particle Swarm
implement our RL agent network in Tensorflow. We use 16 parallel Optimization, based on a score that forces a trade-off between ex-
environments (16 threads) in our synchronous A2C framework. ploitation (use arm that worked best) and exploration (use a more
We perform tuning of the hyperparameters of our network using rarely used arm). We run the search techniques in parallel, eval-
Bayesian Optimization, which results in stronger agents. The learn- uating 16 candidate parameter sets at the same time. We run the
ing curve of our A2C agent in our custom Innovus environment tuner on the eleven training netlists and record the best achieved
is shown in Figure 7. We observe that the mean reward accross wirelength, performing 1, 300 placements per netlist so that the
all netlists converges asymptotically to a value of 6.8%, meaning total number of placements equals those of the RL agent training.
wirelength is reduced in average by 6.8%. Table 6 shows the best wirelengths found by the MAB as well as
Training over 150 iterations (= 14, 400 placements) takes about RL agent during training. The human baseline is set by an experi-
100 hours. Note that 99% of that time is spent to perform the place- enced engineer who tuned the parameters for a day. We see that the
ments while updating the neural network weights takes less than RL agent outperforms MAB on most netlists, reducing HPWL by
20min. Without parallelization, training over the same number of 9.8% on Rocket Core benchmark. All in all, both methods improve
placements would take 16 × 100 hr = 27 days. quite well on the human baseline.
ICCAD ’20, November 2–5, 2020, Virtual Event, USA Anthony Agnesina, Kyungwook Chang, and Sung Kyu Lim

HPWL: 5.26m WL: 6.31m HPWL: 5.11m WL: 6.24m HPWL: 4.99m WL: 6.10m

placement routing
(a) human design (took 7hrs) (b) Multi-Armed Bandit (took 16hrs) (c) reinforcement learning (took 20min)

Figure 8: 28nm full-chip GDSII layouts of OpenPiton.


Table 7: Comparison on test netlists of best wirelength found (one Table 9: PPA comparison after routing on test set. The target fre-
iteration = one placement performed). HPWL is reported in 𝑚. quencies are 1GHz, 500MHz, 833MHz, 666MHz from top to bottom.

Netlist human #iter. MAB [1] #iter. RL #iter. ckt metric human MAB [1] RL
LDPC 1.14 20 1.04 (−8.8%) 50 1.02 (−10.5%) 1 WL (𝑚) 1.65 1.57 1.53
LDPC WNS (𝑛𝑠) −0.005 −0.001 −0.001
OpenPt 5.26 20 5.11 (−2.9%) 50 4.99 (−5.1%) 1
Power (𝑚𝑊 ) 162.10 156.49 153.77
Netcard 4.88 20 4.45 (−8.8%) 50 4.34 (−11.1%) 1
WL (𝑚) 6.31 6.24 6.10
Leon3 3.52 20 3.37 (−4.3%) 50 3.29 (−6.5%) 1 OpenPiton WNS (𝑛𝑠) −0.003 −0.001 0
Table 8: Best placement parameters found for Netcard benchmark. Power (𝑚𝑊 ) 192.08 190.95 189.72
WL (𝑚) 8.01 7.44 7.15
NETCARD WNS (𝑛𝑠) −0.006 −0.007 −0.004
Name MAB [1] RL Power (𝑚𝑊 ) 174.05 170.51 167.70
eco max distance 54 81 WL (𝑚) 5.66 5.53 5.41
legalization gap 1 5 LEON3 WNS (𝑛𝑠) −0.005 −0.001 −0.003
max density 0.92 0.94 Power (𝑚𝑊 ) 156.83 156.00 155.51
eco priority eco fixed
activity power driven none none placement. Table 8 shows the best parameter settings found by
wire length opt high none MAB and RL agent on Netcard. Interestingly enough, we can see
blockage channel (for macros) none soft how the two optimizers found separate ways to minimize HPWL:
timing effort medium high WL driven vs. congestion driven.
clock power driven none none
congestion effort low high
4.4 PPA Comparison After Routing
clock gate aware true true To confirm improvement in HPWL after placement is translated
uniform density false false into one in final routed wirelength, we perform routing of placed
designs. They are all routed with same settings where metal lay-
4.3 Unseen Netlist Testing Results ers 1 to 6 are used. The layouts of OpenPiton Core are shown in
To verify the ability of our agent to generalize we check its per- Figure 8. We verify target frequency is achieved and routing suc-
formance on the four unseen test netlists. Without any additional ceeded without congestion issues or DRC violations. PPA of routed
training (the network parameters are fixed), the RL agent itera- designs is summarized in Table 9. We observe that HPWL reduc-
tively improves a random initial parameter set by selecting action tion after placement is conserved after routing on all test designs,
𝑎 with highest predicted 𝜋𝜽 (𝑎) value, as described in Equation 5. reaching 7.3% and 11% wirelength savings on LDPC and Netcard
Because our actions are deterministic, the resulting set of parame- compared with the human baseline. Footprints are 74,283 𝑢𝑚 2 for
ters is known, and fed back to the network. We repeat this process LDPC, 1,199,934 𝑢𝑚 2 for OpenPiton, 728,871 𝑢𝑚 2 for Netcard and
until the estimated value decreases for 3 consecutive updates and 894,115 𝑢𝑚 2 for Leon3.
backtrack to the settings with highest value. This way a “good” can-
didate parameter set is found without performing any placement. 5 CONCLUSIONS
We then perform a unique placement with that parameter set and Our placement parameter optimizer based on deep RL provides pre-
record the obtained wirelength. set of improved parameter settings without human intervention.
In comparison, the MAB needs the reward signal to propose a We believe this is an important step to shift from the “CAD” mindset
new set of parameters and therefore needs actual placements to be to the “Design Automation” mindset. We use a novel representation
performed by the tool. We track the best wirelength found, allotting to formulate states and actions applied to placement optimization.
50 sequential iterations to the MAB. Our experimental results show our agent generalizes well to un-
The best wirelength found by our RL agent and the MAB on seen netlists and consistently reduces wirelength compared with a
all four test netlists is show in Table 7. We observe our RL agent state-of-the-art tool auto-tuner, in only one iteration without any
achieves superior wirelengths consistently, performing only one additional training.
VLSI Placement Parameter Optimization using Deep Reinforcement Learning ICCAD ’20, November 2–5, 2020, Virtual Event, USA

REFERENCES [10] I. L. Markov et al. Progress and Challenges in VLSI Placement Research. ICCAD
[1] J. Ansel et al. OpenTuner: An Extensible Framework for Program Autotuning. ’12.
PACT ’14. [11] A. Mirhoseini et al. Chip Placement with Deep Reinforcement Learning, 2020.
[2] G. Brockman et al. OpenAI Gym, 2016. [12] V. Mnih et al. Asynchronous Methods for Deep Reinforcement Learning, 2016.
[3] C. Bron and J. Kerbosch. Algorithm 457: Finding All Cliques of an Undirected [13] D. Silver et al. Mastering Chess and Shogi by Self-Play with a General Reinforce-
Graph. Commun. ACM, 1973. ment Learning Algorithm, 2017.
[4] V. Colizza et al. Detecting rich-club ordering in complex networks. Nature [14] R. E. Tarjan. Depth-First Search and Linear Graph Algorithms. SIAM J. Comput.,
Physics, 2006. 1:146–160, 1972.
[5] O. Coudert. Exact Coloring Of Real-life Graphs Is Easy. DAC ’97. [15] L. van der Maaten and G. E. Hinton. Visualizing Data using t-SNE. 2008.
[6] L. Espeholt et al. IMPALA: Scalable Distributed Deep-RL with Importance [16] O. Vinyals et al. StarCraft II: A New Challenge for Reinforcement Learning, 2017.
Weighted Actor-Learner Architectures, 2018. [17] D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-world’ networks.
[7] W. Hamilton et al. Inductive Representation Learning on Large Graphs. NIPS Nature, 1998.
’17. [18] C. Xu et al. A Parallel Bandit-Based Approach for Autotuning FPGA Compilation.
[8] R. B. Lehoucq and D. C. Sorensen. Deflation Techniques for an Implicitly Restarted FPGA ’17.
Arnoldi Iteration. SIAM J. Matrix Analysis Applications, 17:789–821, 1996. [19] C. H. Yu et al. S2FA: An Accelerator Automation Framework for Heterogeneous
[9] M.-T. Luong et al. Effective Approaches to Attention-based Neural Machine Computing in Datacenters. DAC ’18.
Translation.

You might also like