FPGAEnergy Eff Transformers

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

FTRANS: Energy-Efficient Acceleration of Transformers using

FPGA
Bingbing Li1 , Santosh Pandey2 , Haowen Fang3 , Yanjun Lyv1 , Ji Li4 , Jieyang Chen5 , Mimi Xie6 ,
Lipeng Wan5 , Hang Liu2 and Caiwen Ding1
1 Universityof Connecticut 2 Stevens Institute of Technology 3 Syracuse University
4 Microsoft Corporation 5 Oak Ridge National Laboratory 6 University of Texas at San Antonio
1 {bingbing.li, lyu.yanjun, caiwen.ding}@uconn.edu 2 {spande1, Hang.liu}@stevens.edu 3 [email protected]
4 [email protected] 5 {chenj3, wanl}@ornl.gov 6 [email protected]

ABSTRACT due to the bottleneck in the memory (hidden state) and compli-
In natural language processing (NLP), the “Transformer" architec- cated bypassing logic (additive and derivative branches) where long
arXiv:2007.08563v1 [cs.DC] 16 Jul 2020

ture was proposed as the first transduction model replying entirely range information is passed. In addition, the inherently sequential
on self-attention mechanisms without using sequence-aligned re- nature precludes parallelization within training examples through
current neural networks (RNNs) or convolution, and it achieved backpropagation, which is critical at longer sequence lengths [9].
significant improvements for sequence to sequence tasks. The in- To overcome the shortcomings in RNNs, the “Transformer" ar-
troduced intensive computation and storage of these pre-trained chitecture was proposed as the first transduction model replying
language representations has impeded their popularity into compu- entirely on self-attention mechanisms without using sequence-
tation and memory constrained devices. The field-programmable aligned RNNs or convolution. It achieved notable improvements
gate array (FPGA) is widely used to accelerate deep learning algo- for sequence to sequence tasks [18]. The breakthroughs and devel-
rithms for its high parallelism and low latency. However, the trained opments of new models have accelerated at an unprecedented pace
models are still too large to accommodate to an FPGA fabric. In since the attention mechanisms have become the mainstream in
this paper, we propose an efficient acceleration framework, Ftrans, NLP domain with the invention of Transformer. Many transformer-
for transformer-based large scale language representations. Our based NLP language models like BERT [4] and RoBERTa [10] intro-
framework includes enhanced block-circulant matrix (BCM)-based duced pretraining procedures to the transformer architecture and
weight representation to enable model compression on large-scale achieved record-breaking results on major NLP tasks, including
language representations at the algorithm level with few accuracy question answering, sentiment analysis, and language inference.
degradation, and an acceleration design at the architecture level. Ex- Nevertheless, the introduced intensive computation and power
perimental results show that our proposed framework significantly footprint of these pre-trained language representations has impeded
reduce the model size of NLP models by up to 16 times. Our FPGA their popularity into computation and energy constrained as edge
design achieves 27.07× and 81 × improvement in performance and devices. Moreover, despite of the rapid advancement achieved by
energy efficiency compared to CPU, and up to 8.80× improvement the recent transformer-based NLP models, there is a serious lack of
in energy efficiency compared to GPU. studies on compressing these models for embedded and internet-
of-things (IoT) devices.
ACM Reference Format: In this paper, we propose an energy-efficient acceleration frame-
Bingbing Li1 , Santosh Pandey2 , Haowen Fang3 , Yanjun Lyv1 , Ji Li4 , Jieyang work, Ftrans, for transformer-based large scale language repre-
Chen5 , Mimi Xie6 , Lipeng Wan5 , Hang Liu2 and Caiwen Ding1 . 2020. FTRANS:
sentations using FPGA. Ftrans is comprised of an enhanced BCM-
Energy-Efficient Acceleration of Transformers using FPGA. In ACM/IEEE
International Symposium on Low Power Electronics and Design (ISLPED ’20),
based method enabling model compression on language represen-
August 10–12, 2020, Boston, MA, USA. ACM, New York, NY, USA, 7 pages. tations at the algorithm level, and an acceleration design at the
https://doi.org/10.1145/3370748.3406567 architecture level. Our contributions are summarized as follows:
• Enhanced BCM-based model compression for Transformer.
1 INTRODUCTION We address the accuracy degradation caused by traditional BCM
compression, and propose an enhanced BCM-based compression
RNN and its variant Long Short-Term Memory (LSTM) unit [6] and
to reduce the footprint of weights in Transformer. With small
Gated Recurrent unit (GRU) [3] used to dominate in sequence mod-
accuracy loss, Ftrans achieves up to 16 times compression ratio.
eling, language modeling and machine translation, etc. However,
• Holistic optimization for Transformers on FPGA. Given
they in general lack efficiency in transmitting global information,
the large size and complex data flow of transformer-based mod-
els, even with model compression, we still need to schedule
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed the computation resources carefully to optimize latency and
for profit or commercial advantage and that copies bear this notice and the full citation throughput. We propose a two stage optimization approach to
on the first page. Copyrights for components of this work owned by others than ACM mitigate the resource constraints and achieve high throughput.
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a • Low hardware footprint and low power (energy) consump-
fee. Request permissions from [email protected]. tion. We propose an FPGA architecture design to support the
ISLPED ’20, August 10–12, 2020, Boston, MA, USA model compression technique and we develop a design automa-
© 2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-7053-0/20/08. . . $15.00 tion and optimization technique. Overall, the proposed Ftrans
https://doi.org/10.1145/3370748.3406567 achieves the lowest hardware cost and energy consumption in
ISLPED ’20, August 10–12, 2020, Boston, MA, USA Bingbing Li et al.

implementing Transformer and RoBERTa compared to CPU and Encoder: The encoder consists of a stack of N identical layers.
GPU references. Each layer has two sub-layers. The first is a multi-head self-attention
Experimental results show that our proposed framework signifi- mechanism, and the second is a FC feed-forward network. There is
cantly reduce the size of NLP models by up to 16 times. Our FPGA a residual connection around each of the two sub-layers, followed
design achieves 27.07× and 81 × improvement in performance and by layer normalization.
energy efficiency compared to CPU. The power consumption of Decoder: The decoder contains of a stack of N identical layers.
GPU is up to 5.01× compared to that of FPGA, and we achieve up Within each layer, there are three sub-layers, where the third sub-
to 8.80× improvement in energy efficiency compared to GPU. layer is the same as the encoder. The inserted second sub-layer
performs multi-head attention over the output of encoder stack. The
first-sublayer utilizes masked multi-head attention, to ensure that
2 RELATED WORK
predictions for position i only depends on its previous positions.
Attention mechanisms have become an integral part of compelling
sequence modeling and transduction models in various tasks [9]. Ev- 3.1 Attention
idence of NLP community moving towards attention-based models
can be found by more attention-based neural networks developed The attention function can be described as mapping a query q and a
by companies like Amazon [8], Facebook [16], and Salesforce [2]. set of keys k and values v pairs to an output o as shown in Figure 2
The novel approach of Transformer is the first model to eliminate (a), named scaled dot-product attention, or single head attention.
recurrence completely with self-attention to handle the dependen- 3.1.1 Single Head Attention. In this paper, we select dot-product
cies between input and output. BERT [4] and RoBERTa [10] extend attention as the attention function since it is much faster and more
Transformer’s capacity from a sequence to sequence model to a space-efficient [18]. The input consists of queries and p keys of di-
general language model by introducing the pretraining procedure, mension dk , and values of dimension dv . We denote dk is the
and achieved state-of-the-art results on major NLP benchmarks. scaling factor for dot-product attention. We compute
Although RNNs and Convolutional Neural Networks (CNNs) are p the dot prod-
ucts of the query with all keys, divide each by dk , and apply a
being replaced by Transformer-based models in NLP community, softmax function to obtain the weights on the values. The attention
there are only a few works that accelerate Transformers and fo- function on q, k, and v can be computed simultaneously by con-
cus on reducing the energy and power footprint, e.g., a case study catenated into matrix Q, K, and V, respectively. Accordingly, the
of Transformer is presented in [1] using one of the cutting-edge output matrix Oat t is:
FPGA boards. However, it is noteworthy that [1] targets at a special-
ized FPGA architecture, which employs High Bandwidth Memory QKT
Oat t = At t ent ion(Q, K, V) = sof tmax ( p ) (1)
(HBM) technology. Unlike the conventional FPGA, HBM is pack- dk
aged directly within the FPGA fabric to alleviate the on chip mem-
ory constraint. However, work [1] did not adopt model compression 3.2 Multi-head Attention
technique, and used the sequence length of 8 and 16, which are Multi single-head attention are then concatenated as multi-head
too short and not favorable in practise. The model details such as attention, as shown in Figure 2 (b). MultiHead (Q, K, V) = Concat
number of encoder/decoders, hidden size are also not listed. (Head1 , · · · , Headh )×WO , where the Head is defined as:
Q
Headi = At t ent ion(QWi , KWiK , VWV
i ) (2)
3 TRANSFORMER WORKLOAD ANALYSIS Q
where the projections are parameter matrices ∈ Wi Rdmod el ×dk ,
The “Transformer" architecture is the heart for all state-of-the- WiK ∈ Rdmod el ×dk , and WVi ∈ Rdmod el ×dv . Multi-head attention
art large scale language models. It has an encoder-decoder struc- enables the model to jointly attend to information from different
ture [18] as shown in Figure 1. The encoder maps a sequence of the representation subspaces at different positions [18].
input symbols x = (x 1 ; x 2 ; x 3 ; ...; x n ) to a sequence of continuous In this work, we implement a shallow Transformer and a large
representations z = (z 1 ; z 2 ; z 3 ; ...; zn ). Given x, the decoder then scale Transformer, i.e., RoBERTa. The shallow Transformer has h
produces an output sequence y = (y1 ; y2 ; y3 ; ...; ym ) of symbols one = 2 parallel attention layers with 4 attention heads and RoBERTa
element per time step. For the next time step, the model takes the (base configuration) has 12 layers with 12 heads. For each head
previously generated symbols as additional input when generating we use dk = dv = dmodel /h = 200 and 768 for Transformer and
the next. The Transformer follows this overall architecture using RoBERTa, respectively.
stacked self-attention and fully-connected (FC) layers for both the Q
MatMul Scale Mask Softmax MatMul
encoder and decoder, shown in Figure 1. K
Positional Encoder *N V
encoding (a) Scaled dot-product attention
Inputs
Inputs Multi-head Add & FC Add &
+ h
{
embedding attention Norm layers Norm Q Linear
Output Scaled
probabilities K Linear dot-product Concat Linear
Masked Multi-head Add & FC attention
Outputs Add & Add &
+ multi-head Linear Softmax
embedding Norm attention Norm layers Norm V Linear
attention (b) Multi-head attention
Outputs
Positional
Decoder *N
encoding

Figure 2: : (a) Scaled Dot-Product Attention. (b) Multi-Head


Figure 1: Model structure of Transformer. Attention.
FTRANS: Energy-Efficient Acceleration of Transformers using FPGA ISLPED ’20, August 10–12, 2020, Boston, MA, USA

4 TRANSFORMER COMPRESSION USING stacks. The embedding layer contributes 30.89% of parameters. Es-
ENHANCED BLOCK-CIRCULANT MATRIX sentially it is a look-up table which transforms discrete tokens into
continuous space, the computation is less than that of encoder and
The introduced intensive computation and weight storage of large
decoder. Therefore, our basic idea is to off-load embedding layer
pre-trained language representations have brought challenges in
to off-chip memory, thus it is possible to deploy the most compu-
hardware implementation. Therefore, model compression is a natu-
tational intensive part, i.e. the encoder and decoder stack on chip,
ral method to mitigate the these challenges.
avoiding frequently access off-chip weights, hence to accelerate
computation. Second, to mitigate the I/O constrain, we developed
4.1 Enhanced BCM-based Transformer the inter-layer coarse grained pipelining, intra-layer fine grained
CirCNN [5] and C-LSTM [19] have adopted BCM for model com- pipeling, and computation scheduling.
pression on small to medium scale datasets in image classification
and speech recognition, respectively, and achieved significant im-
provement in terms of performance and energy efficiency compared 5.1 Overall Hardware Architecture
to the prior arts. Using this method, we can reduce weight storage As shown in Figure 3, the proposed hardware architecture consists
by replacing the original weight matrix with one or multiple blocks of computation units for encode/decoder computation, on-chip
of circulant matrices, where each row/column is the cyclic reformu- memory banks, a transformer controller, and an off-chip memory
lation of the others. We use b to represent the row/column size of (DDR) and DDR controller. The transformer controller communi-
each circulant matrix (or block size, FFT size). Suppose the shape of cates with the host and controls all the modules in FPGA. The host
Q
a weight matrix in Transformer (e.g., Wi , WiK , WVi ) is W ∈ Rm×n , PC loads the inputs (i.e., sentence pairs ) to the FPGA for inference
there will be f × д blocks after partitioning W, where f = m ÷ b through PCIE. On the FPGA part, given the tokenized sentences,
and д = n ÷ b. Then W = [Wi j ], i ∈ {1 . . . f }, j ∈ {1 . . . д}. the embedding look up module accesses DDR to fetch embeddings.
The input x is also partitioned as x = [xT1 , xT2 , . . . , xTд ]T . In Next, the embeddings will be fed into the pipelined encoder/decoder
each BCM, only the first column/row is needed for storage and stacks to perform inference.
computation, and is termed the index vector, pi j . The theoretical The computing units consist of multi-head attention, scaled dot
foundation is derived in [20], which demonstrates the universal product attention, point wise feed forward layer, linear, and ad-
approximation property and the error bounds of BCM-based neural d/norm. The transformer controller orchestrates the computing
networks are as efficient as general neural networks. flow and data flow of inputs from PCIEs, BRAMs and computing
Prior works [5, 19] have not investigated large-scale language units on the FPGA fabric. Since the encoder and decoder share same
representations. To further maintain the prediction accuracy, we type of operations, so we first decompose them into different com-
use an enhanced BCM-based model compression. We modify the puting primitives, including matrix multiplication of different sizes,
formulation of the index vector as follows: vectorized exponentials etc. The multi-head attention, linear, and
 b1 bj=1 W1j 
Í add/norm modules are reconfigured to form as encoder or decoder
 Íb
1 W 
 under the transformer control logic. We have two pipeline strate-
pi j =  b j=1 2j  (3) gies. For shallow networks, the entire network can be straightfor-
 Í. . .
1 b W  wardly implemented, i.e. all layers can be implemented by dedicated

 b j=1 b j 
FPGA resources. For the state-of-the-art designs such as BERT and
where Wi j is a circulant matrix. We observe that in this way, we
RoBERTa, there are multiple encoders/decoders, hence the resource
can better preserve the parameter information and maintain the
such as DSPs may not enough. In such cases, reuse of certain PE or
overall prediction accuracy. The main reason is that prior works
entire encoder/decoder module are necessary.
take the first column/row as the index vector, missing the effective
representations for other rows/columns.
Based on the circulant convolution theorem [14, 17], instead of 5.2 Multi-Head Attention Design
directly performing the matrix-vector multiplication, we could use
Multi-head attention includes multi-processing elements (named
the fast Fourier transform (FFT)-based multiplication method, and
PE) banks, for matrix multiplication), buffers (K buf, Q buf, and
it is equivalent to matrix-vector multiplication. The calculation
V buf), a normalization module (Norm), a masking function for
of a BCM-based matrix-vector multiplication Wi j xj is: Wi j xj =
masked multi-head attention, and a softmax module as described
pi j ⊛ xj = IFFT FFT(pi j ) ◦ FFT(xj ) , where ‘⊛ ’ represents circular

in Equation (2) and shown in Fig. 4.
convolution, and ◦ is element-wise multiplication. Therefore, the
computational complexity is reduced from O(b 2 ) to O(b log b). Host DDR
Controller
DDR FPGA
Host
CPU Embedding
Lookup Memory Bank Memory Bank
5 ARCHITECTURE
FPGA is widely used to accelerate deep learning models for its
Buffer

high parallelism and low latency. As large amount of transformer Host


Memory
PCIe

parameters exceed the on-chip memory or block RAM (BRAM) Encoder Stack
Transformer
Decoder Stack

capacity on FPGA fabric, even with model compression technique, Control Logic

the full model cannot be stored on chip. To address the challenge,


we partition a model into embedding layer and encoder/decoder Figure 3: The overall hardware architecture on FPGA.
ISLPED ’20, August 10–12, 2020, Boston, MA, USA Bingbing Li et al.

Cntl Head h units to support scaling and softmax required by multi-head atten-
BRAM ...
... K
PE Bank K Buf
tion. The output of multipliers are fed into divider or accumulator
PE Bank Norm
BRAM
Cntl Head 2 0 as stream, hence scaling and softmax layer can be overlapped with
BRAM PE Bank Q Buf
K
Q PE Bank K Buf matrix multiplication.
Cntl HeadSoftmax
1
PE Bank Norm
BRAM
DDR BRAM
KBRAM
PE Bank K BufV Buf
PE Bank
V PE Bank Q Buf
PE Bank0 5.3.2 FFT/IFFT-based PE . Figure 5 shows the design of FFT/IFFT-
Weights Q Norm Fabric
PE Bank FPGA
Softmax
0
based PE and softmax, including a FFT/IFFT kernel, an accumulator,
BRAM
PE Bank Q Buf
QBRAM PE Bank V Buf PE Bank
and an adder. The accumulator is an adder tree with N inputs (the
V Softmax
PCIE FPGA Fabric size is chosen the same as the FFT/IFFT kernel size). We select
Input BRAM
V
PE Bank V Buf PE Bank
Radix-2 Cooley Tukey algorithm [7] for FFT implementation.
FPGA Fabric

5.3.3 Softmax Module. Figure 5 (b) shows the implementation of


Figure 4: Multi-head attention (Head1 , · · · , Headh ) design. exp(x )
the softmax function softmax(x)i = Í exp(xi )) . The exponential
j j
The input are fetched from DDR and fed into encoder pipeline, function exp(x i ) or exp(x j ) is expensive in resource consumption
then multiplied with a set of query matrix Q and key matrix K stored for FPGAs. We adopt piece-wise linear functions to estimate their
on BRAMs. The intermediate results QWQ and KWK are then prop- outputs, in order to simultaneously reduce the resource consump-
agated to the buffers (i.e., K buffer and Q buffer to store KWK , and tion and maintain the accuracy. A buffer is used to store exp(x i )
QWQ , respectively). Next, we compute the matrix multiplication and an accumulator is used to compute the summation of exp(x j ).
of the values stored in the K buffer and W buffer. The product will Next, we perform the division and generate the softmax results.
pr oduct
be loaded to the normalization (Norm) module, i.e., √ . After
dk 6 DESIGN AUTOMATION & OPTIMIZATION
the softmax module, the results will be propagated to a PE bank
to perform matrix multiplication with the matrix stored in V Buf, We developed a workflow to prototype and explore the hardware
i.e., VWV . Each head will have a local controller to orchestrate the architecture. First, we generate a data dependency graph based on
computing flow and data flow of PE banks and buffers. The local trained models to illustrate the computation flow. The operators
controller will also enable the multiplexer for masked multi-head in graph are scheduled to compose the pipeline under the design
attention with a goal of masking the future tokens in a sequence constraints, to achieve maximum throughput. At last, a code gener-
(setting to 0), to prevent the current output predictions from being ator receives the scheduling results and generates the final C/C++
able to see later into the sentence. To support masked multi-head implementation, which can be fed into the commercial HLS tool
attention, the local controller controls multiplexer to set future for synthesis. Our target synthesis backend is Xilinx SDx.
tokens to 0, such that current output predictions are not able to The major computationally intensive operations are shown in
see later sequences. Decoder has one more multi-head attention, Figure 6. Other operations such as division and softmax consume
thus takes longer time to compute than encoder. In the case of much less time, and can be merged/overlapped with these major
decoder module has to be reused, to prevent encoder pipeline stall, operations. The computation in different layers can be decomposed
a buffer is placed between the encoder and decoder stacks. The into common computing elements, i.e., PEs. The layers in same
buffer also stores the output from the last encoder to support the color can be performed by same PE, however, with unbalanced
residue connection. operations. For example, the time consumed by the KWK , QWQ
and VWV is roughly 4 times of computation required by the n
heads. To improve the utilization of pipeline, it is desirable to let
5.3 PE Design and Softmax Module Input VWV split

We develop three different configurable PEs, which as PE-A, PE-B, Head


n
Attn n

and PE-FFT/IFFT. For the BCM-based matrix-vector multiplication Input QW Q split FC Add 1
FFT-
IFFT 1
FFT-
IFFT 2
Add 2

in FC layers, we use FFT/IFFT-based processing elements (PE); for Head


other layers, we use matrix-vector multiplication, i.e., PE-A and
Attn 2
2
Input KWK split
PE-B for different matrix sizes. Head
Attn 1
1

0 1 2 3 4 5 6 7 8
5.3.1 Matrix-vector multiplication-based PE. The major part of the
PE-A or PE-B is a multiplier for matrix multiplication of different Figure 6: Data flow of major operations.
sizes. It also consists of two accumulators, dividers and exponential Resource Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Stage 8
MM-A 1
Wk * k Wv * v
PE BRAM FFT(W) MM-A 2
MM-A 3
Wq * q FC
FFT MM-A 4
FFT
MM-B 1 Head Head Att Att
MAC Exp(-x) Acc MM-B 2 Head Head Att Att
MM-B 3 Head Head Att Att
Buf DIV MM-B 4 Head Head Att Att
Adder
Add Add
Register bank Adder
(FFT twiddle factors)
FFT-IFFT FFT-IFFT FFT-IFFT
(a) PE design (b) Softmax module

Figure 5: FFT/IFFT-based PE and softmax module design. Figure 7: Fine grained operation scheduling
FTRANS: Energy-Efficient Acceleration of Transformers using FPGA ISLPED ’20, August 10–12, 2020, Boston, MA, USA

each layers consumes roughly the same time. This can be achieved Algorithm 1: Pseudo-code for operation scheduling
by allocating more resources to the slowest layer. We adopt a two- Input: Dependency graph G(V , E), available PEs Op = {P E _A1, P E _A2, ..., Adder }
stage optimization flow. In first stage, we find a resource scheme Output: C and designated P E for each Layer
Q = T O PO _SO RT (G(V , E)) \\ Topological sort to obtain priority queue of all layers
that can minimize the maximum time required by layers. In second P = Q [0] \\ List of layers to be scheduled
stage, under such resource constrains, we optimize the scheduling E = ∅ \\ List of layers being executed
S = ∅ \\ The final schedule result
of an individual encoder/decoder. st aдe = 0
while Q , ∅ ∧ E , ∅ do
The optimization starts from a basic implementation of an indi- for l ayer ∈ Q do
vidual encoder/decoder, i.e. no parallization nor resource reusing, if available P E∃Op for l ayer then
Q .pop()
such that we can obtain an estimation of resource consumption, Op .r emove(P E)
number of operations and execution time of each layer, through- E .push _back ((l ayer, P E))
for V ∈ N EI GH BO R(l ayer ) do
put obtained by unit number of resources. Then we will examine Q .push _back (V )
how much resource can be allocated to each encoder/decoder to end
end
minimize the execution time of the slowest layer: st aдe+ = 1
minimize max(T1 ,T2 , ...,Tn ), for l ayer, P E ∈ E do
if I S _F I N I S H ED(l ayer ) == T rue then
(4) E .pop()
Õ
subject to R F [i] ≥ M R j [i] + Rmisc [i] S .push _back ((l ayer, st aдe, P E))
j Op .push _back (P E)
where i ∈ (0, ..., 3), j ∈ n, n is the number of layers, M is the to- end
tal number of encoder/decoder, R F = [R F F , R LU T , R DS P , R BRAM ] end
return S
is on-chip resource constraints for look-up table (LUT), flip-flop
(FF), digital signal processing unit (DSP), and BRAM, respectively. representative Transformer structures, i.e., a shallow Transformer
T j is the time required by the j-th layer. R j is resource utiliza- with both encoder and decoder, and a pretrained deep Transformer
tion of the j-th layer, which is also represented as a vector: R j = architecture - RoBERTa (base configuration) which only has en-
j j j j
[R F F , R LU T , R DS P , R BRAM ]. Rmisc is the resource utilization of coder [10]. The shallow Transformer is evaluated in a language
modules except encoder/decoder module, such as DDR controller, modeling task, which is an unsupervised sequence-to-sequence
PCIE controller, etc. T j can be given as: problem that requires the decoder part. On the other hand, we run
i a RoBERTa on a sentiment classification task that is a supervised
T j = ⌈Nop /(F j · K j )⌉, j ∈ n (5)
i
where Nop is the number of operations required by the j-th layer. classification problem without the requirement for decoder block.
K j is resource allocation factor of the j-th layer. F j is the through- The software is implemented in PyTorch deep learning framework
put of non-optimized design, which can be obtained empirically. [15] and FairSeq sequence modeling toolkit [13]. Table 1 summa-
Therefore, the throughput is: rizes the key parameters of the shallow Transformer and RoBERTa
Throuдhput = f req/(n · max(T1 ,T2 , ...,T j )) (6) models in the experiments.
It finds the slowest layer, allocates more resources, then up- Table 1: Key parameters of Shallow Transformer and
dates the resource consumption and execution time. If resource RoBERTa
constraints are satisfied, we repeat this procedure until no more
speedup. Then the algorithm will examine the fastest layer. If it Model Transformer Transformer Hidden Attention Total
Configuration Structure Layers Size Heads Params
takes significantly less time than the slowest layer, it is possible to Shallow Transformer encoder-decoder 2 200 4 6M
allocate less resources for that layer, hence more resources can be as- RoBERTa (base config.) encoder only 12 768 12 125M
signed to the slowest layer. After this procedure, we obtain resource
constraints, e.g. the No. of different PEs of an encoder and decoder. 7.1.1 Finetuned RoBERTa for Sentiment Classification. We evaluate
Under resource constraints, each layer may not have dedicated com- the proposed model compression method for finetuned RoBERTa [10]
putation resource, hence matrix multipliers, adders, etc. have to be on IMDB movie review sentiment classification [11] to shed some
shared. Therefore, the computation has to be carefully scheduled to light on training trial reductions. Starting from the saved state of
minimize the computation latency. The encoder/decoder can be rep- pretrained models in work [10], we finetune the model until it
resented as a Directed Acyclic Graph (DAG) – G(V , E), where V is a reaches to its best validation accuracy at 95.7%. To maintain over-
set of vertices representing different computation, edges E indicate all accuracy, we compress partial layers. The process suppresses
the dependency. The available computation units such as PEs and randomness by using a deterministic seed. Thus the accuracy dif-
adders are represented by a set Op = {PE − A1, PE − A2, ..., Adder }. ference between the original RoBERTa and compressed version is
The algorithm used for operation scheduling takes G and Op as sorely contributed by the compression techniques.
input, is shown in Algorithm 1.
7.1.2 Shallow Transformer. Language modeling task takes a se-
7 EVALUATION quence of words as input and determines how likely that sequence
is the actual human language. We consider the popular WikiText-2
7.1 Training of Transformer-Based Language dataset [12] in this experiment, which contains 2M training tokens
Representation with a vocabulary size of 33k. A shallow Transformer model with 4
In this section, we apply both enhanced BCM-based model compres- attention heads and 200 hidden dimension is established.
sion on the linear layers, and adopt 16 fixed-point data representa- The baseline and model compression results of shallow Trans-
tion for all the weights. We evaluate the accuracy impact with two former and RoBERTa on WikiText-2 and IMDB review are shown
ISLPED ’20, August 10–12, 2020, Boston, MA, USA Bingbing Li et al.

Table 2: Comparison among different model configurations size is 8 since the latency will be significantly increased and the
ID
Network Block WikiText-2 ACC loss ACC loss throughput will not be increased when we use larger batch size.
Type Size (ACC) % with BCM (%) with BCM & Quant. (%)
1 Shallow Transformer − 91.3 − − 7.2.3 Cross-platform comparison. We compare the performance
2 Shallow Transformer 4 90.7 0.6 0
3 Shallow Transformer 8 90.7 0.6 0.6 (throughput) and energy efficiency among CPU, GPU, and FPGA
4 Shallow Transformer 16 90.0 1.3 0.6 using same model and same benchmark (IMDB), as shown in Ta-
ID
Network Block IMDB ACC loss ACC loss ble 4. We also validate our method on embedded low-power devices,
Type Size (ACC)% with BCM (%) with BCM & Quant. (%)
4 RoBERTa (base) − 95.7 − − implement our pruned model on Jetson TX2, an embedded AI com-
5 RoBERTa (base) 4 91.5 4.2 4.3 puting device. ItâĂŹs built by a 256-core NVIDIA Pascal-family
6 RoBERTa (base) 8 91.4 4.3 4.3
GPU and the memory is 8 GB with 59.7 GB/s bandwidth. Our FPGA
Table 3: Comparison among different model configurations design achieves 27.07× and 81× improvement in throughput and
energy efficiency compared to CPU. For GPU TRX5000, the power
Shallow Transformer consumption is 5.01× compared to that of FPGA, and Our FPGA
Batch Size DSP FF LUT Latency (ms) Power (W) Throughput (FPS)
1 5647 304012 268933 2.94 22.45 680.91 design achieves 8.80× improvement in energy efficiency and 1.77×
4 5647 304296 269361 11.59 22.52 690.50 throughput improvement compared to GPU. For embedded GPU
8 5647 305820 269753 22.90 22.66 698.72 Jason TX2, our FPGA design achieves 2.44× improvement in energy
16 5647 306176 270449 45.54 22.73 702.54
efficiency.
RoBERTa (base)
Batch Size DSP FF LUT Latency (ms) Power (W) Throughput (FPS) Table 4: The performance and energy efficiency comparison
1 6531 506612 451073 10.61 25.06 94.25 among CPU, GPU, FPGA using RoBERTa
4 6531 506936 451545 40.33 25.13 99.13
8 6531 508488 452005 79.03 25.89 101.23 CPU GPU FPGA Jetson TX2
16 6531 508916 452661 157.18 25.96 101.79 i7-8700K RTX5000 VCU118 Embedded GPU
Throughput (FPS) 3.76 57.46 101.79 9.75
in Table 2, respectively. We compress the models using enhanced Power (W) 80 126 25.13 5.86
BCM-based method with block size of 4 or 8. From Table 2, we Energy efficiency (FPS/W) 0.05 0.46 4.05 1.66
observe that for the shallow Transformer, thers is no accuracy loss
with block size of 4 and only 0.6% accuracy loss with block size of 8 CONCLUSION
8. The RoBERTa, on the other hand, incurs 4.2% and 4.3% accuracy
drop after model compression using 4 and 8 block size, respectively1 . In this paper, we propose an energy-efficient acceleration frame-
We also observe that changing from 32-bit floating point to 16-bit work for transformer-based large scale language representations.
fixed point will not cause accuracy loss. The comparable accuracy Our framework includes an enhanced BCM-based method to enable
between the original model and the weight compressed version model compression on large-scale language representations at the
demonstrates the effectiveness of the proposed model compression algorithm level, and an acceleration design at the architecture level.
method. We propose an FPGA architecture design to support the model
compression technique and we develop a design automation and
7.2 Performance and Energy Efficiency optimization technique to explore the parallelism and achieve high
throughput and performance. Experimental results show that our
7.2.1 Experimental Platform. The Xilinx Virtex UltraScale+ VCU118 proposed framework significantly reduces the size of NLP models
board, comprising 345.9Mb BRAM, 6,840 DSPs, 2,586K logic cells with small accuracy loss on Transformer. Our FPGA-based imple-
(LUT), and two 4GB DDR5 memory, is connected to the host ma- mentation significantly outperforms CPU and GPU in terms of
chine through PCIE Gen3 × 8 I/O Interface. The host machine energy efficiency.
adopted in our experiments is a server configured with multiple
Intel Core i7-8700 processors. We use Xilinx SDX 2017.1 as the com-
REFERENCES
mercial high-level synthesis backend to synthesize the high-level [1] 2019. Supercharge Your AI and Database Applications with Xilinx’s HBM-Enabled
(C/C++) based designs on the selected FPGAs. UltraScale+ Devices Featuring Samsung HBM2. Xilinx white paper, WP508 (v1.1.2)
(2019).
7.2.2 Experimental Results of Transformer and RoBERTa. We imple- [2] James Bradbury and Richard Socher. 2017. Towards Neural Machine Translation
ment the compressed model to FPGA to evaluate the performance with Latent Tree Attention. EMNLP 2017 (2017), 12.
[3] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Ben-
and energy efficiency. For different batch sizes, we obtain the paral- gio. 2014. On the properties of neural machine translation: Encoder-decoder
lelism per stage for the 7 stages in encoder/decoders of Transformer approaches. arXiv preprint arXiv:1409.1259 (2014).
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
and RoBERTa based on Algorithm 1 as shown in Table 3, respec- Pre-training of deep bidirectional transformers for language understanding. arXiv
tively. We report the resource utilization on FPGA including DSP, preprint arXiv:1810.04805 (2018).
LUT, and FF. The latency (ms), throughput (frame/sequence per [5] Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang,
Xuehai Qian, Yu Bai, Geng Yuan, et al. 2017. Circnn: accelerating and compressing
second) and power consumption (W) are also reported. Our results deep neural networks using block-circulant weight matrices. In Proceedings of the
shows that there is a trade-off between latency and power con- 50th Annual IEEE/ACM International Symposium on Microarchitecture. 395–408.
sumption. For both Transformer and RoBERTa, we can achieve the [6] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
computation 9, 8 (1997), 1735–1780.
best trade-off (the lowest ratio of Latency/Power) when the batch [7] S Lennart Johnsson and Robert L Krawitz. 1992. Cooley-tukey FFT on the
connection machine. Parallel Comput. 18, 11 (1992), 1201–1221.
1 The accuracy drop on RoBERTa is slightly higher because its parameters are carefully
[8] Joo-Kyung Kim and Young-Bum Kim. 2018. Supervised Domain Enablement
pretrained on the Giga byte dataset (160GB of text) using a masked language model [10] Attention for Personalized Domain Classification. In Proceedings of the 2018
and more sensitive to compression. Conference on Empirical Methods in Natural Language Processing. 894–899.
FTRANS: Energy-Efficient Acceleration of Transformers using FPGA ISLPED ’20, August 10–12, 2020, Boston, MA, USA

[9] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M Rush. 2017. Structured [15] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
attention networks. arXiv preprint arXiv:1702.00887 (2017). Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
[10] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer 2017. Automatic differentiation in PyTorch. (2017).
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A [16] Peng Shi, Jinfeng Rao, and Jimmy Lin. 2018. Simple Attention-Based Rep-
robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 resentation Learning for Ranking Short Social Media Posts. arXiv preprint
(2019). arXiv:1811.01013 (2018).
[11] Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and [17] Julius Orion Smith. 2007. Mathematics of the discrete Fourier transform (DFT):
Christopher Potts. 2011. Learning word vectors for sentiment analysis. In ACL. with audio applications. Julius Smith.
Association for Computational Linguistics, 142–150. [18] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
[12] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. [n. d.]. Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
Pointer Sentinel Mixture Models. In 5th International Conference on Learning you need. In Advances in neural information processing systems. 5998–6008.
Representations, ICLR 2017, Toulon, France, April 24-26, 2017. [19] Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, Yanzhi Wang, and Yun
[13] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, Liang. 2018. C-LSTM: Enabling Efficient LSTM Using Structured Compression
David Grangier, and Michael Auli. 2019. fairseq: A Fast, Extensible Toolkit for Techniques on FPGAs. In FPGA’18.
Sequence Modeling. In Proceedings of NAACL-HLT 2019: Demonstrations. [20] Liang Zhao, Siyu Liao, Yanzhi Wang, Zhe Li, Jian Tang, and Bo Yuan. 2017. Theo-
[14] Victor Pan. 2012. Structured matrices and polynomials: unified superfast algorithms. retical Properties for Neural Networks with Weight Matrices of Low Displacement
Springer Science & Business Media. Rank. In International Conference on Machine Learning. 4082–4090.

You might also like