Artigo Científico
Artigo Científico
Artigo Científico
Abstract—This paper is focused in the inverse transforms increase video compression in 50% while maintaining the same
defined in the video coding standard HEVC – High Efficiency computational complexity. However, the latter goal was not
Video Coding. The transforms stage is one of the innovations reached and this standard achieved a higher computational
proposed by HEVC since it allows the use of the biggest number complexity than H.264/AVC. HEVC final draft was approved
of transforms sizes (four) and also the biggest transform sizes (till by the JCT-VC group in January, 2013 and now the document
32x32) when compared with previous standards. The inverse is being evaluated by the other ITU-T and ISO instances.
DCT is performed by the video encoder and decoder as well. This
paper presents an efficient hardware design for the 32x32 HEVC Current video coding standards are called hybrid encoders,
IDCT based on the separability principle. The hardware design since they use an initial prediction stage (intra-frame or inter-
was planned to reach real time processing (at least 30 frames per frame predictions), followed by a transforms and quantization
second) for high resolution videos, exploiting a high parallelism stage and, finally, an entropy coding is applied [1]. HEVC
level (32 samples consumed per clock cycle). The architecture follows this condign structure. Among the coding stages, the
was also planned to reach a low latency and a low cost, then it transforms hold an important position. The purpose of the
was designed in a purely combinational way and using a transform stage is to concentrate the energy of an image block
multiplierless approach. The synthesis process was targeted to an in just a few numerical coefficients. Thereby, the following
Altera Stratix IV FPGA. The synthesis results show that the stages (quantization and entropy coding) can be performed in a
designed architecture is capable to process more than 30 QFHD
much more efficient way.
frames (3840x2160 pixels) per second, with a latency of 33 clock
cycles. The process to encode a frame generates data losses,
especially by the quantization stage. Since the current frame is
Keywords—HEVC; IDCT; Hardware Design; FPGA; used as reference in the prediction stage for the next frames,
Multiplierless. the current frame must be also decoded in the encoder side to
ensure that the encoder and decoder will use the same
I. INTRODUCTION
references. Thus, an inverse transform process it is necessary
Nowadays, the resolution and the quality of digital videos also in the encoder size. This way, the inverse transforms are
have been improving in a fast and steady manner. Additionally, used in both encoder and decoder sides [6].
such videos are becoming supported by an increasing number
of electronic devices (smartphones, set-top-box for digital The Discrete Cosine Transform (DCT) is the main HEVC
television, blu-ray players, etc.). As consequence, the study and transform likewise the other image and video coding standards
the improvement of video encoders/decoders is an extremely [1]. The HEVC use the same idea presented in the H.264/AVC
relevant activity in the current scenario, since the devices that [7] when an integer approximation of the DCT is used to
process digital videos, with their diverse features, must be able simplify the calculations and to avoid mismatching between
to process high-resolution videos in real time. For this reason, coders and decoders [8]. But the HEVC includes some
topics such as compression rate, video quality, computational important novelties in the transform stage. The first innovation
complexity and energy consumption must be improved, hence was the definition of an independent data structure to handle
they are thoroughly investigated in this area. with the transforms decisions. This structure is called
Transform Unit (TU) and it is organized as a quadtree to
Video coding is imperative in applications that handle evaluate the encoding options [16]. The other innovation is the
digital videos, since an uncompressed video requires a use of four different transform sizes: 4x4, 8x8, 16x16 and
prohibitive volume of bits to be represented [1]. H.264/AVC 32x32 [7], increasing the decision options in the TU structure.
[2] is the latest video coding standard available, presenting The current standards commonly use 4x4 and/or 8x8 sizes
significant compression gains when compared to the MPEG-2 only. The use of more transform sizes increase the compression
standard [3]. On January 2010, the JCT-VC [4] (Joint efficiency, but also it increases the encoder complexity [9].
Collaborative Team – Video Coding) was created, composed
by experts from ITU-T and ISO/IEC, to start the development The forward and inverse discrete cosine transforms for all
of a new video coding standard called HEVC – High the four presented sizes are used in the encoder side and in the
Efficiency Video Coding [5]. The goal of the JCT-VC was to decoder side only the inverse transforms are necessary.
978-1-4799-1132-5/13/$31.00 ©2013 IEEE
This paper presents a hardware design for the 32x32 The HEVC also uses the traditional approach to reduce 2-D
Inverse Discrete Cosine Transform (IDCT) defined in the DCT/IDCT number of calculations with the use of the
HEVC. The hardware designed aims to process 30 Quad Full separability property. This property considers that the 2-D
High Definition (QFHD) frames (3840x2160 pixels) per DCT/IDCT can be calculated through the calculations of two
second. This high throughput was obtained through the 1-D DCTs/IDCTs. Firstly, for each column of the input matrix
parallelism exploration, where 32 input samples are processed the 1-D IDCT is performed and the output coefficients are
at each clock cycle. stored in a transposition matrix row by row. Then, the 1-D
IDCT is performed again column by column from the
The architecture was designed also to have a low latency, transposition matrix.
since this feature is very important in the encoder side,
especially by the intra-frame prediction [10]. The intra-
prediction uses as references the neighbor blocks inside a
frame and these reference blocks must be processed by the
forward and inverse transforms (and also by forward and
inverse quantization) before be uses as references [7]. The low
latency was obtained though a purely combinational design.
Then a group of 32 samples are processed at each clock cycle.
Finally, the architecture was also designed to have a low
hardware cost, respecting the two previous goals. To reach this
goal, the hardware cost was reduced using a multiplierless
approach. All the multiplications presented in the 32-points Fig. 1 Demonstration of separability property.
IDCT equations were converted in shifts and adds.
This paper is organized as follows: in section 2, HEVC
IDCT will be explained. In section 3 some related works will III. RELATED WORKS
be presented. After, in section 4 the designed architecture will Since the HEVC was not approved yet, there are a few
be presented. Section 5 shows the synthesis results and finally, published works about hardware implementations of the
section 6 presents conclusions and future works. Inverse DCT transforms for this standard. The few published
II. INVERSE DISCRETE COSINE TRANSFORM papers about this issue, are focusing on different technologies,
hampering a fair comparison.
A generic 2-D IDCT is defined in (1) and (2), where N and
M are IDCT points number, F(u,v) is the input for the position Shen [11] presents a hardware design for the IDCT of
(u,v) of input matrix, and F(x,y) is the output coefficient. As it different sizes present on video encoders. This work uses
is able to be seen, a direct design of equation (1) and (2), memory instead of registers to design the transposition matrix.
without any kind of optimization would implement a huge The same 1-D DCT architecture is reused to do the 2-D
numbers adders and multipliers, consuming a prohibitive calculation and then the processing rate is impaired. Moreover,
hardware in terms of area. Shen shows that the MCM technique is not so advantageous
for the bigger DCTs. Thus, he uses a common multiplier for
the 16 and 32-points DCTs. Finally, Shen’s architecture is able
to consume four input samples per clock cycle, and it was
(1) designed in five pipeline stages, reaching an operation
frequency of 350MHz. This work was implemented at 0.13um
standard-cells technology. None algorithmic optimization is
presented.
(2)
The work presented in [12] shows a novel hardware-shared
As discussed before, the HEVC transforms present some architecture to compute the 8x8 integer IDCTs of the HEVC
innovations when compared with previous standards. The basic and the H.264/AVC. The hardware was described in Verilog
unity of HEVC transforms and quantization are called and it was synthesized in a Xilinx Vertex4 FPGA. The design
Transform Units (TU). Their size is always square and can was also synthesized using 0.18μm CMOS technology. The
assume a size of 4x4, 8x8, 16x16 and 32x32 samples, resource-shared design costs 12.3K gates and 4K standard cells
structured in quadtrees [7]. The use of bigger transforms sizes with a maximum operating frequency of 211.4MHz. This
together with a higher number of available transforms sizes architecture is capable to process one sample per clock cycle.
increased a lot the coding efficiency of this module, but it is
also increased a lot the number of the necessary calculations. In [13] we present a previous work of our group. This paper
presents the hardware design of a 16-points 1-D Forward DCT
The 1-D DCT HEVC transforms present a peculiar feature: used in the HEVC. The architecture processes 16 samples per
the 4-point DCT (used to generate the 4x4 DCT) is part of the clock cycle and it was synthesized for two different FPGAs.
8-point DCT (used to generate the 8x8 DCT); the 8-point DCT The implemented hardware was purely combinational and
is also part of the 16 points DCT and this is repeated to build some algorithm optimizations were done. The highest
the 32-points DCT [8]. frequency operation was 87.60MHz, reaching a competitive
processing rate.
Edirisuriya [14] proposes an architecture for a transform and adds. Then the multipliers were avoided in the designed
engine capable of computing the HEVC 16×16 2-D DCT/DST architecture.
multitransform without the use of multipliers. This architecture
works at 250MHz and process 16 samples per cycle clock. The Table I shows an example of constants which are used in
proposed architecture was implemented using a FPGA and it the 32-points IDCT multiplications and the respective sums
was also mapped to a 45nm technology. and shift operations used to represent the original operations. In
these examples it is considered a variable “X” as an input
Budagav [15] purposed an unified hardware solution for the coefficient which will be multiplied by the respective constant.
DCT/IDCT. This architecture was described in RTL and
synthesized for a 45nm standard-cells technology. This TABLE I. EXAMPLE OF CONSTANT MULTIPLICATIONS AND THEIR
architecture is able to perform its calculations in a frequency of RESPECTIVE SUMS AND SHIFT
250 MHz and it process 32 samples per clock cycle. Constant Shifts and Adds
The work presented in [16] shows a multiplierless 89 X<<6 + X<<4 + X<<3 + X
architecture solution for the 16-points DCT. It was
implemented in 90nm standard-cells technology. This 75 X<<6 + X<<3 + X<<1 + X
hardware works in a frequency of 150 MHz and it is able to 50 X<<5 + X<<4 + X<<1
process 16 samples per second.
18 X<<4 + X<<1
83 X<<6 + X<<4 + X<<1 + X
IV. DESIGNED ARCHITECTURE
36 X<<5 + X<<2
The architecture was divided in five parts: two registers sets
(input and output), two 1-D IDCT architectures and a 64 X<<6
transposition matrix. Figure 2 shows a block diagram of the
designed architecture. Table II shows examples of equations with additions and
multiplications done among the input coefficients in order to
The architecture was described in behavioral VHDL and generate the respective results. The index used in Table II for
the synthesis was target to Altera FPGAs. each calculation step represents the position of the specific
The 1st 1-D IDCT consumes 32 samples per clock cycle input inside a line of 32 inputs.
which are provided by the input register set. Following the
separability property, this part of architecture stores the TABLE II. EXAMPLE OF MULTIPLICATION STAGE EQUATIONS
resultant coefficients on the respective column of the
transposition matrix. After 32 groups of 32 samples provided Result Operations
by the input registers set are processed, the 2nd 32-points IDCT 4*I1 - 13*I3 + 22*I5 + -31*I7 + 38*I9 - 36*I11 +
is able to start its processing from each line of the transposition O15 54*I13 - 61*I15 + 67*I17 - 73*I19 + 78*I21 - 82*I23
matrix, processing 32 samples per clock cycle. This means that + 85*I25 - 88*I27 + 90*I29 + -90*I31
the first set of 32 outputs are delivered in 64 clock cycles and a
new group of 32 samples are delivered at each new clock 9*I2 - 25*I6 + 43*I10 - 57*I14 + 70*I18 - 80*I22 +
EO7
cycle. 87*I26 - 90*I30
A. The 32 Points IDCT Architecture EEO3 18*I4 - 50*I12 + 75*I20 - 89*I28
The main part of the 2-D IDCT architecture is the 1-D EEEO1 36*I8 - 83*I24
IDCT, since two instances of the 1-D IDCT are used in the 2-D EEEE1 64*I0 - 64*I16
design, exploring the separability process.
The first part of the designed 1-D DCT architecture handles Table III shows a comparison between the number of
with specific equations that contain multiplications. The IDCT operations necessary to perform the 32 points DCT and the
equations defined in the HEVC Model reference software 32x32 DCT calculations through: (a) the mathematical
(HM) [18], [19] presents many multiplications by constants. definition, presented in (1) and (2); (b) the equations extracted
However, complete multipliers are very costly in terms of area from HEVC Model Version 9 (HM:9) [18], [19]; and (c) the
and power consumption due the necessary number of logic decomposition of the multiplications in shifts and adds. From
gates to implement them [17]. These operators are also slower Table III it is possible to notice the important impacts caused
than other most common operators, like adders or subtractors. by the simplifications used in the HEVC/HM over the original
Thus, to reduce the hardware cost and to increase the mathematical definition. It is also possible to notice the impacts
processing rate, the multiplications were decomposed in shifts of the multiplierless approach defined in this work, where the
number of adders increased but no multipliers were used.
Fig. 2 2-D IDCT Architecture Butterfly Block
TABLE III. COMPARISON OF THE NUMBER OF OPERATIONS USED TO In a final stage, the 1-D IDCT architecture performs a
CALCULATE 2-D DCTS rounding process in each output sample.
The main difference between the 1st and 2nd 1-D IDCT
IDCT Version Adds Multiplications
architectures are the number of bits necessary to represent the
Mathematical 1,023 1,024 input and output samples and the number of deleted bits in the
32 rounding stage. The first architecture uses 16 bits to represent
HM:9 404 344
points each input sample and 14 bits to represent each output sample.
Proposed 1,240 0 On the other hand, the second architecture uses, respectively,
Mathematical 1,047,522 1,048,576 14 and 9 bits to represent each input and output sample.
32x32 HM:9 25,856 22,016 Moreover, the first architecture deletes 7 bits at the rounding
process whilst the second one deletes 12 bits, according with
Proposed 2,480 0 the HM version 9 [18], [19].
Figure 4 shows the complete block diagram of the 1-D
The results presented in Table II are used as input of the IDCT architectures, linking the three parts explained above.
second 1-D DCT step. The first 1-D DCT step generates
sixteen “O” outputs, eight “EO” outputs, four “EEO” outputs,
two EEEO outputs and two “EEEE” outputs. Most of equations
used to generate these 32 outputs were omitted in function of
the available space, but the complexity of the equations at each
level is equivalent with those presented in Table III.
The second part of the 1-D IDCT architecture performs the
butterfly calculations. In this process, only sums and
subtractions are done, thus only adders and subtractors were
implemented. Figure 3 illustrates the butterfly block
architecture. Each black circle represents an adder and white Fig. 4 1-D Inverse DCT 32-points Block Diagram
circles represent subtractors. The black squares are used only in B. The Transposition Matrix Architecture
order to make easy the butterfly understanding, and then they
Figure 5 shows the transposition matrix architecture, which
are not implemented in hardware.
was designed using registers and multiplexers.