Artigo Científico

Hardware Design for the 32x32 IDCT of the
HEVC Video Coding Standard

Ruhan Conceição, J. Cláudio Souza Jr., Ricardo Jeske, Marcelo Porto, Júlio Mattos, Luciano Agostini
Group of Architectures and Integrated Circuits – GACI
Federal University of Pelotas – UFPel
Pelotas, Brazil
{radconceicao,jcdsouza,rgjeske,porto,julius,agostini}@inf.ufpel.edu.br
Abstract—This paper is focused in the inverse transforms increase video compression in 50% while maintaining the same
defined in the video coding standard HEVC – High Efficiency computational complexity. However, the latter goal was not
Video Coding. The transforms stage is one of the innovations reached and this standard achieved a higher computational
proposed by HEVC since it allows the use of the biggest number complexity than H.264/AVC. HEVC final draft was approved
of transforms sizes (four) and also the biggest transform sizes (till by the JCT-VC group in January, 2013 and now the document
32x32) when compared with previous standards. The inverse is being evaluated by the other ITU-T and ISO instances.
DCT is performed by the video encoder and decoder as well. This
paper presents an efficient hardware design for the 32x32 HEVC Current video coding standards are called hybrid encoders,
IDCT based on the separability principle. The hardware design since they use an initial prediction stage (intra-frame or inter-
was planned to reach real time processing (at least 30 frames per frame predictions), followed by a transforms and quantization
second) for high resolution videos, exploiting a high parallelism stage and, finally, an entropy coding is applied [1]. HEVC
level (32 samples consumed per clock cycle). The architecture follows this condign structure. Among the coding stages, the
was also planned to reach a low latency and a low cost, then it transforms hold an important position. The purpose of the
was designed in a purely combinational way and using a transform stage is to concentrate the energy of an image block
multiplierless approach. The synthesis process was targeted to an in just a few numerical coefficients. Thereby, the following
Altera Stratix IV FPGA. The synthesis results show that the stages (quantization and entropy coding) can be performed in a
designed architecture is capable to process more than 30 QFHD
much more efficient way.
frames (3840x2160 pixels) per second, with a latency of 33 clock
cycles. The process to encode a frame generates data losses,
especially by the quantization stage. Since the current frame is
Keywords—HEVC; IDCT; Hardware Design; FPGA; used as reference in the prediction stage for the next frames,
Multiplierless. the current frame must be also decoded in the encoder side to
ensure that the encoder and decoder will use the same
I. INTRODUCTION
references. Thus, an inverse transform process it is necessary
Nowadays, the resolution and the quality of digital videos also in the encoder size. This way, the inverse transforms are
have been improving in a fast and steady manner. Additionally, used in both encoder and decoder sides [6].
such videos are becoming supported by an increasing number
of electronic devices (smartphones, set-top-box for digital The Discrete Cosine Transform (DCT) is the main HEVC
television, blu-ray players, etc.). As consequence, the study and transform likewise the other image and video coding standards
the improvement of video encoders/decoders is an extremely [1]. The HEVC use the same idea presented in the H.264/AVC
relevant activity in the current scenario, since the devices that [7] when an integer approximation of the DCT is used to
process digital videos, with their diverse features, must be able simplify the calculations and to avoid mismatching between
to process high-resolution videos in real time. For this reason, coders and decoders [8]. But the HEVC includes some
topics such as compression rate, video quality, computational important novelties in the transform stage. The first innovation
complexity and energy consumption must be improved, hence was the definition of an independent data structure to handle
they are thoroughly investigated in this area. with the transforms decisions. This structure is called
Transform Unit (TU) and it is organized as a quadtree to
Video coding is imperative in applications that handle evaluate the encoding options [16]. The other innovation is the
digital videos, since an uncompressed video requires a use of four different transform sizes: 4x4, 8x8, 16x16 and
prohibitive volume of bits to be represented [1]. H.264/AVC 32x32 [7], increasing the decision options in the TU structure.
[2] is the latest video coding standard available, presenting The current standards commonly use 4x4 and/or 8x8 sizes
significant compression gains when compared to the MPEG-2 only. The use of more transform sizes increase the compression
standard [3]. On January 2010, the JCT-VC [4] (Joint efficiency, but also it increases the encoder complexity [9].
Collaborative Team – Video Coding) was created, composed
by experts from ITU-T and ISO/IEC, to start the development The forward and inverse discrete cosine transforms for all
of a new video coding standard called HEVC – High the four presented sizes are used in the encoder side and in the
Efficiency Video Coding [5]. The goal of the JCT-VC was to decoder side only the inverse transforms are necessary.
978-1-4799-1132-5/13/$31.00 ©2013 IEEE
This paper presents a hardware design for the 32x32 The HEVC also uses the traditional approach to reduce 2-D
Inverse Discrete Cosine Transform (IDCT) defined in the DCT/IDCT number of calculations with the use of the
HEVC. The hardware designed aims to process 30 Quad Full separability property. This property considers that the 2-D
High Definition (QFHD) frames (3840x2160 pixels) per DCT/IDCT can be calculated through the calculations of two
second. This high throughput was obtained through the 1-D DCTs/IDCTs. Firstly, for each column of the input matrix
parallelism exploration, where 32 input samples are processed the 1-D IDCT is performed and the output coefficients are
at each clock cycle. stored in a transposition matrix row by row. Then, the 1-D
IDCT is performed again column by column from the
The architecture was designed also to have a low latency, transposition matrix.
since this feature is very important in the encoder side,
especially by the intra-frame prediction [10]. The intra-
prediction uses as references the neighbor blocks inside a
frame and these reference blocks must be processed by the
forward and inverse transforms (and also by forward and
inverse quantization) before be uses as references [7]. The low
latency was obtained though a purely combinational design.
Then a group of 32 samples are processed at each clock cycle.
Finally, the architecture was also designed to have a low
hardware cost, respecting the two previous goals. To reach this
goal, the hardware cost was reduced using a multiplierless
approach. All the multiplications presented in the 32-points Fig. 1 Demonstration of separability property.
IDCT equations were converted in shifts and adds.
This paper is organized as follows: in section 2, HEVC
IDCT will be explained. In section 3 some related works will III. RELATED WORKS
be presented. After, in section 4 the designed architecture will Since the HEVC was not approved yet, there are a few
be presented. Section 5 shows the synthesis results and finally, published works about hardware implementations of the
section 6 presents conclusions and future works. Inverse DCT transforms for this standard. The few published
II. INVERSE DISCRETE COSINE TRANSFORM papers about this issue, are focusing on different technologies,
hampering a fair comparison.
A generic 2-D IDCT is defined in (1) and (2), where N and
M are IDCT points number, F(u,v) is the input for the position Shen [11] presents a hardware design for the IDCT of
(u,v) of input matrix, and F(x,y) is the output coefficient. As it different sizes present on video encoders. This work uses
is able to be seen, a direct design of equation (1) and (2), memory instead of registers to design the transposition matrix.
without any kind of optimization would implement a huge The same 1-D DCT architecture is reused to do the 2-D
numbers adders and multipliers, consuming a prohibitive calculation and then the processing rate is impaired. Moreover,
hardware in terms of area. Shen shows that the MCM technique is not so advantageous
for the bigger DCTs. Thus, he uses a common multiplier for
the 16 and 32-points DCTs. Finally, Shen’s architecture is able
to consume four input samples per clock cycle, and it was
(1) designed in five pipeline stages, reaching an operation
frequency of 350MHz. This work was implemented at 0.13um
standard-cells technology. None algorithmic optimization is
presented.
(2)
The work presented in [12] shows a novel hardware-shared
As discussed before, the HEVC transforms present some architecture to compute the 8x8 integer IDCTs of the HEVC
innovations when compared with previous standards. The basic and the H.264/AVC. The hardware was described in Verilog
unity of HEVC transforms and quantization are called and it was synthesized in a Xilinx Vertex4 FPGA. The design
Transform Units (TU). Their size is always square and can was also synthesized using 0.18μm CMOS technology. The
assume a size of 4x4, 8x8, 16x16 and 32x32 samples, resource-shared design costs 12.3K gates and 4K standard cells
structured in quadtrees [7]. The use of bigger transforms sizes with a maximum operating frequency of 211.4MHz. This
together with a higher number of available transforms sizes architecture is capable to process one sample per clock cycle.
increased a lot the coding efficiency of this module, but it is
also increased a lot the number of the necessary calculations. In [13] we present a previous work of our group. This paper
presents the hardware design of a 16-points 1-D Forward DCT
The 1-D DCT HEVC transforms present a peculiar feature: used in the HEVC. The architecture processes 16 samples per
the 4-point DCT (used to generate the 4x4 DCT) is part of the clock cycle and it was synthesized for two different FPGAs.
8-point DCT (used to generate the 8x8 DCT); the 8-point DCT The implemented hardware was purely combinational and
is also part of the 16 points DCT and this is repeated to build some algorithm optimizations were done. The highest
the 32-points DCT [8]. frequency operation was 87.60MHz, reaching a competitive
processing rate.
Edirisuriya [14] proposes an architecture for a transform and adds. Then the multipliers were avoided in the designed
engine capable of computing the HEVC 16×16 2-D DCT/DST architecture.
multitransform without the use of multipliers. This architecture
works at 250MHz and process 16 samples per cycle clock. The Table I shows an example of constants which are used in
proposed architecture was implemented using a FPGA and it the 32-points IDCT multiplications and the respective sums
was also mapped to a 45nm technology. and shift operations used to represent the original operations. In
these examples it is considered a variable “X” as an input
Budagav [15] purposed an unified hardware solution for the coefficient which will be multiplied by the respective constant.
DCT/IDCT. This architecture was described in RTL and
synthesized for a 45nm standard-cells technology. This TABLE I. EXAMPLE OF CONSTANT MULTIPLICATIONS AND THEIR
architecture is able to perform its calculations in a frequency of RESPECTIVE SUMS AND SHIFT
250 MHz and it process 32 samples per clock cycle. Constant Shifts and Adds
The work presented in [16] shows a multiplierless 89 X<<6 + X<<4 + X<<3 + X
architecture solution for the 16-points DCT. It was
implemented in 90nm standard-cells technology. This 75 X<<6 + X<<3 + X<<1 + X
hardware works in a frequency of 150 MHz and it is able to 50 X<<5 + X<<4 + X<<1
process 16 samples per second.
18 X<<4 + X<<1
83 X<<6 + X<<4 + X<<1 + X
IV. DESIGNED ARCHITECTURE
36 X<<5 + X<<2
The architecture was divided in five parts: two registers sets
(input and output), two 1-D IDCT architectures and a 64 X<<6
transposition matrix. Figure 2 shows a block diagram of the
designed architecture. Table II shows examples of equations with additions and
multiplications done among the input coefficients in order to
The architecture was described in behavioral VHDL and generate the respective results. The index used in Table II for
the synthesis was target to Altera FPGAs. each calculation step represents the position of the specific
The 1st 1-D IDCT consumes 32 samples per clock cycle input inside a line of 32 inputs.
which are provided by the input register set. Following the
separability property, this part of architecture stores the TABLE II. EXAMPLE OF MULTIPLICATION STAGE EQUATIONS
resultant coefficients on the respective column of the
transposition matrix. After 32 groups of 32 samples provided Result Operations
by the input registers set are processed, the 2nd 32-points IDCT 4*I1 - 13*I3 + 22*I5 + -31*I7 + 38*I9 - 36*I11 +
is able to start its processing from each line of the transposition O15 54*I13 - 61*I15 + 67*I17 - 73*I19 + 78*I21 - 82*I23
matrix, processing 32 samples per clock cycle. This means that + 85*I25 - 88*I27 + 90*I29 + -90*I31
the first set of 32 outputs are delivered in 64 clock cycles and a
new group of 32 samples are delivered at each new clock 9*I2 - 25*I6 + 43*I10 - 57*I14 + 70*I18 - 80*I22 +
EO7
cycle. 87*I26 - 90*I30
A. The 32 Points IDCT Architecture EEO3 18*I4 - 50*I12 + 75*I20 - 89*I28
The main part of the 2-D IDCT architecture is the 1-D EEEO1 36*I8 - 83*I24
IDCT, since two instances of the 1-D IDCT are used in the 2-D EEEE1 64*I0 - 64*I16
design, exploring the separability process.
The first part of the designed 1-D DCT architecture handles Table III shows a comparison between the number of
with specific equations that contain multiplications. The IDCT operations necessary to perform the 32 points DCT and the
equations defined in the HEVC Model reference software 32x32 DCT calculations through: (a) the mathematical
(HM) [18], [19] presents many multiplications by constants. definition, presented in (1) and (2); (b) the equations extracted
However, complete multipliers are very costly in terms of area from HEVC Model Version 9 (HM:9) [18], [19]; and (c) the
and power consumption due the necessary number of logic decomposition of the multiplications in shifts and adds. From
gates to implement them [17]. These operators are also slower Table III it is possible to notice the important impacts caused
than other most common operators, like adders or subtractors. by the simplifications used in the HEVC/HM over the original
Thus, to reduce the hardware cost and to increase the mathematical definition. It is also possible to notice the impacts
processing rate, the multiplications were decomposed in shifts of the multiplierless approach defined in this work, where the
number of adders increased but no multipliers were used.
Fig. 2 2-D IDCT Architecture Butterfly Block
TABLE III. COMPARISON OF THE NUMBER OF OPERATIONS USED TO In a final stage, the 1-D IDCT architecture performs a
CALCULATE 2-D DCTS rounding process in each output sample.
The main difference between the 1st and 2nd 1-D IDCT
IDCT Version Adds Multiplications
architectures are the number of bits necessary to represent the
Mathematical 1,023 1,024 input and output samples and the number of deleted bits in the
32 rounding stage. The first architecture uses 16 bits to represent
HM:9 404 344
points each input sample and 14 bits to represent each output sample.
Proposed 1,240 0 On the other hand, the second architecture uses, respectively,
Mathematical 1,047,522 1,048,576 14 and 9 bits to represent each input and output sample.
32x32 HM:9 25,856 22,016 Moreover, the first architecture deletes 7 bits at the rounding
process whilst the second one deletes 12 bits, according with
Proposed 2,480 0 the HM version 9 [18], [19].
Figure 4 shows the complete block diagram of the 1-D
The results presented in Table II are used as input of the IDCT architectures, linking the three parts explained above.
second 1-D DCT step. The first 1-D DCT step generates
sixteen “O” outputs, eight “EO” outputs, four “EEO” outputs,
two EEEO outputs and two “EEEE” outputs. Most of equations
used to generate these 32 outputs were omitted in function of
the available space, but the complexity of the equations at each
level is equivalent with those presented in Table III.
The second part of the 1-D IDCT architecture performs the
butterfly calculations. In this process, only sums and
subtractions are done, thus only adders and subtractors were
implemented. Figure 3 illustrates the butterfly block
architecture. Each black circle represents an adder and white Fig. 4 1-D Inverse DCT 32-points Block Diagram
circles represent subtractors. The black squares are used only in B. The Transposition Matrix Architecture
order to make easy the butterfly understanding, and then they
Figure 5 shows the transposition matrix architecture, which
are not implemented in hardware.
was designed using registers and multiplexers.
Fig. 3 1-D IDCT Architecture Butterfly Block Fig. 5 Transposition Matrix

At every 32 clock cycles, the control signal is changed, comparison with related works is presented in the next
changing the multiplexers outputs. This means that the data paragraphs and also in Table V.
reading and writing are switched from row to column and vice
versa. Thus, even using only a single array of registers, it is Shen [11], Budagav [15] and Ahmed [16] use a different
possible to carry out simultaneously the operations of reading technology to synthesize the architectures described in their
and writing necessary in a transposition matrix. papers. On the other hand Jeske [13] and Edirisuriya [14]
implemented their architectures targeting a FPGA device.
V. RESULTS AND COMPARISON WITH RELATED WORKS Martuza [12] implemented his architecture targeting two
different technologies: standard cells and FPGA.
This section will present the synthesis results of the
designed architecture. They were synthesized targeting the Other question is that some of presented related works are
EP4SE820F43I4 device from Stratix IV Altera FPGA. Table focused only in the 1-D IDCT design, other are focused in the
IV presents the synthesis results for the 32x32 architecture and forward DCT and not in the inverse DCT, and other are
also the details about the internal modules of this architecture. focused in a smaller size transform. Then it is really very
difficult to do a fair comparison.
TABLE IV. SYNTHESIS RESULTS OF EACH ARCHITECTURE MODULE
Even with the related works targeting the synthesis for
Frequency different technologies, some comparisons are possible.
Module ALUTs Registers Considering the number of samples per cycle that the
(MHz)
architectures can process and the achieved frequency, it is
Transposition Matrix possible to calculate the number of samples per second that all
394.48 478 15,367
and Registers architectures are able to process and also the number of frames
1st 1-D IDCT 43.78 13,517 - processed per second. This information is presented in Table
2nd 1-D IDCT 45.65 12.686 -
V. Moreover, Table V also presents other results for these
architectures, such as hardware consumption and operation
2-D IDCT 43.62 28,311 16,167 frequency as well.
Some results are not presented in Table V because they are
The synthesized architecture reached a low operation not specified in their respective papers. Moreover, some results
frequency since a purely combinational approach was used, are not related by the respective implementation, such as
with a biggest critical path. It is important to notice that this FPGA device for CMOS implementation.
option was done to allow a low latency in this design. A
TABLE V. COMPARISION WITH RELATED WORKS
Shen Martuza Martuza Jeske Budagav Ahmed

Parameter This Work
[11] [12] [12] [13] [15] [16]
Transform Type 2-D IDCT 2-D IDCT 2-D DCT 2-D DCT 1-D DCT 2-D IDCT/DCT 2-D DCT
Transform Size 32 4/8/16/32 8 8 16 4/8/16/32 16
Technology FPGA 0.13um FPGA 0.18um FPGA 45nm 90nm
28,311 109.2K 706 12.3K 5,168 156K
Hardware Consumption -
ALUTs Gates LUTs Gates ALUTs Gates
Registers 16,167 - - - - - -
SRAM (bit) - 2,560 - - - - -
FPGA Device Stratix IV - Virtex 4 - Stratix III - -
Implement Multipliers No Yes No No No Yes No
Frequency (MHz) 43.62 350 - 211.40 87.60 250 150
6 (4x4)
Latency (cycles) 33 65 65 1 64 33
261 (32x32)
Samples per Clock Cycle 32 4 1 1 16 32 16
Samples per Second (x106) 1,395.84 1,400.00 - 211.4 1,401.6 8,000 2,400
HD 1080p fps (4:2:0) 448 448 - 68 448 2,572 768
QFHD fps (4:2:0) 112 112 - 17 112 643 192
When comparing our results with the work presented in synthesize our solution targeting a standard-cells technology in
[11] both works presented almost the same processing rate. order to collect information about area and energy
Moreover, our architecture is a multiplierless solution whilst consumption.
Shen’s design implements complete multipliers. Thus, even
using different technologies it is possible to infer that our REFERENCES
architecture consumes less hardware than Shen’s design. The [1] D. Salomon, Data Compression The Complete Reference, 4th ed.,
latency of [11] is variable according to the size of IDCT. Sringer-Verlag: Northridge, 2007.
4/8/16/32-points 1-D IDCT have respectively 6/19/68/261 [2] International Telecommunication Union. “ITU-T Recommendation
H.264/AVC (03/05): advanced video coding for generic audiovisual
cycles of latency. But considering the 32x32 case, Shen`s services”. 2005.
latency is 7.9 times higher than that of our work. [3] International Telecommunication Union. “ITU-T Recommendation
H.262 (11/94): generic coding of moving pictures and associated audio
Martuza [12] implements his architecture in two different information – part 2: video”. 1994.
technologies, but focused only in the 8x8 transform size. [4] Joint Collaborative Team on Video Coding (JCT-VC). Available at:
Frequency result is given only for standard-cells http://www.itu.int/en/ITU-T/studygroups
implementation. Even for a smaller block size this architecture /com16/video/Pages/jctvc.aspx
has a latency higher than our work and it is not capable to [5] JCT-VC Editors, Recommendation ITU-T H.265 - High Efficiency
Video Coding (ITU-T Rec.H.265), April 2013.
process QFHD videos in real time. [6] L. Agostini, et. al., “Design and FPGA Prototyping of a H.264/AVC
The hardware implementation presented in [13] is a Main Profile Decoder for HDTV”, Journal of the Brazilian Computer
Society, vol. 13, pp. 25-36, May 2007.
previous design of our group but focused only in the 1-D [7] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of
forward DCT with 16 points. This architecture is capable to the High Efficiency Video Coding (HEVC) standard,” IEEE Trans.
achieve high processing rates, but the 2-D design is not Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1648–1667, Dec.
presented in that work.. 2012.
[8] W. H. Chen, C. H. Smith, and S. C. Fralick, “A fast computational
Budagav [15] architecture reaches higher processing rates algorithm for the discrete cosine transform,” IEEE Trans. Commun.,vol.
than our implementation. On the other hand, his architecture COM-25, no. 9, pp. 1004–1009, Sep. 1977.
implements complete multipliers, thus it is possible to infer [9] F. Bossen, B. Bross, K. Sühring, and D. Flynn, “HEVC complexity and
implementation analysis,” IEEE Trans. Circuits Syst. Video Technol.,
that our architecture probably has lower hardware consumption vol. 22, no. 12, pp. 1684–1695, Dec. 2012.
compared to [15]. The latency of Budagav’s architecture is [10] D. Palomino, F. Sampaio, L. Agostini, S. Bampi, A. Susin, “A Memory
almost the double of our latency. Aware and Multiplierless VLSI Architecture for the Complete Intra
Prediction of the HEVC Emerging Standard” in IEEE International
Ahmed [16] architectures is focused in the 16x16 forward Conference on Image Processing (ICIP), Orlando, Los Alamitos, 2012,
DCT. His architecture does not implement multipliers and is pp. 201-204.
able to reach good processing rates. The latency is the same [11] S. Shen, et al. “A unified 4/8/16/32-point integer idct architecture for
than that of our work, but the smaller block size used Multiple video coding standards” in Multimedia and Expo (ICME),
2012 IEEE International Conference on, Melbourne, 9-13 July 2012,
contributes a lot with these good results. pp. 788-793.
[12] M. Martuza, et al.”A cost effective implementation of 8×8 transform of
VI. CONCLUSIONS HEVC from H.264/AVC” in Electrical & Computer Engineering
This work presented a hardware design for the HEVC (CCECE), 2012 25th IEEE Canadian Conference on, Montreal, Quebec,
32x32 IDCT. The main goals of this design were: (1) to reach pp. 1-4.
[13] R. Jeske, et. al.“Low cost and high throughput multiplierless design of
high processing rates; (2) to reach low latency and (3) to reach a 16 point 1-D DCT of the new HEVC video coding standard”.in
as low hardware resources consumption. The high processing Programmable Logic (SPL), Bento Gonçalves, RS, 2012, pp.1-6
rate was reached through the parallelism exploration, since 32 [14] A. Edirisuriya, et al. “A Multiplication-free Digital Architecture for
samples are processed per cycle. The low latency was possible 16x16 2-D DCT/DST Transform for HEVC”. in Electrical &
through a combinational design in the 1-D DCT transforms, Electronics Engineers in Israel (IEEEI), Israel, 2012, pp. 1–5.
[15] M. Budagavi, et al. “Unified Forward+Inverse Transform Architecture
then only one cycle is necessary to process a complete 1-D For Hevc”. In Image Processing (ICIP), Orlando, Los Alamitos, 2012
DCT over 32 samples. The 2-D DCT has a latency of only 33 Paper: MA.P2.8
clock cycles. The low hardware cost was achieved through a [16] A. Ahmed, et al. “VLSI implementation of 16-point DCT for
multiplierless approach, decomposing the multiplications in H.265/HEVC using walsh hadamard transform and lifting scheme”. in
shifts and adds. Multitopic Conference (INMIC), Pakistan, 2011, pp. 144-148.
[17] S. Brown, Fundamentals of Digital Logic with VHDL Design, 2nd ed.,
Synthesis results targeting an Altera FPGA showed that the Higher Education: New York, 2005.
designed architecture is capable to process more than 30 [18] Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16
WP3 and ISO/IEC JTC1/SC29/WG11 – “HM9: High Efficiency Video
QFHD frames per second. Moreover, it was shown that even Coding (HEVC) Test Model 9 Encoder Description”, 11th Meeting:
with few published works in the literature, some comparisons Shanghai, CN, 10–19 October 2012.
were possible and our architecture reached competitive [19] Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16
processing rates and latencies when compared to other works. WP3 and ISO/IEC JTC1/SC29/WG11 – “High Efficiency Video
Coding (HEVC) text specification draft 9”, 11th Meeting: Shanghai,
As future works, we are planned the implementation of a CN, 10–19 October 2012
multi-size architecture which is capable to process all sizes of
IDCT and to research some ways to reduce hardware resources
consumption through the sharing of common sub expressions
as described in [13]. Furthermore, we are planning to

Artigo Científico

Uploaded by

Copyright:

Available Formats

Artigo Científico

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Artigo Científico

Uploaded by

Copyright:

Available Formats

Hardware Design for the 32x32 IDCT of the

HEVC Video Coding Standard

Fig. 3 1-D IDCT Architecture Butterfly Block Fig. 5 Transposition Matrix

TABLE V. COMPARISION WITH RELATED WORKS

Shen Martuza Martuza Jeske Budagav Ahmed

You might also like