baldev2018

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, VOL. 64, NO.

1, FEBRUARY 2018 127

Design and Implementation of Efficient Streaming


Deblocking and SAO Filter for HEVC Decoder
Swamy Baldev, Kaustubh Shukla, Sushanta Gogoi, Pradeep Kumar Rathore, and Rangababu Peesapati

Abstract—This paper aims to design an efficient mixed serial 64 × 64 samples. The CTU is further divided into smaller
five-stage pipeline processing hardware architecture of deblock- blocks using a quad tree structure; such a block is called
ing filter (DBF) and sample adaptive offset (SAO) filter for a coding unit (CU). These CUs subsequently can be split into
high efficiency video coding decoder. The proposed hardware prediction units (PUs), and also act as a root of the transform
is designed to increase the throughput and reduce the num-
ber of clock cycles by processing the pixels in a stream of quad-tree. Each of the child nodes of the transform quad-
4 × 36 samples in which edge filters are applied vertically tree defines a transform unit (TU). The size of the transforms
in a parallel fashion for processing of luma/chroma samples. used in the prediction error coding can vary from 4 × 4 to
Subsequently these filtered pixels are transposed and reprocessed 32 × 32 samples, thus allowing transforms larger than in the
through vertical filter for horizontal filtering in a pipeline fash- paradigm of H.264/AVC.
ion. Finally, the filtered block transposed back to the original In the coding scheme, block-based prediction and transform
orientation and forwarded to a three-stage pipeline SAO filter. coding leads to discontinuities in the reconstructed signal
The proposed architecture is implemented in field programmable
gate array and application specific integrated circuit platform
at the block boundaries. Visible discontinuities at the block
using 90-nm library. Experimental results illustrate that the boundaries are known as blocking artifacts. A major source
proposed DBF and SAO architecture decreases the processing of blocking artifacts is the block-transform coding of the
cycles (172) required for processing each 64 × 64 or large cod- prediction error followed by coarse quantization. Moreover,
ing unit compared with the state-of-the-art literature with the in a motion-compensated prediction process, predictions for
increase of gate count (593.32K) including memory. The results adjacent blocks in the current picture might not come from
show that the throughput of the proposed filter can success- the adjacent blocks in the previously coded pictures, which
fully decode ultrahigh definition video sequences at 200 frames/s
at 341 MHz.
can create discontinuities at the block boundaries of the
prediction signal. Similarly, when applying intraprediction,
Index Terms—Deblocking filter (DBF), field programmable the difference in the parameters of prediction process for
gate array (FPGA), high efficiency video coding (HEVC), sample adjacent blocks causes discontinuities at the block bound-
adaptive offset (SAO) filter.
aries of the prediction signal [2]. To reduce these blocking
artifacts, two filter algorithms are used in HEVC which are
I. I NTRODUCTION applied sequentially to the reconstructed picture. Collectively
they are called the in-loop filter (LF) algorithms, namely the
OINT Collaborative Team on video coding formed by the
J ITU-T Video Coding Experts Group, and ISO/IEC Moving
Picture Experts Group in 2010 has recently developed a new
deblocking filter (DBF) and the sample adaptive offset (SAO)
filter.
In coming years, consumer grade virtual reality (VR) head-
international video compression standard called high efficiency
set shall occupy a major portion of the entertainment market
video coding (HEVC) [1], [2]. It was finalized in January 2010
and dominate the gaming industry [5]. Since VR applica-
and it aims to reduce 50% bit rate in comparison with the
ble devices such as mobile phones contain hardware video
existing advanced video coding (AVC) or H.264 high pro-
decoders that are tailored to resolutions used in a traditional
file standard, with the same visual quality [3]. Similar to the
video service like FHD or UHD, therefore it is important to
previous video coding standards it is also based on hybrid cod-
build the hardware that fulfils the emerging VR requirements.
ing scheme which uses block-based prediction and transform
These devices need 360◦ video services with a higher resolu-
coding [4].
tion, hence the traditional UHD leads to a major bottleneck
In H.264/AVC, a picture is divided into fixed size macro-
in video streaming devices. The hardware systems designed
blocks of 16 × 16 samples but in HEVC, a picture is divided
to meet the emerging streaming demands need to exhibit
into coding tree units (CTUs) of 16 × 16, 32 × 32, or
high bandwidth, high throughput, and low power. Therefore,
Manuscript received January 12, 2018; revised February 11, 2018; accepted developing an efficient architecture on field programmable
February 15, 2018. Date of publication March 7, 2018; date of current version gate array (FPGA) and application specific integrated cir-
March 29, 2018. (Corresponding author: Swamy Baldev.) cuit (ASIC) is an important step before making commercial
S. Baldev, S. Gogoi, P. K. Rathore, and R. Peesapati are with the prototypes.
Department of Electronics and Communication Engineering, National
Institute of Technology Meghalaya, Shillong 793003, India (e-mail: The HEVC DBF algorithm has a lesser computational
[email protected]; [email protected]; pradeeprathore@ complexity than H.264. It consumes one-fifth of the computa-
nitm.ac.in; [email protected]). tional complexity of an HEVC video decoder [6]. Therefore,
K. Shukla is with the Department of Electronics and Communication
Engineering, Maharaja Surajmal Institute of Technology, New Delhi 110058,
developing hardware architecture by making use of pipeline,
India (e-mail: [email protected]). parallel processing, and memory reuse techniques plays a vital
Digital Object Identifier 10.1109/TCE.2018.2812518 role in increasing the throughput. In this paper, authors have
1558-4127 c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
128 IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, VOL. 64, NO. 1, FEBRUARY 2018

studied the hardware architecture for combined DBF and SAO and has a gate count of 3K. Furthermore, Hsu and Shen [14]
filter in HEVC. implemented a six-stage two lines 64×64 block pipeline archi-
tecture with low latency and high processing throughput. The
ASIC implementation in 90-nm technology works at 100 MHz
II. R ELATED W ORKS and the gate count was found to be 466.5K. It consumes
Several works have been reported on improving different 768 clock cycles. Shukla et al. [15] and Peesapati et al. [18]
aspects of throughput, area, and power. Dajiang et al. [3] designed an area efficient dataflow architecture of SAO fil-
proposed a highly parallel DBF architecture for H.264 to pro- ter and a streaming DBF. The novelty of the proposed work
cess one macro-block in 48 clock cycles and give real time used one set of four-edge filters for both horizontal and
support to quad full high definition sequences at 60 frames/s vertical filtering along with SAO filter. In [18], two set of
at less than 100 MHz. The results when synthesized using four-edge filters were used. This paper reduces the area com-
130-nm process cost a gate count of 30.2K. Zhou et al. [7] pared to [18] at the little increase of processing clock cycles.
proposed high-throughput and multiparallel very large-scale This paper also proposes the use of multiple SAO filters in
integration (VLSI) hardware architecture of the DBF for the a pipeline approach along with DBF. The proposed hardware
HEVC. This architecture improves the performance at the architecture presents a novelty over previous architectures in
expense of the slightly increased gate count as compared to terms of processing clock cycles with an increase in area.
the previously known architectures in HEVC. It supports an The proposed design gives high throughput for UHD video
operating frequency of 278 MHz using 90-nm library and the sequences.
real-time requirement of the DBF for 8K × 4K video format The rest of this paper is organized as follows. Section II
is 123 frames/s. reviews the conventional DBF algorithms in HEVC.
Shen et al. [8] proposed four-stage pipeline hardware archi- Section III explains the flow of DBF algorithm and fol-
tecture on a quarter-large CU (LCU) basis with memory lowed by SAO filter algorithm in Section IV. In Section V,
interlacing technique to increase the throughput, which can the proposed hardware design is introduced and the proposed
access the data in the process of both vertical and horizon- architecture of the combined DBF and SAO filter is delineated.
tal filtering efficiently. The design can support 4K × 2K Section VI provides the experimental results and discussion by
at 30-frames/s applications at 28-MHz frequency. The same comparison with the existing literature. Section VII concludes
working group [9] proposed hardware architecture of com- this paper.
bined DBF and SAO that was designed for the HEVC intraen-
coder, and proposed by simplified bit rate estimation method
of SAO that can be applied to both intracoding and inter- III. D EBLOCKING F ILTER
coding. This design can support ultrahigh definition (UHD) The DBF is applied at the edges of all the 8 × 8 luma and
applications (7680×4320) at 40-frames/s and 182-MHz work- chroma samples which are adjacent to a PU or TU boundary
ing frequency. Total logic gate count is found to be 103.3K with the exception when the DBF is disabled across slice/tile
using 65-nm library. or frame boundary. Both PU and TU boundaries are included
Zhu et al. [6] proposed a HEVC in-LF architecture com- because PU boundaries are not always aligned with the TU
posed of fully utilized DBF and SAO. Due to pipeline boundaries in some cases of interpicture predicted coding
the architecture achieved high throughput and synthesized blocks (CBs). The syntax elements that control the DBF across
frequency of 240 MHz. Hence it can process 3.84 Gpixels/s the slice and tile boundaries are situated in sequence param-
and support (7680 × 4320) @120 frames/s decoding. In the eter set, picture parameter set (PPS), and slice headers. In
work of Ozcan et al. [10], is the first HEVC DBF hard- HEVC, the DBF is applied to the edges that are aligned on an
ware that uses two parallel data-paths in order to increase 8 × 8 sample grid for both luma and chroma samples instead
its performance. The results signify that the proposed hard- of a 4 × 4 sample grid basis as used in H.264/AVC. This
ware can decode full HD (1920 × 1080) at 30 frames/s. restriction reduces the worst case computational complexity
Srinivasarao et al. [11] proposed a new dual-standard DBF without noticeable degradation of the image visual qual-
architecture, which supports both H.264/AVC and HEVC stan- ity. It also helps in improving parallel processing operation
dards. The architecture takes 26 clock cycles for H.264/AVC by preventing cascading interactions between nearby filtering
and 14 cycles for HEVC to complete the filtering of operations. DBF operation can be broadly analyzed in three
a 16 × 16 pixel block. It occupies an area equivalent of stages.
70.1K and frequency of operation is 100 MHz. 1) Boundary strength (BS) calculation on filter edge.
Cheng et al. [12] proposed memory ping-pong and inter- 2) Filtering decision.
lacing VLSI architecture to prevent DBF from unnecessarily 3) Filtering (vertical/horizontal) operation.
waiting for pixels in both vertical and horizontal stages, which
only takes 435 cycles at most to process a LCU of 64×64 pix-
els size. A four stage pipeline with a prefilter was proposed A. Boundary Strength Determination
to eliminate the data dependency in the filtering process. This In H.264/AVC scheme, the strength of the DBF is controlled
design can support 8K × 4K@90 frames/s real-time appli- by the values of several syntax elements. However, only three
cations with operation frequency of 318 MHz at the cost of out of five filter strengths are used [2], [16]. For example, as
62.9K gates. Diniz et al. [13] proposed a DBF. It consumes shown in Fig. 1 given that P and Q are two adjacent blocks
1027 clock cycles to complete the filtering of one 64×64 block with a common 8×8 grid boundary, the filter strength of “2” is
at 140 MHz. ASIC implementation of the hardware archi- assigned when one of the block is predicted using intrapicture
tecture in 45-nm technology works at 200-MHz frequency prediction. Otherwise, the filter strength of “1” is assigned.
BALDEV et al.: DESIGN AND IMPLEMENTATION OF EFFICIENT STREAMING DEBLOCKING AND SAO FILTER 129

a selected classifier. It then calculates the offset to each sam-


ple of the category. These offsets are the set of values that
report the mean distortion between the deblocked pixel val-
ues and the post-SAO pixel values [17]. Two algorithms are
defined for classifying a pixel into multiple categories namely
edge offset (EO) and band offset (BO).

A. Edge Offset
EO uses four 1-D directional patterns or sample classifi-
cation, 0◦ horizontal, 90◦ vertical, 135◦ diagonal, and 45◦
diagonal. EO algorithm deploys the adaptive nature of the
SAO. According to these patterns, four EO classes are decided
and a fifth one which describes the monotonicity of the pixel
Fig. 1. DBF decision on/off and weak/strong filtering. group. On the encoder side only one EO class can be selected
for each CTB that enables EO. Based on the rate-distortion
optimization the best EO class is sent into the bit stream. For
The filters strength of “0” is assigned if none of the condi- a given EO class, each sample inside the CTB is classified into
tions are met as per standard [1]. Fig. 1 depicts an example one of the five categories [17]. If the current pixel has a lower
of filtering decision of vertical edge pixel samples. According value when compared to either or both of the neighbors the
to the filter strength and the average quantization parameters positive offsets for these categories result in smoothing, since
P and Q, two thresholds tc and β are determined from the the local valleys rise upward thus leveling with the neigh-
predefined tables. As per standard, no filtering, strong filter- boring pixels, while negative offsets for these categories can
ing and weak filtering is chosen based on the value of β for only increase the discontinuity. On the other hand for cate-
luma samples. For chroma samples, there are only two cases: gories when the current pixel is greater than either or both of
1) no filtering and 2) normal filtering. When the filter strength the neighboring pixels the negative offsets result in smoothing
is greater than 1, normal filtering is applied. while positive offsets increase the discontinuity. A particular
EO class is constant for the entire 64×64 CTU.
B. Filtering Decision
B. Band Offset
The filtering decision is included in HEVC to reduce the
computational complexity by sharing this decision across four It implies one offset is added to all samples of the same
luma rows or columns, using the first and the last rows band. The sample value range is equally divided into 32 bands.
or columns. The filter on/off decision gives whether filter For 8-bit samples ranging from 0 to 255, the width of a band
is applied on 4 × 8 pixel samples for vertical filtering and is 8 and for a particular band n the sample values range from
8 × 4 pixel samples for horizontal filtering as shown in Fig. 1. 7n to 8n+7. The mean distortion between the original samples
The filtering process is then performed using the control vari- and the reconstructed samples in a band are signaled to the
ables tc and β. From Fig. 1, the vertical filtering decision can decoder. There is no constraint on the offsets signs. Out of the
be done by considering the rows 1 and 4 lines of each pixel 32 equally spatial bands, a starting band and four consecutive
and horizontal filtering decision can be done by considering bands from the starting band are specified and signaled to the
the columns 1 and 4 pixels of the block. HEVC decoder [9].

C. Filtering V. P ROPOSED D EBLOCKING F ILTER AND SAO F ILTER


In HEVC, initially the horizontal filtering for vertical edges A. Deblocking Filter
of the entire frame is performed which is followed by the The combined DBF and SAO filter architecture is designed
filtering of horizontal edges. This specific order enables either using a combination of pipeline, parallel, and sequential pro-
multiple horizontal filtering or vertical filtering processes to cessing. This section underscores the proposed DBF and SAO
be applied in parallel. Moreover instead of parallel process, architecture. The flowchart shown in Fig. 3 describes the con-
filtering can be implemented on a coding tree block (CTB) trol flow of this architecture. Initially MUX-1 selects one of
basis. Fig. 1 clearly shows that if strong filtering is applied on the two process block according to selection line, next the
pixels, only the six pixels in rows 1–4 (shown in red) will be related boundary operations are performed which go through
modified. Subsequently if weak filtering is applied, only four a filter flag check. Based on the results of filter flag strong
pixels of rows 5–8 (shown in blue) are modified. or weak filtering is performed. The output of the filtering
process is loaded into a transpose buffer followed by a subse-
quent count check. This process is repeated for 9-pixel blocks.
IV. S AMPLE A DAPTIVE O FFSET F ILTER Finally, the selection lines of DEMUX-2 are checked and the
It is based on the concept of monotonicity in a set of clas- 4 × 36 block is fed into the SAO filter while simultaneously
sified pixels. The concept of SAO is to reduce mean sample a new set of 9-pixel blocks are processed through the DBF fil-
distortion of a region by improving the monotonic nature of ter. In the proposed architecture block diagram during stage 1,
a group of classified pixels. It achieves this purpose by first an MUX-1 shown in Fig. 2 selects one of the two opera-
classifying the region samples into multiple categories with tions according to selection lines. If the selection line is 0, it
130 IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, VOL. 64, NO. 1, FEBRUARY 2018

Fig. 2. Block diagram of proposed DBF and SAO hardware architecture.

loads the input buffer of 4 × 36 into the input samples of nine a parallel fashion. In the next step the decision making block
4 × 4 pixels; however, if the selection line is 1, it loads fil- performs the filtering operation decisions like BS calculation
tered buffer of 4×36 samples into the input samples of applied and filter skip flag. Edge filter parameters like β and tc, respec-
to four edge filters. Edge filtering operations are applied on tively, are calculated based on filter weak/strong decision and
the MUX-1 output according to the BSs and parallel decision are applied as input parameters along with the input samples.
making is performed on each of the above nine input sam- Next the filtered pixels are fed into the transpose block. This
ples of 4 × 4. In stage 2, the filtered samples are loaded into transpose stage and the 4 × 36 edge filter stage exhibit two
the transpose buffer and the transpose operation is applied on pipeline stages. The nine parallel 4 × 4 pixels are sent to the
the 4 × 36 buffer. All the remaining eight 4 × 36 buffers are transpose unit while new set of parallel 4 × 4 pixels are stored
processed in a pipeline fashion for completely filling up the in the 128-bit buffer.
36 × 36 block as shown in Fig. 2. 2) Transpose Unit: In this unit, the filtered block of nine
In stage 3, a DEMUX-2 selects two operations according to 4 × 4 from the edge filter buffer is transposed into the
the selection lines, if the selection line is 0, it loads the nine 36×4 block and loaded into the transpose buffer. The pipeline
transposed buffers of 4×36 into the output buffer of 36×36 (in relationship between edge filter block and transpose buffer
the form of nine 4×36 blocks). However, if the selection line is has already been exhibited in edge filter section. Once the
1, it loads the transposed buffer of 36×36 into the input buffer 36 × 4 block is loaded, the control unit sends appropriate
of 4×36 in a pipeline fashion using a 9-to-1 multiplexer logic, selection signal to DEMUX-2. If the selection signal is 0, the
i.e., MUX-3. The proposed architecture is further illustrated in 36 × 4 block is sent back to the edge filter via MUX-3 for
detail in the following sections by describing the various units horizontal filtering. Since the original pixel block has been
involved in this architecture along with the SAO hardware. In transposed, this horizontal filtering shall represent vertical fil-
order to avoid the dependency in processing of corner pixels tering. Therefore, this architecture reuses the single horizontal
of a block 8 × 8, the proposed work processes additional edge filter block for both horizontal and vertical filtering. Once the
pixels in terms of 4 × 36 instead of 4 × 32. However, a set 36 × 4 block is filtered and transposed back to 4 × 36 and
of 4 × 32 pixels are only selected to four-edge filters. The reloaded into the transpose buffer, the DEMUX-2 selection
right most additional edge is used when next set of 4 × 36 is line toggles to 1 and the 4 × 36 block is transmitted to DBF
processed for filtering. output buffer for SAO filtering. This procedure is repeated for
1) Edge Filter Unit: Initially, control logic unit selects the all nine 4 × 36 blocks as regulated by the selection logic from
MUX-1 selection line 0 to load the 4×36 block input buffer in the control unit fed into the 9-to-1 MUX-3, so that the trans-
which the block gets partitioned into nine blocks; each of size pose buffer and the output buffer is loaded for the complete
4×4 block. Each 4×4 block contains 16 pixels with each pixel CU (36 × 36). The transpose buffer and the output buffer both
of 8-bit size. Thus, each 4 × 4 block represents into a 128-bit exhibit two-stage pipeline for each of the nine 4 × 36 blocks,
buffer. These nine 128-bit buffers load the input samples in once the horizontally and vertically filtered pixels are sent to
BALDEV et al.: DESIGN AND IMPLEMENTATION OF EFFICIENT STREAMING DEBLOCKING AND SAO FILTER 131

Fig. 3. Flowchart of the proposed DBF architecture.

the output buffer a new set of pixels from the edge-filter are
simultaneously loaded into the transpose buffer as shown in
Figs. 2 and 3. This arrangement is also in pipeline with the
edge filtering stage. The edge filter, transpose buffer, and out-
put buffer together constitute a three-stage pipeline processing
in the DBF as shown in Fig. 2.
3) Control Unit: In the proposed architecture, the control
logic unit incorporates a finite state machine. At the reset con-
dition the present state of the MUX-1 and DEMUX-2 selection
pins are set as “00” state. In the next state “01,” the MUX-
1 selection pin is set as 0 and the DEMUX-2 selection pin Fig. 4. Flowchart of proposed SAO filter architecture.
is set as 1. After setting the MUX-1 pin as 0 the input
buffer 4 × 36 block of pixels loads into the four-edge fil-
ter (128-bitsize each) and by setting the DEMUX-2 pin as 1, architecture, and hence undergo parallel processing throughout
the buffer of 4 × 36 loads from the transpose output buffer the hardware. SAO filter dataflow is evident from the flowchart
of 36 × 36 block in nine-stage pipeline manner. In the next shown in Fig. 4. The input samples are read as 4 × 36 block
state “10,” MUX-1 selection pin set as 1 and the buffer of and then processed via 72-bit buffers. Each 72-bit buffer is
4 × 36 block of pixel loads the samples into the nine input individually processed and filtered through the appropriate fil-
buffer each of 128-bit size. The DEMUX-2 selection pin set as tering block based on the SAO parameter. Once the filtering
0 implies that the transpose output buffer of 36×36 loads into operation is done for 9 pixels, a new set of pixels are read from
the deblock buffer of 4×36 block. Next, the MUX-1 selection the next 72-bit buffer. This process is repeated till 4×36 block
pin is set as 0 and the above procedure is repeated. is filtered. The proposed SAO can be analyzed by considering
the following stages.
1) Control Unit for SAO: The control unit consists of con-
B. Sample Adaptive Offset Filter trol logic block as shown in Fig. 2. This block runs on a clock
The proposed SAO filter architecture is quite similar to controlled, sequential, and Moore-based finite state machine.
DBF. The 36×36 CU which was obtained after expanding the The control logic block regulates the data flow in the entire
32 × 32 CU to accommodate boundary conditions in the filter- SAO filter hardware. Primal function of this block is to access
ing procedures is processed in the form of nine 4 × 36 CBs. 9 pixels from the nine 4×36 blocks in parallel. This is done by
These nine 4 × 36 CBs are analogous to the ones used in DBF passing the appropriate address values via an address pointer
132 IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, VOL. 64, NO. 1, FEBRUARY 2018

block. The address pointer block consists of an address incre- transmitted from the 72-bit buffer. Storage buffers are used to
ment logic which ticks for the appropriate state conditions. preserve the essential pixel values during state transitions, the
The address pointer block also passes the address values for buffer also stores the offset values for this filtering block. To
one pixel from each of the nine 4 × 36 blocks thus processing summarize, the horizontal filtering block filters 8 pixels per
nine 8-bit pixels through a data bus. Hence, a 72-bit data bus clock cycle.
is processed in a single clock cycle which exhibits the parallel b) Vertical filtering: Vertical filtering block, shown as
processing of the proposed hardware. V-Filter in Fig. 2, is brought into an active state by the enable
Second, the control logic block initiates the pipeline pro- signal driven by the DEMUX in SAO side which chooses
cessing by using parallel pipeline buffers. In the pipeline the enable signal based on the selection input driven by con-
procedure, as evident from Fig. 2, sixteen 72-bit data registers trol logic. Vertical filtering involves either of the 45◦ , 90◦ ,
store 4 × 36 pixels from the deblock buffer in the form of or 135◦ filtering. Thus, this block covers the final three cate-
16 × 9 samples (each 72-bit buffer consists of 9 pixels), there- gories in the EO filtering. The novel aspect of this architecture
fore the hardware deploys sixteen 72-bit buffers for pipeline. is that it deploys the same state and then reuses the same hard-
Each 72-bit buffer is processed as follows: the control logic ware for all the three categories in the vertical filtering. Thus,
processes a set of 9 pixels, i.e., 72-bit in a single clock cycle the vertical filtering block is versatile for any sort of angular
for filtering and simultaneously stores the next set of 72-bit comparison of adjacent pixels. The parameter that varies for
pixels in the corresponding 72-bit buffer. Thus the SAO unit the said three categories are the number of pixels filtered per
also exhibits single stage internal parallel pipeline for 9 pixels. cycle. In case of the 90◦ category, the proposed architecture
72-bit buffer sets are processed sequentially and the timing processes 27 pixels in three clock cycles and filters 9 pixels
is maintained in such a manner that by the time which the of the middle row after comparison with the upper and lower
current 4 × 36 block has been filtered and mapped into the rows. Thus 9 pixels are filtered in three clock cycles. For 45◦
4 × 32 CU the new 4 × 36 block is ready for processing and 135◦ categories which consider the adjacent diagonal pix-
from the DBF. Thus, SAO filter maintains synchronization of els, 27 pixels are processed in three clock cycles and 8 pixels
pixel blocks with the DBF for streaming processing. Also, are filtered in three clock cycles. Pixel 8 of each set is filtered
Fig. 2 exhibits the pipeline relation between the SAO filter after processing the next corresponding pixel set. A buffer is
and the final output buffer which stores the 4 × 32 pixel utilized to store the offset values and to preserve pixel val-
block. Once the 4 × 36 block has been filtered and mapped, ues in state transitions. Two comparators are used to reuse the
it is stored in the output buffer while simultaneously a new buffer in storing the pixels from upper and lower rows.
set of 4×36 block arrives. This process is repeated for all c) Band offset filtering: BO filtering block as shown in
nine sets of pixel block which get mapped and completely Fig. 2 as BO, is brought into an active state by the enable sig-
loaded into the output buffer, thus constituting a 32 × 32 nal driven by the DEMUX which chooses the enable signal
filtered CU. based on the selection input. The selection input sends BO
2) SAO Filtering Stage: The control logic block sends the signal as high and the BO block is selected. This filtering
offset values to the different filtering blocks and deploys a mul- block processes 9 pixels in one clock cycle. Pixel preser-
tiplexer to select one of the three filtering blocks using the vation is not required in this block therefore, no buffer is
SAO type as the selection lines. These are obtained from SAO utilized. As mentioned in the BO part of Section IV, the off-
syntax header as signaled from the encoder. Filtering stage, sets for pixels are signaled based on the bands these pixels
as the name suggests, is the part of the architecture where lie in. Therefore, comparator and adders are used to check
the 9 pixels obtained from the pipeline buffers are filtered the described bands and add the required offsets. To sum-
using the offset values. Filtering stage consists of the following marize the sequence of pipeline stages in Fig. 2, in stage 1,
filtering blocks. a 4 × 36 block is fed into the edge filter where horizontal
a) Horizontal filtering: Horizontal filtering block, shown filtering operation is carried out as explained in Section V-
as H-Filter in Fig. 2, is brought into an active state by the B2a. Once a 4 × 36 block is filtered, it is sent to a transpose
enable signal which is driven by a DEMUX based on the buffer where the block is changed to 36 × 4. This 36 × 4 is
“SAO type” which acts as the selection input for the selection sent to a buffer based on the logic of DEMUX-2 and these
MUX. This block corresponds to SAO type-0, which states processes correspond to stages 1 and 2. Finally in stage 3,
that the SAO filtering belongs to the EO class 0 category as the 36 × 4 is sent back to edge filter in stage 1 based on the
per standard [1], and the pixels will be compared with their selection logic of MUX-3 and the whole sequence of stages
horizontally adjacent neighbors. The horizontal filtering block 1–3 is repeated again. Thus, as evident by Fig. 2 pipeline
depicts that the pixels obtained from the nine 8-bit pipeline stages 1–3 are operated twice for one 4 × 36 block to per-
buffers are divided into the sets of pixels 0–5 and 6–8. Since form both horizontal and vertical filtering. After stage 3,
the pixels are compared with their immediate neighbors, the the 4 × 36 block is sent to stage 4 (SAO) and finally to
boundary pixels, i.e., pixels 0 and 8 need to be evaluated in stage 5 (SAO filter output), while simultaneously each of the
multiple states. For the filtering of pixel 0 unless the boundary pipeline stages accept next set of inputs. Therefore, this stages
condition is true, the pixel 8 from the previous set is required 1–5 pipeline processing is executed for nine 4 × 36 blocks
and simultaneously for the filtering of pixel 8, the pixel 0 from which finally results in a fully filtered 32 × 32 CU in
the next set is essential. However, if the boundary condition stage 5.
is satisfied, pixel 0 remains unfiltered. Thus, in one state or
in one clock cycle, 7 pixels are processed which result in six VI. R ESULTS AND D ISCUSSION
filtered pixels, while the pixels 6 and 7 are filtered in the The proposed DBF and SAO architecture is implemented
next cycle. Pixel 8 shall be filtered when the next pixel set is in Verilog-HDL at RTL level, ASIC and FPGA platforms are
BALDEV et al.: DESIGN AND IMPLEMENTATION OF EFFICIENT STREAMING DEBLOCKING AND SAO FILTER 133

TABLE I
D IFFERENT CU S OF DBF D EVICE U TILIZATION IN FPGA

TABLE II
D EVICE U TILIZATION AND L ATENCY OF C OMBINED DBF AND SAO IN FPGA

TABLE III
D IFFERENT CU P ROCESSING OF DBF P OWER C ONSUMPTION AND A REA R EPORT

used to test the proposed hardware. Behavioral and post-layout parameters including critical path delay, gate count, and power
simulations with different test sequences have been carried out consumption for the implemented hardware are given in
at 125 MHz. The simulation results are matched with HM Table III. The maximum frequency (critical path delay) for
10.0 HEVC software [1]. Proposed design has been imple- 12 × 12 DBF processing module was found to be 546.44 MHz
mented in FPGA platform and its results are given in Table I. which is higher when compared with other CU processing
It can be observed that the synthesized frequency is almost hardware. The gate count is given in terms of NAND 2 × 1.
similar to different CU’s DBF processing modules. Similarly, In case of a 12 × 12 CU DBF processing, gate count is
the resource utilization and processing clock cycles of differ- 77.44K. Similarly, for 36 × 36 CU consumed gate count is
ent CU’s DBF processing modules are also reported in Table I. 503.58K, which includes logic as well as memory. A total
The 12 × 12 module consumes lesser amount of resources in power of 339 mW is consumed by 36 × 36 DBF hard-
the form of Slice registers (3%) and consumes 1024 clock ware. Critical path delay and gate count for DBF, SAO,
cycles per LCU. However, 68 × 68 block consumes 62% of and combined DBF and SAO are given in Table IV. The
slice registers and 50 clock cycles. In order to find a trade- gate count and total power consumed for combined DBF
off between speed and area, a 36 × 36 block was selected and SAO are found to be 593.32K and 385 mW, respec-
which utilizes 20% of resources and 136 clock cycles. Further tively. The layout of proposed 36 × 36 DBF with SAO after
after integration, the proposed DBF and SAO critical path is place and route and sign-off is shown in Fig. 5. Higher
reduced to 3.437 ns. The SAO integration with DBF increases amount of area is occupied by DBF, buffers, and transpose,
the resource utilization of FPGA in terms 4% of Slice regis- whereas SAO occupies 56.05K gates. The operating frequency
ters and 3% Slice LUTs. SAO filter hardware alone consumes of the proposed hardware can be calculated for the desired
lesser amount of resources, but owing to pipeline buffers and resolution as per [19] (1), where W is frame width, H is
control logic the resource utilization is increased by a small frame height, and Fps is the frames per second. Format
percentage. In addition, the total number of clock cycles con- is set as 1 for 4:2:0 YUV format and 2 for 4:4:4 YUV
sumed by the 36 × 36 CU hardware is increased to 172 per format
LCU as shown in Table II.  
clock cycles
Different DBF CU processing hardware were further imple- Frequency = W ∗ H ∗ Format ∗ Fps∗ . (1)
mented in ASIC platform using 90-nm library. Various LCU
134 IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, VOL. 64, NO. 1, FEBRUARY 2018

TABLE IV
C OMBINED DBF AND SAO F ILTER P OWER AND A REA R EPORT

TABLE V
C OMPARISON OF T HIS PAPER W ITH L ITERATURE

platforms. The proposed architecture is faster than the other


approaches in terms of bandwidth, processing clock cycles
and frequency of operation with a higher resource utiliza-
tion. Zhu et al. [6] consumed 256 clock cycles per LCU
which is higher number than that of the proposed work.
The gate count of the combined DBF and SAO is 67.7K
which is low compared to the proposed work and the max-
imum frequency is also low as compared to the proposed
work. The work in Shen et al. [8], implemented DBF tar-
geted on ASIC platform. The number of processing clock
cycles (110) is higher and gate count is less (75K) compared
to the proposed hardware. However, the synthesis frequency
is also lower than that of the proposed work. In [9], the gate
Fig. 5. Layout of proposed 32 × 32 block architecture. count of 103.3K+4.8kB (including memory) is used whereas
the proposed work consumes 593.32K including memory.
Similarly in [10], the number of processing cycles are higher
In order to support the required specification of 200 frames/s while the gate count and synthesis frequency is lower than
for UHD (8K × 4K) resolution, an operating frequency that of the proposed DBF. In [12], the number of processing
of 125 MHz is required. However, the proposed hard- clock cycles (435) per LCU is higher than the proposed work
ware can work even faster than this due to lower critical and the synthesis frequency is comparable. The gate count is
path delay of 2.93 ns. The proposed hardware was tested 62.9K and an on chip SRAM 53.2K is consumed. As com-
with operating frequency of 125 MHz in both FPGA and pared to [18], the proposed work is a combined DBF and SAO
ASIC platforms. The proposed hardware is primarily suit- and consumes higher number of processing clock cycles/LCU
able for high definition to UHD video decoders with high (i.e., 172), and lesser bandwidth 8.7 GB/s and gate count. To
bandwidth requirement (GB/s). It decodes frames which con- summarize, the proposed design shows a tradeoff between pro-
tains minimum CU size of 32 × 32. It achieves higher cessing capability and hardware cost. It showcases its novelty
throughput due to a lesser number of processing cycles by exhibiting a higher synthesis frequency due to lesser critical
per LCU. path delay and achieves a higher throughput than the state-of-
Table V compares the performance of the proposed architec- the-art owing to the minimal processing cycles as compared
ture with the state-of-the-art in literature in terms of hardware to the literature. The relative hardware cost is higher due to
cost, speed, gate count, and memory. As the works reported the utilization of parallel edge filters.
in literature are implemented using different process tech-
nologies and different platforms. So comparing them with
proposed work is not an accurate estimate. However, compar- VII. C ONCLUSION
ison of proposed work and literature was based on processing This paper presents an efficient streamed processing hard-
clock cycles which is independent of process technology and ware for DBF and SAO of HEVC decoder. The work
BALDEV et al.: DESIGN AND IMPLEMENTATION OF EFFICIENT STREAMING DEBLOCKING AND SAO FILTER 135

investigated the usage of same set of parallel filters in both [15] K. Shukla, S. Baldev, and P. Rangababu, “Area efficient dataflow hard-
horizontal and vertical filtering while processing various CU ware design of SAO filter for HEVC,” in Proc. IEEE Int. Conf. Innov.
Electron. Signal Process. Commun. (IESC), Shillong, India, Apr. 2017,
sizes. Results show a tradeoff between processing capability pp. 16–21.
and hardware cost. An optimal 36 × 36 CU processing hard- [16] W.-Y. Wei, Deblocking Algorithms in Video and Image Compression
ware is chosen for integration of SAO. The combined DBF Coding, Nat. Taiwan Univ., Taipei, Taiwan, 2009.
and SAO hardware is implemented and tested at 125 MHz [17] C.-M. Fu et al., “Sample adaptive offset in the HEVC standard,” IEEE
Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1755–1764,
in FPGA and ASIC platforms. It processes a 64 × 64 block Dec. 2012.
in 172 clock cycles with a gate count of 593.32K and [18] R. Peesapati, S. Das, S. Baldev, and S. R. Ahamed, “Design of streaming
bandwidth of 8.73 GB/s. Experimental results illustrate that deblocking filter for HEVC decoder,” IEEE Trans. Consum. Electron.,
the combined DBF and SAO filter architecture is the best vol. 63, no. 3, pp. 1–9, Aug. 2017.
[19] S. Shen, W. Shen, Y. Fan, and X. Zeng, “A pipelined VLSI architecture
choice in terms of the clock cycle count and bandwidth for sample adaptive offset (SAO) filter and deblocking filter of HEVC,”
with an increased gate count compared to other architectures IEICE Electron. Exp., vol. 10, no. 11, 2013, Art. no. 20130272.
known so far. With these design approaches, the proposed
design can be used in real-time VR gaming applications Swamy Baldev is currently pursuing the Ph.D.
where it decodes 8K UHD at 200 frames/s. The future degree with the Department of Electronics and
work involves design of an energy efficient DBF and SAO Communication Engineering, National Institute of
hardware. Technology Meghalaya, Shillong, India.
His current research interests include video pro-
cessing for architectures H.264 and high efficiency
video coding standards.
R EFERENCES
[1] ITU-T and ISO/IEC, High Efficiency Video Coding, document H.265 and
ISO/IEC 23 008-2:2013, ITU-T Recommendation, Geneva, Switzerland,
2013.
[2] G. J. Sullivan, J. Ohm, W.-J. Han, and T. Wiegand, “Overview of the Kaustubh Shukla is currently pursuing the
high efficiency video coding (HEVC) standard,” IEEE Trans. Circuits Bachelor of Technology degree with the Department
Syst. Video Technol., vol. 22, no. 12, pp. 1649–1668, Dec. 2012. of Electronics and Communication Engineering,
[3] Z. Dajiang, Z. Jinjia, Z. Jiayi, and S. Goto, “A 48 cycles/MB H. 264/AVC Maharaja Surajmal Institute of Technology, New
deblocking filter architecture for ultra high definition applications,” Delhi, India, affiliated to Guru Gobind Singh
IEICE Trans. Fundam. Electron. Commun. Comput. Sci., vol. E92-A, Indraprastha University, New Delhi.
no. 12, pp. 3203–3210, 2009. His current research interest includes very large-
[4] V. Sze, M. Budagavi, and G. J. Sullivan, “In-loop filters in HEVC,” scale integration design.
in Integrated Circuit and Systems, Algorithms and Architectures. Cham,
Switzerland: Springer, 2014, pp. 171–185.
[5] R. Skupin, Y. Sanchez, C. Hellge, and T. Schierl, “Tile based HEVC
video for head mounted displays,” in Proc. Int. Symp. Multimedia (ISM),
San Jose, CA, USA, Dec. 2016, pp. 399–400. Sushanta Gogoi is currently pursuing the Ph.D.
[6] J. Zhu, D. Zhou, G. He, and S. Goto, “A combined SAO and de-blocking degree with the Department of Electronics and
filter architecture for HEVC video decoder,” in Proc. 20th IEEE Int. Communication Engineering, National Institute of
Conf. Image Process. (ICIP), Melbourne, VIC, Australia, Sep. 2013, Technology Meghalaya, Shillong, India.
pp. 1967–1971. His current research interest includes video pro-
[7] W. Zhou, J. Zhang, X. Zhou, Z. Liu, and X. Liu, “A high- cessing architectures.
throughput and multi-parallel VLSI architecture for HEVC deblock-
ing filter,” IEEE Trans. Multimedia, vol. 18, no. 6, pp. 1034–1047,
Jun. 2016.
[8] W. Shen, Q. Shang, S. Shen, Y. Fan, and X. Zeng, “A high-
throughput VLSI architecture for deblocking filter in HEVC,” in
Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), Beijing, China, 2013,
pp. 673–676.
[9] W. Shen et al., “A combined deblocking filter and SAO hardware Pradeep Kumar Rathore received the Ph.D. degree
architecture for HEVC,” IEEE Trans. Multimedia, vol. 18, no. 6, from the Indian Institute of Technology Delhi,
pp. 1022–1033, Jun. 2016. New Delhi, India, in 2015.
[10] E. Ozcan, Y. Adibelli, and I. Hamzaoglu, “A high performance He is currently an Assistant Professor with
deblocking filter hardware for high efficiency video coding,” the Department of Electronics and Communication
IEEE Trans. Consum. Electron, vol. 59, no. 3, pp. 714–720, Engineering, National Institute of Technology
Aug. 2013. Meghalaya, Shillong, India. His current research
[11] B. K. N. Srinivasarao, I. Chakrabarti, and M. N. Ahmad, “High-speed interests include design and development of CMOS
low-power very-large-scale integration architecture for dual-standard and integrated MEMS-based devices.
deblocking filter,” IET Circuits Devices Syst., vol. 9, no. 5, pp. 377–383,
Sep. 2015.
[12] W. Cheng, Y. Fan, Y. Lu, Y. Jin, and X. Zeng, “A high-throughput HEVC
deblocking filter VLSI architecture for 8k×4k application,” in Proc. Rangababu Peesapati received the Ph.D. degree
IEEE Int. Symp. Circuits Syst. (ISCAS), Lisbon, Portugal, May 2015, from the University of Hyderabad, Hyderabad,
pp. 605–608. India, in 2014.
[13] C. M. Diniz, M. Shafique, F. V. Dalcin, S. Bampi, and J. Henkel, He is currently an Assistant Professor with the
“A deblocking filter hardware architecture for the high efficiency video Department of Electronics and Communication
coding standard,” in Proc. Design Autom. Test Europe Conf. Exhibit. Engineering, National Institute of Technology
(DATE), Grenoble, France, Mar. 2015, pp. 1509–1514. Meghalaya, Shillong, India. His current research
[14] P.-K. Hsu and C.-A. Shen, “The VLSI architecture of a highly efficient interest includes reconfigurable systems for
deblocking filter for HEVC systems,” IEEE Trans. Circuits Syst. Video multimedia.
Technol., vol. 27, no. 5, pp. 1091–1103, May 2017.

You might also like