Noname manuscript No.
(will be inserted by the editor)
Parallel HEVC Decoding on Multi- and Many-core
Architectures
A Power and Performance Analysis
Chi Ching Chi · Mauricio Alvarez-Mesa · Jan Lucas · Ben Juurlink ·
Thomas Schierl
Received: date / Accepted: date
Abstract The Joint Collaborative Team on Video Decoding is developing a new standard named High Efficiency Video Coding (HEVC) that aims at reducing the
bitrate of H.264/AVC by another 50%. In order to fulfill the computational demands of the new standard, in
particular for high resolutions and at low power budgets, exploiting parallelism is no longer an option but
a requirement. Therefore, HEVC includes several coding tools that allows to divide each picture into several
partitions that can be processed in parallel, without
degrading the quality nor the bitrate. In this paper we
adapt one of these approaches, the Wavefront Parallel Processing (WPP) coding, and show how it can be
implemented on multi- and many-core processors. Our
approach, named Overlapped Wavefront (OWF), processes several partitions as well as several pictures in
parallel. This has the advantage that the amount of
(thread-level) parallelism stays constant during execution. In addition, performance and power results are
provided for three platforms: a server Intel CPU with
8 cores, a laptop Intel CPU with 4 cores, and a TILEGx36 with 36 cores from Tilera. The results show that
our parallel HEVC decoder is capable of achieving an
Chi Ching Chi, Mauricio Alvarez-Mesa, Jan Lucas and Ben
Juurlink
Technische Universität Berlin, Sekretariat EN 12, Einsteinufer 17, 10587 Berlin, Germany
Tel.: +49.30.314-73130
Fax: +49.30.314-22943
E-mail: {chi.c.chi,mauricio.alvarezmesa,j.lucas,b.juurlink}@tuberlin.de
Mauricio Alvarez-Mesa and Thomas Schierl
Fraunhofer-Institute for Telecommunications, HeinrichHertz-Institut, Einsteinufer 37, 10587 Berlin, Germany
Tel.: +49 30 31002-227
Fax: +49 30 31002-190
E-mail:
[email protected]
average frame rate of 116 fps for 4k resolution on a
standard multicore CPU. The results also demonstrate
that exploiting more parallelism by increasing the number of cores can improve the energy efficiency measured
in terms of Joules per frame substantially.
Keywords HEVC · Video coding · Parallel processing · Power analysis · Real-time 4k · UHD
1 Introduction
Recent increasing demands to support higher resolutions such as 4k or UHD in consumer video devices have
driven the video codec development towards higher compression rates. To meet these demands the Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T
and ISO/IEC MPEG has started a project to develop a
new video coding standard aiming to reduce the bitrate
of the H.264/AVC High Profile [13] by another 50%.
The target application is, besides 4k resolution, to also
the support native HD on mobile devices. Future extensions of the standard also aims to support high quality
color depth of up to 14 bit, and higher chrominance fidelity with 4:2:2 and 4:4:4 chroma subsampling. Some
of the application use cases, which have been selected
for the first test model evaluation, are random access,
such as used in Video-on-Demand or Broadcast applications and low delay for conversational applications.
The HEVC project started in 2010, it has been published in July 2012 as a Draft International Standard
and is scheduled for finalization in early 2013 [22]. The
HEVC project uses the HEVC test Model (HM), which
is the reference software, to integrate and evaluate new
coding tools for standardization.
To support 4k resolution in real-time at frame rates
of 50 and higher, HEVC includes several so-called cod-
2
ing tools that partition each picture into several partitions that can be processed in parallel, without degrading the quality nor the bitrate. For lower resolutions the
provided parallelism can be exploited to improve power
efficiency of computer systems, which we will show in
this paper. Improvements to power efficiency is of key
importance in the increasingly mobile market, because
power is not scaling down at the same rate as feature
size, the so-called power wall.
To investigate if contemporary multi-/many-cores
are able to decode 4k HEVC video sequences in realtime with limited power budgets, we perform a performance and power analysis of (parallel) HEVC decoding.
In particular, our contributions can be summarized as
follows:
– We improve the single-threaded performance compared to the HEVC test Model (HM) 8.0 by an average of 4.1× using both architectural independent
and more architectural specific optimizations.
– We show that by using the novel overlapped wavefront approach (OWF) on top of the optimized singlethreads baseline, high speedups can be obtained resulting in much higher than real-time performance
(up to 186 fps) for 4k sequences.
– Performance, power, and energy efficiency results
are provided for three platforms: a server Intel CPU
with 8 cores, a laptop Intel CPU with 4 cores, and
a TILE-Gx36 with 36 cores from Tilera.
This paper is organized as follows: first, in Section 2,
we present a brief overview of the HEVC standard.
Then, in Section 3 we describe the tools for parallel
processing that have been included in HEVC. In Section 4 we present the details of the implementation of an
optimized parallel HEVC decoder. Section 5 describes
the experimental setup, followed by experimental result in Section 6. Finally we summarize and conclude
the paper in Section 7.
2 Overview of the HEVC Codec
HEVC is based on the same structure as prior hybrid
block-based video codecs such as H.264/AVC, but with
enhancements and generalizations in each coding stage.
Figure 1 depicts a general diagram of the HEVC decoder and its coding stages [23].
In HEVC the motion compensation uses the same
quarter pixel motion resolution, but the derivation of
interpolated pixels is generalized using a larger 8-tap
interpolation filter for luma and 4-tap interpolation filter for chroma. Intra prediction is generalized as well
Chi Ching Chi et al.
by parametrizing the prediction angle, allowing 33 different angles. The transform is still an integer transform but allows more block sizes, ranging from 4 × 4 to
32 × 32, and has higher internal processing precision.
CABAC is the only entropy coding algorithm available
in HEVC with improvements to coefficient scan patterns and context grouping to improve implementation
efficiency. As in H.264/AVC, an in-loop deblocking filter is applied to reduce blocking artifacts. The HEVC
deblocking filter is only applied to edges on a 8 × 8
grid creating opportunities to filter edges in parallel.
In HEVC also an additional in-loop filter is included:
the sample adaptive offset (SAO) filter [11]. The SAO
filter can be activated on a CTB basis by transmitting
offset values or using the offset values of the top or left
neighboring CTB. These offsets can either correspond
to the intensity bands of pixel values (band offset mode)
or the difference compared to neighboring pixels (edge
offset mode).
HEVC also defines a more efficient block structure,
called Coding Tree Blocks (CTBs). The sequence is
coded using a CTB size is of 16 × 16, 32 × 32, or 64 × 64
pixels. Each CTB can be recursively subdivided using
a quad-tree segmentation in coding units (CUs), which
can in turn be further subdivided in prediction units
(PUs) and transform units (TUs) [14]. Each CTB can
be split structure. Coding units can be subdivided down
to a minimum CU size of 8 × 8. The minimum prediction units size is 4 × 8 and 8 × 4, and minimum TU size
is 4 × 4 pixels.
3 Parallel Video Decoding with HEVC
Previous video codecs, in particular H.264/AVC, have
been parallelized using mainly slice-level or macroblocklevel parallelism [17, 21]. In H.264/AVC, as well as in
HEVC, a picture can be partitioned in multiple arbitrarily sized slices for independent processing. Having multiple slices in a picture, however, degrades objective and subjective quality due to additional slice
header overhead and slice boundary discontinuities [18].
In H.264/AVC independent macroblocks inside a frame
can be reconstructed in parallel using a wavefront approach [24]. Furthermore, macroblocks from different
frames can be processed in parallel provided the dependencies due to motion compensation are handled
correctly [18]. Entropy decoding, however, can only be
parallelized at the frame (slice) level and therefore it
has to be decoupled from macroblock reconstruction.
Although this approach can scale to a many-core architecture it increases the memory usage [9].
In order to solve the above mentioned problems in
HEVC two tools aiming at facilitating high level parallel
Parallel HEVC Decoding on Multi- and Many-core Architectures
HEVC
stream
Entropy
decoding
Inverse
quantization
Inverse
DCT
3
Deblocking
filter
+
SAO filter
Intra
prediction
YUV
video
Motion
compensation
Reference
pictures
Fig. 1 Block diagram of the HEVC decoder
processing have been included (draft) standard: Wavefront Parallel Processing (WPP) [15] and Tiles [12].
These tools allow to subdivide each picture into multiple partitions that can be processed in parallel. Tiles
allow to divide the picture in rectangular groups of
CTBs separated by vertical and horizontal boundaries.
Tiles boundaries, similarly to slice boundaries, break all
the dependencies and because of that have high coding
losses and can generate boundary artifacts.
WPP defines one picture partition per CTB row,
but does not require special handling of line borders
preserving the entropy, prediction or filtering dependencies. The header overhead is small as it only requires the
partition entry point offsets to be signaled additionally.
As a result the rate-distortion loss of a WPP bitstream
is small compared to a non-parallel bitstream, while
enabling a decent amount of parallelism that increases
with the picture resolution.
Before WPP was completely defined another tool
called entropy slices was considered in HEVC [19]. Entropy slices break entropy dependencies but maintain
the prediction dependencies. When using one entropy
slice per CTB row it is possible to exploit wavefront parallelism in a similar way to WPP. An implementation
of a parallel HEVC decoder using wavefront processing
with entropy slices on a multicore system with 12 cores
showed a speedup of 7.3 for 4K resolution [2].
The scalability of wavefront processing is limited by
the reduced number of independent blocks (CTBs or
macroblocks) at the beginning and at the end of each
frame. To solve this limitation, and increase the parallel scalability of WPP, a technique called Overlapped
Wavefront (OWF) has been proposed [8]. With OWF
multiple pictures can be decoded simultaneously resulting in a more constant parallelism during execution. An
implementation of OWF on a multicore system consisting of 12 cores has shown average speedups of 10X for
4K resolution. An in-depth analysis of the parallelization tools included in HEVC has shown that when WPP
is combined with the OWF algorithm it has a better
scalability than Tiles [7].
T1
T4
max.
vertical
Referenceable
T2
motion
T3
T4
T5
T6
T1
T2
T3
Fig. 2 Frames can be overlapped with a restricted motion
vector size, because the reference area is fully decoded.
4 Optimized Parallel HEVC Decoder
Implementation
To be able to provide representative power and performance results an optimized parallel HEVC decoder is
developed. The developed decoder is compatible with
the coding tools described in the HEVC 8.0 draft standard [5]. We first discuss the employed parallelization
strategy followed by a concise overview of the steps
involved in decoding the Coding Tree Blocks (CTBs)
in our implementation. Thereafter, we present the improvements in the single-threaded performance over the
HEVC test Model (HM) reference code and briefly discuss where the main improvements originate from.
4.1 Overlapped Wavefront
As mentioned in Section 3, by using the WPP coding
tool in HEVC one thread for each CTB row can be
used to decode each picture in parallel. When the WPP
coding tool is used, the bitstream contains an entry
point offset for each CTB row. These offsets allows up to
a number threads equal to the number of CTB rows to
start decoding in parallel, with a small coding efficiency
cost of around 1 percent [15]. In previous work it has
been found that a high parallelization efficiency can be
achieved when WPP is combined with the overlapped
execution of consecutive frames [8]. Figure 2 illustrates
the overlapped wavefront (OWF) approach.
4
Chi Ching Chi et al.
Aquire/
Release
DPB
Signal ready
Parse
WPP streams
T1
T2
Release
Output
(a) Reconstruction
(b) Vertical edges
(c) Horizontal edges
(d) SAO
Picture
T3
TN
Fig. 3 Decoder organization supporting multi-threaded
overlapping wavefront execution.
Instead of waiting for the entire picture to finish
threads can already start decoding the next frame to
mitigate the parallelism ramping inefficiencies of regular wavefront execution. As the figure illustrates, to
overlap consecutive pictures, a restriction on the size of
the vertical component of the motion vectors is required
to ensure that the reference area is available. The maximum number of parallel CTB rows using OWF, can be
derived using,
P AROW F = ⌊(HP ic − M M V − 8)/HCT B ⌋
(1)
where HP ic is the picture height in pixels, HCT B is the
CTB height in pixels, M M V denotes the maximum size
of the vertical motion vector component. Eight pixels
are additionally subtracted to take the delay of filters
(deblocking filter and SAO) and additional pixel rows
required by the interpolation filter into account, which
will be clarified in the next section. Because currently
the HEVC draft does not define the MMV, we instead
assume the same the MMV as H.264 of 512 pixels for
1080p and doubled this to 1024 for 2160p. This restriction allows up to 8 threads to be used for 1080p and up
to 17 threads for 2160p resolution sequences.
Figure 3 depicts the decoder organization used for
the implementation of OWF. The decoder consists of
two “control” threads (parse and output) and N “worker”
threads. The parse thread acquires a free picture buffer
from the display picture buffer (DPB) for every new picture and pushes a task to the shared worker queue for
each WPP partition it encounters. The worker threads
pop tasks from the queue in order and the wavefront
dependencies are maintained among themselves. The
worker thread that decodes the last CTB of a picture
notifies the output thread of completion by pushing the
completed picture. The output thread reorders the decoded pictures in presentation order and releases the
pictures after they are displayed/outputted.
The parse thread is also is responsible of releasing no
longer used reference pictures. In HEVC the reference
Fig. 4 Order and translation of filtering steps to allow CTB
based execution.
pictures that need to be kept in the DPB are signaled
for every slice, which is a departure from H.264 where
the reference picture that need to be released after decoding the slice are signaled instead. For overlapped
execution this is problematic, as in case the current picture uses a reference picture that is not used in the next
picture, a reference picture can be released to early.
A solution is, to instead of releasing reference pictures directly when they are not present in the reference
picture set (RPS), to release the reference picture when
it is not present in the RPS of two consecutive pictures.
This delays the release of the reference pictures by one
picture. In addition for this scheme to work, it must be
ensured that at any time a maximum of two pictures
are in-flight. This is implemented by having the parse
thread wait until the output threads notifies the completion of a picture if already two pictures are in-flight.
4.2 Coding Tree Block Decoding
A requirement for OWF execution is that all the decoding steps for one CTB are performed before continuing
with the next CTB to ensure that the required reference
area is available for the threads processing the consecutive picture. In addition, performing all the decoding
steps on a CTB basis also improves overall implementation efficiency compared to performing the decoding
steps on a slice or picture basis due to increased data
locality. The two in-loop filters, deblocking and SAO
filter, use and could alter the pixels from surrounding
CTBs, which are not all available at the time of decoding the CTB. To process these filters on a CTB basis,
therefore, requires delaying them as illustrated in Figure 4.
Figure 4 shows the sequence of filters that are applied after parsing and reconstruction (prediction and
transform) of a CTB. In this example the CTB split
depth is 2 for each leaf CU and no further prediction
and transform subdivision is assumed. First the vertical
Parallel HEVC Decoding on Multi- and Many-core Architectures
Coding Tree Block
CU 0
Prediction Units
CU 1
CU 3 CU 4
Residual Quadtree
CU 2
CU 5 CU 6
Fig. 5 Subdivision of a CTB in coding units, prediction
units, and transform units.
edges of the CTB are deblocked followed by the horizontal edges. The deblocking of the horizontal edges must
be delayed horizontally 4 pixels, because HEVC specifies that the horizontal edges must be deblocked using
the vertically deblocked pixels as input. These 4 pixels
have not been vertically deblocked yet as the last edge
belongs to the next CTB. In turn the SAO filter is also
delayed because it uses the horizontally filtered pixels
as input, and would require a minimum translation of 4
pixels upwards and 1 pixel to the left. Delaying the filter
1 pixel to the left, however, would introduce that the
SAO application window would cross 4 CTBs, which
all might have different SAO filter types. We decided
to delay the SAO filter one entire CTB horizontally to
reduce this control overhead.
The parsing and reconstruction follows the quadtree
CTB structure illustrated in Figure 5. Each CTB can
be split into four CUs which can be further split in
smaller CUs given a minimum CU size of 8×8 pixels.
Each leaf CU can be intra or inter predicted and has
one of the prediction unit (PU) shapes. In case of intra
CUs only the first two PU shapes are available, while for
inter CUs all the PU shapes are possible. For each PU
a different intra-prediction mode or motion vector and
reference index pair can be derived. Inter PUs can directly be motion compensated after deriving the motion
vectors and reference indices. Intra prediction, however,
has to be performed for each transform unit (TU) following the residual quadtree (RQT) block structure.
For each CU a RQT containing TUs can be transmitted in the bitstream. Like the CTB quadtree the
RQT can also be split further, but instead has a minimum size of 4×4 pixels. In our implementation the coefficient parsing and inverse transform follow each other
directly for optimal locality. Also adding the residual to
the prediction and clipping is merged with the inverse
transform.
5
4.3 Single-threaded Performance Improvements
In addition to the parallelization, also single-threaded
performance has been significantly improved compared
to the reference HM code. The performance improvements do not result from a few concentrated changes,
but instead originate from many small improvements
over the entire codec which include both architecture
independent and architecture specific changes. Some of
the more prominent architecture independent changes
are a much simplified neighbor context derivation (for
parsing, prediction, and filtering), fusing many kernel
loops (transform-add-clip, interpolation-weighting, inverse quantization and coefficient parsing), skipping zero
block transform, implementing branchless CABAC, removing redundantly stored syntax elements, use of a
scratchpad for better TLB locality, switching from CTB
to CU based reconstruction, CTB-based filtering, using
reference pictures with 8-bit pixel depth when possible,
internal bit depth of 8-bit, improved Annex B parser
and emulation prevention, etc.
For the architecture specific improvements the performance improvements originate mostly from SIMD
optimizations, which are applied to accelerate several
time consuming kernels such as the 8-tap interpolation
filter, inverse transform with block sizes up to 32×32,
and the SAO filter. Also attention was paid to prefetching reference blocks in the interpolation filter and writecombine store operation when writing back the final
reconstructed picture to the memory. For the Tilera
architecture also the scratchpad memory allocated for
each thread is locally homed, which improves the cache
utilization by having no redundant cache line copies
present as long as the decoding thread remain pinned
to the same core.
It should be noted that additional improvements
can be achieved by applying SIMD optimizations to
the deblocking filter and intra-prediction. For the deblocking filter, the performance gains with SIMD will be
smaller compared to inverse transform or interpolation
filters mainly because of the branches introduced by the
filter adaptation. Although intra-prediction can benefit
from SIMD optmization it consumes a small fraction of
the total execution time.
5 Experimental Setup
5.1 Platforms
Our experimental setup consists of three different platforms with different number of cores, microarchitectures and performance levels. Table1 presents a summary of the main properties of the three platforms.
6
Chi Ching Chi et al.
System
Intel X86-LP
Intel X86-HP
Tilera TILE-Gx
Processor
µarchitecture
Num. cores
SMT
Frequency [GHz]
Voltage [V]
LLC Cache [MB]
Memory
Process [nm]
Core i7-2760qm
Sandy Bridge
4
2-way
0.8-2.4
0.76-1.06
6
2-ch. DDR3 1600 MHz
32
Xeon E5 2687W
Sandy Bridge
8
2-way
1.2-3.1
0.84-1.20
20
4-ch. DDR3 1600 MHz
32
TILE-Gx8036
TILE-Gx
36
no
1.0
0.96
9
2-ch. DDR3 1333 MHz
40
Operating system
Linux kernel
Compiler
Compiler flags
Kubuntu 12.04
3.2.0-25
GCC 4.6.3
-O3, AVX enabled
Kubuntu 12.04
3.2.0-29
GCC 4.6.3
-O3, AVX enabled
Tilera MDE-4.0.3.145127
2.6.38.8
GCC 4.6.3
-O2
Table 1 Properties of the three different systems used in the experiments.
To test a high performance multicore platform we
selected a server with a Xeon E5 2687W processor that
consists of 8 Intel X86 64 cores running at 3.1 GHz.
We will refer to this system as X86-HP. As a poweroptimized multicore platform we used a laptop with a
Core i7-2760qm processor that has 4 Intel X86 64 cores
running at up to 2.4 GHz. We will refer to it as X86-LP.
To evaluate a many core processor we used a TILEGx8036 processor on a TILEncore-Gx36 card which is
connected via PCIe to a host system. The TILE-Gx8036
has 36 cores running at 1.0 GHz, where each core is a 64bit VLIW processor. All the cores are connected with
a mesh on-chip interconnect network [3]. The chip includes other peripherals such as the cryptographic unit
(MICA) and 4 network interfaces (mPIPE) wich are
not used in the experiments reported in this paper, except for one of the Ethernet interfaces and the power
sensors. In the rest of the paper we will refer to this
system as TILE-Gx.
5.2 Power measurement
To measure the power on the Intel platforms we used
the Running Average Power Limit (RAPL) feature introduced with the Sandy Bridge microarchitecture [10].
RAPL uses a architectural power predictor, that is also
exposed to software, to implement power capping of the
chip and implement more consistent turbo clocking behavior. The architectural power predictor updates the
model-specific register (MSR) once every millisecond,
and provides high accuracy and correlation with actual
power consumption [20]. RAPL exposes the energy consumed by the complete package (complete CPU die),
and only the cores and their caches. Additionally, depending on the model RAPL exposes the power of the
integrated graphics processing unit or the DRAM controllers. To verify the accuracy we have compared the
power reported by RAPL for the two Intel platforms
with the power measured for the entire system at different voltage and frequency points, and we observed
very good correlation at all operating points. For our
power measurements we access the RAPL MSR for the
complete package power via PAPI [6].
We measure the power of the TILE-Gx8036 CPU
core using the INA219 power monitor chip. This chip
measures power by measuring voltage and current by
the voltage drop over a shunt resistor. The INA219
contains the required signal condition circuits, a 12-Bit
ADC and an I2C bus interface. The measurement error of the INA219 is lower than 0.5%, while the used
shunt resistors provide better than 1% accuracy. Overall the error of the measurements should thus be within
±1.5%.
The Tilera TILEncore-Gx36 PCIe card contains multiple power monitors, measuring the different voltage
rails on the board. The rails are measured behind the
power conversion and therefore do not include the losses
of the power conversion circuits. For comparability with
the Intel RAPL counters we only record the power consumed by the core voltage rail.
The power monitors can be queried using the Tilera
provided board test kit (BTK). We used this capability
to sample the core power at a approximately 10 Hz rate
and save power and timestamps to a data file. When we
run applications on the board we record start and end
time stamps and average all power samples collected
during this time interval to calculate average power for
the application. Energy was then calculated by multiplying runtime by average power.
5.3 Test Sequences and HEVC Encoding
Because parallelism is mainly required at HD resolutions, we selected videos for 1080p (1920×1080) and
Parallel HEVC Decoding on Multi- and Many-core Architectures
Value
Max. CU size
Max. partition depth
Transform size: Min.-Max.
Period of I-frames
Number of B-frames (GOP Size)
Number of reference frames
Motion estimation algorithm
Search range
Asymmetrical Motion Partition
Internal bit depth
Sample Adaptive Offset (SAO)
Wavefront Parallel Processing (WPP)
Quantization Parameter (QP)
64×64
4
4-32
256
8
4
EPZS [25]
48
Enabled
8
Enabled
Enabled
22, 26, 30, 34
Table 2 Coding options.
2160p (4096×2160) resolutions. 1080p is representative
for current high definition systems, while 2160p is representative for the next generation of high quality video.
For 1080p we used the 5 test sequences described in the
HEVC “common conditions” [4]. For 2160p resolution
we use four videos from the EBU (European Broadcasting Union) 4K testset [16]. 1080p sequences have
8 bit per sample and 2160p sequences have 10-bit per
sample. 1080p sequences are in YUV 4:2:0 format, and
2160p sequences were originally in 4:2:2 format but were
converted to 4:2:0 format (because currently the HEVC
reference encoder can not handle formats different than
4:2:0)
All the test sequences have been encoded with the
HEVC HM reference encoder version 8.0 (svn revision
r2738) [5]. Encoding options are based on main HEVC
main profile using the random access configuration [4].
Table 2 shows the main configuration parameters of the
encoder. In order to enable parallel processing WPP
has been enabled. In addition, for supporting OWF,
the maximum length of the vertical motion vectors has
been constrained to 512 pixels for 1080p and 1024 for
2160p. As a result a maximum of 8 and 17 threads can
be used for 1080p and 2160p respectively. Table 3 shows
the resulting bitrate and weighted PNSR (0.75×U +
0.125×U + 0.125×V) for all the videos under consideration.
6 Experimental Results
For the experiments that do not use frequency and voltage scaling the x86 platforms are configured at their
highest rated frequency and DVFS and Turbo Boost
are disabled in the OS and BIOS, respectively. For improved reproducibility and reduced effect of the OS
thread scheduling policies, threads are pinned to cores.
In the X86 platforms the decodings include runnings
100
Normalized execution time [%]
Options
7
80
60
40
20
0
X86-HP
base
TILE-Gx
scalar
simd+
Fig. 6 Normalized average execution that shows the effect of
architecture independent (scalar) and architecture dependent
optimizations (simd+) with respect to reference code (scalar
version).
with and without simultaneous multithreading (SMT)
enabled.
6.1 Single-threaded Optimizations
In Section 4.3 we described the single-threaded optimizations applied to the HEVC decoder. Figure 6 shows
the normalized execution of the architecture independent (scalar) and architecture dependent (simd+) optimizations compared to the reference decoder (compiled with autovectorization disabled). For the X86-HP
platform the scalar optimizations give a reduction of
48% in execution time compared to the baseline, and
the simd++ optimizations give an additional 32% reduction. For the TILE-Gx architecture the results are
similar: a 51% reduction in execution time due to scalar
optimizations and an additional 27% reduction due to
simd+ optimizations. The optimized decoder that includes all the optimizations will be used as baseline for
the parallel executions that will be presented in the next
sections.
6.2 Performance
We executed the optimized parallel decoder on the three
platforms under study for all the videos at different QP
values and measured the execution time. Based on it we
computed the performance, expressed in frames per second. Tables 4 and 5 show the performance for 1080p and
2160p resolutions respectively. They include results for
experiments using one thread and the maximum thread
count.
The X86-HP platform achieves the highest performance, with up to 414 fps for 1080p and up to 185 fps
for 2160p. When using 8 threads it is possible to decode
8
Chi Ching Chi et al.
video
QP22
bitrate
YUV[Kb/s]
PSNR
QP26
bitrate
YUV[Kb/s]
PSNR
QP30
bitrate
YUV[Kb/s]
PSNR
QP34
bitrate
YUV[Kb/s]
PSNR
resolution
frames
BasketballDrive
BQTerrace
Cactus
Kimono
ParkScene
1080p50
1080p60
1080p50
1080p24
1080p24
500
600
500
241
240
16595
38394
16588
4422
6401
40.44
38.83
39.30
42.25
40.71
6889
8866
6030
2326
3093
39.09
37.21
38.07
40.79
38.70
3651
2823
3148
1302
1600
37.74
36.22
36.74
39.19
36.81
2091
1236
1776
732
829
36.25
34.92
35.21
37.48
34.92
DancerPillar
DancerWater
FountainPan
LupoPuppet
2160p50
2160p50
2160p50
2160p50
500
500
500
500
21071
35657
105154
56928
41.60
43.22
41.02
40.80
3042
18966
52612
21889
41.05
41.96
39.20
39.94
1573
10104
26369
11554
40.36
40.54
37.46
39.00
928
5302
12791
6236
39.46
39.00
35.81
37.95
Table 3 Bitrate (in Kb/s) and weighted PSNR (in dB) for all the encoded video sequences.
all the 2160p sequences with more than 50 fps, even the
most difficult ones. The X86-LP platform achieves between 76 and 230 fps when using 4 cores for the 1080p
sequences and between 23 to 86 fps for the 2160p sequences.
On the TILE-Gx platform, real-time is achieved for
most 1080p sequences, except those that require 60 fps
at low QPs (smaller than 26). For the 2160p sequences
the performance, at the maximum core count, is between 16 and 51 fps. For most of the sequences it is not
possible to reach the real-time performance. The main
limitation is the the single threaded performance, which
is significantly lower compared to the other architectures because of the frequency and microarchitectural
disadvantages. Although there are more cores available,
we can not use more threads because of the maximum
limit of the OWF algorithm has been reached.
The results show that the performance depends heavily on the video content and the bitrate. On the one
hand, for sequences with complex or fast motion, such
as LupoPuppet, there are less skip blocks and more
motion compensation operations need to be applied
per frame. On the other hand, when the bitrate increases (and the QP decreases) the number of coefficients that needs to be parsed increases as well, resulting in more CABAC operations. Due to its sequential
behavior CABAC has a low IPC and cannot be optimized with SIMD instructions.
6.3 Speedup
Figure 7 shows the average speedup achieved using multiple cores compared to the optimized code as baseline
for each of the three test platforms. The figure shows
that the scaling for the X86 platforms is high, and because of the higher parallelism 2160p scales better for
higher core counts than 1080p. Also SMT shows up
to 25% performance improvement at low core count
and around 12% performance improvement at high core
count.
For the TILE-Gx platform similar speedup results
are observed up to 8-cores, with a speedup of 6.8× and
7.6× for 1080p and 2160p respectively. At 17 cores,
however, a moderate speedup of 14× is achieved which
is partly caused by the thread stalls resulting from maintaining the wavefront dependencies. As will be shown in
the next section the contention on the TileGx36 memory subsystem reduces scalability at high core counts
as well.
6.4 Power and Energy
Figure 8 shows the power in Watts (W) for each platform using different number of cores. The power numbers indicate that a high amount of power is associated
with the high performance of the X86-HP platform,
with over 100 W of power at the highest core count.
The X86-LP and TILE-Gx platforms fare much better
in this aspect with a maximum of 31.6 W when using
4 cores with SMT and 20 W at 17 cores, respectively.
Also it can be observed that the TILE-Gx platform has
a relatively high idle to load power ratio. This can be
explained due to the larger amount of power management options available on the X86 platforms. On the
TILE-Gx platform the OS does not implement DVFS
and no clock or power gating is available/performed.
In contrast the X86 platforms implements DVFS (although disabled for this experiment), and many power
states for different parts of the chip.
Despite the usage of fine-grained power gating on
the X86 platforms, the power consumption for 1 core is
higher than the maximum power consumption divided
by the number of cores on the chip. This is caused by the
parts of the chip that are always on, such as the PCI-e
controller and QPI interfaces, and because parts of the
chip cannot be power gated when at least one thread
Parallel HEVC Decoding on Multi- and Many-core Architectures
X86-LP
1 thread
4 threads
9
X86-HP
1 thread
8 threads
Tilera
8 threads
Video
QP
BasketballDrive
22
26
30
34
28.8
39.8
48.0
54.9
107.0
148.3
179.2
204.8
35.6
48.6
57.9
65.9
219.7
302.3
368.2
424.4
5.9
7.9
9.3
10.4
36.3
49.2
60.0
67.6
BQTerrace
22
26
30
34
20.0
41.3
58.4
68.0
76.3
151.1
211.3
246.7
25.1
50.4
70.2
81.0
172.6
318.8
431.0
501.7
4.2
8.2
11.0
12.4
28.5
50.8
69.2
78.2
Cactus
22
26
30
34
31.9
51.3
63.9
74.6
116.9
186.4
230.6
267.7
39.4
62.2
76.4
88.6
247.3
363.9
449.6
529.7
6.6
10.2
12.2
13.8
40.6
59.3
72.1
83.8
Kimono1
22
26
30
34
34.8
43.0
50.4
57.5
131.1
161.1
187.5
210.9
42.7
52.2
60.9
68.8
283.0
346.8
398.7
433.2
7.0
8.4
9.6
10.7
45.5
56.0
63.5
70.1
ParkScene
22
26
30
34
29.4
39.1
48.0
56.8
110.6
146.3
178.8
208.7
36.2
47.8
58.0
68.3
235.8
305.3
369.2
425.5
6.1
7.9
9.3
10.7
38.3
49.6
60.1
69.0
47.0
173.1
56.8
356.3
9.1
57.4
Average
1 thread
Table 4 Performance in frames per second for 1080p videos at different bitrates for three different platforms.
X86-LP
1 thread
4 threads
X86-HP
1 thread
8 threads
Tilera
17 threads
Video
QP
DancerPillar
22
26
30
34
11.7
18.0
21.0
23.1
44.8
68.4
79.2
86.8
14.0
20.6
23.6
25.8
101.3
151.3
171.6
185.6
2.2
3.2
3.7
4.0
27.3
41.8
46.9
51.0
DancerWater
22
26
30
34
9.9
12.3
14.7
16.8
38.3
47.0
55.9
64.0
11.8
14.4
17.1
19.3
84.4
103.4
121.5
137.9
1.9
2.3
2.7
3.0
21.6
26.1
31.5
37.0
FountainPan
22
26
30
34
6.1
8.2
10.4
12.6
23.8
31.9
40.1
48.3
7.5
10.0
12.6
14.9
56.2
74.3
90.5
109.9
1.3
1.6
2.0
2.3
16.4
21.1
26.1
30.6
LupoPuppet
22
26
30
34
8.7
12.6
14.6
16.4
33.8
48.9
56.6
62.9
10.7
15.0
17.1
18.9
79.7
112.4
128.6
142.2
1.7
2.3
2.6
2.9
22.0
31.3
36.5
40.5
13.6
51.9
15.8
115.7
2.5
31.7
Average
1 thread
Table 5 Performance in frames per second for 2160p videos at different bitrates for three different platforms.
is actively using it, such as the memory controllers and
L3 cache partitions.
The power efficiency of the chip depends both on
the performance and the power [1]. Figure 9 shows the
power efficiency expressed in Joules per frame for the
different platforms at different core counts. The common trend for all platforms is that using more cores improves power efficiency. While the power increases with
the core count, the performance increases to a greater
extent. This especially holds for the TILE-Gx platform
due to the relatively high idle power. On the X86 platforms SMT improves power efficiency at low counts,
but loses its effectiveness at higher core counts. At their
most efficient points the X86-LP platform achieves the
lowest energy per frame (179 mJ/F and 614 mJ/F for
1080p and 2160p), followed by the TILE-Gx (334 mJ/F
and 696 mJ/F for 1080p and 2160p), and finally the
X86-HP (298 mJ/F and 995 mJ/F for 1080p and 2160p).
Chi Ching Chi et al.
5
2160p_smt
2160p
1080p
2160p_smt
1080p_smt
2160p
1080p
4
3
Speedup
9
8
7
6
5
4
3
2
1
0
Speedup
Speedup
10
2
1
0
1
2
3 4 5 6
Number of cores
7
8
1
2
3
Number of cores
(a) X86-HP
2160p
1080p
16
14
12
10
8
6
4
2
0
4
2
4
(b) X86-LP
6 8 10 12 14 16
Number of cores
(c) Tilera
Fig. 7 Speedup for X86-HP, X86-LP and Tilera
Power [W]
80
Power [W]
2160p_smt
2160p
1080p
100
60
40
20
0
1
2
3 4 5 6 7
Number of cores
40
35
30
25
20
15
10
5
0
8
25
2160p_smt
1080p_smt
2160p
1080p
2160p
1080p
20
Power [W]
120
15
10
5
0
1
(a) X86-HP
2
3
Number of cores
4
2
4
(b) X86-LP
6 8 10 12 14 16
Number of cores
(c) Tilera
2160p_smt
2160p
1080p
2
Energy[J] / frame
Energy[J] / frame
3
2.5
1.5
1
0.5
0
1
2
3 4 5 6 7
Number of cores
8
(a) X86-HP
2160p_smt
1080p_smt
2160p
1080p
1.4
1.2
1
0.8
0.6
0.4
0.2
0
1
2
3
Number of cores
(b) X86-LP
4
Energy[J] / frame
Fig. 8 Power for X86-HP, X86-LP and Tilera
8
7
6
5
4
3
2
1
0
2160p
1080p
2
4
6 8 10 12 14 16
Number of cores
(c) Tilera
Fig. 9 Energy per frame for X86-HP, X86-LP and Tilera
6.5 Frequency and Voltage Scaling on Intel
SandyBridge
In many practical applications of a HEVC decoder, decoding at highest possible speed is not desired. Instead
the decoder needs to meet a certain frame rate for realtime performance. To measure the power efficiency for
these use cases, we have conducted additional experiments in the X86-LP platform in which we limited the
decoding speed to 50 fps at different voltage/frequency
operating points. These include six static configuration points with frequencies ranging from 800 MHz to
2.4 GHz, and three dynamic configurations: “On De-
mand” (OD), “On Demand with Turbo” (OD+T) and
“Perf+T” (Performance with Turbo”). With OD the
processor runs at the lowest possible frequency and increases to maximum when CPU usage reaches 100%.
OD+T adds the Turbo Boost feature that allows the
processor to dynamically increase the speed above its
nominal operating frequency [20]. Finally, in Perf+T
the processor is set to its maximum frequency with
Turbo Boost enabled.
For these experiments the Cactus 1080p50 sequence
encoded with QP 26 and 30 is used for which decoding
speed is close to the average. Figure 10 shows the power
Parallel HEVC Decoding on Multi- and Many-core Architectures
800 MHz clock speed. Also enabling Turbo Boost results in poor power efficiency due to operating at a
higher voltage and frequency (3.2-3.4 GHz). These results show that with playback type of workloads the
default configurations on many systems, in which both
Turbo Boost and DVFS are enabled, produces a much
lower power efficiency than the system is able to.
20
15
Power [W]
11
10
5
0
1
2
3
4
Number of cores
800MHz
1000MHz
1200MHz
1600MHz
2000MHz
2400MHz
OD
OD+T
PERF+T
6.6 Increasing the workload on TILE-Gx
(a) QP26
20
Power [W]
15
10
5
0
1
2
3
4
On the TILE-Gx system we have more cores available
than we are able to use with one bitstream. Combined
with the high idle power this impacts the power efficiency negatively. For that reason we measured the
performance and energy per frame while decoding two
2160p bitstreams or four 1080p bitstreams at the same
time. This way we are able to use most of the cores. The
results of these experiments can be seen in Figure 11.
The figure shows that the energy used per frame decreased by using more cores.
Number of cores
Compared to using only 8 cores for one 1080p bitstream, using four 1080p bitstreams at a time decreases
the energy per frame from 0.333 J/F to 0.136 J/F. Sim(b) QP30
ilarly for 2160p the energy per frame decreases from 696
mJ/F to 477 mJ/F. The TILE-Gx processor is able
Fig. 10 Power at different Frequency/Voltage configurations
to achieve better energy efficiency, when most cores
for Cactus 1080p50 sequence at two different QP encodings
with real-time decoding for the Intel X86-LP system.
are used, compared to both Intel platforms, despite its
process technology disadvantage (40nm vs. 32nm). The
lowest energy per frame achieved by X86-HP platform
consumption for the different core count/frequency points is 299 mJ/F for 1080p and 968 mJ/F for 2160p, and
that achieve 50 fps. The voltage and frequency scale
for X86-LP platform this is 177 mJ/F for 1080p and
linearly with respect to each other following the range
601 mJ/F for 2160p. The power efficiency results are
reported in Table 1.
even slightly pessimistic as part of the idle power is
consumed the on-chip accelerators for high-speed netThe results show that it is not possible to achieve
working (mPIPE) and cryptography (MiCa). In our test
real-time decoding with only one core, even at the maxsetup network is required in order to start executions
imum nominal frequency. This is only possible when
remotely from the host machine, but is not strictly reTurbo Boost is enabled, but with Turbo Boost the dequired for the decoding process. Disabling these accelcoder uses 2.4 times more power compared to the most
erators lowered the power by 2.0 Watts both in idle and
efficient setting. With two cores it is possible to decode
full load, leading to approximately 9% lower mJ/F.
in real-time at 1.6 GHz with around 50% of the power
used with one core. Using four cores at 800 MHz is the
While the aggregated performance using multiple
most efficient setting for this experiment using just 8.0
streams is much higher than a single one, the speedup,
and 7.3 W (or 159 mJ/frame and 145 mJ/frame) for
however, is not linear and especially nearing the end of
QP 26 and QP 30 respectively.
the curve starts to saturate. When scaling to the num800MHz
1000MHz
1200MHz
1600MHz
2000MHz
2400MHz
OD
OD+T
PERF+T
Our results also show that the standard DVFS strategy (OD) is producing suboptimal results. DVFS is only
slightly more efficient than always running at stock 2.4
GHz clock speed when Turbo Boost is disabled. When
decoding using four cores DVFS is using between 33%
and 46% extra power compared to running at a fixed
ber of cores that the TILE-Gx offers, the effects of contention on shared resources become more visible. More
optimizations targeting the memory hierarchy, such as
improved data prefetching, could improve the results to
a greater extend than the Intel platforms which have
less cores and relatively more cache memory.
12
Chi Ching Chi et al.
Fig. 11 Aggregated frames per second and energy per frame
for Tilera when increasing the total load: 4 times for 1080p
(1080p-4X) and 2 times for 2160p (2160p-2X)
for 2160p. In the Tilera processor a maximum speedup
of 12.8 with 17 cores is achieved but result only in an
average of 31.7 fps for 2160p. Our parallelization approach enables up to 17 threads for 4k and up to 8
for 2k which is not sufficient to fully utilize the Tilera
many-core. Therefore, we have also conducted experiment with decoding four 1080p sequences in parallel
and two 4k sequences. In these cases the aggregate performance is 186 fps and 55.6 fps, respectively. For these
configurations the Tilera processor obtains a better energy efficiency compared to the server and laptop Intel
CPUs.
The results show that in general using more of the
available processors improves the energy efficiency in
terms of energy per frame, in particular for a small
number of cores. For example, for 4k resolution, on the
server CPU going from 1 to 2 cores improves the energy
per frame from 743 mJ/frame to 471 mJ/frame, and going to 3 cores improves it further to 384 mJ/frame. Going beyond 4 cores still improves the energy efficiency,
but to a lesser extend.
Because the obtained performance, in some cases,
is beyond the requirements of real-time video decoding,
the additional parallelism in the application can be used
to improve power efficiency. For example, on the Intel
mobile CPU, we found that 1080p real-time decoding
at 50 fps requires 1 core at maximum frequency and
Turbo Boost that consume 19.2 W. Alternatively, the
same performance can be achieved with 4 cores running
at 800 MHz consuming only 8 W. It has been observed,
however, that current dynamic voltage and frequency
scaling approaches (DVFS) are not able to reach the
optimal power point.
7 Conclusions
References
200
1080p-4X
2160p-2X
180
Aggregated FPS
160
140
120
100
80
60
40
20
0
5
10
15
20
Number of cores
25
30
35
25
30
35
(a) FPS
4
2160p-2X
1080p-4X
Energy [J] / frame
3.5
3
2.5
2
1.5
1
0.5
0
5
10
15
20
Number of cores
(b) Energy per frame
In this paper we have presented a power and performance analysis of an optimized parallel HEVC decoder.
The parallelization strategy, called Overlapped Wavefront (WPP), which is an extension of the Wavefront
Paralllel Processing (WPP) tool, allows to process multiple picture partitions as well as multiple pictures in
parallel with minimal compression losses.
The parallel decoder has been evaluated on three
different architectures: a high performance 8-core Intel server processor, a 4-core Intel mobile processor and
a 36 core low power Tilera processor. In addition to
performance results we have measured power and computed energy efficiency in term of Joules per frame.
Our parallel HEVC decoder is the first to achieve a
frame rate of more than 100 fps at 4k resolution using
a standard multi-core CPU. With the 8 core Intel processor we achieved a speedup of 6.3 for 1080p and 7.3
1. Akenine-Möller, T., Johnsson, B.: Performance per
What?
Journal of Computer Graphics Techniques
(JCGT) 1(1), 37–41 (2012)
2. Alvarez-Mesa, M., Chi, C.C., Juurlink, B., George, V.,
Schierl, T.: Parallel Video Decoding in the Emerging
HEVC Standard. In: Proceedings of the 37th International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2012)
3. Bell, S., Edwards, B., Amann, J., Conlin, R., Joyce, K.,
Leung, V., MacKay, J., Reif, M., Bao, L., Brown, J.,
Mattina, M., Miao, C.C., Ramey, C., Wentzlaff, D., Anderson, W., Berger, E., Fairbanks, N., Khan, D., Montenegro, F., Stickney, J., Zook, J.: TILE64 - Processor:
A 64-Core SoC with Mesh Interconnect. In: Digest of
Technical Papers of the IEEE International Solid-State
Circuits Conference (ISSCC), pp. 88–598 (2008)
4. Bossen, F.: Common Test Conditions and Software Reference Configurations. Tech. Rep. JCTVC-H1100 (2012)
5. Bross, B., Han, W.J., Sullivan, G.J., Ohm, J.R., Wiegand, T.: High Efficiency Video Coding (HEVC) Text
Specification Draft 8. Tech. Rep. JCTVC-J1003 (2012)
Parallel HEVC Decoding on Multi- and Many-core Architectures
6. Browne, S., Dongarra, J., Garner, N., Ho, G., , Mucci,
P.: A Portable Programming Interface for Performance
Evaluation on Modern Processors. International Journal of High Performance Computing Applications 14(3),
189–204 (2000)
7. Chi, C.C., Alvarez-Mesa, M., Juurlink, B., Clare, G.,
Henry, F., Pateux, S., , Schierl, T.: Parallel Scalability and Efficiency of HEVC Parallelization Approaches.
IEEE Transactions of Circuits and Systems for Video
Technology (2012)
8. Chi, C.C., Alvarez-Mesa, M., Juurlink, B., George, V.,
Schierl, T.: Improving the Parallelization Efficiency of
HEVC Decoding. In: Proceedings of IEEE International
Conference on Image Processing (ICIP) (2012)
9. Chi, C.C., Juurlink, B.: A QHD-capable Parallel H.264
Decoder. In: Proceedings of the International Conference
on Supercomputing (ICS), pp. 317–326 (2011)
10. David, H., Gorbatov, E., Hanebutte, U.R., Khanna, R.,
Le, C.: RAPL: Memory power estimation and capping.
In: Proceedings of the ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED),
pp. 189–194 (2010)
11. Fu, C.M., Chen, C.Y., Tsai, C.Y., Huang, Y.W., Lei, S.:
CE13: Sample Adaptive Offset with LCU-Independent
Decoding. Tech. Rep. JCTVC-E409 (2011)
12. Fuldseth, A., Horowitz, M., Xu, S., Zhou, M.: Tiles. Tech.
Rep. JCTVC-E408 (2011)
13. Advanced Video Coding for Generic Audiovisual Services. ITU-T Recommendation H.264 and ISO/IEC
14496-10 (MPEG-4 AVC) (2003)
14. Han, W.J., Min, J., Kim, I.K., Alshina, E., Alshin,
A., Lee, T., Chen, J., Seregin, V., Lee, S., Hong,
Y.M., Cheon, M.S., Shlyakhov, N., McCann, K., Davies,
T., Park, J.H.: Improved Video Compression Efficiency
Through Flexible Unit Representation and Corresponding Extension of Coding Tools. IEEE Transactions on
Circuits and Systems for Video Technology 20(12), 1709–
1720 (2010)
15. Henry, F., Pateux, S.: Wavefront Parallel Processing.
Tech. Rep. JCTVC-E196 (2011)
13
16. Hoffman, H., Kouadio, A., Thomas, Y., Visca, M.:
The Turin Shoots.
In: EBU Tech-i, 13, pp. 8–9.
European Broadcasting Union (EBU) (2012). URL
http://tech.ebu.ch/docs/tech-i/ebu tech-i 013.pdf
17. Juurlink, B., Alvarez-Mesa, M., Chi, C.C., Azevedo, A.,
Meenderinck, C., Ramirez, A.: Scalable Parallel Programming Applied to H.264/AVC Decoding. Springer (2012)
18. Meenderinck, C., Azevedo, A., Alvarez, M., Juurlink, B.,
Ramı́rez, A.: Parallel Scalability of Video Decoders. Journal of Signal Processing Systems 57, 173–194 (2009)
19. Misra, K., Zhao, J., Segall, A.: Entropy Slices for Parallel
Entropy Coding. Tech. Rep. JCTVC-B111 (2010)
20. Rotem, E., Naveh, A., Rajwan, D., Ananthakrishnan,
A., Weissmann, E.: Power-Management Architecture of
the Intel Microarchitecture Code-Named Sandy Bridge.
IEEE Micro 32(2), 20–27 (2012)
21. Seitner, F.H., Schreier, R.M., Bleyer, M., Gelautz, M.:
Evaluation of Data-parallel Splitting Approaches for
H.264 Decoding. In: Proceedings of the International
Conference on Advances in Mobile Computing and Multimedia, pp. 40–49 (2008)
22. Sullivan, G.J., Ohm, J.R.: Recent Developments in Standardization of High Efficiency Video Coding (HEVC). In:
Proceedings of SPIE, Applications of Digital Image Processing XXXIII, pp. 77,980V–77,980V–7 (2010)
23. Sullivan, G.J., Ohm, J.R., Han, W.J., Wiegand, T.:
Overview of the High Efficiency Video Coding (HEVC)
Standard. IEEE Transactions on Circuits and Systems
for Video Technology (2012)
24. der Tol, E.B.V., Jaspers, E.G.T., Gelderblom, R.H.:
Mapping of H.264 Decoding on a Multiprocessor Architecture. In: Proceedings of SPIE, 5022, Image and Video
Communications and Processing, pp. 707–718 (2003)
25. Tourapis, A.M.: Enhanced Predictive Zonal Search for
Single and Multiple Frame Motion Estimation. In: Proceedings of SPIE Visual Communications and Image Processing 2002, pp. 1069–1079 (2002)