OVERVIEW OF THE SCALABLE H.264/MPEG4-AVC EXTENSION
Heiko Schwarz, Detlev Marpe, and Thomas Wiegand
Fraunhofer Institute for Telecommunications - Heinrich Hertz Institute, Image Processing Department
Einsteinufer 37, 10587 Berlin, Germany, [hschwarzlmarpelwiegand]@hhi.fraunhofer.de
ABSTRACT
2. OVERVIEW
The scalable extension of H.264/MPEG4-AVC is a current
standardization project of the Joint Video Team of the
ITU-T Video Coding Experts Group and the ISO/1EC Moving Picture Experts Group. This paper gives an overview of
the design of the scalable H.264/MPEG4-AVC extension
and describes the basic concepts for supporting temporal,
spatial, and SNR scalability. The efficiency of the described
concepts for providing spatial and SNR scalability is analyzed by means of simulation results and compared to
H.264/MPEG4-AVC compliant single layer coding.
The basic SVC design can be classified as layered video
codec. In general, the coder structure as well as the coding
efficiency depends on the scalability space that is required
by an application. For illustration, Fig. 1 shows a typical
coder structure with two spatial layers.
Index Terms- video coding
I
1. INTRODUCTION
Scalable Video Coding (SVC) is currently a very active
working area in the research community and in international
standardization. A project on SVC standardization was
originally started by the ISO/1EC Moving Picture Experts
Group (MPEG). Based on an evaluation of the submitted
proposal, MPEG and the ITU-T Video Coding Experts
Group (VCEG) agreed to jointly finalize the SVC project as
an Amendment of their H.264/MPEG4-AVC standard [1],
for which the scalable extension of H.264/MPEG4-AVC as
proposed in [2] was selected as the first Working Draft.
As an important feature of the SVC design, most components of H.264/MPEG4-AVC are used as specified in the
standard. This includes the motion-compensated and intra
prediction, the transform and entropy coding, the deblocking
as well as the NAL unit packetization (NAL - Network Abstraction Layer). The base layer of an SVC bit-stream is generally coded in compliance with H.264/MPEG4-AVC, and
each standard conforming H.264/MPEG-4 AVC decoder is
capable of decoding this base layer representation when it is
provided with an SVC bit-stream. New tools are only added
for supporting spatial and SNR scalability.
This paper gives an overview of the current SVC design
[3] [4]. The basic concepts for proving temporal, spatial, and
SNR scalability are described and analyzed regarding their
coding efficiency. For more detailed information, the reader
is referred to the SVC Working Draft [3] and the Joint Scalable Video Model (JSVM) [4].
1-4244-0481-9/06/$20.00 C2006 IEEE
Fig. 1. Coder structure example with two spatial layers.
In each spatial or coarse-grain SNR layer, the basic
concepts of motion-compensated prediction and intra prediction are employed as in H.264/MPEG4-AVC. The redundancy between different layers is exploited by additional
inter-layer prediction concepts that include prediction
mechanisms for motion parameters as well as texture data
(intra and residual data). A base representation of the input
pictures of each layer is obtained by transform coding similar to that of H.264/MPEG4-AVC, the corresponding NAL
units contain motion information and texture data; the NAL
units of the lowest layer are compatible with single-layer
H.264/MPEG4-AVC [1]. The reconstruction quality of these
base representations can be improved by an additional coding of so-called progressive refinement slices. In contrast to
all other slice data NAL units, the corresponding NAL units
can be arbitrarily truncated in order to support fine granular
quality scalability or flexible bit-rate adaptation.
An important feature of the SVC design is that scalability is provided at a bit-stream level. Bit-streams for a reduced spatial and/or temporal resolution are simply obtained
by discarding NAL units (or network packets) from a global
SVC bit-stream that are not required for decoding the target
resolution. NAL units of PR slices can additionally be truncated in order to further reduce the bit-rate and the associated reconstruction quality.
161
ICIP 2006
3. TEMPORAL SCALABILITY AND
HIERARCHICAL CODING STRUCTURES
4.1. Inter-layer prediction techniques
The following three inter-layer prediction techniques are
included in the SVC design. In the following, only the original concepts based on simple dyadic spatial scalability are
described. For an extension to arbitrary resolution ratios the
reader is referred to [3] [4] [6].
In contrast to older video coding standards as MPEG-2/4,
the coding and display order of pictures is completely decoupled in H.264/MPEG4-AVC. Any picture can be marked
as reference picture and used for motion-compensated prediction of following pictures independent of the corresponding slice coding types. These features allow the coding of
picture sequences with arbitrary temporal dependencies.
4.1.1. Inter-layer motion prediction
In order to employ base layer motion data for spatial enhancement layer coding, additional macroblock modes have
been introduced in spatial enhancement layers. The macroblock partitioning is obtained by upsampling the partitioning
of the co-located 8x8 block in the lower resolution layer.
The reference picture indices are copied from the co-located
base layer blocks, and the associated motion vectors are
scaled by a factor of 2. These scaled motion vectors are either used unmodified or refined by an additional quartersample motion vector refinement. Additionally, a scaled
motion vector of the lower resolution can be used as motion
vector predictor for the conventional macroblock modes.
group of pictures (GOP)
Fig. 2. Hierarchical prediction structure with 4 dyadic levels.
Temporal scalable bit-stream can be generated by using
hierarchical prediction structures as illustrated in Fig. 2
without any changes to H.264/MPEG4-AVC. So-called key
pictures are coded in regular intervals by using only previous key pictures as references. The pictures between two key
pictures are hierarchically predicted as shown in Fig. 2. It is
obvious that the sequence of key pictures represents the
coarsest supported temporal resolution, which can be refined
by adding pictures of following temporal prediction levels.
In addition to enabling temporal scalability, the hierarchical prediction structures also provide an improved coding
efficiency compared to classical IBBP coding on the cost of
an increased encoding-decoding delay [5]. Furthermore, the
efficiency of the tools for supporting spatial and SNR scalability is improved as it will be proven in the following sections. It should also be noted that the delay of hierarchical
prediction structures can be controlled by restricting the motion-compensated prediction from pictures of the future.
4.1.2. Inter-layer residual prediction
The usage of inter-layer residual prediction is signaled by a
flag that is transmitted for all inter-coded macroblocks.
When this flag is true, the base layer signal of the co-located
block is block-wise upsampled and used as prediction for the
residual signal of the current macroblock, so that only the
corresponding difference signal is coded.
4.1.3. Inter-layer intra prediction
Furthermore, an additional intra macroblock mode is introduced, in which the prediction signal is generated by upsampling the co-located reconstruction signal of the lower layer.
For this prediction it is generally required that the lower
layer is completely decoded including the computationally
complex operations of motion-compensated prediction and
deblocking. However, as shown in [7] this problem can be
circumvented when the inter-layer intra prediction is restricted to those parts of the lower layer picture that are intra-coded. With this restriction, each supported target layer
can be decoded with a single motion compensation loop.
4. SPATIAL SCALABILITY
Spatial scalability is achieved by an oversampled pyramid
approach. The pictures of different spatial layers are independently coded with layer specific motion parameters as
illustrated in Fig. 1. However, in order to improve the coding efficiency of the enhancement layers in comparison to
simulcast, additional inter-layer prediction mechanisms have
been introduced. These prediction mechanisms have been
made switchable, so that an encoder can freely choose which
base layer information should be exploited for an efficient
enhancement layer coding. Since the incorporated inter-layer
prediction concepts include techniques for motion parameter
and residual prediction, the temporal prediction structures of
the spatial layers should be temporally aligned for an efficient use of the inter-layer prediction. It should be noted that
all NAL units for a time instant form an access unit and thus
have to be follow each other inside an SVC bit-stream.
4.2. Performance evaluation
The performance of the spatial scalability tools has been
evaluated in comparison to simulcast and single-layer coding. The base layer was coded at a fixed bit-rate, for the encoding of the spatial enhancement layers, the bit-rate as well
as the amount of enabled inter-layer prediction mechanisms
was varied. All encoders have been rate-distortion optimized
following [8]. The intra period was set to 32 pictures; simulations have been carried out for a GOP size of 16 pictures
as well as for IPPP coding. In Fig. 3, the results for the sequence "Soccer" with a CIF and a 4CIF layer are shown.
162
5.2. Fine-grain SNR scalability
SOCCER, GOP 16 (C IF 3OHz -->4C IF 3OHz)
38
In order to support fine-granular SNR scalability, so-called
progressive refinement (PR) slices have been introduced.
Each PR slice represents a refinement of the residual signal
that corresponds to a bisection of the quantization step size
(QP increase of 6). These signals are represented in a way
that only a single inverse transform has to be performed for
each transform block at the decoder side. The ordering of
transform coefficient levels in PR slices allows the corresponding PR NAL units to be truncated at any arbitrary
byte-aligned point, so that the quality of the SNR base layer
can be refined in a fine-granular way.
37
36
35
34
z
C',
33
C IF base layer
* CIF
------+,/'
+
ba~~~~sige layer
32
TX
31
t
-
Simulcast
Intra prediction
Intra & motion prediction
Intra, motion & residual prediction
-
Multiple-loop decoding
/
__O_
-
-
30
T -
29
0
500
1000
1500
bit-rate [kbiti
2000
2500
.
SOCCER, IPPP (CIF 30Hz
37
4CIF 30Hz)
-->
1 E~~~~~~~GS
en ha ncemnent layer
L
36
key picture
35
Fig. 4. Motion-compensated prediction with FGS.
1~~~~~
34
z
>~;'
l--
33
C',
32
CIF base layer
31
-Intra prediction
Single layer
Simulcast
-
30
&
prediction
-Intra, motion & residual predictionfe
-Multiple-loop decoding
500
1000
1500
bit-rate [kbit//s]
2000
The main reason for the low performance of the FGS in
-MPEG-4 is that the motion-compensated prediction (MCP)
is always done in the SNR base layer. In the SVC design, the
highest quality reference available is employed for the MCP
of non-key pictures as depicted in Fig. 4. Note that this diference significantly improves the coding efficiency without
increasing
29
0
key picture
the complexity when hierarchical prediction
structures are used. The MCP for key pictures is done by
2500
only using the base layer representation of the reference pic.tures.
Thus, the key pictures serve as re-synchronization
Fig. 3. Performance of inter-layer pre .iction mechanism.
points, and the drift between encoder and decoder reconThe black and the grey curve rcopresent single-layer codstruction is efficiently limited.
ing and simulcast, respectively. Foir the blue, green, and red
In order to improve the FGS coding efficiency, especially for low-delay IPPP coding, leaky prediction concepts
curve, the inter-layer intra, motionL, and residual prediction
have been successively enabled. IFor all these curves, the
for the motion-compensated prediction of key pictures have
inter-layer prediction was restrictc-d in a way that allows
been additionally incorporated in the SVC design [3][4][9].
single-loop decoding. The efficienc y of spatial scalable codIn [10], a method for further improving the FGS coding effiing that requires multiple-loop de(coding is represented by
ciency by allowing the coding of motion parameter refinethe brown curve. By comparing b()th diagrams of Fig. 3 it
ments as part of the PR slices has been proposed.
can be seen that the efficiency of the inter-layer prediction is
SoccerCIF 3OHz-GOP16
39
improved by using hierarchical prediction structures.
38
5. SNR SCALABILITY
37
36
For SNR scalability, coarse-grain scalability (CGS) and finegrain scalability (FGS) are distinguished.
m
2 35
z
z
cn 34
5.1. Coarse-grain SNR scalability
33
32
Coarse-grain SNR scalable coding is achieved using the
concepts for spatial scalability. The only difference is that
for CGS the upsampling operations of the inter-layer prediction mechanisms are omitted. Note that the restricted interlayer prediction that enables single-loop decoding is even
more important for CGS than for spatial scalable coding.
31
30
200
400
600
bit-rate [kbit/s]
800
Fig. 5. Performance of SNR scalable coding strategies.
163
1000
5.3. Performance evaluation
Soccer QC IF 15Hz -CCIF 30 Hz - 4C IF 60Hz
39
The performance of the different presented SNR scalable
coding strategies have been compared to single-layer coding.
Only the first picture has been intra-coded, and a GOP size
of 16 pictures was chosen. The difference between the
quantization parameters of the lowest and highest SNR layer
was set to 12. The simulation results for "Soccer" in CIF
resolution and a frame rate of 30Hz are depicted in Fig. 5.
The black curve represent the coding efficiency of single-layer coding. The blue and green curve represent CGS
runs with quantization parameter differences between successive layer of 6 and 2, respectively. It is clearly visible that
the coding efficiency of CGS decreases with smaller bit-rate
ratios between successive SNR layers. The performance of
MPEG4-like FGS coding, in which only the SNR base layer
is used for MCP, is shown by the orange curve. The FGS
coding efficiency can be significantly improved when higher
quality references are used for the prediction of non-key
pictures as specified in the SVC design and shown by the
brown curve. The refinement of motion parameters in PR
slices as proposed in [10] lead to further improvements of
the FGS coding efficiency as illustrated by the red curve. By
means of the light blue curve, it is illustrated on the example
of the adaptive FGS concept [10] how the coding efficiency
can be traded-off for low and high rates by modifying the
ratio between motion and texture rate.
38 -CIF 30Hz
-------
--4CIF 60Hz
3.6
35
-.-Single layer
Spatial scalability
SNR scalability
.-Combined scalability
34
33
0
500
1000
1500
bit-rate
2000
2500
3000
3500
[kbit/s]
Fig. 6. Performance of combined scalability coding.
7. CONCLUSION
In this paper, the design of the scalable H.264/MPEG4-AVC
extension is described, and the basic tools for proving spatial
and SNR scalability are analyzed regarding their coding
efficiency. The coding efficiency of SVC bit-streams and the
amount of supported spatio-temporal-rate points can be
traded-off according to the needs of an application.
REFERENCES
6. COMBINED SCALABILITY
[1] ITU-T Rec. & ISO/IEC 14496-10 AVC, "Advanced Video
Coding for Generic Audiovisual Services," version 3, 2005.
[2] H. Schwarz, et. al, "Technical Description of the HHI proposal for SVC CE1," ISO/IEC JTC]/WG]], Doc. ml 1244,
Palma de Mallorca, Spain. Oct. 2004.
[3] J. Reichel, H. Schwarz, and M. Wien (eds.), "Scalable Video
Coding - Joint Draft 4," Joint Video Team, Doc. JVT-Q201,
Nice, France, Oct. 2005.
[4] J. Reichel, H. Schwarz, and M. Wien (eds.), "Joint Scalable
Video Model JSVM-4," Joint Video Team, Doc. JVT-Q202,
Nice, France, Oct. 2005.
[5] H. Schwarz, D. Marpe, and T. Wiegand, "Hierarchical B pictures," Joint Video Team, Doc. JVT-P014, Poznan, Poland,
July 2005.
[6] E. Francois and J. Vieron, "Extended spatial scalability: a
generalization of spatial scalability for non-dyadic configurations," submitted to ICIP 2006.
[7] H. Schwarz, T. Hinz, D. Marpe, and T. Wiegand, "Constrained inter-layer prediction for single-loop decoding in spatial scalability," Proc. of ICIP 2005, Genova, Italy, Sep. 2005.
[8] T. Wiegand, et. al, "Rate-constrained coder control and comparison of video coding standards," IEEE Trans. CSVT,
vol. 13, pp. 688-703, July 2003.
[9] J. Ridge, X. Wang, Y. Bao, and M. Karczewicz, "Low-delay,
low-complexity scalable bit-rate video coding," submitted to
ICIP 2006.
[10] M. Winken, H. Schwarz, D. Marpe, and T. Wiegand, "Adaptive motion refinement for FGS slices," Joint Video Team,
Doc. JVT-Q031, Nice, France, Oct. 2005.
The presented concepts for temporal, spatial, and SNR scalability can be easily combined. In Fig. 6, the coding efficiency of combined scalable coding is compared to the coding efficiency of single-layer, purely spatial scalable, and
purely SNR scalable coding for the sequence "Soccer". The
intra period was generally set 1.07 s (64 pictures at 60Hz);
and for all encodings a dyadic hierarchical prediction structure with a GOP size of 32 pictures at 60Hz has been used.
Since, temporal scalability with resolution of 1.875Hz to
60Hz (4CIF) is supported in the same manner in all bitstreams, it has not been tested separately.
The black curves show the coding efficiency of singlelayer coding; each point represents a separate bit-stream. For
SNR scalability, a separate bit-stream has been generated for
each spatial resolution and is represented by the corresponding red curves. Similarly, the blue curves show the coding
efficiency of spatial scalable coding. Three bit-stream have
been generated, and each of these bit-streams includes either
the lowest, the middle, or the highest plotted rate point for
each spatial resolution. The coding efficiency of combined
scalable coding, for which all plotted spatio-temporal rate
points are supported in a single bit-stream is represented by
the green curves. As it can be seen on the example of Fig. 6,
the coding efficiency of SVC bit-streams scales with the
range of supported spatio-temporal-rate points.
164