Digital Video Transcoding: Jun Xin, Chia-Wen Lin, Ming-Ting Sun
Digital Video Transcoding: Jun Xin, Chia-Wen Lin, Ming-Ting Sun
Digital Video Transcoding: Jun Xin, Chia-Wen Lin, Ming-Ting Sun
JUN XIN, MEMBER, IEEE, CHIA-WEN LIN, SENIOR MEMBER, IEEE, AND
MING-TING SUN, FELLOW, IEEE
Invited Paper
Video transcoding, due to its high practical values for a wide terminals for accessing the Internet. These network ter-
range of networked video applications, has become an active minals vary significantly in resources such as computing
research topic. In this paper, we outline the technical issues and power and display capability. To flexibly deliver multimedia
research results related to video transcoding. We also discuss tech-
niques for reducing the complexity, and techniques for improving data to users with different available resources, access net-
the video quality, by exploiting the information extracted from the works, and interests, the multimedia content may need to
input video bit stream. be adapted dynamically according to the usage environment
Keywords—Error resilience, motion estimation, rate control, [2]. Transcoding is one of the key technologies to fulfill
transcoding architecture, video transcoding. this challenging task. Transcoding is also useful for content
adaptation for peer-to-peer networking over shared multihop
communication links [3].
I. INTRODUCTION
There are many other transcoding applications besides
Video transcoding is the operation of converting a video the universal multimedia access. In statistical multiplexing
from one format into another format. A format is defined by [4], multiple variable-bit-rate video streams are multiplexed
such characteristics as the bit rate, frame rate, spatial resolu- together to achieve the statistical multiplexing gain. When
tion, coding syntax, and content, as shown in Fig. 1. the aggregated bit rate exceeds the channel bandwidth, a
One of the earliest applications of transcoding is to adapt transcoder can be used to adapt the bit rates of the video
the bit rate of a precompressed video stream to a channel streams to ensure that the aggregated bit rate always sat-
bandwidth. For example, a TV program may be originally isfies the channel bandwidth constraint. A transcoder can
compressed at a high bit rate for studio applications, but later also be used to insert new information including company
needs to be transmitted over a channel at a much lower bit logos, watermarks, as well as error-resilience features into
rate. a compressed video stream. Transcoding techniques are
In universal multimedia access [1], different terminals also shown useful for supporting VCR trick modes, i.e.,
may have different accesses to the Internet, including fast forward, reverse play, etc., for on-demand video ap-
local access network (LAN), digital subscriber line (DSL), plications [8]–[10]. In addition, object-based transcoding
cable, wireless networks, integrated services digital network techniques are discussed in [7] for adaptive video content
(ISDN), and dial-up. The different access networks have delivery. A general utility-based framework is introduced in
different channel characteristics such as bandwidths, bit [11] to formulate some transcoding and adaptation issues
error rates, and packet loss rates. At the users’ end, network as resource-constrained utility maximization problems. In
appliances including handheld computers, personal digital [12], a utility-function prediction is performed using auto-
assistants (PDAs), set-top boxes, and smart cellular phones matic feature extraction and regression for MPEG-4 video
are slated to replace personal computers as the dominant transcoding. Several rate-distortion models for transcoding
optimization are introduced in [13] to facilitate the selection
Manuscript received January 16, 2004; revised July 9, 2004. of transcoding methods under a rate constraint. Envisioning
J. Xin is with Mitsubishi Electric Research Laboratories, Cambridge, MA the need of transcoding, the emerging MPEG-7 standard [5],
02139 USA (e-mail: [email protected]).
C.-W. Lin is with the Department of Computer Science and Information which standardizes a framework for describing audiovisual
Engineering, National Chung Cheng University, Chiayi 612, Taiwan, R.O.C. contents, has defined “transcoding hints” to facilitate the
(e-mail: [email protected]). transcoding of compressed video contents [6], [7].
M.-T. Sun is with the Department of Electrical Engineering, University
of Washington, Seattle, WA 98195 USA (e-mail: [email protected]). Dynamic change of coding parameters such as bit rates,
Digital Object Identifier 10.1109/JPROC.2004.839620 frame rates, and spatial resolutions could also be achieved
to a limited extent by scalable coding [14], [15]. How- reduces the temporal redundancy. DCT reduces the spatial
ever, in the current scalable video coding standards, the redundancy and achieves energy compaction. Quantization
enhancement layers are generated by coding the prediction is performed to achieve higher compression ratio. Vari-
residuals between the original video and the base-layer able-length coding (VLC) is applied after the quantization
video. In many applications, the network bandwidth may to reduce the remaining redundancy. A decoder is embedded
fluctuate wildly with time. Therefore, it may be difficult to in the encoder to reconstruct video frames, which are stored
set the base-layer bit rate. If the base-layer bit rate is set in the frame memory for prediction of future frames.
low, the base-layer video quality will be relatively low and A straightforward realization of a transcoder to cascade
the overall video quality degradation may be severe, since a decoder and an encoder: the decoder decodes the com-
the prediction becomes less effective. On the other hand, pressed input video and the encoder reencodes the decoded
if the base-layer bit rate is set high, the base-layer video video into the target format. It is computationally very ex-
may not get through the network completely. In general, the pensive. Therefore, reducing the complexity of the straight-
achievable quality of scalable coding is significantly lower forward decoder-encoder implementation is a major driving
than that of nonscalable coding. In addition, scalable video force behind many research activities on transcoding.
coding demands additional complexities at both encoders What makes transcoding different from video encoding is
and decoders. The inherent weaknesses of scalable coding that the transcoding has access to many coding parameters
have kept it from being widely deployed in practical applica- and statistics that can be easily obtained from the input com-
tions. Nevertheless, scalable coding is still an active research pressed video stream. They may be used not only to sim-
area. A new coding standard is being developed to overcome plify the computation, but also to improve the video quality.
these drawbacks [16]. With these problems addressed, the The transcoding can be considered as a special two-pass en-
scalable coding schemes may be more suitable for streaming coding: the “first-pass” encoding produces the input com-
video applications when a large number of users require pressed video stream, and the “second-pass” encoding in the
different levels of format adaptations, since less computation transcoder can use the information obtained from the first-
is involved. pass to do a better encoding. Therefore, it is possible for the
In this paper, the input to the transcoder is a compressed transcoder to achieve better video quality than the straightfor-
video produced by a standard video encoder. Current major ward implementation, where the encoding is single pass. The
video coding standards—MPEG-1 [17], MPEG-2 [18], challenge of the research on transcoding is then how to intel-
MPEG-4 [19], H.263 [20], and the emerging H.264 [21] all ligently utilize the coding statistics and parameters extracted
use hybrid discrete cosine transform (DCT) and block-based from the input to achieve the best possible video quality and
motion compensation (MC) schemes. Note that H.264 uses the lowest possible computational complexity.
an integer transform that approximates DCT. A block dia- In this paper, we discuss the issues and research results re-
gram of the standard video encoders is shown in Fig. 2. MC lated to the transcoding of video streams compressed using
the hybrid MC/DCT schemes. An overview of transcoding The drift problem is explained as follows. A video
architectures and techniques has been given in [22], which picture is predicted from its reference pictures and only
presents many of the fundamentals in this area. This paper the prediction errors are coded. For the decoder to work
is intended to provide a more in-depth view of architectures properly, the reference pictures reconstructed and stored in
and techniques, and cover such topics as quality optimiza- the decoder predictor must be same as those in the encoder
tion, complexity reduction techniques, and related applica- predictor. The open-loop transcoders change the prediction
tions such as logo and watermark insertion. errors and, therefore, make the reference pictures in the de-
The remainder of this paper is organized as follows. coder predictor different from those in the encoder predictor.
In Section II, we first review the transcoding techniques The differences accumulate and cause the video quality to
used for the bit-rate reduction. In Section III, we discuss deteriorate with time until an intrapicture is reached. The
the transcoding techniques for spatial and temporal resolu- error accumulation caused by the encoder/decoder predictor
tion reductions. Section IV discusses the issues associated mismatch is called drift and it may cause severe degradation
with the standards conversion. Section V addresses the to the video quality [26], [30]. It should be noted that in the
transcoding quality optimization. Section VI discusses the following discussions, many transcoder architectures are not
transcoding for information insertion. Finally, Section VII strictly drift free. However, the degree of the video quality
concludes this paper. degradation caused by the drift varies with architectures.
In addition, the drift will be terminated by an intrapicture.
In the applications where the number of coded pictures
II. BIT-RATE TRANSCODING
between two consecutive intrapictures is small and the
Generally, there exist three transcoding architectures for quality degradation caused by the drift is acceptable, these
the bit-rate transcoding: open-loop transcoders [23]–[25], architectures, although not drift free, can still be quite useful
cascaded pixel-domain transcoders (CPDTs) [24], [26] and due to the potentially lower cost in terms of computation
DCT-domain transcoders (DDTs) [27]–[29]. The open-loop and required frame memory.
architectures include selective transmission [23], [24], where Fig. 4 illustrates the drift-free CPDT [24], a concatena-
the high-frequency DCT coefficients are discarded, and re- tion of a decoder and a simplified encoder. Rather than per-
quantization [24], [25], where the DCT coefficients are forming the full-scale motion estimation, as in a stand-alone
requantized. Fig. 3 shows a requantization transcoder. The video encoder, the encoder reuses the motion vectors along
open-loop transcoders are computationally efficient, since with other information extracted from the input video bit-
they operate directly on the DCT coefficients. However, they stream. Thus, the motion estimation, which usually accounts
suffer from the drift problem. for 60%–70% of the encoder computation [31], is omitted. To
Table 1 motion vectors. Note that the motion vectors formed by the
Runtime Complexity Comparison of Five Different Transcoders.
The Video Sequences Are Encoded at QP = 7, and Then
above algorithms need to be downscaled to the target spatial
Transcoded at QP = 15 resolution.
Recent works extend these strategies to tackle the
transcoding of arbitrary down-sampling ratio [45], [46] by
taking care of the unequal contributions of related input mo-
tion vectors. When the down-sampling ratio is large and one
target MB is down-sampled from a number of input MBs,
the motion vectors of the input MBs are more likely to be
inconsistent. A multicandidate approach is proposed in [47]
to address this issue. The transcoding of interlaced video
for , and , respectively). The is discussed in details in [46], where the motion mapping
number associated with each operation point indicates the is further complicated by various types of frame and field
bit rate generated. We can observe from Figs. 8 and 9 that motion vectors.
the drift caused by the two DCT-domain transcoders is not For transcoding with temporal resolution changes, due to
serious for small GOP sizes. However, the performance the frame dropping, one has to derive a new set of motion
degradation, especially for SDDT, can become rather vectors that do not exist in the input video. This issue is ad-
significant with large GOP sizes. Such large GOP sizes may dressed in [48], where a technique called forward dominant
be used in applications such as networked video streaming vector selection (FDVS) is proposed. The FDVS scheme
and wireless video that demand high coding efficiency. is illustrated in Fig. 11. The best-match area referenced
by the motion vector of the current MB overlaps with at
most four MBs in its reference frame. The motion vector
III. SPATIAL AND TEMPORAL TRANSCODING
of the MB with the largest overlapping portion is called the
The heterogeneity of communication networks and dominant motion vector and is selected for composing the
network access terminals often demand the conversion of target motion vector. This process is repeated for all the
compressed video not only in the bit rates, but also in the dropped frames and the final target motion vector is formed
spatial/temporal resolutions. One of the challenging tasks in by adding all the dominant motion vectors together, followed
spatial/temporal transcoding is how to efficiently reestimate by a motion vector refinement. In [49], the dominant motion
(or map) the target motion vectors from the input motion vector is selected based on the activity of the overlapping
vectors. MBs, instead of the area as in FDVS. Another method,
Many works on the motion reestimation for spatial telescopic vector composition (TVC) [31], accumulates all
transcoding consider the simple case of 2 : 1 downscaling. motion vectors of the current MB’s colocated MBs in the
Fig. 10 illustrates a case of the motion-mapping problem, dropped frames and adds the resulting motion vector to the
where the input MBs have four motion vectors while the current MB’s motion vector. For typical videos with small
target output MB has a single motion vector. Several strate- motion vectors, TVC can achieve similar performance as
gies have been proposed to compose the target motion vector FDVS. It is shown [31] that a 2-pixel refinement around
using the input motion vectors. One strategy is to randomly the composed motion vector can achieve similar perfor-
choose from the four input motion vectors [41], [42]. The mance as the full-scale full-search motion reestimation. For
weighted average taking into account the prediction error the spatial resolution reduction, typically a half-pixel refine-
is presented in [43]. Different methods are compared in ment is enough to achieve a good quality [31], [46]. For
[31] and [41], including median, majority, average, and the temporal resolution reduction, as the number of skipped
random selection. The median method is shown to achieve frames increases, more refinement may be desirable. The
the best performance. The work in [44] selects the motion refinement range may be dynamically decided based on the
vector using a likelihood score based on the statistical char- motion vector magnitudes and the number of skipped frames
acteristics of the MBs associated with the best matching [50]. In [48], the refinement range is determined based
Fig. 9. Performance comparison of average PSNR for CPDT, SDDT, and CDDT for different GOP
sizes. The “Mobile & Calendar” sequence is encoded at QP = 5 using size different GOP sizes, and
then transcoded at QP = 11.
proposed for the DCT-domain spatial downscaling, where
the speedup is achieved through exploiting the redundancies
in the DCT-MC computations of two adjacent checkpoints.
Due to the spatial/temporal resolution reduction, the drift
problem is usually significant in open-loop transcoding
architectures [52]. Therefore, the drift-free CPDT architec-
ture is more favorable in terms of quality. In [53] and [54],
the drift in spatial transcoding is analyzed. Based on the
analyses, several drift-compensation architectures providing
different levels of complexity-quality tradeoffs are proposed.
Fig. 10. In 2 : 1 spatial transcoding, the target motion vector(s) for In-depth analyses and performance comparisons of these
that MB is highly correlated with the four input motion vectors.
alternative architectures and CPDT are provided in [55].
A DCT-domain architecture is proposed in [56] for tem-
on the input/output quantization scales and the prediction poral transcoding, where the reencoding errors are reduced
errors. In [51], a fast motion vector refinement scheme is using the direct addition of DCT coefficients and signals
from an error compensation feedback loop. In [57], a hybrid that are formed by using the motion information from cur-
DCT/pixel-domain transcoder architecture is proposed for rent and adjacent frames. The work in [58] and [59] intro-
video downscaling. It contains a DCT-domain decoder fol- duces an intermediate, virtual layer of video, which has the
lowed by a pixel-domain encoder, where a modified DCT-do- same frame rate and frame type as the target video and the
main inverse transformation and down-sampling method is same spatial resolution as the input video. The motion rees-
developed to convert a DCT block into a downscaled pixel timation process consists of two steps. In the first step, one
block. or more intermediate motion vectors are formed for each MB
in the intermediate video frame using the motion information
IV. STANDARDS TRANSCODING of the input video. In the second step, these motion vectors
of the intermediate-layer video are used to compose the mo-
In many applications, video coded in one coding standard tion vectors for the target video. This step also takes care of
(e.g., MPEG-2) may need to be converted to another standard the mismatch of the motion vector types caused by the inter-
(e.g., MPEG-4) besides the changes in bit rate and resolution. laced input and the progressive output. Effectively, the first
In what follows, we use two examples to illustrate how the step handles the frame-rate reduction and the frame-type con-
information obtained from the input video sequence may be version, and the second step deals with the spatial-resolution
used to help the standards transcoding process. reduction and the interlaced-to-progressive processing. This
two-step process has low complexity, since all operations are
A. MPEG-2 to MPEG-4 Simple Profile (SP) Transcoding performed on the motion vectors and, therefore, the compu-
MPEG-4 SP is aimed at low-complexity and low-bit-rate tationally expensive block matching is not needed.
video applications. Compared to MPEG-2 video, it does
not support B-frames and interlaced video. In addition, it B. MPEG-2 to MPEG-4 Advanced Simple Profile (ASP)
usually operates at lower spatial resolutions and frame rates Transcoding
than MPEG-2 video. Fig. 12 illustrates a typical scenario: Aiming at providing high quality video coding, MPEG-4
an interlaced MPEG-2 video of 720 480 resolution and at ASP incorporates several new coding tools. One of the tools
30 frames/s is transcoded to the progressive MPEG-4 SP of is global MC (GMC), which can improve the coding per-
176 144 resolution at 15 frames/s. It involves conversion formance for scenes with global motions [60]. No previous
of video formats and frame coding types besides the spatial video coding standards, including MPEG-2, support GMC.
and temporal resolution conversions. The new challenge is Therefore, in the transcoding of MPEG-2 to MPEG-4 ASP,
that the motion vectors of an incoming video frame may not global motion (GM) parameters may be estimated to take ad-
use the same reference frame as the target frame. vantage of this tool. The estimation is referred to as global
The motion reestimation problem in the case of frame type motion estimation (GME). Direct GME methods operate in
conversion is first discussed in [31], where the target mo- the pixel-domain [61], [62]. They are computationally ex-
tion vector is chosen from several candidate motion vectors pensive due to the iterative processes in the nonlinear esti-
their complexities. In video coding, however, the current and both MPEG-2, the bit rate is reduced from 10 to 4 Mb/s. The
future frame complexities are usually unknown prior to en- above techniques can be adapted to the joint transcoding of
coding. Rate-control algorithms designed for encoding, such multiple preencoded video streams (statistical multiplexing)
as MPEG-2 TM5, based on the stationary assumption, es- [4], [81], [82].
timate the complexity of the current frame using the com- MB-layer rate control adjusts the quantization parameters
plexity of the previous frame of the same type. It is well based on the encoder buffer feedback and is particularly de-
known that this estimation is poor when the stationary as- sirable for low-delay transcoding. Research works on this
sumption fails. In transcoding, intuitively it is possible to topic can be found in [59], [79], and [83]. In [7], a joint
compute the frame complexities from the input bit-stream rate-control scheme taking into account the various spatio-
(since the quantization step sizes and the number of bits in- temporal tradeoffs among all objects in a scene for MPEG-4
formation are available), and then use these complexities for object-based transcoding is proposed. Dynamic rate-control
the bit allocation in transcoding. The approach of scaling algorithms tailored for multipoint video conferencing are dis-
down the input frame bits is one special case of such algo- cussed in [84]–[86].
rithms where the number of bits is taken as the complexity
measure. It is found in [46] and [80] that complexity mea- C. Mode Decision
sures depend on the coding bit rate. Therefore, the com-
plexity measures calculated from the input video bit-stream There are various levels of mode decisions, including
at the input bit rate may not be suitable to directly serve as the MB-level, frame-level, and object-level. Rate-distortion
complexity measure for coding the frames at the output bit optimized mode decision techniques are explained in details
rate. Instead, the correlations between the complexity mea- in [87]. These techniques are also applicable to transcoding.
sures of the input and output videos are utilized to provide However, suboptimal but simple mode decision strategies
a more accurate estimation of the output frame complexity, are often desirable in complexity-constrained transcoding.
which leads to improved bit-allocation and video quality as In bit-rate transcoding, typically the modes of the input
shown in Fig. 13, where the transcoder input and output are video are reused by the transcoder [28].
In spatial transcoding, the MB-type (inter/intra) deci- where , and are the pixel values of the logo
sion usually follows the majority of the input MB-types signal, the decoded picture, and the logo-inserted picture
[31], [46]. Various heuristic strategies are discussed in [54] for frame , respectively. and are the scaling factors
to perform the decision for the open-loop architectures. for frame controlling the intensity of the logo in order to
MB prediction-mode decision techniques, including the provide uniform visibility [89], [90]. Efficient architectures
frame/field prediction for interlaced transcoding, are dis- performing this operation in the compressed-domain as
cussed in [46]. A good overview of MB-level mode decision illustrated in Fig. 14(b) are proposed in [88] and [90]. These
techniques can be found in [22]. architectures realize the same function as their pixel-domain
Dynamic frame skipping based on the accumulated mag- counterpart does.
nitude of motion vectors is proposed in [50] and [86]. In [7], Fig. 15 shows a logo and a sample picture after logo in-
strategies are proposed to drop less relevant objects if the sertion in the DCT domain [90]. The logo is inserted into an
scene has been coded as a set of objects. In [6], it is demon- MPEG-2 encoded bit-stream whose bit rate is reduced from
strated that transcoding hints of MPEG-7 are valuable in im- 8 to 4 Mb/s. The approach of [89] inserts the information in a
proving mode decisions at various levels. simple, open-looped manner, where only the area affected by
the inserted information needs to be modified. However, it is
VI. INFORMATION INSERTION TRANSCODING subject to drift. The insertion of new information may affect
the optimality of the existing coding parameters of the af-
In general, any operation that changes the content of a
fected picture area, including the motion vectors and coding
compressed video stream may be regarded as transcoding. In
modes as discussed in [91], where techniques are discussed
this section, we discuss two information insertion examples.
to modify these coding parameters.
A. Logo/Watermark Insertion
B. Error-Resilience Transcoding
For copyright protection, video watermarks and company
In practical applications where video contents are com-
logos can be inserted into the compressed video stream [88].
pressed and stored for future delivery, the encoding process
In the pixel-domain transcoders, the logo insertion can be
is typically performed without enough prior knowledge
implemented as illustrated in Fig. 14(a)
about the channel characteristics of network hops between
the encoder and the decoder. In addition, the heterogeneity