DIVP unit-5 unit-4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Unit-4

Basic Steps of Video Processing


Video :
Video is the technology that capture moving images electronically. Those
moving images are really just a series of still images that change so fast that it
looks like the image is moving.
Our eyes perceive visual information as a collection of pictures per
second. Though the sensory perception of eyes differs per person, the most
agreed range seen is between 30 and 60 frames per second. Yet more current
research has proved that we can actually see at a frame rate much higher than
that. The most common modern frame rates exceed from 60-120 to up to
180fps.

Video refers to pictorial (visual) information, including still images and


time-varying images. A still image is a spatial distribution of intensity that is
constant with respect to time. A time-varying image is such that the spatial
intensity pattern changes with time. Hence, a time-varying image is a spatio-
temporal intensity pattern, denoted by f (x1, x2, t), where x1 and x2 are the
spatial variables and t is the temporal variable.

Analog Video:
Today most video recording, storage, and transmission is still in analog
form.
For example, images that we see on TV are recorded in the form of
analog electrical signals, transmitted on the air by means of analog amplitude
modulation, and stored on magnetic tape using video cassette recorders as
analog signals.
Motion pictures are recorded on photographic film, which is a high-
resolution analog medium, or on laser discs as analog signals using optical
technology.
The analog video signal refers to a one-dimensional (1-D) electrical
signal f(t) of time that is obtained by sampling f (x1, x2, t) in the vertical x2 and
temporal coordinates. This periodic sampling process is called scanning. The
signal f(t), then, captures the time-varying image intensity f (x1, x2, t) only
along the scan lines, such as those shown in Figure 1.1. It also contains the
timing information and the blanking signals needed to align the pictures
correctly.
The most commonly used scanning methods are
1) progressive scanning:
A progressive scan traces a complete picture, called a frame, at every At
sec. The computer industry uses progressive scanning with At= l/72 see
for high-resolution monitors.
2) Interlaced scanning:
The TV industry uses 2: l interlace where the odd- numbered and even-
numbered lines, called the odd field and the even field, respectively, are
traced in turn. The solid line and the dotted line represent the odd and the
even fields, respectively. The spot snaps back from point B to C, called
the horizontal retrace, and from D to E, and from F to A, called the
vertical retrace.
An analog video signal f(t) is shown in Figure 1.2. Blanking pulses
(black) are inserted during the retrace intervals to blank out retrace lines
on the receiving CRT. Sync pulses are added on top of the blanking
pulses to synchronize the receiver’s horizontal and vertical sweep
circuits. The sync pulses ensure that the picture starts at the top left corner
of the receiving CRT. The timing of the sync pulses are, of course,
different for progressive and interlaced video.

Some important parameters of the video signal are:


1) Vertical resolution: The vertical resolution is related to the
number of scan lines per frame.
2) Aspect ratio: The aspect ratio is the ratio of the width to the
height of a frame.
3) Frame/field rate: Frame rate or frames per second (FPS), is
essentially the number of images, shots, or frames that a
camera can take per second. The frame rate is measured in
hertz or frames per second, and these frames get animated
together or combined to come up with a proper video.
The frames per second determine the smoothness of a video,
and the most commonly used frame rate used in films is 24
frames per second. For the internet, it’s 30FPS, and it is this rate
that ensures a smooth progression of your videos without any
pictures stuttering.
There exist several analog video signal standards, which have different
image parameters (e.g., spatial and temporal resolution) and differ in the way
they handle color.

The analog video standards can be grouped as:

1) Component analog video

2) Composite video

3) S-video (Y/C video)

Component analog video:

In component analog video (CAV), each primary is considered as a


separate monochromatic video signal. The primaries can be either simply the R,
G, and B signals or a luminance-chrominance transformation of them.

The luminance component (Y) corresponds to the gray level


representation of the video, given by

Y = 0.30R+ 0.59G+ 0.llB

The chrominance components contain the color information given by

I = 0.60R+ 0.28G - 0.328

Q = 0.21R- 0.52G + 0.31B

In practice, these components are subject to normalization and gamma


correction. The CAV representation yields the best color reproduction.
However, transmission of CAV requires perfect synchronization of the three
components and three times more bandwidth.
Composite video:

Composite video signal formats encode the chrominance components on


top of the luminance signal for distribution as a single signal which has the
same bandwidth as the luminance signal. There are different composite video
formats, such as NTSC (National Television Systems Committee), PAL (Phase
Alternation Line), and SECAM (System Electronique Color Avec Memoire),
being used in different countries around the world.
The NTSC composite video standard, defined in 1952, is currently in use
mainly in North America and Japan. NTSC signal is a 2:l interlaced video signal
with 262.5 lines per field (525 lines per frame), 60 fields per second, and 4:3
aspect ratio.
PAL and SECAM, developed in the 196Os, are mostly used in Europe
today. They are also 2:l interlaced, but in comparison to NTSC, they have
different vertical and temporal resolution, slightly higher bandwidth (8 MHz),
and treat color information differently. Both PAL and SECAM have 625 lines
per frame and 50 fields per second; thus, they have higher vertical resolution in
exchange for lesser temporal resolution as compared with NTSC. One of the
differences between PAL and SECAM is how they represent color information.
They both utilize I and Q components for the chrominance information.
However, the integration of the color components with the luminance signal in
PAL and SECAM are different. Both PAL and SECAM are said to have better
color reproduction than NTSC.
S-video (Y/C video):

The composite signal formats usually result in errors in color rendition,


known as hue and saturation errors, because of inaccuracies in the separation of
the color signals. Thus, S-video is a compromise between the composite video
and the component analog video, where we represent the video with two
component signals, a luminance and a composite chrominance signal.
Analog video equipment can be classified as
1) Broadcast-quality
2) Professional-quality
3) Consumer-quality Broadcast-quality equipment has the best
performance, but is the most expensive. For consumer-quality
equipment, cost and ease of use are the highest priorities
Video images may be acquired
1) By electronic live pickup cameras and recorded on videotape, or
2) By motion picture cameras and recorded on motion picture film (24
frames/sec), or
3) formed by sequential ordering of a set of still-frame images such as in
computer animation.
In electronic pickup cameras, the image is optically focused on a two-
dimensional surface of photosensitive material that is able to collect light from
all points of the image all the time.
There are two major types of electronic cameras, which differ in the way
they scan out the integrated and stored charge image.
1) In vacuum-tube cameras (e.g., vidicon), an electron beam scans out
the image.
2) In solid-state imagers (e.g., CCD cameras), the image is scanned out
by a solid-state array.
Color cameras can be
1) three-sensor type: Three-sensor cameras suffer from synchronicity
problems and high cost.
2) single-sensor type: single-sensor cameras often have to compromise
spatial resolution. Solid-state sensors are particularly suited for single-
sensor cameras since the resolution capabilities of CCD cameras are
continuously improving.
Cameras specifically designed for television pickup from motion picture
film are called telecine cameras. These cameras usually employ frame rate
conversion from 24 frames/set to 60 fields/set.

Digital Video:
The main advantage of digital representation and transmission is that they
make it easier to provide a diverse range of services over the same network and
more robust.
Digital video on the desktop brings computers and communications
together in a truly revolutionary manner.
Almost all digital video systems use component representation of the
color signal. Most color video cameras provide RGB outputs which are
individually digitized. Component representation avoids the artifacts that result
from composite encoding, provided that the input RGB signal has not been
composite-encoded before.
In digital video, there is no need for blanking or sync pulses, since a
computer knows exactly where a new line starts as long as it knows the number
of pixels per line. Thus, all blanking and sync pulses are removed in the A/D
conversion.
Even if the input video is a composite analog signal, e.g., from a
videotape, it is usually first converted to component analog video, and the
component signals are then individually digitized.
It is also possible to digitize the composite signal directly using one A/D
converter with a clock high enough to leave the color subcarrier components
free from aliasing, and then perform digital decoding to obtain the desired RGB
or YIQ component signals. This requires sampling at a rate three or four times
the color subcarrier frequency, which can be accomplished by special-purpose
chip sets. Such chips do exist in some advanced TV sets for digital processing
of the received signal for enhanced image quality.
The horizontal and vertical resolution of digital video is related to the
number of pixels per line and the number of lines per frame.
The artifacts in digital video due to lack of resolution are quite different
than those in analog video.
In analog video the lack of spatial resolution results in blurring of the
image in the respective direction.
In digital video, we have pixellation (aliasing) artifacts due to lack of
sufficient spatial resolution.
The arrangement of pixels and lines in a contiguous region of the memory
is called a bitmap.
There are five key parameters of a bitmap:
1) the starting address in memory,
2) the number of pixels per line,
3) the pitch value,
4) the number of lines, and
5) number of bits per pixel.

The pitch value specifies the distance in memory from the start of one
line to the next.
After two vertical scans a “composite frame” in a contiguous region of
the memory is formed.
Each component signal is usually represented with 8 bits per pixel to
avoid “contouring artifacts.” Contouring result in slowly varying regions of
image intensity due to insufficient bit resolution.
Color mapping techniques exist to map 224 distinct colors to 256 colors
for display on &bit color monitors without noticeable loss of color resolution.
The major factor in preventing the widespread use of digital video today
has been the huge storage and transmission bandwidth requirements. For
example, digital video requires much higher data rates and transmission
bandwidths as compared to digital audio.
Exchange of digital video between different applications and products
requires digital video format standards. Video data needs to be exchanged in
compressed form, which leads to compression standards.

CCITT Group 3 and 4 codes are developed for fax image transmission,
and are presently being used in all fax machines.
JBIG has been developed to fix some of the problems with the CCITT
Group 3 and 4 codes, mainly in the transmission of halftone images.
JPEG is a still-image (monochrome and color) compression standard, but
it also finds use in frame-by-frame video compression, mostly because of its
wide availability in VLSI hardware. CCITT Recommendation
H.261 is concerned with the compression of video for videoconferencing
applications over ISDN lines.
Typically, videoconferencing using the CIF format requires 384 kbps,
MPEG-1 targets 1.5 Mbps for storage of CIF format digital video on CD-ROM
and hard disk. MPEG-2 is developed for the compression of higher-definition
video at lo-20 Mbps with HDTV as one of the intended applications.
Interoperability of various digital video products requires not only
standardization of the compression method but also the representation (format)
of the data. There is an abundance of digital video formats/standards, besides
the CCITT 601 and CIF standards.

A committee under the Society of Motion Picture and Television


Engineers (SMPTE) is working to develop a universal header/descriptor that
would make any digital video stream recognizable by any device.
Digital representation of video offers many benefits, including:
i) Open architecture video systems, meaning the existence of video
at multiple spatial, temporal, and SNR resolutions within a single
scalable bitstream.
ii) Interactivity, allowing interruption to take alternative paths
through a video database, and retrieval of video.
iii) Variable-rate transmission on demand.
iv) Easy software conversion from one standard to another.
v) Integration of various video applications, such as TV,
videophone, and so on, on a common multimedia platform.
vi) Editing capabilities, such as cutting and pasting, zooming,
removal of noise and blur.
vii) Robustness to channel noise and ease of encryption
Digital video processing refers to manipulation of the digital video
bitstream. All known applications of digital video today require digital
processing for data compression, In addition, some applications may benefit
from additional processing for motion analysis, standards conversion,
enhancement, and restoration in order to obtain better-quality images or extract
some specific information.
Digital processing of still images has found use in military, commercial,
and consumer applications since the early 1960s. Space missions, surveillance
imaging, night vision, computed tomography, magnetic resonance imaging, and
fax machines.

Time-Varying Image Formation Models:


A time varying image by a function of three continuous variables is
represented by f (x1, x2, t), which is formed by projecting a time-varying three-
dimensional (3-D) spatial scene into the two-dimensional (2-D) image plane.
The temporal variations in the 3-D scene are usually due to movements of
objects in the scene. Thus, time-varying images reflect a projection of 3-D
moving objects into the 2-D image plane as a function of time.
Digital video corresponds to a spatio-temporally sampled version of this
time varying image. A block diagram representation of the time-varying image
formation model is depicted in Figure 2.1.

“3-D scene modeling” refers to modeling the motion and structure of


objects in 3-D.
“Image formation,” includes geometric and photometric image
formation, refers to mapping the 3-D scene into an image plane intensity
distribution. Geometric image formation considers the projection of the 3-D
scene into the 2-D image plane. Photometric image formation models variations
in the image plane intensity distribution due to changes in the scene illumination
in time as well as the photometric effects of the 3-D motion.
In order to obtain an analog or digital video signal representation, the
continuous time-varying image f (x1, x2, t) needs to be sampled in both the
spatial and temporal variables. An analog video signal representation requires
sampling f (x1, x2, t) in the vertical and temporal dimensions. For a digital
video representation, f (x1, x2, t) is sampled in all three dimensions. The spatio-
temporal sampling process is depicted in Figure 3.1, where (nl, n2, k) denotes
the discrete spatial and temporal coordinates, respectively.
 Three-Dimensional Motion Models:
Modeling of the relative 3-D motion between the camera and the objects
in the scene is called 3D motion model.
This includes 3-D motion of the objects in the scene, such as translation
and rotation, as well as the 3-D motion of the camera, such as zooming and
panning.
3D motion models are presented to describe the relative motion of a set of
3-D object points and the camera:
1) in the Cartesian coordinate system (X1, X2, X3) and
2) in the homogeneous coordinate system (kX1, kX2, kX3, k)
According to classical kinematics, 3-D motion can be classified as
1) rigid motion (In the case of rigid motion, the relative distances
between the set of 3-D points remain fixed as the object evolves in
time. That is, the 3-D structure (shape) of the moving object can be
modeled by a nondeformable surface, e.g., a planar, piecewise planar,
or polynomial surface.)
2) nonrigid motion (In nonrigid motion, a deformable surface model is
utilized in modeling the 3-D structure.)

 Rigid Motion in the Cartesian Coordinates:


3-D displacement of a rigid object in the Cartesian coordinates can be
modelled by an affine transformation of the form
X’=RX+T
where R is a 3 x 3 rotation matrix,
T=[ ] is a 3-D translation vector

X=[ ] and [ ]

denote the coordinates of an object point at times t and t’ with respect to the
center of rotation, respectively.
That is, the 3-D displacement can be expressed as the sum of a 3-D
rotation and a 3-D translation.
 The Rotation Matrix:
Three-dimensional rotation in the Cartesian coordinates can be
characterized either by the Eulerian angles of rotation about the three coordinate
axes, or by an axis of rotation and an angle about this axis. The two descriptions
can be shown to be equivalent under the assumption of infinitesimal rotation.
Eulerian angles in the Cartesian coordinates: An arbitrary rotation in the
3-D space can be represented by the Eulerian angles, θ, ψ, and ɸ, of rotation
about the X1, X2, and X3 axes, respectively.

The matrices that describe clockwise rotation about the individual axes
are given by
Assuming that rotation from frame to frame is infinitesimal, i.e., ɸ ,
etc., and thus approximating cos ≈1 and sin ≈ ɸ, and so on, these
matrices simplify as

=| |

Then the composite rotation matrix R can be found as

R= RθRψRɸ
=[ ]

Rotation about an arbitrary axis in the Cartesian coordinates: An alternative


characterization of the rotation matrix results if the 3-D rotation is described by
an angle α about an arbitrary axis through the origin, specified by the directional
cosines n1, n2, and ns, as depicted in Figure 2.3.

Then the rotation matrix is given by

R=

[ ]

For an infinitesimal rotation by the angle Δα, R reduces to


R=[ ]

Thus, the two representations are equivalent with


Δθ=
Δψ=
Δɸ=

In video imagery, the assumption of infinitesimal rotation usually holds,


since the time difference between the frames are in the order of l/30 seconds.
 Rigid Motion in the Homogeneous Coordinates:
We define the homogeneous coordinate representation of a Cartesian point
X = [X1X2X3] T as

[ ]

Then the affine transformation in the Cartesian coordinates can be expressed as


a linear transformation in the homogeneous coordinates

Where ̃ =[ ]

And the matrix A is

A=[ ]=SR

Translation in the Homogeneous Coordinates:


Translation can be represented as a matrix multiplication in the homogeneous
coordinates given by
̃

Where ̃ =[ ]

is the translation matrix


Rotation in the Homogeneous Coordinates:
Rotation in the homogeneous coordinates is represented by a 4 x 4 matrix
multiplication in the form
̃

Where ̃ =[ ]

and denotes the elements of the rotation matrix R in the Cartesian coordinates
Zooming in the Homogeneous Coordinates:
The effect of zooming can be incorporated into the 3-D motion model as
̃

Where ̃ =[ ]

 Geometric Image Formation:


Imaging systems capture 2-D projections of a time-varying 3-D scene.
This projection can be represented by a mapping from a 4-D space to a 3-D
space.
f : R4 ------ R3
( X , X , X , t ) ------ (x ,x ,t)
1 2 3 1 2
where (X1,X2,X3) are the 3-D world coordinates,
(x1,x2) are the 2-D image plane coordinates, and
t is time, are continuous variables.
The two types of projection are
1) Perspective (central) and
2) Orthographic (parallel)
 Perspective Projection:
Perspective projection reflects image formation using an ideal pinhole
camera according to the principles of geometrical optics. Thus, all the rays from
the object pass through the center of projection, which corresponds to the center
of the lens. For this reason, it is also known as “central projection.” Perspective
projection is illustrated in Figure 2.4 when the center of projection is between
the object and the image plane, and the image plane coincides with the (X 1, X2)
plane of the world coordinate system.

The algebraic relations that describe the perspective transformation for


the configuration shown in Figure 2.4 can be obtained based on similar triangles
formed by drawing perpendicular lines from the object point (X1, X2, X3) and
the image point (x1, x2,0) to the X3 axis, respectively. This leads to

= and =

Or
= and =
where f denotes the distance from the center of projection to the image plane.
If we move the center of projection to coincide with the origin of the
world coordinates, a simple change of variables yields the following equivalent
expressions:

= and =
The configuration and the similar triangles used to obtain these expressions are
shown in Figure 2.5, where the image plane is parallel to the (Xl, X2) plane of
the world coordinate system.

We note that the perspective projection is nonlinear in the Cartesian coordinates


since it requires division by the X3 coordinate. However, it can be expressed as
a linear mapping in the homogeneous coordinates, as
[ ] [ ] [ ]

Where [ ] and [ ] denote the world and image plane

points, respectively, in the homogeneous coordinates.


 Orthographic Projection:
Orthographic projection is an approximation of the actual imaging
process where it is assumed that all the rays from the 3-D object (scene) to the
image plane travel parallel to each other. For this reason it is sometimes called
the “parallel projection.” Orthographic projection is depicted in Figure 2.6.

when the image plane is parallel to the X1 - X2 plane of the world


coordinate system, the orthographic projection can be described in Cartesian
coordinates as
and
or in vector-matrix notation as

[ ] [ ] [ ]

where x1 and x2 denote the image plane coordinates.


The distance of the object from the camera does not affect the image
plane intensity distribution in orthographic projection.
However, orthographic projection provides good approximation to the
actual image formation process when the distance of the object from the camera
is much larger than the relative depth of points on the object with respect to a
coordinate system on the object itself.
In such cases, orthographic projection is usually preferred over more
complicated but realistic models because it is a linear mapping and thus leads to
algebraically and computationally more tractable algorithms.

 Photometric Image Formation:


Image intensities can be modeled as proportional to the amount of light
reflected by the objects in the scene. The scene reflectance function is generally
assumed to contain a Lambertian and a specular component.
The surfaces where the specular component can be neglected are called
Lambertian surfaces.
 Lambertian Reflectance Model:
If a Lambertian surface is illuminated by a single point-source with uniform
intensity (in time), the resulting image intensity is given by
=ρ N(t) L
where ρ denotes the surface albedo (the fraction of the light reflected by the
surface)
L = (L1, L2, L3) is the unit vector in the mean illuminant direction, and
N(t) is the unit surface normal of the scene, at spatial location (Xl, Xx, X3(x1,
x2)) and time t, given by

⁄ ⁄

In which p= ⁄ and q= ⁄ are the partial derivatives of depth


X3(x1, x2) with respect to the image coordinates x1 and x2, respectively, under
the orthographic projection.
The illuminant direction can also be expressed in terms of tilt and slant angles
as
L = (Ll, L2, L3) = (cos τ sinσ, sinτ sin σ, cos σ)
where
τ is the tilt angle of the illuminant (angle between L and the X1 – X3 plane),
and
σ is the slant angle (angle between L and the positive X3 axis)
 Photometric effects of 3-D motion:
As an object moves in 3-D, the photometric properties of the surface and
the surface normal changes as a function of time.
Assuming that the mean illuminant direction L remains constant, we can
express the change in the intensity due to photometric effects of the motion as

=ρ L

The rate of change of the normal vector N at the point (Xl, X2, X3) can
be approximated by
where AN denotes the change in the direction of the normal vector due to
the 3-D motion from the point ( ) to ( ) within the period At.
This change can be expressed as
AN=N ( )-N ( )

= -

where p’ and q’ denote the components of N ( ) given by

= =

= =

Observation Noise:
Image capture mechanisms are never perfect. As a result, images
generally suffer from graininess due to electronic noise, photon noise, film-
grain noise, and quantization noise.
In video scanned from motion picture film, streaks due to possible
scratches on film can be modeled as impulsive noise.
Speckle noise is common in radar image sequences and biomedical tine-
ultrasound sequences.
The available signal-to-noise ratio (SNR) varies with the imaging devices
and image recording media.
Even if the noise may not be perceived at full-speed video due to the
temporal masking effect of the eye, it often leads to poor-quality “freeze-
frames.”
The observation noise in video can be modeled as additive or
multiplicative noise, signal-dependent or signal-independent noise, and white or
colored noise.
For example, photon and film-grain noise are signal-dependent, whereas
CCD sensor and quantization noise are usually modeled as white, Gaussian
distributed, and signal independent.
Ghosts in TV images can also be modeled as signal-dependent noise.
We will assume a simple additive noise model given by
= +
where and denote the ideal video and noise at
time t, respectively.
The SNR is an important parameter for most digital video processing
applications, because noise hinders our ability to effectively process the data.
For example, in 2-D and 3-D motion estimation, it is very important to
distinguish the variation of the intensity pattern due to motion from that of the
noise.
In image resolution enhancement, noise is the fundamental limitation on
our ability to recover high frequency information.
Furthermore, in video compression, random noise increases the entropy
hindering effective compression.
The SNR of video imagery can be enhanced by spatio-temporal filtering,
also called noise filtering.

 Sampling of Video Signals:


 Sampling Structures for Analog Video:
An analog video signal is obtained by sampling the time-varying image
intensity distribution in the vertical x2, and temporal t directions by a 2-D
sampling process known as scanning.
Continuous intensity information along each horizontal line is concatenated
to form the 1-D analog video signal as a function of time.
The two most commonly used vertical-temporal sampling structures are
1) Orthogonal sampling structure (used in the representation of progressive
analog video, such as that shown on workstation monitors)
2) Hexagonal sampling structure (used in the representation of 2:l interlaced
analog video, such as that shown on TV monitors)

Each dot indicates a continuous line of video perpendicular to the plane


of the page. The matrices V shown in these figures are called the sampling
matrices.
 Sampling Structures for Digital Video:
Digital video can be obtained by sampling analog video in the horizontal
direction along the scan lines, or by applying an inherently 3-D sampling
structure to sample the time-varying image, as in the case of some solid-state
sensors.
In these figures, each circle indicates a pixel location, and the number inside
the circle indicates the time of sampling.
The first three sampling structures are lattices, whereas the sampling
structure in Figure 3.7 is not a lattice, but a union of two cosets of a lattice.
The vector c in Figure 3.7 shows the displacement of one coset with respect
to the other.

You might also like