Real-time Image Based Rendering From Uncalibrated Images
Geert Willems1
Frank Verbiest1
Maarten Vergauwen1
Luc Van Gool1,2
1
2
ESAT / PSI-VISICS
D-ITET / BIWI
Katholieke Universiteit Leuven
Swiss Federal Institute of Technology (ETH)
Kasteelpark Arenberg 10, 3001 Leuven, Belgium
Gloriastrasse 35, 8092 Zürich, Switzerland
[email protected]
[email protected]
Abstract
We present a novel real-time image-based rendering system
for generating realistic novel views of complex scenes from
a set of uncalibrated images. A combination of structureand-motion and stereo techniques is used to obtain calibrated cameras and dense depth maps for all recorded
images. These depth maps are converted into restrictive
quadtrees, which allow for adaptive, view-dependent tessellations while storing per-vertex quality. When rendering a novel view, a subset of suitable cameras is selected
based upon a ranking criterion. In the spirit of the unstructured lumigraph rendering approach a blending field is
evaluated, although the implementation is adapted on several points. We alleviate the need for the creation of a geometric proxy for each novel view while the camera blending
field is sampled in a more optimal, non-uniform way and
combined with the per-vertex quality to reduce texture artifacts. In order to make real-time visualization possible, all
critical steps of the visualization pipeline are programmed
in a highly optimized way on commodity graphics hardware
using the OpenGL Shading Language. The proposed system
can handle complex scenes such as large outdoor scenes as
well as small objects with a large number of acquired images.
1. Introduction
The two major concepts that are known in literature for
rendering novel views are geometry-based and image-based
rendering. For complex scenes or objects, the geometrybased approach has some drawbacks when it comes to visualizing complex objects. Reasons are the need of a very
high number of vertices to represent the structure accurately
and the lack of realistic surface appearance. Image-based
rendering (IBR) has become an alternative to this approach
in the last decade. IBR aims at generating novel views by
interpolating information from images that resemble the requested view. An exact geometrical description of the scene
is not necessary in this approach [9], although approximate geometrical information can be used to improve the
results [2, 3, 5].
Recent real-time image-based rendering techniques from
uncalibrated images are Mueller et al. [10] and Sainz et
al. [11]. Both approaches are similar in the way that, after
camera calibration, an intermediate step is taken in which
a full volumetric reconstruction of the scene is computed
by means of shape-from-silhouette and voxel carving respectively. In [10] the obtained 3D model is converted
into wireframe and rendered using view-dependent multitexturing to obtain the final image. Sainz et al. [11] on
the other hand, use the volumetric reconstruction to compute the depth-images of each reference view. An efficient
piece-wise linear depth-image approximation and warping
technique [12], from the same main authors, is used for rendering. Both methods [10, 11] eventually do a weighted
blending of the warped reference views by computing a single weight per view based on the configuration of the cameras. Using only a single weight per camera, however, is
not ideal when one e.g. wants to capture less diffuse objects or scenes which are not fully visible from each view,
as discussed in detail by Buehler et al. [1]. This is, more
specifically, due to the lack of minimal angular deviation
and the continuity properties.
We, however, obtain the depth-maps directly from the
same set of input images, after calibration, using stereo
techniques as in [7, 8], thereby removing the intermediate
step of a full volumetric reconstruction. This allows for a
IBR approach which can handle large and complex scenes
while specific memory handling is included to render scenes
which are captured from a large number of viewpoints. In
this respect, our pipeline resembles the approach in EversSenne et al. [4], although they use a multi-rig camera system to capture multiple images simultaneously and put most
effort in creating a geometric proxy via depth-fusion from
multiple cameras while focusing less on texture blending.
In this paper, we present a novel pipeline for the generation of photo-realistic images in real-time from uncalibrated
images from a variety of sources, be it hand-held video,
Proceedings of the Fifth International Conference on 3-D Digital Imaging and Modeling (3DIM’05)
1550-6185/05 $20.00 © 2005 IEEE
2. Offline processing in the IBR Pipeline
2.1. Image acquisition
The acquisition can be done in a variety of ways, ranging from the completely uncalibrated cases of a sequence of
images recorded with a still-camera or a hand-held video, to
a controlled structured acquisition using a dome set-up with
at least partial or approximate calibration. Recently, our lab
has built a dome set-up, consisting of a turntable and a camera mounted on a gantry that can move in the vertical plane
under a hemisphere of controllable light sources. An image of the dome together with the corresponding calibrated
camera path is shown in figure 1. The object of interest
is placed on the turntable and for each position of the table
can be photographed from different viewing angles, thereby
collecting a dense set of images over the upper hemisphere.
This image set is ideally suited for image based interpolation.
2.2. Camera calibration and dense 3D reconstruction
Figure 1. Left: Image of the dome, used to capture image sequences of delicate objects. Right:
The corresponding calibrated camera path. Camera are depicted by a wireframe pyramid and their
view of the scene.
hand-held digital stills or obtained in a more controlled way.
We combine the strength of the efficient piece-wise linear
depth-image approximation of Pajarola et al. [12] for fast
texture warping with the Unstructured Lumigraph Rendering (ULR) technique of Buehler et al. [1] in order to correctly blend reference views. The required computations
are calculated in real-time by taking advantage of the programmable vertex and fragment shader pipeline of current
graphics hardware. More specific, our implementation uses
the OpenGL Shader Language (OGLSL), which became
available mid 2003.
A global overview is depicted in figure 2. This paper
is organized as follows. In section 2 we discuss the steps
involved in off-line processing of the uncalibrated images.
Section 3 deals with the on-line generation of novel views,
while section 4 contains more details on the specific shader
computations. Finally, some experiments are discussed in
section 5.
For the calibration of a closely spaced still-images
recorded with a camera, our lab already had well established
structure and motion recovery (SaM) pipeline [13]. It is beyond the scope of this paper to go into the details of this
problem and its solutions. For more information, we refer
the reader to [6, 13]. The calibration of video footage and
the image-set recorded with the dome require some special
attention, however, as they can not be used out-of-the-box
with our standard SaM pipeline, though many of building
blocks used are the same.
In case of video computing the epipolar geometry between two consecutive views is not well determined. In fact
as long as the camera has not sufficiently moved, the motion
of the features can just as well be explained by a homography. The Geometric Robust Information Criterion (GRIC)
proposed by Torr [15] allows to evaluate which of the two
models – epipolar geometry (F) or homography (H) – is best
suited to explain the data and when the camera has moved
enough to reliably estimate the epipolar geometry, at which
point we instantiate a keyframe. Given these keyframes, calibration happens the same way as with still-images.
For the controlled lab conditions of the dome, the intrinsics are calculated beforehand using the well known algorithm of Tsai [17], providing us with a partial calibration.
Given the intrisics, calculating the motion of the camera for
one sweep of the gantry boils down to the simpler problem of relative pose estimation. After feature extraction
and pairwise matching between consecutive images relative
pose estimation can be computed. Notice that we do not
rely on the mechanical settings of the gantry and turntable to
obtain the extrinsic camera calibration. They are only used
Proceedings of the Fifth International Conference on 3-D Digital Imaging and Modeling (3DIM’05)
1550-6185/05 $20.00 © 2005 IEEE
to put reconstructions of sweeps for different positions of
the turntable approximately in the same global frame. Feature matching between images of neighboring reconstructions (sweeps), followed by a bundle adjustment including
all reconstructions results in the final calibration.
Since the calibration is known at this point, the epipolar geometry can be used to constrain the dense correspondence search between image pairs to a 1-D search range.
Image pairs are warped so that epipolar lines coincide with
the image scan lines [18]. In addition to the epipolar geometry other constraints are used to guide the correspondence
towards the most probable scan-line match using a dynamic
programming scheme [19].
The pairwise disparity estimation allows to compute independent depth estimates for each camera viewpoint. An
optimal joint estimate is achieved by fusing all independent estimates into a common 3D model using a Kalman
filter and controlled correspondence linking [20]. Besides
the depth estimate, also two other values are generated that
can be used as quality measurements. The first expresses
the certainty of the resulting estimate, while the second indicates how many images were involved in the linking of a
particular pixel. They will be used in the post-processing of
depth-maps as will be discussed next.
2.3. Depth-map post-processing
The obtained depth-maps are however not directly suitable for our purposes and some post-processing steps are
in order. In a first step, depth values with a low linking
count are removed. This will leave some undefined regions
in the depth-map. The depth-values in these regions are then
reconstructed from nearby valid depths trough a dilationdiffusion process.
Secondly, in order to compact the information given by
the depth-maps, we convert and store them as restricted
quadtrees (RQT), using the efficient depth-image presentation [12]. An example of such a quadtree can be seen
in figure 3. This piece-wise linear hierarchical approximation of the depth-maps allows real-time adaptive generation
of view-dependent, crack-free, triangulated depth-meshes.
The so-called rubber sheet triangles which are introduced
by surface interpolation over depth discontinuities are eliminated and a per-vertex quality is stored for each vertex.
This quality measure gives a high weight to depth-map regions perpendicular to the view direction as they are adequately sampled from the current view. The construction
of these quadtrees is computationally intensive but can be
done beforehand in an offline step. For more in-depth information about these criteria we refer the reader to [12].
request novel view
uncalibrated images
select cameras
structure-and-motion
compute geometric proxy/proxies
calibrated cameras
dense stereo matching
compute blendfield
clean up depth maps
render warped and weighted textures
convert to RQTs
blend and normalize warped textures
render novel view
depth meshes
Figure 2. Schematic overview of the offline(l) and
online(r) processing steps of our IBR system.
3. Online processing in the IBR Pipeline
3.1. Memory Management
At start-up, only the intrinsic and extrinsic camera parameters of all the available cameras are loaded into the
system, since our system needs to handle complex scenes
with a large number (over 200) of captured images. This allows for a fast initialization of the rendering system. Upon
generating a novel view, the necessary resources, being the
textures and depth-image data, have to be read from disk.
To decrease the load-time, the reconstructed quadtrees are
stored in a binary format while the textures are available in
a compressed format which directly can be uploaded to the
graphics card
As memory can be a limiting factor when dealing with a
large number of reference views, memory management has
been included in the rendering pipeline which will delete
resources from the less recently used cameras if not enough
free memory is available in RAM or on the graphics board.
3.2. Pipeline overview
We now give a short description for each of the steps in
the online part of the pipeline.
1. From the set of available cameras, select the n most
suitable cameras based on the given virtual viewpoint
from which a novel view has to be synthesized. (see
section 3.3)
2. Adaptively triangulate the depth-meshes from the selected cameras and generate segmented triangular
strips. (see section 3.4)
3. Render all triangle strips without illumination and texturing to obtain the ǫ-Z-buffer of the scene, seen from
the virtual viewpoint. (see section 3.5)
4. Render each tessellated depth-mesh a second time
from the virtual viewpoint with its texture and per-
Proceedings of the Fifth International Conference on 3-D Digital Imaging and Modeling (3DIM’05)
1550-6185/05 $20.00 © 2005 IEEE
vertex quality stored in the alpha channel. During rendering, the Z-buffer is set to read-only and the fixed
graphics pipeline is altered by enabling a shader program which computes the per-pixel final weights. Each
resulting image is copied into a separate RGBA image, where the warped texture is stored in RGB and
the alpha-channel contains the final weight. (see section 3.6)
5. Synthesize the novel view by rendering all obtained
images onto a full-screen quad using multi-texturing
under orthographic projection while blending is performed via a second normalization shader program.
This fragment shader will normalize the alpha-values
of all textures in each pixel and blend the rgb-values
accordingly. (see section 3.7)
3.3. Camera Selection
From the whole set of cameras that we have at our disposal, not all cameras are equally suited for rendering from
a certain virtual viewpoint and therefore a subset of cameras is selected. In [4] the authors developed a ranking criterion for ordering the reference views with respect to the
requested virtual viewpoint. This criterion consists of the
proximity and the difference of viewing angle of the real
cameras with the virtual viewpoint. As the cameras are now
listed in order of descending quality, we can select the n
best cameras from which we will generate the novel view.
Loading the data from several cameras at once, however,
can lead to a temporarily drop in the frame rate. We therefore propose a more opportunistic selection process.
From the ordered list of cameras, we first select the n
best cameras. We then loop backwards over this subset
of cameras and swap each unloaded camera with the next
best, already loaded, camera that is not yet in our subset. This procedure is repeated until our subset of cameras includes at most one camera whose information has
not been loaded. This kind of opportunistic selection will
avoid serious frame-rate drops when the user moves quickly
to a viewpoint where none or few real cameras have been
loaded. This technique may however result in a temporary less ideal rendering as cameras with a lower quality are
used. Fortunately, this problem resolves itself automatically
after some frames.
3.4. Depth-mesh tessellation
For each selected camera, we tessellate the corresponding restricted quadtree according to an image-space geometric error tolerance [12]. In case view-independent tessellations are used, this only has to be done if currently none is
loaded in memory.
The tessellation is stored in segmented triangle-strips to
speed up rendering and save memory. No texture coordi-
Figure 3. Left: a reference view from the ”Car
Crash” sequence. Right: wireframe from computed restricted quadtree. Bottom: the calibrated
camera path
nates are saved as we use automatic texture coordinate generation to project the textures onto the triangle-strips. The
vertices color-values are set to (1.0, 1.0, 1.0, ρ) where ρ is
the per-vertex quality of each vertex. This approach differs
from [12] as they set the color-values for each vertex to (ρ,
ρ, ρ, ρ), which in some cases suffers from color quantization
issues, as explained in section 3.7.
3.5. Rendering the ǫ-Z-buffer
In order to blend between depth-meshes, the standard Zbuffer visibility test cannot be used. Reason for this is the
fact that all depth-meshes are noisy to some degree. We
therefore relax the visibility test by generating an ǫ-Z-buffer
in the space of the novel view which has slightly lower zvalues. This step is identical to Pajarola [12] and we will
therefore not elaborated further here.
3.6. Warping textures while evaluating the blending field
The key property of image-based rendering techniques is
the combination of different images to render a new virtual
view. Not every image that is used for the rendering has the
same impact in all areas of the new image. To this end a
blending field is sampled that describes the relative importance of each camera for every pixel in the virtual view.
3.6.1. Optimized blending field sampling. In the original proposal, Buehler et al. [1] triangulate the virtual image plane using a regular grid in combination with the
projection of a manually created low-resolution geometric proxy using constrained Delaunay triangulation to capture the most interesting spatial variations. We however
Proceedings of the Fifth International Conference on 3-D Digital Imaging and Modeling (3DIM’05)
1550-6185/05 $20.00 © 2005 IEEE
already have a good approximation of the scene. The obtained quadtree triangulations are piece-wise linear approximations of the surface and therefore have a denser sampling of points in areas with high curvature. This makes the
vertices of these tessellations ideal sampling positions. The
non-regular, adaptive sampling of blending fields was also
mentioned as future work in [1].
select n best cameras from available cameras
based upon requested virtual viewpoint
tesselate depth-meshes of cameras into triangle strips
adaptively based upon user-defined error (if needed)
Camera positions
& optical axes
GPU
compute blendfield using depth-mesh of camera i
render depth-mesh with texture i
: RGB
blendfieldweight x per-vertex quality : ALPHA
C1
Ci
create epsilon-Z-buffer
Cn
multi-texturing
blend all textures in 1 pass
using normalized ALPHA values as weights
An alternative way to sample the blending field presents
itself. Instead of uniformly sampling the blending field
once, we resample the blending field for each camera, each
time using the vertices of the tessellation of their corresponding depth-meshes as sample positions. Using different sample positions for each camera is not a problem if the
depth-maps, from which the depth-meshes were created, are
good approximations of the scene surface. Using adaptive
triangulation via restricted quadtrees, we can safely assume
that the tessellations are a good description of the real scene
surface, if the obtained depth-maps are consistent. This emphasizes the importance of cleaning up the depth-maps in
the post-processing step discussed in section 2. Hence, although we sample the blending field at different positions
for each camera, the resulting obtained blending fields (after Gouraud shading) will be almost identical. We want to
draw the readers attention to the fact that in this case the
whole virtual view is not necessarily being sampled. However, the parts in the virtual view which are left unsampled
are those where no scene is present.
3.6.2. Computation of blending weights. The downside
of re-sampling the blending field for each selected camera is
more than compensated by the fact that we now do not have
to recreate a (uniform) proxy at each frame. Furthermore,
the sampling can be performed fully on the GPU using a
shader program. The implementation of the shader itself is
discussed in more detail in section 4.
For each selected reference view, we render the tessellated depth-mesh from the virtual viewpoint together with
its corresponding texture using projective texture coordinate
generation. At the same time, the color-values of the vertices are to (1.0, 1.0, 1.0, ρ), where ρ is the per-vertex quality. During rendering, the previously computed ǫ-Z-buffer
is set to read-only and the blending field shader is enabled.
The shader computes a per-pixel weight wkf inal (see section 4.2) for the current reference view Ck . The result is
rendered into a separate RGBA texture where each pixel
contains (texR (s, t), texG (s, t), texB (s, t), wkf inal ). Notice that, in areas which are not seen by all depth-meshes, it
can sometimes happen that a low alpha-value is highly increased due to normalization. Since the RGB-values of the
warped 8-bit textures are not pre-multiplied with the blending weights, as proposed in [12], our method does not suffer
from color quantization in these cases.
render novel view from requested viewpoint
Figure 4. Top: Flow of the online processing steps
of our shader-based IBR system. Bottom: Detailed visualization of the two shader programs,
showing the (limited) data transfer between CPU
and GPU.
3.7. Synthesizing the novel view
Once the textures for all selected cameras are computed,
they are orthographically projected onto a full-screen quad
using multi-texturing while a second shader program is enabled to blend the textures based on their respective weights
(stored in the alpha-component). The weights need to be renormalized as their sum is not necessarily equal to one. The
two reasons are the multiplication of the weights by the pervertex quality and the fact that not all pixels will be covered
by all the depth-meshes of the selected cameras.
4. Shader-based blending field computations
As discussed in 3.6, we separately render the tessellated
depth-mesh from a virtual view Cv for each selected camera
Ck together with its corresponding texture.
4.1. Shader initialization
At the beginning of this stage, before warping all textures, the blending field shader is initialized and enabled.
The camera positions of the selected reference views and
the position of the virtual viewpoint are passed to the shader
together with some penalty factors set by the user. These
parameters stay fixed for the current rendering loop. Be-
Proceedings of the Fifth International Conference on 3-D Digital Imaging and Modeling (3DIM’05)
1550-6185/05 $20.00 © 2005 IEEE
fore warping each individual texture, the ID of the current
real camera is also set. This will tell the shader for which
camera we want to store the weights.
4.2. Blending weight computations
For each vertex p of the depth-mesh tessellation belonging to camera Ck (1 ≤ k ≤ n), the normalized blending
weights w̃k = [w̃k1 w̃k1 . . . w̃kn ] for all n selected cameras
Ci (1 ≤ i ≤ n) are calculated in the vertex shader. To this
end, we first compute for each vertex p of camera Ck the
combined penalty penik with respect to each camera Ci .
The penalty of a certain camera Ci at the position of a
certain vertex belonging to view Ck is computed by combining the angular, resolution and field-of-view penalty as
described in [1].
penik
[ vertex shader ]
c o n s t i n t n ; / / number o f s e l e c t e d c a m e r a s u s e d f o r IBR o f c u r r e n t f r a m e
uniform i n t cameraID ; / / ID o f c u r r e n t l y s e l e c t e d camera
uniform v e c 3 v i r t u a l C a m e r a ; / / c o n t a i n i n g p o s i t i o n o f v i r t u a l camera
uniform v e c 3 c a m e r a s [ n ] ; / / a r r a y c o n t a i n i n g p o s i t i o n s o f a l l s e l e c t e d c a m e r a s
uniform f l o a t a l p h a s [ 3 ] ; / / r e l a t i v e w e i g h t s f o r t h e d i f f e r e n t p e n a l t i e s
varying vec2 ProjCoord ; / / t e x t u r e c o o r d i n a t e v i a t e x . coord . g e n e r a t i o n
v o i d main ( v o i d ) {
v e c 4 t e x c o o r d = g l T e x t u r e M a t r i x [ cameraID ] ∗ g l V e r t e x ;
P r o j C o o r d = t e x c o o r d . xy / t e x c o o r d . w ; / / g e n e r a t e t e x t u r e c o o r d i n a t e s
/ / 1 . c a l c u l a t e t h e d i f f e r e n t p e n a l t i e s ( a n g u l a r , f i e l d −o f−view , r e s o l u t i o n )
/ / 2 . c o n v e r t them t o w e i g h t s , e n f o r c e e p i p o l a r c o n s i s t e n c y and n o r m a l i z e
g l F r o n t C o l o r . r = w e i g h t [ cameraID ] ; / / s t o r e w e i g h t [ cameraID ] i n r e d c h a n n e l
gl FrontColor . a = gl Color . a ; / / s t o r e the v e r t e x q u a l i t y in alpha channel
gl Position = ftransform ( ) ;
}
[ fragment shader ]
uniform sampler2D t e x U n i t [ n ] ;
varying vec2 ProjCoord ;
v o i d main ( v o i d ) {
v e c 4 t e x C o l o r = t e x t u r e 2 D ( t e x U n i t [ cameraID ] , P r o j C o o r d ) ;
/ / s t o r e t e x t u r e−c o l o r i n RGB c h a n n e l s
/ / s t o r e ( i n t e r p o l a t e d ) b l e n d i n g f i e l d w e i g h t∗v e r t e x q u a l i t y i n a l p h a c h a n n e l
float weightedquality = gl Color . r ∗ gl Color . a ;
g l F r a g C o l o r = v e c 4 ( t e x C o l o r . rgb , w e i g h t e d q u a l i t y ) ;
}
ov
= α peni,angular
+ β peni,resolution
+ γ peni,f
(1)
k
k
k
is defined as the anThe angular penalty peni,angular
k
−→
−−→
gular difference between the rays pCi en pCv , which are
respectively the rays from the vertex p to the position of
camera Ci and Cv .
peni,angular
k
−→ −→
pCi .pCi
= 1 − −→ −−→
pCi pCv
(2)
,
We approximate the resolution penalty peni,resolution
k
similar as done in [1], by only penalizing under sampling.
peni,resolution
k
−→
−−→
= max(0, pCi − pCv )
(3)
ov
is especially imporThe field-of-view penalty peni,f
k
tant when we want to render whole scenes. We do not want
to give importance to cameras for which the point p is projected outside its field-of-view, as this leads to highly visible
artifacts at the borders of depth-meshes (see fig. 5). In order
to keep the resulting blending field smooth, the following
penalty has been implemented on the shader which goes towards infinity as the vertex p is projected more towards the
border of the field-of-view of camera Ci .
By using texture coordinate generation, we can use the
camera’s texture projection matrices to calculate the texture
coordinates for vertex p within each camera view. These
texture coordinates can then be used to determine the direction of the ray through vertex p with respect to the field-ofview.
The angular and resolution penalty are calculated based
on the position of the virtual camera and the n selected cameras. Since the position of the cameras is not by default
known by a vertex shader, we initialized these shader variables at the beginning of the rendering-loop. The texture
projection matrices are always accessible from the vertex
Listing 1. Lay-out of OGLSL code of blending field
shader.
shader and do not have to be passed on via additional variables.
Once the penalties for all selected reference view at a
certain vertex have been determined, we can compute how
each view is weighted to reconstruct the final image.
wi
w̃ki = n k
i=1
wki
(4)
Finally, the normalized weight w̃kk of the current camera
Ck for vertex p is passed to the fragment shader, together
with the per-vertex quality ρ. During this stage Gouraud
shading is enabled to obtain both per-pixel quality ρ and
blending field weights wkk .
In the fragment shader, the alpha value of each pixel in
the virtual view is set to the final weight wkf inal for the corresponding camera Ck .
wkf inal = ρ.wkk
(5)
The RGB-values obtained by projecting the camera’s
texture onto the depth-mesh are kept unaltered and the resulting image is copied into an RGBA texture.
The general lay-out of the shader is given in listing 1.
The GPU version of the pipeline is shown in figure 4, which
is deliberately depicted in a similar fashion as in [12] for
easy comparison.
5. Experiments
We have tested our image-based rendering pipeline on a
variety of footage from scenes, acquired by different meth-
Proceedings of the Fifth International Conference on 3-D Digital Imaging and Modeling (3DIM’05)
1550-6185/05 $20.00 © 2005 IEEE
Scene
Car Crash
Ename Site
Skull
Scene diameter
5m
20 m
25 cm
Acquisition
hand-held video hand-held stills stills (dome)
# Images
91
14
237
Resolution
720 x 576
3072 x 2048 1600 x 1200
Novel view
720 x 576
768 x 512
800 x 600
Table 1. List of acquired scenes.
Figure 6. Left: original image. Right: virtual image, generated from 3 viewpoints with original
viewpoint excluded.
Figure 5. Top to bottom: cropped view of reference image, novel view generated from 4 viewpoints with original viewpoint excluded using the
proposed method, visualization of the weights,
novel view using one relative weight per viewpoint
as in [10] (combined with per-vertex quality), visualization of the weights.
ods. One scene was recorded using a handheld video camera, while another scene was taken with a handheld digital
camera. The ”Skull” has been captured using our dome.
An overview of the footage, together with some additional
information, can be seen in table 1. All examples were rendered using a nVidia GeForce FX 5200 graphics card with
frame rates of 5 to 10 fps. The dimensions of the rendered
novel views in each case are also listed in table 1.
We evaluate the proposed algorithm using the ”Car
Crash” scene which has been recorded with a Sony 3CCD
digital video camera. As the scene has been recorded from
close by, each view only sees part of the scene. The different positions from which images were taken together with a
rough visualization of the scene is shown in figure 3.4. The
scene is furthermore geometrically complex and specular in
the area of the headlights. In figure 5, a partial view of a reference view is shown together with the reconstruction from
4 neighboring views. For the visualization of the weights,
each camera is assigned a color. The resulting pixel color is
obtained by blending these colors based on their respective
weights. Although the PSNR of the novel view created by
using only one weight per camera (as done in [10]) is hardly
different from our proposed method (around 22.5dB), clear
artifacts (denoted by ellipses in figure 5) can be noticed.
To the left, a clear discontinuity is seen through the fact
that part of the scene visible in the novel view is outside a
camera’s field-of-view. Secondly, the dense sampling of the
blending field in contrast with only one weight per cameras
allows for a more faithfully reconstructed of the specular
highlights in the headlights.
In figure 6, a reference view and reconstruction using
three viewpoint with the original viewpoint excluded are
shown for the ”Car Crash” and ”Ename Site” scenes. Finally, in figure 7, a novel view of the ”Skull” sequence is
shown using three cameras. An original image is shown
together with its rendered counterpart, and the computed
blending weights where each of the 3 cameras is assigned
to a primary color. Notice the gradual changes, due to a shift
of camera importance, together with the local perturbations
due to the per-vertex quality.
6. Conclusion
We have described a new image-based rendering approach that can handle uncalibrated images taken from
handheld video or digital stills. The combination of depthmap computations from stereo, efficient depth-image warping and non-uniformly sampled blending field weights allow for a good IBR framework to handle scenes which are
only partly visible by each camera and more specular scenes
Proceedings of the Fifth International Conference on 3-D Digital Imaging and Modeling (3DIM’05)
1550-6185/05 $20.00 © 2005 IEEE
[5] S. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen. The
lumigraph. Proc. SIGGRAPH 1996, pages 43–54, 1996.
[6] R.I. Hartley and A. Zisserman. Multiple View Geometry in
Computer Vision. Cambridge University Press, 1998.
Figure 7. Left to right: original view, virtual image
generated from 3 viewpoints with original viewpoint excluded, computed blending weights.
in a visual pleasing way. The pipeline can handle complex scenes and a large number of reference views, while
high-level shader programs are used to allow for the realtime rendering of novel views. However, the accuracy of
the reconstruction is of course dependent on the quality of
the depth-maps. Furthermore, although in the case of large
scenes a good blending between cameras is assured, there
still exists a ”popping effect” due the the camera selection
procedure when a new set of cameras is select for the creation of the next novel view. One solution would be to include a criterion which would try to maximize the ”coverage” of the scene. Another possibility is the selected of a
larger set of cameras and decide in the shader, by sorting
the cameras on a vertex-per-vertex basis, which n cameras
should be used for blending. Current implementations however are not possible due to the limited size of the shader
programs. Next-generation hardware will most likely enable this possibility in the near future.
Acknowledgements
The authors gratefully acknowledge support from the European IST project INVIEW (IST-2000-28459) and the European Network of Excellence EPOCH (IST-2002-507382).
Co-funding by the K.U. Leuven Research Council GOA
project ’MARVEL’ is also gratefully acknowledged.
References
[1] Buehler C., Bosse M., McMillan L., Gortler S., and Cohen
M. Unstructured lumigraph rendering. Proc, SIGGRAPH
2001, pages 425 – 432, 2001.
[7] B. Heigl, R. Koch, M. Pollefeys, J. Denzler, and Luc Van
Gool. Plenoptic modeling and rendering from image sequences taken by hand-held camera. In DAGM’99, pages
94–101, 1999.
[8] Reinhard Koch, Marc Pollefeys, Benno Heigl, Luc J. Van
Gool, and Heinrich Niemann. Calibration of hand-held camera sequences for plenoptic modeling. In ICCV (1), pages
585–591, 1999.
[9] M. Levoy and P. Hanrahan. Lightfield rendering. Proc. SIGGRAPH 1996, pages 31–42, 1996.
[10] K. Mueller, A. Smolic, P. Merkle, B. Kaspar, Peter Eisert,
and T. Wiegand. 3d reconstruction of natural scenes with
view-adaptive multi-texturing. In 3DPVT, pages 116–123,
2004.
[11] M. Sainz, R. Pajarola, and Susin A. Photorealistic Image
Based Objects from Uncalibrated Images. In Posters of
IEEE Visualization Conference (VIS’03), 2003.
[12] Renato Pajarola, Miguel Sainz, and Yu Meng. Dmesh: Fast
depth-image meshing and warping. Int. J. Image Graphics,
4(4):653–681, 2004.
[13] Marc Pollefeys, Luc Van Gool, Maarten Vergauwen, Frank
Verbiest, Kurt Cornelis, Jan Tops, and Reinhard Koch. Visual modeling with a hand-held camera. International Journal of Computer Vision, 59(3):207–232, 2004.
[14] Randi J. Rost, John M. Kessenich, Barthold Lichtenbelt, and
Marc Olano. OpenGL Shading Language. Addison-Wesley
Professional, 2004.
[15] P. Torr, A. Fitzgibbon and A. Zisserman, “Maintaining Multiple Motion Model Hypotheses Over Many Views to Recover Matching and Structure”, Proc. International Conference on Computer Vision, Narosa Publishing house, pp 485491, 1998.
[16] M. Pollefeys, L. Van Gool, M. Vergauwen, K. Cornelis, F.
Verbiest, J. Tops, Video-To-3D, ISPRS Commission V Symposium, Corfu, Greece, 2-6 september 2002.
[17] R. Tsai, “An efficient and accurate camera calibration technique for 3D machine vision”, Proc. Computer Vision and
Pattern Recognition, 1986.
[2] J-X. Chai, X. Tong, S-C. Chan, and H-Y. Shum. Plenoptic
sampling. Proc. SIGGRAPH 2000, pages 307–318, 2000.
[18] Pollefeys, M., Koch, R., Van Gool, L., 1999. A simple and efficient rectification method for general motion,
Proc.ICCV’99 (international Conference on Computer Vision), Corfu (Greece), pp.496-501.
[3] P. Debevec, Y. Yu, and G. Borshukov. Efficient ViewDependent Image-Based Rendering with Projective TextureMapping, Proc. Eurographics Rendering Workshop 1998,
pages 105–116, 1998.
[19] Van Meerbergen, G., Vergauwen, M., Pollefeys, M., Van
Gool, L., 2002. A Hierarchical Symmetric Stereo Algorithm Using Dynamic Programming, International Journal
of Computer Vision, 47(1-3), pp. 275-285, 2002.
[4] J.-F. Evers-Senne and Reinhard Koch. Image based interactive rendering with view dependent geometry. In Proc.
Eurographics 2003, Computer Graphics Forum, pages 573–
582. Eurographics Association, 2003.
[20] Koch, R., Pollefeys, M., Van Gool, L., 1998. Multi Viewpoint Stereo from Uncalibrated Video Sequences, Proc. European Conference on Computer Vision, Freiburg, Germany,
pp.55-71.
Proceedings of the Fifth International Conference on 3-D Digital Imaging and Modeling (3DIM’05)
1550-6185/05 $20.00 © 2005 IEEE