Academia.eduAcademia.edu

Real-Time Image Based Rendering from Uncalibrated Images

2005, Fifth International Conference on 3-D Digital Imaging and Modeling (3DIM'05)

We present a novel real-time image-based rendering system for generating realistic novel views of complex scenes from a set of uncalibrated images. A combination of structureand-motion and stereo techniques is used to obtain calibrated cameras and dense depth maps for all recorded images. These depth maps are converted into restrictive quadtrees, which allow for adaptive, view-dependent tessellations while storing per-vertex quality. When rendering a novel view, a subset of suitable cameras is selected based upon a ranking criterion. In the spirit of the unstructured lumigraph rendering approach a blending field is evaluated, although the implementation is adapted on several points. We alleviate the need for the creation of a geometric proxy for each novel view while the camera blending field is sampled in a more optimal, non-uniform way and combined with the per-vertex quality to reduce texture artifacts. In order to make real-time visualization possible, all critical steps of the visualization pipeline are programmed in a highly optimized way on commodity graphics hardware using the OpenGL Shading Language. The proposed system can handle complex scenes such as large outdoor scenes as well as small objects with a large number of acquired images.

Real-time Image Based Rendering From Uncalibrated Images Geert Willems1 Frank Verbiest1 Maarten Vergauwen1 Luc Van Gool1,2 1 2 ESAT / PSI-VISICS D-ITET / BIWI Katholieke Universiteit Leuven Swiss Federal Institute of Technology (ETH) Kasteelpark Arenberg 10, 3001 Leuven, Belgium Gloriastrasse 35, 8092 Zürich, Switzerland [email protected] [email protected] Abstract We present a novel real-time image-based rendering system for generating realistic novel views of complex scenes from a set of uncalibrated images. A combination of structureand-motion and stereo techniques is used to obtain calibrated cameras and dense depth maps for all recorded images. These depth maps are converted into restrictive quadtrees, which allow for adaptive, view-dependent tessellations while storing per-vertex quality. When rendering a novel view, a subset of suitable cameras is selected based upon a ranking criterion. In the spirit of the unstructured lumigraph rendering approach a blending field is evaluated, although the implementation is adapted on several points. We alleviate the need for the creation of a geometric proxy for each novel view while the camera blending field is sampled in a more optimal, non-uniform way and combined with the per-vertex quality to reduce texture artifacts. In order to make real-time visualization possible, all critical steps of the visualization pipeline are programmed in a highly optimized way on commodity graphics hardware using the OpenGL Shading Language. The proposed system can handle complex scenes such as large outdoor scenes as well as small objects with a large number of acquired images. 1. Introduction The two major concepts that are known in literature for rendering novel views are geometry-based and image-based rendering. For complex scenes or objects, the geometrybased approach has some drawbacks when it comes to visualizing complex objects. Reasons are the need of a very high number of vertices to represent the structure accurately and the lack of realistic surface appearance. Image-based rendering (IBR) has become an alternative to this approach in the last decade. IBR aims at generating novel views by interpolating information from images that resemble the requested view. An exact geometrical description of the scene is not necessary in this approach [9], although approximate geometrical information can be used to improve the results [2, 3, 5]. Recent real-time image-based rendering techniques from uncalibrated images are Mueller et al. [10] and Sainz et al. [11]. Both approaches are similar in the way that, after camera calibration, an intermediate step is taken in which a full volumetric reconstruction of the scene is computed by means of shape-from-silhouette and voxel carving respectively. In [10] the obtained 3D model is converted into wireframe and rendered using view-dependent multitexturing to obtain the final image. Sainz et al. [11] on the other hand, use the volumetric reconstruction to compute the depth-images of each reference view. An efficient piece-wise linear depth-image approximation and warping technique [12], from the same main authors, is used for rendering. Both methods [10, 11] eventually do a weighted blending of the warped reference views by computing a single weight per view based on the configuration of the cameras. Using only a single weight per camera, however, is not ideal when one e.g. wants to capture less diffuse objects or scenes which are not fully visible from each view, as discussed in detail by Buehler et al. [1]. This is, more specifically, due to the lack of minimal angular deviation and the continuity properties. We, however, obtain the depth-maps directly from the same set of input images, after calibration, using stereo techniques as in [7, 8], thereby removing the intermediate step of a full volumetric reconstruction. This allows for a IBR approach which can handle large and complex scenes while specific memory handling is included to render scenes which are captured from a large number of viewpoints. In this respect, our pipeline resembles the approach in EversSenne et al. [4], although they use a multi-rig camera system to capture multiple images simultaneously and put most effort in creating a geometric proxy via depth-fusion from multiple cameras while focusing less on texture blending. In this paper, we present a novel pipeline for the generation of photo-realistic images in real-time from uncalibrated images from a variety of sources, be it hand-held video, Proceedings of the Fifth International Conference on 3-D Digital Imaging and Modeling (3DIM’05) 1550-6185/05 $20.00 © 2005 IEEE 2. Offline processing in the IBR Pipeline 2.1. Image acquisition The acquisition can be done in a variety of ways, ranging from the completely uncalibrated cases of a sequence of images recorded with a still-camera or a hand-held video, to a controlled structured acquisition using a dome set-up with at least partial or approximate calibration. Recently, our lab has built a dome set-up, consisting of a turntable and a camera mounted on a gantry that can move in the vertical plane under a hemisphere of controllable light sources. An image of the dome together with the corresponding calibrated camera path is shown in figure 1. The object of interest is placed on the turntable and for each position of the table can be photographed from different viewing angles, thereby collecting a dense set of images over the upper hemisphere. This image set is ideally suited for image based interpolation. 2.2. Camera calibration and dense 3D reconstruction Figure 1. Left: Image of the dome, used to capture image sequences of delicate objects. Right: The corresponding calibrated camera path. Camera are depicted by a wireframe pyramid and their view of the scene. hand-held digital stills or obtained in a more controlled way. We combine the strength of the efficient piece-wise linear depth-image approximation of Pajarola et al. [12] for fast texture warping with the Unstructured Lumigraph Rendering (ULR) technique of Buehler et al. [1] in order to correctly blend reference views. The required computations are calculated in real-time by taking advantage of the programmable vertex and fragment shader pipeline of current graphics hardware. More specific, our implementation uses the OpenGL Shader Language (OGLSL), which became available mid 2003. A global overview is depicted in figure 2. This paper is organized as follows. In section 2 we discuss the steps involved in off-line processing of the uncalibrated images. Section 3 deals with the on-line generation of novel views, while section 4 contains more details on the specific shader computations. Finally, some experiments are discussed in section 5. For the calibration of a closely spaced still-images recorded with a camera, our lab already had well established structure and motion recovery (SaM) pipeline [13]. It is beyond the scope of this paper to go into the details of this problem and its solutions. For more information, we refer the reader to [6, 13]. The calibration of video footage and the image-set recorded with the dome require some special attention, however, as they can not be used out-of-the-box with our standard SaM pipeline, though many of building blocks used are the same. In case of video computing the epipolar geometry between two consecutive views is not well determined. In fact as long as the camera has not sufficiently moved, the motion of the features can just as well be explained by a homography. The Geometric Robust Information Criterion (GRIC) proposed by Torr [15] allows to evaluate which of the two models – epipolar geometry (F) or homography (H) – is best suited to explain the data and when the camera has moved enough to reliably estimate the epipolar geometry, at which point we instantiate a keyframe. Given these keyframes, calibration happens the same way as with still-images. For the controlled lab conditions of the dome, the intrinsics are calculated beforehand using the well known algorithm of Tsai [17], providing us with a partial calibration. Given the intrisics, calculating the motion of the camera for one sweep of the gantry boils down to the simpler problem of relative pose estimation. After feature extraction and pairwise matching between consecutive images relative pose estimation can be computed. Notice that we do not rely on the mechanical settings of the gantry and turntable to obtain the extrinsic camera calibration. They are only used Proceedings of the Fifth International Conference on 3-D Digital Imaging and Modeling (3DIM’05) 1550-6185/05 $20.00 © 2005 IEEE to put reconstructions of sweeps for different positions of the turntable approximately in the same global frame. Feature matching between images of neighboring reconstructions (sweeps), followed by a bundle adjustment including all reconstructions results in the final calibration. Since the calibration is known at this point, the epipolar geometry can be used to constrain the dense correspondence search between image pairs to a 1-D search range. Image pairs are warped so that epipolar lines coincide with the image scan lines [18]. In addition to the epipolar geometry other constraints are used to guide the correspondence towards the most probable scan-line match using a dynamic programming scheme [19]. The pairwise disparity estimation allows to compute independent depth estimates for each camera viewpoint. An optimal joint estimate is achieved by fusing all independent estimates into a common 3D model using a Kalman filter and controlled correspondence linking [20]. Besides the depth estimate, also two other values are generated that can be used as quality measurements. The first expresses the certainty of the resulting estimate, while the second indicates how many images were involved in the linking of a particular pixel. They will be used in the post-processing of depth-maps as will be discussed next. 2.3. Depth-map post-processing The obtained depth-maps are however not directly suitable for our purposes and some post-processing steps are in order. In a first step, depth values with a low linking count are removed. This will leave some undefined regions in the depth-map. The depth-values in these regions are then reconstructed from nearby valid depths trough a dilationdiffusion process. Secondly, in order to compact the information given by the depth-maps, we convert and store them as restricted quadtrees (RQT), using the efficient depth-image presentation [12]. An example of such a quadtree can be seen in figure 3. This piece-wise linear hierarchical approximation of the depth-maps allows real-time adaptive generation of view-dependent, crack-free, triangulated depth-meshes. The so-called rubber sheet triangles which are introduced by surface interpolation over depth discontinuities are eliminated and a per-vertex quality is stored for each vertex. This quality measure gives a high weight to depth-map regions perpendicular to the view direction as they are adequately sampled from the current view. The construction of these quadtrees is computationally intensive but can be done beforehand in an offline step. For more in-depth information about these criteria we refer the reader to [12]. request novel view uncalibrated images select cameras structure-and-motion compute geometric proxy/proxies calibrated cameras dense stereo matching compute blendfield clean up depth maps render warped and weighted textures convert to RQTs blend and normalize warped textures render novel view depth meshes Figure 2. Schematic overview of the offline(l) and online(r) processing steps of our IBR system. 3. Online processing in the IBR Pipeline 3.1. Memory Management At start-up, only the intrinsic and extrinsic camera parameters of all the available cameras are loaded into the system, since our system needs to handle complex scenes with a large number (over 200) of captured images. This allows for a fast initialization of the rendering system. Upon generating a novel view, the necessary resources, being the textures and depth-image data, have to be read from disk. To decrease the load-time, the reconstructed quadtrees are stored in a binary format while the textures are available in a compressed format which directly can be uploaded to the graphics card As memory can be a limiting factor when dealing with a large number of reference views, memory management has been included in the rendering pipeline which will delete resources from the less recently used cameras if not enough free memory is available in RAM or on the graphics board. 3.2. Pipeline overview We now give a short description for each of the steps in the online part of the pipeline. 1. From the set of available cameras, select the n most suitable cameras based on the given virtual viewpoint from which a novel view has to be synthesized. (see section 3.3) 2. Adaptively triangulate the depth-meshes from the selected cameras and generate segmented triangular strips. (see section 3.4) 3. Render all triangle strips without illumination and texturing to obtain the ǫ-Z-buffer of the scene, seen from the virtual viewpoint. (see section 3.5) 4. Render each tessellated depth-mesh a second time from the virtual viewpoint with its texture and per- Proceedings of the Fifth International Conference on 3-D Digital Imaging and Modeling (3DIM’05) 1550-6185/05 $20.00 © 2005 IEEE vertex quality stored in the alpha channel. During rendering, the Z-buffer is set to read-only and the fixed graphics pipeline is altered by enabling a shader program which computes the per-pixel final weights. Each resulting image is copied into a separate RGBA image, where the warped texture is stored in RGB and the alpha-channel contains the final weight. (see section 3.6) 5. Synthesize the novel view by rendering all obtained images onto a full-screen quad using multi-texturing under orthographic projection while blending is performed via a second normalization shader program. This fragment shader will normalize the alpha-values of all textures in each pixel and blend the rgb-values accordingly. (see section 3.7) 3.3. Camera Selection From the whole set of cameras that we have at our disposal, not all cameras are equally suited for rendering from a certain virtual viewpoint and therefore a subset of cameras is selected. In [4] the authors developed a ranking criterion for ordering the reference views with respect to the requested virtual viewpoint. This criterion consists of the proximity and the difference of viewing angle of the real cameras with the virtual viewpoint. As the cameras are now listed in order of descending quality, we can select the n best cameras from which we will generate the novel view. Loading the data from several cameras at once, however, can lead to a temporarily drop in the frame rate. We therefore propose a more opportunistic selection process. From the ordered list of cameras, we first select the n best cameras. We then loop backwards over this subset of cameras and swap each unloaded camera with the next best, already loaded, camera that is not yet in our subset. This procedure is repeated until our subset of cameras includes at most one camera whose information has not been loaded. This kind of opportunistic selection will avoid serious frame-rate drops when the user moves quickly to a viewpoint where none or few real cameras have been loaded. This technique may however result in a temporary less ideal rendering as cameras with a lower quality are used. Fortunately, this problem resolves itself automatically after some frames. 3.4. Depth-mesh tessellation For each selected camera, we tessellate the corresponding restricted quadtree according to an image-space geometric error tolerance [12]. In case view-independent tessellations are used, this only has to be done if currently none is loaded in memory. The tessellation is stored in segmented triangle-strips to speed up rendering and save memory. No texture coordi- Figure 3. Left: a reference view from the ”Car Crash” sequence. Right: wireframe from computed restricted quadtree. Bottom: the calibrated camera path nates are saved as we use automatic texture coordinate generation to project the textures onto the triangle-strips. The vertices color-values are set to (1.0, 1.0, 1.0, ρ) where ρ is the per-vertex quality of each vertex. This approach differs from [12] as they set the color-values for each vertex to (ρ, ρ, ρ, ρ), which in some cases suffers from color quantization issues, as explained in section 3.7. 3.5. Rendering the ǫ-Z-buffer In order to blend between depth-meshes, the standard Zbuffer visibility test cannot be used. Reason for this is the fact that all depth-meshes are noisy to some degree. We therefore relax the visibility test by generating an ǫ-Z-buffer in the space of the novel view which has slightly lower zvalues. This step is identical to Pajarola [12] and we will therefore not elaborated further here. 3.6. Warping textures while evaluating the blending field The key property of image-based rendering techniques is the combination of different images to render a new virtual view. Not every image that is used for the rendering has the same impact in all areas of the new image. To this end a blending field is sampled that describes the relative importance of each camera for every pixel in the virtual view. 3.6.1. Optimized blending field sampling. In the original proposal, Buehler et al. [1] triangulate the virtual image plane using a regular grid in combination with the projection of a manually created low-resolution geometric proxy using constrained Delaunay triangulation to capture the most interesting spatial variations. We however Proceedings of the Fifth International Conference on 3-D Digital Imaging and Modeling (3DIM’05) 1550-6185/05 $20.00 © 2005 IEEE already have a good approximation of the scene. The obtained quadtree triangulations are piece-wise linear approximations of the surface and therefore have a denser sampling of points in areas with high curvature. This makes the vertices of these tessellations ideal sampling positions. The non-regular, adaptive sampling of blending fields was also mentioned as future work in [1]. select n best cameras from available cameras based upon requested virtual viewpoint tesselate depth-meshes of cameras into triangle strips adaptively based upon user-defined error (if needed) Camera positions & optical axes GPU compute blendfield using depth-mesh of camera i render depth-mesh with texture i : RGB blendfieldweight x per-vertex quality : ALPHA C1 Ci create epsilon-Z-buffer Cn multi-texturing blend all textures in 1 pass using normalized ALPHA values as weights An alternative way to sample the blending field presents itself. Instead of uniformly sampling the blending field once, we resample the blending field for each camera, each time using the vertices of the tessellation of their corresponding depth-meshes as sample positions. Using different sample positions for each camera is not a problem if the depth-maps, from which the depth-meshes were created, are good approximations of the scene surface. Using adaptive triangulation via restricted quadtrees, we can safely assume that the tessellations are a good description of the real scene surface, if the obtained depth-maps are consistent. This emphasizes the importance of cleaning up the depth-maps in the post-processing step discussed in section 2. Hence, although we sample the blending field at different positions for each camera, the resulting obtained blending fields (after Gouraud shading) will be almost identical. We want to draw the readers attention to the fact that in this case the whole virtual view is not necessarily being sampled. However, the parts in the virtual view which are left unsampled are those where no scene is present. 3.6.2. Computation of blending weights. The downside of re-sampling the blending field for each selected camera is more than compensated by the fact that we now do not have to recreate a (uniform) proxy at each frame. Furthermore, the sampling can be performed fully on the GPU using a shader program. The implementation of the shader itself is discussed in more detail in section 4. For each selected reference view, we render the tessellated depth-mesh from the virtual viewpoint together with its corresponding texture using projective texture coordinate generation. At the same time, the color-values of the vertices are to (1.0, 1.0, 1.0, ρ), where ρ is the per-vertex quality. During rendering, the previously computed ǫ-Z-buffer is set to read-only and the blending field shader is enabled. The shader computes a per-pixel weight wkf inal (see section 4.2) for the current reference view Ck . The result is rendered into a separate RGBA texture where each pixel contains (texR (s, t), texG (s, t), texB (s, t), wkf inal ). Notice that, in areas which are not seen by all depth-meshes, it can sometimes happen that a low alpha-value is highly increased due to normalization. Since the RGB-values of the warped 8-bit textures are not pre-multiplied with the blending weights, as proposed in [12], our method does not suffer from color quantization in these cases. render novel view from requested viewpoint Figure 4. Top: Flow of the online processing steps of our shader-based IBR system. Bottom: Detailed visualization of the two shader programs, showing the (limited) data transfer between CPU and GPU. 3.7. Synthesizing the novel view Once the textures for all selected cameras are computed, they are orthographically projected onto a full-screen quad using multi-texturing while a second shader program is enabled to blend the textures based on their respective weights (stored in the alpha-component). The weights need to be renormalized as their sum is not necessarily equal to one. The two reasons are the multiplication of the weights by the pervertex quality and the fact that not all pixels will be covered by all the depth-meshes of the selected cameras. 4. Shader-based blending field computations As discussed in 3.6, we separately render the tessellated depth-mesh from a virtual view Cv for each selected camera Ck together with its corresponding texture. 4.1. Shader initialization At the beginning of this stage, before warping all textures, the blending field shader is initialized and enabled. The camera positions of the selected reference views and the position of the virtual viewpoint are passed to the shader together with some penalty factors set by the user. These parameters stay fixed for the current rendering loop. Be- Proceedings of the Fifth International Conference on 3-D Digital Imaging and Modeling (3DIM’05) 1550-6185/05 $20.00 © 2005 IEEE fore warping each individual texture, the ID of the current real camera is also set. This will tell the shader for which camera we want to store the weights. 4.2. Blending weight computations For each vertex p of the depth-mesh tessellation belonging to camera Ck (1 ≤ k ≤ n), the normalized blending weights w̃k = [w̃k1 w̃k1 . . . w̃kn ] for all n selected cameras Ci (1 ≤ i ≤ n) are calculated in the vertex shader. To this end, we first compute for each vertex p of camera Ck the combined penalty penik with respect to each camera Ci . The penalty of a certain camera Ci at the position of a certain vertex belonging to view Ck is computed by combining the angular, resolution and field-of-view penalty as described in [1]. penik [ vertex shader ] c o n s t i n t n ; / / number o f s e l e c t e d c a m e r a s u s e d f o r IBR o f c u r r e n t f r a m e uniform i n t cameraID ; / / ID o f c u r r e n t l y s e l e c t e d camera uniform v e c 3 v i r t u a l C a m e r a ; / / c o n t a i n i n g p o s i t i o n o f v i r t u a l camera uniform v e c 3 c a m e r a s [ n ] ; / / a r r a y c o n t a i n i n g p o s i t i o n s o f a l l s e l e c t e d c a m e r a s uniform f l o a t a l p h a s [ 3 ] ; / / r e l a t i v e w e i g h t s f o r t h e d i f f e r e n t p e n a l t i e s varying vec2 ProjCoord ; / / t e x t u r e c o o r d i n a t e v i a t e x . coord . g e n e r a t i o n v o i d main ( v o i d ) { v e c 4 t e x c o o r d = g l T e x t u r e M a t r i x [ cameraID ] ∗ g l V e r t e x ; P r o j C o o r d = t e x c o o r d . xy / t e x c o o r d . w ; / / g e n e r a t e t e x t u r e c o o r d i n a t e s / / 1 . c a l c u l a t e t h e d i f f e r e n t p e n a l t i e s ( a n g u l a r , f i e l d −o f−view , r e s o l u t i o n ) / / 2 . c o n v e r t them t o w e i g h t s , e n f o r c e e p i p o l a r c o n s i s t e n c y and n o r m a l i z e g l F r o n t C o l o r . r = w e i g h t [ cameraID ] ; / / s t o r e w e i g h t [ cameraID ] i n r e d c h a n n e l gl FrontColor . a = gl Color . a ; / / s t o r e the v e r t e x q u a l i t y in alpha channel gl Position = ftransform ( ) ; } [ fragment shader ] uniform sampler2D t e x U n i t [ n ] ; varying vec2 ProjCoord ; v o i d main ( v o i d ) { v e c 4 t e x C o l o r = t e x t u r e 2 D ( t e x U n i t [ cameraID ] , P r o j C o o r d ) ; / / s t o r e t e x t u r e−c o l o r i n RGB c h a n n e l s / / s t o r e ( i n t e r p o l a t e d ) b l e n d i n g f i e l d w e i g h t∗v e r t e x q u a l i t y i n a l p h a c h a n n e l float weightedquality = gl Color . r ∗ gl Color . a ; g l F r a g C o l o r = v e c 4 ( t e x C o l o r . rgb , w e i g h t e d q u a l i t y ) ; } ov = α peni,angular + β peni,resolution + γ peni,f (1) k k k is defined as the anThe angular penalty peni,angular k −→ −−→ gular difference between the rays pCi en pCv , which are respectively the rays from the vertex p to the position of camera Ci and Cv . peni,angular k −→ −→ pCi .pCi = 1 − −→ −−→ pCi pCv  (2) , We approximate the resolution penalty peni,resolution k similar as done in [1], by only penalizing under sampling. peni,resolution k −→ −−→ = max(0, pCi  − pCv ) (3) ov is especially imporThe field-of-view penalty peni,f k tant when we want to render whole scenes. We do not want to give importance to cameras for which the point p is projected outside its field-of-view, as this leads to highly visible artifacts at the borders of depth-meshes (see fig. 5). In order to keep the resulting blending field smooth, the following penalty has been implemented on the shader which goes towards infinity as the vertex p is projected more towards the border of the field-of-view of camera Ci . By using texture coordinate generation, we can use the camera’s texture projection matrices to calculate the texture coordinates for vertex p within each camera view. These texture coordinates can then be used to determine the direction of the ray through vertex p with respect to the field-ofview. The angular and resolution penalty are calculated based on the position of the virtual camera and the n selected cameras. Since the position of the cameras is not by default known by a vertex shader, we initialized these shader variables at the beginning of the rendering-loop. The texture projection matrices are always accessible from the vertex Listing 1. Lay-out of OGLSL code of blending field shader. shader and do not have to be passed on via additional variables. Once the penalties for all selected reference view at a certain vertex have been determined, we can compute how each view is weighted to reconstruct the final image. wi w̃ki = n k i=1 wki (4) Finally, the normalized weight w̃kk of the current camera Ck for vertex p is passed to the fragment shader, together with the per-vertex quality ρ. During this stage Gouraud shading is enabled to obtain both per-pixel quality ρ and blending field weights wkk . In the fragment shader, the alpha value of each pixel in the virtual view is set to the final weight wkf inal for the corresponding camera Ck . wkf inal = ρ.wkk (5) The RGB-values obtained by projecting the camera’s texture onto the depth-mesh are kept unaltered and the resulting image is copied into an RGBA texture. The general lay-out of the shader is given in listing 1. The GPU version of the pipeline is shown in figure 4, which is deliberately depicted in a similar fashion as in [12] for easy comparison. 5. Experiments We have tested our image-based rendering pipeline on a variety of footage from scenes, acquired by different meth- Proceedings of the Fifth International Conference on 3-D Digital Imaging and Modeling (3DIM’05) 1550-6185/05 $20.00 © 2005 IEEE Scene Car Crash Ename Site Skull Scene diameter 5m 20 m 25 cm Acquisition hand-held video hand-held stills stills (dome) # Images 91 14 237 Resolution 720 x 576 3072 x 2048 1600 x 1200 Novel view 720 x 576 768 x 512 800 x 600 Table 1. List of acquired scenes. Figure 6. Left: original image. Right: virtual image, generated from 3 viewpoints with original viewpoint excluded. Figure 5. Top to bottom: cropped view of reference image, novel view generated from 4 viewpoints with original viewpoint excluded using the proposed method, visualization of the weights, novel view using one relative weight per viewpoint as in [10] (combined with per-vertex quality), visualization of the weights. ods. One scene was recorded using a handheld video camera, while another scene was taken with a handheld digital camera. The ”Skull” has been captured using our dome. An overview of the footage, together with some additional information, can be seen in table 1. All examples were rendered using a nVidia GeForce FX 5200 graphics card with frame rates of 5 to 10 fps. The dimensions of the rendered novel views in each case are also listed in table 1. We evaluate the proposed algorithm using the ”Car Crash” scene which has been recorded with a Sony 3CCD digital video camera. As the scene has been recorded from close by, each view only sees part of the scene. The different positions from which images were taken together with a rough visualization of the scene is shown in figure 3.4. The scene is furthermore geometrically complex and specular in the area of the headlights. In figure 5, a partial view of a reference view is shown together with the reconstruction from 4 neighboring views. For the visualization of the weights, each camera is assigned a color. The resulting pixel color is obtained by blending these colors based on their respective weights. Although the PSNR of the novel view created by using only one weight per camera (as done in [10]) is hardly different from our proposed method (around 22.5dB), clear artifacts (denoted by ellipses in figure 5) can be noticed. To the left, a clear discontinuity is seen through the fact that part of the scene visible in the novel view is outside a camera’s field-of-view. Secondly, the dense sampling of the blending field in contrast with only one weight per cameras allows for a more faithfully reconstructed of the specular highlights in the headlights. In figure 6, a reference view and reconstruction using three viewpoint with the original viewpoint excluded are shown for the ”Car Crash” and ”Ename Site” scenes. Finally, in figure 7, a novel view of the ”Skull” sequence is shown using three cameras. An original image is shown together with its rendered counterpart, and the computed blending weights where each of the 3 cameras is assigned to a primary color. Notice the gradual changes, due to a shift of camera importance, together with the local perturbations due to the per-vertex quality. 6. Conclusion We have described a new image-based rendering approach that can handle uncalibrated images taken from handheld video or digital stills. The combination of depthmap computations from stereo, efficient depth-image warping and non-uniformly sampled blending field weights allow for a good IBR framework to handle scenes which are only partly visible by each camera and more specular scenes Proceedings of the Fifth International Conference on 3-D Digital Imaging and Modeling (3DIM’05) 1550-6185/05 $20.00 © 2005 IEEE [5] S. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen. The lumigraph. Proc. SIGGRAPH 1996, pages 43–54, 1996. [6] R.I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 1998. Figure 7. Left to right: original view, virtual image generated from 3 viewpoints with original viewpoint excluded, computed blending weights. in a visual pleasing way. The pipeline can handle complex scenes and a large number of reference views, while high-level shader programs are used to allow for the realtime rendering of novel views. However, the accuracy of the reconstruction is of course dependent on the quality of the depth-maps. Furthermore, although in the case of large scenes a good blending between cameras is assured, there still exists a ”popping effect” due the the camera selection procedure when a new set of cameras is select for the creation of the next novel view. One solution would be to include a criterion which would try to maximize the ”coverage” of the scene. Another possibility is the selected of a larger set of cameras and decide in the shader, by sorting the cameras on a vertex-per-vertex basis, which n cameras should be used for blending. Current implementations however are not possible due to the limited size of the shader programs. Next-generation hardware will most likely enable this possibility in the near future. Acknowledgements The authors gratefully acknowledge support from the European IST project INVIEW (IST-2000-28459) and the European Network of Excellence EPOCH (IST-2002-507382). Co-funding by the K.U. Leuven Research Council GOA project ’MARVEL’ is also gratefully acknowledged. References [1] Buehler C., Bosse M., McMillan L., Gortler S., and Cohen M. Unstructured lumigraph rendering. Proc, SIGGRAPH 2001, pages 425 – 432, 2001. [7] B. Heigl, R. Koch, M. Pollefeys, J. Denzler, and Luc Van Gool. Plenoptic modeling and rendering from image sequences taken by hand-held camera. In DAGM’99, pages 94–101, 1999. [8] Reinhard Koch, Marc Pollefeys, Benno Heigl, Luc J. Van Gool, and Heinrich Niemann. Calibration of hand-held camera sequences for plenoptic modeling. In ICCV (1), pages 585–591, 1999. [9] M. Levoy and P. Hanrahan. Lightfield rendering. Proc. SIGGRAPH 1996, pages 31–42, 1996. [10] K. Mueller, A. Smolic, P. Merkle, B. Kaspar, Peter Eisert, and T. Wiegand. 3d reconstruction of natural scenes with view-adaptive multi-texturing. In 3DPVT, pages 116–123, 2004. [11] M. Sainz, R. Pajarola, and Susin A. Photorealistic Image Based Objects from Uncalibrated Images. In Posters of IEEE Visualization Conference (VIS’03), 2003. [12] Renato Pajarola, Miguel Sainz, and Yu Meng. Dmesh: Fast depth-image meshing and warping. Int. J. Image Graphics, 4(4):653–681, 2004. [13] Marc Pollefeys, Luc Van Gool, Maarten Vergauwen, Frank Verbiest, Kurt Cornelis, Jan Tops, and Reinhard Koch. Visual modeling with a hand-held camera. International Journal of Computer Vision, 59(3):207–232, 2004. [14] Randi J. Rost, John M. Kessenich, Barthold Lichtenbelt, and Marc Olano. OpenGL Shading Language. Addison-Wesley Professional, 2004. [15] P. Torr, A. Fitzgibbon and A. Zisserman, “Maintaining Multiple Motion Model Hypotheses Over Many Views to Recover Matching and Structure”, Proc. International Conference on Computer Vision, Narosa Publishing house, pp 485491, 1998. [16] M. Pollefeys, L. Van Gool, M. Vergauwen, K. Cornelis, F. Verbiest, J. Tops, Video-To-3D, ISPRS Commission V Symposium, Corfu, Greece, 2-6 september 2002. [17] R. Tsai, “An efficient and accurate camera calibration technique for 3D machine vision”, Proc. Computer Vision and Pattern Recognition, 1986. [2] J-X. Chai, X. Tong, S-C. Chan, and H-Y. Shum. Plenoptic sampling. Proc. SIGGRAPH 2000, pages 307–318, 2000. [18] Pollefeys, M., Koch, R., Van Gool, L., 1999. A simple and efficient rectification method for general motion, Proc.ICCV’99 (international Conference on Computer Vision), Corfu (Greece), pp.496-501. [3] P. Debevec, Y. Yu, and G. Borshukov. Efficient ViewDependent Image-Based Rendering with Projective TextureMapping, Proc. Eurographics Rendering Workshop 1998, pages 105–116, 1998. [19] Van Meerbergen, G., Vergauwen, M., Pollefeys, M., Van Gool, L., 2002. A Hierarchical Symmetric Stereo Algorithm Using Dynamic Programming, International Journal of Computer Vision, 47(1-3), pp. 275-285, 2002. [4] J.-F. Evers-Senne and Reinhard Koch. Image based interactive rendering with view dependent geometry. In Proc. Eurographics 2003, Computer Graphics Forum, pages 573– 582. Eurographics Association, 2003. [20] Koch, R., Pollefeys, M., Van Gool, L., 1998. Multi Viewpoint Stereo from Uncalibrated Video Sequences, Proc. European Conference on Computer Vision, Freiburg, Germany, pp.55-71. Proceedings of the Fifth International Conference on 3-D Digital Imaging and Modeling (3DIM’05) 1550-6185/05 $20.00 © 2005 IEEE