Dust3R: Geometric 3D Vision Made Easy
Dust3R: Geometric 3D Vision Made Easy
Dust3R: Geometric 3D Vision Made Easy
Shuzhe Wang∗ , Vincent Leroy† , Yohann Cabon† , Boris Chidlovskii† and Jerome Revaud†
∗ †
Aalto University Naver Labs Europe
[email protected] [email protected]
arXiv:2312.14132v1 [cs.CV] 21 Dec 2023
Figure 1. Overview: Given an unconstrained image collection, i.e. a set of photographs with unknown camera poses and intrinsics, our
proposed method DUSt3R outputs a set of corresponding pointmaps, from which we can straightforwardly recover a variety of geometric
quantities normally difficult to estimate all at once, such as the camera parameters, pixel correspondences, depthmaps, and fully-consistent
3D reconstruction. Note that DUSt3R also works for a single input image (e.g. achieving in this case monocular reconstruction). We also
show qualitative examples on the DTU, Tanks and Temples and ETH-3D datasets [1, 50, 107] obtained without known camera parameters.
For each sample, from left to right: input image, colored point cloud, and rendered with shading for a better view of the underlying geometry.
1
1. Introduction practically all scene parameters (i.e. cameras and scene ge-
ometry) can be straightforwardly extracted. This is possible
Unconstrained image-based dense 3D reconstruction from because our network jointly processes the input images and
multiple views is one of a few long-researched end-goals of the resulting 3D pointmaps, thus learning to associate 2D
computer vision [24, 71, 89]. In a nutshell, the task aims at structures with 3D shapes, and having the opportunities of
estimating the 3D geometry and camera parameters of a par- solving multiple minimal problems simultaneously, enabling
ticular scene, given a set of photographs of this scene. Not internal ‘collaboration’ between them.
only does it have numerous applications like mapping [13,
72] , navigation [15], archaeology [86, 132], cultural her- Our model is trained in a fully-supervised manner using
itage preservation [38], robotics [78], but perhaps more im- a simple regression loss, leveraging large public datasets for
portantly, it holds a fundamentally special place among all which ground-truth annotations are either synthetically gen-
3D vision tasks. Indeed, it subsumes nearly all of the other erated [68, 103], reconstructed from SfM softwares [55, 161]
geometric 3D vision tasks. Thus, modern approaches for 3D or captured using dedicated sensors [25, 93, 121, 165]. We
reconstruction consists in assembling the fruits of decades drift away from the trend of integrating task-specific mod-
of advances in various sub-fields such as keypoint detec- ules [164], and instead adopt a fully data-driven strategy
tion [26, 28, 62, 96] and matching [10, 59, 99, 119], robust based on a generic transformer architecture, not enforcing
estimation [3, 10, 180], Structure-from-Motion (SfM) and any geometric constraints at inference, but being able to ben-
Bundle Adjustment (BA) [20, 58, 105], dense Multi-View efit from powerful pretraining schemes. The network learns
Stereo (MVS) [106, 138, 157, 175], etc. strong geometric and shape priors, which are reminiscent of
In the end, modern SfM and MVS pipelines boil down those commonly leveraged in MVS, like shape from texture,
to solving a series of minimal problems: matching points, shading or contours [110].
finding essential matrices, triangulating points, sparsely re- To fuse predictions from multiple images pairs, we revisit
constructing the scene, estimating cameras and finally per- bundle adjustment (BA) for the case of pointmaps, hereby
forming dense reconstruction. Considering recent advances, achieving full-scale MVS. We introduce a global alignment
this rather complex chain is of course a viable solution in procedure that, contrary to BA, does not involve minimiz-
some settings [31, 70, 76, 142, 145, 147, 162], yet we argue ing reprojection errors. Instead, we optimize the camera
it is quite unsatisfactory: each sub-problem is not solved pose and geometry alignment directly in 3D space, which
perfectly and adds noise to the next step, increasing the com- is fast and shows excellent convergence in practice. Our
plexity and the engineering effort required for the pipeline experiments show that the reconstructions are accurate and
to work as a whole. In this regard, the absence of communi- consistent between views in real-life scenarios with various
cation between each sub-problem is quite telling: it would unknown sensors. We further demonstrate that the same
seem more reasonable if they helped each other, i.e. dense architecture can handle real-life monocular and multi-view
reconstruction should naturally benefit from the sparse scene reconstruction scenarios seamlessly. Examples of reconstruc-
that was built to recover camera poses, and vice-versa. On tions are shown in Fig. 1 and in the accompanying video.
top of that, key steps in this pipeline are brittle and prone to In summary, our contributions are fourfold. First,
break in many cases [58]. For instance, the crucial stage of we present the first holistic end-to-end 3D reconstruction
SfM that serves to estimate all camera parameters, is typi- pipeline from un-calibrated and un-posed images, that uni-
cally known to fail in many common situations, e.g. when fies monocular and binocular 3D reconstruction. Second, we
the number of scene views is low [108], for objects with introduce the pointmap representation for MVS applications,
non-Lambertian surfaces [16], in case of insufficient camera that enables the network to predict the 3D shape in a canoni-
motion [13], etc. This is concerning, because in the end, “an cal frame, while preserving the implicit relationship between
MVS algorithm is only as good as the quality of the input pixels and the scene. This effectively drops many constraints
images and camera parameters” [32]. of the usual perspective camera formulation. Third, we intro-
In this paper, we present DUSt3R, a radically novel ap- duce an optimization procedure to globally align pointmaps
proach for Dense Unconstrained Stereo 3D Reconstruction in the context of multi-view 3D reconstruction. Our proce-
from un-calibrated and un-posed cameras. The main com- dure can extract effortlessly all usual intermediary outputs
ponent is a network that can regress a dense and accurate of the classical SfM and MVS pipelines. In a sense, our ap-
scene representation solely from a pair of images, without proach unifies all 3D vision tasks and considerably simplifies
prior information regarding the scene nor the cameras (not over the traditional reconstruction pipeline, making DUSt3R
even the intrinsic parameters). The resulting scene represen- seem simple and easy in comparison. Fourth, we demon-
tation is based on 3D pointmaps with rich properties: they strate promising performance on a range of 3D vision tasks
simultaneously encapsulate (a) the scene geometry, (b) the In particular, our all-in-one model achieves state-of-the-art
relation between pixels and scene points and (c) the rela- results on monocular and multi-view depth benchmarks, as
tion between the two viewpoints. From this output alone, well as multi-view camera pose estimation.
2
2. Related Work of 3D information and, combined with camera intrinsics,
can straightforwardly yield pixel-aligned 3D point-clouds.
For the sake of space, we summarize here the most related SynSin [150], for example, performs new viewpoint syn-
works in 3D vision, and refer the reader to the appendix in thesis from a single image by rendering feature-augmented
Sec. C for a more comprehensive review. depthmaps knowing all camera parameters. Without cam-
Structure-from-Motion (SfM) [20, 21, 44, 47, 105] aims era intrinsics, one solution is to infer them by exploiting
at reconstructing sparse 3D maps while jointly determin- temporal consistency in video frames, either by enforcing a
ing camera parameters from a set of images. The tradi- global alignment et al. [155] or by leveraging differentiable
tional pipeline starts from pixel correspondences obtained rendering with a photometric reconstruction loss [36, 116].
from keypoint matching [4, 5, 42, 62, 98] between multiple Another way is to explicitly learn to predict camera intrin-
images to determine geometric relationships, followed by sics, which enables to perform metric 3D reconstruction
bundle adjustment to optimize 3D coordinates and camera from a single image when combined with MDE [167, 169].
parameters jointly. Recently, the SfM pipeline has under- All these methods are, however, intrinsically limited by the
gone substantial enhancements, particularly with the incor- quality of depth estimates, which arguably is ill-posed for
poration of learning-based techniques into its subprocesses. monocular settings.
These improvements encompass advanced feature descrip- In contrast, our network processes two viewpoints simul-
tion [26, 28, 96, 134, 166], more accurate image match- taneously in order to output depthmaps, or rather, pointmaps.
ing [3, 17, 59, 81, 99, 119, 125, 144], featuremetric refine- In theory, at least, this makes triangulation between rays
ment [58], and neural bundle adjustment [57, 152]. Despite from different viewpoint possible. Multi-view networks for
these advancements, the sequential structure of the SfM 3D reconstruction have been proposed in the past. They are
pipeline persists, making it vulnerable to noise and errors in essentially based on the idea of building a differentiable SfM
each individual component. pipeline, replicating the traditional pipeline but training it
MultiView Stereo (MVS) is the task of densely reconstruct- end-to-end [130, 135, 183]. For that, however, ground-truth
ing visible surfaces, which is achieved via triangulation be- camera intrinsics are required as input, and the output is gen-
tween multiple viewpoints. In the classical formulation of erally a depthmap and a relative camera pose [135, 183]. In
MVS, all camera parameters are supposed to be provided contrast, our network has a generic architecture and outputs
as inputs. The fully handcrafted [32, 34, 106, 146, 174], the pointmaps, i.e. dense 2D field of 3D points, which handle
more recent scene optimization based [31, 70, 75, 76, 142, camera poses implicitly and makes the regression problem
145, 147, 162], or learning based [52, 64, 85, 160, 163, 179] much better posed.
approaches all depend on camera parameter estimates ob- Pointmaps. Using a collection of pointmaps as shape rep-
tained via complex calibration procedures, either during the resentation is quite counter-intuitive for MVS, but its us-
data acquisition [1, 23, 108, 165] or using Structure-from- age is widespread for Visual Localization tasks, either in
Motion approaches [47, 105] for in-the-wild reconstructions. scene-dependent optimization approaches [8, 9, 11] or scene-
Yet, in real-life scenarios, the inaccuracy of pre-estimated agnostic inference methods [95, 123, 158]. Similarly, view-
camera parameters can be detrimental for these algorithms to wise modeling is a common theme in monocular 3D recon-
work properly. In this work, we propose instead to directly struction works [56, 112, 126, 140] and in view synthesis
predict the geometry of visible surfaces without any explicit works [150]. The idea being to store the canonical 3D shape
knowledge of the camera parameters. in multiple canonical views to work in image space. These
Direct RGB-to-3D. Recently, some approaches aiming at approaches usually leverage explicit perspective camera ge-
directly predicting 3D geometry from a single RGB image ometry, via rendering of the canonical representation.
have been proposed. Since the problem is by nature ill-posed
without introducing additional assumptions, these methods 3. Method
leverage neural networks that learn strong 3D priors from
large datasets to solve ambiguities. These methods can be Before delving into the details of our method, we introduce
classified into two groups. The first group leverages class- below the essential concept of pointmaps.
level object priors. For instance, Pavllo et al. [82–84] pro- Pointmap. In the following, we denote a dense 2D field
pose to learn a model that can fully recover shape, pose, and of 3D points as a pointmap X ∈ RW ×H×3 . In association
appearance from a single image, given a large collection of with its corresponding RGB image I of resolution W × H,
2D images. While this type of approach is powerful, it does X forms a one-to-one mapping between image pixels and
not allow to infer shape on objects from unseen categories. A 3D scene points, i.e. Ii,j ↔ Xi,j , for all pixel coordinates
second group of work, closest to our method, focuses instead (i, j) ∈ {1 . . . W } × {1 . . . H}. We assume here that each
on general scenes. These methods systematically build on camera ray hits a single 3D point, i.e. ignoring the case of
or re-use existing monocular depth estimation (MDE) net- translucent surfaces.
works [6, 90, 168, 170]. Depth maps indeed encode a form Cameras and scene. Given the camera intrinsics K ∈ R3×3 ,
3
Pointmap
𝑋1,1 ∈ ℝ𝑊×𝐻×3
ViT Transformer Head1 Common coordinate frame
encoder Decoder1 Confidence of camera 1 (image 𝐼1 )
𝐹1
𝐶 1 ∈ ℝ𝑊×𝐻
Camera1
Image 𝐼1 ∈ ℝ𝑊×𝐻×3 Shared Information (at origin)
weights sharing
Pointmap
𝑋 2,1 ∈ ℝ𝑊×𝐻×3 Camera2
ViT Transformer Head2
encoder Decoder2 (unknown position)
2 Confidence
Patchify 𝐹
𝐶 2 ∈ ℝ𝑊×𝐻
𝑊×𝐻×3
Image 𝐼2 ∈ ℝ
Figure 2. Architecture of the network F. Two views of a scene (I 1 , I 2 ) are first encoded in a Siamese manner with a shared ViT encoder.
The resulting token representations F 1 and F 2 are then passed to two transformer decoders that constantly exchange information via
cross-attention. Finally, two regression heads output the two corresponding pointmaps and associated confidence maps. Importantly, the two
pointmaps are expressed in the same coordinate frame of the first image I 1 . The network F is trained using a simple regression loss (Eq. (4))
the pointmap X of the observed scene can be straight- token of a view attends to tokens of the same view), then
forwardly obtained from the ground-truth depthmap D ∈ cross-attention (each token of a view attends to all other
⊤
RW ×H as Xi,j = K −1 [iDi,j , jDi,j , Di,j ] . Here, X is ex- tokens of the other view), and finally feeds tokens to a MLP.
pressed in the camera coordinate frame. In the following, we Importantly, information is constantly shared between the
denote as X n,m the pointmap X n from camera n expressed two branches during the decoder pass. This is crucial in
in camera m’s coordinate frame: order to output properly aligned pointmaps. Namely, each
decoder block attends to tokens from the other branch:
X n,m = Pm Pn−1 h (X n ) (1)
G1i = DecoderBlock1i G1i−1 , G2i−1 ,
3×4
with Pm , Pn ∈ R the world-to-camera poses for images
G2i = DecoderBlock2i G2i−1 , G1i−1 ,
n and m, and h : (x, y, z) → (x, y, z, 1) the homogeneous
mapping.
for i = 1, . . . , B for a decoder with B blocks and initial-
3.1. Overview ized with encoder tokens G10 := F 1 and G20 := F 2 . Here,
We wish to build a network that solves the 3D reconstruction DecoderBlockvi (G1 , G2 ) denotes the i-th block in branch
task for the generalized stereo case through direct regression. v ∈ {1, 2}, G1 and G2 are the input tokens, with G2 the
To that aim, we train a network F that takes as input 2 RGB tokens from the other branch. Finally, in each branch a sep-
images I 1 , I 2 ∈ RW ×H×3 and outputs 2 corresponding arate regression head takes the set of decoder tokens and
pointmaps X 1,1 , X 2,1 ∈ RW ×H×3 with associated confi- outputs a pointmap and an associated confidence map:
dence maps C 1,1 , C 2,1 ∈ RW ×H . Note that both pointmaps X 1,1 , C 1,1 = Head1 G10 , . . . , G1B ,
are expressed in the same coordinate frame of I 1 , which X 2,1 , C 2,1 = Head2 G20 , . . . , G2B .
radically differs from existing approaches but offers key
advantages (see Secs. 1, 2, 3.3 and 3.4). For the sake of Discussion. The output pointmaps X 1,1 and X 2,1 are re-
clarity and without loss of generalization, we assume that gressed up to an unknown scale factor. Also, it should be
both images have the same resolution W × H, but naturally noted that our generic architecture never explicitly enforces
in practice their resolution can differ. any geometrical constraints. Hence, pointmaps do not neces-
Network architecture. The architecture of our network F sarily correspond to any physically plausible camera model.
is inspired by CroCo [149], making it straightforward to Rather, we let the network learn all relevant priors present
heavily benefit from CroCo pretraining [148]. As shown in from the train set, which only contains geometrically con-
Fig. 2, it is composed of two identical branches (one for each sistent pointmaps. Using a generic architecture allows to
image) comprising each an image encoder, a decoder and leverage strong pretraining technique, ultimately surpassing
a regression head. The two input images are first encoded what existing task-specific architectures can achieve. We
in a Siamese manner by the same weight-sharing ViT en- detail the learning process in the next section.
coder [27], yielding two token representations F 1 and F 2 :
3.2. Training Objective
F 1 = Encoder(I 1 ), F 2 = Encoder(I 2 ).
The network then reasons over both of them jointly in the 3D Regression loss. Our sole training objective is based
decoder. Similarly to CroCo [149], the decoder is a generic on regression in the 3D space. Let us denote the ground-
transformer network equipped with cross attention. Each truth pointmaps as X̄ 1,1 and X̄ 2,1 , obtained from Eq. (1)
decoder block thus sequentially performs self-attention (each along with two corresponding sets of valid pixels D1 , D2 ⊆
4
{1 . . . W } × {1 . . . H} on which the ground-truth is defined. the principal point is approximately centered and pixel are
The regression loss for a valid pixel i ∈ Dv in view v ∈ squares, hence only the focal f1∗ remains to be estimated:
{1, 2} is simply defined as the Euclidean distance:
W X
H 1,1 1,1
X 1,1 (Xi,j,0 , Xi,j,1 )
ℓregr (v, i) =
1 v,1 1 v,1
X − X̄i . (2) f1∗ = arg min Ci,j (i′ , j ′ ) − f1 1,1 ,
z i z̄ f1 i=0 j=0
Xi,j,2
5
Figure 3. Reconstruction examples on two scenes never seen during training. From left to right: RGB, depth map, confidence map,
reconstruction. The left scene shows the raw result output from F(I 1 , I 2 ). The right scene shows the outcome of global alignment (Sec. 3.4).
and measure their overlap based on the average confidence Waymo [121]. These datasets feature diverse scenes types:
in both pairs, then we filter out low-confidence pairs. indoor, outdoor, synthetic, real-world, object-centric, etc.
Global optimization. We use the connectivity graph G When image pairs are not directly provided with the dataset,
to recover globally aligned pointmaps {χn ∈ RW ×H×3 } we extract them based on the method described in [148].
for all cameras n = 1 . . . N . To that aim, we first pre- Specifically, we utilize off-the-shelf image retrieval and point
dict, for each image pair e = (n, m) ∈ E, the pair- matching algorithms to match and verify image pairs. All in
wise pointmaps X n,n , X m,n and their associated confidence all, we extract 8.5M pairs in total.
maps C n,n , C m,n . For the sake of clarity, let us define Training details. During each epoch, we randomly sam-
X n,e := X n,n and X m,e := X m,n . Since our goal involves ple an equal number of pairs from each dataset to equal-
to rotate all pairwise predictions in a common coordinate ize disparities in dataset sizes. We wish to feed relatively
frame, we introduce a pairwise pose Pe ∈ R3×4 and scaling high-resolution images to our network, say 512 pixels in
σe > 0 associated to each pair e ∈ E. We then formulate the the largest dimension. To mitigate the high cost associated
following optimization problem: with such input, we train our network sequentially, first on
X X HW X v,e 224×224 images and then on larger 512-pixel images. We
χ∗ = arg min Ci ∥χvi − σe Pe Xiv,e ∥ . (5)
χ,P,σ randomly select the image aspect ratios for each batch (e.g.
e∈E v∈e i=1
16/9, 4/3, etc), so that at test time our network is familiar
Here, we abuse notation and write v ∈ e for v ∈ {n, m} if with different image shapes. We simply crop images to the
e = (n, m). The idea is that, for a given pair e, the same desired aspect-ratio, and resize so that the largest dimension
rigid transformation Pe should align both pointmaps X n,e is 512 pixels.
and X m,e with the world-coordinate pointmaps χn and χm , We use standard data augmentation techniques and train-
since X n,e and X m,e are by definition both expressed in the ing set-up overall. Our network architecture comprises a
same coordinate frame. To avoid Q the trivial optimum where Vit-Large for the encoder [27], a ViT-Base for the decoder
σe = 0, ∀e ∈ E, we enforce that e σe = 1. and a DPT head [90]. We refer to the appendix in Sec. F
Recovering camera parameters. A straightforward for more details on the training and architecture. Before
extension to this framework enables to recover all training, we initialize our network with the weights of an off-
cameras parameters. By simply replacing χni,j := the-shelf CroCo pretrained model [148]. Cross-View com-
−1 −1 n n n
Pn h(Kn [iDi,j ; jDi,j ; Di,j ]) (i.e. enforcing a standard pletion (CroCo) is a recently proposed pretraining paradigm
camera pinhole model as in Eq. (1)), we can thus esti- inspired by MAE [45] that has been shown to excel on var-
mate all camera poses {Pn }, associated intrinsics {Kn } ious downstream 3D vision tasks, and is thus particularly
and depthmaps {Dn } for n = 1 . . . N . suited to our framework. We ablate in Sec. 4.6 the impact of
Discussion. We point out that, contrary to traditional bundle CroCo pretraining and increase in image resolution.
adjustment, this global optimization is fast and simple to
Evaluation. In the remainder of this section, we benchmark
perform in practice. Indeed, we are not minimizing 2D
DUSt3R on a representative set of classical 3D vision tasks,
reprojection errors, as bundle adjustment normally does, but
each time specifying datasets, metrics and comparing per-
3D projection errors. The optimization is carried out using
formance with existing state-of-the-art approaches. We em-
standard gradient descent and typically converges after a few
phasize that all results are obtained with the same DUSt3R
hundred steps, requiring mere seconds on a standard GPU.
model (our default model is denoted as ‘DUSt3R 512’, other
4. Experiments with DUSt3R DUSt3R models serves for the ablations in Section Sec. 4.6),
i.e. we never finetune our model on a particular downstream
Training data. We train our network with a mixture task. During test, all test images are rescaled to 512px while
of eight datasets: Habitat [103], MegaDepth [55], ARK- preserving their aspect ratio. Since there may exist different
itScenes [25], MegaDepth [55], Static Scenes 3D [68], ‘routes’ to extract task-specific outputs from DUSt3R, as
Blended MVS [161], ScanNet++ [165], CO3D-v2 [93] and described in Sec. 3.3 and Sec. 3.4, we precise each time the
6
employed method. Baselines and metrics. We compare DUSt3R pose es-
Qualitative results. DUSt3R yields high-quality dense 3D timation results, obtained either from PnP-RANSAC or
reconstructions even in challenging situations. We refer global alignment, against the learning-based RelPose [176],
the reader to the appendix in Sec. B for non-cherrypicked PoseReg [139] and PoseDiffusion [139], and structure-based
visualizations of pairwise and multi-view reconstructions. PixSFM [58], COLMAP+SPSG (COLMAP [106] extended
with SuperPoint [26] and SuperGlue [99]). Similar to [139],
4.1. Visual Localization
we report the Relative Rotation Accuracy (RRA) and Rel-
Datasets and metrics. We first evaluate DUSt3R for the ative Translation Accuracy (RTA) for each image pair to
task of absolute pose estimation on the 7Scenes [113] and evaluate the relative pose error and select a threshold τ = 15
Cambridge Landmarks datasets [48]. 7Scenes contains 7 to report RTA@15 and RRA@15. Additionally, we calcu-
indoor scenes with RGB-D images from videos and their late the mean Average Accuracy (mAA)@30, defined as the
6-DOF camera poses. Cambridge-Landmarks contains 6 out- area under the curve accuracy of the angular differences at
door scenes with RGB images and their associated camera min(RRA@30, RTA@30).
poses, which are obtained via SfM. We report the median Results. As shown in Table 2, DUSt3R with global
translation and rotation errors in (cm/◦ ), respectively. alignment achieves the best overall performance on the
Protocol and results. To compute camera poses in world two datasets and significantly surpasses the state-of-the-art
coordinates, we use DUSt3R as a 2D-2D pixel matcher (see PoseDiffusion [139]. Moreover, DUSt3R with PnP also
Section 3.3) between a query and the most relevant database demonstrates superior performance over both learning and
images obtained using off-the-shelf image retrieval AP- structure-based existing methods. It is worth noting that
GeM [94]. In other words, we simply use the raw pointmaps RealEstate10K results reported for PoseDiffusion are from
output from F(I Q , I B ) without any refinement, where I Q is the model trained on CO3Dv2. Nevertheless, we assert that
the query image and I B is a database image. We use the top our comparison is justified considering that RealEstate10K
20 retrieved images for Cambridge-Landmarks and top 1 for is not used either during DUSt3R’s training. We also report
7Scenes and leverage the known query intrinsics. For results performance with less input views (between 3 and 10) in
obtained without using ground-truth intrinsics parameters, the appendix (Sec. D), in which case DUSt3R also yields
refer to the appendix in Sec. E. excellent performance on both benchmarks.
We compare our results against the state of the art in Ta-
ble 1 for each scene of the two datasets. Our method obtains
comparable accuracy compared to existing approaches, be- 4.3. Monocular Depth
ing feature-matching ones [100, 102] or end-to-end learning-
For this monocular task, we simply feed the same input im-
based methods [11, 54, 101, 124, 151], even managing to out-
age I to the network as F(I, I). By design, depth prediction
perform strong baselines like HLoc [100] in some cases. We
is simply the z coordinate in the predicted 3D pointmap.
believe this to be significant for two reasons. First, DUSt3R
was never trained for visual localisation in any way. Second, Datasets and metrics. We benchmark DUSt3R on
neither query image nor database images were seen during two outdoor (DDAD [40], KITTI [35]) and three indoor
DUSt3R’s training. (NYUv2 [114], BONN [79], TUM [118]) datasets. We com-
pare DUSt3R ’s performance to state-in-the-art methods
4.2. Multi-view Pose Estimation categorized in supervised, self-supervised and zero-shot set-
We now evaluate DUSt3R on multi-view relative pose esti- tings, this last category corresponding to DUSt3R. We use
mation after the global alignment from Sec. 3.4. two metrics commonly used in the monocular depth evalua-
tions [6, 116]: the absolute relative error AbsRel between
Datasets. Following [139], we use two multi-view datasets,
target y and prediction ŷ, AbsRel = |y − ŷ|/y, and the pre-
CO3Dv2 [93] and RealEstate10k [185] for the evaluation.
diction threshold accuracy, δ1.25 = max(ŷ/y, y/ŷ) < 1.25.
CO3Dv2 contains 6 million frames extracted from approxi-
mately 37k videos, covering 51 MS-COCO categories. The Results. In zero-shot setting, the state of the art is rep-
ground-truth camera poses are annotated using COLMAP resented by the recent SlowTv [116]. This approach col-
from 200 frames in each video. RealEstate10k is an in- lected a large mixture of curated datasets with urban, natu-
door/outdoor dataset with 10 million frames from about 80K ral, synthetic and indoor scenes, and trained one common
video clips on YouTube, the camera poses being obtained model. For every dataset in the mixture, camera parameters
by SLAM with bundle adjustment. We follow the protocol are known or estimated with COLMAP. As Table 2 shows,
introduced in [139] to evaluate DUSt3R on 41 categories DUSt3R adapts well to outdoor and indoor environments.
from CO3Dv2 and 1.8K video clips from the test set of It outperforms the self-supervised baselines [6, 37, 120]
RealEstate10k. For each sequence, we random select 10 and performs on-par with state-of-the-art supervised base-
frames and feed all possible 45 pairs to DUSt3R. lines [90, 173].
7
7Scenes (Indoor) [113] Cambridge (Outdoor) [48]
Methods
Chess Fire Heads Office Pumpkin Kitchen Stairs S. Facade O. Hospital K. College St.Mary’s G. Court
AS [102] 4/1.96 3/1.53 2/1.45 9/3.61 8/3.10 7/3.37 3/2.22 4/0.21 20/0.36 13/0.22 8/0.25 24/0.13
FM
HLoc [100] 2/0.79 2/0.87 2/0.92 3/0.91 5/1.12 4/1.25 6/1.62 4/0.2 15/0.3 12/0.20 7/0.21 11/0.16
DSAC* [11] 2/1.10 2/1.24 1/1.82 3/1.15 4/1.34 4/1.68 3/1.16 5/0.3 15/0.3 15/0.3 13/0.4 49/0.3
HSCNet [54] 2/0.7 2/0.9 1/0.9 3/0.8 4/1.0 4/1.2 3/0.8 6/0.3 19/0.3 18/0.3 9/0.3 28/0.2
PixLoc [101] 2/0/80 2/0.73 1/0.82 3/0.82 4/1.21 3/1.20 5/1.30 5/0.23 16/0.32 14/0.24 10/0.34 30/0.14
E2E
SC-wLS [151] 3/0.76 5/1.09 3/1.92 6/0.86 8/1.27 9/1.43 12/2.80 11/0.7 42/1.7 14/0.6 39/1.3 164/0.9
NeuMaps [124] 2/0.81 3/1.11 2/1.17 3/0.98 4/1.11 4/1.33 4/1.12 6/0.25 19/0.36 14/0.19 17/0.53 6/ 0.10
DUSt3R 224-NoCroCo 5/1.76 6/2.02 3/1.75 5/1.54 9/2.35 6/1.82 34/7.81 24/1.33 79/1.17 69/1.15 46/1.51 143/1.32
DUSt3R 224 3/0.96 3/1.02 1/1.00 4/1.04 5/1.26 4/1.36 21/4.08 9/0.38 26/0.46 20/0.32 11/0.38 36/0.24
DUSt3R 512 3/0.97 3/0.95 2/1.37 3/1.01 4/1.14 4/1.34 11/2.84 6/0.26 17/0.33 11/0.20 7/0.24 38/0.16
Table 1. Absolute camera pose on 7Scenes [113] and Cambridge-Landmarks [48] datasets. We report the median translation and rotation
errors (cm/◦ ) to feature matching (FM) based and end-to-end (E2E) learning-base methods. The best results at each category are in bold.
Outdoor Indoor
Co3Dv2 [93] RealEstate10K
Methods Train DDAD[40] KITTI [35] BONN [79] NYUD-v2 [114] TUM [118] Methods
Rel↓ δ1.25 ↑ Rel↓ δ1.25 ↑ Rel↓ δ1.25 ↑ Rel↓ δ1.25 ↑ Rel ↓ δ1.25 ↑ RRA@15 RTA@15 mAA(30) mAA(30)
DPT-BEiT[90] D 10.70 84.63 9.45 89.27 - - 5.40 96.54 10.45 89.68
NeWCRFs[173] D 9.59 82.92 5.43 91.54 - - 6.22 95.58 14.63 82.95
RelPose [176] 57.1 - - -
Monodepth2 [37] SS 23.91 75.22 11.42 86.90 56.49 35.18 16.19 74.50 31.20 47.42 Colmap+SPSG [26, 99] 36.1 27.3 25.3 45.2
SC-SfM-Learners [6] SS 16.92 77.28 11.83 86.61 21.11 71.40 13.79 79.57 22.29 64.30
SC-DepthV3 [120] SS 14.20 81.27 11.79 86.39 12.58 88.92 12.34 84.80 16.28 79.67
PixSfM [58] 33.7 32.9 30.1 49.4
MonoViT[181] SS - - 09.92 90.01 - - - - - PosReg [139] 53.2 49.1 45.0 -
RobustMIX [91] T - - 18.25 76.95 - - 11.77 90.45 15.65 86.59
SlowTv [116] T 12.63 79.34 (6.84) (56.17) - - 11.59 87.23 15.02 80.86
PoseDiffusion [139] 80.5 79.8 66.5 48.0
DUSt3R 224-NoCroCo T 19.63 70.03 20.10 71.21 14.44 86.00 14.51 81.06 22.14 66.26 DUSt3R 512 (w/ PnP) 94.3 88.4 77.2 61.2
DUSt3R 224 T 16.32 77.58 16.97 77.89 11.05 89.95 10.28 88.92 17.61 75.44
DUSt3R 512 T 13.88 81.17 10.74 86.60 8.08 93.56 6.50 94.09 14.17 79.89 DUSt3R 512 (w/ GA) 96.2 86.8 76.7 67.7
Table 2. Left: Monocular depth estimation on multiple benchmarks. D-Supervised, SS-Self-supervised, T-transfer (zero-shot). (Parentheses)
refers to training on the same set. Right: Multi-view pose regression on the CO3Dv2 [93] and RealEst10K [185] with 10 random frames.
8
GT GT GT Align KITTI ScanNet ETH3D DTU T&T Average
Methods
Pose Range Intrinsics rel ↓ τ ↑ rel ↓ τ ↑ rel ↓ τ ↑ rel ↓ τ ↑ rel↓ τ ↑ rel↓ τ ↑ time (s)↓
COLMAP [105, 106] ✓ × ✓ × 12.0 58.2 14.6 34.2 16.4 55.1 0.7 96.5 2.7 95.0 9.3 67.8 ≈ 3 min
(a)
COLMAP Dense [105, 106] ✓ × ✓ × 26.9 52.7 38.0 22.5 89.8 23.2 20.8 69.3 25.7 76.4 40.2 48.8 ≈ 3 min
MVSNet [160] ✓ ✓ ✓ × 22.7 36.1 24.6 20.4 35.4 31.4 (1.8) (86.0) 8.3 73.0 18.6 49.4 0.07
MVSNet Inv. Depth [160] ✓ ✓ ✓ × 18.6 30.7 22.7 20.9 21.6 35.6 (1.8) (86.7) 6.5 74.6 14.2 49.7 0.32
(b) Vis-MVSSNet [175] ✓ ✓ ✓ × 9.5 55.4 8.9 33.5 10.8 43.3 (1.8) (87.4) 4.1 87.2 7.0 61.4 0.70
MVS2D ScanNet [159] ✓ ✓ ✓ × 21.2 8.7 (27.2) (5.3) 27.4 4.8 17.2 9.8 29.2 4.4 24.4 6.6 0.04
MVS2D DTU [159] ✓ ✓ ✓ × 226.6 0.7 32.3 11.1 99.0 11.6 (3.6) (64.2) 25.8 28.0 77.5 23.1 0.05
DeMon [135] ✓ × ✓ × 16.7 13.4 75.0 0.0 19.0 16.2 23.7 11.5 17.6 18.3 30.4 11.9 0.08
DeepV2D KITTI [130] ✓ × ✓ × (20.4) (16.3) 25.8 8.1 30.1 9.4 24.6 8.2 38.5 9.6 27.9 10.3 1.43
DeepV2D ScanNet [130] ✓ × ✓ × 61.9 5.2 (3.8) (60.2) 18.7 28.7 9.2 27.4 33.5 38.0 25.4 31.9 2.15
MVSNet [160] ✓ × ✓ × 14.0 35.8 1568.0 5.7 507.7 8.3 (4429.1) (0.1) 118.2 50.7 1327.4 20.1 0.15
(c)
MVSNet Inv. Depth [160] ✓ × ✓ × 29.6 8.1 65.2 28.5 60.3 5.8 (28.7) (48.9) 51.4 14.6 47.0 21.2 0.28
Vis-MVSNet [175] ✓ × ✓ × 10.3 54.4 84.9 15.6 51.5 17.4 (374.2) (1.7) 21.1 65.6 108.4 31.0 0.82
MVS2D ScanNet [159] ✓ × ✓ × 73.4 0.0 (4.5) (54.1) 30.7 14.4 5.0 57.9 56.4 11.1 34.0 27.5 0.05
MVS2D DTU [159] ✓ × ✓ × 93.3 0.0 51.5 1.6 78.0 0.0 (1.6) (92.3) 87.5 0.0 62.4 18.8 0.06
Robust MVD Baseline [109] ✓ × ✓ × 7.1 41.9 7.4 38.4 9.0 42.6 2.7 82.0 5.0 75.1 6.3 56.0 0.06
DeMoN [135] × × ✓ ∥t∥ 15.5 15.2 12.0 21.0 17.4 15.4 21.8 16.6 13.0 23.2 16.0 18.3 0.08
DeepV2D KITTI [130] × × ✓ med (3.1) (74.9) 23.7 11.1 27.1 10.1 24.8 8.1 34.1 9.1 22.6 22.7 2.07
DeepV2D ScanNet [130] × × ✓ med 10.0 36.2 (4.4) (54.8) 11.8 29.3 7.7 33.0 8.9 46.4 8.6 39.9 3.57
(d) DUSt3R 224-NoCroCo × × × med 15.14 21.16 7.54 40.00 9.51 40.07 3.56 62.83 11.12 37.90 9.37 40.39 0.05
DUSt3R 224 × × × med 15.39 26.69 (5.86) (50.84) 4.71 61.74 2.76 77.32 5.54 56.38 6.85 54.59 0.05
DUSt3R 512 × × × med 9.11 39.49 (4.93) (60.20) 2.91 76.91 3.52 69.33 3.17 76.68 4.73 64.52 0.13
Table 3. Multi-view depth evaluation with different settings: a) Classical approaches; b) with poses and depth range, without alignment; c)
absolute scale evaluation with poses, without depth range and alignment; d) without poses and depth range, but with alignment. (Parentheses)
denote training on data from the same domain. The best results for each setting are in bold.
9
Figure 5. Example of 3D reconstruction of an unseen
Figure 4. Example of 3D reconstruction of an unseen MegaDepth MegaDepth [55] scene from two images only. Note this is the
scene from two images (top-left). Note this is the raw output of raw output of the network, i.e. we show the output depthmaps
the network, i.e. we show the output depthmaps (top-center, see (top-center) and confidence maps (top-right), as well as different
Eq. (8)) and confidence maps (top-right), as well as two different viewpoints on the colored pointcloud (middle and bottom). Camera
viewpoints on the colored pointcloud (middle and bottom). Camera parameters are recovered from the raw pointmaps, see Sec. 3.3
parameters are recovered from the raw pointmaps, see Sec. 3.3 in the main paper. DUSt3R handles strong viewpoint and focal
in the main paper. DUSt3R handles strong viewpoint and focal changes without apparent problems
changes without apparent problems
10
Figure 6. Example of 3D reconstruction from two images only of unseen scenes, namely KingsCollege(Top-Left), OldHospital (Top-
Middle), StMarysChurch(Top-Right), ShopFacade(Bottom-Left), GreatCourt(Bottom-Right). Note this is the raw output of the network, i.e.
we show new viewpoints on the colored pointclouds. Camera parameters are recovered from the raw pointmaps, see Sec. 3.3 in the main
paper.
Figure 7. Example of 3D reconstruction from two images only of unseen scenes, namely Chess, Fire, Heads, Office (Top-Row), Pumpkin,
Kitchen, Stairs (Bottom-Row). Note this is the raw output of the network, i.e. we show new viewpoints on the colored pointclouds. Camera
parameters are recovered from the raw pointmaps, see Sec. 3.3 in the main paper.
11
Figure 8. Examples of 3D reconstructions from nearly opposite viewpoints. For each of the 4 cases (motorcycle, toaster, bench, stop
sign), we show the two input images (top-left) and the raw output of the network: output depthmaps (top-center) and confidence maps
(top-right), as well as two different views on the colored point-clouds (middle and bottom). Camera parameters are recovered from the raw
pointmaps, see Sec. 3.3 in the main paper. DUSt3R handles drastic viewpoint changes without apparent issues, even when there is almost no
overlapping visual content between images, e.g. for the stop sign and motorcycle. Note that these example cases are not cherry-picked; they
are randomly chosen from the set of unseen CO3D v2 sequences. Please refer to the video for animated visualizations.
12
Figure 9. Reconstruction example from 4 random frames of a RealEstate10K indoor sequence, after global alignment. On the left-hand side,
we show the 4 input frames, and on the right-hand side the resulting point-cloud and the recovered camera intrinsics and poses.
13
Appendix we complete it in this section with a few equally important
topics.
This appendix provides additional details and qualitative
results of DUSt3R. We first present in Sec. B qualitative pair- Implicit Camera Models. In our work, we do not explicitly
wise predictions of the presented architecture on challenging output camera parameters. Likewise, there are several works
real-life datasets. This section also contains the description aiming to express 3D shapes in a canonical space that is
of the video accompanying this material. We then propose not directly related to the input viewpoint. Shapes can be
an extended related works in Sec. C, encompassing a wider stored as occupancy in regular grids [19, 97, 111, 117, 137,
range of methodological families and geometric vision tasks. 153, 154], octree structures [127], collections of parametric
Sec. D provides auxiliary ablative results on multi-view pose surface elements [39], point clouds encoders [29, 65, 66],
estimation, that did not fit in the main paper. We then report free-form deformation of template meshes [88] or per-view
in Sec. E results on an experimental visual localization task, depthmaps [53]. While these approaches arguably perform
where the camera intrinsics are unknown. Finally, we details classification and not actual 3D reconstruction [128], all-
the training and data augmentation procedures in Sec. F. in-all, they work only in very constrained setups, usually
on ShapeNet [14] and have trouble generalizing to natural
B. Qualitative results scenes with non object-centric views [186]. The question
of how to express a complex scene with several object in-
Point-cloud visualizations. We present some visualiza- stances in a single canonical frame had yet to be answered:
tion of DUSt3R’s pairwise results in Figs. 4 to 8. Note in this work, we also express the reconstruction in a canoni-
these scenes were never seen during training and were not cal reference frame, but thanks to our scene representation
cherry-picked. Also, we did not post-process these results, (pointmaps), we still preserve a relationship between image
except for filtering out low-confidence points (based on the pixels and the 3D space, and we are thus able to perform 3D
output confidence) and removing sky regions for the sake of reconstruction consistently.
visualization, i.e. these figures accurately represent the raw Dense Visual SLAM. In visual SLAM, early works on
output of DUSt3R. Overall, the proposed network is able dense 3D reconstruction and ego-motion estimation uti-
to perform highly accurate 3D reconstruction from just two lized active depth sensors [73, 135, 183]. Recent works
images. In Fig. 9, we show the output of DUSt3R after the on dense visual SLAM from RGB video stream are able
global alignment stage. In this case, the network has pro- to produce high-quality depth maps and camera trajectories
cessed all pairs of the 4 input images, and outputs 4 spatially [7, 22, 115, 121, 129, 131], but they inherit the traditional
consistent pointmaps along with the corresponding camera limitations of SLAM, e.g. noisy predictions, drifts and out-
parameters. liers in the pixel correspondences. To make the 3D recon-
Note that, for the case of image sequences captured with struction more robust, R3D3 [104] jointly leverages jointly
the same camera, we never enforce the fact that camera multi-camera constraints and monocular depth cues. Most
intrinsics must be identical for every frame, i.e. all intrinsic recently, GO-SLAM [178] proposed real-time global pose
parameters are optimized independently. This remains true optimization by considering the complete history of input
for all results reported in this appendix and in the main paper, frames and continuously aligning all poses that enables in-
e.g. on multi-view pose estimation with the CO3Dv2 [93] stantaneous loop closures and correction of global structure.
and RealEstate10K [185] datasets. Still, all SLAM methods assume that the input consists of
Supplementary Video. We attach to this appendix a video a sequence of closely related images, e.g. with identical
showcasing the different steps of DUSt3R. In the video, we intrinsics, nearby camera poses and small illumination vari-
demonstrate dense 3D reconstruction from a small set of ations. In comparison, our approach handles completely
raw RGB images, without using any ground-truth camera unconstrained image collections.
parameters (i.e. unknown intrinsic and extrinsic parameters).
3D reconstruction from implicit models has undergone
We show that our method can seamlessly handle monocular
significant advancements, largely fueled by the integra-
predictions, and is able to perform reconstruction and cam-
tion of neural networks [60, 71, 80, 147, 172]. Earlier ap-
era pose estimation in extreme binocular cases, where the
proaches [60, 74, 80] utilize Multi-Layer Perceptron (MLP)
cameras are facing each other. In addition, we show some
to generate continuous surface outputs with only posed
qualitative reconstructions of rather large scale scenes from
RGB images. Innovations like Nerf [71] and its follow-
the ETH3D dataset [108].
ups [46, 67, 69, 93, 143, 177] have pioneered density-based
volume rendering to represent scenes as continuous 5D func-
C. Extended Related Work
tions for both occupancy and color, showing exceptional abil-
For the sake of exposition, Section 2 of the main paper ity in synthesizing novel views of complex scenes. To handle
covered only some (but not all) of the most related works. large-scale scenes, recent approaches [41, 172, 187, 188]
Because this work covers a large variety of geometric tasks, introduce geometry priors to the implicit model, leading to
14
much more detailed reconstructions. In contrast to the im- Methods N Frames
Co3Dv2 [93] RealEstate10K [185]
RRA@15 RTA@15 mAA(30) mAA(30)
plicit 3D reconstruction, our work focuses on the explicit 3D COLMAP+SPSG 3 ∼22 ∼14 ∼15 ∼23
reconstruction and showcases that the proposed DUSt3R can PixSfM 3 ∼18 ∼8 ∼10 ∼17
not only have detailed 3D reconstruction but also provide Relpose 3 ∼56 - - -
PoseDiffusion 3 ∼75 ∼75 ∼61 - (∼77)
rich geometry for multiple downstream 3D tasks. DUSt3R 512 3 95.3 88.3 77.5 69.5
RGB-pairs-to-3D takes its roots in two-view geometry [43] COLMAP+SPSG 5 ∼21 ∼17 ∼17 ∼34
PixSfM 5 ∼21 ∼16 ∼15 ∼30
and is considered as a stand-alone task or an intermediate Relpose 5 ∼56 - - -
step towards the multi-view reconstruction. This process typ- PoseDiffusion 5 ∼77 ∼76 ∼63 - (∼78)
DUSt3R 512 5 95.5 86.7 76.5 67.4
ically involves estimating a dense depth map and determining
COLMAP+SPSG 10 31.6 27.3 25.3 45.2
the relative camera pose from two different views. Recent PixSfM 10 33.7 32.9 30.1 49.4
learning-based approaches formulate this problem either as Relpose 10 57.1 - - -
PoseDiffusion 10 80.5 79.8 66.5 48.0 (∼80)
pose and monocular depth regression [92, 171, 184] or pose DUSt3R 512 10 96.2 86.8 76.7 67.7
and stereo matching [122, 130, 135, 141, 182]. The ultimate
goal is to achieve 3D reconstruction from the predicted ge- Table 5. Comparison with the state of the art for multi-view pose
ometry [2]. In addition to reconstruction tasks, learning from regression on the CO3Dv2 [93] and RealEstate10K [185] with 3,
two views also gives an advance in unsupervised pretraining; 5 and 10 random frames. (Parentheses) indicates results obtained
the recently proposed CroCo [148, 149] introduces a pre- after training on RealEstate10K. In contrast, we report results for
text task of cross-view completion from a large set of image DUSt3R after global alignment without training on RealEstate10K.
pair to learn 3D geometry from unlabeled data and to apply
this learned implicit representation to various downstream
3D vision tasks. Our method draws inspiration from the
CroCo pipeline, but diverges in its application. Instead of
E. Visual localization
focusing on model pretraining, our approach leverages this
pipeline to directly generate 3D pointmaps from the image
pair. In this context, the depth map and camera poses are We include additional results of visual localization on the
only by-products in our pipeline. 7-scenes and Cambridge-Landmarks datasets [48, 113].
Namely, we experiment with a scenario where the focal
parameter of the querying camera is unknown. In this case,
D. Multi-view Pose Estimation we feed the query image and a database image into DUSt3R,
and get an un-scaled 3D reconstruction. We then scale the
We include additional results for the multi-view pose estima- resulting pointmap according to the ground-truth pointmap
tion task from the main paper (in Sec. 4.2). Namely, we com- of the database image, and extract the pose as described
pute the pose accuracy for a smaller number of input images in Sec. 3.3 of the main paper. Tab. 6 shows that this method
(they are randomly selected from the entire test sequences). performs reasonably well on the 7-scenes dataset, where the
Tab. 5 reports our performance and compares with the state of median translation error is on the order of a few centime-
the art. Numbers for state-of-the-art methods are borrowed ters. On the Cambridge-Landmarks dataset, however, we
from the recent PoseDiffusion [139] paper’s tables and plots, obtain considerably larger errors. After inspection, we find
hence some numbers are only approximate. Our method that the ground-truth database pointmaps are sparse, which
consistently outperforms all other methods on the CO3Dv2 prevents any reliable scaling of our reconstruction. On the
dataset by a large margin, even for small number of frames. contrary, 7-scenes provides dense ground-truth pointmaps.
As can be observed in Fig. 8 and in the attached video, We conclude that further work is necessary for ”in-the-wild”
DUSt3R handles opposite viewpoints (i.e. nearly 180◦ apart) visual-localization with unknown intrinsics.
seemingly without much troubles. In the end, DUSt3R ob-
tains relatively stable performance, regardless of the number
of input views. When comparing with PoseDiffusion [139]
on RealEstate10K, we report performances with and without F. Training details
training on the same dataset. Note that DUSt3R’s train-
ing data include a small subset of CO3Dv2 (we used 50 F.1. Training data
sequences for each category, i.e. less than 7% of the full
training set) but no data from RealEstate10K whatsoever. Ground-truth pointmaps. Ground-truth pointmaps X̄ 1,1
An example of reconstruction on RealEstate10K is shown and X̄ 2,1 for images I 1 and I 2 , respectively, from Eq. (2) in
in Fig. 9. Our network outputs a consistent pointcloud de- the main paper are obtained from the ground-truth camera
spite wide baseline viewpoint changes between the first and intrinsics K1 , K2 ∈ R3×3 , camera poses P1 , P2 ∈ R3×4
last pairs of frames. and depthmaps D1 , D2 ∈ RW ×H . Specifically, we simply
15
GT 7Scenes (Indoor) [113] Cambridge (Outdoor) [48]
Methods
Focals Chess Fire Heads Office Pumpkin Kitchen Stairs S. Facade O. Hospital K. College St.Mary’s G. Court
DUSt3R 512 from 2D-matching ✓ 3/0.97 3/0.95 2/1.37 3/1.01 4/1.14 4/1.34 11/2.84 6/0.26 17/0.33 11/0.20 7/0.24 38/0.16
DUSt3R 512 from scaled rel-pose × 5/1.08 5/1.18 4/1.33 6/1.05 7/1.25 6/1.37 26/3.56 64/0.97 151/0.88 102/0.88 79/1.46 245/1.08
Table 6. Absolute camera pose on 7Scenes [113] (top 1 image) and Cambridge-Landmarks [48] (top 20 images) datasets. We report the
median translation and rotation errors (cm/◦ ).
16
Hyperparameters low-resolution training high-resolution training DPT training
Prediction Head Linear Linear DPT[90]
Optimizer AdamW [61] AdamW [61] AdamW [61]
Base learning rate 1e-4 1e-4 1e-4
Weight decay 0.05 0.05 0.05
Adam β (0.9, 0.95) (0.9, 0.95) (0.9, 0.95)
Pairs per Epoch 700k 70k 70k
Batch size 128 64 64
Epochs 50 100 90
Warmup epochs 10 20 15
Learning rate scheduler Cosine decay Cosine decay Cosine decay
224×224 512×384, 512×336 512×384, 512×336
Input resolutions 512×288, 512×256 512×288, 512×256
512×160 512×160
Image Augmentations Random centered crop, color jitter Random centered crop, color jitter Random centered crop, color jitter
Initialization CroCo v2[148] low-resolution training high-resolution training
Table 7. Detailed hyper-parameters for the training, with first a low-resolution training with a linear head followed by a higher-resolution
training still with a linear head and a final step of higher-resolution training with a DPT head, in order to save training time
Datasets Type N Pairs mization for large-scale structure from motion. PAMI, 2013.
Habitat [103] Indoor / Synthetic 1000k 2, 3
CO3Dv2 [93] Object-centric 941k [21] Hainan Cui, Xiang Gao, Shuhan Shen, and Zhanyi Hu.
ScanNet++ [165] Indoor / Real 224k Hsfm: Hybrid structure-from-motion. In Proceedings of
ArkitScenes [25] Indoor / Real 2040k the IEEE conference on computer vision and pattern recog-
Static Thing 3D [68] Object / Synthetic 337k nition, 2017. 3
MegaDepth [55] Outdoor / Real 1761k [22] Jan Czarnowski, Tristan Laidlow, Ronald Clark, and An-
BlendedMVS [161] Outdoor / Synthetic 1062k drew J. Davison. DeepFactors: Real-time probabilistic dense
Waymo [121] Outdoor / Real 1100k monocular SLAM. IEEE Robotics Autom. Lett., 5(2):721–
728, 2020. 14
[23] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber,
Table 8. Dataset mixture and sample sizes for DUSt3R training. Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-
annotated 3d reconstructions of indoor scenes. In CVPR,
2017. 3, 8
Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, [24] Amaury Dame, Victor A Prisacariu, Carl Y Ren, and Ian
and Fisher Yu. ShapeNet: An Information-Rich 3D Model Reid. Dense reconstruction using 3d object shape priors. In
Repository. Technical Report arXiv:1512.03012 [cs.GR], CVPR, pages 1288–1295, 2013. 2
2015. 14 [25] Afshin Dehghan, Gilad Baruch, Zhuoyuan Chen, Yuri Fei-
[15] Devendra Singh Chaplot, Dhiraj Gandhi, Saurabh Gupta, Ab- gin, Peter Fu, Thomas Gebauer, Daniel Kurz, Tal Dimry,
hinav Gupta, and Ruslan Salakhutdinov. Learning to explore Brandon Joffe, Arik Schwartz, and Elad Shulman. ARK-
using active neural slam. arXiv preprint arXiv:2004.05155, itScenes: A diverse real-world dataset for 3d indoor scene
2020. 2 understanding using mobile RGB-D data. In NeurIPS
[16] Guanying Chen, Kai Han, Boxin Shi, Yasuyuki Matsushita, Datasets and Benchmarks, 2021. 2, 6, 16, 17
and Kwan-Yee K. Wong. Deep photometric stereo for non- [26] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi-
lambertian surfaces. PAMI, 2022. 2 novich. Superpoint: Self-supervised interest point detection
[17] Hongkai Chen, Zixin Luo, Lei Zhou, Yurun Tian, Mingmin and description. In CVPR Workshops, pages 224–236, 2018.
Zhen, Tian Fang, David McKinnon, Yanghai Tsin, and Long 2, 3, 7, 8
Quan. Aspanformer: Detector-free image matching with [27] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
adaptive span transformer. ECCV, 2022. 3 Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
[18] Shuo Cheng, Zexiang Xu, Shilin Zhu, Zhuwen Li, Li Erran Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
Li, Ravi Ramamoorthi, and Hao Su. Deep stereo using adap- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is
tive thin volume representation with uncertainty awareness. worth 16x16 words: Transformers for image recognition at
In CVPR, 2020. 9 scale. ICLR, 2021. 4, 6
[19] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin [28] Mihai Dusmanu, Ignacio Rocco, Tomás Pajdla, Marc Polle-
Chen, and Silvio Savarese. 3D-R2N2: A unified approach feys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-net:
for single and multi-view 3d object reconstruction. In ECCV, A trainable CNN for joint description and detection of local
2016. 14 features. In CVPR, pages 8092–8101, 2019. 2, 3
[20] David Crandall, Andrew Owens, Noah Snavely, and Daniel [29] Haoqiang Fan, Hao Su, and Leonidas J. Guibas. A point
Huttenlocher. SfM with MRFs: Discrete-continuous opti-
17
set generation network for 3d object reconstruction from a view synthesis of outdoor scenes. In Proceedings of the
single image. In CVPR, 2017. 14 IEEE/CVF International Conference on Computer Vision,
[30] Martin A. Fischler and Robert C. Bolles. Random sample pages 9187–9198, 2023. 14
consensus: a paradigm for model fitting with applications to [47] Nianjuan Jiang, Zhaopeng Cui, and Ping Tan. A global
image analysis and automated cartography. Communications linear method for camera pose registration. In ICCV, 2013.
of the ACM, 1981. 5 3
[31] Qiancheng Fu, Qingshan Xu, Yew Soon Ong, and Wenbing [48] Alex Kendall, Matthew Grimes, and Roberto Cipolla.
Tao. Geo-neus: Geometry-consistent neural implicit sur- PoseNet: a Convolutional Network for Real-Time 6-DOF
faces learning for multi-view reconstruction. In NeurIPS, Camera Relocalization. In ICCV, 2015. 7, 8, 15, 16
2022. 2, 3 [49] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen
[32] Yasutaka Furukawa and Carlos Hernández. Multi-view Koltun. Tanks and temples: Benchmarking large-scale scene
stereo: A tutorial. Found. Trends Comput. Graph. Vis., 2015. reconstruction. ACM Transactions on Graphics, 36(4), 2017.
2, 3 8
[33] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and [50] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen
robust multiview stereopsis. PAMI, 2010. 9 Koltun. Tanks and Temples Online Benchmark. https:
[34] Silvano Galliani, Katrin Lasinger, and Konrad Schindler. //www.tanksandtemples.org/leaderboard/,
Massively parallel multiview stereopsis by surface normal 2017. [Online; accessed 19-October-2023]. 1
diffusion. In ICCV, June 2015. 3, 9 [51] Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua.
[35] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Epnp: An accurate O(n) solution to the pnp problem. IJCV,
Urtasun. Vision meets robotics: The KITTI dataset. Int. J. 2009. 5
Robotics Res., 32(11):1231–1237, 2013. 7, 8 [52] Vincent Leroy, Jean-Sébastien Franco, and Edmond Boyer.
[36] Clément Godard, Oisin Mac Aodha, and Gabriel J. Brostow. Volume sweeping: Learning photoconsistency for multi-
Unsupervised monocular depth estimation with left-right view shape reconstruction. IJCV, 2021. 3
consistency. In CVPR, 2017. 3 [53] Kejie Li, Trung Pham, Huangying Zhan, and Ian D. Reid.
[37] Clément Godard, Oisin Mac Aodha, Michael Firman, and Efficient dense point cloud object reconstruction using de-
Gabriel J. Brostow. Digging into self-supervised monocular formation vector fields. In ECCV, 2018. 14
depth estimation. In ICCV, pages 3827–3837. IEEE, 2019. [54] Xiaotian Li, Shuzhe Wang, Yi Zhao, Jakob Verbeek, and
7, 8 Juho Kannala. Hierarchical scene coordinate classification
[38] Leonardo Gomes, Olga Regina Pereira Bellon, and Luciano and regression for visual localization. In CVPR, 2020. 7, 8
Silva. 3d reconstruction methods for digital preservation of [55] Zhengqi Li and Noah Snavely. Megadepth: Learning single-
cultural heritage: A survey. Pattern Recognit. Lett., 2014. 2 view depth prediction from internet photos. In CVPR, pages
[39] Thibault Groueix, Matthew Fisher, Vladimir G. Kim, 2041–2050, 2018. 2, 6, 10, 16, 17
Bryan C. Russell, and Mathieu Aubry. Atlasnet: A papier- [56] Chen-Hsuan Lin, Chen Kong, and Simon Lucey. Learning
mâché approach to learning 3d surface generation. CVPR, efficient point cloud generation for dense 3d object recon-
2018. 14 struction. In AAAI, 2018. 3
[40] Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raven- [57] Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Si-
tos, and Adrien Gaidon. 3d packing for self-supervised mon Lucey. BARF: bundle-adjusting neural radiance fields.
monocular depth estimation. In CVPR, pages 2482–2491, In ICCV, 2021. 3
2020. 7, 8 [58] Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson,
[41] Haoyu Guo, Sida Peng, Haotong Lin, Qianqian Wang, and Marc Pollefeys. Pixel-perfect structure-from-motion
Guofeng Zhang, Hujun Bao, and Xiaowei Zhou. Neural 3d with featuremetric refinement. In ICCV, 2021. 2, 3, 7, 8
scene reconstruction with the manhattan-world assumption. [59] Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle-
In Proceedings of the IEEE/CVF Conference on Computer feys. Lightglue: Local feature matching at light speed. In
Vision and Pattern Recognition, pages 5511–5520, 2022. 14 ICCV, 2023. 2, 3
[42] Chris Harris, Mike Stephens, et al. A combined corner and [60] Shaohui Liu, Yinda Zhang, Songyou Peng, Boxin Shi, Marc
edge detector. In Alvey vision conference, volume 15, pages Pollefeys, and Zhaopeng Cui. Dist: Rendering deep implicit
10–5244. Citeseer, 1988. 3 signed distance function with differentiable sphere tracing.
[43] Richard Hartley and Andrew Zisserman. Multiple view In Proceedings of the IEEE/CVF Conference on Computer
geometry in computer vision. Cambridge university press, Vision and Pattern Recognition, pages 2019–2028, 2020. 14
2003. 15 [61] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
[44] Richard Hartley and Andrew Zisserman. Multiple View regularization. In ICLR, 2019. 17
Geometry in Computer Vision. Cambridge University Press, [62] David G. Lowe. Distinctive image features from scale-
2004. 3, 5 invariant keypoints. IJCV, 2004. 2, 3
[45] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr [63] Bin Luo and Edwin R. Hancock. Procrustes alignment with
Dollár, and Ross B. Girshick. Masked autoencoders are the EM algorithm. In Computer Analysis of Images and
scalable vision learners. In CVPR, 2022. 6 Patterns, CAIP, volume 1689 of Lecture Notes in Computer
[46] Muhammad Zubair Irshad, Sergey Zakharov, Katherine Liu, Science, pages 623–631. Springer, 1999. 5
Vitor Guizilini, Thomas Kollar, Adrien Gaidon, Zsolt Kira, [64] Zeyu Ma, Zachary Teed, and Jia Deng. Multiview stereo
and Rares Ambrus. Neo 360: Neural fields for sparse with cascaded epipolar raft. In ECCV, 2022. 3, 9
[65] Priyanka Mandikal, Navaneet K. L., Mayank Agarwal, and
18
Venkatesh Babu Radhakrishnan. 3d-lmnet: Latent embed- Giguère, and Cyrill Stachniss. Refusion: 3d reconstruction
ding matching for accurate and diverse 3d point cloud recon- in dynamic environments for RGB-D cameras exploiting
struction from a single image. In BMVC, 2018. 14 residuals. In 2IEEE/RSJ International Conference on Intelli-
[66] Priyanka Mandikal and Venkatesh Babu Radhakrishnan. gent Robots and Systems (IROS), pages 7855–7862, 2019.
Dense 3d point cloud reconstruction using a deep pyramid 7, 8
network. In WACV, 2019. 14 [80] Jeong Joon Park, Peter Florence, Julian Straub, Richard
[67] Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Newcombe, and Steven Lovegrove. Deepsdf: Learning con-
Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duck- tinuous signed distance functions for shape representation.
worth. Nerf in the wild: Neural radiance fields for uncon- In The IEEE Conference on Computer Vision and Pattern
strained photo collections. In Proceedings of the IEEE/CVF Recognition (CVPR), June 2019. 14
Conference on Computer Vision and Pattern Recognition, [81] Rémi Pautrat, Iago Suárez, Yifan Yu, Marc Pollefeys, and
pages 7210–7219, 2021. 14 Viktor Larsson. GlueStick: Robust image matching by stick-
[68] N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. ing points and lines together. In ICCV, 2023. 3
Dosovitskiy, and T. Brox. A large dataset to train convolu- [82] Dario Pavllo, Jonas Kohler, Thomas Hofmann, and Aurélien
tional networks for disparity, optical flow, and scene flow Lucchi. Learning generative models of textured 3d meshes
estimation. In CVPR, 2016. 2, 6, 16, 17 from real-world images. In ICCV, 2021. 3
[69] Quan Meng, Anpei Chen, Haimin Luo, Minye Wu, Hao [83] Dario Pavllo, Graham Spinks, Thomas Hofmann, Marie-
Su, Lan Xu, Xuming He, and Jingyi Yu. Gnerf: Gan-based Francine Moens, and Aurélien Lucchi. Convolutional gener-
neural radiance field without posed camera. In ICCV, 2021. ation of textured 3d meshes. In NeurIPS, 2020.
14 [84] Dario Pavllo, David Joseph Tan, Marie-Julie Rakotosaona,
[70] Xiaoxu Meng, Weikai Chen, and Bo Yang. Neat: Learning and Federico Tombari. Shape, pose, and appearance from a
neural implicit surfaces with arbitrary topologies from multi- single image via bootstrapped radiance field inversion. In
view images. In CVPR, 2023. 2, 3 CVPR, 2023. 3
[71] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, [85] Rui Peng, Rongjie Wang, Zhenyu Wang, Yawen Lai, and
Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Ronggang Wang. Rethinking depth estimation for multi-
Representing scenes as neural radiance fields for view syn- view stereo: A unified representation. In CVPR, 2022. 3
thesis. In ECCV, 2020. 2, 14 [86] MV Peppa, JP Mills, KD Fieber, I Haynes, S Turner, A
[72] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Turner, M Douglas, and PG Bryan. Archaeological feature
Tardos. Orb-slam: a versatile and accurate monocular slam detection from archive aerial photography with a sfm-mvs
system. IEEE transactions on robotics, 2015. 2 and image enhancement pipeline. The International Archives
[73] Richard A. Newcombe, Steven Lovegrove, and Andrew J. of the Photogrammetry, Remote Sensing and Spatial Infor-
Davison. DTAM: dense tracking and mapping in real-time. mation Sciences, 42:869–875, 2018. 2
In ICCV, pages 2320–2327, 2011. 14 [87] Frank Plastria. The Weiszfeld Algorithm: Proof, Amend-
[74] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and ments, and Extensions, pages 357–389. Springer US, 2011.
Andreas Geiger. Differentiable volumetric rendering: Learn- 5
ing implicit 3d representations without 3d supervision. In [88] Jhony K. Pontes, Chen Kong, Sridha Sridharan, Simon
Proceedings of the IEEE/CVF Conference on Computer Vi- Lucey, Anders P. Eriksson, and Clinton Fookes. Im-
sion and Pattern Recognition, pages 3504–3515, 2020. 14 age2mesh: A learning framework for single image 3d recon-
[75] Michael Niemeyer, Lars M. Mescheder, Michael Oechsle, struction. In ACCV, 2018. 14
and Andreas Geiger. Differentiable volumetric rendering: [89] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and
Learning implicit 3d representations without 3d supervision. Leonidas J. Guibas. Pointnet: Deep learning on point sets for
In CVPR, 2020. 3 3d classification and segmentation. In CVPR, pages 77–85,
[76] Michael Oechsle, Songyou Peng, and Andreas Geiger. 2017. 2
UNISURF: unifying neural implicit surfaces and radiance [90] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi-
fields for multi-view reconstruction. In ICCV, 2021. 2, 3 sion transformers for dense prediction. In ICCV, 2021. 3, 6,
[77] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy 7, 8, 17
Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, [91] René Ranftl, Katrin Lasinger, David Hafner, Konrad
Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- Schindler, and Vladlen Koltun. Towards robust monocular
moud Assran, Nicolas Ballas, Wojciech Galuba, Rus- depth estimation: Mixing datasets for zero-shot cross-dataset
sell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, transfer. CoRR, 1907.01341/abs, 2020. 8
Michael G. Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, [92] Anurag Ranjan, Varun Jampani, Lukas Balles, Kihwan Kim,
Hervé Jégou, Julien Mairal, Patrick Labatut, Armand Joulin, Deqing Sun, Jonas Wulff, and Michael J Black. Competitive
and Piotr Bojanowski. Dinov2: Learning robust visual fea- collaboration: Joint unsupervised learning of depth, camera
tures without supervision. arXiv preprint arXiv:2304.07193, motion, optical flow and motion segmentation. In Proceed-
2023. 9 ings of the IEEE/CVF conference on computer vision and
[78] Onur Özyeşil, Vladislav Voroninski, Ronen Basri, and Amit pattern recognition, pages 12240–12249, 2019. 15
Singer. A survey of structure from motion*. Acta Numerica, [93] Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler,
26:305–364, 2017. 2 Luca Sbordone, Patrick Labatut, and David Novotný. Com-
[79] Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe mon objects in 3d: Large-scale learning and evaluation of
19
real-life 3d category reconstruction. In ICCV, pages 10881– Andreas Geiger. A multi-view stereo benchmark with high-
10891, 2021. 2, 6, 7, 8, 14, 15, 16, 17 resolution images and multi-camera videos. In CVPR, 2017.
[94] Jerome Revaud, Jon Almazán, Rafael S Rezende, and Ce- 2, 3, 8, 14
sar Roberto de Souza. Learning with average precision: [109] Philipp Schröppel, Jan Bechtold, Artemij Amiranashvili,
Training image retrieval with a listwise loss. In ICCV, 2019. and Thomas Brox. A benchmark and a baseline for robust
7 multi-view depth estimation. In 3DV, pages 637–645, 2022.
[95] Jerome Revaud, Yohann Cabon, Romain Brégier, Jong- 8, 9
Min Lee, and Philippe Weinzaepfel. SACReg: Scene- [110] Qi Shan, Brian Curless, Yasutaka Furukawa, Carlos Hernan-
agnostic coordinate regression for visual localization. CoRR, dez, and Steven M Seitz. Occluding contours for multi-view
abs/2307.11702, 2023. 3 stereo. In CVPR, pages 4002–4009, 2014. 2
[96] Jérôme Revaud, César Roberto de Souza, Martin Humen- [111] Zai Shi, Zhao Meng, Yiran Xing, Yunpu Ma, and Roger
berger, and Philippe Weinzaepfel. R2D2: reliable and repeat- Wattenhofer. 3d-retr: End-to-end single and multi-view 3d
able detector and descriptor. In Neurips, pages 12405–12415, reconstruction with transformers. In BMVC, page 405, 2021.
2019. 2, 3 14
[97] Stephan R. Richter and Stefan Roth. Matryoshka networks: [112] Daeyun Shin, Charless C. Fowlkes, and Derek Hoiem. Pix-
Predicting 3d geometry via nested shape layers. In CVPR, els, voxels, and views: A study of shape representations for
2018. 14 single view 3d object shape prediction. In CVPR, 2018. 3
[98] Edward Rosten and Tom Drummond. Machine learning for [113] Jamie Shotton, Ben Glocker, Christopher Zach, Shahram
high-speed corner detection. In ECCV. Springer, 2006. 3 Izadi, Antonio Criminisi, and Andrew W. Fitzgibbon. Scene
[99] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, coordinate regression forests for camera relocalization in
and Andrew Rabinovich. Superglue: Learning feature match- RGB-D images. In CVPR, pages 2930–2937, 2013. 7, 8, 15,
ing with graph neural networks. In CVPR, pages 4937–4946, 16
2020. 2, 3, 7, 8 [114] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob
[100] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Fergus. Indoor segmentation and support inference from
Marcin Dymczyk. From coarse to fine: Robust hierarchical RGBD images. In ECCV, pages 746–760, 2012. 7, 8
localization at large scale. In CVPR, 2019. 7, 8 [115] Cameron Smith, Yilun Du, Ayush Tewari, and Vincent Sitz-
[101] Paul-Edouard Sarlin, Ajaykumar Unagar, Måns Larsson, mann. Flowcam: Training generalizable 3d radiance fields
Hugo Germain, Carl Toft, Victor Larsson, Marc Pollefeys, without camera poses via pixel-aligned scene flow, 2023. 14
Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl, and [116] Jaime Spencer, Chris Russell, Simon Hadfield, and Richard
Torsten Sattler. Back to the Feature: Learning Robust Cam- Bowden. Kick back & relax: Learning to reconstruct the
era Localization from Pixels to Pose. In CVPR, 2021. 7, world by watching slowtv. In ICCV, 2023. 3, 7, 8
8 [117] Riccardo Spezialetti, David Joseph Tan, Alessio Tonioni,
[102] Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Efficient & Keisuke Tateno, and Federico Tombari. A divide et impera
effective prioritized matching for large-scale image-based approach for 3d shape reconstruction from multiple views.
localization. IEEE trans. PAMI, 2017. 7, 8 In 3DV, 2020. 14
[103] Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, [118] Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram
Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Burgard, and Daniel Cremers. A benchmark for the evalua-
Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv tion of RGB-D SLAM systems. In IEEE/RSJ International
Batra. Habitat: A platform for embodied ai research. In Conference on Intelligent Robots and Systems (IROS), pages
ICCV, 2019. 2, 6, 16, 17 573–580. IEEE, 2012. 7, 8
[104] Aron Schmied, Tobias Fischer, Martin Danelljan, Marc [119] Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and
Pollefeys, and Fisher Yu. R3D3: dense 3d reconstruc- Xiaowei Zhou. LoFTR: Detector-free local feature matching
tion of dynamic scenes from multiple cameras. CoRR, with transformers. CVPR, 2021. 2, 3
abs/2308.14713, 2023. 14 [120] Libo Sun, Jia-Wang Bian, Huangying Zhan, Wei Yin,
[105] Johannes Lutz Schönberger and Jan-Michael Frahm. Ian Reid, and Chunhua Shen. Sc-depthv3: Robust self-
Structure-from-motion revisited. In Conference on Com- supervised monocular depth estimation for dynamic scenes.
puter Vision and Pattern Recognition (CVPR), 2016. 2, 3, 8, CoRR, 2211.03660, 2022. 7, 8
9 [121] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien
[106] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou,
and Jan-Michael Frahm. Pixelwise view selection for un- Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han,
structured multi-view stereo. In ECCV, 2016. 2, 3, 7, 8, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et-
9 tinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang,
[107] Thomas Schöps, Johannes L. Schönberger, Silvano Gal- Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov.
liani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, Scalability in perception for autonomous driving: Waymo
and Andreas Geiger. ETH3D Online Benchmark. https: open dataset. In CVPR, June 2020. 2, 6, 14, 16, 17
//www.eth3d.net/high_res_multi_view, 2017. [122] Chengzhou Tang and Ping Tan. Ba-net: Dense bundle adjust-
[Online; accessed 19-October-2023]. 1 ment network. Proceedings of the International Conference
[108] Thomas Schöps, Johannes L. Schönberger, Silvano Gal- on Learning Representations, 2018. 15
liani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and [123] Shitao Tang, Chengzhou Tang, Rui Huang, Siyu Zhu, and
20
Ping Tan. Learning camera localization via dense scene Kaihao Zhang, Nikolai Smolyanskiy, and Hongdong Li.
matching. In CVPR, 2021. 3 Deep two-view structure-from-motion revisited. In Pro-
[124] Shitao Tang, Sicong Tang, Andrea Tagliasacchi, Ping Tan, ceedings of the IEEE/CVF conference on Computer Vision
and Yasutaka Furukawa. Neumap: Neural coordinate map- and Pattern Recognition, pages 8953–8962, 2021. 15
ping by auto-transdecoder for camera localization. In CVPR, [142] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku
2023. 7, 8 Komura, and Wenping Wang. Neus: Learning neural im-
[125] Shitao Tang, Jiahui Zhang, Siyu Zhu, and Ping Tan. plicit surfaces by volume rendering for multi-view recon-
Quadtree attention for vision transformers. ICLR, 2022. struction. In NeurIPS, 2021. 2, 3
3 [143] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku
[126] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Komura, and Wenping Wang. Neus: Learning neural im-
Multi-view 3d models from single images with a convolu- plicit surfaces by volume rendering for multi-view recon-
tional network. In ECCV, 2016. 3 struction. NeurIPS, 2021. 14
[127] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. [144] Shuzhe Wang, Juho Kannala, Marc Pollefeys, and Daniel
Octree generating networks: Efficient convolutional archi- Barath. Guiding local feature matching with surface curva-
tectures for high-resolution 3d outputs. In ICCV, 2017. 14 ture. In Proceedings of the IEEE/CVF International Con-
[128] Maxim Tatarchenko, Stephan R. Richter, René Ranftl, ference on Computer Vision (ICCV), pages 17981–17991,
Zhuwen Li, Vladlen Koltun, and Thomas Brox. What do October 2023. 3
single-view 3d reconstruction networks learn? In CVPR, [145] Yiqun Wang, Ivan Skorokhodov, and Peter Wonka. Hf-
2019. 14 neus: Improved surface reconstruction using high-frequency
[129] Keisuke Tateno, Federico Tombari, Iro Laina, and Nassir details. In NeurIPS, 2022. 2, 3
Navab. CNN-SLAM: real-time dense monocular SLAM [146] Yuesong Wang, Zhaojie Zeng, Tao Guan, Wei Yang, Zhuo
with learned depth prediction. In CVPR, 2017. 14 Chen, Wenkai Liu, Luoyuan Xu, and Yawei Luo. Adaptive
[130] Zachary Teed and Jia Deng. Deepv2d: Video to depth with patch deformation for textureless-resilient multi-view stereo.
differentiable structure from motion. In ICLR, 2020. 3, 9, In CVPR, 2023. 3
15 [147] Yi Wei, Shaohui Liu, Yongming Rao, Wang Zhao, Jiwen Lu,
[131] Zachary Teed and Jia Deng. DROID-SLAM: deep visual and Jie Zhou. Nerfingmvs: Guided optimization of neural
SLAM for monocular, stereo, and RGB-D cameras. In radiance fields for indoor multi-view stereo. In ICCV, 2021.
NeurIPS, pages 16558–16569, 2021. 14 2, 3, 14
[132] Sebastian Thrun. Probabilistic robotics. Communications of [148] Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy,
the ACM, 45(3):52–57, 2002. 2 Yohann Cabon, Vaibhav Arora, Romain Brégier, Gabriela
[133] Engin Tola, Christoph Strecha, and Pascal Fua. Efficient Csurka, Leonid Antsfeld, Boris Chidlovskii, and Jérôme
large-scale multi-view stereo for ultra high-resolution image Revaud. CroCo v2: Improved Cross-view Completion Pre-
sets. Mach. Vis. Appl., 2012. 9 training for Stereo Matching and Optical Flow. In ICCV,
[134] Michał Tyszkiewicz, Pascal Fua, and Eduard Trulls. Disk: 2023. 4, 6, 9, 15, 17
Learning local features with policy gradient. Advances in [149] Weinzaepfel, Philippe and Leroy, Vincent and Lucas,
Neural Information Processing Systems, 33:14254–14265, Thomas and Brégier, Romain and Cabon, Yohann and
2020. 3 Arora, Vaibhav and Antsfeld, Leonid and Chidlovskii, Boris
[135] Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Niko- and Csurka, Gabriela and Revaud Jérôme. CroCo: Self-
laus Mayer, Eddy Ilg, Alexey Dosovitskiy, and Thomas Brox. Supervised Pre-training for 3D Vision Tasks by Cross-View
DeMoN: Depth and motion network for learning monocular Completion. In NeurIPS, 2022. 4, 15
stereo. In CVPR, pages 5622–5631, 2017. 3, 9, 14, 15 [150] Olivia Wiles, Georgia Gkioxari, Richard Szeliski, and Justin
[136] Sheng Wan, Tung-Yu Wu, Wing H. Wong, and Chen-Yi Lee. Johnson. Synsin: End-to-end view synthesis from a single
Confnet: Predict with confidence. In IEEE Intern. Conf.on image. In CVPR, 2020. 3
Acoustics, Speech and Signal Processing (ICASSP), pages [151] Xin Wu, Hao Zhao, Shunkai Li, Yingdian Cao, and Hongbin
2921–2925, 2018. 5 Zha. Sc-wls: Towards interpretable feed-forward camera
[137] Dan Wang, Xinrui Cui, Xun Chen, Zhengxia Zou, Tianyang re-localization. In ECCV, 2022. 7, 8
Shi, Septimiu Salcudean, Z. Jane Wang, and Rabab Ward. [152] Yuxi Xiao, Nan Xue, Tianfu Wu, and Gui-Song Xia. Level-
Multi-view 3d reconstruction with transformers. In ICCV, S2 fM: Structure From Motion on Neural Level Set of Im-
pages 5702–5711, 2021. 14 plicit Surfaces. In Proceedings of the IEEE/CVF Conference
[138] Fangjinhua Wang, Silvano Galliani, Christoph Vogel, Pablo on Computer Vision and Pattern Recognition, 2023. 3
Speciale, and Marc Pollefeys. Patchmatchnet: Learned [153] Haozhe Xie, Hongxun Yao, Xiaoshuai Sun, Shangchen
multi-view patchmatch stereo. In CVPR, pages 14194– Zhou, and Shengping Zhang. Pix2vox: Context-aware 3d
14203, 2021. 2, 9 reconstruction from single and multi-view images. In ICCV,
[139] Jianyuan Wang, Christian Rupprecht, and David Novotný. 2019. 14
Posediffusion: Solving pose estimation via diffusion-aided [154] Haozhe Xie, Hongxun Yao, Shengping Zhang, Shangchen
bundle adjustment. In ICCV, 2023. 7, 8, 15 Zhou, and Wenxiu Sun. Pix2vox++: Multi-scale context-
[140] Jinglu Wang, Bo Sun, and Yan Lu. Mvpnet: Multi-view aware 3d object reconstruction from single and multiple
point regression networks for 3d object reconstruction from images. IJCV, 2020. 14
A single image. In AAAI, 2019. 3 [155] Guangkai Xu, Wei Yin, Hao Chen, Chunhua Shen, Kai
[141] Jianyuan Wang, Yiran Zhong, Yuchao Dai, Stan Birchfield,
21
Cheng, and Feng Zhao. Frozenrecon: Pose-free 3d scene [171] Zhichao Yin and Jianping Shi. Geonet: Unsupervised learn-
reconstruction with frozen depth models. In ICCV, 2023. 3 ing of dense depth, optical flow and camera pose. In Pro-
[156] Qingshan Xu and Wenbing Tao. Learning inverse depth re- ceedings of the IEEE conference on computer vision and
gression for multi-view stereo with correlation cost volume. pattern recognition, pages 1983–1992, 2018. 15
In AAAI, 2020. 9 [172] Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sat-
[157] Jiayu Yang, Wei Mao, José M. Álvarez, and Miaomiao Liu. tler, and Andreas Geiger. Monosdf: Exploring monocular
Cost volume pyramid based depth inference for multi-view geometric cues for neural implicit surface reconstruction. Ad-
stereo. In CVPR, pages 4876–4885, 2020. 2, 9 vances in neural information processing systems, 35:25018–
[158] Luwei Yang, Ziqian Bai, Chengzhou Tang, Honghua Li, 25032, 2022. 14
Yasutaka Furukawa, and Ping Tan. Sanet: Scene agnostic [173] Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and
network for camera localization. In ICCV, 2019. 3 Ping Tan. Neural window fully-connected crfs for monocular
[159] Zhenpei Yang, Zhile Ren, Qi Shan, and Qixing Huang. depth estimation. In CVPR, pages 3906–3915, 2022. 7, 8
MVS2D: efficient multiview stereo via attention-driven 2d [174] Zhaojie Zeng. OpenMVS. https://github.com/
convolutions. In CVPR, pages 8564–8574, 2022. 9 cdcseacave/openMVS, 2015. [Online; accessed 19-
[160] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. October-2023]. 3
Mvsnet: Depth inference for unstructured multi-view stereo. [175] Jingyang Zhang, Shiwei Li, Zixin Luo, Tian Fang, and Yao
In ECCV, 2018. 3, 9 Yao. Vis-mvsnet: Visibility-aware multi-view stereo net-
[161] Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, work. Int. J. Comput. Vis., 131(1):199–214, 2023. 2, 9
Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large- [176] Jason Y. Zhang, Deva Ramanan, and Shubham Tulsiani.
scale dataset for generalized multi-view stereo networks. In Relpose: Predicting probabilistic relative rotation for single
CVPR, pages 1787–1796, 2020. 2, 6, 16, 17 objects in the wild. In Shai Avidan, Gabriel J. Brostow,
[162] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner,
Atzmon, Ronen Basri, and Yaron Lipman. Multiview neu- editors, ECCV, pages 592–611, 2022. 7, 8
ral surface reconstruction by disentangling geometry and [177] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
appearance. In NeurIPS, 2020. 2, 3 Koltun. Nerf++: Analyzing and improving neural radiance
[163] Xinyi Ye, Weiyue Zhao, Tianqi Liu, Zihao Huang, Zhiguo fields. arXiv preprint arXiv:2010.07492, 2020. 14
Cao, and Xin Li. Constraining depth map geometry for multi- [178] Youmin Zhang, Fabio Tosi, Stefano Mattoccia, and Matteo
view stereo: A dual-depth approach with saddle-shaped Poggi. GO-SLAM: Global optimization for consistent 3d
depth cells. ICCV, 2023. 3 instant reconstruction. In ICCV, pages 3727–3737, October
[164] Zhichao Ye, Chong Bao, Xin Zhou, Haomin Liu, Hujun Bao, 2023. 14
and Guofeng Zhang. Ec-sfm: Efficient covisibility-based [179] Zhe Zhang, Rui Peng, Yuxi Hu, and Ronggang Wang. Ge-
structure-from-motion for both sequential and unordered omvsnet: Learning multi-view stereo with geometry percep-
images. CoRR, abs/2302.10544, 2023. 2 tion. In CVPR, 2023. 3, 9
[165] Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, [180] Chen Zhao, Yixiao Ge, Feng Zhu, Rui Zhao, Hongsheng Li,
and Angela Dai. Scannet++: A high-fidelity dataset of 3d in- and Mathieu Salzmann. Progressive correspondence pruning
door scenes. In Proceedings of the International Conference by consensus learning. In ICCV, 2021. 2
on Computer Vision (ICCV), 2023. 2, 3, 6, 16, 17 [181] Chaoqiang Zhao, Youmin Zhang, Matteo Poggi, Fabio Tosi,
[166] Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Xianda Guo, Zheng Zhu, Guan Huang, Yang Tang, and
Fua. Lift: Learned invariant feature transform. In Computer Stefano Mattoccia. MonoViT: Self-supervised monocular
Vision–ECCV 2016: 14th European Conference, Amsterdam, depth estimation with a vision transformer. In International
The Netherlands, October 11-14, 2016, Proceedings, Part Conference on 3D Vision (3DV), sep 2022. 8
VI 14, pages 467–483. Springer, 2016. 3 [182] Yunhan Zhao, Connelly Barnes, Yuqian Zhou, Eli Shecht-
[167] Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, man, Sohrab Amirghodsi, and Charless C. Fowlkes. Geofill:
Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Reference-based image inpainting with better geometric un-
Towards zero-shot metric 3d prediction from a single image. derstanding. In WACV, pages 1776–1786, 2023. 15
In ICCV, 2023. 3 [183] Huizhong Zhou, Benjamin Ummenhofer, and Thomas Brox.
[168] Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Si- DeepTAM: Deep tracking and mapping with convolutional
mon Chen, Yifan Liu, and Chunhua Shen. Towards accurate neural networks. Int. J. Comput. Vis., 128(3):756–769, 2020.
reconstruction of 3d scene shape from a single monocular 3, 14
image, 2022. 3 [184] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G
[169] Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Si- Lowe. Unsupervised learning of depth and ego-motion from
mon Chen, Yifan Liu, and Chunhua Shen. Towards accurate video. In Proceedings of the IEEE conference on computer
reconstruction of 3d scene shape from a single monocular vision and pattern recognition, pages 1851–1858, 2017. 15
image. IEEE Transactions on Pattern Analysis and Machine [185] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe,
Intelligence (TPAMI), 2022. 3 and Noah Snavely. Stereo magnification: Learning view
[170] Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, synthesis using multiplane images. ACM Trans. Graph.
Long Mai, Simon Chen, and Chunhua Shen. Learning to (Proc. SIGGRAPH), 37, 2018. 7, 8, 14, 15
recover 3d scene shape from a single image. In CVPR, 2020. [186] Rui Zhu, Chaoyang Wang, Chen-Hsuan Lin, Ziyan Wang,
3 and Simon Lucey. Semantic photometric bundle adjustment
22
on natural sequences. CoRR, 2017. 14
[187] Zihan Zhu, Songyou Peng, Viktor Larsson, Zhaopeng Cui,
Martin R Oswald, Andreas Geiger, and Marc Pollefeys.
Nicer-slam: Neural implicit scene encoding for rgb slam.
arXiv preprint arXiv:2302.03594, 2023. 14
[188] Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu,
Hujun Bao, Zhaopeng Cui, Martin R Oswald, and Marc
Pollefeys. Nice-slam: Neural implicit scalable encoding
for slam. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 12786–
12796, 2022. 14
23