Unsupervised Learning of Depth and Ego-Motion From Video

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Unsupervised Learning of Depth and Ego-Motion from Video

Tinghui Zhou∗ Matthew Brown Noah Snavely David G. Lowe


UC Berkeley Google Google Google

Abstract

We present an unsupervised learning framework for the


task of monocular depth and camera motion estimation
from unstructured video sequences. In common with re- ... ... ... ...
cent work [10, 14, 16], we use an end-to-end learning ap-
(a) Training: unlabeled video clips.
proach with view synthesis as the supervisory signal. In
contrast to the previous work, our method is completely un- Target view Depth CNN
supervised, requiring only monocular video sequences for
training. Our method uses single-view depth and multi-
view pose networks, with a loss based on warping nearby
views to the target using the computed depth and pose. The
Nearby views Pose CNN
networks are thus coupled by the loss during training, but
R, t
can be applied independently at test time. Empirical eval-
uation on the KITTI dataset demonstrates the effectiveness
of our approach: 1) monocular depth performs comparably
with supervised methods that use either ground-truth pose (b) Testing: single-view depth and multi-view pose estimation.
or depth for training, and 2) pose estimation performs fa- Figure 1. The training data to our system consists solely of un-
vorably compared to established SLAM systems under com- labeled image sequences capturing scene appearance from differ-
parable input settings. ent viewpoints, where the poses of the images are not provided.
Our training procedure produces two models that operate inde-
pendently, one for single-view depth prediction, and one for multi-
view camera pose estimation.
1. Introduction
Humans are remarkably capable of inferring ego-motion
and the 3D structure of a scene even over short timescales. In this work, we mimic this approach by training a model
For instance, in navigating along a street, we can easily that observes sequences of images and aims to explain its
locate obstacles and react quickly to avoid them. Years observations by predicting likely camera motion and the
of research in geometric computer vision has failed to scene structure (as shown in Fig. 1). We take an end-to-
recreate similar modeling capabilities for real-world scenes end approach in allowing the model to map directly from
(e.g., where non-rigidity, occlusion and lack of texture are input pixels to an estimate of ego-motion (parameterized as
present). So why do humans excel at this task? One hypoth- 6-DoF transformation matrices) and the underlying scene
esis is that we develop a rich, structural understanding of the structure (parameterized as per-pixel depth maps under a
world through our past visual experience that has largely reference view). We are particularly inspired by prior work
consisted of moving around and observing vast numbers of that has suggested view synthesis as a metric [44] and recent
scenes and developing consistent modeling of our observa- work that tackles the calibrated, multi-view 3D case in an
tions. From millions of such observations, we have learned end-to-end framework [10]. Our method is unsupervised,
about the regularities of the world—roads are flat, buildings and can be trained simply using sequences of images with
are straight, cars are supported by roads etc., and we can no manual labeling or even camera motion information.
apply this knowledge when perceiving a new scene, even Our approach builds upon the insight that a geomet-
from a single monocular image. ric view synthesis system only performs consistently well
when its intermediate predictions of the scene geometry
∗ The majority of the work was done while interning at Google. and the camera poses correspond to the physical ground-

1
truth. While imperfect geometry and/or pose estimation from an input image using stereoscopic film footage as training
can cheat with reasonable synthesized views for certain data. A similar approach was taken by Godard et al. [16], with
types of scenes (e.g., textureless), the same model would the addition of a left-right consistency constraint, and a better ar-
fail miserably when presented with another set of scenes chitecture design that led to impressive performance. Like our
with more diverse layout and appearance structures. Thus, approach, these techniques only learn from image observations of
our goal is to formulate the entire view synthesis pipeline the world, unlike methods that require explicit depth for training,
as the inference procedure of a convolutional neural net- e.g., [20, 42, 7, 27, 30].
work, so that by training the network on large-scale video These techniques bear some resemblance to direct methods for
data for the ‘meta’-task of view synthesis the network is structure and motion estimation [22], where the camera parame-
forced to learn about intermediate tasks of depth and cam- ters and scene depth are adjusted to minimize a pixel-based error
era pose estimation in order to come up with a consistent function. However, rather than directly minimizing the error to
explanation of the visual world. Empirical evaluation on obtain the estimation, the CNN-based methods only take a gradi-
the KITTI [15] benchmark demonstrates the effectiveness ent step for each batch of input instances, which allows the net-
of our approach on both single-view depth and camera pose work to learn an implicit prior from a large corpus of related im-
estimation. Our code will be made available at https: agery. Several authors have explored building differentiable ren-
//github.com/tinghuiz/SfMLearner. dering operations into their models that are trained in this way,
e.g., [19, 29, 34].
2. Related work While most of the above techniques (including ours) are mainly
focused on inferring depth maps as the scene geometry output, re-
Structure from motion The simultaneous estimation of cent work (e.g., [13, 41, 46, 52]) has also shown success in learn-
structure and motion is a well studied problem with an estab- ing 3D volumetric representations from 2D observations based on
lished toolchain of techniques [12, 50, 38]. Whilst the traditional similar principles of projective geometry. Fouhey et al. [11] fur-
toolchain is effective and efficient in many cases, its reliance on ac- ther show that it is even possible to learn 3D inference without 3D
curate image correspondence can cause problems in areas of low labels (or registered 2D views) by utilizing scene regularity.
texture, complex geometry/photometry, thin structures, and occlu-
sions. To address these issues, several of the pipeline stages have
been recently tackled using deep learning, e.g., feature match-
Unsupervised/Self-supervised learning from video An-
other line of related work to ours is visual representation learning
ing [18], pose estimation [26], and stereo [10, 27, 53]. These
from video, where the general goal is to design pretext tasks for
learning-based techniques are attractive in that they are able to
learning generic visual features from video data that can later be
leverage external supervision during training, and potentially over-
re-purposed for other vision tasks such as object detection and se-
come the above issues when applied to test data.
mantic segmentation. Such pretext tasks include ego-motion esti-
mation [2, 24], tracking [49], temporal coherence [17], temporal
Warping-based view synthesis One important application order verification [36], and object motion mask prediction [39].
of geometric scene understanding is the task of novel view syn- While we focus on inferring the explicit scene geometry and
thesis, where the goal is to synthesize the appearance of the scene ego-motion in this work, intuitively, the internal representation
seen from novel camera viewpoints. A classic paradigm for view learned by the deep network (especially the single-view depth
synthesis is to first either estimate the underlying 3D geometry CNN) should capture some level of semantics that could gener-
explicitly or establish pixel correspondence among input views, alize to other tasks as well.
and then synthesize the novel views by compositing image patches Concurrent to our work, Vijayanarasimhan et al. [48] indepen-
from the input views (e.g., [4, 55, 43, 6, 9]). Recently, end-to- dently propose a framework for joint training of depth, camera
end learning has been applied to reconstruct novel views by trans- motion and scene motion from videos. While both methods are
forming the input based on depth or flow, e.g., DeepStereo [10], conceptually similar, ours is focused on the unsupervised aspect,
Deep3D [51] and Appearance Flows [54]. In these methods, the whereas their framework adds the capability to incorporate super-
underlying geometry is represented by quantized depth planes vision (e.g., depth, camera motion or scene motion). There are
(DeepStereo), probabilistic disparity maps (Deep3D) and view- significant differences in how scene dynamics are modeled during
dependent flow fields (Appearance Flows), respectively. Unlike training, in which they explicitly solve for object motion whereas
methods that directly map from input views to the target view our explainability mask discounts regions undergoing motion, oc-
(e.g., [45]), warping-based methods are forced to learn intermedi- clusion and other factors.
ate predictions of geometry and/or correspondence. In this work,
we aim to distill such geometric reasoning capability from CNNs
trained to perform warping-based view synthesis. 3. Approach
Here we propose a framework for jointly training a single-view
Learning single-view 3D from registered 2D views Our depth CNN and a camera pose estimation CNN from unlabeled
work is closely related to a line of recent research on learning video sequences. Despite being jointly trained, the depth model
single-view 3D inference from registered 2D observations. Garg et and the pose estimation model can be used independently during
al. [14] propose to learn a single-view depth estimation CNN us- test-time inference. Training examples to our model consist of
ing projection errors to a calibrated stereo twin for supervision. short image sequences of scenes captured by a moving camera.
Concurrently, Deep3D [51] predicts a second stereo viewpoint While our training procedure is robust to some degree of scene
Depth CNN
It Is Iˆs
Project Warp
It p ptl
s ptr
s
D̂t (p)
pt ps pt
pbl
s pbr
s
pt Project
It 1 1 Pose CNN
T̂t!t 1
Figure 3. Illustration of the differentiable image warping process.
pt+1 T̂t!t+1 For each point pt in the target view, we first project it onto the
It+1
Project source view based on the predicted depth and camera pose, and
then use bilinear interpolation to obtain the value of the warped
Figure 2. Overview of the supervision pipeline based on view syn- image Iˆs at location pt .
thesis. The depth network takes only the target view as input, and
outputs a per-pixel depth map D̂t . The pose network takes both the
target view (It ) and the nearby/source views (e.g., It−1 and It+1 ) framework can be applied to standard videos without pose infor-
as input, and outputs the relative camera poses (T̂t→t−1 , T̂t→t+1 ). mation. Furthermore, it predicts the poses as part of the learning
The outputs of both networks are then used to inverse warp the framework. See Figure 2 for an illustration of our learning pipeline
source views (see Sec. 3.2) to reconstruct the target view, and the for depth and pose estimation.
photometric reconstruction loss is used for training the CNNs. By
utilizing view synthesis as supervision, we are able to train the 3.2. Differentiable depth image-based rendering
entire framework in an unsupervised manner from videos. As indicated in Eq. 1, a key component of our learning frame-
work is a differentiable depth image-based renderer that recon-
structs the target view It by sampling pixels from a source view Is
motion, we assume that the scenes we are interested in are mostly based on the predicted depth map D̂t and the relative pose T̂t→s .
rigid, i.e., the scene appearance change across different frames is
Let pt denote the homogeneous coordinates of a pixel in the
dominated by the camera motion.
target view, and K denote the camera intrinsics matrix. We can
3.1. View synthesis as supervision obtain pt ’s projected coordinates onto the source view ps by2

The key supervision signal for our depth and pose prediction ps ∼ K T̂t→s D̂t (pt )K −1 pt (2)
CNNs comes from the task of novel view synthesis: given one
input view of a scene, synthesize a new image of the scene seen Notice that the projected coordinates ps are continuous values. To
from a different camera pose. We can synthesize a target view obtain Is (ps ) for populating the value of Iˆs (pt ) (see Figure 3),
given a per-pixel depth in that image, plus the pose and visibility we then use the differentiable bilinear sampling mechanism pro-
in a nearby view. As we will show next, this synthesis process can posed in the spatial transformer networks [23] that linearly in-
be implemented in a fully differentiable manner with CNNs as the terpolates the values of the 4-pixel neighbors (top-left, top-right,
geometry and pose estimation modules. Visibility can be handled, bottom-left, and bottom-right) of ps to approximate Is (ps ), i.e.
Iˆs (pt ) = Is (ps ) = ij ij ij
P
along with non-rigidity and other non-modeled factors, using an i∈{t,b},j∈{l,r} w Is (ps ), where w is
“explanability” mask, which we discuss later (Sec. 3.3). ij
linearly proportional to the spatial proximity between p s and p s ,
Let us denote < I1 , . . . , IN > as a training image sequence and i,j wij = 1. A similar strategy is used in [54] for learning
P
with one of the frames It being the target view and the rest being to directly warp between different views, while here the coordi-
the source views Is (1 ≤ s ≤ N, s 6= t). The view synthesis nates for pixel warping are obtained through projective geometry
objective can be formulated as that enables the factorization of depth and camera pose.
XX
Lvs = |It (p) − Iˆs (p)| , (1) 3.3. Modeling the model limitation
s p
Note that when applied to monocular videos the above view
where p indexes over pixel coordinates, and Iˆs is the source view synthesis formulation implicitly assumes 1) the scene is static
Is warped to the target coordinate frame based on a depth image- without moving objects; 2) there is no occlusion/disocclusion be-
based rendering module [8] (described in Sec. 3.2), taking the pre- tween the target view and the source views; 3) the surface is Lam-
dicted depth D̂t , the predicted 4×4 camera transformation matrix1 bertian so that the photo-consistency error is meaningful. If any
T̂t→s and the source view Is as input. of these assumptions are violated in a training sequence, the gra-
Note that the idea of view synthesis as supervision has also dients could be corrupted and potentially inhibit training. To im-
been recently explored for learning single-view depth estima- prove the robustness of our learning pipeline to these factors, we
tion [14, 16] and multi-view stereo [10]. However, to the best of additionally train a explainability prediction network (jointly and
our knowledge, all previous work requires posed image sets dur- simultaneously with the depth and pose networks) that outputs a
ing training (and testing too in the case of DeepStereo), while our per-pixel soft mask Ês for each target-source pair, indicating the
1 In practice, the CNN estimates the Euler angles and the 3D translation 2 For notation simplicity, we omit showing the necessary conversion to

vector, which are then converted to the transformation matrix. homogeneous coordinates along the steps of matrix multiplication.
...

Input
Conv
Deconv
Concat
Upsample +  Concat
Prediction

(a) Single-view depth network (b) Pose/explainability network


Figure 4. Network architecture for our depth/pose/explainability prediction modules. The width and height of each rectangular block indi-
cates the output channels and the spatial dimension of the feature map at the corresponding layer respectively, and each reduction/increase
in size indicates a change by the factor of 2. (a) For single-view depth, we adopt the DispNet [35] architecture with multi-scale side pre-
dictions. The kernel size is 3 for all the layers except for the first 4 conv layers with 7, 7, 5, 5, respectively. The number of output channels
for the first conv layer is 32. (b) The pose and explainabilty networks share the first few conv layers, and then branch out to predict 6-DoF
relative pose and multi-scale explainability masks, respectively. The number of output channels for the first conv layer is 16, and the kernel
size is 3 for all the layers except for the first two conv and the last two deconv/prediction layers where we use 7, 5, 5, 7, respectively. See
Section 3.5 for more details.

network’s belief in where direct view synthesis will be success- explicit multi-scale and smoothness loss (e.g., as in [14, 16]) that
fully modeled for each target pixel. Based on the predicted Ês , allows gradients to be derived from larger spatial regions directly.
the view synthesis objective is weighted correspondingly by We adopt the second strategy in this work as it is less sensitive to
X X architectural choices. For smoothness, we minimize the L1 norm
Lvs = Ês (p)|It (p) − Iˆs (p)| . (3) of the second-order gradients for the predicted depth maps (similar
<I1 ,...,IN >∈S p
to [48]).
Since we do not have direct supervision for Ês , training with the Our final objective becomes
above loss would result in a trivial solution of the network always
predicting Ês to be zero, which perfectly minimizes the loss. To
X X
Lf inal = Llvs + λs Llsmooth + λe Lreg (Êsl ) , (4)
resolve this, we add a regularization term Lreg (Ês ) that encour- l s
ages nonzero predictions by minimizing the cross-entropy loss
with constant label 1 at each pixel location. In other words, the where l indexes over different image scales, s indexes over source
network is encouraged to minimize the view synthesis objective, images, and λs and λe are the weighting for the depth smoothness
but allowed a certain amount of slack for discounting the factors loss and the explainability regularization, respectively.
not considered by the model.
3.4. Overcoming the gradient locality 3.5. Network architecture
One remaining issue with the above learning pipeline is that the Single-view depth For single-view depth prediction, we adopt
gradients are mainly derived from the pixel intensity difference be- the DispNet architecture proposed in [35] that is mainly based on
tween I(pt ) and the four neighbors of I(ps ), which would inhibit an encoder-decoder design with skip connections and multi-scale
training if the correct ps (projected using the ground-truth depth side predictions (see Figure 4). All conv layers are followed by
and pose) is located in a low-texture region or far from the current ReLU activation except for the prediction layers, where we use
estimation. This is a well known issue in motion estimation [3]. 1/(α∗sigmoid(x)+β) with α = 10 and β = 0.1 to constrain the
Empirically, we found two strategies to be effective for overcom- predicted depth to be always positive within a reasonable range.
ing this issue: 1) using a convolutional encoder-decoder architec- We also experimented with using multiple views as input to the
ture with a small bottleneck for the depth network that implicitly depth network, but did not find this to improve the results. This is
constrains the output to be globally smooth and facilitates gradi- in line with the observations in [47], where optical flow constraints
ents to propagate from meaningful regions to nearby regions; 2) need to be enforced to utilize multiple views effectively.
Pose The input to the pose estimation network is the target view Input image Our prediction
concatenated with all the source views (along the color channels),
and the outputs are the relative poses between the target view and
each of the source views. The network consists of 7 stride-2 con-
volutions followed by a 1 × 1 convolution with 6 ∗ (N − 1) output
channels (corresponding to 3 Euler angles and 3-D translation for
each source view). Finally, global average pooling is applied to
aggregate predictions at all spatial locations. All conv layers are
followed by ReLU except for the last layer where no nonlinear
activation is applied.

Explainability mask The explainability prediction network


shares the first five feature encoding layers with the pose network,
followed by 5 deconvolution layers with multi-scale side predic-
tions. All conv/deconv layers are followed by ReLU except for
the prediction layers with no nonlinear activation. The number of
output channels for each prediction layer is 2 ∗ (N − 1), with ev-
ery two channels normalized by softmax to obtain the explainabil-
ity prediction for the corresponding source-target pair (the second Figure 5. Our sample predictions on the Cityscapes dataset using
channel after normalization is Ês and used in computing the loss the model trained on Cityscapes only.
in Eq. 3).
Input image Ours (CS) Ours (CS + KITTI)
4. Experiments
Here we evaluate the performance of our system, and compare
with prior approaches on single-view depth as well as ego-motion
estimation. We mainly use the KITTI dataset [15] for benchmark-
ing, but also use the Make3D dataset [42] for evaluating cross-
dataset generalization ability.

Training Details We implemented the system using the pub-


licly available TensorFlow [1] framework. For all the experiments,
we set λs = 0.5/l (l is the downscaling factor for the correspond- Figure 7. Comparison of single-view depth predictions on the
ing scale) and λe = 0.2. During training, we used batch normal- KITTI dataset by our initial Cityscapes model and the final model
ization [21] for all the layers except for the output layers, and the (pre-trained on Cityscapes and then fine-tuned on KITTI). The
Adam [28] optimizer with β1 = 0.9, β2 = 0.999, learning rate of Cityscapes model sometimes makes structural mistakes (e.g. holes
0.0002 and mini-batch size of 4. The training typically converges on car body) likely due to the domain gap between the two
after about 150K iterations. All the experiments are performed datasets.
with image sequences captured with a monocular camera. We re-
size the images to 128 × 416 during training, but both the depth
training [14, 16]. Since the depth predicted by our method is de-
and pose networks can be run fully-convolutionally for images of
fined up to a scale factor, for evaluation we multiply the predicted
arbitrary size at test time.
depth maps by a scalar ŝ that matches the median with the ground-
4.1. Single-view depth estimation truth, i.e. ŝ = median(Dgt )/median(Dpred ).
Similar to [16], we also experimented with first pre-training the
We train our system on the split provided by [7], and exclude system on the larger Cityscapes dataset [5] (sample predictions are
all the frames from the testing scenes as well as static sequences shown in Figure 5), and then fine-tune on KITTI, which results in
with mean optical flow magnitude less than 1 pixel for training. slight performance improvement.
We fix the length of image sequences to be 3 frames, and treat the
central frame as the target view and the ±1 frames as the source
views. We use images captured by both color cameras, but treated KITTI Here we evaluate the single-view depth performance on
them independently when forming training sequences. This results the 697 images from the test split of [7]. As shown in Table 1,
in a total of 44, 540 sequences, out of which we use 40, 109 for our unsupervised method performs comparably with several su-
training and 4, 431 for validation. pervised methods (e.g. Eigen et al. [7] and Garg et al. [14]), but
To the best of our knowledge, no previous systems exist that falls short of concurrent work by Godard et al. [16] that uses cal-
learn single-view depth estimation in an unsupervised manner ibrated stereo images (i.e. with pose supervision) with left-right
from monocular videos. Nonetheless, here we provide comparison cycle consistency loss for training. For future work, it would be in-
with prior methods with depth supervision [7] and recent methods teresting to see if incorporating the similar cycle consistency loss
that use calibrated stereo images (i.e. with pose supervision) for into our framework could further improve the results. Figure 6
Input Ground-truth Eigen et al. (depth sup.) Garg et al. (pose sup.) Ours (unsupervised)

Figure 6. Comparison of single-view depth estimation between Eigen et al. [7] (with ground-truth depth supervision), Garg et al. [14]
(with ground-truth pose supervision), and ours (unsupervised). The ground-truth depth map is interpolated from sparse measurements for
visualization purpose. The last two rows show typical failure cases of our model, which sometimes struggles in vast open scenes and
objects close to the front of the camera.

provides examples of visual comparison between our results and scenes are static without significant scene motions, and 2) the oc-
some supervised baselines over a variety of examples. One can clusion/visibility effects only occur in small regions in sequences
see that although trained in an unsupervised manner, our results across a short time span (3-frames), which make the explainabil-
are comparable to that of the supervised baselines, and sometimes ity modeling less essential to the success of training. Nonetheless,
preserve the depth boundaries and thin structures such as trees and our explainability prediction network does seem to capture the fac-
street lights better. tors like scene motion and visibility well (see Sec. 4.3), and could
We show sample predictions made by our initial Cityscapes potentially be more important for other more challenging datasets.
model and the final model (pre-trained on Cityscapes and then
fine-tuned on KITTI) in Figure 7. Due to the domain gap between
the two datasets, our Cityscapes model sometimes has difficulty Make3D To evaluate the generalization ability of our single-
in recovering the complete shape of the car/bushes, and mistakes view depth model, we directly apply our model trained on
them with distant objects. Cityscapes + KITTI to the Make3D dataset unseen during train-
We also performed an ablation study of the explainability mod- ing. While there still remains a significant performance gap be-
eling (see Table 1), which turns out only offering a modest per- tween our method and others supervised using Make3D ground-
formance boost. This is likely because 1) most of the KITTI truth depth (see Table 2), our predictions are able to capture the
Method Dataset Supervision Error metric Accuracy metric
Depth Pose Abs Rel Sq Rel RMSE RMSE log δ < 1.25 δ < 1.252 δ < 1.253
Train set mean K X 0.403 5.530 8.709 0.403 0.593 0.776 0.878
Eigen et al. [7] Coarse K X 0.214 1.605 6.563 0.292 0.673 0.884 0.957
Eigen et al. [7] Fine K X 0.203 1.548 6.307 0.282 0.702 0.890 0.958
Liu et al. [32] K X 0.202 1.614 6.523 0.275 0.678 0.895 0.965
Godard et al. [16] K X 0.148 1.344 5.927 0.247 0.803 0.922 0.964
Godard et al. [16] CS + K X 0.124 1.076 5.311 0.219 0.847 0.942 0.973
Ours (w/o explainability) K 0.221 2.226 7.527 0.294 0.676 0.885 0.954
Ours K 0.208 1.768 6.856 0.283 0.678 0.885 0.957
Ours CS 0.267 2.686 7.580 0.334 0.577 0.840 0.937
Ours CS + K 0.198 1.836 6.565 0.275 0.718 0.901 0.960
Garg et al. [14] cap 50m K X 0.169 1.080 5.104 0.273 0.740 0.904 0.962
Ours (w/o explainability) cap 50m K 0.208 1.551 5.452 0.273 0.695 0.900 0.964
Ours cap 50m K 0.201 1.391 5.181 0.264 0.696 0.900 0.966
Ours cap 50m CS 0.260 2.232 6.148 0.321 0.590 0.852 0.945
Ours cap 50m CS + K 0.190 1.436 4.975 0.258 0.735 0.915 0.968

Table 1. Single-view depth results on the KITTI dataset [15] using the split of Eigen et al. [7] (Baseline numbers taken from [16]). For
training, K = KITTI, and CS = Cityscapes [5]. All methods we compare with use some form of supervision (either ground-truth depth or
calibrated camera pose) during training. Note: results from Garg et al. [14] are capped at 50m depth, so we break these out separately in
the lower part of the table.

Input Ground-truth Ours Method Supervision Error metric


Depth Pose Abs Rel Sq Rel RMSE RMSE log
Train set mean X 0.876 13.98 12.27 0.307
Karsch et al. [25] X 0.428 5.079 8.389 0.149
Liu et al. [33] X 0.475 6.562 10.05 0.165
Laina et al. [31] X 0.204 1.840 5.683 0.084
Godard et al. [16] X 0.544 10.94 11.76 0.193
Ours 0.383 5.321 10.47 0.478

Table 2. Results on the Make3D dataset [42]. Similar to ours, Go-


dard et al. [16] do not utilize any of the Make3D data during train-
ing, and directly apply the model trained on KITTI+Cityscapes to
the test set. Following the evaluation protocol of [16], the errors
are only computed where depth is less than 70 meters in a central
image crop.

Figure 8. Our sample predictions on the Make3D dataset. Note


that our model is trained on KITTI + Cityscapes only, and directly allowing loop closure and re-localization), and 2) ORB-SLAM
tested on Make3D. (short), which runs on 5-frame snippets (same as our input
setting). Another baseline we compare with is the dataset mean
of car motion (using ground-truth odometry) for 5-frame snippets.
global scene layout reasonably well without any training on the To resolve scale ambiguity during evaluation, we first optimize
Make3D images (see Figure 8). the scaling factor for the predictions made by each method to best
align with the ground truth, and then measure the Absolute Trajec-
4.2. Pose estimation tory Error (ATE) [37] as the metric. ATE is computed on 5-frame
snippets and averaged over the full sequence.3 As shown in Table 3
To evaluate the performance of our pose estimation network, and Fig. 9, our method outperforms both baselines (mean odome-
we applied our system to the official KITTI odometry split (con- try and ORB-SLAM (short)) that share the same input setting
taining 11 driving sequences with ground truth odometry obtained as ours, but falls short of ORB-SLAM (full), which leverages
through the IMU/GPS readings, which we use for evaluation pur- whole sequences (1591 for seq. 09 and 1201 for seq. 10) for loop
pose only), and used sequences 00-08 for training and 09-10 for closure and re-localization.
testing. In this experiment, we fix the length of input image se- For better understanding of our pose estimation results, we
quences to our system to 5 frames. We compare our ego-motion
estimation with two variants of monocular ORB-SLAM [37] (a 3 For evaluating ORB-SLAM (full) we break down the trajectory of
well-established SLAM system): 1) ORB-SLAM (full), which the full sequence into 5-frame snippets with the reference coordinate frame
recovers odometry using all frames of the driving sequence (i.e. adjusted to the central frame of each snippet.
Target view Explanability mask Source view
Method Seq. 09 Seq. 10
ORB-SLAM (full) 0.014 ± 0.008 0.012 ± 0.011
ORB-SLAM (short) 0.064 ± 0.141 0.064 ± 0.130
Mean Odom. 0.032 ± 0.026 0.028 ± 0.023
Ours 0.021 ± 0.017 0.020 ± 0.015
Table 3. Absolute Trajectory Error (ATE) on the KITTI odome-
try split averaged over all 5-frame snippets (lower is better). Our
method outperforms baselines with the same input setting, but falls
short of ORB-SLAM (full) that uses strictly more data.

0.1
Absolute Translation Error (m)

0.08
Mean Odom.
ORB-SLAM (full)
0.06 ORB-SLAM (short)
Ours
0.04

0.02

0
0 0.1 0.2 0.3 0.4 0.5
Left/right turning magnitude (m)
Figure 9. Absolute Trajectory Error (ATE) at different left/right
turning magnitude (coordinate difference in the side-direction be-
tween the start and ending frame of a testing sequence). Our Figure 10. Sample visualizations of the explainability masks.
method performs significantly better than ORB-SLAM (short) Highlighted pixels are predicted to be unexplainable by the net-
when side rotation is small, and is comparable with ORB-SLAM work due to motion (rows 1–3), occlusion/visibility (rows 4–5), or
(full) across the entire spectrum. other factors (rows 7–8).

show in Figure 9 the ATE curve with varying amount of side-


structure inference. A number of major challenges are yet to be
rotation by the car between the beginning and the end of a se-
addressed: 1) our current framework does not explicitly estimate
quence. Figure 9 suggests that our method is significantly bet-
scene dynamics and occlusions (although they are implicitly taken
ter than ORB-SLAM (short) when the side-rotation is small
into account by the explainability masks), both of which are crit-
(i.e. car mostly driving forward), and comparable to ORB-SLAM
ical factors in 3D scene understanding. Direct modeling of scene
(full) across the entire spectrum. The large performance
dynamics through motion segmentation (e.g. [48, 40]) could be a
gap between ours and ORB-SLAM (short) suggests that our
potential solution; 2) our framework assumes the camera intrinsics
learned ego-motion could potentially be used as an alternative to
are given, which forbids the use of random Internet videos with un-
the local estimation modules in monocular SLAM systems.
known camera types/calibration – we plan to address this in future
4.3. Visualizing the explainability prediction work; 3) depth maps are a simplified representation of the under-
lying 3D scene. It would be interesting to extend our framework
We visualize example explainability masks predicted by our to learn full 3D volumetric representations (e.g. [46]).
network in Figure 10. The first three rows suggest that the network
Another interesting area for future work would be to investi-
has learned to identify dynamic objects in the scene as unexplain-
gate in more detail the representation learned by our system. In
able by our model, and similarly, rows 4–5 are examples of ob-
particular, the pose network likely uses some form of image cor-
jects that disappear from the frame in subsequent views. The last
respondence in estimating the camera motion, whereas the depth
two rows demonstrate the potential downside of explainability-
estimation network likely recognizes common structural features
weighted loss: the depth CNN has low confidence in predicting
of scenes and objects. It would be interesting to probe these, and
thin structures well, and tends to mask them as unexplainable.
investigate the extent to which our network already performs, or
5. Discussion could be re-purposed to perform, tasks such as object detection
and semantic segmentation.
We have presented an end-to-end learning pipeline that utilizes
the task of view synthesis for supervision of single-view depth Acknowledgments: We thank our colleagues, Sudheendra Vijaya-
and camera pose estimation. The system is trained on unlabeled narasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, and Ka-
videos, and yet performs comparably with approaches that require terina Fragkiadaki for their help. We also thank the anonymous reviewers
ground-truth depth or pose for training. Despite good performance for their valuable comments. TZ would like to thank Shubham Tulsiani for
on the benchmark evaluation, our method is by no means close to helpful discussions, and Clement Godard for sharing the evaluation code.
solving the general problem of unsupervised learning of 3D scene This work is also partially funded by Intel/NSF VEC award IIS-1539099.
References In Computer Vision and Pattern Recognition (CVPR), 2012
IEEE Conference on, pages 3354–3361. IEEE, 2012. 2, 5, 7
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,
[16] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised
C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al.
monocular depth estimation with left-right consistency. In
TensorFlow: Large-scale machine learning on heteroge-
Computer Vision and Pattern Recognition, 2017. 1, 2, 3, 4,
neous distributed systems. arXiv preprint arXiv:1603.04467,
5, 7
2016. 5
[17] R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. Le-
[2] P. Agrawal, J. Carreira, and J. Malik. Learning to see by Cun. Unsupervised learning of spatiotemporally coherent
moving. In Int. Conf. Computer Vision, 2015. 2 metrics. In Proceedings of the IEEE International Confer-
[3] J. Bergen, P. Anandan, K. Hanna, and R. Hingorani. Hier- ence on Computer Vision, pages 4086–4093, 2015. 2
archical model-based motion estimation. In Computer Vi- [18] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg.
sionECCV’92, pages 237–252. Springer, 1992. 4 MatchNet: Unifying feature and metric learning for patch-
[4] S. E. Chen and L. Williams. View interpolation for image based matching. In Computer Vision and Pattern Recogni-
synthesis. In Proceedings of the 20th annual conference on tion, pages 3279–3286, 2015. 2
Computer graphics and interactive techniques, pages 279– [19] A. Handa, M. Bloesch, V. Patraucean, S. Stent, J. McCor-
288. ACM, 1993. 2 mac, and A. Davison. gvnn: Neural network library for ge-
[5] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, ometric computer vision. arXiv preprint arXiv:1607.07405,
R. Benenson, U. Franke, S. Roth, and B. Schiele. The 2016. 2
Cityscapes dataset for semantic urban scene understanding. [20] D. Hoiem, A. A. Efros, and M. Hebert . Automatic photo
In Proceedings of the IEEE Conference on Computer Vision pop-up. In Proc. SIGGRAPH, 2005. 2
and Pattern Recognition, pages 3213–3223, 2016. 5, 7 [21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
[6] P. E. Debevec, C. J. Taylor, and J. Malik. Modeling and ren- deep network training by reducing internal covariate shift.
dering architecture from photographs: A hybrid geometry- arXiv preprint arXiv:1502.03167, 2015. 5
and image-based approach. In Proceedings of the 23rd an- [22] M. Irani and P. Anandan. About direct methods. In In-
nual conference on Computer graphics and interactive tech- ternational Workshop on Vision Algorithms, pages 267–277.
niques, pages 11–20. ACM, 1996. 2 Springer, 1999. 2
[7] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction [23] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial
from a single image using a multi-scale deep network. In transformer networks. In Advances in Neural Information
Advances in Neural Information Processing Systems, 2014. Processing Systems, pages 2017–2025, 2015. 3
2, 5, 6, 7 [24] D. Jayaraman and K. Grauman. Learning image representa-
[8] C. Fehn. Depth-image-based rendering (dibr), compression, tions tied to egomotion. In Int. Conf. Computer Vision, 2015.
and transmission for a new approach on 3d-tv. In Electronic 2
Imaging 2004, pages 93–104. International Society for Op- [25] K. Karsch, C. Liu, and S. B. Kang. Depth transfer: Depth
tics and Photonics, 2004. 3 extraction from video using non-parametric sampling. IEEE
[9] A. Fitzgibbon, Y. Wexler, and A. Zisserman. Image-based transactions on pattern analysis and machine intelligence,
rendering using image-based priors. Int. Journal of Com- 36(11):2144–2158, 2014. 7
puter Vision, 63(2):141–151, 2005. 2 [26] A. Kendall, M. Grimes, and R. Cipolla. PoseNet: A convo-
[10] J. Flynn, I. Neulander, J. Philbin, and N. Snavely. Deep- lutional network for real-time 6-DOF camera relocalization.
Stereo: Learning to predict new views from the world’s im- In Int. Conf. Computer Vision, pages 2938–2946, 2015. 2
agery. In Computer Vision and Pattern Recognition, 2016. 1, [27] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry,
2, 3 R. Kennedy, A. Bachrach, and A. Bry. End-to-end learning
[11] D. F. Fouhey, W. Hussain, A. Gupta, and M. Hebert. Single of geometry and context for deep stereo regression. arXiv
image 3D without a single 3D image. In Proceedings of the preprint arXiv:1703.04309, 2017. 2
IEEE International Conference on Computer Vision, pages [28] D. Kingma and J. Ba. Adam: A method for stochastic opti-
1053–1061, 2015. 2 mization. arXiv preprint arXiv:1412.6980, 2014. 5
[12] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski. To- [29] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum.
wards internet-scale multi-view stereo. In Computer Vision Deep convolutional inverse graphics network. In C. Cortes,
and Pattern Recognition, pages 1434–1441. IEEE, 2010. 2 N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,
[13] M. Gadelha, S. Maji, and R. Wang. 3d shape induc- editors, Advances in Neural Information Processing Systems,
tion from 2d views of multiple objects. arXiv preprint pages 2539–2547. Curran Associates, Inc., 2015. 2
arXiv:1612.05872, 2016. 2 [30] Y. Kuznietsov, J. Stückler, and B. Leibe. Semi-supervised
[14] R. Garg, V. K. BG, G. Carneiro, and I. Reid. Unsupervised deep learning for monocular depth map prediction. arXiv
CNN for single view depth estimation: Geometry to the res- preprint arXiv:1702.02706, 2017. 2
cue. In European Conf. Computer Vision, 2016. 1, 2, 3, 4, 5, [31] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and
6, 7 N. Navab. Deeper depth prediction with fully convolutional
[15] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for residual networks. In 3D Vision (3DV), 2016 Fourth Interna-
autonomous driving? The KITTI vision benchmark suite. tional Conference on, pages 239–248. IEEE, 2016. 7
[32] F. Liu, C. Shen, G. Lin, and I. Reid. Learning depth from sin- tion network for learning monocular stereo. arXiv preprint
gle monocular images using deep convolutional neural fields. arXiv:1612.02401, 2016. 4
IEEE transactions on pattern analysis and machine intelli- [48] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar,
gence, 38(10):2024–2039, 2016. 7 and K. Fragkiadaki. SfM-Net: Learning of structure and mo-
[33] M. Liu, M. Salzmann, and X. He. Discrete-continuous depth tion from video. arXiv preprint, 2017. 2, 4, 8
estimation from a single image. In Proceedings of the IEEE [49] X. Wang and A. Gupta. Unsupervised learning of visual rep-
Conference on Computer Vision and Pattern Recognition, resentations using videos. In Proceedings of the IEEE Inter-
pages 716–723, 2014. 7 national Conference on Computer Vision, pages 2794–2802,
[34] M. M. Loper and M. J. Black. OpenDR: An approximate 2015. 2
differentiable renderer. In European Conf. Computer Vision, [50] C. Wu. VisualSFM: A visual structure from motion system.
pages 154–169. Springer, 2014. 2 http://ccwu.me/vsfm, 2011. 2
[35] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, [51] J. Xie, R. B. Girshick, and A. Farhadi. Deep3D: Fully au-
A. Dosovitskiy, and T. Brox. A large dataset to train con- tomatic 2D-to-3D video conversion with deep convolutional
volutional networks for disparity, optical flow, and scene neural networks. In European Conf. Computer Vision, 2016.
flow estimation. In Proceedings of the IEEE Conference 2
on Computer Vision and Pattern Recognition, pages 4040– [52] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective
4048, 2016. 4 transformer nets: Learning single-view 3d object reconstruc-
[36] I. Misra, C. L. Zitnick, and M. Hebert. Shuffle and learn: tion without 3d supervision. In Advances in Neural Informa-
unsupervised learning using temporal order verification. In tion Processing Systems, pages 1696–1704, 2016. 2
European Conference on Computer Vision, pages 527–544. [53] J. Zbontar and Y. LeCun. Stereo matching by training a con-
Springer, 2016. 2 volutional neural network to compare image patches. Jour-
[37] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos. ORB- nal of Machine Learning Research, 17(1-32):2, 2016. 2
SLAM: a versatile and accurate monocular SLAM system. [54] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros. View
IEEE Transactions on Robotics, 31(5), 2015. 7 synthesis by appearance flow. In European Conference on
[38] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison. Computer Vision, pages 286–301. Springer, 2016. 2, 3
DTAM: Dense tracking and mapping in real-time. In Int. [55] C. L. Zitnick, S. B. Kang, M. Uyttendaele, S. Winder, and
Conf. Computer Vision, pages 2320–2327. IEEE, 2011. 2 R. Szeliski. High-quality video view interpolation using a
[39] D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariha- layered representation. In ACM Transactions on Graphics
ran. Learning features by watching objects move. In CVPR, (TOG), volume 23, pages 600–608. ACM, 2004. 2
2017. 2
[40] R. Ranftl, V. Vineet, Q. Chen, and V. Koltun. Dense monoc-
ular depth estimation in complex dynamic scenes. In Pro-
ceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 4058–4066, 2016. 8
[41] D. J. Rezende, S. A. Eslami, S. Mohamed, P. Battaglia,
M. Jaderberg, and N. Heess. Unsupervised learning of 3d
structure from images. In Advances In Neural Information
Processing Systems, pages 4997–5005, 2016. 2
[42] A. Saxena, M. Sun, and A. Y. Ng. Make3D: Learning 3D
scene structure from a single still image. Pattern Analysis
and Machine Intelligence, 31(5):824–840, May 2009. 2, 5, 7
[43] S. M. Seitz and C. R. Dyer. View morphing. In Proceedings
of the 23rd annual conference on Computer graphics and
interactive techniques, pages 21–30. ACM, 1996. 2
[44] R. Szeliski. Prediction error as a quality metric for motion
and stereo. In Int. Conf. Computer Vision, volume 2, pages
781–788. IEEE, 1999. 1
[45] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Multi-view 3d
models from single images with a convolutional network. In
European Conference on Computer Vision, pages 322–337.
Springer, 2016. 2
[46] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multi-view
supervision for single-view reconstruction via differentiable
ray consistency. In Computer Vision and Pattern Recogni-
tion, 2017. 2, 8
[47] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg,
A. Dosovitskiy, and T. Brox. DeMoN: Depth and mo-

You might also like