Unsupervised Learning of Depth and Ego-Motion From Video
Unsupervised Learning of Depth and Ego-Motion From Video
Unsupervised Learning of Depth and Ego-Motion From Video
Abstract
1
truth. While imperfect geometry and/or pose estimation from an input image using stereoscopic film footage as training
can cheat with reasonable synthesized views for certain data. A similar approach was taken by Godard et al. [16], with
types of scenes (e.g., textureless), the same model would the addition of a left-right consistency constraint, and a better ar-
fail miserably when presented with another set of scenes chitecture design that led to impressive performance. Like our
with more diverse layout and appearance structures. Thus, approach, these techniques only learn from image observations of
our goal is to formulate the entire view synthesis pipeline the world, unlike methods that require explicit depth for training,
as the inference procedure of a convolutional neural net- e.g., [20, 42, 7, 27, 30].
work, so that by training the network on large-scale video These techniques bear some resemblance to direct methods for
data for the ‘meta’-task of view synthesis the network is structure and motion estimation [22], where the camera parame-
forced to learn about intermediate tasks of depth and cam- ters and scene depth are adjusted to minimize a pixel-based error
era pose estimation in order to come up with a consistent function. However, rather than directly minimizing the error to
explanation of the visual world. Empirical evaluation on obtain the estimation, the CNN-based methods only take a gradi-
the KITTI [15] benchmark demonstrates the effectiveness ent step for each batch of input instances, which allows the net-
of our approach on both single-view depth and camera pose work to learn an implicit prior from a large corpus of related im-
estimation. Our code will be made available at https: agery. Several authors have explored building differentiable ren-
//github.com/tinghuiz/SfMLearner. dering operations into their models that are trained in this way,
e.g., [19, 29, 34].
2. Related work While most of the above techniques (including ours) are mainly
focused on inferring depth maps as the scene geometry output, re-
Structure from motion The simultaneous estimation of cent work (e.g., [13, 41, 46, 52]) has also shown success in learn-
structure and motion is a well studied problem with an estab- ing 3D volumetric representations from 2D observations based on
lished toolchain of techniques [12, 50, 38]. Whilst the traditional similar principles of projective geometry. Fouhey et al. [11] fur-
toolchain is effective and efficient in many cases, its reliance on ac- ther show that it is even possible to learn 3D inference without 3D
curate image correspondence can cause problems in areas of low labels (or registered 2D views) by utilizing scene regularity.
texture, complex geometry/photometry, thin structures, and occlu-
sions. To address these issues, several of the pipeline stages have
been recently tackled using deep learning, e.g., feature match-
Unsupervised/Self-supervised learning from video An-
other line of related work to ours is visual representation learning
ing [18], pose estimation [26], and stereo [10, 27, 53]. These
from video, where the general goal is to design pretext tasks for
learning-based techniques are attractive in that they are able to
learning generic visual features from video data that can later be
leverage external supervision during training, and potentially over-
re-purposed for other vision tasks such as object detection and se-
come the above issues when applied to test data.
mantic segmentation. Such pretext tasks include ego-motion esti-
mation [2, 24], tracking [49], temporal coherence [17], temporal
Warping-based view synthesis One important application order verification [36], and object motion mask prediction [39].
of geometric scene understanding is the task of novel view syn- While we focus on inferring the explicit scene geometry and
thesis, where the goal is to synthesize the appearance of the scene ego-motion in this work, intuitively, the internal representation
seen from novel camera viewpoints. A classic paradigm for view learned by the deep network (especially the single-view depth
synthesis is to first either estimate the underlying 3D geometry CNN) should capture some level of semantics that could gener-
explicitly or establish pixel correspondence among input views, alize to other tasks as well.
and then synthesize the novel views by compositing image patches Concurrent to our work, Vijayanarasimhan et al. [48] indepen-
from the input views (e.g., [4, 55, 43, 6, 9]). Recently, end-to- dently propose a framework for joint training of depth, camera
end learning has been applied to reconstruct novel views by trans- motion and scene motion from videos. While both methods are
forming the input based on depth or flow, e.g., DeepStereo [10], conceptually similar, ours is focused on the unsupervised aspect,
Deep3D [51] and Appearance Flows [54]. In these methods, the whereas their framework adds the capability to incorporate super-
underlying geometry is represented by quantized depth planes vision (e.g., depth, camera motion or scene motion). There are
(DeepStereo), probabilistic disparity maps (Deep3D) and view- significant differences in how scene dynamics are modeled during
dependent flow fields (Appearance Flows), respectively. Unlike training, in which they explicitly solve for object motion whereas
methods that directly map from input views to the target view our explainability mask discounts regions undergoing motion, oc-
(e.g., [45]), warping-based methods are forced to learn intermedi- clusion and other factors.
ate predictions of geometry and/or correspondence. In this work,
we aim to distill such geometric reasoning capability from CNNs
trained to perform warping-based view synthesis. 3. Approach
Here we propose a framework for jointly training a single-view
Learning single-view 3D from registered 2D views Our depth CNN and a camera pose estimation CNN from unlabeled
work is closely related to a line of recent research on learning video sequences. Despite being jointly trained, the depth model
single-view 3D inference from registered 2D observations. Garg et and the pose estimation model can be used independently during
al. [14] propose to learn a single-view depth estimation CNN us- test-time inference. Training examples to our model consist of
ing projection errors to a calibrated stereo twin for supervision. short image sequences of scenes captured by a moving camera.
Concurrently, Deep3D [51] predicts a second stereo viewpoint While our training procedure is robust to some degree of scene
Depth CNN
It Is Iˆs
Project Warp
It p ptl
s ptr
s
D̂t (p)
pt ps pt
pbl
s pbr
s
pt Project
It 1 1 Pose CNN
T̂t!t 1
Figure 3. Illustration of the differentiable image warping process.
pt+1 T̂t!t+1 For each point pt in the target view, we first project it onto the
It+1
Project source view based on the predicted depth and camera pose, and
then use bilinear interpolation to obtain the value of the warped
Figure 2. Overview of the supervision pipeline based on view syn- image Iˆs at location pt .
thesis. The depth network takes only the target view as input, and
outputs a per-pixel depth map D̂t . The pose network takes both the
target view (It ) and the nearby/source views (e.g., It−1 and It+1 ) framework can be applied to standard videos without pose infor-
as input, and outputs the relative camera poses (T̂t→t−1 , T̂t→t+1 ). mation. Furthermore, it predicts the poses as part of the learning
The outputs of both networks are then used to inverse warp the framework. See Figure 2 for an illustration of our learning pipeline
source views (see Sec. 3.2) to reconstruct the target view, and the for depth and pose estimation.
photometric reconstruction loss is used for training the CNNs. By
utilizing view synthesis as supervision, we are able to train the 3.2. Differentiable depth image-based rendering
entire framework in an unsupervised manner from videos. As indicated in Eq. 1, a key component of our learning frame-
work is a differentiable depth image-based renderer that recon-
structs the target view It by sampling pixels from a source view Is
motion, we assume that the scenes we are interested in are mostly based on the predicted depth map D̂t and the relative pose T̂t→s .
rigid, i.e., the scene appearance change across different frames is
Let pt denote the homogeneous coordinates of a pixel in the
dominated by the camera motion.
target view, and K denote the camera intrinsics matrix. We can
3.1. View synthesis as supervision obtain pt ’s projected coordinates onto the source view ps by2
The key supervision signal for our depth and pose prediction ps ∼ K T̂t→s D̂t (pt )K −1 pt (2)
CNNs comes from the task of novel view synthesis: given one
input view of a scene, synthesize a new image of the scene seen Notice that the projected coordinates ps are continuous values. To
from a different camera pose. We can synthesize a target view obtain Is (ps ) for populating the value of Iˆs (pt ) (see Figure 3),
given a per-pixel depth in that image, plus the pose and visibility we then use the differentiable bilinear sampling mechanism pro-
in a nearby view. As we will show next, this synthesis process can posed in the spatial transformer networks [23] that linearly in-
be implemented in a fully differentiable manner with CNNs as the terpolates the values of the 4-pixel neighbors (top-left, top-right,
geometry and pose estimation modules. Visibility can be handled, bottom-left, and bottom-right) of ps to approximate Is (ps ), i.e.
Iˆs (pt ) = Is (ps ) = ij ij ij
P
along with non-rigidity and other non-modeled factors, using an i∈{t,b},j∈{l,r} w Is (ps ), where w is
“explanability” mask, which we discuss later (Sec. 3.3). ij
linearly proportional to the spatial proximity between p s and p s ,
Let us denote < I1 , . . . , IN > as a training image sequence and i,j wij = 1. A similar strategy is used in [54] for learning
P
with one of the frames It being the target view and the rest being to directly warp between different views, while here the coordi-
the source views Is (1 ≤ s ≤ N, s 6= t). The view synthesis nates for pixel warping are obtained through projective geometry
objective can be formulated as that enables the factorization of depth and camera pose.
XX
Lvs = |It (p) − Iˆs (p)| , (1) 3.3. Modeling the model limitation
s p
Note that when applied to monocular videos the above view
where p indexes over pixel coordinates, and Iˆs is the source view synthesis formulation implicitly assumes 1) the scene is static
Is warped to the target coordinate frame based on a depth image- without moving objects; 2) there is no occlusion/disocclusion be-
based rendering module [8] (described in Sec. 3.2), taking the pre- tween the target view and the source views; 3) the surface is Lam-
dicted depth D̂t , the predicted 4×4 camera transformation matrix1 bertian so that the photo-consistency error is meaningful. If any
T̂t→s and the source view Is as input. of these assumptions are violated in a training sequence, the gra-
Note that the idea of view synthesis as supervision has also dients could be corrupted and potentially inhibit training. To im-
been recently explored for learning single-view depth estima- prove the robustness of our learning pipeline to these factors, we
tion [14, 16] and multi-view stereo [10]. However, to the best of additionally train a explainability prediction network (jointly and
our knowledge, all previous work requires posed image sets dur- simultaneously with the depth and pose networks) that outputs a
ing training (and testing too in the case of DeepStereo), while our per-pixel soft mask Ês for each target-source pair, indicating the
1 In practice, the CNN estimates the Euler angles and the 3D translation 2 For notation simplicity, we omit showing the necessary conversion to
vector, which are then converted to the transformation matrix. homogeneous coordinates along the steps of matrix multiplication.
...
Input
Conv
Deconv
Concat
Upsample + Concat
Prediction
network’s belief in where direct view synthesis will be success- explicit multi-scale and smoothness loss (e.g., as in [14, 16]) that
fully modeled for each target pixel. Based on the predicted Ês , allows gradients to be derived from larger spatial regions directly.
the view synthesis objective is weighted correspondingly by We adopt the second strategy in this work as it is less sensitive to
X X architectural choices. For smoothness, we minimize the L1 norm
Lvs = Ês (p)|It (p) − Iˆs (p)| . (3) of the second-order gradients for the predicted depth maps (similar
<I1 ,...,IN >∈S p
to [48]).
Since we do not have direct supervision for Ês , training with the Our final objective becomes
above loss would result in a trivial solution of the network always
predicting Ês to be zero, which perfectly minimizes the loss. To
X X
Lf inal = Llvs + λs Llsmooth + λe Lreg (Êsl ) , (4)
resolve this, we add a regularization term Lreg (Ês ) that encour- l s
ages nonzero predictions by minimizing the cross-entropy loss
with constant label 1 at each pixel location. In other words, the where l indexes over different image scales, s indexes over source
network is encouraged to minimize the view synthesis objective, images, and λs and λe are the weighting for the depth smoothness
but allowed a certain amount of slack for discounting the factors loss and the explainability regularization, respectively.
not considered by the model.
3.4. Overcoming the gradient locality 3.5. Network architecture
One remaining issue with the above learning pipeline is that the Single-view depth For single-view depth prediction, we adopt
gradients are mainly derived from the pixel intensity difference be- the DispNet architecture proposed in [35] that is mainly based on
tween I(pt ) and the four neighbors of I(ps ), which would inhibit an encoder-decoder design with skip connections and multi-scale
training if the correct ps (projected using the ground-truth depth side predictions (see Figure 4). All conv layers are followed by
and pose) is located in a low-texture region or far from the current ReLU activation except for the prediction layers, where we use
estimation. This is a well known issue in motion estimation [3]. 1/(α∗sigmoid(x)+β) with α = 10 and β = 0.1 to constrain the
Empirically, we found two strategies to be effective for overcom- predicted depth to be always positive within a reasonable range.
ing this issue: 1) using a convolutional encoder-decoder architec- We also experimented with using multiple views as input to the
ture with a small bottleneck for the depth network that implicitly depth network, but did not find this to improve the results. This is
constrains the output to be globally smooth and facilitates gradi- in line with the observations in [47], where optical flow constraints
ents to propagate from meaningful regions to nearby regions; 2) need to be enforced to utilize multiple views effectively.
Pose The input to the pose estimation network is the target view Input image Our prediction
concatenated with all the source views (along the color channels),
and the outputs are the relative poses between the target view and
each of the source views. The network consists of 7 stride-2 con-
volutions followed by a 1 × 1 convolution with 6 ∗ (N − 1) output
channels (corresponding to 3 Euler angles and 3-D translation for
each source view). Finally, global average pooling is applied to
aggregate predictions at all spatial locations. All conv layers are
followed by ReLU except for the last layer where no nonlinear
activation is applied.
Figure 6. Comparison of single-view depth estimation between Eigen et al. [7] (with ground-truth depth supervision), Garg et al. [14]
(with ground-truth pose supervision), and ours (unsupervised). The ground-truth depth map is interpolated from sparse measurements for
visualization purpose. The last two rows show typical failure cases of our model, which sometimes struggles in vast open scenes and
objects close to the front of the camera.
provides examples of visual comparison between our results and scenes are static without significant scene motions, and 2) the oc-
some supervised baselines over a variety of examples. One can clusion/visibility effects only occur in small regions in sequences
see that although trained in an unsupervised manner, our results across a short time span (3-frames), which make the explainabil-
are comparable to that of the supervised baselines, and sometimes ity modeling less essential to the success of training. Nonetheless,
preserve the depth boundaries and thin structures such as trees and our explainability prediction network does seem to capture the fac-
street lights better. tors like scene motion and visibility well (see Sec. 4.3), and could
We show sample predictions made by our initial Cityscapes potentially be more important for other more challenging datasets.
model and the final model (pre-trained on Cityscapes and then
fine-tuned on KITTI) in Figure 7. Due to the domain gap between
the two datasets, our Cityscapes model sometimes has difficulty Make3D To evaluate the generalization ability of our single-
in recovering the complete shape of the car/bushes, and mistakes view depth model, we directly apply our model trained on
them with distant objects. Cityscapes + KITTI to the Make3D dataset unseen during train-
We also performed an ablation study of the explainability mod- ing. While there still remains a significant performance gap be-
eling (see Table 1), which turns out only offering a modest per- tween our method and others supervised using Make3D ground-
formance boost. This is likely because 1) most of the KITTI truth depth (see Table 2), our predictions are able to capture the
Method Dataset Supervision Error metric Accuracy metric
Depth Pose Abs Rel Sq Rel RMSE RMSE log δ < 1.25 δ < 1.252 δ < 1.253
Train set mean K X 0.403 5.530 8.709 0.403 0.593 0.776 0.878
Eigen et al. [7] Coarse K X 0.214 1.605 6.563 0.292 0.673 0.884 0.957
Eigen et al. [7] Fine K X 0.203 1.548 6.307 0.282 0.702 0.890 0.958
Liu et al. [32] K X 0.202 1.614 6.523 0.275 0.678 0.895 0.965
Godard et al. [16] K X 0.148 1.344 5.927 0.247 0.803 0.922 0.964
Godard et al. [16] CS + K X 0.124 1.076 5.311 0.219 0.847 0.942 0.973
Ours (w/o explainability) K 0.221 2.226 7.527 0.294 0.676 0.885 0.954
Ours K 0.208 1.768 6.856 0.283 0.678 0.885 0.957
Ours CS 0.267 2.686 7.580 0.334 0.577 0.840 0.937
Ours CS + K 0.198 1.836 6.565 0.275 0.718 0.901 0.960
Garg et al. [14] cap 50m K X 0.169 1.080 5.104 0.273 0.740 0.904 0.962
Ours (w/o explainability) cap 50m K 0.208 1.551 5.452 0.273 0.695 0.900 0.964
Ours cap 50m K 0.201 1.391 5.181 0.264 0.696 0.900 0.966
Ours cap 50m CS 0.260 2.232 6.148 0.321 0.590 0.852 0.945
Ours cap 50m CS + K 0.190 1.436 4.975 0.258 0.735 0.915 0.968
Table 1. Single-view depth results on the KITTI dataset [15] using the split of Eigen et al. [7] (Baseline numbers taken from [16]). For
training, K = KITTI, and CS = Cityscapes [5]. All methods we compare with use some form of supervision (either ground-truth depth or
calibrated camera pose) during training. Note: results from Garg et al. [14] are capped at 50m depth, so we break these out separately in
the lower part of the table.
0.1
Absolute Translation Error (m)
0.08
Mean Odom.
ORB-SLAM (full)
0.06 ORB-SLAM (short)
Ours
0.04
0.02
0
0 0.1 0.2 0.3 0.4 0.5
Left/right turning magnitude (m)
Figure 9. Absolute Trajectory Error (ATE) at different left/right
turning magnitude (coordinate difference in the side-direction be-
tween the start and ending frame of a testing sequence). Our Figure 10. Sample visualizations of the explainability masks.
method performs significantly better than ORB-SLAM (short) Highlighted pixels are predicted to be unexplainable by the net-
when side rotation is small, and is comparable with ORB-SLAM work due to motion (rows 1–3), occlusion/visibility (rows 4–5), or
(full) across the entire spectrum. other factors (rows 7–8).