Nerf Supervision
Nerf Supervision
Nerf Supervision
https://yenchenlin.me/nerf-supervision/
arXiv:2203.01913v2 [cs.RO] 27 Apr 2022
Fig. 1: Overview. We present a new, RGB-sensor-only, self-supervised pipeline for learning object-centric dense descriptors,
based on neural radiance fields (NeRFs) [1]. The pipeline consists of three stages: (a) We collect RGB images of the object
of interest and optimize a NeRF for that object; (b) The recovered NeRF’s density field is then used to automatically generate
a dataset of dense correspondences; (c) We use the generated dataset to train a model to estimate dense object descriptors,
and evaluate that model on previously-unobserved real images. Click the image to play the overview video in a browser.
supervision into NeRF [22] improves geometry accuracy for a single-valued depth at each discrete pixel. In this case,
our purposes. Though Deng et al. [22] focus on the few- the single-valued depth estimate for each dense pixel is
image setting (i.e. ∼ 5 images), in our investigations we computed using (3). Each training image pair consists of one
found that even in the many-view (i.e. ∼ 60 images) setting, rendered RGB-D image (Îs , D̂s ) with camera pose Gs and
adding depth supervision is beneficial. Specifically, we find another rendered RGB-D image (Ît , D̂t ) with camera pose
NeRF’s density prediction often deteriorates in real-world Gt . Below, we slightly abuse the notation and use D̂s (us )
360◦ inward-facing scenes due to the transient shadows to represent the predicted depth at pixel us .
cast by the photographer or robot on the scene. Because Given these depth maps rendered by NeRF, and assuming
these shadows appear in some images but not others, NeRF known camera intrinsics K, we can then generate the target
tends to explain them away by introducing artifacts in the pixel ut in Ît given a query pixel us in Îs :
optimized density field. Incorporating the depth supervision
appears to effectively mitigate this issue. ut = π KGt −1 Gs K−1 D̂s (us )us (4)
Though NeRF’s primary goal is to perform view synthesis
by rendering RGB images, the volumetric rendering equation where π(·) represents the projection operation. We will refer
in (2) can be modified slightly to produce the expected to this data generation method as depth-map, as it uses the
termination depth of each ray (as was done in [1, 22]) by mean of NeRF’s distribution of depths at each pixel to render
simply replacing the predicted color ck with the distance tk : a depth map.
K
X D. Generating Probabilistic Dense Correspondences from
D̂(r) = Tk 1 − exp − σk (tk+1 − tk ) tk . (3)
NeRF’s Density Field
k=1
Because Tk represents the probability of the ray transmitting While using NeRF’s depth map to generate dense corre-
through interval k, the resulting depth D̂(r) is the expected spondences may work well when the distribution of density
distance that ray r will travel when cast into the scene. We along the ray has only a single mode, it may produce
can obtain a ground-truth depth D(r) by first transforming an incorrect depth when the density distribution is multi-
the 3D keypoint k(r) that is associated with the ray r to modal along the ray. In Fig. 4, we show two examples
the camera frame with camera pose G ∈ SE(3) and then of this case, where NeRF’s depth map generates incorrect
extract its coordinate along the camera’s z-axis: D(r) = correspondences. To resolve this issue, we propose to treat
hG−1 k(r), [0, 0, 1]i. The depth-supervision loss Ldepth = correspondence generation not via a single depth for each
P 2 pixel, but via a distribution of depths, which as shown in
r∈R kD̂(r) − D(r)k2 is defined as the squared distance
Fig. 4 can have modes which correctly recover correspon-
between the predicted depth D̂(r) and the “ground-truth”
dences where the depth map failed.
depth D(r) (which in our case is the partial depth map
Specifically, we can sample depth values based on the
generated by COLMAP’s structure from motion). Note this
alpha compositing weights w:
supervision is only sparse, not dense — this loss is not
imposed for pixels where the depth supervisor does not return
a valid depth. The final combined loss for training DS-NeRFs w(D̂(us ) = tk ) = Tk 1 − exp − σk (tk+1 − tk ) (5)
is: L = Lphoto + Ldepth .
Rather than reducing the depth distribution into its mean
C. Depth-Map Dense Correspondences from NeRF by rendering out depth maps and sampling the correspon-
The first approach we investigate in order to generate dences deterministically, this formulation retains a complete
correspondence training data from NeRF is to render pairs of distribution over depths and samples correspondences prob-
RGB-D images, and effectively treat NeRF as a traditional abilistically. In practice, we first sample K points along each
depth sensor by extracting a depth-map D ∈ Rw×h with ray and get {w(D̂(us ) = tk ), tk }K k=1 from NeRF. Then, we
normalize {w(D̂(us ) = tk )}K k=1 to sum to 1 and treat it as • GLU-Net [6] is a model architecture that integrates both
a probability distribution for sampling t. global and local correlation in a feature pyramid-based
We hypothesize the probabilistic formulation can produce network for estimating dense correspondences.
more precise downstream neural correspondence networks, • GOCor [12] improves GLU-Net [6]’s feature correla-
since as depicted in Fig. 4, the modes of the density, tion layers to disambiguate similar regions in the scene.
rather than the mean, can be closer to the ground truth. • PDC-Net [7] adopts the architecture of GOCor [12]
Furthermore, when combined with a self-consistency check and further parametrizes the predictive distribution as
(Sec. III-E) during descriptor learning, the probability of a constrained mixture model for estimating dense cor-
sampling false positives is reduced. This hypothesis is tested respondences and their uncertainties.
in our Results section. Next, we train Dense Object Nets (DONs) [13] for learn-
ing dense visual descriptors. In practice, we set the dimen-
E. Additional Correspondence Learning Details
sionality of visual descriptors d = 3. We consider using
Self-consistency. After obtaining ut from us , we perform COLMAP or NeRF to generate training correspondences to
a self-consistency check by starting from ut and identify its supervise DONs.
probabilistic correspondence ûs in Is . We only adopt the • COLMAP [16] is a widely-used classical Multi-view
pair of pixels (us , ut ) if the distance between us and ûs Stereo (MVS) method. We use the estimated depth maps
is smaller than certain threshold. This is our probabilistic to generate correspondences.
analogue to the deterministic visibility check in [13, 32]. • NeRF [1] is a volume-rendering approach, which we
Sampling from mask. We acquire object masks for the either use via depth maps (Sec. III-C) or probabilisti-
training images through a finetuned Mask R-CNN [33]. cally through the density field (Sec. III-D) to generate
Similar to Dense Object Nets [13], masks are used to sample correspondences.
pixels of the object during descriptor learning.
C. Comparisons
IV. R ESULTS
We evaluate dense descriptors and show quantitative re-
We execute a series of experiments using real world im- sults in Table I, Table II, and Table III. We find the off-the-
ages for training and evaluation. We evaluate dense descrip- shelf dense descriptors do not work well to handle object-
tors learned with correspondences generated with different centric scenes, potentially because they are trained on images
approaches. The goals of the experiments are four-fold: (i) with synthetic warp and have not seen the target objects from
to investigate whether the 3D geometry predicted by NeRF a wide range of viewing angles. In contrast, Dense Object
is sufficient for training precise descriptors, particularly on Nets trained with target objects perform much better. This
challenging thin and reflective objects, (ii) to compare our suggests the need of a data collection pipeline to generate
proposed method to existing off-the-shelf descriptors, (iii) to object-centric training data for robot manipulation. Among
investigate whether the distribution-of-depth formulation is the three correspondence generation approaches, COLMAP
effective, and (iv) to test the generalization ability of visual has the highest error compared to other methods. Using the
descriptors produced by our pipeline. density field of NeRF to sample correspondences attains the
A. Settings best performance. It outperforms Dense Object Nets with
COLMAP by 29% and off-the-shelf descriptors by 106% on
Datasets. We evaluate our approach and baseline methods PCK@3px metric.
using 8 objects (3 distinct classes). For each object, we
captured 60 input images using an iPhone 12 with locked D. Generalization
auto exposure and auto focus. The images are resized to We evaluate the trained Dense Object Nets on novel
504 × 378. We use COLMAP [14] to estimate both camera scenes and objects not present in the training data. Fig. 5
poses and sparse point cloud of each object. To construct shows examples of Whisks and Strainers and their visual
the test set, 8 images are randomly selected and held-out descriptors. We follow the same visualization method in [13].
during training. We manually annotate (for evaluation only) Noisy background and lighting. In Fig. 5a, we show results
100 correspondences using these test images for each object. of our learned descriptors when the objects are placed on
Metrics. We employ the Average End Point Error (AEPE) a different background or in different lighting conditions.
and Percentage of Correct Keypoints (PCK) as the evalua- The results demonstrate that our learned descriptors can be
tion metrics. AEPE is computed as the average Euclidean deployed in environments different from the training scenes.
distance, in pixel space, between estimated and ground- Multiple objects. We show the learned descriptors when the
truth correspondences. PCK@δ is defined as the percentage input image contains multiple objects in Fig. 5b. The results
of estimated correspondences with a pixel-wise Euclidean demonstrate that the descriptors are consistent for objects of
distance < δ w.r.t. to the ground-truths. different sizes.
Category-level generalization. We further test our model on
B. Methods unseen objects of the same category. Fig. 5c shows unseen
First, we consider several off-the-shelf learned descriptors objects not in the training set. The learned visual descriptors
that attain state-of-the-art results on commonly used dense can robustly generalize to these unseen objects and estimate
correspondence benchmarks (e.g., ETH3D [34]). the view-invariant descriptors.
& shadows (a)
(b)Different
(a) background
Differentbackground
Different background &&shadows
and shadows
shadow (b)(a) Different
(c)
Different background
(b) Multiple
background
Multiple & shadows
objects
& shadows
objects (c) (c)
Multiple
(b) Different objects
background
Unseen & shadows
objects (c
Fig. 5: Qualitative results of generalization to novel scenes and objects. (a) We show the learned object descriptors
can be consistent across significant 1) viewpoint, 2) background, and 3) lighting variations. (b) We visualize the learned
descriptors for multiple objects despite the models have never seen multiple objects during training. (c) We test our model
on objects that are not seen during training. The visual descriptors are shown to be consistent with previously-seen objects
in the category.
GLU-Net [6] 33.25 28.09 28.92 16.06 15.36 39.04 17.12 18.28 24.52
Off-the-shelf GOCor [12] 34.23 26.89 20.92 10.8 7.04 31.95 10.2 13.86 19.49
PDC-Net [7] 32.48 13.7 23.77 7.82 5.81 19.94 8.3 8.76 15.07
Depth map, COLMAP MVS 8.91 5.52 7.65 4.50 4.10 8.90 5.31 5.87 6.35
DON[13] via Depth map, NeRF (ours) 5.64 4.31 5.24 3.82 3.52 6.84 3.73 4.19 4.66
Density field, NeRF (ours) 4.53 4.08 3.93 3.28 3.19 4.96 3.42 3.66 3.88
TABLE II: Percentage Correct Keypoints (PCK@3px) for 3 pixels, ↑ higher is better.
Strainer-S Strainer-M Strainer-L Whisk-S Whisk-M Whisk-L Fork-S Fork-L Mean
GLU-Net [6] 0.04 0.04 0.07 0.24 0.26 0 0.16 0.14 0.12
Off-the-shelf GOCor [12] 0.1 0.05 0.07 0.26 0.33 0.03 0.18 0.16 0.15
PDC-Net [7] 0.14 0.19 0.11 0.48 0.51 0.19 0.42 0.38 0.30
Depth map, COLMAP MVS 0.32 0.41 0.38 0.57 0.64 0.44 0.55 0.51 0.48
DON[13] via Depth map, NeRF (ours) 0.52 0.56 0.51 0.62 0.66 0.50 0.67 0.63 0.58
Density field, NeRF (ours) 0.58 0.59 0.61 0.64 0.66 0.58 0.69 0.64 0.62
TABLE III: Percentage Correct Keypoints (PCK@5px) for 5 pixels, ↑ higher is better.
Strainer-S Strainer-M Strainer-L Whisk-S Whisk-M Whisk-L Fork-S Fork-L Mean
GLU-Net [6] 0.09 0.09 0.10 0.37 0.44 0.06 0.26 0.21 0.20
Off-the-shelf GOCor [12] 0.13 0.1 0.11 0.47 0.63 0.09 0.29 0.28 0.26
PDC-Net [7] 0.29 0.25 0.16 0.53 0.68 0.26 0.57 0.51 0.41
Depth map, COLMAP MVS 0.62 0.72 0.64 0.79 0.80 0.48 0.60 0.55 0.65
DON[13] via Depth map, NeRF (ours) 0.82 0.84 0.75 0.82 0.81 0.56 0.79 0.76 0.77
Density field, NeRF (ours) 0.84 0.87 0.79 0.82 0.82 0.64 0.82 0.78 0.80
E. Example Application: 6-DoF Robotic Pick and Place keypoints’ 2D locations using the descriptors and move the
robot to capture two RGB images of the scene using the
We demonstrate accurate 6-DoF pick and place of thin and
camera mounted on the robot arm. Then, we use triangulation
reflective objects. After learning the dense descriptors, we
to derive keypoints’ 3D locations and execute the encoded
specify a set of semantic keypoints which encode a SE(3)
SE(3) grasp pose. For more details, please see Sec. A.
grasp pose for each category. Before any grasp, we track
V. C ONCLUSION
We introduce NeRF-Supervision as a pipeline to generate
data for learning object-centric dense descriptors. Compared
to previous approaches based on RGB-D cameras or MVS,
our method enables learning dense descriptors of thin, reflec-
tive objects. We believe these results chart forward a general
paradigm in which NeRF may be leveraged as an untapped
representational format for supervising robot vision systems.