Nerf Supervision

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

NeRF-Supervision: Learning Dense Object Descriptors

from Neural Radiance Fields


Lin Yen-Chen1 Pete Florence2 Jonathan T. Barron2
Tsung-Yi Lin3∗ Alberto Rodriguez1 Phillip Isola1
1 MIT 2 Google 3 Nvidia

https://yenchenlin.me/nerf-supervision/
arXiv:2203.01913v2 [cs.RO] 27 Apr 2022

Fig. 1: Overview. We present a new, RGB-sensor-only, self-supervised pipeline for learning object-centric dense descriptors,
based on neural radiance fields (NeRFs) [1]. The pipeline consists of three stages: (a) We collect RGB images of the object
of interest and optimize a NeRF for that object; (b) The recovered NeRF’s density field is then used to automatically generate
a dataset of dense correspondences; (c) We use the generated dataset to train a model to estimate dense object descriptors,
and evaluate that model on previously-unobserved real images. Click the image to play the overview video in a browser.

descriptors, supervised by dense correspondences between


Abstract— Thin, reflective objects such as forks and whisks images, have demonstrated superior performance compared
are common in our daily lives, but they are particularly chal- to hand-crafted descriptors [5, 6, 7, 8]. However, producing
lenging for robot perception because it is hard to reconstruct
them using commodity RGB-D cameras or multi-view stereo the ground-truth dense correspondence data required for
techniques. While traditional pipelines struggle with objects training these models is challenging, as the geometry of
like these, Neural Radiance Fields (NeRFs) have recently been the scene and the poses of the cameras must somehow be
shown to be remarkably effective for performing view synthesis estimated from an image (or known a priori). As a result,
on objects with thin structures or reflective materials. In this learning-based methods typically rely on either synthetically
paper we explore the use of NeRF as a new source of supervision
for robust robot vision systems. In particular, we demonstrate rendering an object from multiple views [9, 10], or on aug-
that a NeRF representation of a scene can be used to train menting a non-synthetic image with random affine transfor-
dense object descriptors. We use an optimized NeRF to extract mations from which “ground truth” correspondences can be
dense correspondences between multiple views of an object, and obtained [6, 11, 12]. While effective, these approaches have
then use these correspondences as training data for learning a their limitations: the gap between real data and synthetic data
view-invariant representation of the object. NeRF’s usage of
a density field allows us to reformulate the correspondence may hinder performance, and data augmentation approaches
problem with a novel distribution-of-depths formulation, as may fail to identify correspondences involving out-of-plane
opposed to the conventional approach of using a depth map. rotation (which occur often in robot manipulation).
Dense correspondence models supervised with our method
significantly outperform off-the-shelf learned descriptors by To learn a dense correspondence model, Florence
106% (PCK@3px metric, more than doubling performance) et al. propose a self-supervised data collection approach
and outperform our baseline supervised with multi-view stereo based on robot motion in conjunction with a depth cam-
by 29%. Furthermore, we demonstrate the learned dense era [13]. Their method generates dense correspondences
descriptors enable robots to perform accurate 6-degree of
freedom (6-DoF) pick and place of thin and reflective objects. given a set of posed RGB-D images and then uses them
to supervise visual descriptors. However, this method works
I. I NTRODUCTION poorly for objects that contain thin structures or highly
Designing robust visual descriptors that are invariant to specular materials, as commodity depth cameras fail in these
scale, illumination, and pose is a long-standing problem in circumstances. An object exhibiting thin structures or shiny
computer vision [2, 3, 4]. Recently, learning-based visual reflectance, well-exemplified by objects such as forks and
whisks, will result in a hole-riddled depth map (shown in
∗ Work done while at Google. Fig. 2b) which prevents the reprojection operation from
generating high quality correspondences. Multi-view stereo
(MVS) methods present an alternative approach for solving
this problem, as they do not rely on direct depth sensors and
instead estimate depth using only RGB images. However,
conventional stereo techniques typically rely on patch-based
photometric consistency, which implicitly assumes that the
world is made of large and Lambertian objects. The perfor-
mance of MVS is therefore limited in the presence of thin (a) Objects (b) Depth camera image
or shiny objects — thin structures mean image patches may Fig. 2: Motivation. (a) Here we show the objects used
not reoccur across input images (as any patch will likely in the work. Annotating dense correspondences for these
contain some part of the background, which may vary), objects is challenging because existing pipelines [13, 15]
and specularities mean that photometric consistency may rely on depth cameras, and therefore cannot capture thin or
be violated (as the object may look different when viewed reflective objects. (b) This can be observed by visualizing the
from different angles). Figure 3 shows a failure case when depth image from a commodity RGB-D camera (a RealSense
applying COLMAP [14], a widely-used MVS method, on D415), where the pixels colored black indicate where the
a strainer. Because COLMAP produces an incorrect depth depth sensor failed to produce a depth estimate.
map, the estimated correspondences are also incorrect.
To address the limitations of depth sensors and con-
ventional stereo techniques, we introduce NeRF-Supervision
for learning object-centric dense correspondences: an RGB-
only, self-supervised pipeline based on neural radiance fields
(NeRF) [1]. Unlike approaches based on RGB-D sensors or
MVS, it can handle reflective objects as the view direction
is taken as input for color prediction. Another advantage of
using NeRF-Supervision over depth sensors or MVS is that Fig. 3: Baselines. Multi-view Stereo represents a potential
the density field predicted by NeRF provides a mechanism alternative to depth cameras. However, the depth maps
for handling ambiguity in photometric consistency: given a estimated by COLMAP (a widely-used MVS method) [16]
trained NeRF, the predicted density field can be used to exhibit significant artifacts on thin or reflective objects,
sample a dataset of dense correspondences probabilistically. which leads to incorrect correspondences between pixels
See Fig. 1 for an overview of our method. In our experiments, (shown in red).
we consider 8 challenging objects (shown in Fig. 2a) and
demonstrate that our pipeline can produce robust dense visual
descriptors for all of them. Our approach significantly out- central component of NeRF is the use of coordinate-based
performs all off-the-shelf descriptors as well as our baseline MLPs (neural networks that take as input a 3D coordinate
method supervised with multi-view stereo. Furthermore, we in space) to estimate volumetric density and color in 3D.
demonstrate the learned dense descriptors enable robots to This MLP is embedded within a volumetric rendering engine,
perform accurate 6-degree of freedom (6-DoF) pick and and gradient descent is used to optimize the weights of the
place of thin and reflective objects. scene to reproduce the input images, thereby resulting in
Our contributions are as follows: (i) a new, RGB-sensor- an MLP that maps any input coordinate to a field of density
only, self-supervised pipeline for learning object-centric (and color). Though NeRF has primarily been used for vision
dense descriptors, based on neural radiance fields; (ii) a novel or graphics tasks such as appearance interpolation [17] and
distribution-of-depths formulation, enabled by the estimated portrait photography [18], it has also been adopted for robotic
density field, which treats correspondence generation not via applications such as pose estimation [19] and SLAM [20].
a single depth for each pixel, but rather via a distribution In this work, we propose using NeRF as a data generator for
of depths; (iii) experiments showing that our pipeline can: learning visual descriptors.
(a) enable training accurate object-centric correspondences Note that NeRF represents all scene content as a volu-
without depth sensors, and (b) succeed on thin, reflective metric quantity — everything is assumed to be some degree
objects on which depth sensors typically fail; and (iv) exper- of semi-transparent, and “hard” surfaces are simulated using
iments showing that the distribution-of-depths formulation a very dense (but not infinitely dense) field [21]. Though
can improve the downstream precision of correspondence the use of volumetric rendering provides significant benefits
models trained on this data, when compared to the single- (most notably, smooth gradient-based optimization) it does
depth alternatives. present some difficulties when attempting to use NeRF in
a robotics context, as NeRF does not directly estimate the
II. R ELATED W ORK boundaries of objects nor does it directly produce depth
Neural radiance fields. NeRF is a powerful technique for maps. However, the density field estimated by NeRF can be
novel view synthesis — taking as input a set of images of used to synthesize depth maps by computing the expected
an object, and producing novel views of that object [1]. A termination depth of a ray — a ray is cast towards the
camera, and the density field is used to determine how ”deep” Given this ground-truth correspondence data (1), a variety
into the volumetric object that ray is expected to penetrate, of learning-based correspondence approaches can be trained,
and that distance is then used as a depth map [1]. Some recent but our experiments focus on object-centric dense descriptor
work has explored improving these depth maps, such as Deng models [13] which have been shown to be useful in enabling
et al. [22] who use the depths estimated by COLMAP to generalizable robot manipulation [13, 24, 25, 26, 27]. With
directly supervise these depth maps. a descriptor-based correspondence model, a neural network
Dense descriptors. Dense visual descriptors play an im- fθ with parameters θ maps an input RGB image I to a dense
portant role in 3D scene reconstruction, localization, object visual descriptor image fθ (I) ∈ Rh×w×d where each pixel
pose estimation, and robot manipulation [13, 23, 24, 25, 26, is encoded by a d-dimensional feature vector, and closeness
27]. Modern approaches rely on machine learning to learn (small Euclidean distance) in the descriptor space indicates
a visual descriptor: First, image pairs with annotated cor- correspondence despite viewpoint changes, lighting changes,
respondences are obtained, either by a generative approach and potentially category-level variation [13, 23, 28].
or through manual labeling. Then these correspondences
are used as training data to learn pixel-level descriptors A. NeRF Preliminaries
such that the feature embeddings of corresponding pixels NeRF [1] use a neural network to represent a scene as a
are similar. A common approach for generating data is to volumetric field of density σ and RGB color c. The weights
use synthetic warping with large image collections, as is of a NeRF are initialized randomly and optimized for an
done by GLU-Net [6]. Despite the benefit of being trained individual scene using a collection of input RGB images as
with many examples, these methods often fail to predict supervision (the camera poses of the images are assumed
correspondences in images that exhibit out-of-plane rotation, to be known, and are often recovered via COLMAP[14]).
as image-space warping only demonstrates in-plane rotation. After optimization, the density field modeled by the NeRF
Other approaches leverage explicit 3D geometry to supervise captures the geometry of the scene (where a large density
correspondences [23, 28]. Within this category, Florence indicates an occupied region) and the color field models
et al. [13] demonstrate a self-supervised learning approach the view-dependent appearance of those occupied regions.
for collecting training correspondences using motion and A multilayer perceptron (MLP) parameterized by weights Θ
depth sensors on robots. This approach is prone to failure is used to predict the density σ and RGB color c of each
whenever the depth sensors fails to measure the correct point as a function of that point’s 3D position x = (x, y, z)
depth, which occurs often for thin or reflective structures. and unit-norm viewing direction d as input. To overcome the
Methods that use only RGB inputs face the challenge of spectral bias that neural networks exhibit in low dimensional
ambiguity of visual correspondences on regions with no spaces [29], each input is encoded using a positional encod-
texture or drastic depth variations. Other approaches have ing γ(·), giving us (σ, c) ← FΘ (γ(x), γ(d)). To render a
demonstrated simulation-based descriptor training [26, 27], pixel, NeRF casts a camera ray r(t) = o + td from the
which is an attractive approach due to its flexibility. How- camera center o along the direction d passing through that
ever, it requires significant engineering effort to configure pixel on the image plane. Along the ray, K discrete points
accurate and realistic simulations. Our work uses NeRF to {xk = r(tk )}K k=1 are sampled for use as input to the MLP,
generate training correspondences from only real-world non- which outputs a set of densities and colors {σk , ck }K k=1 .
synthetic RGB images captured in uncontrolled settings, These values are then used to estimate the color Ĉ(r) of
thereby avoiding the shortcomings of depth sensors and that pixel following volume rendering [30], using a numerical
addressing ambiguity by modeling correspondence with a quadrature approximation [31]:
density field, which we interpret as a probability distribution K   
over possible depths.
X
Ĉ(r) = Tk 1 − exp − σk (tk+1 − tk ) ck ,
k=1 (2)
III. M ETHOD  X 
with Tk = exp − σk (tk +1 − tk )
0 0 0

Our approach introduces an RGB-sensor-only framework k0 <k


to provide training data for supervising dense correspondence
where Tk can be interpreted as the probability that the ray
models. In particular, the framework provides the fundamen-
successfully transmits to point r(tk ). NeRFPis then trained
tal unit of training data required for training such models,
to minimize a photometric loss Lphoto = r∈R ||Ĉ(r) −
which is a tuple of the form:
C(r)||22 , using some sampled set of rays r ∈ R where C(r)
(Is , us , It , ut ) (1) is the observed RGB value of the pixel corresponding to
ray r in some image. For more details, we refer readers to
that consists of a pair of RGB images Is and It , each Mildenhall et al. [1].
in Rw×h×3 , and a pair of pixel-space coordinates us and
ut , each in R2 , whose image-forming rays intersect the B. Sparse Depth Supervision for NeRF
same point in 3D space. Rather than proposing a specific For objects and scenes with particularly challenging ge-
correspondence model for using these tuples, our focus is ometry (in particular, thin and reflective structures), we
on an approach for generating this training data. find that leveraging recent work on incorporating depth
pipolar line

Epipolar line Epipolar line

(a) Fork (b) Strainer


Fig. 4: Generating correspondences from NeRF’s density field vs. depth map. We denote the query pixel us as +, the
correspondence found in the other image using NeRF’s depth map as , and correspondences found by NeRF’s density
field as , where each point’s radius is scaled by its corresponding weight. We show two example objects: (a) fork and
(b) strainer. The correspondence implied by NeRF’s depth map is incorrect, but by using NeRF’s density field directly, the
correct correspondence can be sampled probabilistically.

supervision into NeRF [22] improves geometry accuracy for a single-valued depth at each discrete pixel. In this case,
our purposes. Though Deng et al. [22] focus on the few- the single-valued depth estimate for each dense pixel is
image setting (i.e. ∼ 5 images), in our investigations we computed using (3). Each training image pair consists of one
found that even in the many-view (i.e. ∼ 60 images) setting, rendered RGB-D image (Îs , D̂s ) with camera pose Gs and
adding depth supervision is beneficial. Specifically, we find another rendered RGB-D image (Ît , D̂t ) with camera pose
NeRF’s density prediction often deteriorates in real-world Gt . Below, we slightly abuse the notation and use D̂s (us )
360◦ inward-facing scenes due to the transient shadows to represent the predicted depth at pixel us .
cast by the photographer or robot on the scene. Because Given these depth maps rendered by NeRF, and assuming
these shadows appear in some images but not others, NeRF known camera intrinsics K, we can then generate the target
tends to explain them away by introducing artifacts in the pixel ut in Ît given a query pixel us in Îs :
optimized density field. Incorporating the depth supervision  
appears to effectively mitigate this issue. ut = π KGt −1 Gs K−1 D̂s (us )us (4)
Though NeRF’s primary goal is to perform view synthesis
by rendering RGB images, the volumetric rendering equation where π(·) represents the projection operation. We will refer
in (2) can be modified slightly to produce the expected to this data generation method as depth-map, as it uses the
termination depth of each ray (as was done in [1, 22]) by mean of NeRF’s distribution of depths at each pixel to render
simply replacing the predicted color ck with the distance tk : a depth map.
K  
X  D. Generating Probabilistic Dense Correspondences from
D̂(r) = Tk 1 − exp − σk (tk+1 − tk ) tk . (3)
NeRF’s Density Field
k=1

Because Tk represents the probability of the ray transmitting While using NeRF’s depth map to generate dense corre-
through interval k, the resulting depth D̂(r) is the expected spondences may work well when the distribution of density
distance that ray r will travel when cast into the scene. We along the ray has only a single mode, it may produce
can obtain a ground-truth depth D(r) by first transforming an incorrect depth when the density distribution is multi-
the 3D keypoint k(r) that is associated with the ray r to modal along the ray. In Fig. 4, we show two examples
the camera frame with camera pose G ∈ SE(3) and then of this case, where NeRF’s depth map generates incorrect
extract its coordinate along the camera’s z-axis: D(r) = correspondences. To resolve this issue, we propose to treat
hG−1 k(r), [0, 0, 1]i. The depth-supervision loss Ldepth = correspondence generation not via a single depth for each
P 2 pixel, but via a distribution of depths, which as shown in
r∈R kD̂(r) − D(r)k2 is defined as the squared distance
Fig. 4 can have modes which correctly recover correspon-
between the predicted depth D̂(r) and the “ground-truth”
dences where the depth map failed.
depth D(r) (which in our case is the partial depth map
Specifically, we can sample depth values based on the
generated by COLMAP’s structure from motion). Note this
alpha compositing weights w:
supervision is only sparse, not dense — this loss is not
 
imposed for pixels where the depth supervisor does not return 
a valid depth. The final combined loss for training DS-NeRFs w(D̂(us ) = tk ) = Tk 1 − exp − σk (tk+1 − tk ) (5)
is: L = Lphoto + Ldepth .
Rather than reducing the depth distribution into its mean
C. Depth-Map Dense Correspondences from NeRF by rendering out depth maps and sampling the correspon-
The first approach we investigate in order to generate dences deterministically, this formulation retains a complete
correspondence training data from NeRF is to render pairs of distribution over depths and samples correspondences prob-
RGB-D images, and effectively treat NeRF as a traditional abilistically. In practice, we first sample K points along each
depth sensor by extracting a depth-map D ∈ Rw×h with ray and get {w(D̂(us ) = tk ), tk }K k=1 from NeRF. Then, we
normalize {w(D̂(us ) = tk )}K k=1 to sum to 1 and treat it as • GLU-Net [6] is a model architecture that integrates both
a probability distribution for sampling t. global and local correlation in a feature pyramid-based
We hypothesize the probabilistic formulation can produce network for estimating dense correspondences.
more precise downstream neural correspondence networks, • GOCor [12] improves GLU-Net [6]’s feature correla-
since as depicted in Fig. 4, the modes of the density, tion layers to disambiguate similar regions in the scene.
rather than the mean, can be closer to the ground truth. • PDC-Net [7] adopts the architecture of GOCor [12]
Furthermore, when combined with a self-consistency check and further parametrizes the predictive distribution as
(Sec. III-E) during descriptor learning, the probability of a constrained mixture model for estimating dense cor-
sampling false positives is reduced. This hypothesis is tested respondences and their uncertainties.
in our Results section. Next, we train Dense Object Nets (DONs) [13] for learn-
ing dense visual descriptors. In practice, we set the dimen-
E. Additional Correspondence Learning Details
sionality of visual descriptors d = 3. We consider using
Self-consistency. After obtaining ut from us , we perform COLMAP or NeRF to generate training correspondences to
a self-consistency check by starting from ut and identify its supervise DONs.
probabilistic correspondence ûs in Is . We only adopt the • COLMAP [16] is a widely-used classical Multi-view
pair of pixels (us , ut ) if the distance between us and ûs Stereo (MVS) method. We use the estimated depth maps
is smaller than certain threshold. This is our probabilistic to generate correspondences.
analogue to the deterministic visibility check in [13, 32]. • NeRF [1] is a volume-rendering approach, which we
Sampling from mask. We acquire object masks for the either use via depth maps (Sec. III-C) or probabilisti-
training images through a finetuned Mask R-CNN [33]. cally through the density field (Sec. III-D) to generate
Similar to Dense Object Nets [13], masks are used to sample correspondences.
pixels of the object during descriptor learning.
C. Comparisons
IV. R ESULTS
We evaluate dense descriptors and show quantitative re-
We execute a series of experiments using real world im- sults in Table I, Table II, and Table III. We find the off-the-
ages for training and evaluation. We evaluate dense descrip- shelf dense descriptors do not work well to handle object-
tors learned with correspondences generated with different centric scenes, potentially because they are trained on images
approaches. The goals of the experiments are four-fold: (i) with synthetic warp and have not seen the target objects from
to investigate whether the 3D geometry predicted by NeRF a wide range of viewing angles. In contrast, Dense Object
is sufficient for training precise descriptors, particularly on Nets trained with target objects perform much better. This
challenging thin and reflective objects, (ii) to compare our suggests the need of a data collection pipeline to generate
proposed method to existing off-the-shelf descriptors, (iii) to object-centric training data for robot manipulation. Among
investigate whether the distribution-of-depth formulation is the three correspondence generation approaches, COLMAP
effective, and (iv) to test the generalization ability of visual has the highest error compared to other methods. Using the
descriptors produced by our pipeline. density field of NeRF to sample correspondences attains the
A. Settings best performance. It outperforms Dense Object Nets with
COLMAP by 29% and off-the-shelf descriptors by 106% on
Datasets. We evaluate our approach and baseline methods PCK@3px metric.
using 8 objects (3 distinct classes). For each object, we
captured 60 input images using an iPhone 12 with locked D. Generalization
auto exposure and auto focus. The images are resized to We evaluate the trained Dense Object Nets on novel
504 × 378. We use COLMAP [14] to estimate both camera scenes and objects not present in the training data. Fig. 5
poses and sparse point cloud of each object. To construct shows examples of Whisks and Strainers and their visual
the test set, 8 images are randomly selected and held-out descriptors. We follow the same visualization method in [13].
during training. We manually annotate (for evaluation only) Noisy background and lighting. In Fig. 5a, we show results
100 correspondences using these test images for each object. of our learned descriptors when the objects are placed on
Metrics. We employ the Average End Point Error (AEPE) a different background or in different lighting conditions.
and Percentage of Correct Keypoints (PCK) as the evalua- The results demonstrate that our learned descriptors can be
tion metrics. AEPE is computed as the average Euclidean deployed in environments different from the training scenes.
distance, in pixel space, between estimated and ground- Multiple objects. We show the learned descriptors when the
truth correspondences. PCK@δ is defined as the percentage input image contains multiple objects in Fig. 5b. The results
of estimated correspondences with a pixel-wise Euclidean demonstrate that the descriptors are consistent for objects of
distance < δ w.r.t. to the ground-truths. different sizes.
Category-level generalization. We further test our model on
B. Methods unseen objects of the same category. Fig. 5c shows unseen
First, we consider several off-the-shelf learned descriptors objects not in the training set. The learned visual descriptors
that attain state-of-the-art results on commonly used dense can robustly generalize to these unseen objects and estimate
correspondence benchmarks (e.g., ETH3D [34]). the view-invariant descriptors.
& shadows (a)
(b)Different
(a) background
Differentbackground
Different background &&shadows
and shadows
shadow (b)(a) Different
(c)
Different background
(b) Multiple
background
Multiple & shadows
objects
& shadows
objects (c) (c)
Multiple
(b) Different objects
background
Unseen & shadows
objects (c

Fig. 5: Qualitative results of generalization to novel scenes and objects. (a) We show the learned object descriptors
can be consistent across significant 1) viewpoint, 2) background, and 3) lighting variations. (b) We visualize the learned
descriptors for multiple objects despite the models have never seen multiple objects during training. (c) We test our model
on objects that are not seen during training. The visual descriptors are shown to be consistent with previously-seen objects
in the category.

TABLE I: Average End Point Error (AEPE), ↓ lower is better.


Strainer-S Strainer-M Strainer-L Whisk-S Whisk-M Whisk-L Fork-S Fork-L Mean

GLU-Net [6] 33.25 28.09 28.92 16.06 15.36 39.04 17.12 18.28 24.52
Off-the-shelf GOCor [12] 34.23 26.89 20.92 10.8 7.04 31.95 10.2 13.86 19.49
PDC-Net [7] 32.48 13.7 23.77 7.82 5.81 19.94 8.3 8.76 15.07

Depth map, COLMAP MVS 8.91 5.52 7.65 4.50 4.10 8.90 5.31 5.87 6.35
DON[13] via Depth map, NeRF (ours) 5.64 4.31 5.24 3.82 3.52 6.84 3.73 4.19 4.66
Density field, NeRF (ours) 4.53 4.08 3.93 3.28 3.19 4.96 3.42 3.66 3.88

TABLE II: Percentage Correct Keypoints (PCK@3px) for 3 pixels, ↑ higher is better.
Strainer-S Strainer-M Strainer-L Whisk-S Whisk-M Whisk-L Fork-S Fork-L Mean

GLU-Net [6] 0.04 0.04 0.07 0.24 0.26 0 0.16 0.14 0.12
Off-the-shelf GOCor [12] 0.1 0.05 0.07 0.26 0.33 0.03 0.18 0.16 0.15
PDC-Net [7] 0.14 0.19 0.11 0.48 0.51 0.19 0.42 0.38 0.30

Depth map, COLMAP MVS 0.32 0.41 0.38 0.57 0.64 0.44 0.55 0.51 0.48
DON[13] via Depth map, NeRF (ours) 0.52 0.56 0.51 0.62 0.66 0.50 0.67 0.63 0.58
Density field, NeRF (ours) 0.58 0.59 0.61 0.64 0.66 0.58 0.69 0.64 0.62

TABLE III: Percentage Correct Keypoints (PCK@5px) for 5 pixels, ↑ higher is better.
Strainer-S Strainer-M Strainer-L Whisk-S Whisk-M Whisk-L Fork-S Fork-L Mean

GLU-Net [6] 0.09 0.09 0.10 0.37 0.44 0.06 0.26 0.21 0.20
Off-the-shelf GOCor [12] 0.13 0.1 0.11 0.47 0.63 0.09 0.29 0.28 0.26
PDC-Net [7] 0.29 0.25 0.16 0.53 0.68 0.26 0.57 0.51 0.41

Depth map, COLMAP MVS 0.62 0.72 0.64 0.79 0.80 0.48 0.60 0.55 0.65
DON[13] via Depth map, NeRF (ours) 0.82 0.84 0.75 0.82 0.81 0.56 0.79 0.76 0.77
Density field, NeRF (ours) 0.84 0.87 0.79 0.82 0.82 0.64 0.82 0.78 0.80

E. Example Application: 6-DoF Robotic Pick and Place keypoints’ 2D locations using the descriptors and move the
robot to capture two RGB images of the scene using the
We demonstrate accurate 6-DoF pick and place of thin and
camera mounted on the robot arm. Then, we use triangulation
reflective objects. After learning the dense descriptors, we
to derive keypoints’ 3D locations and execute the encoded
specify a set of semantic keypoints which encode a SE(3)
SE(3) grasp pose. For more details, please see Sec. A.
grasp pose for each category. Before any grasp, we track
V. C ONCLUSION
We introduce NeRF-Supervision as a pipeline to generate
data for learning object-centric dense descriptors. Compared
to previous approaches based on RGB-D cameras or MVS,
our method enables learning dense descriptors of thin, reflec-
tive objects. We believe these results chart forward a general
paradigm in which NeRF may be leveraged as an untapped
representational format for supervising robot vision systems.

Acknowledgements. We thank Felix Yanwei Wang, Anthony


Simeonov, Wei-Chiu Ma, Rachel Holladay, and Maria Bauza
for helpful feedback on the draft. This work was supported
by a grant from Amazon. Fig. 7: 6-DoF pick and place. We show that our robot can
accurately grasp objects that are not in the training data with
A PPENDIX SE(3) grasp poses. Click the image to play the video in a
A. Robotic Pick And Place browser.
We use a UR5 robot with a Robotiq 2F-85 parallel jaw
gripper. A RealSense D415 camera is mounted on the robot
arm and precisely calibrated for both intrinsics and extrinsics. [2] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski,
We illustrate the grasping pipeline in Fig. 6, and we show “Orb: An efficient alternative to sift or surf,” in ICCV,
the pick and place in action in Fig. 7. 2011. 1
[3] D. G. Lowe, “Distinctive image features from scale-
invariant keypoints,” IJCV, 2004. 1
[4] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool,
“Speeded-up robust features (SURF),” Computer vision
and image understanding, 2008. 1
[5] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Super-
point: Self-supervised interest point detection and de-
scription,” in Computer Vision and Pattern Recognition
Workshops, 2018. 1
(a) Input image (b) Segmented image [6] P. Truong, M. Danelljan, and R. Timofte, “GLU-Net:
Global-local universal network for dense flow and cor-
respondences,” in CVPR, 2020. 1, 3, 5, 6
[7] P. Truong, M. Danelljan, L. V. Gool, and R. Timofte,
“Learning accurate dense correspondences and when to
trust them,” in CVPR, 2021. 1, 5, 6
[8] W. Jiang, E. Trulls, J. Hosang, A. Tagliasacchi, and
K. M. Yi, “COTR: Correspondence Transformer for
Matching Across Images,” in ICCV, 2021. 1
(c) Output descriptors (d) Tracked keypoints [9] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazir-
bas, V. Golkov, P. Van Der Smagt, D. Cremers, and
Fig. 6: Grasping pipeline. We feed the input image (a) into T. Brox, “Flownet: Learning optical flow with convo-
a segmentation model to generate the segmented image (b), lutional networks,” in ICCV, 2015. 1
which is then taken as input to predict dense descriptors (c). [10] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers,
We manually define a set of semantic keypoints (d) and track A. Dosovitskiy, and T. Brox, “A large dataset to train
them using the descriptors. Finally, we perform triangulation convolutional networks for disparity, optical flow, and
on stereo image pairs to derive keypoints’ 3D locations and scene flow estimation,” in CVPR, 2016. 1
the corresponding grasp pose. Click the image to play the [11] I. Rocco, R. Arandjelović, and J. Sivic, “Convolutional
video in a browser. neural network architecture for geometric matching,” in
CVPR, 2017. 1
[12] P. Truong, M. Danelljan, L. V. Gool, and R. Timofte,
R EFERENCES “GOCor: Bringing globally optimized correspondence
[1] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Bar- volumes into your neural network,” in NeurIPS, 2020.
ron, R. Ramamoorthi, and R. Ng, “Nerf: Representing 1, 5, 6
scenes as neural radiance fields for view synthesis,” in [13] P. R. Florence, L. Manuelli, and R. Tedrake, “Dense
ECCV, 2020. 1, 2, 3, 4, 5 object nets: Learning dense visual object descriptors by
and for robotic manipulation,” in Conference on Robot Barron, and R. Ng, “Fourier features let networks learn
Learning, 2018. 1, 2, 3, 5, 6 high frequency functions in low dimensional domains,”
[14] J. L. Schönberger and J.-M. Frahm, “Structure-from- NeurIPS, 2020. 3
motion revisited,” in CVPR, 2016. 2, 3, 5 [30] J. T. Kajiya and B. P. V. Herzen, “Ray tracing volume
[15] L. Manuelli, W. Gao, P. Florence, and R. Tedrake, densities,” SIGGRAPH, 1984. 3
“kpam: Keypoint affordances for category-level robotic [31] N. Max, “Optical models for direct volume rendering,”
manipulation,” arXiv preprint arXiv:1903.06684, 2019. IEEE TVCG, 1995. 3
2 [32] E. Trucco and A. Verri, Introductory techniques for 3-
[16] J. L. Schönberger, E. Zheng, M. Pollefeys, and J.- D computer vision. Prentice Hall Englewood Cliffs,
M. Frahm, “Pixelwise view selection for unstructured 1998. 5
multi-view stereo,” in ECCV, 2016. 2, 5 [33] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask
[17] R. Martin-Brualla, N. Radwan, M. S. M. Sajjadi, J. T. r-cnn,” in ICCV, 2017. 5
Barron, A. Dosovitskiy, and D. Duckworth, “Nerf in [34] T. Schops, J. L. Schonberger, S. Galliani, T. Sattler,
the wild: Neural radiance fields for unconstrained photo K. Schindler, M. Pollefeys, and A. Geiger, “A multi-
collections,” in CVPR, 2021. 2 view stereo benchmark with high-resolution images and
[18] K. Park, U. Sinha, J. T. Barron, S. Bouaziz, D. B. multi-camera videos,” in CVPR, 2017. 5
Goldman, S. M. Seitz, and R. Martin-Brualla, “Nerfies:
Deformable neural radiance fields,” ICCV, 2021. 2
[19] L. Yen-Chen, P. Florence, J. T. Barron, A. Rodriguez,
P. Isola, and T.-Y. Lin, “iNeRF: Inverting neural radi-
ance fields for pose estimation,” in IROS, 2021. 2
[20] E. Sucar, S. Liu, J. Ortiz, and A. Davison, “iMAP: Im-
plicit mapping and positioning in real-time,” in ICCV,
2021. 2
[21] R. A. Drebin, L. Carpenter, and P. Hanrahan, “Volume
rendering,” ACM Siggraph Computer Graphics, 1988.
2
[22] K. Deng, A. Liu, J.-Y. Zhu, and D. Ramanan, “Depth-
supervised NeRF: Fewer views and faster training for
free,” arXiv preprint arXiv:2107.02791, 2021. 3, 4
[23] T. Schmidt, R. Newcombe, and D. Fox, “Self-
supervised visual descriptor learning for dense corre-
spondence,” IEEE Robotics and Automation Letters,
2016. 3
[24] P. Florence, L. Manuelli, and R. Tedrake, “Self-
supervised correspondence in visuomotor policy learn-
ing,” IEEE Robotics and Automation Letters, 2019. 3
[25] L. Manuelli, Y. Li, P. Florence, and R. Tedrake, “Key-
points into the future: Self-supervised correspondence
in model-based reinforcement learning,” arXiv preprint
arXiv:2009.05085, 2020. 3
[26] P. Sundaresan, J. Grannen, B. Thananjeyan, A. Bal-
akrishna, M. Laskey, K. Stone, J. E. Gonzalez, and
K. Goldberg, “Learning rope manipulation policies us-
ing dense object descriptors trained on synthetic depth
data,” in ICRA, 2020. 3
[27] A. Ganapathi, P. Sundaresan, B. Thananjeyan, A. Bal-
akrishna, D. Seita, J. Grannen, M. Hwang, R. Hoque,
J. E. Gonzalez, N. Jamali et al., “Learning to smooth
and fold real fabric using dense object descriptors
trained on synthetic color images,” arXiv:2003.12698,
2020. 3
[28] C. B. Choy, J. Y. Gwak, S. Savarese, and
M. Chandraker, “Universal correspondence network,”
in NeurIPS, 2016. 3
[29] M. Tancik, P. P. Srinivasan, B. Mildenhall, S. Fridovich-
Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. T.

You might also like