Academia.eduAcademia.edu

Vox-E: Text-guided Voxel Editing of 3D Objects

2023, arXiv (Cornell University)

Figure 1. Given multiview images of an object (left), our technique generates volumetric edits from target text prompts, allowing for significant geometric and appearance changes, while faithfully preserving the input object. The objects can be edited either locally (center) or globally (right), depending on the nature of the user-provided text prompt.

Vox-E: Text-guided Voxel Editing of 3D Objects Etai Sella1 Gal Fiebelman1 arXiv:2303.12048v3 [cs.CV] 19 Sep 2023 1 Original Peter Hedman2 Tel Aviv University “A ⟨object⟩ with a birthday hat” 2 Hadar Averbuch-Elor1 Google Research “A yarn doll of a ⟨object⟩” Figure 1. Given multiview images of an object (left), our technique generates volumetric edits from target text prompts, allowing for significant geometric and appearance changes, while faithfully preserving the input object. The objects can be edited either locally (center) or globally (right), depending on the nature of the user-provided text prompt. Abstract nature of our 3D representation to enforce correlation between the global structure of the original and edited object. Furthermore, we present a technique that optimizes crossattention volumetric grids to refine the spatial extent of the edits. Extensive experiments and comparisons demonstrate the effectiveness of our approach in creating a myriad of edits which cannot be achieved by prior works1 . Large scale text-guided diffusion models have garnered significant attention due to their ability to synthesize diverse images that convey complex visual concepts. This generative power has more recently been leveraged to perform text-to-3D synthesis. In this work, we present a technique that harnesses the power of latent diffusion models for editing existing 3D objects. Our method takes oriented 2D images of a 3D object as input and learns a grid-based volumetric representation of it. To guide the volumetric representation to conform to a target text prompt, we follow unconditional text-to-3D methods and optimize a Score Distillation Sampling (SDS) loss. However, we observe that combining this diffusion-guided loss with an image-based regularization loss that encourages the representation not to deviate too strongly from the input object is challenging, as it requires achieving two conflicting goals while viewing only structure-and-appearance coupled 2D projections. Thus, we introduce a novel volumetric regularization loss that operates directly in 3D space, utilizing the explicit 1. Introduction Creating and editing 3D models is a cumbersome task. While template models are readily available from online databases, tailoring one to a specific artistic vision often requires extensive knowledge of specialized 3D editing software. In recent years, neural field-based representations (e.g., NeRF [30]) demonstrated expressive power in faithfully capturing fine details, while offering effective optimization schemes through differentiable rendering. Their 1 Our code can be reached through our project page at http:// vox-e.github.io/ 1 applicability has recently expanded also for a variety of editing tasks. However, research in this area has mostly focused on either appearance-only manipulations, which change the object’s texture [47, 49] and style [51, 45], or geometric editing via correspondences with an explicit mesh representation [13, 50, 48]—linking these representations to the rich literature on mesh deformations [19, 41]. Unfortunately, these methods still require placing user-defined control points on the explicit mesh representation, and cannot allow for adding new structures or significantly adjusting the geometry of the object. In this work, we are interested in enabling more flexible and localized object edits, guided only by textual prompts, which can be expressed through both appearance and geometry modifications. To do so, we leverage the incredible competence of pretrained 2D diffusion models in editing images to conform with target textual descriptions. We carefully apply a score distillation loss, as recently proposed in the unconditional text-driven 3D generation setting [34]. Our key idea is to regularize the optimization in 3D space. We achieve this by coupling two volumetric fields, providing the system with more freedom to comply with the text guidance, on the one hand, while preserving the input structure, on the other hand. Rather than using neural fields, we base our method on lighter voxel-based representations which learn scene features over a sparse voxel grid. This explicit grid structure not only allows for faster reconstruction and rendering times, but also for achieving a tight volumetric coupling between volumetric fields representing the 3D object before and after applying the desired edit using a novel volumetric correlation loss over the density features. To further refine the spatial extent of the edits, we utilize 2D cross-attention maps which roughly capture regions associated with the target edit, and lift them to volumetric grids. This approach is built on the premise that, while independent 2D internal features of generative models can be noisy, unifying them into a single 3D representation allows for better distilling the semantic knowledge. We then use these 3D cross-attention grids as a signal for a binary volumetric segmentation algorithm that splits the reconstructed volume into edited and non-edited regions, allowing for merging the features of the volumetric grids to better preserve regions that should not be affected by the textual edit. Our approach, coined Vox-E, provides an intuitive voxel editing interface, where the user only provides a simple target text prompt (see Figure 1). We compare our method to existing 3D object editing techniques, and demonstrate that our approach can facilitate local and global edits involving appearance and geometry changes over a variety of objects and text prompts, which are extremely challenging for current methods. Explicitly stated, our contributions are: • A coupled volumetric representation tied using 3D regularization, allowing for editing 3D objects using diffusion models as guidance while preserving the appearance and geometry of the input object. • A 3D cross-attention based volumetric segmentation technique that defines the spatial extent of textual edits. • Results that demonstrate that our proposed framework can perform a wide array of editing tasks, which cannot be previously achieved. 2. Related Work Text-driven Object Editing. Computational methods targeting text-driven image generation and manipulation have seen tremendous progress with the emergence of CLIP [35] and diffusion models [17], advancing from specific domains [1, 40, 33, 12] to more generic ones [31, 4, 11, 22]. Several recent methods allow for performing convincing localized edits on real images without requiring mask guidance [5, 15, 8, 43, 32]. However, these methods all operate on single images and cannot facilitate a consistent editing of 3D objects. While less common, methods for manipulating 3D objects are also gaining increasing interests. Methods such as LADIS [18] and ChangeIt3D [2] aim at learning the relations between 3D shape parts and text directly using datasets composed of edit descriptions and shape pairs. These works allow for geometric edits but fail to generalize to out of distribution shapes and cannot modify appearance. Alternatively, several methods have proposed leveraging 2D image projections, matching these to a driving text. Text2Mesh [29] uses CLIP for stylizing 3D meshes based on textual prompts. Tango [10] also styles meshes with CLIP, enabling additionally stylization of lighting conditions, reflectance properties and local geometric variations. TEXTure [36] use a depth-to-image diffusion model for texturing 3D meshes. Unlike our work, these methods focus mostly on texturing meshes, and cannot be used for generating significant geometric modifications, such as adding glasses or other types of accessories. Neural Field Editing. Neural fields (e.g., NeRF [30]), which can be effectively learned from multi-view images through differentiable rendering, have recently shown great promise for representing object and scenes. Prior works have demonstrated that these fields can be adapted to express different forms of manipulations. ARF [51] transfers the style of an exemplar image to a NeRF. NeRFArt [45] performs a text-driven style transfer. Distilled Feature Fields [24] distill the knowledge of 2D image feature extractors into a 3D feature field and use this feature field to localize edits performed by CLIP-NeRF [44], which optimizes a radiance field so that its rendered images match 2 Figure 2. An overview of our approach. Given a set of posed images depicting an object, we optimize an initial feature grid (left). We then perform text-guided object editing using a generative SDS loss and a volumetric regularization, optimizing an edited grid Ge . To localize the edits, we optimize 3D cross-attention grids which define probability distributions over the object and the edit regions. We obtain a volumetric mask from these grids using an energy minimization problem over all the voxels. Finally, we merge the initial and edited grid to obtain a refined volumetric grid (right). with a text prompt via CLIP. Several works have shown that neural fields can be edited by editing selected 2D images [26, 49]. NeuTex [47] uses 2D texture maps, which can be edited directly, to represent the surface appearance. Other works demonstrated geometric editing of shapes represented with neural fields via correspondences with an explicit mesh representation [13, 50, 48], that can be edited using as-rigid-aspossible deformations [41]. However, these cannot easily allow for modifying the 3D mesh to incorporate additional parts, according to the user’s provided description. Concurrently to our work, Instruct-NeRF2NeRF [14] uses an image editing model to iteratively edit multi-pose images from which an edited 3D scene is reconstructed. Unlike our work which optimizes the underlying 3D representation, they optimize the input images directly. Furthermore, our method is based on grid-based representations rather than neural fields, in particular ReLU Fields [21], which do not require any neural networks and instead model the scene as a voxel grid where each voxel contains learned features. We show that having an explicit grid structure is beneficial for editing 3D objects as it enables fast reconstruction and rendering times as well as powerful volumetric regularization. 2D diffusion model. Magic3D [25] proposes a two-stage optimization technique to overcome DreamFusion’s slow optimization. Unlike these works, we focus on the conditional setting. In our case, a 3D object is provided, and the desired edit should preserve the object’s geometry and appearance. Still, we compare with Latent-NeRF in the experiments, as it can use rough 3D shapes as guidance. 3. Method In this work, we consider the problem of editing 3D objects given a captured set of posed multiview images describing this object and a text prompt expressing the desired edit. We first represent the input object with a grid-based volumetric representation (Section 3.1). We then optimize a coupled voxel grid, such that it resembles the input grid on the one hand while conforming to the target text on the other hand (Section 3.2). To further refine the spatial extent of the edits, we perform an (optional) refinement step (Section 3.3). Figure 2 provides an overview of our approach. 3.1. Grid-Based Volumetric Representation Our volumetric representation is based on the voxel grid model first introduced in DVGO [42] and later simplified in ReLU Fields [21]. We use a 3D grid G, where each voxel holds a 4D feature vector. We model the object’s geometry using a single feature channel which represents spatial density values when passed through a ReLU nonlinearity. The three additional feature channels represent the object’s appearance, and are mapped to RGB colors when passed through a sigmoid function. Note that in contrast to most recent neural 3D scene representations (including ReLU Fields) , we do not model view dependent appearance effects, as we found it leads to undesirable artifacts Text-to-3D. Following the great success of text-to-image generation, we are witnessing increasing interests in unconditional text-driven generation of 3D objects and scenes. CLIP-Forge [38] uses CLIP guidance to generate coarse object shapes from text. Dream Fields [20], DreamFusion [34], Score Jacobian Chaining [46] and LatentNeRF [28] optimize radiance fields to generate the geometry and color of objects driven by the text. While DreamFields relies on CLIP, the other three methods instead use a score distillation loss, which enables the use of a pretrained 3 Input Image Edited grid 2D Cross Attention Map 3D Cross Attention Grid Segmentation Mask Output Figure 3. Optimizing 3D cross-attention grids for edit localization. We leverage rough 2D cross-attention maps (third column) for supervising the training of 3D cross-attention grids (fourth column). Provided with cross-attention grids associated with the edit (as demonstrated above for “christmas sweater" and “crown") and object regions, we formulate an energy minimization problem, which outputs a volumetric binary segmentation mask (fifth column). We then merge the features of the input (first column) and edited (second column) grids using this volumetric mask to obtain our final output (rightmost column). Note that warmer colors correspond to higher activations in the cross-attention maps and edited regions are colored in gray in the binary segmentation mask. when guided with 2D diffusion-based models. To represent the input object with our grid-based representation, we use images and associated camera poses to perform volumetric rendering as described in NeRF [30]. However, in contrast to NeRF, we do not use any positional encoding and instead sample our grid at each location query to obtain interpolated density and color values, which are then accumulated along each ray. We use a simple L1 loss between our rendered outputs and the input images to learn a grid-based volume Gi that represents the input object. trained Denoising Diffusion Probabilistic Model (DDPM). Formally, at each optimization iteration, noise is added to a generated image x using a random time-step t, 3.2. Text-guided Object Editing where w(t) is a weighting function, s is an input guidance text, and ϵφ (zt , t, s) is the noise predicted by a pre-trained DDPM with weights ϕ given xt , t and s. As suggested by Lin et al. [25], we use an annealed SDS loss which gradually decreases the maximal time-step we draw t from, allowing SDS to focus on high frequency information after the outline of the edit has formed. We empirically found that this often leads to higher quality outputs. xt = x + ϵ t , (1) where ϵt is the output of a noising function Q(t) at timestep t. The score distillation gradients (computed per pixel) can be expressed as: ∇x LSDS = w(t) (ϵt − ϵφ (xt , t, s)) , Equipped with the initial voxel grid Gi described in the previous section, we perform text-guided object editing by optimizing Ge , a grid representing the edited object which is initialized from Gi . Our optimization scheme combines a generative component, guided by the target text prompt, and a pullback term that encourages the new grid not to deviate too strongly from its initial values. As we later show, our coupled volumetric representation provides added flexibility to our system, allowing for better balancing between the two objectives by regularizing directly in 3D space. Next we describe these two optimization objectives. (2) Volumetric Regularization Regularization is key in our problem setting, as we want to avoid over-fitting to specific views and also not to deviate too far from the original 3D representation. Therefore, we propose a volumetric regularization term, which couples our edited grid Ge with the initial grid Gi . Specifically, we incorporate a loss term which encourages correlation between the density features of the input grid fiσ and the density features of the edited grid feσ : Generative Text-guided Objective To encourage our feature grid to respect the desired edit provided via a textual prompt, we use a Score Distillation Sampling (SDS) loss applied over Latent Diffusion Models (LDMs). SDS was first introduced in DreamFusion [34], and consists of minimizing the difference between noise injected to a generator’s output and noise predicted by a pre- Lreg3D = 1 − p 4 Cov(fiσ , feσ ) V ar(fiσ )V ar(feσ ) (3) 3D grid (Ae ) x2D maps based on energy minimization [3]. Specifically, we extract a segmentation grid that minimizes an energy function composed of two terms: A unary term, which penalizes disagreements with the label probabilities, and a smoothness term, which penalizes large pairwise color differences within similarly-labeled voxels. We define the label probabilities for voxel cell as the element-wise softmax of two cross-attention grids Ae and Aobj , where • Ae is the cross-attention grid associated with the token describing the edit (e.g. sunglasses), and Figure 4. Cross-attention 2D maps and rendered 3D grids over multiple viewpoints, obtained for the token associated with the word ”rollerskates" (from the ”kangaroo on rollerskates" text prompt). While 2D cross-attention may yield inconsistent observations, such as high probabilities over the tail region in the rightmost column, our 3D grids can more accurately localize the region of interest (effectively smoothing out such inconsistencies). • Aobj is the grid associated with the object, defined as the maximum probability over all other tokens in the prompt. We compute the smoothness term from local color differences in the edited grid Ge . That is, we sum   −(cp − cq )2 (5) wpq = exp 2σ 2 This volumetric loss has a significant edge over image space losses as it allows for decoupling the appearance of the scene from its structure, thereby connecting the volumetric representations in 3D space rather than treating it as a multiview optimization problem. for each pair of same-labeled neighboring voxels p and q, where cp and cq are RGB colors from Ge . In our experiments, we use σ = 0.1 and balance the data and smoothness terms with a parameter λ = 5 (strengthening the smoothness term). Finally, we solve this energy minimization problem via graph cuts [7], resulting in the high quality segmentation masks shown in Figure 3. 3.3. Spatial Refinement via 3D Cross-Attention While our optimization framework described in the previous section can mostly preserve the shape and the identity of a 3D object, for local edits, it is usually desirable to only change specific local regions, while keeping other regions completely fixed. Therefore, we add an (optional) refinement step which leverages the signal from cross-attention layers to produce a volumetric binary mask M that marks the voxels which should be edited. We then obtain the refined grid Gr by merging the input grid Gi and edited Gr grid as Gr = M · Ge + (1 − M ) · Gi . (4) 4. Experiments We show qualitative editing results over diverse 3D objects and various edits in Figures 1, 5, 6, 8, Please refer to the supplementary material for many fly-through visualizations demonstrating that our results are indeed consistent across different views. To assess the quality of our object editing approach, we conduct several sets of experiments, quantifying the extent of which these rendered images conform to the target text prompt. We provide comparisons with prior 3D object editing methods in Section 4.2, and comparisons to 2D editing methods in Section 4.3. An additional comparison to an unconditional text-to-3D method is presented in Section 4.4 Results over real scenes are illustrated in Section 4.5. We show ablations in Section 4.6. Finally, we discuss limitations in Section 4.7. Additional results, visualizations, ablations and comparisons can be found in the supplementary material. In the context of 2D image editing with diffusion models, the outputs of the cross-attention layers roughly capture the spatial regions associated with each word (or token) in the text. More concretely, these cross attention maps can be interpreted as probability distributions over tokens for each image patch [15, 8]. We elevate these 2D probability maps to a 3D grid by using them as supervision for training a ReLU field. We initialize the density values from the ReLU field trained in Section 3.1 and keep these fixed, while using probability maps in place of color images and optimizing for the probability values in the grid using an L1 loss. As shown in Figures 3 and 4, optimizing for a volumetric representation allows for ultimately refining the 2D probability maps, for instance by resolving over inconsistent 2D observations (as illustrated in Figure 4). We then convert these 3D probability fields to our binary mask M using a seam-hiding segmentation algorithm Synthetic Object Dataset. We assembled a dataset using freely available meshes found on the internet. Each mesh was rendered from 100 views in Blender. For a quantitative evaluation, we paired each object in our dataset with a number of both local and global edit prompts including: • “A ⟨object⟩ wearing sunglasses”. 5 Table 1. Quantitative Evaluation. We compare against the 3D object editing techniques Text2Mesh [29], two variants of LatentNeRF [28]: SketchShape (Sketch) and Latent-Paint (Paint) and DFF+CN [24, 44], over local (top) and global (bottom) edits. *Note that Text2Mesh and DFF+CN explicitly train to minimize a CLIP loss, and thus directly comparing them is uninformative over these metrics. “A bulldozer on a magic carpet" x “A cactus in a pot" x DFF+CN 0.32* 0.01* Text2Mesh 0.34* 0.03* Latent-NeRF (Sketch / Paint) 0.30 / 0.31 0.01 / 0.01 Ours 0.34 0.02 “A rainbow colored microphone" “A dog in low-poly video game style" x Local “An alien wearing a tuxedo" We encode both the prompt and images rendered from our 3D outputs using CLIP’s text and image encoders, respectively, and measure the cosine-distance between these encodings. CLIP Direction Similarity (CLIPDir ) evaluates the quality of the edit in regards to the input by measuring the directional CLIP similarity first introduced by Gal et al. [12]. This metric measures the cosine distance between the direction of the change from the input and output rendered images and the direction of the change from an input prompt (i.e. “a dog") to the one describing the edit (i.e. “a dog wearing a hat"). “A duck with a wizard hat" x “A dog wearing a christmas sweater" CLIPSim ↑ CLIPDir ↑ DFF+CN 0.34* 0.05* Text2Mesh 0.36* 0.08* Latent-NeRF (Sketch / Paint) 0.32 / 0.31 0.01 / 0.01 Ours 0.36 0.07 Global Method “A cat made of wood" Figure 5. Results obtained by our method over different objects and prompts (with the inputs displayed on the left). Please refer to the supplementary material for additional qualitative results. • “A ⟨object⟩ wearing a party hat”. Edit Magnitude. For ablating components in our model, we use the Frechét Inception Distance (FID) [16, 39] to measure the difference in visual appearance between: (i) the output and input images (FIDInput ) and (ii) the output and images generated by the initial reconstruction grid (FIDRec ). We show both to demonstrate to what extent the appearance is affected by the edit versus the expressive power of our framework. • “A ⟨object⟩ wearing a Christmas sweater”. • “A yarn doll of a ⟨object⟩”. • “A wood carving of a ⟨object⟩”. We separately evaluate local and global edits, using our spatial refinement step over local edits only. For instance, the first three prompts above are considered local edits (where regions that are not associated with the text prompt should remain unchanged) and the last two as edits that should produce global edits. We provide additional details in the supplementary material. 4.2. 3D Object Editing Comparisons To the best of our knowledge, there is no prior work that can directly perform our task of text-guided localized edits for 3D objects given a set of posed input images. Thus, we consider Distilled Feature Fields [24] combined with CLIP-NeRF [44] (DFF+CN), Text2Mesh [29] and LatentNeRF [28] which can be applied in a similar setting to ours. These experiments highlight the differences between prior works and our proposed editing technique. Distilled Feature Fields [24] distills 2D image features into a 3D feature field to enable query-based local editing of a 3D scenes. CLIP-NeRF edits a neural radiance field by optimizing the CLIP score of the input query and the rendered image. Combining these two methods allows to edit only the relevant parts of the 3D scene. Text2Mesh [29] aims at editing the style of a given input mesh to conform Runtime. All experiments were performed on a single RTX A5000 GPU (24GB VRAM). The training time for our method is approx. 50 minutes for the editing stage and 15 minutes for the optional refinement stage. 4.1. Metrics Edit Fidelity. We evaluate how well the generated results capture the target text prompt using two metrics: CLIP Similarity (CLIPSim ) measures the semantic similarity between the output objects and the target text prompts. 6 xxIPix2Pix xxxSDEdit xIPix2Pixx xSDEditx SketchShape Latent-Paint xDFF+CN xxxxxOurs xText2Mesh xxxOurs “A wood carving “Horse wearing of a horse" a santa hat" “A donkey" Figure 7. Comparison to 2D image editing techniques. We compare to the text-guided image editing techniques InstructPix2Pix (IPix2Pix) [8] and SDEdit [27] by providing it with images from different viewpoints and a target instruction text prompt (“put sunglasses on the dog" for IPix2Pix and “a dog with sunglasses" for SDEdit and our method). We show one input image on the left, and three outputs on the right (side, front and back views), where the leftmost output corresponds to the input viewpoint. We show two variants, one with added backgrounds (top rows), as we observe that it allows for better preserving the object’s appearance. As illustrated above, 2D techniques cannot easily achieve 3Dconsistent edit results (illustrated, for instance, by the sunglasses added on the dog’s back). “A carousel horse" Figure 6. Comparison to other 3D Object editing techniques. We show qualitative results obtained using Text2Mesh [29], two applications of Latent-NeRF [28] (Latent-Paint and SketchShape) and DFF+CN [24, 44] and compare to our method. To accommodate their problem setting, the top three methods are provided with uncolored meshes. Note that the input meshes are visible on the second row from the top (as Latent-Paint does not edit the object’s geometry). As illustrated above, prior methods struggle at achieving semantic localized edits. Our method succeeds, while maintaining high fidelity to the input object. ometry fixed). As illustrated in the figure, Text2Mesh cannot produce significant geometric edits (e.g., adding a Santa hat to the horse or turning the horse into a donkey). Even SketchShape, which is designed to allow geometric edits, cannot achieve significant localized edits. Furthermore, it fails to preserve the geometry of the input—although, we again note that this method is not intended to preserve the input geometry. DFF+CN seems generally less suitable for our problem setting, particularly for prompts that require geometric modifications (i.e. “A donkey”). Our method, in contrast to prior works, succeeds in conforming to the target text prompt, while preserving the input geometry, allowing for semantically meaningful changes to both geometry and appearance. We perform a quantitative evaluation in Table 1 on our dataset. To perform a fair comparison where all methods operate within their training domain, we use meshes without texture maps as input for Text2Mesh and Latent-NeRF. to a target prompt with a style transfer network that predicts color and a displacement along the normal direction. As it only predicts displacements along the normal direction, the geometric edits enabled by Text2Mesh are limited mostly to small changes. Latent-Paint and SketchShape are two applications introduced in Latent-Nerf [28] which operate on input meshes. SketchShape generates shape and appearance from coarse input geometry, while Latent-Paint only edits the appearance of an existing mesh. Note that Text2Mesh and Latent-NeRF are designed for slightly more constrained inputs than our approach. While our focus is on editing 3D models with arbitrary textures (as depicted from associated imagery), they only operate on uncolored meshes. We show a qualitative comparison in Figure 6 over an uncolored mesh (its geometry can be observed on the second row from the top as Latent-Paint keeps the input ge7 Initial ”A pineapple on the ground" ”A pinecone floating in a pond" Input Initial ”A vase full of sunflowers" A van-Gogh painting of a flower vase" RF xxxx DVGOxxxxx Input Figure 8. Editing real scenes with different underlying 3D representations. We show results obtained when using DVGO [42] (bottom row) and ReLU-Fields (RF, top row). We show samples from the input image dataset (leftmost columns), initial scene reconstructions (second columns), results over local edits (third columns) and results over global edits (rightmost columns). 4.4. Comparisons to an unconditional text-to-3D model Input “Kangaroo on rollerskates" In Figure 9 we compare to the unconditional text-to-3D model proposed in Latent-NeRF, to show that such unconditional models are also not guaranteed to generate a consistent object over different prompts. We also note that this result (as well as our edits) would certainly look better if fueled with a proprietary big diffusion model [37], but nonetheless, these models cannot preserve identity. “Kangaroo on skis" Figure 9. Comparison to unconditional text-to-3D generation. We compare to unconditional text-to-3D methods by comparing to Latent-NeRF [28], providing it with the two target prompts displayed above. We display these alongside our results (LatentNeRF on the left, ours on the right). As illustrated above, unconditional methods cannot easily match an input object, and are also not guaranteed to generate a consistent object over different prompts. 4.5. Real Scenes In Figure 8, we demonstrate that our method also succeeds in modeling and editing real scenes using the 360◦ Real Scenes available by Mildenhall et al. [30]. As illustrated in the figure, we can locally edit the foreground (e.g., turning the pinecone into a pineapple) as well as globally edit the scene (e.g. turning the scene into a Van-Gogh painting). For these more complex and computationally demanding scenes, we also experiment with implementing our method on top of DVGO [42] (bottom row), in addition to ReLU-Fields which we exclusively focus on in all other experiments (top row), as it offers additional features such as scene contraction, a more expressive color feature space and complex ray sampling. These make this underlying representation better suited for editing and reconstructing these real scenes (as illustrated in the columns labeled as ’Initial’). This experiment also demonstrates that our method is agnostic to the underlying 3D representation of the scene and can readily operate over different grid-based representations. As illustrated in the table, our method outperforms all baselines over both local and global edits in terms of CLIP similarity, but Text2Mesh yields slightly higher CLIP direction similarity. We note that Text2Mesh as well as DFF+CN are advantaged in terms of the CLIP metrics as they explicitly optimize on CLIP similarities and thus their scores are not entirely indicative. 4.3. 2D Image Editing Comparisons An underlying assumption in our work is that editing 3D geometry cannot easily be done by reconstructing edited 2D images depicting the scene. To test this hypothesis, we modified images rendered from various viewpoints using the diffusion-based image editing methods InstructPix2Pix [8] and SDEdit [27]. We show two variants of these methods in Figure 7, one with added backgrounds, as we observe that it also affects performance. In both cases, as illustrated in the figure, 2D methods often struggle to produce meaningful results from less canonical views (e.g., adding sunglasses on the dog’s back) and also produce highly view-inconsistent results. Concurrently to us, Instruct-NeRf2NeRF [14] explore how to best use these 2D methods to learn view-consistent 3D representations. 4.6. Ablations We provide an ablation study in Table 2 and Figure 10. Specifically, we ablate our volumetric regularization (Lreg3D ) and our 3D cross-attention-based spatial refinement module (SR). When ablating our volumetric regularization, we use a single volumetric grid and regularize the SDS objective with an image-based L2 regularization 8 “A horse with a pig tail" “A pink unicorn" “A horse riding on a magic carpet" Figure 11. Limitations. Above, we present several failure cases (when provided with rendered images of the uncolored mesh displayed in Figure 6, top row). These likely result from incorrect attribute binding (the horse’s nose turning into a pig’s nose), inconsistencies across views (two horns on the unicorn) or excessive regularization to the input object (carpet on the horse, not below). w/o Lreg3D Input w/o SR Output ing the output to stay similar to the input, sometimes at the expense of the editing signal. Figure 10. Qualitative ablation results, obtained for the target prompt “A <object> wearing sunglasses" over three different objects. Image-space regularization (denoted by “w/o Lreg3D ") leads to extremely noisy results. The edited grid before refinement (denoted by “w/o SR") respects the target prompt, but some of the fidelity to the geometry and appearance of the input object is lost. In contrast, our refined grid successfully combines the edited and input regions to output a result that complies with the target text and also preserves the input object. Lreg3D × ✓ ✓ SR × × ✓ CLIPSim ↑ 0.29 0.37 0.36 CLIPDir ↑ 0.05 0.08 0.06 FIDRec ↓ 367.53 240.37 119.44 4.7. Limitations Our method applies a wide range of edits with high fidelity to 3D objects, however, there are several limitations to consider. As shown in Figure 11, since we optimize over different views, our method attempts to edit the same object in differing spatial locations, thus failing on certain prompts. Moreover, the figure shows that some of our edits fail due to incorrect attribute binding, where the model binds attributes to the wrong subjects, which is a common challenge in large-scale diffusion-based models [9]. Finally, we inherit the limitations of our volumetric representation. Thus, the quality of real scenes, for instance, could be significantly improved by borrowing ideas from works such as [6] (e.g. scene contraction to model the background). FIDInput ↓ 384.55 288.26 236.32 Table 2. Ablation study, evaluating the effect of the volumetric regularizer between our coupled grids (Lreg3D , Section 3.2) and the 3D cross-attention-based spatial refinement module (SR, Section 3.3) over a set of metrics (detailed in Section 4). 5. Conclusion loss. More details and additional ablations are provided in the supplementary material, including alternative regularization objectives (such as image-based L1 loss, or volumetric regularization over RGB features) and results using higher order spherical harmonics coefficients. The benefit of using our volumetric regularization is further illustrated in Figure 10, which shows that image-space regularization leads to very noisy results, and often complete failures (see, for instance, the cat result, where the output is not at all correlated with the input object). Quantitatively, we can also observe that images rendered from these models are of significantly different appearance (as measured using the FID metrics). Regarding the SR module, as expected, it increases similarity to the inputs (reflected in lower FID scores). This is also clearly visible in Figure 10—for example, geometric differences are apparent by looking at the animals’ paws. The output textures after refinement also are more similar to the input textures. However, we also see that this module slightly hinders CLIP similarity to the edit and text prompt. This is also somewhat expected as we are further constrain- In this work, we presented Vox-E, a new framework that leverages the expressive power of diffusion models for textguided voxel editing of 3D objects. Technically, we demonstrated that by combining a diffusion-based image-space objective with volumetric regularization we can achieve fidelity to the target prompt and to the input 3D object. We also illustrated that 2D cross-attention maps can be elevated for performing localization in 3D space. We showed that our approach can generate both local and global edits, which are challenging for existing techniques. Our work makes it easy for non-experts to modify 3D objects using just text prompts as input, bringing us closer to the goal of democratizing 3D content creating and editing. 6. Acknowledgments We thank Rinon Gal, Gal Metzer and Elad Richardson for their insightful feedback. This work was supported by a research gift from Meta, the Alon Fellowship and the Yandex Initiative in AI. 9 References [15] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. 2022. 2, 5 [16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017. 6 [17] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. NeurIPS, 2020. 2 [18] Ian Huang, Panos Achlioptas, Tianyi Zhang, Sergey Tulyakov, Minhyuk Sung, and Leonidas Guibas. Ladis: Language disentanglement for 3d shape editing. arXiv preprint arXiv:2212.05011, 2022. 2 [19] Takeo Igarashi, Tomer Moscovich, and John F Hughes. Asrigid-as-possible shape manipulation. ACM transactions on Graphics (TOG), 24(3):1134–1141, 2005. 2 [20] Ajay Jain, Ben Mildenhall, Jonathan T Barron, Pieter Abbeel, and Ben Poole. Zero-shot text-guided object generation with dream fields. In CVPR, 2022. 3 [21] Animesh Karnewar, Tobias Ritschel, Oliver Wang, and Niloy Mitra. Relu fields: The little non-linearity that could. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–9, 2022. 3, 12 [22] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022. 2 [23] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. 12 [24] Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitzmann. Decomposing nerf for editing via feature field distillation. Advances in Neural Information Processing Systems, 35:23311–23330, 2022. 2, 6, 7 [25] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: Highresolution text-to-3d content creation. arXiv preprint arXiv:2211.10440, 2022. 3, 4 [26] Steven Liu, Xiuming Zhang, Zhoutong Zhang, Richard Zhang, Jun-Yan Zhu, and Bryan Russell. Editing conditional radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5773–5783, 2021. 3 [27] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022. 7, 8 [28] Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and Daniel Cohen-Or. Latent-nerf for shape-guided generation of 3d shapes and textures. arXiv preprint arXiv:2211.07600, 2022. 3, 6, 7, 8 [29] Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and Rana Hanocka. Text2mesh: Text-driven neural stylization for meshes. In CVPR, 2022. 2, 6, 7 [30] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: [1] Rameen Abdal, Peihao Zhu, Niloy J. Mitra, and Peter Wonka. Styleflow: Attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. ACM Trans. Graph., 40(3), May 2021. 2 [2] Panos Achlioptas, Ian Huang, Minhyuk Sung, Sergey Tulyakov, and Leonidas Guibas. Changeit3d: Languageassisted 3d shape edits and deformations, 2022. 2 [3] Aseem Agarwala, Mira Dontcheva, Maneesh Agrawala, Steven Drucker, Alex Colburn, Brian Curless, David Salesin, and Michael Cohen. Interactive digital photomontage. ACM Trans. Graph., 23(3):294–302, 2004. 5 [4] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022. 2 [5] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2live: Text-driven layered image and video editing. In ECCV, 2022. 2 [6] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5470–5479, 2022. 9 [7] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions on pattern analysis and machine intelligence, 23(11):1222– 1239, 2001. 5, 13 [8] Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. November 2022. 2, 5, 7, 8 [9] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. arXiv preprint arXiv:2301.13826, 2023. 9 [10] Yongwei Chen, Rui Chen, Jiabao Lei, Yabin Zhang, and Kui Jia. Tango: Text-driven photorealistic and robust 3d stylization via lighting decomposition. arXiv preprint arXiv:2210.11277, 2022. 2 [11] Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022. 2 [12] Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip-guided domain adaptation of image generators, 2021. 2, 6 [13] Stephan J Garbin, Marek Kowalski, Virginia Estellers, Stanislaw Szymanowicz, Shideh Rezaeifar, Jingjing Shen, Matthew Johnson, and Julien Valentin. Voltemorph: Realtime, controllable and generalisable animation of volumetric representations. arXiv preprint arXiv:2208.00949, 2022. 2, 3 [14] Ayaan Haque, Matthew Tancik, Alexei A Efros, Aleksander Holynski, and Angjoo Kanazawa. Instruct-nerf2nerf: Editing 3d scenes with instructions. arXiv preprint arXiv:2303.12789, 2023. 3, 8 10 [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] Can Wang, Ruixiang Jiang, Menglei Chai, Mingming He, Dongdong Chen, and Jing Liao. Nerf-art: Text-driven neural radiance fields stylization. arXiv preprint arXiv:2212.08070, 2022. 2 [46] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A. Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. ArXiv, abs/2212.00774, 2022. 3 [47] Fanbo Xiang, Zexiang Xu, Milos Hasan, Yannick HoldGeoffroy, Kalyan Sunkavalli, and Hao Su. Neutex: Neural texture mapping for volumetric neural rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7119–7128, 2021. 2, 3 [48] Tianhan Xu and Tatsuya Harada. Deforming radiance fields with cages. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, pages 159–175. Springer, 2022. 2, 3 [49] Bangbang Yang, Chong Bao, Junyi Zeng, Hujun Bao, Yinda Zhang, Zhaopeng Cui, and Guofeng Zhang. Neumesh: Learning disentangled neural mesh-based implicit field for geometry and texture editing. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVI, pages 597–614. Springer, 2022. 2, 3 [50] Yu-Jie Yuan, Yang-Tian Sun, Yu-Kun Lai, Yuewen Ma, Rongfei Jia, and Lin Gao. Nerf-editing: geometry editing of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18353–18364, 2022. 2, 3 [51] Kai Zhang, Nick Kolkin, Sai Bi, Fujun Luan, Zexiang Xu, Eli Shechtman, and Noah Snavely. Arf: Artistic radiance fields, 2022. 2 Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021. 1, 2, 4, 8 Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022. 2 Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image translation. ArXiv, abs/2302.03027, 2023. 2 Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. In ICCV, 2021. 2 Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022. 2, 3, 4 Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. 2 Elad Richardson, Gal Metzer, Yuval Alaluf, Raja Giryes, and Daniel Cohen-Or. Texture: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721, 2023. 2 Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022. 8 Aditya Sanghi, Hang Chu, Joseph G Lambourne, Ye Wang, Chin-Yi Cheng, Marco Fumero, and Kamal Rahimi Malekshan. Clip-forge: Towards zero-shot text-to-shape generation. In CVPR, 2022. 3 Maximilian Seitzer. pytorch-fid: FID Score for PyTorch. https://github.com/mseitzer/pytorch-fid, 08 2020. Version 0.2.1. 6, 13 Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. Interpreting the latent space of gans for semantic face editing. In CVPR, 2020. 2 Olga Sorkine and Marc Alexa. As-rigid-as-possible surface modeling. In Symposium on Geometry processing, volume 4, pages 109–116, 2007. 2, 3 Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5459– 5469, 2022. 3, 8 Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for textdriven image-to-image translation. arXiv preprint arXiv:2211.12572, 2022. 2 Can Wang, Menglei Chai, Mingming He, Dongdong Chen, and Jing Liao. Clip-nerf: Text-and-image driven manipulation of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3835–3844, 2022. 2, 6, 7 11 Appendix We use a weight of 200 to balance the two terms (multiplying Lreg3D by this weight value). The volumetric regularization term operates only on the density features of the editing grid. The optimizer we used in this (and all other stages) is the Adam optimizer [23] with a learning rate of 0.03 and betas 0.9, 0.999. The resolution of the images rendered from our grid is 266×266. We add a "a render of" prefix to all of our editing prompts as we found that this produced more coherent results (and the images the LDM receives are indeed renders). A. Additional Details A.1. Implementation Details Below we provide all the implementation details of our method, detailed in Section 3 in the main paper. Grid-Based Volumetric Representation We use 100 images uniformly sampled from upper hemisphere poses along with corresponding camera intrinsic and extrinsic parameters to train our initial grid. We follow the standard ReLU Fields [21] training process using their default settings aside from two modifications: Spatial Refinement via 3D Cross-Attention The diffusion model we use for this stage is "StableDiffusion 1.4" by CompVis and it consists of several crossattention layers at resolutions 32, 16, and 8. To extract a single attention map for each token we interpolate each cross attention map from each layer and attention head to our image resolution (266x266) and take an average per each token. The time-step we use to generate the attention maps is 0.2 (the actual step being 0.2 * Nsteps = 200). The cross-attention grids Ae and Aobj contain a density feature and an additional one-dimensional feature a, which represents the cross-attention value at a given voxel and can be interpreted and rendered as a grayscale luma value. We initialize the density features in these grids to the density features of the editing grid’s (the former stage’s output) and freeze them. At each refinement iteration we generate two 2D cross-attention maps from the LDM, one for the object and one for the edit. After obtaining the 2D cross-attention maps, we render gray-scale heatmaps from Ae and Aobj and use L1 loss to encourage similarity between the rendered attention images and their corresponding attention maps extracted from the diffusion model. We repeat this process for 1500 iterations, sampling a random upper-hemisphere pose each time. As in the former optimization stage, we use the Adam optimizer with a learning rate of 0.03 and betas 0.9 and 0.999 and generate images in 266×266 resolution. After obtaining the two grids Ae and Aobj , we perform element-wise softmax on their a values to obtain probabilities for each voxel belonging to either the object, denoted by Pobj (v), or the edit, denoted by Pe (v). We then proceed to calculate the binary refinement volumetric mask. To do this we define a graph in which each non-zero density voxel in our edited grid Ge is a node. We define "edit" and "object" labels as the source and drain nodes, such that a node connected to the source node is marked as an "edit" node and a node connected to the drain node is marked as an "object" node. We rank the nodes according to their Pe (v) values and connect the top Ninit−edit nodes to the source node. We then rank the nodes according to their Pobj (v) value and connect the top Ninit−object nodes to the drain node. We then connect the non-terminal nodes to each-other in a 1. We change the result grid size from the standard 1283 to 1603 to increase the output render quality. 2. As detailed in the main paper, we limit the order of spherical harmonics to be zero order only to avoid undesirable view-dependent effects (we further illustrate these effects in Section B.2). Text-guided Object Editing We perform 8000 training iterations during the object editing optimization stage. During each iteration, a random pose is uniformly sampled from an upper hemisphere and an image is rendered from our edited grid Ge according to the sampled pose and the rendering process described in ReLU Fields [21]. Noise is then added to the rendered image according to the time-step sampled from the fitting distribution. We use an annealed SDS loss which gradually decreases the maximal time-step we draw t from. Formally, this annealed SDS loss introduces three additional hyperparameters to our system: a starting iteration istart , an annealing frequency fa and an annealing factor γa . With these hyper-parameters set, we change our time-step distribution to be: t ∼ U [t0 + ε, tf inal ∗ ki + ε], (6)   1, ki = ki−1 ∗ γa ,   ki−1 , if i < istart else if i % fa = 0 otherwise (7) In all our experiments, the values we use for ε, istart , fa and γa are 0.02, 4000, 600, and 0.75. Additionally, we stop annealing the time-step when it reaches a value of 0.35. The latent diffusion model we use in our experiments is "StableDiffusion 2.1" by Stability AI. 12 6-neighborhood with the capacity of each edge being wpq as detailed in the main paper. We set the hyper-parameters Ninit−edit and Ninit−object to be 300 and 200. To perform graph-cut [7], we used the PyMaxflow implementation of the max-flow / min-cut algorithm. the techniques we compare against use only an un-textured mesh and an editing prompt as input. As such, we used the meshes our inputs were rendered from as input for the editing methods. Additionally, we tested an additional scenario in which we imported the ’horse’ mesh from the Text2Mesh GitHub repository to blender, added a grey-matte material to it and rendered images of it to use as input for our system. This scenario used four prompts: (1) A wood carving of a horse, (2) A horse wearing a Santa hat, (3) A donkey, (4) A carousel horse, and was used for qualitative comparisons only. A.2. Evaluation Protocol To evaluate our results quantitatively, we constructed a test set composed of eight scenes: ’White Dog’, ’Grey Dog’, ’White Cat’, ’Ginger Cat’, ’Kangaroo’, ’Alien’, ’Duck’ and ’Horse’, and six editing prompts: (1) A ⟨object⟩ wearing big sunglasses, (2) A ⟨object⟩ wearing a Christmas sweater, (3) A ⟨object⟩ wearing a birthday party hat, (4) A yarn doll of a ⟨object⟩, (5) A wood carving of a ⟨object⟩, (6) A claymation ⟨object⟩. This yields 18 edited scenes in total. We render each edited scene from 100 different poses distributed evenly along a 360◦ ring. In addition to these 18 scenes we also render 100 images from the same poses on the initial (reconstruction) grid Gi for each input scene. When comparing our result with other 3D textual editing papers we evaluate our results using two CLIP-based metrics. The CLIP model we used for both of these metrics is ViT-B/32 and the input image text prompts used to calculate the directional CLIP metric is “A render of a ⟨object⟩". CLIPDir is calculated for each edited image in relation to the corresponding image in the reconstruction scene. To quantitatively evaluate ablations we use two additional metrics using FID [39]. For this we use the pytorch implementation given by the authors with the standard settings. Text2Mesh When comparing to Text2Mesh we used the code provided by the authors and the input settings given in the "run_horse.sh" demo file. SketchShape In this comparison we again use the code provided by the authors. And the input parameters used are the default parameters in the ’train_latent_nerf.py’ script ’train_latent_nerf.py’ script with 10,000 training steps (as opposed to the default 5,000). Latent-Paint We compared our method to Latent-Paint only qualitatively as this method outputs edits that transform only the appearance of the input mesh, rather than appearance and geometry. As in SketchShape we used the code provided by the authors and used the default input settings provided for latent paint, which are given in the ’train_latent_paint.py’ script. 360◦ Real Scenes For the 360◦ Real Scenes edits we follow the same implementation details as outlined previously, with three modifications: DFF + CN In this comparison we use the code provided by the authors and the default input parameters provided for this method. 1. We alternate between using the DVGO model or the ReLU-Fields model as our 3D representation. Results for both models are presented in Figure 6 of the main paper. A.4. 2D Image Editing Techniques When comparing to InstructPix2Pix and SDEdit we constructed two image sets for each scene / prompt combination we wanted to test. Both sets were created by rendering one of our inputs in evenly spaced poses along a 360◦ ring, one set was rendered over a white background and the other over a ’realistic’ image of a brick wall. We used these sets as input for each 2D editing method along with an editing prompt and compared the results to rendered outputs from our result grids. When comparing to InstructPix2Pix we used the standard InstructPix2Pix pipeline with 16bit floating point precision and 20 inference steps. We used the default guidance scale (1.0) for the images rendered over the ’realistic’ background and increased the guidance scale 2. Our input poses are created in a spherical manner and when rendering we sample linearly in inverse depth rather than in depth as seen in the official implementation of NeRF . 3. We perform 5000 training iterations during the object editing optimization stage and the values we use for ε, istart , fa and γa are 0.02, 3000, 400, and 0.75. A.3. 3D Object Editing Techniques Below we provide additional details on the alternative 3D object editing techniques we compare against. All of 13 to 3.0 for the images rendered over a white background, as we found it to produce higher quality results specifically for these more ’synthetic’ images. When giving prompts to InstructPix2Pix we rephrased our prompts as instructions, for example turning "a dog wearing sunglasses" to "put sunglasses on this dog". When comparing to SDEdit we used the standard SDEdit pipeline with guidance scale of 0.75 and a strength of 0.6. 2D 3D In this section, we show a more detailed ablation study which evaluates the effect of our volumetric regularization loss (Section B.1) and an additional experiment, demonstrating the effect of using high order spherical harmonics coefficients (Section B.2). Lreg3D xxxxx Lreg3D++ xxxxxx ”A dog wearing a christmas sweater" FIDInput ↓ 415.96 437.68 437.09 467.14 L1 L2 Lreg3D++ Lreg3D 0.36 0.35 0.34 0.36 0.05 0.05 0.02 0.06 222.91 240.50 210.46 223.89 284.86 284.83 242.73 272.73 Alternative Volumetric Regularization Functions In this setting we replace our correlation-based regularization with other functions that encourage similarity between the density features of the grids using the same balancing weight. Namely we compare against L1 and L2 volumetric loss functions, both penalizing the distance between the density features of Gi and those of Ge . We additionally compare against an alternative version of Lreg3D in which we penalize the miscorrelation between both density and color features, formally: Table 3 shows a quantitative comparison over different image-space and volumetric regularizations. Only the image-space L1 loss also appears in the main paper. Below we provide additional details on these ablations. Input FIDRec ↓ 0.02 0.02 is set to 200, as detailed in Section A.1). We evaluate this ablation using L1 and L2 image space loss functions. B.1. Alternative Regularization Objectives ”A wood carving of a duck" CLIPDir ↑ 0.26 0.25 Table 3. Detailed ablation study, evaluating the effect of different regularization objectives. We compare the performance using Lreg3D , with image-space (top rows) and volumetric (bottom rows) L1 and L2 losses, as well as Lreg3D++ , which also penalizes miscorrelations between color features. B. Ablations Input CLIPSim ↑ L1 L2 Loss Function Cov(firgb , fergb ) ) (8) Lreg3D++ = Lreg3D +(1− q V ar(firgb )V ar(fergb ) We find that using this loss yields better reconstruction scores, at the expense of significantly lower CLIP-based scores (e.g., CLIPDir scores drop from 0.08 to 0.02). Qualitatively, constraining RGB values as well as density features appears too limiting for our purposes. This can be seen in Figure 12, where we compare results obtained when using Lreg3D++ against results obtained when using Lreg3D . When observing these results, we can see that the edit integrity is reduced at the expense of the preservation of the origin object’s color. This is evident in the duck, for instance, where the brown wooden color of the body is only clearly visible in the Lreg3D example. Furthermore, the colors of the sweater on the dog are significantly faded when regularized with Lreg3D++ as the colors of a standard christmas sweater are typically much more vibrant than the white fur of the dog. Figure 12. Regularizing RGB colors in addition to volumetric densities. We show results obtained when using our default regularization objective Lreg3D (top-row) compared against results obtained when using Lreg3D++ - an alternative version of Lreg3D (bottom-row) in which we penalize the miscorrelation between both density and color features. These results show that regularizing both density and RGB can be limiting, especially when the edit requires a drastic change in color, such as changing the white fur of the dog into a vibrant christmas sweater. Image-space Regularization In this setting we render images from our editing grid Ge in the poses corresponding to the input images during each iteration of the optimization stage. Rather than using a volumetric regularization, we incur a loss between the images rendered from Ge and the corresponding input image while using the same weight used to balance Lreg3D with the annealed SDS loss (this weight B.2. Ablating the Color Representation As mentioned in Section 3.1 of the main paper, we do not model view dependent effects using higher order spherical harmonics as that leads to undesirable effects. We demonstrate this by observing these effects in examples rendered with 1st and 2nd order spherical harmonic coefficients as 14 2D cross-attention 2D cross-attention xx3D grid (Ae ) xx3D grid (Ae ) t=1 t=200 t=400 t=600 t=800 t=999 Figure 13. Visualizing 2D cross-attention maps and 3d cross-attention grids over different diffusion timestamps. We visualize the trained 3d cross-attention grids and the corresponding 2D cross-attention maps used as supervision across different diffusion timestamps. We show them for the edit region corresponding to the token associated with the word “rollerskates" (top two rows) and “hat" (bottom two rows). color features. These results can be seen in videos available on our project page. B.3. Cross-attention Grid Supervision When observing these results we can clearly see how view-dependent colors yield undesirable effects such as the feet of the “yarn kangaroo" varying from green to yellow across views or the head of the dog becoming a birthday party hat when it faces away from the camera. We additionally see the colors become over-saturated, especially when using second-order spherical harmonic coefficients. It is also evident that the added expressive capabilities of the model allow it to over-fit more easily to specific views, creating unrealistic results such as the “cat wearing glasses" in the first and second order coefficient models, where glasses are scattered along various parts of its body. We note that while this expressive power currently produces undesirable effects it does potentially enable higher quality and more realistic renders, and therefore, we believe that constraining this power is an interesting topic for future research. As explained in Section A.1, we use a constant timestamp of 0.2 when extracting attention maps for training our attention grids Ae and Aobj . This value was chosen empirically as we found that higher time-steps tend to be noisier and less focused, while lower time-steps varied largely from pose to pose producing inferior attention grids. This can be seen qualitatively in Figure 13. As illustrated in the figure, the attention values for the edit region get gradually more smeared and unfocused as the time-steps increase. This is evident, for instance, in warmer regions around the kangaroo’s tail or the head of the duck. While perhaps less visually distinct, we can also observe that in lower timestamps the warm regions denoting high attention values cover a smaller area of the region which should be edited. We empirically find that this makes it more challenging for separating the object and edit regions. 15 C. Additional Visualizations and Results Visualizing 2D cross-attention maps and images rendered from our 3D cross-attention grids While the attention maps used as ground-truth are inherently unfocused (as they are up-sampled from very low resolutions) and are not guaranteed to be view consistent, we show that learning the projection of these attention maps on to our object’s density produces view-consistent heat maps for object and edit regions (Figure 14). 16 2D cross-attention 2D cross-attention xx3D grid (Aobj ) 2D cross-attention xx3D grid (Ae ) xx3D grid (Aobj ) 2D cross-attention xx3D grid (Ae ) Figure 14. Visualizing 2D cross-attention maps and 3d cross-attention grids over multiple viewpoints. We visualize the optimized 3d cross-attention grids and the corresponding 2D cross-attention maps used as supervision. We show them for the edit region corresponding to the token associated with the word “rollerskates" (top two rows) and “hat" (fifth and sixth rows) and the object region (third and fourth rows for the kangaroo and bottom two rows for the duck). 17