Texture: Text-Guided Texturing of 3D Shapes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

TEXTure: Text-Guided Texturing of 3D Shapes

Elad Richardson* Gal Metzer* Yuval Alaluf Raja Giryes Daniel Cohen-Or

Tel Aviv University


arXiv:2302.01721v1 [cs.CV] 3 Feb 2023

”A hand carved “A full-color photograph


”A white rabbit” ”An orangutan”
wood turtle” of Albert Einstein”

Figure 1. Texturing results. TEXTure takes an input mesh and a conditioning text prompt and paints the mesh with high-quality textures.

Abstract 1. Introduction

In this paper, we present TEXTure, a novel method for The ability to paint pictures with words has long been a
text-guided generation, editing, and transfer of textures for sign of a master storyteller, and with recent advancements
3D shapes. Leveraging a pretrained depth-to-image diffu- in text-to-image models, this has become a reality for us all.
sion model, TEXTure applies an iterative scheme that paints Given a textual description, these new models are able to
a 3D model from different viewpoints. Yet, while depth-to- generate highly detailed imagery that captures the essence
image models can create plausible textures from a single and intent of the input text. Despite the rapid progress
viewpoint, the stochastic nature of the generation process in text-to-image generation, painting 3D objects remains a
can cause many inconsistencies when texturing an entire significant challenge as it requires to consider the specific
3D object. To tackle these problems, we dynamically de- shape of the surface being painted. Recent works have be-
fine a trimap partitioning of the rendered image into three gun making significant progress in painting and texturing
progression states, and present a novel elaborated diffu- 3D objects by using language-image models as guidance
sion sampling process that uses this trimap representation [9, 28, 30, 51]. Yet, these methods still fall short in terms of
to generate seamless textures from different views. We then quality compared to their 2D counterparts.
show that one can transfer the generated texture maps to In this paper, we focus on texturing 3D objects, and
new 3D geometries without requiring explicit surface-to- present TEXTure, a technique that leverages diffusion mod-
surface mapping, as well as extract semantic textures from els [38] to seamlessly paint a given 3D input mesh. Unlike
a set of images without requiring any explicit reconstruc- previous texturing approaches [25, 28] that apply score dis-
tion. Finally, we show that TEXTure can be used to not tillation [33] to indirectly utilize Stable Diffusion [38] as
only generate new textures but also edit and refine existing a texturing prior, we opt to directly apply a full denoising
textures using either a text prompt or user-provided scrib- process on rendered images using a depth-conditioned dif-
bles. We demonstrate that our TEXTuring method excels fusion model [38].
at generating, transferring, and editing textures through
At its core, our method iteratively renders the object
extensive evaluation, and further close the gap between
from different viewpoints, applies a depth-based painting
2D image generation and 3D texturing. Code is avail-
scheme, and projects it back to the mesh vertices or atlas.
able on: https://texturepaper.github.io/
We show that our approach can result in a significant boost
TEXTurePaper/
in both running time and generation quality. However, ap-
plying this process naı̈vely would result in highly inconsis-
* Denotes equal contribution tent texturing with noticeable seams due to the stochastic

1
can directly apply an edit on a texture map, where we refine
the texture to fuse the user-applied edits into the 3D shape.
We evaluate TEXTure and show its effectiveness for tex-
ture generation, transfer, and editing. We demonstrate that
TEXTure offers a significant speedup compared to previ-
ous approaches, and more importantly, offers significantly
A B C D
higher-quality generated textures.
Figure 2. Ablation of our different components: (A) is a naı̈ve
painting scheme. In (B) we introduce “keep” region. (C) is our
improved scheme for “generate” regions, and (D) is our complete
2. Related Work
scheme with “refine” regions. Text-to-Image Diffusion Models. The past year has seen
the development of multiple large diffusion models [31, 36,
nature of the generation process (see Figure 2 (A)). 38, 40] capable of producing impressive images with pris-
To alleviate these inconsistencies, we introduce a dy- tine details guided by an input text prompt. The widely pop-
namic partitioning of the rendered view to a trimap of ular Stable Diffusion [38], is trained on a rich text-image
“keep”, “refine”, and “generate” regions, which is esti- dataset [43] and is conditioned on CLIP’s [35] frozen text
mated before each diffusion process. The “generate” re- encoder. Beyond simple text-conditioning, Stable Diffusion
gions are areas in the rendered viewpoint that are viewed has multiple extensions that allow conditioning its denois-
for the first time and need to be painted. A “refine” re- ing network on additional input modalities such as a depth
gion is an area that was already painted in previous itera- map or an inpainting mask. Given a guiding prompt and an
tions, but is now seen from a better angle and should be estimated depth image [37], the depth-conditioning model
repainted. Finally, “keep” regions are painted regions that is tasked with generating images that follow the same depth
should not be repainted from the current view. We then pro- values while being semantically faithful with respect to the
pose a modified diffusion process that takes into account text. Similarly, the inpainting model completes missing im-
our trimap partitioning. By freezing “keep” regions dur- age regions, given a masked image.
ing the diffusion process we attain more consistent outputs, Although current text-to-image models generate high-
but the newly generated regions still lack global consistency quality results when conditioned on a text prompt or depth
(see Figure 2 (B)). To encourage better global consistency map, editing an existing image or injecting objects speci-
in the “generate” regions, we further propose to incorpo- fied by a few exemplar images remains challenging [2, 47].
rate both depth-guided and mask-guided diffusion models To introduce a user-specific concept to a pre-trained text-to-
into the sampling process (see Figure 2 (C)). Finally, for image model, [15] introduce Textual Inversion to map a few
“refine” regions, we design a novel process that repaints exemplar images into a learned pseudo-tokens in the em-
these regions but takes into account their existing texture. bedding space of the frozen text-to-image model. Dream-
Together these techniques allow the generation of highly- Booth [39] further fine-tune the entire diffusion model on
realistic results in mere minutes (see Figure 2 (D) and Fig- the set of input images to achieve more faithful composi-
ure 1 for results). tions. The learned token or fine-tuned model can then be
Next, we show that our method can be used not only used to generate novel images using the custom token in
to texture meshes guided by a text prompt, but also based new user-specified text prompts.
on an existing textures from some other colored mesh or
even from a small set of images. Our method requires no Texture and Content Transfer. Early works [11, 12, 18,
surface-to-surface mapping or any intermediate reconstruc- 54] focus on 2D texture synthesis through probabilistic
tion step. Instead, we propose to learn semantic tokens models, while more recent works [13,45,50,53] take a data-
that represent a specific texture by building on Textual In- driven approach to generate textures using deep neural net-
version [15] and DreamBooth [39], while extending them works. Generating textures over 3D surfaces is a more chal-
to depth-conditioned models and introducing learned view- lenging problem, as it requires attention to both colors and
point tokens. We show that we can successfully capture the geometry. For geometric texture synthesis, [17] applies a
essence of a texture even from a few unaligned images and similar statistical method to [18] while [6] extended non-
use it to paint a 3D mesh based on its semantic texture. parametric sampling proposed by [12] to 3D meshes. [5] in-
Finally, in the spirit of diffusion-based image edit- troduces a metric learning approach to transfer details from
ing [20–22, 48], we show that one can further refine and a source to a target shape, while [19] use an internal learning
edit textures. We propose two editing techniques. First, technique to transfer geometric texture. For 3D color tex-
we present a text-only refinement where an existing texture ture synthesis, given an exemplar colored mesh, [8, 26, 27]
map is modified using a guiding prompt to better match the use the relation between geometric features and color values
semantics of the new text. Second, we illustrate how users to synthesize new textures on target shapes.

2
TEXTure Overview Iteration 1 Iteration t Iteration T
3D Shape and Texture Generation. Generating shapes
v0 vt
and textures in 3D has recently gained significant inter-
est. Text2Mesh [30], Tango [9], and CLIP-Mesh [23] ... ...
v1
use CLIP-space similarities as an optimization objective Input Mesh

to generate novel shapes and textures. CLIP-Mesh de-


forms an initial sphere with UV texture parameterization.
Tango optimizes per-vertex colors and focuses on generat- Texture Creation (iteration t)
Textured Texture Viewpoint Textured Texture Viewpoint
ing novel textures. Text2Mesh optimizes per-vertex color Mesh Map Cache Mesh Map Cache
vt vt
attributes, while allowing small geometric displacements.
Get3D [16] is trained to generate shape and texture through Tt Nt Tt+1 Nt+1
Refine Generate
a DMTet [46] mesh extractor and 2D adversarial losses. Rendering
Keep
Mesh
Recently, DreamFusion [33] introduced the use of pre- Projection

trained image diffusion models for generating 3D NeRF


Modified
models conditioned on a text prompt. The key compo- Dt Nt
Diffusion Process
Qt nz
nent in DreamFusion is the Score-Distillation loss which Depth
Map
Rendered
Image
Camera Viewpoint
Normals cache
Trimap
”An orangutan”
enables the use of a pretrained 2D diffusion model as a cri-
tique for optimizing the 3D NeRF scene. Recently, Latent- Modified Diffusion Process
NeRF [28] showed how the same Score-Distillation loss can
be used in Stable Diffusion’s latent space to generate latent Dt Qt .. ..
3D NeRF models. In the context for texture generation, [28]
present Latent-Paint, a texture generation technique, where Initial Noise Trimap Mask Denoising Iterations
latent texture maps are painted using Score-Distillation and
Figure 3. Our schematic texturing process. A mesh is iteratively
are then decoded to RGB for the final colorization output. painted from different viewpoints. In each painting iteration, we
Similarly, [25] uses score-distillation to texture and refine render the mesh alongside its depth and normal maps. We then
a coarse initial shape. Both methods suffer from relatively calculate a trimap partitioning of the image to three distinct areas
slow convergence and less defined textures compared to our based on the camera normals and a viewpoint cache representing
proposed approach. previously viewed angles. These inputs are then fed into a modi-
fied diffusion process alongside a given text prompt which gener-
3. Method ates an updated image. This image is then projected back to the
texture map for the next iteration.
We first lay the foundation for our text-guided mesh
texturing scheme, illustrated in Figure 3. Our TEXTure
scheme performs an incremental texturing of a given 3D We start from an arbitrary initial viewpoint v0 = (r =
mesh, where at each iteration we paint the currently visible 1.25, φ0 = 0, θ = 60) where r is the radius of the camera, φ
regions of the mesh as seen from a single viewpoint. To en- is the azimuth camera angle, and θ is the camera elevation.
courage both local and global consistency, we segment the We then use Mdepth to generate an initial colored image
mesh into a trimap of “keep”, “refine”, “generate” regions. I0 of the mesh as viewed from v0 , conditioned on the ren-
A modified depth-to-image diffusion process is presented to dered depth map D0 . The generated image I0 is then pro-
incorporate this information into the denoising steps. jected back to the texture atlas T0 to color the shape’s visible
We then propose two extensions of TEXTure. First, we parts from v0 . Following this initialization step, we begin a
present a texture transfer scheme (Section 3.2) that transfers process of incremental colorization, illustrated in Figure 3,
the texture of a given mesh to a new mesh, by learning a cus- where we iterate through a fixed set of viewpoints. For each
tom concept that represents the given texture. Finally, we viewpoint, we then render the mesh using a renderer R [14]
present a texture editing technique that allows users to edit to obtain Dt and Qt , where Qt is the rendering of the mesh
a given texture map, either through a guiding text prompt or as seen from the viewpoint vt that considers all previous
a user-provided scribble (Section 3.3). colorization steps. Finally, we generate the next image It
and project It back to the updated texture atlas Tt while
3.1. Text-Guided Texture Synthesis taking into account Qt .
Our texture generation method relies on a pretrained Once a single view has been painted, the generation task
depth-to-image diffusion model Mdepth and a pretrained becomes more challenging due to the need for local and
inpainting diffusion model Mpaint , both based on Stable global consistency along the generated texture. Below we
Diffusion [38] and with a shared latent space. During the consider a single iteration t of our incremental painting pro-
generation process, the texture is represented as an atlas cess and elaborate on our proposed techniques to handle
through a UV mapping that is calculated using XAtlas [52]. these challenges.

3
Trimap Creation. Given a viewpoint vt , we first apply troduce an interleaved process where we alternate between
a partitioning of the rendered image into three regions: the two models during the initial sampling steps. Specifi-
“keep”, “refine”, and “generate”. The “generate” regions cally, during sampling, the next noised latent zi−1 is com-
are rendered areas that are viewed for the first time and need puted as:
to be painted to match the previously painted regions. The 
distinction between “keep” and “refine” regions is slightly Mdepth (zi , Dt )
 0 ≤ i < 10
more nuanced and is based on the fact that coloring a mesh zi−1 = Mpaint (zi , “generate”) 10 ≤ i < 20
from an oblique angle can result in high distortion. This 
Mdepth (zi , Dt ) 20 ≤ i < 50

is because the cross-section of a triangle with the screen
is low, resulting in a low-resolution update to the mesh tex-
When applying Mdepth , the noised latent is guided by the
ture image Tt . Specifically, we measure the triangle’s cross-
current depth Dt while when applying Mpaint , the sam-
section as the z component of the face normal nz in the
pling process is tasked with completing the “generate” re-
camera’s coordinate system.
gions in a globally-consistent manner.
Ideally, if the current view provides a better colorization
angle for some of the previously-painted regions, we would
like to “refine” their existing texture. Otherwise, we should Refining Regions. To handle “refine” regions we use an-
“keep” the original texture and avoid modifying it to en- other novel modification to the diffusion process that gen-
sure consistency with previous views. To keep track of seen erates new textures while taking into account their previous
regions and the cross-section at which they were previously values. Our key observation is that by using an alternating
colored from, we use an additional meta-texture map N that checkerboard-like mask in the first steps of the sampling
is updated at every iteration. This additional map can be ef- process, we can guide the noise towards values that locally
ficiently rendered together with the texture map at each it- align with previous completions.
eration and is used to define the current trimap partitioning. The granularity of this process can be controlled by
changing the resolution of the checkerboard mask and the
Masked Generation. As the depth-to-image diffusion number of constrained steps. In practice, we apply the
process was trained to generate an entire image, we must mask for the first 25 sampling steps. Namely, the masked
modify the sampling process to “keep” part of the image mblended applied in Equation (1) is set as,
fixed. Following Blended Diffusion [2, 3], in each denois- 
ing step we explicitly inject a noised versions of Qt , i.e. 
 0 “keep”

zQt , at the “keep” regions into the diffusion sampling pro- checkerboard “refine” ∧ i ≤ 25
cess, such that these areas are seamlessly blended into the mblended = (2)

 1 “refine” ∧ i > 25
final generated result. Specifically, the latent at the current 
1

“generate”
sampling timestep i is computed as

zi ← zi mblended + zQt (1 − mblended ) (1) where a value of 1 indicates that this region should be
painted and kept otherwise. Our blending mask is visual-
where the mask mblended is defined in Equation 2. That is, ized in Figure 3.
for “keep” regions, we simply set zi fixed according to their
original values. Texture Projection. To project It back to the texture atlas
Tt , we apply gradient-based optimization for Lt over the
Consistent Texture Generation. Injecting “keep” re- values of Tt when rendered through the differential renderer
gions into the diffusion process results in better blending R. That is,
with “generate” regions. Still, when moving away from
the “keep” boundary and deeper into the “generate” re- ∂R ms
∇Tt Lt = [(R(mesh, Tt , vt ) − It ) ms ]
gions, the generated output is mostly governed by the sam- ∂Tt
pled noise and is not consistent with previously painted re-
gions. We first opt to use the same sampled noise from each To achieve smoother texture seams of the projections
viewpoint, this sometimes improves consistency, but is still from different views, a soft mask ms is applied at the
very sensitive to the change in viewpoint. We observe that boundaries of the “refine” and “generate” region:
applying an inpainting diffusion model Mpaint that was di- (
rectly trained to complete masked regions, results in more 0 “keep”
ms = mh ∗ g mh =
consistent generations. However, this in turn deviates from 1 “refine” ∪ “generate”
the conditioning depth Dt and may generate new geome-
tries. To benefit from the advantages of both models we in- where g is a 2D Gaussian blur kernel.

4
Data Creation input texture. Doing so disentangles the texture from its
specific geometry and improves the generalization of the
fine-tuned diffusion model. Inspired by the concept of sur-
face caricaturization [44], we propose a novel spectral aug-
mentation technique. In our case, we propose random low-
Input Mesh Spectral Augmented Meshes frequency geometric deformations to the textured source
mesh, regularized by the mesh Laplacian’s spectrum [29].
Modulating random deformations over the spectral
... eigenbasis results in smooth deformations that keep the in-
A <Dfront> view of a A <Dleft> view of a A <Dbottom> view of a
tegrity of the input shape. Empirically, we choose to apply
<Stexture> < Stexture> < Stexture>
random inflations or deflations to the mesh, with a magni-
Diffusion Model Fine-Tuning tude proportional to a randomly selected eigenfunction. We
Input View
& Depth Noised Sample Reconstruction provide examples of such augmentations in Figure 4, and
additional details in the supplementary materials.
+ noise

Depth-
Conditioned
Diffusion Model

A <Dfront> view of a <Stexture> Texture Learning. Applying our spectral augmentation


Reconstruction Loss technique, we generate a large set of images with corre-
Figure 4. Fine-Tuning for texture transfer. (top) Given an input sponding depth maps of the input shape. We render the
mesh, spectral augmentations are applied to generate a variety of images from several viewpoints (left, right, overhead, bot-
textured geometries. The geometries are then rendered from ran- tom, front, and back) and paste the rendered object onto a
dom viewpoints, where each image is coupled with a sentence, randomly colored background (see Figure 4).
based on the viewpoint. (bottom) a depth-conditioned diffusion Given the set of rendered images, we follow [15] and op-
model is then fine-tuned with a set of learnable tokens, a fixed timize an embedding vector representing our texture using
hStexture i token representing the object, and an additional view- prompts of the form “a hDv i photo of a hStexture i” where
point token, hDv i. The tuned model is used to paint new objects. hDv i is a learned token representing the view direction of
the rendered image and hStexture i is the token representing
Additional Details. Our texture is represented as a our texture. Observe, that we have six learned directional
1024 × 1024 atlas, where the rendering resolution is 1200 × tokens Dv , shared within images from the same view, and a
1200. For the diffusion process, we segment the inner re- single token Stexture representing the texture, shared across
gion, resize it to 512 × 512 and mat it onto a realistic back- all images. Additionally, to better capture the input tex-
ground. All shapes are rendered with 8 viewpoints around ture we fine-tune the diffusion model itself as well, as done
the object, and two additional top/bottom views. We show in [39]. Our texture learning scheme is illustrated in Fig-
that viewpoint order can also affect the end results. ure 4. After training, we use TEXTure (Section 3.1), to
color the target shape, swapping the original Stable Diffu-
3.2. Texture Transfer sion model with our fine-tuned model.
Having successfully generated a new texture on a given
3D mesh, we now turn to describe how to transfer a given Texture from Images. Next, we explore the more chal-
texture to a new, untextured target mesh. We show how to lenging task of texture generation based on a small set of
capture textures from either a painted mesh, or from a small sample images. While we cannot expect the same quality
set of input images. Our texture transfer approach builds given only a few images, we can still potentially learn con-
on previous work on concept learning over diffusion mod- cepts that represent different textures. Unlike standard tex-
els [15, 39], by fine-tuning a pretrained diffusion model and tual inversion techniques [15, 39] our learned concepts rep-
learning a pseudo-token representing the generated texture. resent mostly texture and not structure as they are trained
The fine-tuned model is then used for texturing a new geom- on a depth-conditioned model. This potentially makes them
etry. To improve the generalization of the fine-tuned model more suitable for texturing other 3D shapes.
to new geometries, we further propose a novel spectral aug- For this task, we segment the prominent object from the
mentation technique, described next. We then discuss our image using a pretrained saliency network [34], apply stan-
concept learning scheme from meshes or images. dard scale and crop augmentations, and paste the result onto
a randomly-colored background. Our results show that one
Spectral Augmentations. Since we are interested in can successfully learn semantic concepts from images and
learning a token representing the input texture and not the apply them to 3D shapes without any explicit reconstruction
original input geometry itself, we should ideally learn a stage in between. We believe this creates new opportunities
common token over a range of geometries containing the for creating captivating textures inspired by real objects.

5
“A turtle” “A hand carved wood elephant” “A wooden klein bottle”
Figure 5. Texturing results. Our method generates a high-quality texture for a collection of prompts and geometries.

3.3. Texture-Editing Method Overall Text Runtime


Quality (↑) Fidelity (↑) (minutes) (↓)
We show that our trimap-based TEXTuring can be used
to easily extend 2D editing techniques to a full mesh. For Text2Mesh 2.57 3.62 32 (6.4×)
text-based editing, we wish to alter an existing texture map Latent-Paint 2.95 4.01 46 (9.2×)
guided by a textual prompt. To this end, we define the en- TEXTure 3.93 4.44 5
tire texture map as a “refine” region and apply our TEX-
Turing process to modify the texture to align with the new Table 1. User study results conducted with 30 respondents. We
ask respondents to rate the results on a scale of 1 to 5 with respect
text prompt. We additionally provide scribble-based editing
to the overall quality of the results and the level at which the result
where a user can directly edit a given texture map (e.g. to reflects the text prompt. Results are averaged across all responses
define a new color scheme over a desired region). To al- and text prompts.
low this, we simply define the altered regions as “refine”
regions during the TEXTuring process and “keep” the re-
Paint [28], the results are more plausible but still lack in
maining texture fixed.
terms of quality. For example, Latent-Paint often struggles
in achieving visibly sharp textures such as those shown on
4. Experiments “’a desktop iMac’. We attribute this shortcoming due to its
We now turn to validate the robustness and effectiveness reliance on score distillation [33], which tends to omit high-
on our proposed method through a set of experiments. frequency details.
The last row in Figure 6, depicting a statue of Napoleon
4.1. Text-Guided Texturing Bonaparte, showcases our method’s ability to produce fine
details compared to alternative methods that struggle to pro-
Qualitative Results. We first demonstrate results
duce matching quality. Notably, our results were achieved
achieved with TEXTure across several geometries and
significantly faster than the alternative methods. Generat-
driving text prompts. Figure 5, Figure 1 and Figure 11
ing a single texture with TEXTure takes approximately 5
show highly-detailed realistic textures that are conditioned
minutes compared to 19 through 45 minutes for alternative
on a single text prompt. Observe, for example, how
the generated textures nicely align with the geometry of
the turtle and elephant shapes. Moreover, the generated
textures are consistent both on a local scale (e.g., along
the shell of the turtle) and a global scale (e.g., the shell is
“A 90’s boombox”
consistent across different views). Furthermore, TEXTure
can successfully generate textures of an individual (e.g., of
Albert Einstein in Figure 1). Observe how the generated
texture captures fine details of Albert Einstein’s face.
“A desktop iMac”
Finally, TEXTure can successfully handle challenging
geometric shapes, including the non-orientable Klein
bottle, shown in Figure 5.

Qualitative Comparisons. In Figure 6 we compare our


method to several state-of-the-art methods for 3D texturing.
First, we observe that Clip-Mesh [23] and Text2Mesh [30] “A photo of Napoleon Bonaparte”
struggle in achieving globally-consistent results due to their Clip-Mesh Text2Mesh Latent-Paint Ours
heavy reliance CLIP-based guidance. See the ”90’s boom- Figure 6. Visual comparison of text-guided texture generation. For
box” example at the top of Figure 6 where speakers are each input prompt, we show results for a single viewpoint. Best
placed sporadically in competing methods. For Latent- viewed zoomed in. Meshes obtained from ModelNet40 [49].

6
methods (See Table 1). We provide additional qualitative
results in 11 and in the supplementary materials.

User Study. Finally, we conduct a user study to analyze


the fidelity and overall quality of the generated textures.
We select 10 text prompts and corresponding 3D meshes
and texture the meshes using TEXTure and two baselines:
Text2Mesh [30] and Latent-Paint [28]. For each prompt and
method, we ask each respondent to evaluate the result with
respect to two aspects: (1) its overall quality and (2) the
level at which it reflects the text prompt, on a scale of 1 to
5. Results are presented in Table 1 where we show the av-
erage results across all prompts for each method. As can
be seen, TEXTure outperforms both baselines in terms of A B C D
both overall quality and text fidelity by a significant mar- Figure 7. Ablation of our different stages: (A) is a naı̈ve paint-
gin. Importantly, our method’s improved quality and fi- ing scheme that paints the entire viewpoint. (B) takes into account
delity are attained with a significant decrease in runtime. “keep” region. (C) is our inpainting-based scheme for “gener-
Specifically, TEXTure achieves a decrease of 6.4× in run- ate” regions, and (D) is our complete scheme with “refine” re-
ning time compared to Text2Mesh and a decrease of 9.2× gions. Car model obtained from [49], Teddy Bear model obtained
relative to Latent-Paint. from [32].
In addition to the above evaluation setting, we ask re-
spondents to rank the methods relative to each other. Specif- Text2Mesh Latent-Paint TEXTure
ically, for each of the 10 prompts, we show the results of all Average Rank (↓) 2.44 2.24 1.32
methods side-by-side (in a random order) and ask respon-
dents to rank the results. In Table 2 we present the average Table 2. Additional user study results. Each respondent is asked
rank of each method, averaged across the 10 prompts and to rank the results of the different methods with respect to overall
across all responses. Note that a lower rank is preferable in quality.
this setting. As can be seen, TEXTure has a significantly
better average rank relative to the two baselines, as desired.
This further demonstrates the effectiveness of TEXTure in
generating high-quality, semantically-accurate textures.

Ablation Study. An ablation validating the different


components of our TEXTure scheme is shown in Figure 7.
One can see that each component is needed for achieving
high-quality generations and improving our method’s ro-
bustness to the sensitivity of the generation process. Specif-
ically, without differentiating between “keep” and “gener- Textured Input Exact ”...in the style of hStexture i”
ate” regions the generated textures are unsatisfactory. For Figure 8. Token-based Texture Transfer. The learned token
both the teddy bear and the sports car one can clearly see hStexture i is applied to different geometries. Einstein’s head is
that the texture presented in A fails to achieve local and shown with both exact and “style” prompts, demonstrating the ef-
global consistency, with inconsistent patches visible across fect on the transfer’s semantics. Both Spot and Ogre are taken
the texture. In contrast, thanks to our blending technique from Keenan’s Repository [10].
for “keep” regions, B achieves local consistency between
views. Still, observe that the texture of the teddy bear legs identifying “refine” regions and applying our full TEXTure
in the top example, do not match the texture of its back. By scheme, we are able to effectively address these problems
incorporating our improved “generate” method, we achieve and produce sharper textures, see D.
more consistent results across the entire shape, as shown in
C. These results tend to have smeared regions as can be ob- 4.2. Texture Capturing
served in the fur of the teddy bear, or the text written across
the hood of the car. We attribute this to the fact that some We next validate our proposed texture capture and trans-
regions are painted from oblique angles at early viewpoints, fer technique that can be applied over both 3D meshes and
which are not ideal for texturing, and are not refined. By images.

7
Texture From Mesh As mentioned in Section 3.2 we
are able to capture a texture of a given mesh by combin-
ing concept learning techniques [15, 39] with view-specific
directional tokens. We can then use the learned token
with TEXTure to color new meshes accordingly. Figure 8
presents texture transfer results from two existing meshes
onto various target geometries. While the training of the
new texture token is performed using prompts of the form
“A hDv i photo of hStexture i”, we can generate new textures
by forming different texture prompts when transferring the Images Set Textured Meshes
learned texture. Specifically, the ”Exact” results in Figure 8 Figure 9. Token-based Texture Transfer from images. New
were generated using the specific text prompt used for fine- meshes are presented only once. All meshes are textured with the
tuning, and the rest of the outputs were generated ”in the exact prompt. The teddy bear is also presented with the “..in the
style of hStexture i”. Observe how a single eye is placed style of” prompt, one can see that this results in a semantic mix
on Einstein’s face when the exact prompt is used, while two between the image set and a teddy bear.
eyes are generated when only using a prompt ”a hDv i photo
of Einstein that looks like hStexture i”. A key component of
the texture-capturing technique is our novel spectral aug-
mentations scheme. We refer the reader to the supplemen-
tary materials for an ablation study over this component. Depth Inconsistencies Viewpoint Inconsistencies
Figure 10. Limitations. Left: input mesh, where the diffusion out-
put was inconsistent with the geometry of the model face. These
Texture From Image In practice, it is often more prac- inconsistencies are clearly visible from other views and affect the
tical to learn a texture for a set of images rather than from painting process. Right: two different viewpoints of “a goldfish”
a 3D mesh. As such, in Figure 9, we demonstrate the re- where the diffusion process added inconsistent eyes, as the previ-
sults of our transferring scheme where the texture is cap- ously painted eyes were not visible.
tured from a collection of images depicting the source ob-
ject. One can see that even when given only several images, 5. Discussion, Limitations and Conclusions
our method can successfully texture different shapes with a This paper presents TEXTure, a novel method for text-
semantically similar texture. We find these results exciting, guided generation, transfer, and editing of textures for 3D
as it means one can easily paint different 3D shapes using shapes. There have been many models for generating and
textures derived from real-world data, even when the texture editing high-quality images using diffusion models. How-
itself is only implicitly represented in the fine-tuned model ever, leveraging these image-based models for generating
and concept tokens. seamless textures on 3D models is a challenging task, in
particular for non-stochastic and non-stationary textures.
4.3. Editing Our work addresses this challenge by introducing an iter-
ative painting scheme that leverages a pretrained depth-to-
Finally, in Figure 15 we show the editing results of exist- image diffusion model. Instead of using a computationally
ing textures. The top row of Figure 15 shows scribble-based demanding score distillation approach, we propose a mod-
results where a user manually edits the texture atlas image. ified image-to-image diffusion process that is applied from
Then we “refine” the texture atlas to seamlessly blend the a small set of viewpoints. This results in a fast process ca-
edited region into the final texture. Observe how the man- pable of generating high-quality textures in mere minutes.
ually annotated white spot on the bunny on the right turns With all its benefits, there are still some limitations to our
into a realistic-looking patch of white fur. proposed scheme, which we discuss next.
The bottom of Figure 15 shows text-based editing re- While our painting technique is designed to be spatially
sults, where an existing texture is “refined” according to coherent, it may sometimes result in inconsistencies on a
a new text prompt. The bottom right example illustrates global scale, caused by occluded information from other
a texture generated on a similar geometry with the same views. See Figure 10, where different looking eyes are
prompt from scratch. Observe that the texture generated added from different viewpoints. Another caveat is view-
from scratch significantly differs from the original texture. point selection. We use eight fixed viewpoints around the
In contrast, when applying editing, the textures are able to object, which may not fully cover adversarial geometries.
remain semantically close to the input texture while gener- This issue can possibly be solved by finding a dynamic
ating novel details to match the target text. set of viewpoints that maximize the coverage of the given

8
mesh. Furthermore, the depth-guided model sometimes de- 3d stylization via lighting decomposition. arXiv preprint
viates from the input depth, and may generate images that arXiv:2210.11277, 2022. 1, 3
are not consistent with the geometry (See Figure 10 left). [10] Keenan Crane. Keenan’s 3d model
This in turn may result in conflicting projections to the repository. https://www.cs.cmu.edu/ km-
mesh, that cannot be fixed in later painting iterations. crane/Projects/ModelRepository/, 2022. 7, 9
[11] Jeremy S De Bonet. Multiresolution sampling procedure for
With that being said, we believe that TEXTure takes an
analysis and synthesis of texture images. In Proceedings of
important step toward revolutionizing the field of graphic
the 24th annual conference on Computer graphics and inter-
design and further opens new possibilities for 3D artists, active techniques, pages 361–368, 1997. 2
game developers, and modelers who can use these tools to [12] Alexei A Efros and Thomas K Leung. Texture synthesis
generate high-quality textures in a fraction of the time of by non-parametric sampling. In Proceedings of the sev-
existing techniques. Additionally, our trimap partitioning enth IEEE international conference on computer vision, vol-
formulation provides a practical and useful framework that ume 2, pages 1033–1038. IEEE, 1999. 2
we hope will be utilized and “refined” in future studies. [13] Anna Frühstück, Ibraheem Alhashim, and Peter Wonka. Ti-
legan: Synthesis of large-scale non-homogeneous textures.
Acknowledgements ACM Trans. Graph., 38:58:1–58:11, 2019. 2
[14] Clement Fuji Tsang, Maria Shugrina, Jean Francois
We thank Dana Cohen and Or Patashnik for their early Lafleche, Towaki Takikawa, Jiehan Wang, Charles Loop,
feedback and insightful comments. We would also like Wenzheng Chen, Krishna Murthy Jatavallabhula, Edward
to thank Harel Richardson for generously allowing us to Smith, Artem Rozantsev, Or Perel, Tianchang Shen, Jun
use his toys throughout this paper. The beautiful meshes Gao, Sanja Fidler, Gavriel State, Jason Gorski, Tommy Xi-
throughout this paper are taken from [1, 4, 7, 10, 24, 30, 32, ang, Jianing Li, Michael Li, and Rev Lebaredian. Kaolin:
A pytorch library for accelerating 3d deep learning re-
41, 42, 49].
search. https://github.com/NVIDIAGameWorks/
kaolin, 2022. 3
References [15] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash-
[1] Arafurisan. Pikachu 3d sculpt free 3d model. nik, Amit H Bermano, Gal Chechik, and Daniel Cohen-
https://www.cgtrader.com/free-3d-models/ Or. An image is worth one word: Personalizing text-to-
character / child / pikachu - 3d - model - free, image generation using textual inversion. arXiv preprint
2022. Model ID: 3593494. 9 arXiv:2208.01618, 2022. 2, 5, 8
[2] Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended [16] Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen,
latent diffusion. arXiv preprint arXiv:2206.02779, 2022. 2, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and
4 Sanja Fidler. Get3d: A generative model of high quality
3d textured shapes learned from images. arXiv preprint
[3] Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended
arXiv:2209.11163, 2022. 3
latent diffusion. arXiv preprint arXiv:2206.02779, 2022. 4
[17] Aleksey Golovinskiy, Wojciech Matusik, Hanspeter Pfister,
[4] behnam.aftab. Free 3d turtle model. https :
Szymon Rusinkiewicz, and Thomas Funkhouser. A statis-
/ / 3dexport . com / free - 3dmodel - free - 3d -
tical model for synthesis of detailed facial geometry. ACM
turtle-model-355284.htm, 2021. Item ID: 355284.
Transactions on Graphics (TOG), 25(3):1025–1034, 2006. 2
9
[18] David J Heeger and James R Bergen. Pyramid-based texture
[5] Sema Berkiten, Maciej Halber, Justin Solomon, Chongyang analysis/synthesis. In Proceedings of the 22nd annual con-
Ma, Hao Li, and Szymon Rusinkiewicz. Learning detail ference on Computer graphics and interactive techniques,
transfer based on geometric features. In Computer Graphics pages 229–238, 1995. 2
Forum, volume 36, pages 361–373. Wiley Online Library, [19] Amir Hertz, Rana Hanocka, Raja Giryes, and Daniel Cohen-
2017. 2 Or. Deep geometric texture synthesis. ACM Trans. Graph.,
[6] Toby P Breckon and Robert B Fisher. A hierarchical exten- 39(4), 2020. 2
sion to 3d non-parametric surface relief completion. Pattern [20] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman,
Recognition, 45(1):172–185, 2012. 2 Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im-
[7] RAMIRO CASTRO. Elephant natural history mu- age editing with cross attention control. arXiv preprint
seum. https://www.cgtrader.com/free- 3d- arXiv:2208.01626, 2022. 2
print- models/miniatures/other/elephant- [21] Dong Huk Park, Grace Luo, Clayton Toste, Samaneh Azadi,
natural - history - museum - 1, 2020. Model ID: Xihui Liu, Maka Karalashvili, Anna Rohrbach, and Trevor
2773650. 9 Darrell. Shape-guided diffusion with inside-outside atten-
[8] Xiaobai Chen, Tom Funkhouser, Dan B Goldman, and Eli tion. arXiv e-prints, pages arXiv–2212, 2022. 2
Shechtman. Non-parametric texture transfer using mesh- [22] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen
match. Technical Report Technical Report 2012-2, 2012. 2 Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic:
[9] Yongwei Chen, Rui Chen, Jiabao Lei, Yabin Zhang, and Text-based real image editing with diffusion models. arXiv
Kui Jia. Tango: Text-driven photorealistic and robust preprint arXiv:2210.09276, 2022. 2

9
[23] Nasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky, eration with clip latents. arXiv preprint arXiv:2204.06125,
and Popa Tiberiu. Clip-mesh: Generating textured meshes 2022. 2
from text using pretrained image-text models. SIGGRAPH [37] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi-
Asia 2022 Conference Papers, 2022. 3, 6 sion transformers for dense prediction. ICCV, 2021. 2
[24] Oliver Laric. Three d scans. https://threedscans.com/, 2022. [38] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
9 Patrick Esser, and Björn Ommer. High-resolution image syn-
[25] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, thesis with latent diffusion models, 2022. 1, 2, 3
Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fi- [39] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
dler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High- Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
resolution text-to-3d content creation. arXiv preprint tuning text-to-image diffusion models for subject-driven
arXiv:2211.10440, 2022. 1, 3 generation. arXiv preprint arXiv:2208.12242, 2022. 2, 5,
[26] Jianye Lu, Athinodoros S Georghiades, Andreas Glaser, 8
Hongzhi Wu, Li-Yi Wei, Baining Guo, Julie Dorsey, and [40] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
Holly Rushmeier. Context-aware textures. ACM Transac- Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
tions on Graphics (TOG), 26(1):3–es, 2007. 2 Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi,
[27] Tom Mertens, Jan Kautz, Jiawen Chen, Philippe Bekaert, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J
and Frédo Durand. Texture transfer using geometry corre- Fleet, and Mohammad Norouzi. Photorealistic text-to-image
lation. Rendering Techniques, 273(10.2312):273–284, 2006. diffusion models with deep language understanding, 2022. 2
2 [41] Laura Salas. Klein bottle 2. https : / /
[28] Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and sketchfab.com/3d-models/klein-bottle-2-
Daniel Cohen-Or. Latent-nerf for shape-guided generation eecfee26bbe54bfc9cb8e1904f33c0b7, 2016. 9
of 3d shapes and textures. arXiv preprint arXiv:2211.07600, [42] savagerus. Orangutan free 3d model. https :
2022. 1, 3, 6, 7 / / www . cgtrader . com / free - 3d - models /
[29] Mark Meyer, Mathieu Desbrun, Peter Schröder, and Alan H animals/mammal/orangutan- fe49896d- 8c60-
Barr. Discrete differential-geometry operators for triangu- 46d8 - 9ecb - 36afa9af49f6, 2019. Model ID:
lated 2-manifolds. In Visualization and mathematics III, 2142893. 9
pages 35–57. Springer, 2003. 5 [43] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
[30] Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
Rana Hanocka. Text2mesh: Text-driven neural stylization Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
for meshes. In Proceedings of the IEEE/CVF Conference man, et al. Laion-5b: An open large-scale dataset for
on Computer Vision and Pattern Recognition, pages 13492– training next generation image-text models. arXiv preprint
13502, 2022. 1, 3, 6, 7, 9 arXiv:2210.08402, 2022. 2
[31] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav [44] Matan Sela, Yonathan Aflalo, and Ron Kimmel. Compu-
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and tational caricaturization of surfaces. Computer Vision and
Mark Chen. Glide: Towards photorealistic image generation Image Understanding, 141:1–17, 2015. 5
and editing with text-guided diffusion models. arXiv preprint [45] Omry Sendik and Daniel Cohen-Or. Deep correlations for
arXiv:2112.10741, 2021. 2 texture synthesis. ACM Transactions on Graphics (TOG),
[32] paffin. Teddy bear free 3d model. https : / / 36:1 – 15, 2017. 2
www.cgtrader.com/free-3d-models/animals/ [46] Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and
other / teddy - bear - 715514f6 - a1ab - 4aae - Sanja Fidler. Deep marching tetrahedra: a hybrid represen-
98d0- 80b5f55902bd, 2019. Model ID: 2034275. 7, tation for high-resolution 3d shape synthesis. In Advances in
9 Neural Information Processing Systems (NeurIPS), 2021. 3
[33] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- [47] Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian Price,
hall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv Jianming Zhang, Soo Ye Kim, and Daniel Aliaga. Ob-
preprint arXiv:2209.14988, 2022. 1, 3, 6 jectstitch: Generative object compositing. arXiv preprint
[34] Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood De- arXiv:2212.00932, 2022. 2
hghan, Osmar R Zaiane, and Martin Jagersand. U2-net: Go- [48] Narek Tumanyan, Michal Geyer, Shai Bagon, and
ing deeper with nested u-structure for salient object detec- Tali Dekel. Plug-and-play diffusion features for text-
tion. Pattern recognition, 106:107404, 2020. 5 driven image-to-image translation. arXiv preprint
[35] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya arXiv:2211.12572, 2022. 2
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, [49] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin-
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d
ing transferable visual models from natural language super- shapenets: A deep representation for volumetric shapes. In
vision. In International Conference on Machine Learning, Proceedings of the IEEE conference on computer vision and
pages 8748–8763. PMLR, 2021. 2 pattern recognition, pages 1912–1920, 2015. 6, 7, 9
[36] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, [50] Wenqi Xian, Patsorn Sangkloy, Jingwan Lu, Chen Fang,
and Mark Chen. Hierarchical text-conditional image gen- Fisher Yu, and James Hays. Texturegan: Controlling deep

10
image synthesis with texture patches. 2018 IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
8456–8465, 2017. 2
[51] Jiale Xu, Xintao Wang, Weihao Cheng, Yan-Pei Cao, Ying
Shan, Xiaohu Qie, and Shenghua Gao. Dream3d: Zero-shot
text-to-3d synthesis using 3d shape prior and text-to-image
diffusion models. arXiv preprint arXiv:2212.14704, 2022. 1
[52] Jonathan Young. Xatlas: Mesh parameterization / uv un-
wrapping library. https://github.com/jpcy/xatlas, 2022. 3
[53] Yang Zhou, Zhen Zhu, Xiang Bai, Dani Lischinski, Daniel
Cohen-Or, and Hui Huang. Non-stationary texture synthesis
by adversarial expansion. arXiv preprint arXiv:1805.04487,
2018. 2
[54] Song Chun Zhu, Yingnian Wu, and David Mumford. Fil-
ters, random fields and maximum entropy (frame): Towards
a unified theory for texture modeling. International Journal
of Computer Vision, 27(2):107–126, 1998. 2

11
“A plush pikachu” “A red porcelain dragon”

“A yellow Lamborghini” “An ancient vase” “An ikea wardrobe”

Input Mesh “A goldfish” “A puffer fish” “A piranha fish”

Input Mesh “A 90s boombox” under various seeds

Figure 11. Additional texturing results achieved with TEXTure.


Text2Mesh

“A black boot”
Latent-Paint

“A blue converse allstar shoe”


TEXTure

“An UGG boot”


“A wooden “A toyota “A circus “A desktop
cabinet” land cruiser” tent” iMac”
Tango CLIPMesh Latent-Paint Ours

Figure 12. Additional qualitative comparison for text-guided texture generation. Best viewed zoomed in.

12
Input Mesh “A photo of ironman” “A photo of spiderman” “A photo of batman”

Figure 13. Additional texturing results achieved with TEXTure. Our method generates a high-quality texture for a collection of prompts
and geometries.

Images Set Input Mesh Textured Meshes Input Mesh Textured Meshes
Figure 14. Token-based Texture Transfer from images. All meshes are textured with the exact prompt “A photo of a hS∗ i using the
fine-tuned diffusion model.

Input Edit Result Input Edit Result Input Texture Input Edit Result

Input Texture “A spotted turtle“ “A plush turtle” “A plush turtle” Input Texture “A plush goldfish“ Input Texture “A plush
Refinement Refinement orangutan”
Figure 15. Texture Editing. The first row presents results for localized scribble-based editing, the original prompt was also used for the
refinement step. The second row shows global text-based edits, with the last result showing a texture generated using the same prompt
without conditioning on the input texture which clearly results in a completely new texture.

13

You might also like