Texture: Text-Guided Texturing of 3D Shapes
Texture: Text-Guided Texturing of 3D Shapes
Texture: Text-Guided Texturing of 3D Shapes
Elad Richardson* Gal Metzer* Yuval Alaluf Raja Giryes Daniel Cohen-Or
Figure 1. Texturing results. TEXTure takes an input mesh and a conditioning text prompt and paints the mesh with high-quality textures.
Abstract 1. Introduction
In this paper, we present TEXTure, a novel method for The ability to paint pictures with words has long been a
text-guided generation, editing, and transfer of textures for sign of a master storyteller, and with recent advancements
3D shapes. Leveraging a pretrained depth-to-image diffu- in text-to-image models, this has become a reality for us all.
sion model, TEXTure applies an iterative scheme that paints Given a textual description, these new models are able to
a 3D model from different viewpoints. Yet, while depth-to- generate highly detailed imagery that captures the essence
image models can create plausible textures from a single and intent of the input text. Despite the rapid progress
viewpoint, the stochastic nature of the generation process in text-to-image generation, painting 3D objects remains a
can cause many inconsistencies when texturing an entire significant challenge as it requires to consider the specific
3D object. To tackle these problems, we dynamically de- shape of the surface being painted. Recent works have be-
fine a trimap partitioning of the rendered image into three gun making significant progress in painting and texturing
progression states, and present a novel elaborated diffu- 3D objects by using language-image models as guidance
sion sampling process that uses this trimap representation [9, 28, 30, 51]. Yet, these methods still fall short in terms of
to generate seamless textures from different views. We then quality compared to their 2D counterparts.
show that one can transfer the generated texture maps to In this paper, we focus on texturing 3D objects, and
new 3D geometries without requiring explicit surface-to- present TEXTure, a technique that leverages diffusion mod-
surface mapping, as well as extract semantic textures from els [38] to seamlessly paint a given 3D input mesh. Unlike
a set of images without requiring any explicit reconstruc- previous texturing approaches [25, 28] that apply score dis-
tion. Finally, we show that TEXTure can be used to not tillation [33] to indirectly utilize Stable Diffusion [38] as
only generate new textures but also edit and refine existing a texturing prior, we opt to directly apply a full denoising
textures using either a text prompt or user-provided scrib- process on rendered images using a depth-conditioned dif-
bles. We demonstrate that our TEXTuring method excels fusion model [38].
at generating, transferring, and editing textures through
At its core, our method iteratively renders the object
extensive evaluation, and further close the gap between
from different viewpoints, applies a depth-based painting
2D image generation and 3D texturing. Code is avail-
scheme, and projects it back to the mesh vertices or atlas.
able on: https://texturepaper.github.io/
We show that our approach can result in a significant boost
TEXTurePaper/
in both running time and generation quality. However, ap-
plying this process naı̈vely would result in highly inconsis-
* Denotes equal contribution tent texturing with noticeable seams due to the stochastic
1
can directly apply an edit on a texture map, where we refine
the texture to fuse the user-applied edits into the 3D shape.
We evaluate TEXTure and show its effectiveness for tex-
ture generation, transfer, and editing. We demonstrate that
TEXTure offers a significant speedup compared to previ-
ous approaches, and more importantly, offers significantly
A B C D
higher-quality generated textures.
Figure 2. Ablation of our different components: (A) is a naı̈ve
painting scheme. In (B) we introduce “keep” region. (C) is our
improved scheme for “generate” regions, and (D) is our complete
2. Related Work
scheme with “refine” regions. Text-to-Image Diffusion Models. The past year has seen
the development of multiple large diffusion models [31, 36,
nature of the generation process (see Figure 2 (A)). 38, 40] capable of producing impressive images with pris-
To alleviate these inconsistencies, we introduce a dy- tine details guided by an input text prompt. The widely pop-
namic partitioning of the rendered view to a trimap of ular Stable Diffusion [38], is trained on a rich text-image
“keep”, “refine”, and “generate” regions, which is esti- dataset [43] and is conditioned on CLIP’s [35] frozen text
mated before each diffusion process. The “generate” re- encoder. Beyond simple text-conditioning, Stable Diffusion
gions are areas in the rendered viewpoint that are viewed has multiple extensions that allow conditioning its denois-
for the first time and need to be painted. A “refine” re- ing network on additional input modalities such as a depth
gion is an area that was already painted in previous itera- map or an inpainting mask. Given a guiding prompt and an
tions, but is now seen from a better angle and should be estimated depth image [37], the depth-conditioning model
repainted. Finally, “keep” regions are painted regions that is tasked with generating images that follow the same depth
should not be repainted from the current view. We then pro- values while being semantically faithful with respect to the
pose a modified diffusion process that takes into account text. Similarly, the inpainting model completes missing im-
our trimap partitioning. By freezing “keep” regions dur- age regions, given a masked image.
ing the diffusion process we attain more consistent outputs, Although current text-to-image models generate high-
but the newly generated regions still lack global consistency quality results when conditioned on a text prompt or depth
(see Figure 2 (B)). To encourage better global consistency map, editing an existing image or injecting objects speci-
in the “generate” regions, we further propose to incorpo- fied by a few exemplar images remains challenging [2, 47].
rate both depth-guided and mask-guided diffusion models To introduce a user-specific concept to a pre-trained text-to-
into the sampling process (see Figure 2 (C)). Finally, for image model, [15] introduce Textual Inversion to map a few
“refine” regions, we design a novel process that repaints exemplar images into a learned pseudo-tokens in the em-
these regions but takes into account their existing texture. bedding space of the frozen text-to-image model. Dream-
Together these techniques allow the generation of highly- Booth [39] further fine-tune the entire diffusion model on
realistic results in mere minutes (see Figure 2 (D) and Fig- the set of input images to achieve more faithful composi-
ure 1 for results). tions. The learned token or fine-tuned model can then be
Next, we show that our method can be used not only used to generate novel images using the custom token in
to texture meshes guided by a text prompt, but also based new user-specified text prompts.
on an existing textures from some other colored mesh or
even from a small set of images. Our method requires no Texture and Content Transfer. Early works [11, 12, 18,
surface-to-surface mapping or any intermediate reconstruc- 54] focus on 2D texture synthesis through probabilistic
tion step. Instead, we propose to learn semantic tokens models, while more recent works [13,45,50,53] take a data-
that represent a specific texture by building on Textual In- driven approach to generate textures using deep neural net-
version [15] and DreamBooth [39], while extending them works. Generating textures over 3D surfaces is a more chal-
to depth-conditioned models and introducing learned view- lenging problem, as it requires attention to both colors and
point tokens. We show that we can successfully capture the geometry. For geometric texture synthesis, [17] applies a
essence of a texture even from a few unaligned images and similar statistical method to [18] while [6] extended non-
use it to paint a 3D mesh based on its semantic texture. parametric sampling proposed by [12] to 3D meshes. [5] in-
Finally, in the spirit of diffusion-based image edit- troduces a metric learning approach to transfer details from
ing [20–22, 48], we show that one can further refine and a source to a target shape, while [19] use an internal learning
edit textures. We propose two editing techniques. First, technique to transfer geometric texture. For 3D color tex-
we present a text-only refinement where an existing texture ture synthesis, given an exemplar colored mesh, [8, 26, 27]
map is modified using a guiding prompt to better match the use the relation between geometric features and color values
semantics of the new text. Second, we illustrate how users to synthesize new textures on target shapes.
2
TEXTure Overview Iteration 1 Iteration t Iteration T
3D Shape and Texture Generation. Generating shapes
v0 vt
and textures in 3D has recently gained significant inter-
est. Text2Mesh [30], Tango [9], and CLIP-Mesh [23] ... ...
v1
use CLIP-space similarities as an optimization objective Input Mesh
3
Trimap Creation. Given a viewpoint vt , we first apply troduce an interleaved process where we alternate between
a partitioning of the rendered image into three regions: the two models during the initial sampling steps. Specifi-
“keep”, “refine”, and “generate”. The “generate” regions cally, during sampling, the next noised latent zi−1 is com-
are rendered areas that are viewed for the first time and need puted as:
to be painted to match the previously painted regions. The
distinction between “keep” and “refine” regions is slightly Mdepth (zi , Dt )
0 ≤ i < 10
more nuanced and is based on the fact that coloring a mesh zi−1 = Mpaint (zi , “generate”) 10 ≤ i < 20
from an oblique angle can result in high distortion. This
Mdepth (zi , Dt ) 20 ≤ i < 50
is because the cross-section of a triangle with the screen
is low, resulting in a low-resolution update to the mesh tex-
When applying Mdepth , the noised latent is guided by the
ture image Tt . Specifically, we measure the triangle’s cross-
current depth Dt while when applying Mpaint , the sam-
section as the z component of the face normal nz in the
pling process is tasked with completing the “generate” re-
camera’s coordinate system.
gions in a globally-consistent manner.
Ideally, if the current view provides a better colorization
angle for some of the previously-painted regions, we would
like to “refine” their existing texture. Otherwise, we should Refining Regions. To handle “refine” regions we use an-
“keep” the original texture and avoid modifying it to en- other novel modification to the diffusion process that gen-
sure consistency with previous views. To keep track of seen erates new textures while taking into account their previous
regions and the cross-section at which they were previously values. Our key observation is that by using an alternating
colored from, we use an additional meta-texture map N that checkerboard-like mask in the first steps of the sampling
is updated at every iteration. This additional map can be ef- process, we can guide the noise towards values that locally
ficiently rendered together with the texture map at each it- align with previous completions.
eration and is used to define the current trimap partitioning. The granularity of this process can be controlled by
changing the resolution of the checkerboard mask and the
Masked Generation. As the depth-to-image diffusion number of constrained steps. In practice, we apply the
process was trained to generate an entire image, we must mask for the first 25 sampling steps. Namely, the masked
modify the sampling process to “keep” part of the image mblended applied in Equation (1) is set as,
fixed. Following Blended Diffusion [2, 3], in each denois-
ing step we explicitly inject a noised versions of Qt , i.e.
0 “keep”
zQt , at the “keep” regions into the diffusion sampling pro- checkerboard “refine” ∧ i ≤ 25
cess, such that these areas are seamlessly blended into the mblended = (2)
1 “refine” ∧ i > 25
final generated result. Specifically, the latent at the current
1
“generate”
sampling timestep i is computed as
zi ← zi mblended + zQt (1 − mblended ) (1) where a value of 1 indicates that this region should be
painted and kept otherwise. Our blending mask is visual-
where the mask mblended is defined in Equation 2. That is, ized in Figure 3.
for “keep” regions, we simply set zi fixed according to their
original values. Texture Projection. To project It back to the texture atlas
Tt , we apply gradient-based optimization for Lt over the
Consistent Texture Generation. Injecting “keep” re- values of Tt when rendered through the differential renderer
gions into the diffusion process results in better blending R. That is,
with “generate” regions. Still, when moving away from
the “keep” boundary and deeper into the “generate” re- ∂R ms
∇Tt Lt = [(R(mesh, Tt , vt ) − It ) ms ]
gions, the generated output is mostly governed by the sam- ∂Tt
pled noise and is not consistent with previously painted re-
gions. We first opt to use the same sampled noise from each To achieve smoother texture seams of the projections
viewpoint, this sometimes improves consistency, but is still from different views, a soft mask ms is applied at the
very sensitive to the change in viewpoint. We observe that boundaries of the “refine” and “generate” region:
applying an inpainting diffusion model Mpaint that was di- (
rectly trained to complete masked regions, results in more 0 “keep”
ms = mh ∗ g mh =
consistent generations. However, this in turn deviates from 1 “refine” ∪ “generate”
the conditioning depth Dt and may generate new geome-
tries. To benefit from the advantages of both models we in- where g is a 2D Gaussian blur kernel.
4
Data Creation input texture. Doing so disentangles the texture from its
specific geometry and improves the generalization of the
fine-tuned diffusion model. Inspired by the concept of sur-
face caricaturization [44], we propose a novel spectral aug-
mentation technique. In our case, we propose random low-
Input Mesh Spectral Augmented Meshes frequency geometric deformations to the textured source
mesh, regularized by the mesh Laplacian’s spectrum [29].
Modulating random deformations over the spectral
... eigenbasis results in smooth deformations that keep the in-
A <Dfront> view of a A <Dleft> view of a A <Dbottom> view of a
tegrity of the input shape. Empirically, we choose to apply
<Stexture> < Stexture> < Stexture>
random inflations or deflations to the mesh, with a magni-
Diffusion Model Fine-Tuning tude proportional to a randomly selected eigenfunction. We
Input View
& Depth Noised Sample Reconstruction provide examples of such augmentations in Figure 4, and
additional details in the supplementary materials.
+ noise
Depth-
Conditioned
Diffusion Model
5
“A turtle” “A hand carved wood elephant” “A wooden klein bottle”
Figure 5. Texturing results. Our method generates a high-quality texture for a collection of prompts and geometries.
6
methods (See Table 1). We provide additional qualitative
results in 11 and in the supplementary materials.
7
Texture From Mesh As mentioned in Section 3.2 we
are able to capture a texture of a given mesh by combin-
ing concept learning techniques [15, 39] with view-specific
directional tokens. We can then use the learned token
with TEXTure to color new meshes accordingly. Figure 8
presents texture transfer results from two existing meshes
onto various target geometries. While the training of the
new texture token is performed using prompts of the form
“A hDv i photo of hStexture i”, we can generate new textures
by forming different texture prompts when transferring the Images Set Textured Meshes
learned texture. Specifically, the ”Exact” results in Figure 8 Figure 9. Token-based Texture Transfer from images. New
were generated using the specific text prompt used for fine- meshes are presented only once. All meshes are textured with the
tuning, and the rest of the outputs were generated ”in the exact prompt. The teddy bear is also presented with the “..in the
style of hStexture i”. Observe how a single eye is placed style of” prompt, one can see that this results in a semantic mix
on Einstein’s face when the exact prompt is used, while two between the image set and a teddy bear.
eyes are generated when only using a prompt ”a hDv i photo
of Einstein that looks like hStexture i”. A key component of
the texture-capturing technique is our novel spectral aug-
mentations scheme. We refer the reader to the supplemen-
tary materials for an ablation study over this component. Depth Inconsistencies Viewpoint Inconsistencies
Figure 10. Limitations. Left: input mesh, where the diffusion out-
put was inconsistent with the geometry of the model face. These
Texture From Image In practice, it is often more prac- inconsistencies are clearly visible from other views and affect the
tical to learn a texture for a set of images rather than from painting process. Right: two different viewpoints of “a goldfish”
a 3D mesh. As such, in Figure 9, we demonstrate the re- where the diffusion process added inconsistent eyes, as the previ-
sults of our transferring scheme where the texture is cap- ously painted eyes were not visible.
tured from a collection of images depicting the source ob-
ject. One can see that even when given only several images, 5. Discussion, Limitations and Conclusions
our method can successfully texture different shapes with a This paper presents TEXTure, a novel method for text-
semantically similar texture. We find these results exciting, guided generation, transfer, and editing of textures for 3D
as it means one can easily paint different 3D shapes using shapes. There have been many models for generating and
textures derived from real-world data, even when the texture editing high-quality images using diffusion models. How-
itself is only implicitly represented in the fine-tuned model ever, leveraging these image-based models for generating
and concept tokens. seamless textures on 3D models is a challenging task, in
particular for non-stochastic and non-stationary textures.
4.3. Editing Our work addresses this challenge by introducing an iter-
ative painting scheme that leverages a pretrained depth-to-
Finally, in Figure 15 we show the editing results of exist- image diffusion model. Instead of using a computationally
ing textures. The top row of Figure 15 shows scribble-based demanding score distillation approach, we propose a mod-
results where a user manually edits the texture atlas image. ified image-to-image diffusion process that is applied from
Then we “refine” the texture atlas to seamlessly blend the a small set of viewpoints. This results in a fast process ca-
edited region into the final texture. Observe how the man- pable of generating high-quality textures in mere minutes.
ually annotated white spot on the bunny on the right turns With all its benefits, there are still some limitations to our
into a realistic-looking patch of white fur. proposed scheme, which we discuss next.
The bottom of Figure 15 shows text-based editing re- While our painting technique is designed to be spatially
sults, where an existing texture is “refined” according to coherent, it may sometimes result in inconsistencies on a
a new text prompt. The bottom right example illustrates global scale, caused by occluded information from other
a texture generated on a similar geometry with the same views. See Figure 10, where different looking eyes are
prompt from scratch. Observe that the texture generated added from different viewpoints. Another caveat is view-
from scratch significantly differs from the original texture. point selection. We use eight fixed viewpoints around the
In contrast, when applying editing, the textures are able to object, which may not fully cover adversarial geometries.
remain semantically close to the input texture while gener- This issue can possibly be solved by finding a dynamic
ating novel details to match the target text. set of viewpoints that maximize the coverage of the given
8
mesh. Furthermore, the depth-guided model sometimes de- 3d stylization via lighting decomposition. arXiv preprint
viates from the input depth, and may generate images that arXiv:2210.11277, 2022. 1, 3
are not consistent with the geometry (See Figure 10 left). [10] Keenan Crane. Keenan’s 3d model
This in turn may result in conflicting projections to the repository. https://www.cs.cmu.edu/ km-
mesh, that cannot be fixed in later painting iterations. crane/Projects/ModelRepository/, 2022. 7, 9
[11] Jeremy S De Bonet. Multiresolution sampling procedure for
With that being said, we believe that TEXTure takes an
analysis and synthesis of texture images. In Proceedings of
important step toward revolutionizing the field of graphic
the 24th annual conference on Computer graphics and inter-
design and further opens new possibilities for 3D artists, active techniques, pages 361–368, 1997. 2
game developers, and modelers who can use these tools to [12] Alexei A Efros and Thomas K Leung. Texture synthesis
generate high-quality textures in a fraction of the time of by non-parametric sampling. In Proceedings of the sev-
existing techniques. Additionally, our trimap partitioning enth IEEE international conference on computer vision, vol-
formulation provides a practical and useful framework that ume 2, pages 1033–1038. IEEE, 1999. 2
we hope will be utilized and “refined” in future studies. [13] Anna Frühstück, Ibraheem Alhashim, and Peter Wonka. Ti-
legan: Synthesis of large-scale non-homogeneous textures.
Acknowledgements ACM Trans. Graph., 38:58:1–58:11, 2019. 2
[14] Clement Fuji Tsang, Maria Shugrina, Jean Francois
We thank Dana Cohen and Or Patashnik for their early Lafleche, Towaki Takikawa, Jiehan Wang, Charles Loop,
feedback and insightful comments. We would also like Wenzheng Chen, Krishna Murthy Jatavallabhula, Edward
to thank Harel Richardson for generously allowing us to Smith, Artem Rozantsev, Or Perel, Tianchang Shen, Jun
use his toys throughout this paper. The beautiful meshes Gao, Sanja Fidler, Gavriel State, Jason Gorski, Tommy Xi-
throughout this paper are taken from [1, 4, 7, 10, 24, 30, 32, ang, Jianing Li, Michael Li, and Rev Lebaredian. Kaolin:
A pytorch library for accelerating 3d deep learning re-
41, 42, 49].
search. https://github.com/NVIDIAGameWorks/
kaolin, 2022. 3
References [15] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash-
[1] Arafurisan. Pikachu 3d sculpt free 3d model. nik, Amit H Bermano, Gal Chechik, and Daniel Cohen-
https://www.cgtrader.com/free-3d-models/ Or. An image is worth one word: Personalizing text-to-
character / child / pikachu - 3d - model - free, image generation using textual inversion. arXiv preprint
2022. Model ID: 3593494. 9 arXiv:2208.01618, 2022. 2, 5, 8
[2] Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended [16] Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen,
latent diffusion. arXiv preprint arXiv:2206.02779, 2022. 2, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and
4 Sanja Fidler. Get3d: A generative model of high quality
3d textured shapes learned from images. arXiv preprint
[3] Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended
arXiv:2209.11163, 2022. 3
latent diffusion. arXiv preprint arXiv:2206.02779, 2022. 4
[17] Aleksey Golovinskiy, Wojciech Matusik, Hanspeter Pfister,
[4] behnam.aftab. Free 3d turtle model. https :
Szymon Rusinkiewicz, and Thomas Funkhouser. A statis-
/ / 3dexport . com / free - 3dmodel - free - 3d -
tical model for synthesis of detailed facial geometry. ACM
turtle-model-355284.htm, 2021. Item ID: 355284.
Transactions on Graphics (TOG), 25(3):1025–1034, 2006. 2
9
[18] David J Heeger and James R Bergen. Pyramid-based texture
[5] Sema Berkiten, Maciej Halber, Justin Solomon, Chongyang analysis/synthesis. In Proceedings of the 22nd annual con-
Ma, Hao Li, and Szymon Rusinkiewicz. Learning detail ference on Computer graphics and interactive techniques,
transfer based on geometric features. In Computer Graphics pages 229–238, 1995. 2
Forum, volume 36, pages 361–373. Wiley Online Library, [19] Amir Hertz, Rana Hanocka, Raja Giryes, and Daniel Cohen-
2017. 2 Or. Deep geometric texture synthesis. ACM Trans. Graph.,
[6] Toby P Breckon and Robert B Fisher. A hierarchical exten- 39(4), 2020. 2
sion to 3d non-parametric surface relief completion. Pattern [20] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman,
Recognition, 45(1):172–185, 2012. 2 Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im-
[7] RAMIRO CASTRO. Elephant natural history mu- age editing with cross attention control. arXiv preprint
seum. https://www.cgtrader.com/free- 3d- arXiv:2208.01626, 2022. 2
print- models/miniatures/other/elephant- [21] Dong Huk Park, Grace Luo, Clayton Toste, Samaneh Azadi,
natural - history - museum - 1, 2020. Model ID: Xihui Liu, Maka Karalashvili, Anna Rohrbach, and Trevor
2773650. 9 Darrell. Shape-guided diffusion with inside-outside atten-
[8] Xiaobai Chen, Tom Funkhouser, Dan B Goldman, and Eli tion. arXiv e-prints, pages arXiv–2212, 2022. 2
Shechtman. Non-parametric texture transfer using mesh- [22] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen
match. Technical Report Technical Report 2012-2, 2012. 2 Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic:
[9] Yongwei Chen, Rui Chen, Jiabao Lei, Yabin Zhang, and Text-based real image editing with diffusion models. arXiv
Kui Jia. Tango: Text-driven photorealistic and robust preprint arXiv:2210.09276, 2022. 2
9
[23] Nasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky, eration with clip latents. arXiv preprint arXiv:2204.06125,
and Popa Tiberiu. Clip-mesh: Generating textured meshes 2022. 2
from text using pretrained image-text models. SIGGRAPH [37] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi-
Asia 2022 Conference Papers, 2022. 3, 6 sion transformers for dense prediction. ICCV, 2021. 2
[24] Oliver Laric. Three d scans. https://threedscans.com/, 2022. [38] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
9 Patrick Esser, and Björn Ommer. High-resolution image syn-
[25] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, thesis with latent diffusion models, 2022. 1, 2, 3
Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fi- [39] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
dler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High- Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
resolution text-to-3d content creation. arXiv preprint tuning text-to-image diffusion models for subject-driven
arXiv:2211.10440, 2022. 1, 3 generation. arXiv preprint arXiv:2208.12242, 2022. 2, 5,
[26] Jianye Lu, Athinodoros S Georghiades, Andreas Glaser, 8
Hongzhi Wu, Li-Yi Wei, Baining Guo, Julie Dorsey, and [40] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
Holly Rushmeier. Context-aware textures. ACM Transac- Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
tions on Graphics (TOG), 26(1):3–es, 2007. 2 Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi,
[27] Tom Mertens, Jan Kautz, Jiawen Chen, Philippe Bekaert, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J
and Frédo Durand. Texture transfer using geometry corre- Fleet, and Mohammad Norouzi. Photorealistic text-to-image
lation. Rendering Techniques, 273(10.2312):273–284, 2006. diffusion models with deep language understanding, 2022. 2
2 [41] Laura Salas. Klein bottle 2. https : / /
[28] Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and sketchfab.com/3d-models/klein-bottle-2-
Daniel Cohen-Or. Latent-nerf for shape-guided generation eecfee26bbe54bfc9cb8e1904f33c0b7, 2016. 9
of 3d shapes and textures. arXiv preprint arXiv:2211.07600, [42] savagerus. Orangutan free 3d model. https :
2022. 1, 3, 6, 7 / / www . cgtrader . com / free - 3d - models /
[29] Mark Meyer, Mathieu Desbrun, Peter Schröder, and Alan H animals/mammal/orangutan- fe49896d- 8c60-
Barr. Discrete differential-geometry operators for triangu- 46d8 - 9ecb - 36afa9af49f6, 2019. Model ID:
lated 2-manifolds. In Visualization and mathematics III, 2142893. 9
pages 35–57. Springer, 2003. 5 [43] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
[30] Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
Rana Hanocka. Text2mesh: Text-driven neural stylization Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
for meshes. In Proceedings of the IEEE/CVF Conference man, et al. Laion-5b: An open large-scale dataset for
on Computer Vision and Pattern Recognition, pages 13492– training next generation image-text models. arXiv preprint
13502, 2022. 1, 3, 6, 7, 9 arXiv:2210.08402, 2022. 2
[31] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav [44] Matan Sela, Yonathan Aflalo, and Ron Kimmel. Compu-
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and tational caricaturization of surfaces. Computer Vision and
Mark Chen. Glide: Towards photorealistic image generation Image Understanding, 141:1–17, 2015. 5
and editing with text-guided diffusion models. arXiv preprint [45] Omry Sendik and Daniel Cohen-Or. Deep correlations for
arXiv:2112.10741, 2021. 2 texture synthesis. ACM Transactions on Graphics (TOG),
[32] paffin. Teddy bear free 3d model. https : / / 36:1 – 15, 2017. 2
www.cgtrader.com/free-3d-models/animals/ [46] Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and
other / teddy - bear - 715514f6 - a1ab - 4aae - Sanja Fidler. Deep marching tetrahedra: a hybrid represen-
98d0- 80b5f55902bd, 2019. Model ID: 2034275. 7, tation for high-resolution 3d shape synthesis. In Advances in
9 Neural Information Processing Systems (NeurIPS), 2021. 3
[33] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- [47] Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian Price,
hall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv Jianming Zhang, Soo Ye Kim, and Daniel Aliaga. Ob-
preprint arXiv:2209.14988, 2022. 1, 3, 6 jectstitch: Generative object compositing. arXiv preprint
[34] Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood De- arXiv:2212.00932, 2022. 2
hghan, Osmar R Zaiane, and Martin Jagersand. U2-net: Go- [48] Narek Tumanyan, Michal Geyer, Shai Bagon, and
ing deeper with nested u-structure for salient object detec- Tali Dekel. Plug-and-play diffusion features for text-
tion. Pattern recognition, 106:107404, 2020. 5 driven image-to-image translation. arXiv preprint
[35] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya arXiv:2211.12572, 2022. 2
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, [49] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin-
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d
ing transferable visual models from natural language super- shapenets: A deep representation for volumetric shapes. In
vision. In International Conference on Machine Learning, Proceedings of the IEEE conference on computer vision and
pages 8748–8763. PMLR, 2021. 2 pattern recognition, pages 1912–1920, 2015. 6, 7, 9
[36] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, [50] Wenqi Xian, Patsorn Sangkloy, Jingwan Lu, Chen Fang,
and Mark Chen. Hierarchical text-conditional image gen- Fisher Yu, and James Hays. Texturegan: Controlling deep
10
image synthesis with texture patches. 2018 IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
8456–8465, 2017. 2
[51] Jiale Xu, Xintao Wang, Weihao Cheng, Yan-Pei Cao, Ying
Shan, Xiaohu Qie, and Shenghua Gao. Dream3d: Zero-shot
text-to-3d synthesis using 3d shape prior and text-to-image
diffusion models. arXiv preprint arXiv:2212.14704, 2022. 1
[52] Jonathan Young. Xatlas: Mesh parameterization / uv un-
wrapping library. https://github.com/jpcy/xatlas, 2022. 3
[53] Yang Zhou, Zhen Zhu, Xiang Bai, Dani Lischinski, Daniel
Cohen-Or, and Hui Huang. Non-stationary texture synthesis
by adversarial expansion. arXiv preprint arXiv:1805.04487,
2018. 2
[54] Song Chun Zhu, Yingnian Wu, and David Mumford. Fil-
ters, random fields and maximum entropy (frame): Towards
a unified theory for texture modeling. International Journal
of Computer Vision, 27(2):107–126, 1998. 2
11
“A plush pikachu” “A red porcelain dragon”
“A black boot”
Latent-Paint
Figure 12. Additional qualitative comparison for text-guided texture generation. Best viewed zoomed in.
12
Input Mesh “A photo of ironman” “A photo of spiderman” “A photo of batman”
Figure 13. Additional texturing results achieved with TEXTure. Our method generates a high-quality texture for a collection of prompts
and geometries.
Images Set Input Mesh Textured Meshes Input Mesh Textured Meshes
Figure 14. Token-based Texture Transfer from images. All meshes are textured with the exact prompt “A photo of a hS∗ i using the
fine-tuned diffusion model.
Input Edit Result Input Edit Result Input Texture Input Edit Result
Input Texture “A spotted turtle“ “A plush turtle” “A plush turtle” Input Texture “A plush goldfish“ Input Texture “A plush
Refinement Refinement orangutan”
Figure 15. Texture Editing. The first row presents results for localized scribble-based editing, the original prompt was also used for the
refinement step. The second row shows global text-based edits, with the last result showing a texture generated using the same prompt
without conditioning on the input texture which clearly results in a completely new texture.
13