GVGEN
GVGEN
GVGEN
https://GVGEN.github.io
1 Introduction
The development of 3D models is a pivotal task in computer graphics, garner-
ing increased attention across various industries, including video game design,
∗ Equal Contribution.
† Corresponding Authors.
2 He et al.
2 Related Works
GaussianVolume Fitting
Stage 1:
Caption
Condition c
Text-to-3D Generation
Rasterizer
Stage 2:
Unet 3D Unet 3D
GaussianVolume GDF
Fig. 1: Overview of GVGEN. Our framework comprises two stages. In the data
pre-processing phase, we fit GaussianVolumes (Sec. 3.1) and extract coarse geometry
Gaussian Distance Field (GDF) as training data. For the generation stage (Sec. 3.2),
we first generate GDF via a diffusion model, and then send it into a 3D U-Net to
predict attributes of GaussianVolumes.
Since the proposal of Neural Radiance Field (NeRF) [28], various differentiable
neural rendering methods [2, 3, 5, 30] have emerged, which demonstrate remark-
able capabilities in scene reconstruction and novel view synthesis. These methods
are also widely used in 3D generation tasks. Instant-NGP [30] employs feature
volumes for accelerations by querying features only at the corresponding spa-
tial positions. Works [3, 5] decompose features into lower dimensions for faster
training and less memory storage. However, they still have some shortcomings
compared with point-based rendering methods [4,45] in the aspects of rendering
speed and explicit manipulability.
In recent research, 3D Gaussian Splatting [18] has received widespread at-
tention. It adopts anisotropic Gaussians to represent scenes, achieving real-time
GVGEN: Text-to-3D Generation with Volumetric Representation 5
Ground-truth
Pool Pool
Supervise
Clone & Split
Rasterizer
Prune
Pool
( , )
offset 3DGS features
3 Methodology
Our text-to-3D generation framework (Fig. 1), GVGEN, is delineated into two
pivotal stages: GaussianVolume fitting (Sec. 3.1) and text-to-3D generation (Sec. 3.2).
Initially, in the GaussianVolume fitting stage, we propose a structured, volumet-
ric form of 3D Gaussians, termed GaussianVolume. We fit the GaussianVolume
as training data for the generation stage. This stage is crucial for arranging dis-
organized point-cloud-like Gaussian points as a format more amenable to neural
network processing. To address this, We use a fixed number (i.e. fixed volume res-
olution for the volume) of 3D Gaussians to form our GaussianVolume (Sec. 3.1),
thereby facilitating ease of processing. Furthermore, we introduce the Candidate
Pool Strategy (CPS) for dynamic pruning, cloning, and splitting of Gaussian
points to enhance the fidelity of fitted assets. Our GaussianVolume maintains
high rendering quality with only a small number (32,768 points) of Gaussians.
In the phase of generation Sec. 3.2, we first use a diffusion model conditioned
on input texts to generate the coarse geometry volume, termed Gaussian Dis-
tance Field (GDF), which represents the geometry of generated objects. Sub-
sequently, a 3D U-Net-based reconstruction model utilizes the GDF and text
inputs to predict the attributes of the final GaussianVolume, thus achieving the
generation of detailed 3D objects from text descriptions.
During the fitting phase, we only apply the backward operation on the offsets
∆µ, which allows the expression of more fine-grained 3D assets and also imposes
a restriction to form better structures. Due to the fixed number of Gaussians,
we cannot directly apply the original strategy for densification control. Instead,
we propose the Candidate Pool Strategy for effectively pruning and cloning.
Candidate Pool Strategy We can not directly apply the original interleaved
optimization/density strategy to move 3D Gaussians in our approach, due to the
dynamical changes in the number of Gaussians not suitable for a fixed number of
Gaussians. Densifying or pruning Gaussians freely is not allowed since Gaussians
are bijective with assigned grid points. A naive way to avoid the problem is to
only adjust the offsets ∆µ relying on gradient back-propagation. Unfortunately,
we experimentally found that the movement range of Gaussian centers becomes
largely limited without a pruning and densification strategy, which leads to lower
rendering quality and inaccurate geometry (see Fig. 7(a)). For this reason, we
propose a novel strategy (Algorithm 1) to densify and prune the fixed number of
Gaussians. The key point to our strategy is storing pruned points in a candidate
pool P for later densification.
We initially align Gaussian centers, µ, with assigned positions, p and set
offsets ∆µ = 0 (Line 2 in Algorithm 1). The candidate pool P is initialized as an
empty set (Line 3). This pool stores "deactivated" points, which refer to pruned
points during optimization – they are not involved in the forward and backward
process. We optimize each asset for T iterations. Following original 3D Gaussian
8 He et al.
Training Loss The final loss for supervision is the original loss used in 3D
Gaussian Splatting adding a regularization loss:
\label {eq:eqoffsets} \mathcal {L}_{offsets}=Mean(\mathrm {ReLU}(| \Delta \mu - \epsilon _{offsets}|)), (3)
\label {eq:eq1} \mathcal {L}_{fitting}=\lambda _1 \mathcal {L}_1+\lambda _2 \mathcal {L}_{SSIM} + \lambda _3 \mathcal {L}_{offsets} (4)
After training, the points are sorted in spatial order according to volume co-
ordinates as training data for the generation stage. Once the training is over,
2D images of target objects could be rendered at an ultra-fast speed since each
GaussianVolume is rather lightweight. More implementation details can be found
in the supplemental materials.
Ground truth
Reconstruction
\label {eq:eq2} \mathcal {L}=\lambda _{3D} \mathcal {L}_{3D}+\lambda _{2D} \mathcal {L}_{2D} (6)
Using a multi-modal loss balances global semantic and local details, and keeps
the training process more stable. More implementation details about model ar-
chitectures and data processing can be found in the supplemental materials.
10 He et al.
A burger with
lettuce, tomatoes,
and various
toppings.
A rusty, bronze
faucet with a
handle.
A small metal
lantern.
A yellow
gramophone on a
wooden box.
4 Experiments
4.1 Baseline Methods and Dataset
Baseline Methods We compare with both feed-forward-based methods and
optimization-based methods. In the feed-forward category, we evaluate Shap-
E [17] and VolumeDiffusion [40], both of which directly generate 3D assets. For
optimization-based methods, we consider DreamGaussian [39] as the baseline,
where coarse 3D Gaussians undergo optimization with SDS loss [33] and are sub-
sequently converted into meshes for further refinement. For VolumeDiffusion, we
report results generated in the feed-forward process, without post-optimization
using SDS via pretrained text-to-image models, to ensure fair comparisons with
other methods.
Dataset Our training dataset comprises the Objaverse-LVIS dataset [11], which
contains ∼ 46,000 3D models in 1,156 categories. For text prompts for training,
GVGEN: Text-to-3D Generation with Volumetric Representation 11
A computer monitor with a blue A beige bunk bed. A green and red Christmas A red, white, and blue Lego
screen and a black cube on top. gift box with a red bow. airplane with wheels.
A small red metal post A white and red Lego A wooden piano with
A red bell pepper.
box. toy chicken. stool and cabinet.
we use captions from Cap3D [26], which leverages BLIP-2 [20] to caption multi-
view images of objects and consolidates them into single captions through GPT-
4 [1]. For evaluation, we generate 100 assets and render 8 views for each asset.
The camera poses for these views are uniformly sampled around the object. Our
method excels in both visualization and semantic alignment.
Condition
View 1
View 2
View 1 View 1
View 2 View 2
Ground truth Full w/o CPS w/o offsets Ground truth Full w/o w/o
(a) (b)
Fig. 7: Qualitative Results for Ablation Studies. (a) represents visual compar-
isons of different GaussianVolume fitting methods. (b) stands for results using different
losses to train the GaussianVolume prediction model.
and geometries. Such comparisons accentuate the critical difference between our
GVGEN and the reconstruction methods.
Table 2: Quantitative Metrics for Ablation Studies. The left table analyzes
GaussianVolume fitting strategies, while the right table compares different losses train-
ing the GaussianVolume attributes prediction model. For qualitative comparisons, see
Fig. 7.
4.4 Limitations
GVGEN has shown encouraging results in generating 3D objects. However, its
performance is constrained when dealing with input texts significantly divergent
from the domain of training data, as illustrated in Fig. 12 in the supplementary.
Since we need to fit the GaussianVolume per object to prepare training data,
it is time-consuming to scale up to millions of object data for better diversity.
Additionally, the volume resolution N is set as 32 (For 3D Gaussians, the point
number is only N 3 =32,768) to save computational resources, which limits the
rendering effects of 3D assets with very complex textures. In the future, we will
further explore how to generate higher-quality 3D assets in more challenging
scenarios.
5 Conclusions
In conclusion, this paper explores the feed-forward generation of explicit 3D
Gaussians conditioned on texts. We innovatively organize disorganized 3D Gaus-
sian points into a structured volumetric form, termed GaussianVolume, enabling
feed-forward generation of 3D Gaussians via a coarse-to-fine generation pipeline.
GVGEN: Text-to-3D Generation with Volumetric Representation 15
References
1. Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida,
D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv
preprint arXiv:2303.08774 (2023)
2. Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srini-
vasan, P.P.: Mip-nerf: A multiscale representation for anti-aliasing neural radiance
fields. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision. pp. 5855–5864 (2021)
3. Cao, A., Johnson, J.: Hexplane: A fast representation for dynamic scenes. In: Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion. pp. 130–141 (2023)
4. Chang, J.H.R., Chen, W.Y., Ranjan, A., Yi, K.M., Tuzel, O.: Pointersect: Neural
rendering with cloud-ray intersection. In: Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition. pp. 8359–8369 (2023)
5. Chen, A., Xu, Z., Geiger, A., Yu, J., Su, H.: Tensorf: Tensorial radiance fields. In:
European Conference on Computer Vision. pp. 333–350. Springer (2022)
6. Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3d: Disentangling geometry
and appearance for high-quality text-to-3d content creation. arXiv preprint
arXiv:2303.13873 (2023)
7. Chen, Y., Chen, Z., Zhang, C., Wang, F., Yang, X., Wang, Y., Cai, Z., Yang, L.,
Liu, H., Lin, G.: Gaussianeditor: Swift and controllable 3d editing with gaussian
splatting. arXiv preprint arXiv:2311.14521 (2023)
8. Chen, Z., Wang, F., Liu, H.: Text-to-3d using gaussian splatting. arXiv preprint
arXiv:2309.16585 (2023)
9. Cheng, Y.C., Lee, H.Y., Tulyakov, S., Schwing, A.G., Gui, L.Y.: Sdfusion: Multi-
modal 3d shape completion, reconstruction, and generation. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4456–
4465 (2023)
10. Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A.,
Laforte, C., Voleti, V., Gadre, S.Y., et al.: Objaverse-xl: A universe of 10m+ 3d
objects. Advances in Neural Information Processing Systems 36 (2024)
11. Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E.,
Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of
annotated 3d objects. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. pp. 13142–13153 (2023)
12. Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances
in neural information processing systems 34, 8780–8794 (2021)
13. He, Z., Wang, T.: Openlrm: Open-source large reconstruction models. https://
github.com/3DTopia/OpenLRM (2023)
16 He et al.
14. Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K.,
Bui, T., Tan, H.: Lrm: Large reconstruction model for single image to 3d. arXiv
preprint arXiv:2311.04400 (2023)
15. Huang, Z., Wen, H., Dong, J., Wang, Y., Li, Y., Chen, X., Cao, Y.P., Liang,
D., Qiao, Y., Dai, B., et al.: Epidiff: Enhancing multi-view synthesis via localized
epipolar-constrained diffusion. arXiv preprint arXiv:2312.06725 (2023)
16. Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided
object generation with dream fields. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. pp. 867–876 (2022)
17. Jun, H., Nichol, A.: Shap-e: Generating conditional 3d implicit functions. arXiv
preprint arXiv:2305.02463 (2023)
18. Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for
real-time radiance field rendering. ACM Transactions on Graphics 42(4) (2023)
19. Li, J., Tan, H., Zhang, K., Xu, Z., Luan, F., Xu, Y., Hong, Y., Sunkavalli, K.,
Shakhnarovich, G., Bi, S.: Instant3d: Fast text-to-3d with sparse-view generation
and large reconstruction model. arXiv preprint arXiv:2311.06214 (2023)
20. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-
training with frozen image encoders and large language models. arXiv preprint
arXiv:2301.12597 (2023)
21. Li, W., Chen, R., Chen, X., Tan, P.: Sweetdreamer: Aligning geometric priors in
2d diffusion for consistent text-to-3d. arXiv preprint arXiv:2310.02596 (2023)
22. Liang, Y., Yang, X., Lin, J., Li, H., Xu, X., Chen, Y.: Luciddreamer: Towards
high-fidelity text-to-3d generation via interval score matching (2023)
23. Liu, Y., Lin, C., Zeng, Z., Long, X., Liu, L., Komura, T., Wang, W.: Syncdreamer:
Generating multiview-consistent images from a single-view image. arXiv preprint
arXiv:2309.03453 (2023)
24. Liu, Z., Li, Y., Lin, Y., Yu, X., Peng, S., Cao, Y.P., Qi, X., Huang, X., Liang,
D., Ouyang, W.: Unidream: Unifying diffusion priors for relightable text-to-3d
generation. arXiv preprint arXiv:2312.08754 (2023)
25. Long, X., Guo, Y.C., Lin, C., Liu, Y., Dou, Z., Liu, L., Ma, Y., Zhang, S.H.,
Habermann, M., Theobalt, C., et al.: Wonder3d: Single image to 3d using cross-
domain diffusion. arXiv preprint arXiv:2310.15008 (2023)
26. Luo, T., Rockwell, C., Lee, H., Johnson, J.: Scalable 3d captioning with pretrained
models. Advances in Neural Information Processing Systems 36 (2024)
27. Melas-Kyriazi, L., Rupprecht, C., Vedaldi, A.: Pc2: Projection-conditioned
point cloud diffusion for single-image 3d reconstruction. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12923–
12932 (2023)
28. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng,
R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commu-
nications of the ACM 65(1), 99–106 (2021)
29. Mohammad Khalid, N., Xie, T., Belilovsky, E., Popa, T.: Clip-mesh: Generating
textured meshes from text using pretrained image-text models. In: SIGGRAPH
Asia 2022 conference papers. pp. 1–8 (2022)
30. Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with
a multiresolution hash encoding. ACM Transactions on Graphics (ToG) 41(4), 1–
15 (2022)
31. Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: A system for
generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751
(2022)
GVGEN: Text-to-3D Generation with Volumetric Representation 17
32. Ntavelis, E., Siarohin, A., Olszewski, K., Wang, C., Gool, L.V., Tulyakov, S.: Au-
todecoding latent 3d diffusion models. Advances in Neural Information Processing
Systems 36 (2024)
33. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using
2d diffusion. arXiv preprint arXiv:2209.14988 (2022)
34. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,
Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from
natural language supervision. In: International conference on machine learning. pp.
8748–8763. PMLR (2021)
35. Ren, J., Pan, L., Tang, J., Zhang, C., Cao, A., Zeng, G., Liu, Z.: Dreamgaussian4d:
Generative 4d gaussian splatting. arXiv preprint arXiv:2312.17142 (2023)
36. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution
image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
37. Shi, R., Chen, H., Zhang, Z., Liu, M., Xu, C., Wei, X., Chen, L., Zeng, C., Su, H.:
Zero123++: a single image to consistent multi-view diffusion base model. arXiv
preprint arXiv:2310.15110 (2023)
38. Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: Mvdream: Multi-view diffusion
for 3d generation. arXiv preprint arXiv:2308.16512 (2023)
39. Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: Dreamgaussian: Generative gaussian
splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023)
40. Tang, Z., Gu, S., Wang, C., Zhang, T., Bao, J., Chen, D., Guo, B.: Volumediffusion:
Flexible text-to-3d generation with efficient volumetric encoder. arXiv preprint
arXiv:2312.11459 (2023)
41. Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer:
High-fidelity and diverse text-to-3d generation with variational score distillation.
Advances in Neural Information Processing Systems 36 (2024)
42. Wu, G., Yi, T., Fang, J., Xie, L., Zhang, X., Wei, W., Liu, W., Tian, Q., Wang,
X.: 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint
arXiv:2310.08528 (2023)
43. Wu, Z., Wang, Y., Feng, M., Xie, H., Mian, A.: Sketch and text guided diffu-
sion model for colored point cloud generation. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. pp. 8929–8939 (2023)
44. Xu, D., Yuan, Y., Mardani, M., Liu, S., Song, J., Wang, Z., Vahdat, A.:
Agg: Amortized generative 3d gaussians for single image to 3d. arXiv preprint
arXiv:2401.04099 (2024)
45. Xu, Q., Xu, Z., Philip, J., Bi, S., Shu, Z., Sunkavalli, K., Neumann, U.: Point-nerf:
Point-based neural radiance fields. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. pp. 5438–5448 (2022)
46. Yi, T., Fang, J., Wu, G., Xie, L., Zhang, X., Liu, W., Tian, Q., Wang, X.: Gaus-
siandreamer: Fast generation from text to 3d gaussian splatting with point cloud
priors. arXiv preprint arXiv:2310.08529 (2023)
47. Yin, Y., Xu, D., Wang, Z., Zhao, Y., Wei, Y.: 4dgen: Grounded 4d content gener-
ation with spatial-temporal consistency. arXiv preprint arXiv:2312.17225 (2023)
48. Yu, X., Guo, Y.C., Li, Y., Liang, D., Zhang, S.H., Qi, X.: Text-to-3d with classifier
score distillation. arXiv preprint arXiv:2310.19415 (2023)
49. Zou, Z.X., Yu, Z., Guo, Y.C., Li, Y., Liang, D., Cao, Y.P., Zhang, S.H.: Triplane
meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with
transformers. arXiv preprint arXiv:2312.09147 (2023)
18 He et al.
6 Implemantation Details
In this stage, we precisely fit each object using 96 uniformly rendered images,
capturing various poses. The initial 72 images are rendered with camera poses
uniformly distributed around a camera-to-world-center distance of 2.4 units. The
remaining 24 images are rendered from a closer distance of 1.6 units to provide
a more detailed view. We set the volume resolution at N = 32 and the spherical
harmonics (SH) order at 0, resulting in each Gaussian point having a feature
channel number C = 14.
The fitting volume is assumed to be a cube with a side length of 1 world unit,
centered within the global coordinate system. Inside this cube, Gaussian points
are uniformly distributed to ensure a comprehensive coverage of the volume.
Initially, the offset for each Gaussian point ∆µ is set to zero. To optimize each
instance, we conduct training against a white background for 20,000 iterations. A
densification strategy is employed from iteration 500 to 15,000, with subsequent
operations executed every 100 iterations to incrementally enhance the model’s
density. After this densification phase, we periodically clip features to predefined
ranges every 100 iterations to maintain consistency and prevent outliers. We
refrain from resetting the opacity of each Gaussian point during the training
process. This decision is made to avoid introducing instability into the model’s
learning and adaptation phases.
For weights selection in Lf itting , we set λ1 = 0.8, λ2 = 0.2, λ3 = 20.0, and
the offsets threshold as 1.5 voxel distances. Mathematically, this is expressed as
1.5
ϵof f sets = 32−1 , signifying the calculated distance between adjacent grid points
in our defined volume space. For other parameters, we adhere to the default
configurations in 3D Gaussian Splatting [18].
7 Explorational Experiments
In our early exploration, we investigated the efficacy of various generative mod-
els, initially training three unconditional models: a point cloud diffusion model,
and a primal 3D U-Net-based diffusion model, and compared them with our full
model. We specifically utilized 8 fitted GaussianVolumes belonging to the chair
LVIS-category as our training data and trained each model for 4,000 epochs.
The qualitative results are shown in Fig. 8.
Our initial foray involved adapting the point cloud diffusion model, inspired
by prior work [27], to facilitate the generation of 3D Gaussians. Unfortunately,
this approach struggled with convergence issues, yielding unsatisfactory results.
The inherent limitations of point cloud diffusion models in capturing and gen-
erating the nuanced structures of 3D Gaussians became apparent, prompting us
to explore alternative strategies. Employing a primal one-stage diffusion model
offered a glimpse into generating basic 3D structures. However, this model pre-
dominantly produced coarse outputs characterized by spiny shapes, highlighting
20 He et al.
A pink flower
in a pot.
A white vase
filled with
red roses.
A yellow wooden
barrel with
white stripes.
Fig. 9: GSGEN [8] Results Initialized with Our Method and Point-E [31].
The left column represents rendering results initialized with different methods, and the
right column stands for rendering results after optimization with GSGEN.
the need for a more nuanced generation technique to achieve plausible 3D geom-
etry. Our full model, incorporating a coarse-to-fine generation pipeline, demon-
strated significant improvements in generating 3D Gaussians. This approach not
only simplified the generation process but also empowered the model to produce
instances with more accurate and realistic 3D geometry. By sequentially refining
the generated outputs, the coarse-to-fine pipeline effectively addresses the limi-
tations observed in the earlier models, showcasing its superiority in generating
complex 3D structures.
8 Application
To demonstrate the superiority of our method, we show GVGEN’s capability
to integrate with optimization-based methods, like GSGEN [8], for further re-
finement (see Fig. 9). Using the generated GaussianVolume6 as initialization to
6
We filter low-opacity Gaussian points for GSGEN initialization.
GVGEN: Text-to-3D Generation with Volumetric Representation 21
Novel
View 1
Novel
View 2
Gaussians
Visualization
Fig. 10: Visual Results of Rendered Images and Positions of The Gaussian
Center. The fitted assets are optimized with different offsets threshold ϵof f sets .
replace point clouds from Point-E with random colors. This shift enables GS-
GEN to further optimize 3D Gaussians, achieving better alignment with the text
descriptions in both texture and geometry. This enhancement stems from avoid-
ing the adverse impact of color attributes from Point-E on GSGEN’s utilization,
as the features produced by GVGEN are more compatible and beneficial for the
optimization process.
for the network to learn effectively. To strike a balance between flexibility and
1.5
maintaining a well-defined structure, we have selected ϵof f sets = 32−1 , equiva-
lent to a 1.5 voxel distance, as our optimal threshold.
10 Failure Cases
As illustrated in Fig. 12, some of the objects generated by our model suffer from
blurred textures and imprecise geometry. This phenomenon largely stems from
the relatively small size of our training dataset, which comprises around 46,000
GVGEN: Text-to-3D Generation with Volumetric Representation 23
Novel
View 1
Novel
View 2
instances. Such a dataset size limits the model’s capacity to produce varied
outputs in response to a broad spectrum of text inputs. In the future, our work
will focus on two main avenues: improving the model architecture and enhancing
the data quality. By addressing these aspects, we aim to scale up the model for
application in large-scale scenarios, which is expected to improve the generation
diversity and lead to better-rendered results.
A pink flower A yellow wooden barrel Black umbrella with A Lego man
in a pot. with white stripes. wooden handle. wearing a red shirt
and various pants colors.
A blue and yellow toy a black sectional sofa a small blue teapot Red and black devil
tricycle with a pink and with horns. mask with horns.
blue seat and wheels.
Yellow rubber duck. a light blue, plastic pink frosted donut with
a pink, purple, and blue
child's chair. white sprinkles.
children's chandelier.
a wooden bedside a chessboard with a light blue ottoman Thor's wooden hammer
side table. checkered pattern. with a metal frame.
a red, blue, and white a pair of glasses. a wooden stool with a garlic bulb.
cube floating in water. seat and legs.