GVGEN

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

arXiv:2403.12957v1 [cs.

CV] 19 Mar 2024


GVGEN: Text-to-3D Generation with
Volumetric Representation

Xianglong He1,2∗ , Junyi Chen1,3∗ , Sida Peng4 , Di Huang1 , Yangguang Li5 ,


Xiaoshui Huang1 , Chun Yuan2† , Wanli Ouyang1 , and Tong He1†
1
Shanghai AI Lab
2
Tsinghua Shenzhen International Graduate School
3
Shanghai Jiao Tong University
4
Zhejiang University
5
VAST

https://GVGEN.github.io

Abstract. In recent years, 3D Gaussian splatting has emerged as a pow-


erful technique for 3D reconstruction and generation, known for its fast
and high-quality rendering capabilities. Nevertheless, these methods of-
ten come with limitations, either lacking the ability to produce diverse
samples or requiring prolonged inference times. To address these short-
comings, this paper introduces a novel diffusion-based framework, GV-
GEN, designed to efficiently generate 3D Gaussian representations from
text input. We propose two innovative techniques: (1) Structured Volu-
metric Representation. We first arrange disorganized 3D Gaussian points
as a structured form GaussianVolume. This transformation allows the
capture of intricate texture details within a volume composed of a fixed
number of Gaussians. To better optimize the representation of these de-
tails, we propose a unique pruning and densifying method named the
Candidate Pool Strategy, enhancing detail fidelity through selective op-
timization. (2) Coarse-to-fine Generation Pipeline. To simplify the gen-
eration of GaussianVolume and empower the model to generate instances
with detailed 3D geometry, we propose a coarse-to-fine pipeline. It ini-
tially constructs a basic geometric structure, followed by the prediction
of complete Gaussian attributes. Our framework, GVGEN, demonstrates
superior performance in qualitative and quantitative assessments com-
pared to existing 3D generation methods. Simultaneously, it maintains a
fast generation speed (∼7 seconds), effectively striking a balance between
quality and efficiency.

Keywords: Text-to-3D Generation · Feed-forward Generation · 3D Gaus-


sians

1 Introduction
The development of 3D models is a pivotal task in computer graphics, garner-
ing increased attention across various industries, including video game design,
∗ Equal Contribution.
† Corresponding Authors.
2 He et al.

film production, and AR/VR technologies. Among the different aspects of 3D


modeling, generating 3D models from text descriptions has emerged as a partic-
ularly intriguing area of research due to its accessibility and ease of use. Various
methods [16, 31, 33] have been proposed to handle the task. Still, it continues
to present difficulties owing to the ambiguity of texts and intrinsic domain gap
between text description and corresponding 3D assets.
Previous text-to-3D approaches can be broadly classified into two categories:
optimization-based generation [8, 33, 41, 48] and feed-forward generation [19, 31,
32,40]. Optimization-based methods have become rather popular recently, due to
the rapid development of text-to-image diffusion models [12, 36]. These methods
usually optimize 3D objects conditioned on texts or images through Score Distil-
lation Sampling (SDS) [33], distilling rich knowledge from 2D image generation
models. Despite yielding impressive results, optimization-based methods face
the Janus problem [33], manifesting as multiple faces or over-saturation prob-
lems. Additionally, the optimization of a single object can be prohibitively time-
consuming, requiring extensive computational effort. Contrarily, feed-forward
approaches strive to generate 3D assets directly from text descriptions, thus
sidestepping the Janus problem and significantly hastening the generation pro-
cess. Our work is closely related to feed-forward-based methods. However, feed-
forward methods that utilize multi-view generation models often create lower-
resolution 3D assets than their multi-view image counterparts. Moreover, models
directly generating 3D objects from texts often encounter difficulties with seman-
tics when complex prompts are used.
Different from previous feed-forward-based methods like [19] that follow a
text-2D-3D framework, our method proposes to generate 3D assets via directly
learning 3D representation. In this study, we introduce an innovative and stream-
lined coarse-to-fine generation pipeline, GVGEN, for generating 3D Gaussians di-
rectly from text descriptions. Leveraging the highly expressive and fast-rendering
capabilities of 3D Gaussians, our method achieves not only promising results but
also maintains rapid text-to-3D generation and rendering. As shown in Fig. 1,
our method consists of two stages: GaussianVolume fitting and text-to-3D gener-
ation. In the first stage, we introduce GaussianVolume, a structured volumetric
form composed of 3D Gaussians. Achieving this is challenging due to the sparse
and unstructured nature of optimizing original 3D Gaussians. To address this,
we introduce a novel Candidate Pool Strategy for pruning and densification.
This approach allows for fitting high-quality volumetric representation of Gaus-
sians, rather than unordered Gaussian points, making the generation process
more conducive for a diffusion-based framework, as is utilized in the following
step.
Despite the GaussianVolume establishing a structured volumetric framework
that integrates seamlessly with existing diffusion pipelines, the intrinsic complex-
ity of rich features of 3D Gaussians presents significant challenges. Specifically,
capturing the distribution of a vast amount of training data effectively becomes
difficult, resulting in hard convergence for the diffusion model. Addressing these
challenges, we partition the text-to-3D generation into two steps: coarse geom-
GVGEN: Text-to-3D Generation with Volumetric Representation 3

etry generation and Gaussian attributes prediction. To be more specific, in the


first step, we employ a diffusion model to generate the coarse geometry of ob-
jects, termed the Gaussian Distance Field (GDF) - an isotropic representation
outlining the proximity of each grid point to the nearest Gaussian point’s center.
Following this, the generated GDF, in conjunction with text inputs, is processed
through a 3D U-Net-based model to predict the attributes of GaussianVolumes,
ensuring enhanced control and model convergence.
To the best of our knowledge, this is the first study to directly feed-forward
generate 3D Gaussians from texts, exploring new avenues for rapid 3D content
creation and applications. Our main contributions are summarized as follows:

– We introduce GaussianVolume, a structured, volumetric form consisting of


3D Gaussians. Through the innovative Candidate Pool Strategy for pruning
and cloning, we accommodate high-quality GaussianVolume fitting within a
fixed volumetric resolution. This framework seamlessly integrates with exist-
ing generative networks, leveraging the inherent advantages of 3D Gaussians
for explicit and efficient representation.
– We propose GVGEN, an efficient text-to-3D coarse-to-fine generation pipeline
that first generates geometry volume and then predicts detailed 3D Gaussian
attributes, better controlling diverse geometry and appearances of generated
assets. GVGEN achieves a fast generation speed (∼7 seconds) compared
with baseline methods, effectively striking a balance between quality and
efficiency.
– Compared with existing baselines, GVGEN demonstrates competitive capa-
bilities in both quantitative and qualitative aspects.

2 Related Works

2.1 Text-to-3D Generation

Generating 3D objects conditioned on texts has become a challenging yet promi-


nent research area in recent years. Previous approaches [16, 29] utilize CLIP [34]
as a prior for 3D asset optimization but lacked realism and fidelity. With the rise
of text-to-image generative models [12,36], Dreamfusion [33] leverages Score Dis-
tillation Sampling (SDS) to generate diverse 3D objects, drawing on the rich prior
knowledge embedded in these models. Subsequently, some works [6, 21, 24, 25]
focus on modeling and learning multi-modal attributes of objects (e.g. colors,
albedo, normals, and depths), to enhance consistency. ProlificDreamer [41] and
LucidDreamer [22] introduce novel losses to improve the quality. Some stud-
ies [15, 23, 37, 38] explore predicting multi-views for objects, followed by using
feed-forward methods or SDS-based optimization to generate 3D objects. More-
over, integrating the recently proposed 3D Gaussian Splatting [18], works [8, 39,
46] lead to improvements in convergence speed for text-to-3D object generation.
However, these methods still require significant generation time (∼hours) per
object or face 3D inconsistency of generated objects.
4 He et al.

GaussianVolume Fitting
Stage 1:

Fit Extract Add noise

Caption

Condition c
Text-to-3D Generation

Rasterizer
Stage 2:

Unet 3D Unet 3D

GaussianVolume GDF

Fig. 1: Overview of GVGEN. Our framework comprises two stages. In the data
pre-processing phase, we fit GaussianVolumes (Sec. 3.1) and extract coarse geometry
Gaussian Distance Field (GDF) as training data. For the generation stage (Sec. 3.2),
we first generate GDF via a diffusion model, and then send it into a 3D U-Net to
predict attributes of GaussianVolumes.

As the emergence of large-scale 3D datasets [10,11], Feed-forward models are


trained to reduce the generation time. Point-E [31] and Shap-E [17], trained on
millions of 3D assets, generate point clouds and neural radiance fields respec-
tively. 3D VADER [32] trains auto-decoders to fit latent volumes and uses them
to train the diffusion model. Similarly, VolumeDiffusion [40] trains an efficient
volumetric encoder to produce training data for the diffusion model. Despite
these advancements, text-to-3D Gaussian generation remains largely unexplored,
primarily due to the complexity of organizing disorganized Gaussians. Our work
introduces a coarse-to-fine pipeline to feed-forward generate 3D Gaussians from
texts, by first generating the object’s coarse geometry and then predicting its
explicit attributes, pioneering in the direct generation of 3D Gaussians from text
descriptions.

2.2 Differentiable 3D Representation

Since the proposal of Neural Radiance Field (NeRF) [28], various differentiable
neural rendering methods [2, 3, 5, 30] have emerged, which demonstrate remark-
able capabilities in scene reconstruction and novel view synthesis. These methods
are also widely used in 3D generation tasks. Instant-NGP [30] employs feature
volumes for accelerations by querying features only at the corresponding spa-
tial positions. Works [3, 5] decompose features into lower dimensions for faster
training and less memory storage. However, they still have some shortcomings
compared with point-based rendering methods [4,45] in the aspects of rendering
speed and explicit manipulability.
In recent research, 3D Gaussian Splatting [18] has received widespread at-
tention. It adopts anisotropic Gaussians to represent scenes, achieving real-time
GVGEN: Text-to-3D Generation with Volumetric Representation 5

Ground-truth

Pool Pool

Supervise
Clone & Split

Rasterizer
Prune
Pool

( , )
offset 3DGS features

Fig. 2: Illustration of GaussianVolume Fitting. We organize a fixed number of


3D Gaussians in a volumetric form, termed GaussianVolume. By using position offsets
to express slight movements from grid points to Gaussian centers, we can capture the
details of objects. The proposed Candidate Pool Strategy (CPS) (Sec. 3.1) enables
effective pruning and densification with a pool storing pruned points.

rendering and facilitating downstream tasks like 3D generation [35,44,47], scene


editing [7] and dynamic scene rendering [42]. We introduce a strategy to fit 3D
Gaussians as a structured volumetric form called GaussianVolume, which al-
lows for good integration with the existing generation pipeline, leveraging the
advantages of 3D Gaussians and achieving fast generation and rendering speeds.

3 Methodology
Our text-to-3D generation framework (Fig. 1), GVGEN, is delineated into two
pivotal stages: GaussianVolume fitting (Sec. 3.1) and text-to-3D generation (Sec. 3.2).
Initially, in the GaussianVolume fitting stage, we propose a structured, volumet-
ric form of 3D Gaussians, termed GaussianVolume. We fit the GaussianVolume
as training data for the generation stage. This stage is crucial for arranging dis-
organized point-cloud-like Gaussian points as a format more amenable to neural
network processing. To address this, We use a fixed number (i.e. fixed volume res-
olution for the volume) of 3D Gaussians to form our GaussianVolume (Sec. 3.1),
thereby facilitating ease of processing. Furthermore, we introduce the Candidate
Pool Strategy (CPS) for dynamic pruning, cloning, and splitting of Gaussian
points to enhance the fidelity of fitted assets. Our GaussianVolume maintains
high rendering quality with only a small number (32,768 points) of Gaussians.
In the phase of generation Sec. 3.2, we first use a diffusion model conditioned
on input texts to generate the coarse geometry volume, termed Gaussian Dis-
tance Field (GDF), which represents the geometry of generated objects. Sub-
sequently, a 3D U-Net-based reconstruction model utilizes the GDF and text
inputs to predict the attributes of the final GaussianVolume, thus achieving the
generation of detailed 3D objects from text descriptions.

3.1 Stage1: GaussianVolume Fitting


The original 3D Gaussian Splatting technique offers fast optimization and real-
time rendering. However, its lack of structure makes it challenging for 3D neural
6 He et al.

Algorithm 1: GaussianVolume Fitting with Candidate Pool Strategy


Input: The number of Gaussian points N with assigned positions p, total
iteration times T , multi-view images I with camera poses V , prune
and densify conditions τp , τd
Output: GaussianVolume G, a volumetric form composed of a fixed number
of 3D Gaussians
1 t ← 0; // Iteration Time
2 G ← Initialize(N, p); // Initialize Gaussian Points
3 P ← ∅; // Initialize Candidate Pool
4 while t < T do
5 Iˆ ← Rasterize(G, V ); // Rendering
6 L ← Loss(I, I);ˆ // Loss
7 G ← Optimize(∇L); // Update Gaussian Attributes
8 if IsRefinementIteration(t) then
9 Gp ← PrunePoints(G, τp ); // Determine Pruned Points
10 G, P ← AddPointsToPool(Gp , G, P ); // Make Gp "Deactivated"
11 Gd ← DensifyPoints(G, τd ); // Determine Densified Points in G
12 Gnew ← FindPointsInPool(Gd , P ); // Find Added Points in P
13 G, P ← RemovePointsFromPool(Gnew , G, P ) ; // Make Gnew
"Activated"
14 end
15 if EndRefinementIteration(t) then
16 ReleasePool(G, P ); // Make All Points "Activated"
17 end
18 end

networks to process effectively. Directly generating 3D Gaussians through net-


works involves integrating existing point cloud networks, treating Gaussians as
carriers of semantically rich features. However, previous work [43] has pointed
out the difficulty of directly generating point clouds with only 3-dimensional
color features. Extending such methods, especially for high-dimensional Gaus-
sians, may make the learning process very challenging. We also attempted to
extend previous work [27] using point cloud diffusion to generate 3D Gaussians,
while the results were disappointing.
Recent works [44,49] propose hybrid representations combining 3D Gaussians
with triplane-NeRF, utilizing point clouds for geometry and triplane features for
texture querying, to finally decode 3D Gaussians. While these methods demon-
strate effectiveness, the reconstructed results are heavily constrained by the ac-
curacy of predicted point clouds and lose the direct operability advantage of 3D
Gaussians. In light of these challenges, we introduce GaussianVolume, which is a
volume composed of a fixed number of 3D Gaussians. This innovation facilitates
the processing of 3D Gaussians by existing generation and reconstruction mod-
els, retaining the efficiency benefits of Gaussians. The illustration of the fitting
process can be found in Fig. 2.
GVGEN: Text-to-3D Generation with Volumetric Representation 7

GaussianVolume In 3D Gaussian Splatting [18], features of 3D Gaussians G


include the following attributes: a center position µ ∈ R3 , covariance matrix
Σ, color information c ∈ R3 (when SH order=0) and opacity α ∈ R. To better
optimize the covariance matrix, Σ is analogous to describing the configuration
of an ellipsoid via a scaling matrix S ∈ R3 and a rotation matrix R ∈ R3×3 ,
satisfying:
\label {eq:eqsigma} \Sigma = R S S^T R^T. (1)
In the implementation, S, R could be stored as a 3D vector s ∈ R3 and a quater-
nion q ∈ R4 , respectively. With these differentiable features, 3D Gaussians can
be easily projected to 2D splats and employ neural point-based α-blending tech-
nique to render 2D-pixel colors. An efficient tile-based rasterizer is used for fast
rendering and backpropagation.
In this work, we represent each object with a volume V ∈ RC×N ×N ×N com-
posed of 3D Gaussians, which is trained from multi-view images of the object.
Here, C, N represent the number of feature channels and volume resolution re-
spectively, and the number of 3D Gaussians is N 3 . We follow the convention
and rendering methods in original works, but use position offsets ∆µ to express
slight movements between the position p of each grid point in the volume and
the center µ of the Gaussian point it represents:

\mu = p + \Delta \mu . (2)

During the fitting phase, we only apply the backward operation on the offsets
∆µ, which allows the expression of more fine-grained 3D assets and also imposes
a restriction to form better structures. Due to the fixed number of Gaussians,
we cannot directly apply the original strategy for densification control. Instead,
we propose the Candidate Pool Strategy for effectively pruning and cloning.

Candidate Pool Strategy We can not directly apply the original interleaved
optimization/density strategy to move 3D Gaussians in our approach, due to the
dynamical changes in the number of Gaussians not suitable for a fixed number of
Gaussians. Densifying or pruning Gaussians freely is not allowed since Gaussians
are bijective with assigned grid points. A naive way to avoid the problem is to
only adjust the offsets ∆µ relying on gradient back-propagation. Unfortunately,
we experimentally found that the movement range of Gaussian centers becomes
largely limited without a pruning and densification strategy, which leads to lower
rendering quality and inaccurate geometry (see Fig. 7(a)). For this reason, we
propose a novel strategy (Algorithm 1) to densify and prune the fixed number of
Gaussians. The key point to our strategy is storing pruned points in a candidate
pool P for later densification.
We initially align Gaussian centers, µ, with assigned positions, p and set
offsets ∆µ = 0 (Line 2 in Algorithm 1). The candidate pool P is initialized as an
empty set (Line 3). This pool stores "deactivated" points, which refer to pruned
points during optimization – they are not involved in the forward and backward
process. We optimize each asset for T iterations. Following original 3D Gaussian
8 He et al.

Splatting [18], thresholds τp , τd on view-space position gradients are used for


pruning and densifying every fixed iterations (IsRefinementIteration(t) in Line
8).
During the refining process, once Gaussian points Gp are pruned with pre-
defined threshold τp , they are added into candidate pool P , making them "deac-
tivated" for rendering and optimization (Lines 9-10). For densification, we first
determine points Gd that should be densified in the "activated" set of Gaus-
sians G. Then newly added points Gnew are selected from candidate pool P via
some criterion (e.g., we use the nearest point within a certain distance ϵof f sets
to the densified points Gd ). The corresponding coordinate offsets for the added
points are calculated. They become "activated" for both forward and backward
processes and are subsequently removed from the candidate pool P (Lines 11-
13). At the end of optimization/density, all points in the candidate pool are
reintroduced into the optimization for further refinement (Lines 15-17).
This strategy ensures that Gaussian points can adaptively move during the
optimization process, to represent more intricate object shapes. Simultaneously,
the resulting structured volumetric form maintains its physical meaning, demon-
strating a balance between adaptability and well-defined structure.

Training Loss The final loss for supervision is the original loss used in 3D
Gaussian Splatting adding a regularization loss:

\label {eq:eqoffsets} \mathcal {L}_{offsets}=Mean(\mathrm {ReLU}(| \Delta \mu - \epsilon _{offsets}|)), (3)

where ϵof f sets is a hyper-parameter, to restrict the center of 3D Gaussians not


too far from their corresponding grid points,

\label {eq:eq1} \mathcal {L}_{fitting}=\lambda _1 \mathcal {L}_1+\lambda _2 \mathcal {L}_{SSIM} + \lambda _3 \mathcal {L}_{offsets} (4)

After training, the points are sorted in spatial order according to volume co-
ordinates as training data for the generation stage. Once the training is over,
2D images of target objects could be rendered at an ultra-fast speed since each
GaussianVolume is rather lightweight. More implementation details can be found
in the supplemental materials.

3.2 Stage2: Text-to-3D Generation


As discussed in work [40], 3D volume space is high-dimensional, which poten-
tially poses challenges in training the diffusion model. Empirically, we found that
even though we form 3D Gaussians as a volumetric structure, it still takes too
long for model convergence when training with a large amount of data. More-
over, learning the data distribution of GaussianVolumes from diverse categories
via a single diffusion model causes it to collapse to the single-step reconstruction,
leading to the degradation of generation diversity. To overcome these challenges,
we introduce a coarse-to-fine pipeline that first generates coarse geometry vol-
umes (Gaussian Distance Field, GDF) and then predicts the attributes of the
GVGEN: Text-to-3D Generation with Volumetric Representation 9

Ground truth

Reconstruction

Fig. 3: Visualization of GaussianVolume Fitting. The rendering results demon-


strate excellent reconstruction performance of GaussianVolumes.

GaussianVolume. In the first step, we adopt a diffusion model conditioned on


input texts to generate GDF. In the second step, we employ a 3D U-Net-based
model to predict the attributes of Gaussians by inputting the GDF along with
the text condition, leading to the final GaussianVolume.

Gaussian Distance Field Generation The Gaussian Distance Field (GDF)


1×N ×N ×N
F ∈ R+ 0 stores an isotropic feature representing the fundamental geom-
etry of a GaussianVolume. It measures the distance between each grid coordinate
and its nearest Gaussian point, similar to the definition of Unsigned Distance
Field. This attribute can be easily extracted from fitted GaussianVolumes via
sorting algorithms. After obtaining the ground truth GDF, we train a diffusion
model conditioned on texts to generate GDF, creating the coarse geometry of
objects. The model is supervised by minimizing Mean Square Error (MSE) loss
L3D between the ground truth GDF and the generated GDF, which is equivalent
to predicting added noises in the diffusion process. Using diffusion models, the
generation diversity with respect to object shape is introduced.

GaussianVolume Prediction After generating GDF, we send it into a U-Net-


based model modified from SDFusion [9], along with text descriptions, to predict
all attributes of the GaussianVolume. The reason why a reconstruction model
is adopted is that we empirically found that the single-step model produces
comparable predictions of Gaussian attributes with a diffusion model.
We use two kinds of losses in this phase: MSE loss L3D between ground truth
GaussianVolume and the predicted one, and rendering loss L2D :
\label {eq:eqrenderloss} \mathcal {L}_{2D}=\lambda \mathcal {L}_1+(1-\lambda )\mathcal {L}_{SSIM} (5)
for rendered images, which are composed of L1 and SSIM loss. The total loss is,

\label {eq:eq2} \mathcal {L}=\lambda _{3D} \mathcal {L}_{3D}+\lambda _{2D} \mathcal {L}_{2D} (6)
Using a multi-modal loss balances global semantic and local details, and keeps
the training process more stable. More implementation details about model ar-
chitectures and data processing can be found in the supplemental materials.
10 He et al.

A burger with
lettuce, tomatoes,
and various
toppings.

A clear glass jar


with a wooden lid.

A rusty, bronze
faucet with a
handle.

A small metal
lantern.

A yellow and black


forklift truck.

A yellow
gramophone on a
wooden box.

condition Ours Shap·E VolumeDiffusion DreamGaussian

Fig. 4: Comparisons with State-of-the-art Text-to-3D Methods. Our method


achieves competitive visual results with better alignments with text conditions.

4 Experiments
4.1 Baseline Methods and Dataset
Baseline Methods We compare with both feed-forward-based methods and
optimization-based methods. In the feed-forward category, we evaluate Shap-
E [17] and VolumeDiffusion [40], both of which directly generate 3D assets. For
optimization-based methods, we consider DreamGaussian [39] as the baseline,
where coarse 3D Gaussians undergo optimization with SDS loss [33] and are sub-
sequently converted into meshes for further refinement. For VolumeDiffusion, we
report results generated in the feed-forward process, without post-optimization
using SDS via pretrained text-to-image models, to ensure fair comparisons with
other methods.

Dataset Our training dataset comprises the Objaverse-LVIS dataset [11], which
contains ∼ 46,000 3D models in 1,156 categories. For text prompts for training,
GVGEN: Text-to-3D Generation with Volumetric Representation 11

White chandelier with A gray sectional sofa with A brown leather


A chicken on a plate.
frosted glass shades. pillows and wooden legs. backpack.

A computer monitor with a blue A beige bunk bed. A green and red Christmas A red, white, and blue Lego
screen and a black cube on top. gift box with a red bow. airplane with wheels.

A small red metal post A white and red Lego A wooden piano with
A red bell pepper.
box. toy chicken. stool and cabinet.

Pixelated model of a A red and yellow fire A shopping cart with


White scooter model.
wooden ship with sails. hydrant. wheels.

Fig. 5: Text-to-3D Generation Results by GVGEN.

we use captions from Cap3D [26], which leverages BLIP-2 [20] to caption multi-
view images of objects and consolidates them into single captions through GPT-
4 [1]. For evaluation, we generate 100 assets and render 8 views for each asset.
The camera poses for these views are uniformly sampled around the object. Our
method excels in both visualization and semantic alignment.

4.2 Qualitative and Quantitative Results

GaussianVolume Fitting We present the reconstruction results of Gaussian-


Volume Fitting in Fig. 3. These results demonstrate that the proposed Gaus-
sianVolume can express high-quality 3D assets with a small number of Gaussian
points under the volume resolution N = 32. Higher resolutions yield better
rendering effects but require more computational resources for the generation
stage. Additional ablation studies about fitting GaussianVolume could be found
in Sec. 4.3 and the supplemental materials.
12 He et al.

Condition
View 1

View 2

A yellow chair. Ours OpenLRM TGS


(a) (b)
Fig. 6: Display of Generation Diversity. (a) displays diverse generated assets ac-
cording to the same text prompt via GVGEN. (b) shows the comparisons among our
image-conditioned GVGEN, OpenLRM [13] and TGS [49]. The single-view reconstruc-
tion models suffer from the problem of average patterns, leading to implausible shapes
and textures in unseen regions, while GVGEN produces reasonable appearances and
geometries.

Text-to-3D Generation We provide visual comparisons and quantitative anal-


yses of our method and existing baselines in Fig. 4 and Tab. 1, respectively.
As depicted in the figure, our model generates reasonable geometry and plau-
sible textures. Shap-E [17] can produce rough shapes but sometimes result in
misaligned appearances with respect to the input texts. VolumeDiffusion [40]
tends to generate unrealistic textures. And DreamGaussian [39], which adopts an
optimization-based approach, always produces over-saturated results. For quan-
titative results, we compare the CLIP score between rendered images and the
corresponding texts, and also the generation time of these methods. We randomly
select 300 text prompts for evaluation from the 40 LVIS categories containing
the highest number of objects. For each object, we render 8 views in uniformly
sampled camera poses. To measure generation time, we set the sampling steps
to 100 for diffusion-based methods. The configuration of DreamGaussian [39]
follows the default settings. Refer to Fig. 5 for more generation results.

Generation Diversity As depicted in Fig. 6(a), GVGEN can generate diverse


assets conditioned on the same prompt. The generative diversity of our method
not only differentiates it from reconstruction approaches but also heightens the
imaginative capabilities of users. Furthermore, we develop an image-to-3D model
conditioned on CLIP image embeddings and compare the results with the re-
cently popular single-image reconstruction models TGS [49] and LRM [14] (See
Fig. 6(b)). Since we do not have access to the close-sourced LRM, we use a
re-implementation of it, namely OpenLRM [13]. The single-view reconstruction
models suffer from the problem of average patterns, leading to implausible shapes
and textures in unseen regions, while GVGEN produces reasonable appearances
GVGEN: Text-to-3D Generation with Volumetric Representation 13

Table 1: Quantitative Comparisons against Baseline Methods on CLIP


Score and Inference Speed. The metrics demonstrate the superior performance
of our method.

Metrics CLIP score ↑ Time ↓


Ours 28.53 7 sec
Shap-E 28.48 11 sec
VolumeDiffusion 25.09 7 sec
DreamGaussian 23.60 ∼3 min

View 1 View 1

View 2 View 2

Ground truth Full w/o CPS w/o offsets Ground truth Full w/o w/o
(a) (b)

Fig. 7: Qualitative Results for Ablation Studies. (a) represents visual compar-
isons of different GaussianVolume fitting methods. (b) stands for results using different
losses to train the GaussianVolume prediction model.

and geometries. Such comparisons accentuate the critical difference between our
GVGEN and the reconstruction methods.

4.3 Ablation Studies

GaussianVolume Fitting We first study the effects of strategies fitting Gaus-


sianVolume via visual comparisons. In these ablation experiments, 3D assets are
trained with 72 images, with camera poses uniformly distributed at a distance
of 2.4 from the world center, and evaluated with 24 images posed uniformly at a
distance of 1.6. We set volume resolution N = 32, i.e. number of Gaussian points
N 3 = 32, 768. The full method, as used in the data preparation stage, performs
the best in terms of rendering results, even with only 32, 768 Gaussians. "w/o
CPS" refers to optimizing coordinate offsets ∆µ only via backpropagation, while
"w/o offsets" means fixing Gaussian coordinates µ = p without optimizing point
positions. Fig. 7(a) shows that using the full method produces the best rendering
results. Tab. 2 reports the average PSNR, SSIM, and LPIPS scores, supporting
14 He et al.

Table 2: Quantitative Metrics for Ablation Studies. The left table analyzes
GaussianVolume fitting strategies, while the right table compares different losses train-
ing the GaussianVolume attributes prediction model. For qualitative comparisons, see
Fig. 7.

Metrics PSNR ↑ SSIM ↑ LPIPS ↓ Metrics PSNR ↑ SSIM ↑ LPIPS ↓


Full 30.122 0.963 0.038 Full 35.03 0.9872 0.0236
w/o CPS 29.677 0.958 0.049 w/o L3D 35.21 0.9846 0.0268
w/o offsets 27.140 0.936 0.084 w/o L2D 29.55 0.9654 0.0444

the effectiveness of the proposed strategy quantitatively. More results can be


found in the supplemental materials.

Text-to-3D Generation Due to the computational resources required, it is


impractical to evaluate each part of our design with models trained on the full
Objaverse-LVIS dataset for our ablation study. Therefore, we train the models
on a small subset of Objaverse-LVIS to validate the effectiveness of losses in
the stage of predicting GaussianVolume attributes. Specifically, we pick 1,336
assets themed animals from 24 LVIS categories for training. We randomly select
∼20% of the data from the training data and evaluate quantitative metrics with
rendered novel views of each asset. Fig. 7(b) demonstrates that the proposed
multi-modal losses produce plausible results with fine-grained details, and quan-
titative results are listed in Tab. 2. The combination of two types of losses leads
to more detailed textures while keeping the entire geometry smooth.

4.4 Limitations
GVGEN has shown encouraging results in generating 3D objects. However, its
performance is constrained when dealing with input texts significantly divergent
from the domain of training data, as illustrated in Fig. 12 in the supplementary.
Since we need to fit the GaussianVolume per object to prepare training data,
it is time-consuming to scale up to millions of object data for better diversity.
Additionally, the volume resolution N is set as 32 (For 3D Gaussians, the point
number is only N 3 =32,768) to save computational resources, which limits the
rendering effects of 3D assets with very complex textures. In the future, we will
further explore how to generate higher-quality 3D assets in more challenging
scenarios.

5 Conclusions
In conclusion, this paper explores the feed-forward generation of explicit 3D
Gaussians conditioned on texts. We innovatively organize disorganized 3D Gaus-
sian points into a structured volumetric form, termed GaussianVolume, enabling
feed-forward generation of 3D Gaussians via a coarse-to-fine generation pipeline.
GVGEN: Text-to-3D Generation with Volumetric Representation 15

To facilitate the fitting of high-quality GaussianVolumes for training, we propose


a novel pruning and densifying strategy, i.e. Candidate Pool Strategy. Our pro-
posed framework, GVGEN, demonstrates remarkable efficiency in generating 3D
Gaussians from texts. Experimental results demonstrate the competitive capa-
bilities of GVGEN in both qualitative and quantitative terms. This progress
suggests potential extensions of our approach in tackling a broader spectrum of
challenges within the field.

References

1. Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida,
D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv
preprint arXiv:2303.08774 (2023)
2. Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srini-
vasan, P.P.: Mip-nerf: A multiscale representation for anti-aliasing neural radiance
fields. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision. pp. 5855–5864 (2021)
3. Cao, A., Johnson, J.: Hexplane: A fast representation for dynamic scenes. In: Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion. pp. 130–141 (2023)
4. Chang, J.H.R., Chen, W.Y., Ranjan, A., Yi, K.M., Tuzel, O.: Pointersect: Neural
rendering with cloud-ray intersection. In: Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition. pp. 8359–8369 (2023)
5. Chen, A., Xu, Z., Geiger, A., Yu, J., Su, H.: Tensorf: Tensorial radiance fields. In:
European Conference on Computer Vision. pp. 333–350. Springer (2022)
6. Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3d: Disentangling geometry
and appearance for high-quality text-to-3d content creation. arXiv preprint
arXiv:2303.13873 (2023)
7. Chen, Y., Chen, Z., Zhang, C., Wang, F., Yang, X., Wang, Y., Cai, Z., Yang, L.,
Liu, H., Lin, G.: Gaussianeditor: Swift and controllable 3d editing with gaussian
splatting. arXiv preprint arXiv:2311.14521 (2023)
8. Chen, Z., Wang, F., Liu, H.: Text-to-3d using gaussian splatting. arXiv preprint
arXiv:2309.16585 (2023)
9. Cheng, Y.C., Lee, H.Y., Tulyakov, S., Schwing, A.G., Gui, L.Y.: Sdfusion: Multi-
modal 3d shape completion, reconstruction, and generation. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4456–
4465 (2023)
10. Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A.,
Laforte, C., Voleti, V., Gadre, S.Y., et al.: Objaverse-xl: A universe of 10m+ 3d
objects. Advances in Neural Information Processing Systems 36 (2024)
11. Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E.,
Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of
annotated 3d objects. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. pp. 13142–13153 (2023)
12. Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances
in neural information processing systems 34, 8780–8794 (2021)
13. He, Z., Wang, T.: Openlrm: Open-source large reconstruction models. https://
github.com/3DTopia/OpenLRM (2023)
16 He et al.

14. Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K.,
Bui, T., Tan, H.: Lrm: Large reconstruction model for single image to 3d. arXiv
preprint arXiv:2311.04400 (2023)
15. Huang, Z., Wen, H., Dong, J., Wang, Y., Li, Y., Chen, X., Cao, Y.P., Liang,
D., Qiao, Y., Dai, B., et al.: Epidiff: Enhancing multi-view synthesis via localized
epipolar-constrained diffusion. arXiv preprint arXiv:2312.06725 (2023)
16. Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided
object generation with dream fields. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. pp. 867–876 (2022)
17. Jun, H., Nichol, A.: Shap-e: Generating conditional 3d implicit functions. arXiv
preprint arXiv:2305.02463 (2023)
18. Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for
real-time radiance field rendering. ACM Transactions on Graphics 42(4) (2023)
19. Li, J., Tan, H., Zhang, K., Xu, Z., Luan, F., Xu, Y., Hong, Y., Sunkavalli, K.,
Shakhnarovich, G., Bi, S.: Instant3d: Fast text-to-3d with sparse-view generation
and large reconstruction model. arXiv preprint arXiv:2311.06214 (2023)
20. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-
training with frozen image encoders and large language models. arXiv preprint
arXiv:2301.12597 (2023)
21. Li, W., Chen, R., Chen, X., Tan, P.: Sweetdreamer: Aligning geometric priors in
2d diffusion for consistent text-to-3d. arXiv preprint arXiv:2310.02596 (2023)
22. Liang, Y., Yang, X., Lin, J., Li, H., Xu, X., Chen, Y.: Luciddreamer: Towards
high-fidelity text-to-3d generation via interval score matching (2023)
23. Liu, Y., Lin, C., Zeng, Z., Long, X., Liu, L., Komura, T., Wang, W.: Syncdreamer:
Generating multiview-consistent images from a single-view image. arXiv preprint
arXiv:2309.03453 (2023)
24. Liu, Z., Li, Y., Lin, Y., Yu, X., Peng, S., Cao, Y.P., Qi, X., Huang, X., Liang,
D., Ouyang, W.: Unidream: Unifying diffusion priors for relightable text-to-3d
generation. arXiv preprint arXiv:2312.08754 (2023)
25. Long, X., Guo, Y.C., Lin, C., Liu, Y., Dou, Z., Liu, L., Ma, Y., Zhang, S.H.,
Habermann, M., Theobalt, C., et al.: Wonder3d: Single image to 3d using cross-
domain diffusion. arXiv preprint arXiv:2310.15008 (2023)
26. Luo, T., Rockwell, C., Lee, H., Johnson, J.: Scalable 3d captioning with pretrained
models. Advances in Neural Information Processing Systems 36 (2024)
27. Melas-Kyriazi, L., Rupprecht, C., Vedaldi, A.: Pc2: Projection-conditioned
point cloud diffusion for single-image 3d reconstruction. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12923–
12932 (2023)
28. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng,
R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commu-
nications of the ACM 65(1), 99–106 (2021)
29. Mohammad Khalid, N., Xie, T., Belilovsky, E., Popa, T.: Clip-mesh: Generating
textured meshes from text using pretrained image-text models. In: SIGGRAPH
Asia 2022 conference papers. pp. 1–8 (2022)
30. Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with
a multiresolution hash encoding. ACM Transactions on Graphics (ToG) 41(4), 1–
15 (2022)
31. Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: A system for
generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751
(2022)
GVGEN: Text-to-3D Generation with Volumetric Representation 17

32. Ntavelis, E., Siarohin, A., Olszewski, K., Wang, C., Gool, L.V., Tulyakov, S.: Au-
todecoding latent 3d diffusion models. Advances in Neural Information Processing
Systems 36 (2024)
33. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using
2d diffusion. arXiv preprint arXiv:2209.14988 (2022)
34. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,
Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from
natural language supervision. In: International conference on machine learning. pp.
8748–8763. PMLR (2021)
35. Ren, J., Pan, L., Tang, J., Zhang, C., Cao, A., Zeng, G., Liu, Z.: Dreamgaussian4d:
Generative 4d gaussian splatting. arXiv preprint arXiv:2312.17142 (2023)
36. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution
image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
37. Shi, R., Chen, H., Zhang, Z., Liu, M., Xu, C., Wei, X., Chen, L., Zeng, C., Su, H.:
Zero123++: a single image to consistent multi-view diffusion base model. arXiv
preprint arXiv:2310.15110 (2023)
38. Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: Mvdream: Multi-view diffusion
for 3d generation. arXiv preprint arXiv:2308.16512 (2023)
39. Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: Dreamgaussian: Generative gaussian
splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023)
40. Tang, Z., Gu, S., Wang, C., Zhang, T., Bao, J., Chen, D., Guo, B.: Volumediffusion:
Flexible text-to-3d generation with efficient volumetric encoder. arXiv preprint
arXiv:2312.11459 (2023)
41. Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer:
High-fidelity and diverse text-to-3d generation with variational score distillation.
Advances in Neural Information Processing Systems 36 (2024)
42. Wu, G., Yi, T., Fang, J., Xie, L., Zhang, X., Wei, W., Liu, W., Tian, Q., Wang,
X.: 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint
arXiv:2310.08528 (2023)
43. Wu, Z., Wang, Y., Feng, M., Xie, H., Mian, A.: Sketch and text guided diffu-
sion model for colored point cloud generation. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. pp. 8929–8939 (2023)
44. Xu, D., Yuan, Y., Mardani, M., Liu, S., Song, J., Wang, Z., Vahdat, A.:
Agg: Amortized generative 3d gaussians for single image to 3d. arXiv preprint
arXiv:2401.04099 (2024)
45. Xu, Q., Xu, Z., Philip, J., Bi, S., Shu, Z., Sunkavalli, K., Neumann, U.: Point-nerf:
Point-based neural radiance fields. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. pp. 5438–5448 (2022)
46. Yi, T., Fang, J., Wu, G., Xie, L., Zhang, X., Liu, W., Tian, Q., Wang, X.: Gaus-
siandreamer: Fast generation from text to 3d gaussian splatting with point cloud
priors. arXiv preprint arXiv:2310.08529 (2023)
47. Yin, Y., Xu, D., Wang, Z., Zhao, Y., Wei, Y.: 4dgen: Grounded 4d content gener-
ation with spatial-temporal consistency. arXiv preprint arXiv:2312.17225 (2023)
48. Yu, X., Guo, Y.C., Li, Y., Liang, D., Zhang, S.H., Qi, X.: Text-to-3d with classifier
score distillation. arXiv preprint arXiv:2310.19415 (2023)
49. Zou, Z.X., Yu, Z., Guo, Y.C., Li, Y., Liang, D., Cao, Y.P., Zhang, S.H.: Triplane
meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with
transformers. arXiv preprint arXiv:2312.09147 (2023)
18 He et al.

6 Implemantation Details

6.1 Stage 1: GaussianVolume Fitting

In this stage, we precisely fit each object using 96 uniformly rendered images,
capturing various poses. The initial 72 images are rendered with camera poses
uniformly distributed around a camera-to-world-center distance of 2.4 units. The
remaining 24 images are rendered from a closer distance of 1.6 units to provide
a more detailed view. We set the volume resolution at N = 32 and the spherical
harmonics (SH) order at 0, resulting in each Gaussian point having a feature
channel number C = 14.
The fitting volume is assumed to be a cube with a side length of 1 world unit,
centered within the global coordinate system. Inside this cube, Gaussian points
are uniformly distributed to ensure a comprehensive coverage of the volume.
Initially, the offset for each Gaussian point ∆µ is set to zero. To optimize each
instance, we conduct training against a white background for 20,000 iterations. A
densification strategy is employed from iteration 500 to 15,000, with subsequent
operations executed every 100 iterations to incrementally enhance the model’s
density. After this densification phase, we periodically clip features to predefined
ranges every 100 iterations to maintain consistency and prevent outliers. We
refrain from resetting the opacity of each Gaussian point during the training
process. This decision is made to avoid introducing instability into the model’s
learning and adaptation phases.
For weights selection in Lf itting , we set λ1 = 0.8, λ2 = 0.2, λ3 = 20.0, and
the offsets threshold as 1.5 voxel distances. Mathematically, this is expressed as
1.5
ϵof f sets = 32−1 , signifying the calculated distance between adjacent grid points
in our defined volume space. For other parameters, we adhere to the default
configurations in 3D Gaussian Splatting [18].

6.2 Stage 2: Text-to-3D Generation

Data Pre-processing In the generation phase, we first preprocess Gaussian-


Volume data to improve the stability and convergence of the learning process.
First, we standardize the decomposition of the covariance matrix by imposing
an ordering constraint on the scaling vector s ∈ R3 of each Gaussian, ensuring
they are arranged in ascending order. This organization, while maintaining the
orthogonality of the eigenvectors, does not modify the resultant covariance ma-
trix. The rotation vector q ∈ R4 is also adjusted to align with the sequence of
transformations introduced by scaling. Secondly, we represent quaternions using
spherical coordinates. This conversion reduces the rotation vector q to a three-
dimensional representation, enhancing the uniqueness and consistency of each
Gaussian’s parameters. By normalizing quaternions in this manner, we ensure
that identical shapes are represented by identical parameters. Finally, we nor-
malize each feature channel individually based on its mean and variance. This
normalization process equilibrates the scales across different channels.
GVGEN: Text-to-3D Generation with Volumetric Representation 19

Point Cloud Diffusion One-stage Diffusion Ours


Fig. 8: Visual Results of Generated Assets in Explorational Experiments.

Model Architecture and Training In the generation phase of Gaussian Den-


sity Fields (GDF), we utilize a 3D U-Net modified from [9]. The text condition
is 77 × 768 embeddings extracted with CLIP ViT-L/14 [34] text encoder, and it
is injected into the 3D U-Net using cross-attention mechanisms. An additional
MLP layer is employed to bridge the inter-modal gap. The architecture of the
prediction model mirrors the 3D U-Net, with adjustments including a modifica-
tion in the input channel number and the omission of timestep blocks.
When training the generative model predicting GDF, we only use MSE loss
L3D as the supervision. For training the prediction model, we set λ = 0.2 for
the rendering loss L2D through the whole process. Then, for the total loss L =
λ3D L3D + λ2D L2D , we set λ3D = 1.0, λ2D = 0 for the first 100 epochs, providing
a robust initialization that focuses solely on 3D consistency. Subsequently, the
weights are adjusted to λ3D = 0.2 and λ2D = 0.8, shifting the focus towards
optimizing 2D rendering and further refining.

7 Explorational Experiments
In our early exploration, we investigated the efficacy of various generative mod-
els, initially training three unconditional models: a point cloud diffusion model,
and a primal 3D U-Net-based diffusion model, and compared them with our full
model. We specifically utilized 8 fitted GaussianVolumes belonging to the chair
LVIS-category as our training data and trained each model for 4,000 epochs.
The qualitative results are shown in Fig. 8.
Our initial foray involved adapting the point cloud diffusion model, inspired
by prior work [27], to facilitate the generation of 3D Gaussians. Unfortunately,
this approach struggled with convergence issues, yielding unsatisfactory results.
The inherent limitations of point cloud diffusion models in capturing and gen-
erating the nuanced structures of 3D Gaussians became apparent, prompting us
to explore alternative strategies. Employing a primal one-stage diffusion model
offered a glimpse into generating basic 3D structures. However, this model pre-
dominantly produced coarse outputs characterized by spiny shapes, highlighting
20 He et al.

A pink flower
in a pot.

A red bell pepper.

A white vase
filled with
red roses.

A yellow wooden
barrel with
white stripes.

Condition GSGEN w/ Ours GSGEN w/ Point-E

Fig. 9: GSGEN [8] Results Initialized with Our Method and Point-E [31].
The left column represents rendering results initialized with different methods, and the
right column stands for rendering results after optimization with GSGEN.

the need for a more nuanced generation technique to achieve plausible 3D geom-
etry. Our full model, incorporating a coarse-to-fine generation pipeline, demon-
strated significant improvements in generating 3D Gaussians. This approach not
only simplified the generation process but also empowered the model to produce
instances with more accurate and realistic 3D geometry. By sequentially refining
the generated outputs, the coarse-to-fine pipeline effectively addresses the limi-
tations observed in the earlier models, showcasing its superiority in generating
complex 3D structures.

8 Application
To demonstrate the superiority of our method, we show GVGEN’s capability
to integrate with optimization-based methods, like GSGEN [8], for further re-
finement (see Fig. 9). Using the generated GaussianVolume6 as initialization to
6
We filter low-opacity Gaussian points for GSGEN initialization.
GVGEN: Text-to-3D Generation with Volumetric Representation 21

Novel
View 1

Novel
View 2

Gaussians
Visualization

GT 0-dist 1.5-dist 3-dist ∞-dist

Fig. 10: Visual Results of Rendered Images and Positions of The Gaussian
Center. The fitted assets are optimized with different offsets threshold ϵof f sets .

replace point clouds from Point-E with random colors. This shift enables GS-
GEN to further optimize 3D Gaussians, achieving better alignment with the text
descriptions in both texture and geometry. This enhancement stems from avoid-
ing the adverse impact of color attributes from Point-E on GSGEN’s utilization,
as the features produced by GVGEN are more compatible and beneficial for the
optimization process.

9 Additional Ablation Studies on GaussianVolume


Fitting Stage
9.1 Effects of offsets threshold ϵof f sets
In Tab. 3 and Fig. 10, we present the effects of varying the offset threshold,
ϵof f sets during GaussianVolume fitting. The term "0-dist" indicates the absence
of an offsets term, whereas "∞-dist" denotes the omission of offsets regular-
ization. Our observations reveal that minimal regularization leads to improved
rendering performance. However, this comes at the cost of Gaussian points within
the volume becoming more unstructured and deviating from grid points. Without
this regularization, the offset term becomes overly flexible, making it challenging
22 He et al.

Table 3: Quantitative results for different offsets threshold selection.

Metrics PSNR ↑ SSIM ↑ LPIPS ↓


0-dist 18.189 0.859 0.176
1.5-dist 30.437 0.966 0.046
3-dist 30.395 0.969 0.038
∞-dist 30.007 0.969 0.034

Table 4: Quantitative Results for Different Settings of GaussianVolume Resolution.

Metrics PSNR ↑ SSIM ↑ LPIPS ↓ Num of Gaussians Time(s)


16-res 23.079 0.887 0.122 4,096 169
32-res 30.437 0.966 0.046 32,768 179
64-res 30.395 0.972 0.032 262,144 195
3DGS 29.789 0.969 0.030 175,532 191

for the network to learn effectively. To strike a balance between flexibility and
1.5
maintaining a well-defined structure, we have selected ϵof f sets = 32−1 , equiva-
lent to a 1.5 voxel distance, as our optimal threshold.

9.2 Effects of GaussianVolume Resolution


We present a comparison of rendering results and quantitative metrics in Tab. 4
and Fig. 11, utilizing the "Chunky knight" asset from the Objaverse dataset. An
increase in resolution N -res of the GaussianVolume correlates to a higher number
of Gaussian points, N 3 . To accommodate the varying volume resolutions N ,
adjustments are made to the initialization and the offsets threshold, specifically
ϵof f sets = N1.5
−1 . The term "3DGS" denotes the process of fitting the object using
the original 3D Gaussian Splatting [18], with iteration settings matched to ours
and other parameters set to default. The analysis reveals that a greater number of
Gaussian points leads to improved fitting quality under identical settings, albeit
at the expense of increased memory usage. Notably, even when utilizing fewer
Gaussian points, our method delivers comparable visual quality to the original
3DGS approach and achieves higher PSNR values in a structured representation.
This underscores the efficiency and effectiveness of our method in reconstructing
high-quality, memory-efficient representations.

10 Failure Cases
As illustrated in Fig. 12, some of the objects generated by our model suffer from
blurred textures and imprecise geometry. This phenomenon largely stems from
the relatively small size of our training dataset, which comprises around 46,000
GVGEN: Text-to-3D Generation with Volumetric Representation 23

Novel
View 1

Novel
View 2

GT 16-res 32-res 64-res 3DGS

Fig. 11: Qualitative Comparisons among different GaussianVolume resolution set-


tings and original 3DGS.

Soccer ball a bowl of vegetables a schoolbus

Fig. 12: Failure Cases.

instances. Such a dataset size limits the model’s capacity to produce varied
outputs in response to a broad spectrum of text inputs. In the future, our work
will focus on two main avenues: improving the model architecture and enhancing
the data quality. By addressing these aspects, we aim to scale up the model for
application in large-scale scenarios, which is expected to improve the generation
diversity and lead to better-rendered results.

11 Additional Qualitative Results


We provide more visual results in Fig. 13 and Fig. 14.
24 He et al.

A pink flower A yellow wooden barrel Black umbrella with A Lego man
in a pot. with white stripes. wooden handle. wearing a red shirt
and various pants colors.

A rooster. A table with A Yellow Chevrolet Low-poly model of a palm


a red and white Camaro sports car. tree.
checkered tablecloth.

a white horned Pink Flamingo. An old-fashioned A white fan


bull head lantern. with blue blades
on a stand.

A blue and yellow toy a black sectional sofa a small blue teapot Red and black devil
tricycle with a pink and with horns. mask with horns.
blue seat and wheels.

Fig. 13: More Qualitative Results.


GVGEN: Text-to-3D Generation with Volumetric Representation 25

an hourglass with a reindeer with antlers. a chocolate muffin White crane


blue sand on a plate.

Yellow rubber duck. a light blue, plastic pink frosted donut with
a pink, purple, and blue
child's chair. white sprinkles.
children's chandelier.

a wooden bedside a chessboard with a light blue ottoman Thor's wooden hammer
side table. checkered pattern. with a metal frame.

a red, blue, and white a pair of glasses. a wooden stool with a garlic bulb.
cube floating in water. seat and legs.

Fig. 14: More Qualitative Results.

You might also like