A GPU-Parallel Construction of Volumetric Tree
Mohammad M. Hossain
Thomas M. Tucker
College of Computing
Georgia Tech
Atlanta, GA, USA
Tucker Innovations
Charlotte, NC, USA
Thomas R. Kurfess
Richard W. Vuduc
George W. Woodruff School of
Mechanical Engineering
Georgia Tech
Atlanta, GA, USA
School of Computational
Science and Engineering
Georgia Tech
Atlanta, GA, USA
ABSTRACT
Volume data has extensive uses in medical imaging, like,
MRI (magnetic resonance imaging) scan, visual effects production, including volume rendering and fluid simulation,
computer-aided design and manufacturing (CAD/CAM) in
advanced prototyping, such as, 3D Printing, among others.
This work presents a compact hierarchical data structure,
dubbed HDT, for extreme-scale volume modeling. The name
is shorthand for hybrid dynamic tree. HDT is hybrid in that
it fuses two contrasting structures, namely, octree and tiled
grid. Such fusion of two opposing data layouts alleviate the
limitations inherent to the respective structures.
We describe the HDT construction algorithm to generate volumetric representation of triangle mesh on GPUs.
While HDT mirrors much of the existing works on sparse
3D data modeling, our evaluation and comparative studies
with prior researches demonstrate that HDT is arguably the
most storage-effective representation. The geometric topology in HDT consumes just short of two bits per voxel —
five times compact relative to the current state-of-the-art
volumetric structure.
Categories and Subject Descriptors
J.6.6 [Volumetric models]: Hybrid Data Layout—GPU
accelerated algorithms; J.2.1 [Massively Parallel Algorithms]; K.3.6 [Computer-Aided Design]
Keywords
Octree, Voxels, GPGPU Acceleration, Rapid Prototyping.
1. INTRODUCTION
While the computational demand for 3D data processing
has historically limited its usages in medical imaging and
visual effects productions, with the emergence of advanced
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Request permissions from
[email protected].
IA^3 2015, November 15-20, 2015, Austin, TX, USA
c 2015 ACM. ISBN 978-1-4503-4001-4/15/11...$15.00 DOI:
http://dx.doi.org/10.1145/2833179.2833191.
rapid prototyping technologies and 3D printing volumetric
representation of complex free-form objects become a central part in the computer-aided design and manufacturing
(CAD/CAM) industry. Algorithms operating on a volume
are generally robust compared to their counterparts dealing
with explicit or parametric representation, such as, triangle
meshes, NURBS, among others. In volume modeling, the
embedding space is split into cubical (voxels), and the state
of a voxel is characterized to lay either inside, outside or exactly on the surface depending on intersection with the solid
boundary. Generally, voxels are discretized on a uniform
grid partly because it is easy to manipulate on a 3D matrix;
but more importantly volume processing tools, like, convolutional kernels, interpolations etc. are inclined to uniform
sampling [7]. A fundamental limitation of uniform sampling
is that the memory footprint grows in proportion to the volume of the bounding space that limits such scheme only
suitable for low-resolutions.
Many approaches attempt to mitigate these high memory requirements by constructing a sparse voxel representation, typically into an octree structure. With an extremely
small branching factor of two along each coordinate, octree
is clearly optimal for adaptive grid sampling. However, a
standard sparse voxel octree (SVO) presents two major challenges for parallelization on GPUs. First, being a hierarchical structure extra storage is required to maintain the geometric topology. Indexing is implicit in a 3D grid, and hence
no extra storage is needed to construct the topology of the
voxels with uniform sampling. The efficient SVO work by
Laine et al. [6] reports a minimum of one byte per voxel for
storing the voxel topology in the sparse data layout. To the
best of our knowledge, this is the least storage needed for a
SVO traversal; yet one byte memory per voxel may use up
a significant quota of the limited memory on GPU. A second fundamental challenge for any tree based data structure
is that it is not flexible for efficient traversal and dynamic
topological adjustment when processed on GPU’s parallel architecture. As octree is usually constructed in level order on
GPUs, the deep hierarchies pertinent to modeling volumes
at extreme scales become a bottleneck.
In this study, we present hybrid dynamic tree (HDT) that
combines octree with a two-level grid structure. HDT is
designed to minimize the memory footprint — each voxel
accounts for just 1.66 bits on average to store the topology
data, and thus capable of squeezing the relevant storage by
a factor of 5X. Fusing the two contrasting data layouts in
HDT simultaneously restrains the tree hierarchy to grow too
tall. HDT layout is noticeably shallow relative to octree,
often spanning just 4-5 levels of hierarchy, and allows efficient
GPU-parallel processing of the boundary voxels representing
the solid boundary.
2. BACKGROUND AND RELATED WORK
Parallel volume construction on GPUs, either in dense or
sparse representation, has been extensively studied. For accelerated ray tracing, an efficient construction algorithm for
uniform grids on modern GPUs was presented in [5]. As
uniform grids impose limited resolution, Kalojanov et al. [4]
presented a CUDA implementation for ray tracing with a
two-level nested grid structure, often termed as Tiled Grid.
In a tiled grid, the volume is first partitioned into uniform
blocks (called tiles or bricks), and each intersecting block is
then partitioned into voxels. Being a uniform sampling technique, for extreme-scale volume representation tiled grids inherit the same constraints as uniform 3D grids exhibit.
At the other extreme of a dense regular grid, with full
dyadic refinement and data stored only in leaves, we have
a complete octree. While octree is space efficient, a small
branching factor results in a very deep hierarchy that is
difficult to traverse and update dynamically on a GPU. In
between, we can choose different branching factors at each
level, resulting in a hierarchical grid or N 3 -tree that have
been used in volume rendering and volume reconstruction
from 3D scanning [3, 7, 2]. Depending on specific application characteristics, an N 3 -tree can be configured to trade-off
between memory efficiency (low N , deep tree) and traversal
efficiency (large N , shallow tree).
Our goal is to design a memory-efficient data structure
that can compactly embed the geometric topology of the active voxels1 , and simultaneously lend itself for parallel construction and algorithmic processing on GPU accelerators.
While the prior research proposals have targeted achieving
the similar objective, some fall short of extreme-scale resolutions, like, Gigavoxel [3] achieves maximum 5123 resolution,
and the work by Chen et al. [2] demonstrates the capability of dealing up to 10243 resolution. Among these studies, structurally VDB [7] is most closely related to our work
exhibiting results with 40963 resolution. However, VDB is
designed for CPU platform that strikingly differs with the
design principle behind HDT. Due to the memory overhead
for custom-tailored software-level caching, large branching
factors mirroring B-Tree like layout, among others, each active voxel in VDB, as reported with the two test datasets
in [7], accounts for 7.5 bytes storage on average — an order
of magnitude higher than in HDT.
The efficient sparse voxel octree by Laine et al. [6] restricts
the per voxel topology storage overhead maximum two bytes.
The exact storage depends on the number of child nodes
branching out from the parent that amortize the allocation
required for per node child descriptor. For an estimation
of four descendants on average, the theoretical memory requirement comes at 1.33 bytes, which closely matches with
their experimental observation. To the contrary, HDT manages keeping the geometric topology consuming just short of
two bits per active voxel. The compact storage in HDT is
due to the unique data layout — topology storage at inter1
Cells that lie on the solid boundary are termed as active
voxels. The rest of the voxels in a tile located either inside
or outside of the object boundary, are termed inactive voxels.
mediate nodes are amortized over a large number of active
voxels, while efficient spatial subdivision with octree limits
the proliferation of intermediate nodes.
3. VOLUMETRIC TREE FOR GPU
While HDT reflects much of the prior researches, its novelty lies in gluing two contrasting data representations. Proposed Hybrid Dynamic Tree (HDT) combines the characteristics of both tiled grid and hierarchical octree. At the topmost level, similar to tiled grid structure, HDT comprises
a 3D grid of blocks (Root Grid). An element in the root
grid is subdivided, if the solid boundary intersects with the
corresponding index space. Such an intersecting root element becomes the root node of the octree for sparse representation of the respective 3D space. This adaptive subpartitioning continues like a regular octree until the element
size reaches the target resolution. Finally, at the lowest hierarchy, alike a tiled grid, each leaf in HDT groups a block of
voxels (Leaf Grid). To sum up, HDT sandwiches hierarchical octree structure in between the two levels of dense grid.
Fig. 1 depicts an HDT with a root grid of size 163 . Octree
hierarchy spans rooted at each intersecting green-colored element. Blue and white colors represent nodes with the uniform state – either full or empty, and are not subdivided.
Each leaf grid represents 163 , i.e. 4096 voxels of target resolution. In this example, two small sized grids and three-level
octree jointly represent a volumetric resolution of 20483 —
much shallower than a hierarchy of eleven with a standard
octree alternative.
Level 2
Level 1
Root Volume
[16 x 16 x 16]
Level 3
Leaf Volume
[16 x 16 x 16]
Figure 1: Illustration of HDT representation.
Conceptually, HDT is composed of two types of primitives
— Tree Element and Leaf Grid encapsulating a tree node
and a leaf volume, respectively. Each element stores X, Y,
and Z-coordinates (each four bytes) in the 3D space, and the
depth in HDT (30 bits) where it locates. Two bits are required for distinguishing four possible states — full, empty,
branch and boundary. While full and empty are terminal
states, an element with branch or boundary state points
(four bytes) to descendant tree element or leaf volume, respectively. Each tree element thus consumes a total of 20
bytes storage. While a branch element overlapping with the
surface boundary points to a memory address where eight
child nodes are allocated contiguously; each boundary element is mapped to a leaf grid typically corresponding to a
block of 163 , i.e. 4096 cells. As each voxel’s state can be
either of the three possibilities — inside, outside or surface
boundary, two bits data are needed to codify the state information. Thus, the states of all voxels in a leaf volume are
compactly stored in a chunk of 4096 x 2 bits, i.e. 1024 bytes.
3.1
Parallel HDT Construction on GPU
The HDT construction procedure takes an indexed triangle mesh as input, and outputs a hierarchical layout that
represents the mesh geometry with a set of tree elements
and a set of leaf grids. This conversion process of geometric information to volume data is called voxelization. Like
majority applications in volume graphics operating only on
the surface of the solid, our work focuses on surface voxelization problem that identifies only the cells lying on the solid
boundary as active voxels. Our GPU-parallel HDT construction consists of three CUDA kernels that are briefly detailed
below.
Triangles Mapping to Root Elements. The first kernel maps the triangles of the meshes to the root elements of
the HDT based on triangle-box overlap test [1]. At the first
sight, it looks like all the triangles in the mesh need to perform intersection test with all the root elements in the HDT.
A simple way to reduce this complexity is to compute the
bounding box for each triangle first, and then to check only
the root elements that lie within that bounding region, as
depicted in Fig. 2(a). The output of the first kernel is a list
of root elements that need to be further divided, which are
passed to the second kernel as input. For each such element,
the output list contains information of – (1) the index of the
root element in the buffer of tree elements, and (2) a list of
indexes of the overlapping triangles in the triangle pool.
3D cells
CUDA
Thread
Thread Block of size 16 x 16
(a)
(b)
Figure 2: Triangles Mapping and Leaf Processing.
HDT Branching. Like most hierarchical construction on
GPUs, in the branching phase the HDT is constructed in a
level-order. At each iteration, the CUDA kernel processes
each tree element of the current depth in parallel, and spans
the tree hierarchy by one level. The level-order construction
approach ensures that there are no dependencies between
the data. At successive tree levels, the size of the elements
are halved along each of the three dimensions. Once the
size of the tree element reaches the target resolution, the
hierarchal tree spanning is terminated, and each lowest-level
tree element is connected to a leaf grid corresponding to the
3D index space represented by the respective element.
Leaf Processing. The third kernel processes each 3D cell
in the leaf volumes to determine the state — either inside,
outside or exactly on the surface bounding the solid. As
detailed previously, for a typical HDT configuration with
a leaf grid size of 163 all voxels of a leaf volume is stored
in contiguous 1024 bytes of memory block. In our GPU
implementation of leaf processing, a thread block comprises
16 x 16 threads, and each CUDA thread sets the states of
16 consecutive voxels as shown with the red colored block of
cells in Fig. 2(b). As each voxel’s three possible states are
codified in 2 bits data, the state information of 16 contiguous
cells are stored in a block of 16 x 2 bits, i.e. 4 byte in the
GPU memory. Hence, each thread in CUDA warp reads
a 4-byte word, works on corresponding 16 cells, and writes
back modified word to the memory. Thus, a CUDA thread
block processes altogether 256 x 4 bytes, i.e. 1024 bytes data
of 163 cells of a grid. Such coalesced memory access utilizes
off-chip memory bandwidth optimally that is crucial for high
throughput on GPUs.
4. RESULTS AND ANALYSIS
4.1
CAD Benchmarking with HDT
HDT is implemented in Visual Studio 2010 with CUDA 7.0
on a quad core Intel Core i7 3.5 GHz CPU and an NVIDIA
GTX 780Ti graphics card. Detailed geometric properties of
the selected input meshes, and volumetric statistics in resultant HDT layouts are enlisted in Table 1. Each model
is enclosed with the number of triangles comprising the respective CAD models. The first row shows the mesh surface
area, followed by two rows presenting the model’s physical
dimension and the bounding volume, respectively. As a first
step to understand the complexity of volumetric representation with HDT structure, for each model four different HDT
configurations are evaluated that target a geometric resolution between 10243 (1K) and 81923 (8K). HDT spans to a
depth between 2 and 5 respectively for these target resolutions as shown in the following entry in Table 1.
The next two rows in Table 1 respectively report the tree
elements count and the number of leaf grids for different
HDT configurations at varied resolutions. In contrast to
dense representations that scale with the bounding volume of
the solid, the number of leaf grids in HDT correlates with the
surface area of the triangle meshes. For instance, though the
Head model makes up 6x triangles than the Candle Holder,
the mesh surface area in the latter is 22% larger than the
former. Hence, the leaf grid counts in Candle Holder is
commensurately (20%) higher than Head at 8K resolution.
Moreover, as the surface voxels of a solid geometry increases
quadratically at 2X grid resolution, the numbers of HDT leaf
grids at successive higher resolution grow by a factor of four.
The following entry in Table 1 reports the total number of
leaf voxels in HDT representation that is the product of leaf
grid counts and voxels per grid (i.e. 4096 cells for a leaf dimension of 16). As discussed in Section 1, optimization of the
extra storage overhead to maintain the geometry topology
in a sparse layout is a key challenge with any extreme-scale
volume modeling. Prior state-of-the-art research on volume
rendering [6] consumes at least 1 byte per active voxel to
track each cell’s topology in sparse voxel octree (SVO) hierarchy. To study the storage effectiveness of HDT, the succeeding rows in Table 1 present the number of active voxels
(i.e. cells lying on the surface boundary) and the storage
per active voxel, respectively. As shown, at 8K resolutions
each boundary cell in HDT accounts for on average 1.46 –
1.66 bits storage that is five times compact than the memory
overhead reported in [6].
4.2
HDT Construction Phases
Fig. 3 summarizes the GPU execution time of HDT construction steps at 4K and 8K resolutions. For convenience of
discussion, let we first denote the triangles mapping time be
T1 , the branching time be T2 , and the leaf processing time
be T3 . As demonstrated in Fig. 3, T1 grows in proportion
to the number of triangles of respective CAD models. For
instance, T1 consumes about 2% of total HDT construction
Mesh Surface Area (mm2 )
Dimensions XYZ (mm)
Bounding Volume (mm3 )
Effective Resolution
Height of HDT
HDT Elements (x 103 )
HDT Leaf Grids (x 103 )
HDT Leaf Voxels (x 106 )
Active Voxels in HDT(x 106 )
Bits per Active Voxel
Head (230K)
Dragon (173K)
Turbine (58K)
8,956
10,747
12,346
48.6 x 46.0 x 64.4
46.7 x 74.6 x 50.4
48.9 x 48.9 x 31.1
143,923
175,577
74,367
1024 2048 4096 8192 1024 2048 4096 8192 1024 2048 4096 8192
2
3
4
5
2
3
4
5
2
3
4
5
10
37
146 581
10
36
150 621
8
36
153 684
3
14
54
218
3
14
60
240
3
15
66
291
14
56
223 894
14
59
245 985
12
61
271 1193
0.87 3.51 14.0 56.0 0.98 3.96 15.8 63.4 1.17 4.67 18.7 74.7
1.84 1.67 1.66 1.66 1.60 1.45 1.52 1.57 1.08 1.22 1.30 1.46
Candle Holder (38K)
10,987
48.4 x 48.9 x 57.7
136,515
1024 2048 4096 8192
2
3
4
5
11
44
172 698
4
17
65
262
16
68
267 1074
1.05 4.20 16.8 67.2
1.62 1.66 1.63 1.66
Table 1: Statistics of the 3D CAD benchmarks.
T₁
T₂
T₁
T₃
2.5
5. CONCLUSIONS
T₃
We have presented HDT structure for compact modeling
of volumes at extreme resolutions on GPU. HDT is designed
to address simultaneously the two key challenges pertinent to
a standard octree based sparse volume representation. By
way of motivation, HDT is a fusion of two opposing data
representation — octree and tiled grid. Such a merging of
two contrastive data structures unleashes the merits, while
synergistically subdues the limitations of respective layouts.
4
3.5
2
3
Time (sec)
Time (sec)
T₂
1.5
1
2.5
2
1.5
1
0.5
0.5
0
0
Head Dragon Turbine Candle
Holder
(a) Resolution 40963
Head Dragon Turbine Candle
Holder
(b) Resolution 81923
6. ACKNOWLEDGMENTS
This work was supported by the National Science Foundation (NSF) under CPS Synergy award number 1329742.
Figure 3: HDT construction time on GPU.
7. REFERENCES
3
time for Head and Dragon at an effective resolution of 8192 ,
whereas for Turbine and Candle Holder T1 is less than 1%
of total GPU time. Fig. 3 further illustrates that for a fixed
configuration of the root volume the triangles mapping time
is independent of the target resolution. Hence, T1 remains
unchanged across the two HDT configurations.
Among the three construction steps, HDT branching time
(T2 ) is the largest contributing factor of the total GPU time.
As shown in Fig. 3, the relative magnitudes of T2 values for
different models correlate with the number of tree elements
reported in Table 1. For instance, at 8K resolution Dragon,
Turbine and Candle Holder have 0.62, 0.68 and 0.70 million
tree elements respectively; and this trend of HDT element
counts matches with corresponding T2 values – 1.93 seconds,
2.17 seconds and 2.39 seconds. Similarly, HDT leaf processing times (T3 ) correspond to the number of leaf grids of the
benchmark inputs documented in Table 1, though in Fig. 3
the differences in T3 measurements across the models are
hardly distinguishable. As the leaf processing step is massively parallel, GPU crunches several hundreds of millions of
voxels per second. Hence, the T3 values lie in a tight spectrum; for example, at resolution 81923 T3 measurements for
the four models rest in between 1.23 seconds to 1.29 seconds.
[1] T. Akenine-Möller. Fast 3d triangle-box overlap testing.
Journal of Graphics Tools, 6(1), Jan. 2002.
[2] J. Chen, D. Bautembach, and S. Izadi. Scalable
Real-time Volumetric Surface Reconstruction. ACM
Trans. Graph., 32(4), July 2013.
[3] C. Crassin, F. Neyret, S. Lefebvre, and E. Eisemann.
GigaVoxels: Ray-guided Streaming for Efficient and
Detailed Voxel Rendering. In Proceedings of Interactive
3D Graphics and Games, I3D, 2009.
[4] J. Kalojanov, M. Billeter, and P. Slusallek. Two-Level
Grids for Ray Tracing on GPUs. Computer Graphics
Forum, 2011.
[5] J. Kalojanov and P. Slusallek. A Parallel Algorithm for
Construction of Uniform Grids. In Proceedings of the
Conference on High Performance Graphics 2009.
[6] S. Laine and T. Karras. Efficient Sparse Voxel Octrees.
In Proceedings of Interactive 3D Graphics and Games,
I3D, 2010.
[7] K. Museth. VDB: High-resolution Sparse Volumes with
Dynamic Topology. ACM Trans. Graph., 32(3), July
2013.