Academia.eduAcademia.edu

El Proyecto Museográfico de la nevería de Alcañiz

2001

A GPU-Parallel Construction of Volumetric Tree Mohammad M. Hossain Thomas M. Tucker College of Computing Georgia Tech Atlanta, GA, USA Tucker Innovations Charlotte, NC, USA Thomas R. Kurfess Richard W. Vuduc George W. Woodruff School of Mechanical Engineering Georgia Tech Atlanta, GA, USA School of Computational Science and Engineering Georgia Tech Atlanta, GA, USA ABSTRACT Volume data has extensive uses in medical imaging, like, MRI (magnetic resonance imaging) scan, visual effects production, including volume rendering and fluid simulation, computer-aided design and manufacturing (CAD/CAM) in advanced prototyping, such as, 3D Printing, among others. This work presents a compact hierarchical data structure, dubbed HDT, for extreme-scale volume modeling. The name is shorthand for hybrid dynamic tree. HDT is hybrid in that it fuses two contrasting structures, namely, octree and tiled grid. Such fusion of two opposing data layouts alleviate the limitations inherent to the respective structures. We describe the HDT construction algorithm to generate volumetric representation of triangle mesh on GPUs. While HDT mirrors much of the existing works on sparse 3D data modeling, our evaluation and comparative studies with prior researches demonstrate that HDT is arguably the most storage-effective representation. The geometric topology in HDT consumes just short of two bits per voxel — five times compact relative to the current state-of-the-art volumetric structure. Categories and Subject Descriptors J.6.6 [Volumetric models]: Hybrid Data Layout—GPU accelerated algorithms; J.2.1 [Massively Parallel Algorithms]; K.3.6 [Computer-Aided Design] Keywords Octree, Voxels, GPGPU Acceleration, Rapid Prototyping. 1. INTRODUCTION While the computational demand for 3D data processing has historically limited its usages in medical imaging and visual effects productions, with the emergence of advanced Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. IA^3 2015, November 15-20, 2015, Austin, TX, USA c 2015 ACM. ISBN 978-1-4503-4001-4/15/11...$15.00 DOI: http://dx.doi.org/10.1145/2833179.2833191. rapid prototyping technologies and 3D printing volumetric representation of complex free-form objects become a central part in the computer-aided design and manufacturing (CAD/CAM) industry. Algorithms operating on a volume are generally robust compared to their counterparts dealing with explicit or parametric representation, such as, triangle meshes, NURBS, among others. In volume modeling, the embedding space is split into cubical (voxels), and the state of a voxel is characterized to lay either inside, outside or exactly on the surface depending on intersection with the solid boundary. Generally, voxels are discretized on a uniform grid partly because it is easy to manipulate on a 3D matrix; but more importantly volume processing tools, like, convolutional kernels, interpolations etc. are inclined to uniform sampling [7]. A fundamental limitation of uniform sampling is that the memory footprint grows in proportion to the volume of the bounding space that limits such scheme only suitable for low-resolutions. Many approaches attempt to mitigate these high memory requirements by constructing a sparse voxel representation, typically into an octree structure. With an extremely small branching factor of two along each coordinate, octree is clearly optimal for adaptive grid sampling. However, a standard sparse voxel octree (SVO) presents two major challenges for parallelization on GPUs. First, being a hierarchical structure extra storage is required to maintain the geometric topology. Indexing is implicit in a 3D grid, and hence no extra storage is needed to construct the topology of the voxels with uniform sampling. The efficient SVO work by Laine et al. [6] reports a minimum of one byte per voxel for storing the voxel topology in the sparse data layout. To the best of our knowledge, this is the least storage needed for a SVO traversal; yet one byte memory per voxel may use up a significant quota of the limited memory on GPU. A second fundamental challenge for any tree based data structure is that it is not flexible for efficient traversal and dynamic topological adjustment when processed on GPU’s parallel architecture. As octree is usually constructed in level order on GPUs, the deep hierarchies pertinent to modeling volumes at extreme scales become a bottleneck. In this study, we present hybrid dynamic tree (HDT) that combines octree with a two-level grid structure. HDT is designed to minimize the memory footprint — each voxel accounts for just 1.66 bits on average to store the topology data, and thus capable of squeezing the relevant storage by a factor of 5X. Fusing the two contrasting data layouts in HDT simultaneously restrains the tree hierarchy to grow too tall. HDT layout is noticeably shallow relative to octree, often spanning just 4-5 levels of hierarchy, and allows efficient GPU-parallel processing of the boundary voxels representing the solid boundary. 2. BACKGROUND AND RELATED WORK Parallel volume construction on GPUs, either in dense or sparse representation, has been extensively studied. For accelerated ray tracing, an efficient construction algorithm for uniform grids on modern GPUs was presented in [5]. As uniform grids impose limited resolution, Kalojanov et al. [4] presented a CUDA implementation for ray tracing with a two-level nested grid structure, often termed as Tiled Grid. In a tiled grid, the volume is first partitioned into uniform blocks (called tiles or bricks), and each intersecting block is then partitioned into voxels. Being a uniform sampling technique, for extreme-scale volume representation tiled grids inherit the same constraints as uniform 3D grids exhibit. At the other extreme of a dense regular grid, with full dyadic refinement and data stored only in leaves, we have a complete octree. While octree is space efficient, a small branching factor results in a very deep hierarchy that is difficult to traverse and update dynamically on a GPU. In between, we can choose different branching factors at each level, resulting in a hierarchical grid or N 3 -tree that have been used in volume rendering and volume reconstruction from 3D scanning [3, 7, 2]. Depending on specific application characteristics, an N 3 -tree can be configured to trade-off between memory efficiency (low N , deep tree) and traversal efficiency (large N , shallow tree). Our goal is to design a memory-efficient data structure that can compactly embed the geometric topology of the active voxels1 , and simultaneously lend itself for parallel construction and algorithmic processing on GPU accelerators. While the prior research proposals have targeted achieving the similar objective, some fall short of extreme-scale resolutions, like, Gigavoxel [3] achieves maximum 5123 resolution, and the work by Chen et al. [2] demonstrates the capability of dealing up to 10243 resolution. Among these studies, structurally VDB [7] is most closely related to our work exhibiting results with 40963 resolution. However, VDB is designed for CPU platform that strikingly differs with the design principle behind HDT. Due to the memory overhead for custom-tailored software-level caching, large branching factors mirroring B-Tree like layout, among others, each active voxel in VDB, as reported with the two test datasets in [7], accounts for 7.5 bytes storage on average — an order of magnitude higher than in HDT. The efficient sparse voxel octree by Laine et al. [6] restricts the per voxel topology storage overhead maximum two bytes. The exact storage depends on the number of child nodes branching out from the parent that amortize the allocation required for per node child descriptor. For an estimation of four descendants on average, the theoretical memory requirement comes at 1.33 bytes, which closely matches with their experimental observation. To the contrary, HDT manages keeping the geometric topology consuming just short of two bits per active voxel. The compact storage in HDT is due to the unique data layout — topology storage at inter1 Cells that lie on the solid boundary are termed as active voxels. The rest of the voxels in a tile located either inside or outside of the object boundary, are termed inactive voxels. mediate nodes are amortized over a large number of active voxels, while efficient spatial subdivision with octree limits the proliferation of intermediate nodes. 3. VOLUMETRIC TREE FOR GPU While HDT reflects much of the prior researches, its novelty lies in gluing two contrasting data representations. Proposed Hybrid Dynamic Tree (HDT) combines the characteristics of both tiled grid and hierarchical octree. At the topmost level, similar to tiled grid structure, HDT comprises a 3D grid of blocks (Root Grid). An element in the root grid is subdivided, if the solid boundary intersects with the corresponding index space. Such an intersecting root element becomes the root node of the octree for sparse representation of the respective 3D space. This adaptive subpartitioning continues like a regular octree until the element size reaches the target resolution. Finally, at the lowest hierarchy, alike a tiled grid, each leaf in HDT groups a block of voxels (Leaf Grid). To sum up, HDT sandwiches hierarchical octree structure in between the two levels of dense grid. Fig. 1 depicts an HDT with a root grid of size 163 . Octree hierarchy spans rooted at each intersecting green-colored element. Blue and white colors represent nodes with the uniform state – either full or empty, and are not subdivided. Each leaf grid represents 163 , i.e. 4096 voxels of target resolution. In this example, two small sized grids and three-level octree jointly represent a volumetric resolution of 20483 — much shallower than a hierarchy of eleven with a standard octree alternative. Level 2 Level 1 Root Volume [16 x 16 x 16] Level 3 Leaf Volume [16 x 16 x 16] Figure 1: Illustration of HDT representation. Conceptually, HDT is composed of two types of primitives — Tree Element and Leaf Grid encapsulating a tree node and a leaf volume, respectively. Each element stores X, Y, and Z-coordinates (each four bytes) in the 3D space, and the depth in HDT (30 bits) where it locates. Two bits are required for distinguishing four possible states — full, empty, branch and boundary. While full and empty are terminal states, an element with branch or boundary state points (four bytes) to descendant tree element or leaf volume, respectively. Each tree element thus consumes a total of 20 bytes storage. While a branch element overlapping with the surface boundary points to a memory address where eight child nodes are allocated contiguously; each boundary element is mapped to a leaf grid typically corresponding to a block of 163 , i.e. 4096 cells. As each voxel’s state can be either of the three possibilities — inside, outside or surface boundary, two bits data are needed to codify the state information. Thus, the states of all voxels in a leaf volume are compactly stored in a chunk of 4096 x 2 bits, i.e. 1024 bytes. 3.1 Parallel HDT Construction on GPU The HDT construction procedure takes an indexed triangle mesh as input, and outputs a hierarchical layout that represents the mesh geometry with a set of tree elements and a set of leaf grids. This conversion process of geometric information to volume data is called voxelization. Like majority applications in volume graphics operating only on the surface of the solid, our work focuses on surface voxelization problem that identifies only the cells lying on the solid boundary as active voxels. Our GPU-parallel HDT construction consists of three CUDA kernels that are briefly detailed below. Triangles Mapping to Root Elements. The first kernel maps the triangles of the meshes to the root elements of the HDT based on triangle-box overlap test [1]. At the first sight, it looks like all the triangles in the mesh need to perform intersection test with all the root elements in the HDT. A simple way to reduce this complexity is to compute the bounding box for each triangle first, and then to check only the root elements that lie within that bounding region, as depicted in Fig. 2(a). The output of the first kernel is a list of root elements that need to be further divided, which are passed to the second kernel as input. For each such element, the output list contains information of – (1) the index of the root element in the buffer of tree elements, and (2) a list of indexes of the overlapping triangles in the triangle pool. 3D cells CUDA Thread Thread Block of size 16 x 16 (a) (b) Figure 2: Triangles Mapping and Leaf Processing. HDT Branching. Like most hierarchical construction on GPUs, in the branching phase the HDT is constructed in a level-order. At each iteration, the CUDA kernel processes each tree element of the current depth in parallel, and spans the tree hierarchy by one level. The level-order construction approach ensures that there are no dependencies between the data. At successive tree levels, the size of the elements are halved along each of the three dimensions. Once the size of the tree element reaches the target resolution, the hierarchal tree spanning is terminated, and each lowest-level tree element is connected to a leaf grid corresponding to the 3D index space represented by the respective element. Leaf Processing. The third kernel processes each 3D cell in the leaf volumes to determine the state — either inside, outside or exactly on the surface bounding the solid. As detailed previously, for a typical HDT configuration with a leaf grid size of 163 all voxels of a leaf volume is stored in contiguous 1024 bytes of memory block. In our GPU implementation of leaf processing, a thread block comprises 16 x 16 threads, and each CUDA thread sets the states of 16 consecutive voxels as shown with the red colored block of cells in Fig. 2(b). As each voxel’s three possible states are codified in 2 bits data, the state information of 16 contiguous cells are stored in a block of 16 x 2 bits, i.e. 4 byte in the GPU memory. Hence, each thread in CUDA warp reads a 4-byte word, works on corresponding 16 cells, and writes back modified word to the memory. Thus, a CUDA thread block processes altogether 256 x 4 bytes, i.e. 1024 bytes data of 163 cells of a grid. Such coalesced memory access utilizes off-chip memory bandwidth optimally that is crucial for high throughput on GPUs. 4. RESULTS AND ANALYSIS 4.1 CAD Benchmarking with HDT HDT is implemented in Visual Studio 2010 with CUDA 7.0 on a quad core Intel Core i7 3.5 GHz CPU and an NVIDIA GTX 780Ti graphics card. Detailed geometric properties of the selected input meshes, and volumetric statistics in resultant HDT layouts are enlisted in Table 1. Each model is enclosed with the number of triangles comprising the respective CAD models. The first row shows the mesh surface area, followed by two rows presenting the model’s physical dimension and the bounding volume, respectively. As a first step to understand the complexity of volumetric representation with HDT structure, for each model four different HDT configurations are evaluated that target a geometric resolution between 10243 (1K) and 81923 (8K). HDT spans to a depth between 2 and 5 respectively for these target resolutions as shown in the following entry in Table 1. The next two rows in Table 1 respectively report the tree elements count and the number of leaf grids for different HDT configurations at varied resolutions. In contrast to dense representations that scale with the bounding volume of the solid, the number of leaf grids in HDT correlates with the surface area of the triangle meshes. For instance, though the Head model makes up 6x triangles than the Candle Holder, the mesh surface area in the latter is 22% larger than the former. Hence, the leaf grid counts in Candle Holder is commensurately (20%) higher than Head at 8K resolution. Moreover, as the surface voxels of a solid geometry increases quadratically at 2X grid resolution, the numbers of HDT leaf grids at successive higher resolution grow by a factor of four. The following entry in Table 1 reports the total number of leaf voxels in HDT representation that is the product of leaf grid counts and voxels per grid (i.e. 4096 cells for a leaf dimension of 16). As discussed in Section 1, optimization of the extra storage overhead to maintain the geometry topology in a sparse layout is a key challenge with any extreme-scale volume modeling. Prior state-of-the-art research on volume rendering [6] consumes at least 1 byte per active voxel to track each cell’s topology in sparse voxel octree (SVO) hierarchy. To study the storage effectiveness of HDT, the succeeding rows in Table 1 present the number of active voxels (i.e. cells lying on the surface boundary) and the storage per active voxel, respectively. As shown, at 8K resolutions each boundary cell in HDT accounts for on average 1.46 – 1.66 bits storage that is five times compact than the memory overhead reported in [6]. 4.2 HDT Construction Phases Fig. 3 summarizes the GPU execution time of HDT construction steps at 4K and 8K resolutions. For convenience of discussion, let we first denote the triangles mapping time be T1 , the branching time be T2 , and the leaf processing time be T3 . As demonstrated in Fig. 3, T1 grows in proportion to the number of triangles of respective CAD models. For instance, T1 consumes about 2% of total HDT construction Mesh Surface Area (mm2 ) Dimensions XYZ (mm) Bounding Volume (mm3 ) Effective Resolution Height of HDT HDT Elements (x 103 ) HDT Leaf Grids (x 103 ) HDT Leaf Voxels (x 106 ) Active Voxels in HDT(x 106 ) Bits per Active Voxel Head (230K) Dragon (173K) Turbine (58K) 8,956 10,747 12,346 48.6 x 46.0 x 64.4 46.7 x 74.6 x 50.4 48.9 x 48.9 x 31.1 143,923 175,577 74,367 1024 2048 4096 8192 1024 2048 4096 8192 1024 2048 4096 8192 2 3 4 5 2 3 4 5 2 3 4 5 10 37 146 581 10 36 150 621 8 36 153 684 3 14 54 218 3 14 60 240 3 15 66 291 14 56 223 894 14 59 245 985 12 61 271 1193 0.87 3.51 14.0 56.0 0.98 3.96 15.8 63.4 1.17 4.67 18.7 74.7 1.84 1.67 1.66 1.66 1.60 1.45 1.52 1.57 1.08 1.22 1.30 1.46 Candle Holder (38K) 10,987 48.4 x 48.9 x 57.7 136,515 1024 2048 4096 8192 2 3 4 5 11 44 172 698 4 17 65 262 16 68 267 1074 1.05 4.20 16.8 67.2 1.62 1.66 1.63 1.66 Table 1: Statistics of the 3D CAD benchmarks. T₁ T₂ T₁ T₃ 2.5 5. CONCLUSIONS T₃ We have presented HDT structure for compact modeling of volumes at extreme resolutions on GPU. HDT is designed to address simultaneously the two key challenges pertinent to a standard octree based sparse volume representation. By way of motivation, HDT is a fusion of two opposing data representation — octree and tiled grid. Such a merging of two contrastive data structures unleashes the merits, while synergistically subdues the limitations of respective layouts. 4 3.5 2 3 Time (sec) Time (sec) T₂ 1.5 1 2.5 2 1.5 1 0.5 0.5 0 0 Head Dragon Turbine Candle Holder (a) Resolution 40963 Head Dragon Turbine Candle Holder (b) Resolution 81923 6. ACKNOWLEDGMENTS This work was supported by the National Science Foundation (NSF) under CPS Synergy award number 1329742. Figure 3: HDT construction time on GPU. 7. REFERENCES 3 time for Head and Dragon at an effective resolution of 8192 , whereas for Turbine and Candle Holder T1 is less than 1% of total GPU time. Fig. 3 further illustrates that for a fixed configuration of the root volume the triangles mapping time is independent of the target resolution. Hence, T1 remains unchanged across the two HDT configurations. Among the three construction steps, HDT branching time (T2 ) is the largest contributing factor of the total GPU time. As shown in Fig. 3, the relative magnitudes of T2 values for different models correlate with the number of tree elements reported in Table 1. For instance, at 8K resolution Dragon, Turbine and Candle Holder have 0.62, 0.68 and 0.70 million tree elements respectively; and this trend of HDT element counts matches with corresponding T2 values – 1.93 seconds, 2.17 seconds and 2.39 seconds. Similarly, HDT leaf processing times (T3 ) correspond to the number of leaf grids of the benchmark inputs documented in Table 1, though in Fig. 3 the differences in T3 measurements across the models are hardly distinguishable. As the leaf processing step is massively parallel, GPU crunches several hundreds of millions of voxels per second. Hence, the T3 values lie in a tight spectrum; for example, at resolution 81923 T3 measurements for the four models rest in between 1.23 seconds to 1.29 seconds. [1] T. Akenine-Möller. Fast 3d triangle-box overlap testing. Journal of Graphics Tools, 6(1), Jan. 2002. [2] J. Chen, D. Bautembach, and S. Izadi. Scalable Real-time Volumetric Surface Reconstruction. ACM Trans. Graph., 32(4), July 2013. [3] C. Crassin, F. Neyret, S. Lefebvre, and E. Eisemann. GigaVoxels: Ray-guided Streaming for Efficient and Detailed Voxel Rendering. In Proceedings of Interactive 3D Graphics and Games, I3D, 2009. [4] J. Kalojanov, M. Billeter, and P. Slusallek. Two-Level Grids for Ray Tracing on GPUs. Computer Graphics Forum, 2011. [5] J. Kalojanov and P. Slusallek. A Parallel Algorithm for Construction of Uniform Grids. In Proceedings of the Conference on High Performance Graphics 2009. [6] S. Laine and T. Karras. Efficient Sparse Voxel Octrees. In Proceedings of Interactive 3D Graphics and Games, I3D, 2010. [7] K. Museth. VDB: High-resolution Sparse Volumes with Dynamic Topology. ACM Trans. Graph., 32(3), July 2013.