Academia.eduAcademia.edu

GPU-based Ray Tracing of Dynamic Scenes

2010

This paper presents the design and implementation of a GPU-based ray tracing system for dynamic scenes consisting of a set of individual, non-deformable objects. The triangles of each object are organized in a separate Kd-tree. A bounding volume hierarchy (BVH) is built on top of these Kd-trees. The BVH is updated and uploaded into GPU memory on a frame-by-frame basis, whereas the Kd-trees are uploaded only once. We use a single ray tracing kernel for handling all ray generations. Code execution path divergence is limited by the use of a ray stack, which treats all ray types in the same way. We make effective use of the very limited high bandwidth memory of the GPU's multiprocessors by using a smart stack for the acceleration structure traversals. The results show that our GPU implementation of a two-level acceleration approach performs between 40 and 105 percent as fast as a single Kd-tree containing the entire scene.

GPU-based Ray Tracing of Dynamic Scenes Martin Reichl Robert Dünger Alexander Schiewe Thomas Klemmer Markus Hartleb Christopher Lux Bernd Fröhlich Prof. Dr. rer. nat. Bernd Fröhlich Bauhausstr. 11 Fakultät Medien Bauhaus-Universität Weimar 99423 Weimar Tel.: +49 (0) 3643 - 58 3732 FAX: +49 (0) 3643 - 58 3709 E-Mail: [email protected] Abstract: This paper presents the design and implementation of a GPU-based ray tracing system for dynamic scenes consisting of a set of individual, non-deformable objects. The triangles of each object are organized in a separate Kd-tree. A bounding volume hierarchy (BVH) is built on top of these Kd-trees. The BVH is updated and uploaded into GPU memory on a frame-by-frame basis, whereas the Kd-trees are uploaded only once. We use a single ray tracing kernel for handling all ray generations. Code execution path divergence is limited by the use of a ray stack, which treats all ray types in the same way. We make effective use of the very limited high bandwidth memory of the GPU’s multiprocessors by using a smart stack for the acceleration structure traversals. The results show that our GPU implementation of a two-level acceleration approach performs between 40 and 105 percent as fast as a single Kd-tree containing the entire scene. Keywords: ray tracing, multi-level hierarchy, animation, dynamic scenes 1 Introduction Interactive ray tracing of non-trivial scenes is just becoming feasible on single graphics processing units (GPU). Recent work in this area focuses on building effective acceleration structures, which work well under the constraints of current GPUs. Most approaches are targeted at static scenes and only allow navigation in the virtual scene. So far support for dynamic scenes has not been considered for GPU implementations. We have developed a GPU-based ray tracing system for dynamic scenes consisting of a set of individual objects. Each object may independently move around, but its geometry and topology are static. We use a two-level acceleration structure for this constrained scenario similar to the approach taken by Wald et al. [WBS03]. Instead of using Kd-trees for both levels, we organize the individual objects in a bounding box hierarchy (BVH), which is built on top of Kd-trees for the individual objects. On the CPU the BVH is rebuilt every time the position of an object changes and updated in GPU memory. Our implementation uses a single kernel on the GPU to handle Figure 1: Our work is motivated by assembly planning and product evaluation applications. In these scenarios the interactive manipulation of individual objects is an important requirement. all ray generations, except the primary rays which are handled by a separate rasterization pass. The implementation on the GPU requires particular attention to limit the divergence of the code execution paths across multiple rays. We use a ray stack to avoid branching into particular cases for the treatment of reflection, refraction and shadow rays. The GPU’s multiprocessors only provide a very limited amount of extremely fast on-chip memory, which is important for the stack-based traversal of the BVH and the Kd-trees. Our smart stack uses on-chip memory as long as possible and overflows into the slower global memory only when necessary. Ray tracing is an important technology for virtual environments since it greatly improves the visual quality and enhances depth perception. In virtual environments, we mostly deal with a set of individual objects, which can be translated, rotated and otherwise manipulated. This paper reports on the experiences of our GPU-based ray tracing system for such dynamic scenes consisting of individual objects or parts. We analyze the behavior of our implementation in detail and discuss the advantages, disadvantages and limitations of a two-level acceleration structure. We have also compared the performance of our BVH/Kd-tree combination to a single Kd-tree implementation, which cannot be rebuilt on a frame-by-frame basis and thus does not support dynamic objects. The results show that the two-level hierarchy performs between 40 percent and 105 percent as fast as the pure Kd-tree implementation. While this is an encouraging result, it still leaves room for improvement of the data structures and traversal approaches for dynamic scenes. 2 Related Work First approaches to utilize GPUs for accelerating ray tracing were limited by the capabilities of the early generations of programmable graphics processors. With the Ray Engine Carr et al. [CHH02] were the first to implement a real-time ray tracing technique augmented by the GPU. In their approach the GPU was only used for ray-triangle intersections. The performance of this approach was limited by the required memory transports for geometry download and intersection results upload. At the same time Purcell et al. [PBMH02] simulated the use of the GPU as a stream processor to enable the implementation of a complete ray tracing algorithm on a GPU. This approach decomposed the ray tracing algorithm into smaller subtasks (i. e. ray generation, traversal, intersection tests, shading) which were processed in multiple rendering passes. A regular grid was chosen as the acceleration structure due to simple traversal computations. For static scenes the Kd-tree is one of the most efficient acceleration structures [Hav00], in particular if its construction is based on a surface area heuristic (SAH) [? ] to minimize traversal costs. Traversal of such advanced hierarchical data structures requires the use of a stack which is still difficult to efficiently implement on todays GPUs. Foley and Sugerman [FS05] presented two techniques for stackless Kd-tree traversal (kd-restart, kd-backtrack) which require a number of redundant traversal steps. To reduce this traversal overhead, Horn et al. [HSHH07] added a small fixed-size stack. Recently, Popov et al. [PGSS07] introduced a stackless, GPU-based Kdtree traversal algorithm, which requires significantly less traversal steps than stack-based methods or kd-restart. With the additional storage of links to adjacent nodes called ropes a high amount of down-traversal steps are avoided since traversal may start at a leaf node. Besides Kd-trees, bounding volume hierarchies (BVH) can be used as acceleration structures on the GPU. Thrane and Simonsen [TS05] presented a stackless BVH traversal algorithm, which can be efficiently implemented on the GPU. Their approach outperformed stackless kd-restart and kdbacktrack traversals on moderately sized scenes. Recently, Günther et al. [GPSS07] presented a BVH-based packet traversal algorithm using a shared stack. They were able to achieve near real-time results for large static scenes. Recent work on ray tracing of animated and interactive scenes mostly focused on acceleration structures, which can be quickly build or rebuild [WMG+ 07]. Wald and Havran [WH06] showed how to rebuild a BVH for the entire scene on a per-frame basis. Yoon et al. [YCM07] presented a technique for locally restructuring parts of a BVH instead of rebuilding the entire structure, which works well if only small portions of the scene are manipulated. Other approaches used nested or multi-level hierarchies where a top-level hierarchy maintains only movable scene objects while an efficient acceleration structure is used as a low-level hierarchy holding the object geometries. This allows for manipulation of scene objects while avoiding unnecessary reconstruction of acceleration structures for the non-deformable scene objects. Lext and Akenine-Möller [LAM01] showed how to use a grid as a top-level hierarchy which allows very fast rebuilds. A similar approach was taken by Wald et al. [WBS03] using Kd-trees for both the top-level and low-level hierarchies. While all these techniques for dynamic or animated scenes focus on CPU implementations, we have explored the use of multi-level hierarchies for dynamic scenes in the context of GPU-based real-time ray tracing. 3 Two-Level Hierarchy Common virtual reality applications allow the interactive manipulation of objects or parts of objects in the scene. Most manipulations do not change the shape of objects and in most cases only BVH leaf node ee -tr Kd BVH internal node BVH root node Figure 2: A two-level hierarchy consisting of a BVH maintaining the top-level scene structure while static scene geometries in the leaf nodes are organized in Kd-trees. rigid body transformations are applied to individual objects. Our goal is to support such scenarios. Assembly planning in the automotive industry is a typical example, which requires the manipulation of individual car and engine parts. In a typical scene graph-based scenario objects are organized in hierarchical structures to achieve hierarchical transformations. In the following we will only consider scenes consisting of a set of individual objects, where each object may be associated with an affine transformation. Each object is static with respect to its mesh connectivity and geometry. Most scene graph hierarchies can be flattened to create the required structure. This allows the use of efficient acceleration structures on the level of static scene objects while maintaining the dynamic scene structure in a separate top-level acceleration structure. This approach of using multiple nested acceleration structures of potentially different types is typically called multi-level hierarchy (TLH) [LAM01, WBS03]. In our implementation, we use only a two-level hierarchy (TLH) of acceleration structures, which also allows single-level instancing, but does not support multi-level instancing schemes. Using a two-level hierarchy allows to combine acceleration structures offering different characteristics with respect to their creation time. Choosing an acceleration structure, which allows for the most efficient ray traversals at the cost of extended build time, is favorable for the static object geometries. The acceleration structures for these objects remains unchanged after the initial build process and therefore only needs to be constructed once. This enables us to employ algorithms generating traversal-cost optimized structures in a preprocess for each interactive scene object. In contrast, due to user interaction with the scene, the top-level structure needs to be potentially rebuild on a frame-to-frame basis. Therefore it requires an acceleration structure with very fast reconstruction or restructuring times at the expense of being less optimized for ray traversals. For the GPU-implementation of the two-level hierarchy we chose to combine a bounding volume hierarchy (BVH) of axis-aligned bounding boxes (AABBs) used as the dynamic top-level hierarchy with Kd-trees used for organizing the static object geometries (cf. figure 2). Both structures are constructed utilizing a surface area heuristic (SAH) cost estimation to minimize the costs for traversal and intersection tests. The top-level BVH is reconstructed from scratch every time the scene structure is changed through user input or animation. On these events, we employ a method for fast BVH rebuilds based on the technique presented by Wald et al. citeWald:07:RTDSBVH. Note, that the reconstruction is done entirely on the CPU. M¯¹ M (a) (b) (c) Figure 3: Initial transformations are computed for all static scene objects which results in more tightly fitting bounding boxes in the object’s coordinate system. Figure (a) shows the object orientation in global space which is computed in a preprocess. Figure (b) shows the built Kd-tree and AABB in object space. The resulting Kd-tree OBB and the BVH AABB in global space are shown in figure (c). In a pre-process we compute oriented bounding boxes (OBB) for the individual scene-objects using Gottschalk’s fitting technique [GLM96] and assign an initial transformation matrix for each object to place it appropriately in the global coordinate system. The Kd-trees for the individual scene objects are built in the local coordinate system of each object. The BVH is built in topdown order based on axis-aligned bounding boxes around the OBBs of the individual objects (cf. figure 3) on a frame-by-frame basis. Each leaf node of the BVH holds a transformation matrix and a reference to the contained Kd-tree. Thus instancing is directly supported which avoids unnecessary geometry replications. The top-level BVH structure needs frequent updates in GPU-side memory, whereas the static Kd-tree structures containing the actual scene geometries are uploaded only once. We chose a full binary BVH layout where each inner node has exactly two descendants. For a fixed scene size the BVH data structure has a constant memory footprint. Thus the memory has to be allocated only once on the GPU side avoiding memory fragmentation and minimizing memory management during run-time. The following section will describe details of our implementation. 4 Two-Level Hierarchy on the GPU For the GPU implementation we chose to use NVIDIA’s CUDA API [NVI07] for directly accessing the compute features of the NVIDIA G80 family of GPUs. This enables us to take advantage of particular hardware features which are not accessible through common graphics APIs. This section introduces the characteristics of the Compute Unified Device Architecture (CUDA) before describing our TLH GPU implementation in detail. Data layout and algorithmic details (e. g. traversal, ray generation, stacks) are presented as well as memory access optimizations introduced by our approach. 4.1 G80 Architecture & CUDA Today’s GPUs may be viewed as highly parallel streaming architectures. More precisely, they are multi-processor/multi-alu machines. Each multiprocessor can be classified as a Concurrent Read, Concurrent Write Parallel Random Access Machine (CRCW PRAM) [PGSS07]. CUDA enables direct access to the compute capabilities of G80+ GPUs without the requirement to express algorithms in terms of graphic primitives such as triangles or textures. Using CUDA general memory scattering operations have finally become possible, while they are still very limited with current high-level shading languages. Scattering allows writing to arbitrary memory locations, and therefore enables more flexible algorithm implementations and in particular the use of stacks and other dynamic data structures. CUDA programs are expressed as so called kernels which are executed in chunks of threads that are running in parallel. These chunks are called warps1 which again are grouped into blocks running on individual multiprocessors. These blocks share all resources of a multiprocessor. However, the atomic scheduling unit remains a warp which executes threads in SIMD fashion. The G80 architecture exposes different types of memory with highly different bandwidth and latency characteristics. All multiprocessors share direct uncached access to the global device memory. While this is suited for intra- and inter-processor communication memory fetches suffer from a relatively high latency. An alternative way to access global memory is through the CUDA texture interfaces, which speeds up recurring and spatially coherent fetches from global memory regions by using an on-chip cache. In addition, there are constant and shared memory regions attached to each multiprocessor. These memories exhibit high bandwidth and very small access latencies. Finally, threads have access to a set of dedicated registers. Each multiprocessor possesses only a limited set of overall resources for constant, shared and register memory. As a result the requirements of multiprocessor resources per kernel limits the amount of runnable (active) threads per block. 4.2 Ray Tracing Kernel We employ a single kernel for implementing the complete ray tracing algorithm including acceleration structure traversal, ray-triangle intersection, shading and secondary ray generation. For each pixel a thread is created which executes the ray tracing kernel. We use a rasterization pass to generate the intersections for the primary rays to accelerate the ray tracing system. The results of this rendering pass are used by the ray tracing kernel to compute secondary effects and shading. 4.2.1 Data Layout 0 1 3 2 4 5 0 1 3 4 2 5 6 6 Figure 4: Depth-first left ordered binary tree serialization. . 1 On current GPUs the warp size is 32 threads. For GPU-ray tracing the geometry as well as the acceleration structures are serialized and uploaded to the GPU global memory. We employ the CUDA texture interface to benefit from cached memory access. The BVH and the Kd-tree structures are serialized in depth first order as shown in figure 4. Using this memory layout for binary trees the first child of each inner node is located right next to the parent requiring only a single link per node pointing to the second child. As a result, for any node fetched from the global memory the first child is potentially placed in the cache. Special handling of null pointers is not required since we use a binary tree layout for the BVH as well as for the Kd-tree (i. e. each inner node has exactly two child nodes). The Kd-tree leaf nodes store only references to individual triangles to prevent triangle replication. The Kd-tree node layout strictly follows the 8 byte scheme proposed by Wald [Wal04]. In contrast, a BVH node needs to additionally store references to a transformation matrix and an axis-aligned bounding box requiring 32 byte per node. The CUDA-OpenGL interoperation layer provides only limited support for shared data usage. Unfortunately, the scene geometries need to be duplicated to be available for the primary OpenGL rasterization pass and for the actual CUDA-based ray tracing. Furthermore, since the offscreen buffers which holding the primary ray intersection results are updated each frame, they need to be transferred from dedicated OpenGL memory to CUDA-mapped memory each frame which is an unexpected bottleneck with current drivers2 . 4.2.2 Primary Ray Generation To take advantage of the rasterization power of current GPUs we employ a rasterization pass to generate the triangle index and intersection point for each primary ray. The scene is rendered using OpenGL using a fragment program which writes out triangle indices, interpolated barycentric coordinates and a Kd-tree reference. We also use the BVH structure to render the scene objects in front-to-back order to take advantage of hardware-supported early-z culling. The CUDA ray tracing kernel reads back the per-pixel information from the rasterization pass to reconstruct the exact intersection point p. Since p is defined in the Kd-tree coordinate system it is then transformed into the world coordinate space. 4.2.3 Shading After the reconstruction of the primary intersection point the shading equation is applied to this point. This implies fetching associated materials, normals and texture coordinates from global memory. As CUDA is currently not supporting the use of indexed texture objects we use a texture atlas and extend the material structures with a texture index. The texture coordinates are on the fly transformed into the actual atlas texture coordinates by using a texture description table residing in on-chip constant memory. 2 We measured an effective bandwidth of 7GiB/s which is at around 10% of the maximum bandwidth capabilities of an NVIDIA GeForce 8800GTX GPU. 4.2.4 Secondary Ray Traversal Subsequently to primary intersection point shading secondary rays are generated and traced through the TLH. The SIMD GPU execution model leads to sets of implicitly generated ray packets (warps) of secondary rays, which are handled in parallel. Each thread in a warp executes the same instructions at the same time, since all the threads in a warp are executed on a single multiprocessor in SIMD fashion. In case of conditional branching the whole warp is executing all required executions paths by masking individual threads as inactive. Thus it is particularly important to avoid too many possible execution paths. We use a ray stack in global memory to maximize the number of active threads. Each generated ray and the corresponding ray parameters are pushed onto the stack which is then processed according to listing 1. Thus the code is not branching for the different ray types and cases for secondary rays. Instead all secondary rays basically follow the same execution path which limits the number of inactive threads per warp. Nevertheless, if no reflective or transparent material is hit these threads cannot do meaningful work. while (!stack.empty()) { // secondary ray stack Ray ray = stack.pop(); Result hit = traverseTLH(ray); if (!hit) continue; if (isTransparent(hit)) stack.push(refract(ray)); if (isReflective(hit)) stack.push(reflect(ray)); Result shadow = traverseTLH(toLight(hit)); color += shade(ray.intensity, shadow, hit); } Listing 1: Secondary rays are processed using a ray stack instead of branching into different cases for shadow, reflection and refraction rays. The ray stack limits code execution divergence. The secondary rays traverse through the TLH structure in the same way as it would be implemented on the CPU. For each intersected BVH node intersections with the two child nodes are calculated and depth sorted. Then the traversal descends into the BVH following the first hit child node (cf. 5a). In case of BVH leaf nodes the ray is transformed to the coordinate system associated with the referenced Kd-tree. Before traversing the actual Kd-tree the axis-aligned bounding box of the Kd-tree is intersected to avoid unnecessary traversals. After returning from the Kd-tree traversal, the remaining sub-trees with potential intersections closer to the ray origin than a potentially found intersection point are processed. 4.3 Memory Access Optimization We employ two techniques to reduce the cases where we have to wait for data to be fetched from global memory: At first, we store the geometry data in separate lines of a 2D texture to increase texture cache utilization. But as others, we observed that cache hits become more and more unlikely with subsequent ray generations. The performance advantage of using the small 16Kb tex- ture cache is split among all uses such as storing geometry data, acceleration structures as well as materials thus it may be negligible in the end. Scene BART robots BART kitchen The chevy Chess BVH avg max var 1.0 6 1.33 0.01 1 0.01 0.72 5 1.1 0.27 2 0.39 avg 1.40 3.63 0.22 1.53 Kd max 7 10 5 6 var 1.09 1.14 0.37 1.02 Table 1: Actual maximum stack usage for the two acceleration structures used in the approach. Data is taken for all primary rays. The maximum allowed tree depth for the Kd-tree is 20, the depth of the BVH is log2 (n) where n is the number of objects. The second method for memory access optimization is the use of a smart stack, which uses on-chip shared memory as long as possible and overflows into global memory if necessary. While BVH left leaf node BVH root node ree d-t K eft l (b) righ t Kd -tree BVH right leaf node (c) (a) Figure 5: Detailed schematic view of the TLH traversal. Figure (a) Shows the top-level BVH within the global coordinate system. During TLH traversal the ray intersects both BVH child nodes therefore it has to be transformed to both local Kd-tree coordinate systems. Figures (b) and (c) show the traversal of the Kd-trees in their local coordinate space. 0 4 8 Kd-tree stack element 12 16 20 Kd-tree stack element 24 28 254 256 256 260 264 BVH stack element Kd-tree stack element 4 bytes (a) upper part of the stack is stored in fast shared device memory 268 272 BVH stack element 276 280 Kd-tree stack element 284 288 Kd-tree stack element 4 bytes (b) oldest stack elements are stored in slow global device memory Figure 6: Per thread mixed BVH/Kd-tree stack layout. Each BVH stack element takes 8 bytes of stack memory for its corresponding node index and the ray parameter for the nearest intersection point with the AABB. Kd-tree stack elements need additional 4 bytes to store the ray parameter for the exit intersection point. The stack in the global memory will only be used as a swap region for the oldest stack elements if the maximum size of the shared memory is reached. a maximum stack size of the height of the tree can be required, the average stack utilization is significantly lower as can be seen in Table 1. Therefore, Horn et al. [HSHH07] introduced a small, fixed-size stack called short-stack. In their implementation, if the short-stack underflows they fall back to a strategy called kd-restart, which restarts tree traversal with a shortened ray. If the stack overflows, they simply overwrite the oldest entry. We adapted the concept of the short-stack to create a fast-access version of the BVH and Kd-tree stacks which reside in the on-chip shared memory of the multiprocessor. As our kernel implementation allows for an active thread count of 64 threads per block there are 256 bytes of shared memory per thread available. This would allow to implement for either a short BVH stack of 32 elements or a short Kd-tree stack of 21 elements. Since both stacks are not used at the same time, we have adapted the stacks to work in the same memory region. This is possible since the Kd-tree stack is always empty as a thread runs BVH traversal. During Kd-tree traversal updates of the BVH stack do not occur. Figure 6 shows the memory layout and addressing scheme used. If the short stack would grow beyond the memory region dedicated to a thread, this thread uses a second memory region in the global memory as a swap area for the oldest stack elements. When reaching the size limit of the short stack, the oldest element respectively the two oldest elements are transferred to the swap area and this space in shared memory is used for the new stack entry. Once the shared memory stack runs empty, the youngest stack element from the swap area is popped back to the shared memory region. Note that if any thread of a warp has to fall back to global memory, all threads in the warp are punished with high memory latencies due to the swapping operations to global memory. However, these cases are extremely rare. For the examples in Table 1 they did not occur. We also considered to put the ray stack for secondary ray handling into shared memory. As this memory region is clearly limited in size, this approach would constrain the memory dedicated to the smart stack even more. Since secondary ray stack operations occur considerably less often than traversal stack operations the speedup of the smart ray stack was negligible. 5 Results and Discussion In this section we will present and discuss the results of the conducted performance analysis to evaluate the impact of the TLH approach compared to using a single Kd-tree for the entire scene. The chosen test scenes (cf. figure 7) exhibit different characteristics regarding rigid object count, polygon count and the spatial distribution of objects. Especially the BART scenes provide different stress scenarios for ray tracing of dynamic scenes like the ”teapot in a stadium” problem, hierarchical animations and varying frame-to-frame coherence. Additionally, the chevy scene contains multiple objects with overlapping bounding volumes which requires a large percentage of the rays to traverse more than one object Kd-tree. The chess scene has been chosen because of a uniform distribution of objects which show no bounding volume overlaps as well as nearly the same triangle count as the chevy scene. The tests were conducted using a Intel Core2 Duo 2.4GHz workstation with 2GiB RAM and a NVIDIA GeForce 8800GTX graphics board running CUDA 1.1 (driver version 174.55) under Windows XP. All tests were conducted using an image resolution of 512x512 pixels. 5.1 Performance Results Table 2 shows our performance measurements for the test scenes under different ray recursion depth configurations. We also show the primary ray shading performance to give an estimate of the baseline performance without ray traversals. These numbers include the OpenGL rasterization pass as well as the transfer of the results to CUDA mapped memory and the intersection point reconstruction and shading performance of the CUDA kernel. For our scenes the rasterization approach is up to six times faster than tracing primary rays, which is due to the relatively small number of triangles (only up to 110k). As noted in section 4.2.1 this approach is still limited by CUDA-OpenGL inter operation constraints and surprisingly low device memory transfer performance, which is expected to be improved with newer CUDA and driver revisions. Scene Primary Shadow Shading Only Reflection 5 Bounces Only + Shadow Robots Kd TLH 123.21 11.45 8.83 6.02 4.32 1.18 0.77 Kitchen Kd TLH 125.02 11.55 12.78 13.05 13.61 2.53 2.8 Chevy Kd TLH 208.53 55.26 23.03 52.6 26.45 26.53 11.19 Chess Kd TLH 230.51 45.9 38.43 27.08 22.26 7.89 6.55 Table 2: Performance comparison for the different test scenes using either the TLH or a single Kd-tree for the entire scene. The results are measured in frames per second for different depths of the ray tree. Primary ray shading only involves the rasterization pass and a single kernel pass for reconstructing and shading the primary intersection point. (a) BART robots (b) BART kitchen (c) Chevy (d) Chess Figure 7: Test scene overview. (a) 162 objects, 110K triangles (b) 6 objects, 71K triangles (c) Chevy, 79 objects, 43K triangles (d) Chess, 34 objects, 46K triangles The rebuild time for the BVH is significantly influenced by the number of dynamic objects in the scene. Our SAH-based implementation, which is similar to Wald et al. [WBS07], exhibits O(nlog2 n) runtime behavior and allows us to build a BVH for 1000 objects in 9.8 ms. As the test scenes consist of considerably less dynamic objects the BVH rebuild time is negligible to performance considerations. As we are employing a already published construction algorithm more comprehensive performance data can be found in the original publication. The performance of the TLH approach turns out to be an almost constant fraction of the performance of the Kd-tree implementation for a specific scene. While Wald et al. observed an overhead of 10 to 20% comparing an TLH to a static Kd-tree in a CPU environment [WBS03] we observe performance penalties of 10 to 60% for the TLH approach depending on scene characteristics. Interestingly, the chevy model runs only at 42% and the BART robots scene achieves 65% of the Kd-tree performance while consisting of about twice as much dynamic objects. The reason for the limited peformance of the chevy model is the significant number of overlapping bounding volumes of the various car parts. In case of overlapping bounding volumes the probability increases that more than one Kd-tree needs to be traversed for finding the first intersection point of a ray. Generally, two extreme cases can be observed: On the one hand, without overlap the BVH is able to separate two children and no traversal overhead is observed (as evident in the BART kitchen scenario). On the other hand, in case of complete overlap the BVH degenerates to a linear list whose children all need to be processed. The average case between these two extrema profits from the BVH early-out techniques as outlined in section 4.2.4. The issue of overlapping bounding volumes can be reduced by using oriented bounding boxes. However, a complete avoidance is often not possible. The chevy model consists of densely arranged objects which results in many overlapping bounding volumes and thus the lowest performance ratio compared to other scenes consisting of even more dynamic objects. Figure 8 illustrates the amount of Kd-tree traversals measured for the primary shadow rays comparing the chevy model and the BART robots scene. It is evident that the amount of rays traversing four or more Kd-trees before finding the correct intersection is much higher for the chevy scene than for the robots scene. The average number of Kd-tree traversals for the chevy scene is 3.44 in contrast to 1.3 in the robots scene, which clearly explains the lowest performance ratio for the TLH approach compared to a single Kd-tree for the entire scene. 5.2 GPU Utilization We found three primary factors influencing the performance of our GPU ray tracing algorithm: the number of memory accesses, the active GPU occupancy and the code execution divergence. We found the biggest influence on performance is the large number of memory accesses needed for the traversal of the TLH structure (e. g. bounding volumes, transformation matrices) as well as the access to the actual geometry data for intersection tests and shading (e. g. vertices, normals, texture coordinates). Due to the chosen serialized memory layout for binary trees (cf. section 4.2.1) access to the nodes of a BVH or a Kd-tree is more coherent than access to geometry data. This is the case since we chose a binary tree structure for the BVH and the Kd-tree and store the first Figure 8: These images illustrate the amount of Kd-tree traversals for the first shadow ray traversal for the chevy scene and the BART robots scene. Blue colors indicate up to three, green colors up to six, red colors up to nine and white color more than nine Kd-tree traversals per ray. child with the parent node. Thus the acceleration structures make more efficient use of the texture cache than the geometry data. An experiment selectively disabling the texture cache interface confirmed this assumption. Disabling the texture cache for the acceleration structures resulted in a 20% decrease in performance while disabling the texture cache for the geometry data showed no noticeable effect. Thus there is clearly potential for improving the memory layout for the geometry data by optimizing it with respect to the layout of the Kd-trees. We analyzed the ratio of clock cycles spent for actual computations in comparison to waiting cycles. We found that on average 70 to 75% of the overall clock cycles account to memory access waiting cycles clearly dominating the actual computation time per thread. Typically, such waiting cycles can be hidden by the GPU scheduler by swapping blocked with runnable warps. The amount of runnable warps depends on the amount of active threads on a multiprocessor. As stated before, the number of active threads on a multiprocessor is limited by the kernel’s requirements on multiprocessor resources (i. e. registers and shared memory). The kernel of our TLH traversal approach requires 96 registers limiting the number of active threads per block and multiprocessor to 64. This results in only two active warps per multiprocessor. Due to this fact, the memory access latencies can not be effectively hidden and the multiprocessors have to be stalled. Experiments using an early beta version of the CUDA 2.0 API resulted in an reduction to 59 registers for the same kernel. This allows to double the number of active warps per multiprocessor enabling the scheduler a more effective hiding of memory access latencies. First tests showed an increase in performance of about 30% for all our test scenes. The SIMD execution of all threads belonging to a warp further reduces the GPU utilization since individual threads may be following different code execution paths. While bad branching behavior is not a new phenomenon on programmable GPUs [HSHH07], the problem will get even worse as growing warp sizes are an expected way to raise computation power of future GPUs. We analyzed the actual code execution divergence3 of our TLH implementation using the NVIDIA Visual Profiler [NVI07]. We found that the divergence is around 10% for the BART robots scene 3 Defined as the relation of the number of divergent branches divided by the number of branches. and around 5% for the chevy model using a ray recursion depth of three. This indicates that TLH traversal using the global ray stack does not introduce large divergences. Sadly, the profiler tool does not provide any information about how large the run-time penalties are for the existing divergence numbers. 5.3 Scalability Our approach rebuilds the top-level BVH on a frame-to-frame basis, which clearly limits the number of dynamic objects in the scene. When only manipulating small portions of the scene it might not be necessary to rebuild the entire BVH structure. Yoon et al. [YCM07] presented a technique to locally restructure parts of a BVH. Ize et al. [IWP07] showed how to asynchronously rebuild a BVH over multiple frames while relying on refitting for the intermediate BVHs. The use of a parallel build algorithm may increase the number of dynamic objects even further. Since the BVH and the Kd-trees are binary trees, doubling the number of scene objects and triangles results in only one additional traversal step. While this behavior is quite acceptable scalability is additionally limited by the size of the fast on-chip memory for the stack usage and potentially also less coherent access to tree nodes and geometry data. Another problem are large scene objects whose bounding volumes are very frequently intersected by smaller objects. The BART kitchen and robots scenes are an example for such scenarios. They contain large background objects which contain the smaller dynamic objects. This typically leads to unnecessary multiple decents into the Kd-tree hierarchies. This problem can be solved by subdividing the large scene objects into smaller parts which are sorted into the top-level hierarchy. Even if this does not completely remove the intersection of bounding volumes it should result in significantly less unnecessary decents at the cost of more top-level objects. During BVH construction we split the set of objects until every leaf contains exactly one object. One benefit of using a SAH-based tree construction is that object subdivision is canceled if a subdivision would not improve the estimated intersection cost. Ignoring this decision criterion may possibly reduce the BVH quality. On the other hand, allowing for more than one Kd-tree being assigned to a leaf node would require each ray to iterate through a list and result in a more complicated data layout. Along with this, one more point of possible code execution divergence would be created due to the necessary loop over list elements. Nevertheless it is hard to predict the impact of this design decision. 6 Conclusion and Future Work We have reported on the design and implementation of a two-level acceleration structure for ray tracing dynamic scenes on the GPU. We used a bounding volume hierarchy for the top level to organize the bounding boxes of the dynamic objects and encapsulated each object in a Kd-tree. The results indicate that our approach is feasible for a GPU implementation and that the performance may reach up to the performance of a single Kd-tree implementation for the entire scene. The BVH can be very quickly build, but performance may be reduced if there is much overlap between the bounding boxes of the individual objects in the scene. Our approach is applicable for specific virtual reality applications such as assembly planning or rigid body physics, which do not require deformable objects and vertex animations. On desktop systems users alternate between view point manipulation and object manipulation since both cannot be controlled at the same time. During view point manipulation, a single Kdtree for the entire scene seems most appropriate, but also view point coherence should be exploited. During object manipulation, often only small parts of the scene change and thus efficient update techniques for a global acceleration structure would be required. At the same time, object manipulation may change only a small subset of the light exchange paths in the scene. Kurz et al. [KLSF08] have recently shown an approach which makes use of this fact by storing ray paths and only updating those which change. We should combine such an approach with our two-level hierarchy to make use of frame-to-frame coherence. Today there are more and more CPUs and CPU cores in combination with two, three or even more graphics cards build into a single machine. The challenge is to split the ray tracing algorithm into parts such that these abundant CPU and GPU resources are utilized in a balanced and parallel way. The increased bandwidth between CPUs and GPUs as well as approaches to place both processor types on a single die make such approaches feasible and efficient. Nevertheless, appropriate data structures for handling large dynamic scenes on such systems still need to be developed. References [CHH02] Nathan A. Carr, Jesse D. Hall, and John C. Hart. The ray engine. In HWWS ’02: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pages 37–46. ACM, 2002. [FS05] Tim Foley and Jeremy Sugerman. Kd-tree acceleration structures for a GPU raytracer. In HWWS ’05: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics hardware, pages 15–22. ACM, 2005. [GLM96] S. Gottschalk, M. C. Lin, and D. Manocha. Obbtree: A hierarchical structure for rapid interference detection. volume 30, pages 171–180. ACM, 1996. [GPSS07] Johannes Günther, Stefan Popov, Hans-Peter Seidel, and Philipp Slusallek. Realtime ray tracing on GPU with BVH-based packet traversal. In Proceedings of Eurographics 2007, pages 113–118. IEEE, 2007. [Hav00] Vlastimil Havran. Heuristic ray shooting algorithms. Ph.D. Thesis, Department of Computer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University in Prague, November 2000. [HSHH07] Daniel Reiter Horn, Jeremy Sugerman, Mike Houston, and Pat Hanrahan. Interactive k-d tree gpu raytracing. In I3D ’07: Proceedings of the 2007 symposium on Interactive 3D graphics and games, pages 167–174. ACM, 2007. [IWP07] Thiago Ize, Ingo Wald, and Steven G. Parker. Asynchronous BVH construction for ray tracing dynamic scenes on parallel multi-core architectures. In Proceedings of the 2007 Eurographics Symposium on Parallel Graphics and Visualization. Eurographics, 2007. [KLSF08] Daniel Kurz, Christopher Lux, Jan P. Springer, and Bernd Fröhlich. Improving interaction performance for ray tracing. In Eurographics’08, Annex to the Conference Proceedings, Short Papers, pages 283–286. Eurographics, 2008. [LAM01] Jonas Lext and Thomas Akenine-Möller. Towards rapid reconstruction for animated ray tracing. In Proceedings of the IEEE Symposium on Interactive Ray Tracing 2007, pages 311–318. Eurographics, 2001. [NVI07] NVIDIA. The cuda homepage. 2007. [PBMH02] Timothy J. Purcell, Ian Buck, William R. Mark, and Pat Hanrahan. Ray tracing on programmable graphics hardware. volume 21, pages 703–712. ACM, July 2002. [PGSS07] Stefan Popov, Johannes Günther, Hans-Peter Seidel, and Philipp Slusallek. Stackless Kd-tree traversal for high performance GPU ray tracing. volume 26. Eurographics, 2007. [TS05] Niels Thrane and Lars Ole Simonsen. A comparison of acceleration structures for GPU assisted ray tracing. Master’s Thesis, University of Aarhus, 2005. [Wal04] Ingo Wald. Realtime ray tracing and interactive global illumination. Ph.D. Thesis, Computer Graphics Group, Saarland University, 2004. [WBS03] Ingo Wald, Carsten Benthin, and Philipp Slusallek. Distributed interactive ray tracing of dynamic scenes. In PVG ’03: Proceedings of the 2003 IEEE Symposium on Parallel and Large-Data Visualization and Graphics, page 11. IEEE, 2003. [WBS07] Ingo Wald, Solomon Boulos, and Peter Shirley. Ray tracing deformable scenes using dynamic bounding volume hierarchies. volume 26, page 6. ACM Transactions on Graphics, January 2007. [WH06] Ingo Wald and Vlastimil Havran. On building fast Kd-trees for ray tracing, and on doing that in O(N) log(N). In Proceedings of the IEEE Symposium on Interactive Ray Tracing 2006, pages 61–69. IEEE, 2006. [WMG+ 07] I. Wald, W. R. Mark, J. Günther, S. Boulos, T. Ize, W. Hunt, S. G. Parker, and P. Shirley. State of the art in ray tracing animated scenes. In STAR Proceedings of Eurographics 2007, pages 89–116. Eurographics, 2007. [YCM07] Sung-Eui Yoon, Sean Curtis, and Dinesh Manocha. Ray tracing dynamic scenes using selective restructuring. In SIGGRAPH ’07: ACM SIGGRAPH 2007 sketches, page 55. ACM Press, 2007.