HART: A Hybrid Architecture for Ray Tracing Animated Scenes

2015, IEEE Transactions on Visualization and Computer Graphics

We present a hybrid architecture, inspired by asynchronous BVH construction [1], for ray tracing animated scenes. Our hybrid architecture utilizes heterogeneous hardware resources: dedicated ray-tracing hardware for BVH updates and ray traversal and a CPU for BVH reconstruction. We also present a traversal scheme using a primitive's axis-aligned bounding box (PrimAABB). This scheme reduces ray-primitive intersection tests by reusing existing BVH traversal units and the primAABB data for tree updates; it enables the use of shallow trees to reduce tree build times, tree sizes, and bus bandwidth requirements. Furthermore, we present a cache scheme that exploits consecutive memory access by reusing data in an L1 cache block. We perform cycle-accurate simulations to verify our architecture, and the simulation results indicate that the proposed architecture can achieve real-time Whitted ray tracing animated scenes at 1920×1200 resolution. This result comes from our highperformance hardware architecture and minimized resource requirements for tree updates. Index Terms-Ray tracing, bounding volume hierarchy, dynamic scene, graphics hardware ! 1 INTRODUCTION Recently, a great deal of research has been conducted to achieve ray tracing dynamic scenes at interactive rates [2]. In dynamic scenes, objects can be moved, added or deleted from a scene, or animated with topological changes. Because most ray-tracing systems are based on acceleration data structures, such as kdtrees, bounding volume hierarchies (BVHs), and grids, these acceleration data structures should be effectively updated for dynamic scenes. Many researchers have exploited CPUs [1], [3]-[11], GPUs [12]-[16], MIC (many integrated core) [11], [17], or dedicated raytracing hardware [18]-[20] to achieve this goal. However, most current real-time rendering engines (e.g. game engines) use techniques based on rasterization instead of ray tracing. This means that current ray-tracing systems still do not provide sufficient performance for the real-time rendering of dynamic scenes on commodity hardware. To achieve ray-tracing in dynamic scenes at real-time rates, there are two requirements: to get high-quality effects, the ray traversal performance must be high; and there must be fast acceleration-data-structure updates that do not degrade the tree quality. To achieve these two goals, we present a hybrid ray-tracing architecture based on the BVH. In this

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS , VOL. X, NO. X, 2014 1 HART: A Hybrid Architecture for Ray Tracing Animated Scenes Jae-Ho Nah, Jin-Woo Kim, Junho Park, Won-Jong Lee, Jeong-Soo Park, Seok-Yoon Jung, Woo-Chan Park, Dinesh Manocha, and Tack-Don Han Abstract—We present a hybrid architecture, inspired by asynchronous BVH construction [1], for ray tracing animated scenes. Our hybrid architecture utilizes heterogeneous hardware resources: dedicated ray-tracing hardware for BVH updates and ray traversal and a CPU for BVH reconstruction. We also present a traversal scheme using a primitive’s axis-aligned bounding box (PrimAABB). This scheme reduces ray-primitive intersection tests by reusing existing BVH traversal units and the primAABB data for tree updates; it enables the use of shallow trees to reduce tree build times, tree sizes, and bus bandwidth requirements. Furthermore, we present a cache scheme that exploits consecutive memory access by reusing data in an L1 cache block. We perform cycle-accurate simulations to verify our architecture, and the simulation results indicate that the proposed architecture can achieve real-time Whitted ray tracing animated scenes at 1920×1200 resolution. This result comes from our highperformance hardware architecture and minimized resource requirements for tree updates. Index Terms—Ray tracing, bounding volume hierarchy, dynamic scene, graphics hardware ✦ 1 I NTRODUCTION Recently, a great deal of research has been conducted to achieve ray tracing dynamic scenes at interactive rates [2]. In dynamic scenes, objects can be moved, added or deleted from a scene, or animated with topological changes. Because most ray-tracing systems are based on acceleration data structures, such as kdtrees, bounding volume hierarchies (BVHs), and grids, these acceleration data structures should be effectively updated for dynamic scenes. Many researchers have exploited CPUs [1], [3]–[11], GPUs [12]–[16], MIC (many integrated core) [11], [17], or dedicated raytracing hardware [18]–[20] to achieve this goal. However, most current real-time rendering engines (e.g. game engines) use techniques based on rasterization instead of ray tracing. This means that current ray-tracing systems still do not provide sufficient performance for the real-time rendering of dynamic scenes on commodity hardware. To achieve ray-tracing in dynamic scenes at real-time rates, there are two requirements: to get high-quality effects, the ray traversal performance must be high; and there must be fast acceleration-data-structure updates that do not degrade the tree quality. To achieve these two goals, we present a hybrid ray-tracing architecture based on the BVH. In this J.-H. Nah, J.-W. Kim, J. Park, and T.-D. Han (corresponding author) are with Yonsei University, E-mail: {jhnah, jwkim, bluedawn, hantack} (part of the work was done when J.-H. Nah was visiting UNC Chapel Hill) W.-J. Lee, J.-S. Park, and S.-Y. Jung are with Samsung Electronics, Email: {joe.w.lee, js1980.park, seokyoon.jung} W.-C. Park is with Sejong University, E-mail: [email protected] D. Manocha is with the University of North Carolina at Chapel Hill, E-mail: [email protected] architecture, dedicated ray-tracing hardware performs traversal and ray-triangle intersection tests because these two operations tend to be the main bottlenecks in ray tracing. In order to deal with dynamic scenes, we extend CPU-based asynchronous BVH construction schemes [1], [7]; tree construction is performed using a CPU, and bounding volume (BV) refitting is performed by geometry and tree update (GTU) units, as part of ray-tracing hardware. This approach greatly reduces the tree update cost because expensive BVH construction does not need to be performed during each frame. When we utilize multiple hardware resources, the throughput of each hardware component and the communication cost between the hardware components determine overall performance. For our system, we present a traversal scheme using primitive’s axisaligned bounding boxes (primAABBs) with shallow trees. In this scheme, the traversal unit in the raytracing hardware performs both BVH traversal and ray-primAABB intersection tests. When this scheme is used with shallow trees, it reduces both tree build times and tree sizes by up to 44% without significant performance degradation. We also present a cache scheme for our traversal unit; we maintain two consecutive sets of shape data (node and primAABB) in an L1 cache block to reuse the data in the next iteration. This cache-data reuse scheme reduces cache misses caused by eviction and increases rendering performance up to 21%. We verify the performance of our architecture using a cycle-accurate simulator. According to the simulation results, our architecture could achieve a significantly higher performance in ray-tracing dynamic scenes than other systems [17], [21], [22]. Conse- IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS , VOL. X, NO. X, 2014 quently, our architecture can perform ray-tracing dynamic scenes with full Whitted effects (e.g., specular reflection and hard shadows) at real-time rates. This result comes from the efficient use of each hardware component and from the maintenance of BVH quality. In static scenes, our architecture performs comparably to other ray-tracing hardware architectures designed for static scenes [23]. 2 P REVIOUS W ORK In this section, we summarize prior work on asynchronous BVH construction. Next, we introduce related work, including ray-tracing hardware architectures, CPU-GPU hybrid ray tracers, and primitive culling algorithms. 2.1 Asynchronous BVH Construction There are two representative BVH update methods: BV refitting and rebuilding. First, BV refitting [3], [4] quickly updates the BVs without changing the topology of the BVHs. However, this method degrades the tree quality due to overlap between BVs. In contrast, BVH rebuilding methods [5] reconstruct the BVH from scratch during each frame. This method creates high-quality trees but takes longer than BV refitting. The rebuild heuristic [3] detects tree-quality degradation after BV refitting to determine when the BVH needs to be rebuilt, but this method may cause a disruptive pause while the BVH is being rebuilt [7]. Selective restructuring [6] continuously reconstructs subtrees instead of rebuilding the entire tree at certain points, which prevents that disruptive pause. However, this method proceeds in a serial manner, so it is not suitable for our parallel architecture. Lauterbach et al. [24] proposed an oriented bounding box (OBB)based refitting process for collision detection. This OBB-based process, however, cannot be directly used for our AABB-based ray-tracing system. Finally, a tree-rotation algorithm [9] performs additional treerotation operations after BV refitting to reduce treequality degradation. Asynchronous BVH construction [1], [7], the base algorithm used in our architecture, asynchronously executes BV refitting and BVH rebuilding on a multicore CPU. While a new BVH is built on specific threads, the remaining threads perform BV refitting and rendering during each frame. This approach takes the best of both methods: it prevents BVH quality degradation from BV refitting while maintaining frame rates, unlike the rebuild heuristic. 2.2 Dedicated Ray-Tracing Hardware Architecture We classify the studies on dedicated ray-tracing hardware architectures into two categories: SIMD (single instruction, multiple data) approaches and MIMD (multiple instructions, multiple data) approaches. 2 SaarCOR [25], RPU [26], and D-RPU [19] traverse four rays together by exploiting packet tracing and the four-wide SIMD architecture. These approaches are suitable for coherent rays, but they are inefficient for incoherent rays. To increase SIMD efficiency, StreamRay [27] filters active rays in a large packet. In single-ray-based approaches, each ray is treated as an independent thread. Thus, these architectures have higher hardware utilization than packet-based SIMD approaches in the case of incoherent sets of rays. TRaX [28] and MIMD TM [29] distribute each ray to light-weight programmable cores. In contrast, the T&I engine [23] includes fixed pipelines for traversal and intersection operations. This architecture consists of traversal units using an ordered depth-first layout and three-phase intersection units. These two types of units commonly include a ray accumulation buffer for latency hiding. Aila and Karras [30] proposed a GPU-based single-instruction, multiplethreads (SIMT) architecture using treelets and a stack top cache to minimize memory traffic when tracing incoherent rays. SGRT [31], [32] is a mobile ray-tracing hardware architecture designed for static scenes. It combines dedicated T&I units and SRPs (Samsung reconfigurable processors). RayCore [33] is another mobile ray-tracing hardware architecture based on unified MIMD T&I units. Many other ray tracing hardware architectures have been proposed for dynamic scenes. SaarCOR [25] includes a transformation unit for ray transformation. This hardware architecture does not update a kd-tree, so it is limited to piecewise rigid motion [18]. D-RPU [19] supports skinning animation through the use of a BKD-tree update unit [19]. This approach is similar to BV refitting [4] and is prone to problems related to tree-quality degradation [34]. In contrast, our architecture based on asynchronous BVH construction can maintain tree quality. Finally, Doyle et al. [20] proposed a hardware architecture for binned SAH BVH construction. 2.3 CPU-GPU Hybrid Ray-Tracing System Some researchers have tried to utilize both CPUs and GPUs in a cooperative way to render dynamic scenes. Budge et al. [35] combined CPU tree construction and GPU ray tracing using CUDA. Nah et al. [36] implemented an OpenGL ES-based ray tracer for mobile devices, which assigned kd-tree construction and ray tree management to CPUs and ray traversal and shading to GPUs. The brigade renderer [37] is a rendering engine for path tracing dynamic scenes that exploits CPUs for game logic and BVH maintenance and GPUs for rendering. Other researchers have focused on CPU-GPU hybrid path tracing static scenes. Budge et al. [38] distributed the tasks to CPUs and GPUs via careful scheduling. Combinatorial bidirectional path-tracing IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS , VOL. X, NO. X, 2014 [39] utilizes GPUs to link segments of camera and light paths and utilizes CPUs to avoid the limitations of pure GPU implementations. LuxRays [40] supports multiple OpenCL devices, so CPUs, GPUs, or both can be used for ray tracing. Unlike these studies, we focus on ray tracing dynamic scenes rather than static scenes. 2.4 Primitive Culling Algorithms for Ray Tracing To reduce the computational costs of ray-primitive intersection tests, a few CPU-based primitive culling algorithms have been proposed. Vertex culling [41] substitutes a cheap ray-frustum test for unnecessary ray-triangle intersection tests. The ray box cull [42] creates a transient AABB using a ray’s t interval in a grid cell and primAABB. Nah et al. [43] extended the ray box cull into kd-trees. These methods [41], [43] can be useful for ray tracing dynamic scenes when they are combined with shallow tree structures. 3 P ROPOSED A RCHITECTURE We start this section by describing the overall system architecture and design decisions. We then introduce the traversal scheme using ray-primAABB intersection tests. Next, we describe the hardware components of our system in detail. 3.1 System Organization Figure 1 illustrates the organization of our proposed system. This system consists of CPUs, ray tracing acceleration units, and programmable shaders. The goal of the proposed system is to utilize heterogeneous hardware resources for fast ray tracing dynamic scenes. We chose asynchronous BVH construction [1] [7] for our system because we can easily distribute the BVH update process to a CPU and ray-tracing hardware. In Fig. 1. Overall system architecture. 3 our system, a CPU performs scene management and BVH construction. Because modern CPUs have multilevel cache hierarchies, surface area heuristic (SAH)based tree construction [4] requiring random memory access fits well in these hierarchies. On the other hand, geometry and tree update (GTU) units in the dedicated hardware perform key-frame animation, BV refitting, and the computation of triangle data in [44] (triAccel) because these need to be performed during each frame. The input key-frame geometry data are transferred from the CPU to a memory in ray-tracing hardware via a PCI Express bus in advance, and they are used in the GTU unit. The computed data from the GTU unit are used for traversal and intersection (T&I) operations in ray tracing. The reason we chose a fixed hardware unit for this process is its high performance per area. T&I units are comprised of fixed pipelines for high performance per area as well because T&I operations can dominate the computation of ray tracing. We used a single-ray-based approach rather than a ray-packetbased SIMD approach for efficient processing of incoherent rays. Additionally, efficient multi-threading is performed with a ray accumulation buffer [23] in each traversal and intersection unit. This buffer is used to prevent pipeline stalls by storing rays that induce a cache miss, and it permits the efficient concurrent processing of multiple rays in deep pipeline stages. The traversal unit (TRV) performs both the BVH traversal and ray-primAABB intersection tests in Section 3.2. In contrast to the prior T&I engine [23], the TRV unit is optimized for BVHs rather than kd-trees. We limited the primitive type to a triangle for simple configuration. The ray-triangle intersection unit (IST) is based on Wald’s algorithm [44]. According to [23], this algorithm has the lowest cost of all ray-triangle intersection algorithms for hit triangles. Because of the increased possibility that the ray could hit the triangles in a leaf node after filtering of the rayprimAABB scheme, Wald’s algorithm is a good choice for our architecture. Additionally, precomputationbased algorithms such as Wald’s algorithm can be used to design an effective H/W intersection unit; consecutive memory access to precomputed data can simplify cache configuration and increase pipeline utilization compared to the Möller-Trumbore algorithm [45], which requires one index and three vertices. Although the precomputed triAccel data of Wald’s algorithm increase memory footprints (40 bytes per triangle), we believe that the overhead is not high for scenes of moderate complexity. Programmable shaders perform ray generation and shading to support various effects. We assume that these shaders are similar to unified shaders in commodity GPUs, and that ray data transmission between the T&I units and the programmable shaders use small FIFO buffers to reduce memory traffic, as with [23], [32]. We will only focus on BVH updates and IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS , VOL. X, NO. X, 2014 ray traversal in this paper; detailed descriptions of the ray generation and shading kernel are included in the SGRT paper [32].         6 6 6 6 6 6 6 6 Traversal Scheme using PrimAABBs The proposed system in Section 3.1 has the following problems. First, asynchronous BVH construction on CPUs and our ray-tracing hardware require triple buffering of tree data (Figure 1): data transferred from CPUs, data for the GTU unit, and data for the T&I unit. Second, the BVH is sometimes too outdated by the time its construction is finished [7], and this situation occurs when the BVH construction time is long. Third, a big tree with a large number of nodes would create a bottleneck due to the limited bus bandwidth between the CPU and the ray-tracing hardware. We deal with these issues by reusing the primAABBs for traversal. Originally, primAABBs are used for BVH building and BV refitting. We maintain primAABBs data after BV refitting and reuse them for traversal. When a ray reaches a leaf node, we perform a ray-primAABB test using the existing traversal unit before sending the ray to the intersection (IST) unit. This method substitutes most of the expensive rayprimitive intersection tests with ray-AABB intersection tests. For example, a ray-triangle intersection test requires 11–29 multiplications and 1–2 reciprocal instructions [23], but a ray-AABB intersection test only require 6 multiplications. We will explain the detailed traversal hardware architecture using this scheme in Section 3.4. In combination with shallow trees, our traversal scheme using primAABBs solves all three of the problems listed above (Figure 2). Shallow trees with large leaf nodes have small memory footprints and also require less build time. However, these large leaf nodes incur more ray-primitive intersection tests. Our proposed scheme prevents traversal cost increases from the use of shallow trees as other culling methods do [41], [43]. However, our reuse of the existing traversal unit and primAABB data means that the additional culling stages required by other culling methods are unnecessary. This method also permits an effective data layout. Wald’s intersection algorithm [44] requires 36 bytes per triangle in a triAccel data, and a reordered triangle in a BVH leaf node requires its original triangle index for shading (4 bytes). If the precomputed triangle data (40 bytes) and a primAABB (24 bytes) are combined, 32-byte alignment can be made available (Figure 3) without padding. This configuration increases cache efficiency. If a ray passes the ray-primAABB test, we transfer the 8-byte triAccel data to the IST unit. 3.3     3.2 4 Geometry and Tree Update Unit The geometry and tree update (GTU) unit computes BVHs, primAABBs, and triAccel data during each     ,QQHU /HDI 3ULP 3ULPLWLYH QRGH QRGH $$%% Fig. 2. An example of a shallow tree configuration using primAABBs: Our scheme reduces the tree depth and adds the primAABBs of each primitive to the tree. Each leaf node in the right figure points to primAABBs instead of actual primitives. Fig. 3. 32-byte alignment by combining a primAABB with the precomputed triangle data (triAccel). frame for animated scenes. This unit is organized into five pipeline stages as illustrated in Figure 4. Vertex and index fetch units read the triangle index and vertex data, an interpolation unit performs keyframe animation, and the AABB/triAccel calculation unit calculates each triangle’s AABB and triAccel data. These three units were designed as a half-pipelined architecture for reduced hardware requirements. Finally, a BV refit unit performs BV refitting [4], and this unit is fully pipelined. The index fetch unit fetches the index of three vertices in a triangle from the memory. The order of the triangles used in this unit corresponds to the triangle order in the leaf nodes of the BVH. In other words, all triangles in a leaf node are stored consecutively. This order removes triangle list fetching upon ray traversal to simplify the design of the T&I unit. Each index fetch unit has two index buffers for the concurrent processing of the index fetch unit and the vertex fetch unit. The vertex fetch unit reads vertices from the external memory. Each vertex fetch unit has a 32-entry FIFO buffer to hide memory latency. If the buffer is not full, another thread can generate a memory request without a pipeline stall. The size of the memory request is 32 bytes. This policy reduces memory requests by using spatial locality when three vertices are adjacently stored in the memory. The vertex interpolation unit calculates the interpolated vertices by using the two key frames. An interpolation requires nine multiplications and nine additions for three axes and three vertices. Each ver- IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS , VOL. X, NO. X, 2014 GTU unit External DRAM Index fetch unit Index buffer1 Index buffer2 Vertex indices Vertex fetch unit Vertices FIFO buffer Interpolation unit AABB/TriAccel calculation unit TriAccel/AABBs BV Refit unit Refit cache BVH Fig. 4. The architecture of the geometry and tree update (GTU) unit. tex interpolation unit consists of five multipliers and five adders. Thus, the throughput of the unit is 0.5 triangles per cycle. The AABB/triAccel calculation unit calculates primAABBs and triAccel data using the interpolated vertices. Each AABB/triAccel calculation unit consists of 10 adders, 12 multipliers, and one reciprocal unit. After the primAABBs are calculated, the BV refit unit performs BV refitting using the primAABBs. We used the breadth-first tree layout for BV refitting for easy parallelization. If the size of the node cache in the T&I units is 64 bytes, the breadth-first layout stores two child nodes in a cache block similar to the layout in [46]. Therefore, both layouts have the same cache efficiency in a 64-byte cache block. Each BV refit unit has a cache including node data and primAABB data; we also prefetch data into this cache to hide memory latency. The output data are transferred to the memory using the write-through policy. The GTU procedure can be parallelized for large dynamic scenes. In this case, the geometry-update part (index fetch, vertex fetch, interpolation, and AABB/TriAccel calculation units) and the tree-update part (a BV refit unit) are separately operated. In the geometry update part, the number of triangles assigned to each parallel unit is the total number of triangles divided by the number of parallel units. In contrast, BV refit units exploit level-by-level parallelization in a bottom-up update manner [24]; at each level, the number of nodes assigned to each parallel unit is the number of nodes at the level divided by the number of parallel units. Only if all BV refit units finish the current updates, further BV updates at the upper levels of the tree are started. 3.4 Traversal and Intersection Unit The traversal and intersection (T&I) unit (Figure 5) consists of one ray dispatcher (RD), 16 traversal (TRV) units, one TRV L2 cache, and one intersection 5 (IST) unit. The RD gets rays from the programmable shaders and dispatches the rays to the TRV units. The RD also calculates the inverse direction vector for TRV units. Ray tracing is basically “embarrassingly parallel,” and the parallelization of ray traversal can be easily achieved as with [23]. In other words, if enough ray threads are supplied to T&I units, the RD in each T&I unit can perform traversals using multiple TRV units. The TRV units perform both BVH traversal and rayprimAABB intersection tests. Each TRV unit includes a ray-AABB intersection routine, stack memory, a ray accumulation buffer for latency hiding [23], and an L1 cache. The ray-AABB intersection calculation part is fully pipelined and it consists of six floating-point adders, six floating-point multipliers, and 13 floatingpoint comparators to achieve a throughput of one rayAABB intersection test per cycle. The IST unit performs ray-triangle intersection tests. Each IST unit consists of 11 floating-point adders, 11 floating-point multipliers, one reciprocal unit, and four floating-point comparators. The IST unit, like the TRV unit, includes an L1 cache and a ray accumulator buffer for effective memory access. The ratio of TRV units to IST units is 16:1, which is different from that in the previous fixed ray-tracing pipelines [18], [19], [23], [25], [32] (3:1–4:1). The reason for this is that ray-primAABB tests in TRV units minimize the number of ray-triangle intersection tests. Figure 6 illustrates the finite-state machine for processing of both BVH traversal and ray-primAABB intersection tests. Each state is described as follows. • STAT TRV (0) represents the initial traversal stage to fetch data. If the parent node is an inner node, we fetch the child node’s AABB and the next state is set to STAT LCHD (1) to visit the left child node. If the parent node is a leaf node, the next state is STAT PRIM (4). If the traversal is finished, the next state is STAT SHD (6). 6+ WPKV 4C[FKURCVEJGT  648 648  648 648 648 648 648  648 648 648 648 648 648 648  648 648 648 . ECEJGU +56 Fig. 5. The architecture of the traversal (TRV) and intersection (IST) unit. In contrast to the prior T&I engine [23], the ratio between TRV and IST units is 16:1 with the ray-primAABB intersection scheme, which reduces the expense of IST units per T&I unit. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS , VOL. X, NO. X, 2014 termination STAT_ SHD (6) miss STAT_ IST (5) hit STAT_ PRIM (4) STAT_ TRV (0) inner node 3.5 Cache-Data Reuse Scheme STAT_ LCHD (1) leaf ŶŽĚĞ no more prims STAT_ TRV_ POST (3) 6 STAT_ RCHD (2) Fig. 6. The finite-state machine of the traversal unit. A dash arrow, such as 0→ 1, 2→ 3, and 4→ 3, means a state transition in the same iteration without additional shape data fetch. • STAT LCHD (1) performs the left child traversal. After the ray-AABB intersection test, the next state is set to STAT RCHD (2). • STAT RCHD (2) performs the right child traversal. After the ray-AABB intersection test, the next state is set to STAT TRV POST (3). • STAT TRV POST (3) determines the next visit node. In this state, stack operations based on the short-stack algorithm with the restart-trail method [47] are performed. Additionally, the SATO metric [48] is adapted to accelerate shadow ray traversal. First, if a non-shadow ray and a shadow ray intersect both of the child nodes, the next node is determined based on the child node of the nearest node and of the node with the lower cost based on SATO metric, respectively. The other node is pushed into the stack and the restart-trail flag bit at the current tree level is updated. If the stack is full, the bottom-most entry is discarded. After that, the current tree level is increased for further restart-trail updates. Second, if a ray intersects either the left child node or the right child node, only the current tree level is increased. Third, if the ray does not intersect either child node, the next visit node is popped from the stack and the current level bit is decreased. If the stack is empty, the node traversal is restarted from the root node. In this case, the restart-trail flag prevents duplicated visits to already traversed sub-trees. Because the next traversal step corresponds to a visit to the nodes in the above three cases, we set the next state to STAT TRV (0). • STAT PRIM (4) performs ray-primAABB intersection tests. If the ray passes the test, the next state is STAT IST (5). Additionally, if there are remaining primAABBs for further intersection in the leaf node, the processing is iterated with the current state (STAT PRIM (4)). When we find the hit point of an occluded ray or have visited all primAABBs in the leaf, we change the state into STAT TRV POST (3). • STAT IST (5) passes the ray into the IST unit. • STAT SHD (6) passes the ray into the shaders when the final hit point of the ray is found. Efficient memory access is important for high performance when rays are incoherent. For incoherent ray tracing, we present a cache-data reuse scheme to exploit consecutive access. The block size of an L1 traversal cache is 64 bytes, so two sets of BVH node data or primAABB data can be stored in a cache block. In the case of BVH node data, left and right child nodes are stored consecutively. PrimAABBs in a leaf node are also stored consecutively. Therefore, we can reuse the cache-block data for the next iteration after the data are obtained. From the L1 cache, we obtain two sets of shape data (node or primAABB data) in an entire cache block and continuously maintain these data in the next pipeline stages. After dozens of cycles, when the ray comes back to the top of the traversal pipeline for the next iteration, we can reuse the shape data. If the required shape data exist in the maintained cache-block data, a cache access for the shape data is bypassed and the processing of the ray is treated as a cache hit. 4 S IMULATION R ESULTS AND A NALYSIS In this section, we describe the experimental results using a cycle-accurate simulator. This simulator provides all of the cycles required for BV refitting and ray-tracing, hardware utilization, average T&I operations per ray, cache/memory statistics, and simulated performance, etc. We also compare our system to other approaches and describe the limitations of our work. 4.1 The Effect of the Ray-PrimAABB Test Scheme and Asynchronous BVH Construction We used four dynamic test scenes for the experiment (Figure 7): UNC cloth simulation (92K triangles), Fairy forest (174K triangles), Exploding Dragon (252K triangles), and Lion (1.6M triangles). The Cloth scene has high frame-to-frame coherence, so it is suitable for BV refitting. The Fairy scene is used for game-like scene configuration. The Dragon scene has low frameto-frame coherence due to fractures after a collision between a bunny and a dragon. The Lion scene is the largest scene in our benchmark and has features similar to the Dragon scene. All scenes were rendered at 1920×1200 resolution. For our experiments, we used two different ray settings: ray casting with hard shadows with one light source and two-bounce forced specular reflection. The BVHs were constructed by using the binned SAH method [5]. All experiments were performed using a 3.5GHz Core i7 4770K CPU with 8GB of RAM. To construct shallow BVHs, we modified the ratio of the expected traversal cost (KT ) to the expected intersection cost (KI ) when we calculated the SAH cost in [5]. When we did not use the ray-primAABB IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS , VOL. X, NO. X, 2014 7 Fig. 7. Four dynamic test scenes (from top to bottom): Cloth, Fairy, Dragon, and Lion. All captured images were rendered with two-bounce forced specular reflection and hard shadows. According to our simulation results, our architecture can render these scenes at 264, 36, 212, and 46 FPS at 1920×1200 resolution, respectively. test scheme, both the KT and KI values were set to 1. When we enabled the ray-primAABB test scheme, we varied the KT values (2, 5, and 10) and set the KI value to 1. The use of larger KT values produces shallower trees having larger leaf nodes. The first goal of the experiments in this section is to determine the optimal ratio of KT to KI for our system. Table 1 describes the results of our experiments. To measure the BVH build time, we use the multithreaded BVH builder in Embree 2.3.2 [11]. We used four threads and one thread for the Lion scene and the other scenes, respectively. We believe this 1-4 thread setting is affordable in terms of resource utilization because modern CPUs, such as Intel Core i7 used in our experiments, can support up to eight threads. The results in Table 1 are described as follows. First, the BVH build time decreases by 17–20%, 36–44%, and 46–50% when KT is 2, 5, and 10, respectively. When we consider a small overhead for key-frame animation and data transfer, the BVH can be rebuilt at 25, 12, 9, and 3 times per second for the Cloth, Fairy, Dragon, and Lion scenes, respectively. Next, the number of intersection tests decreases 16–29% when the ray-primAABB intersection scheme has been enabled, but when KT is 10, the number of traversal operations increases up to 27%. Thus, we conclude that the optimal KT :KI ratio with the ray-primAABB test scheme is 5:1 because this ratio provides the best performance balance between tree construction and TABLE 1 The experimental results for the ray-primAABB intersection scheme. For the statistics in this table, we selected a middle frame of each scene and rendered the scene with two-bounce reflection and shadows. KT :KI Avg tris BVH build Avg TRV Avg IST per leaf time (ms) steps per ray steps per ray Cloth (92K triangles) / single-threaded BVH build 1:1 1.75 65 (1.00×) 33.46 (1.00×) 1.93 (1.00×) 2:1 2.84 52 (0.80×) 33.00 (0.99×) 1.37 (0.71×) 5:1 7.45 36 (0.56×) 34.87 (1.04×) 1.38 (0.72×) 10:1 14.49 33 (0.51×) 40.04 (1.20×) 1.38 (0.72×) Fairy (174K triangles) / single-threaded BVH build 1:1 2.06 120 (1.00×) 64.99 (1.00×) 6.82 (1.00×) 2:1 3.17 100 (0.83×) 63.73 (0.98×) 5.68 (0.83×) 5:1 7.35 77 (0.64×) 66.40 (1.02×) 5.72 (0.84×) 10:1 14.50 65 (0.54×) 78.90 (1.21×) 5.74 (0.84×) Dragon (252K triangles) / single-threaded BVH build 1:1 1.72 178 (1.00×) 51.63 (1.00×) 1.36 (1.00×) 2:1 2.80 142 (0.80×) 51.31 (0.99×) 0.98 (0.72×) 5:1 6.47 105 (0.59×) 52.63 (1.02×) 0.99 (0.73×) 10:1 12.24 88 (0.50×) 57.04 (1.10×) 1.00 (0.74×) Lion (1.6M triangles) / parallel BVH build (4 threads) 1:1 1.80 474 (1.00×) 85.39 (1.00×) 11.68 (1.00×) 2:1 2.82 390 (0.82×) 84.97 (1.00×) 8.86 (0.76×) 5:1 7.30 287 (0.61×) 91.27 (1.07×) 8.90 (0.76×) 10:1 14.40 246 (0.52×) 108.17 (1.27×) 8.94 (0.77×) ray traversal. Because the sum of T&I operations with the KT values of 1 and 5 are similar, we think both settings will result in similar memory traffic. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS , VOL. X, NO. X, 2014 (MB) 160 120 TABLE 2 Complexity of each component by the number of floating-point units and the required on-chip memory. Abbreviations: ADD – adder, MUL – multiplier, RCP – reciprocal unit, CMP – comparator, L1 – L1 cache, L2 – L2 cache, idx. – index, vtx. – vertex, calc. – calculation. Node PrimAABB Triangle index list 80 40 ADD MUL RCP CMP 0 Default Ours Default Ours Default Ours Default Ours Cloth Fairy Dragon Lion Fig. 8. Comparison of the tree sizes (in megabytes). Default – the ratio of KT to KI was 1:1 without the ray-primAABB intersection scheme. Ours – the ratio of KT to KI was 5:1 with the ray-primAABB intersection scheme. Figure 8 depicts the tree sizes. Node data require triple buffering, and the size of a node is 32 bytes. Both primAABB data and a triangle index list require double buffering; PrimAABB data and a triangle index list do not need a buffer to store transferred data from the CPUs and a buffer in T&I units, respectively. Note that we do not count 8 bytes of primAABB data stored in the padding of TriAccel data (Figure 3) for these tree sizes in Figure 8. According to Figure 8, the proposed method achieves a reduction of 34–44% of the tree sizes; the number of nodes is 72–77% less than the default setting, and primAABB data are added. The result also means that the bus-bandwidth requirements to transfer node and triangle index data from the CPU to the ray-tracing hardware are reduced by 66–71%. If the data are asynchronously transferred from the CPU to the ray tracing hardware, the required busbandwidth for BVHs is very small (22–52 MB/s) because these data do not need to be transferred during each frame. 4.2 8 Hardware Complexity and Area Estimation The hardware setup of the proposed architecture is structured as follows. The number of stacks per TRV is 32, and the number of TRVs in a T&I unit is 16; therefore, the highest number of executing rays in a T&I unit is 512. We configured the external memory for 1GHz, 8-channel GDDR3 memory. We assumed that six channels are connected to T&I units, and two channels are connected to the GTU units. The memory simulation was executed using a GDDR3 simulator in GPGPU-Sim [49]. As with [23], we assumed that the programmable shaders provide sufficient computing power for ray generation and shading. Table 2 shows the hardware complexity of a GTU unit and a T&I unit. Each BV refit, TRV, and IST unit has a 8KB, 16KB, and 128KB 2-way set associative cache, respectively. All caches have one read-only GTU unit Idx. fetch Vtx. fetch Vtx. interp. 10 TriAccel/ 10 AABB calc. BV refit Total 20 T&I unit 1 RD 16 TRV 96 1 IST 11 I/O buffer Total 107 5 12 1 7 17 1 11 18 3 96 11 1 107 4 RF L1 L2 2KB 5KB 1KB 20KB 2KB 8KB 30KB 8KB 2KB 208 207KB 256KB 512KB 3 13KB 128KB 32KB 211 254KB 384KB 512KB TABLE 3 Area estimates of a GTU unit and a T&I unit. Abbreviations: FP – floating-point, INT – integer. Functional Area Total Area Memory Area Total Area Unit (mm2 ) (mm2 ) Unit (mm2 ) (mm2 ) GTU unit FP ADD 0.003 0.06 BV refit 0.03 FP MUL 0.01 0.17 4K RFs 0.019 0.14 FP RCP 0.11 0.11 FP CMP 0.00072 0.01 INT ADD 0.00066 0.02 Control/Etc. 0.09 Wiring overhead 0.54 Total 1.07 T&I unit FP ADD 0.003 0.32 TRV L1 0.037 0.60 FP MUL 0.01 1.07 TRV L2 1.23 FP RCP 0.11 0.44 IST 0.25 FP CMP 0.00072 0.08 4K RFs 0.019 1.18 INT ADD 0.00066 0.03 Control/Etc. 0.45 Wiring overhead 3.88 Total 9.51 port. An L2 TRV cache is a 512KB 4-way cache divided into eight banks. The BV refit cache has a block size of 256B for data prefetching with a sequential access pattern. In contrast, both TRV and IST caches have a block size of 64B. We set the latency of the L1 caches as one cycle and set the minimum latency of the L2 caches as 3 cycles. This configuration is the same as that in 500MHz MIMD TM [29] based on CACTI [50], and also corresponds to the cache latencies on AMD Opteron X4 (3-cycle L1 and 9-cycle L2 latencies at 2.5GHz) [51]. Additionally, caches in our architecture are read-only IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS , VOL. X, NO. X, 2014 in contrast to modern CPUs/GPUs, so we did not need to consider complex cache coherency issues. We also considered bank conflicts; if a bank conflict in an L2 cache occurs, the L2 cache access is delayed. As a result, the actual L2 cache latency is usually more than 10 cycles with coherent rays; if rays are very incoherent, the L2 cache latency can increase by up to dozens of cycles. The latencies of a floating-point (FP) multiplier, an FP adder, an FP comparator, and a reciprocal unit were set to 2, 2, 1, and 16 cycles, respectively, similar to [29]. Register files (RFs) are needed for buffers between the units, ray accumulation buffers, 8-entry traversal stacks, I/O buffers to programmable shaders, and pipeline registers. The size of a ray accumulation buffer is 32 (2 (width) × 16 (height)). To predict the performance of our system, we carefully estimated the area of a GTU unit and a T&I unit (Table 3), using an estimation metric similar to that in [23]. First, we assumed 65 nm technology, a 200 mm2 die area, and a clock speed of 500 MHz, similar to TRaX [28]. Second, we assumed that the GTU unit and T&I units occupy less than 33% of the total area, similar to D-RPU [19]. The remaining area was used for programmable shaders and memory interfaces. Third, we used the area estimates for arithmetic units and caches obtained from [29] and CACTI 6.5 [50]. Fourth, we assumed that control parts require 23% of the total area for arithmetic units; this assumption is based on the ratio of the front-to-end area to that of the execution area in [52]. Fifth, we added 69% overhead into our estimation. This was assumed by two levels of wiring overhead (arithmetic units → each component → a GTU unit and a T&I unit); the one-level overhead used in [52] is approximately 30%. According to these estimates, four GTU units (4.3 mm2 ) and six T&I units (57.1 mm2 ) can be assigned into a ray-tracing core with a 200 mm2 die area (31% of the total area). 4.3 Simulation Results in Dynamic Scenes For experiments with dynamic scenes, we used the same test scenes and experimental setup as in Section 4.1. Table 4 describes the results: our system can achieve real-time frame rates at 1920×1200 resolution. The performance effects of two-bounce reflection are different in each scene due to the different required cycles for BV update and ray tracing changed in each scene (Figure 9). In the Fairy scene, frame rates with a ray recursion depth of 2 are 3× lower than that with a depth of 0 because almost all radiance rays (primary and reflection rays) hit some objects and generate additional rays. In contrast, many radiance rays in the other scenes do not hit any objects (background colors in Figure 7) and do not propagate additional rays. Thus, the differences of frame rates between the ray recursion depths of 0 and 2 in these scenes are less Cycles(M) Cloth 3 2 1 0 9 Geometry /tree update 0 Ray tracing (depth 0) 3 Ray tracing (depth 2) 6 9 20 Fairy 10 0 Dragon 6 4 2 0 0 5 0 10 5 15 10 20 15 15 Lion 10 5 0 0 10 20 30 40 Time (s) Fig. 9. Required cycles per frame for the test scenes. than those in the Fairy scene. In particular, the Lion scene shows almost same frame rates regardless of the ray recursion depths because the geometry and tree update time is bottlenecked in this scene due to its high triangle count (1.6M). However, real-time frame rates (46 FPS) are still shown. Memory traffic per frame is broadly proportional to the scene size because more triangles require more memory accesses for geometry and tree updates. Additionally, two-bounce reflection increases memory traffic due to low cache hit rates; various directions of normal vectors of each object can result in incoherent reflection rays which can decrease cache hit rates. In Table 5, we compare our system to other approaches. For a comparison with a CPU approach, we executed the Manta ray tracer [21] with the treerotation algorithm [9]. For ray traversal, we used the DynBVH traversal algorithm [4] with an 8×8 packet size. The result on a 3.5GHz Core i7 CPU is 8–23 FPS. For comparison with a GPU approach, we used NVIDIA OptiX 3.6.2 [22]. For key-frame animation, we modified the Sample6 code in OptiX SDK. We measured performance with an NVIDIA GeForce GTX680 card, which gave a result of 7–30 FPS. In contrast to these CPU and GPU approaches, our architecture can achieve real-time frame rates in all the test scenes. These high frame rates are due to high ray traversal performance, the maintenance of tree quality, and a low tree-update overhead. We have also implemented a CPU-GPU hybrid ray tracer based on asynchronous BVH construction. The detailed description of the hybrid ray tracer is beyond the scope of this paper and is included in another IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS , VOL. X, NO. X, 2014 10 TABLE 4 Simulation results in dynamic scenes. Scene (max ray depth) Cloth (0) Cloth (2) Fairy (0) Fairy (2) Dragon (0) Dragon (2) Lion (0) Lion (2) # of rays per frame (M) 3.0 3.9 4.6 11.9 2.6 3.3 2.7 3.6 Cache hit rate (%) (BV Refit/TRV L1/ TRV L2/IST) 86 / 97 / 95 / 98 86 / 95 / 95 / 97 85 / 98 / 96 / 99 85 / 96 / 94 / 98 85 / 98 / 93 / 95 85 / 93 / 84 / 85 85 / 92 / 97 / 98 85 / 88 / 94 / 96 Average TRV/IST steps per ray 28.8 / 1.3 33.6 / 1.4 63.1 / 4.3 66.5 / 5.6 28.9 / 0.5 41.0 / 0.8 54.2 / 4.7 86.4 / 8.1 Memory traffic (MB/frame) 37.0 51.3 63.0 187.7 80.0 166.5 763.3 923.2 Simulated frames per second 402 264 109 36 312 212 46 46 TABLE 5 Comparison of the performance for ray casting with shadows at 1920×1200 resolution. Platform Clock (MHz) Process (nm) Area (mm2 ) BVH update method FPS (Cloth) FPS (Fairy) FPS (Dragon) FPS (Lion) CPU (Manta) [21] GPU (OptiX) [22] MIC [17] CPU-GPU hybrid [53] Ours Intel Core i7 NVIDIA GeForce Intel MIC Intel i7 4770K + RT H/W + 4770K (4 cores) GTX 680 (1536 cores) (32 cores) NVIDIA GTX 680 CPU (1-4 cores) 3500 1006 1000 3500(CPU) & 1006(GPU) 500 (RT H/W only) 22 28 45 22(CPU) & 28(GPU) 65 ( “ ) 177 294 177(CPU) & 294(GPU) 200 ( “ ) BV refitting [4] + LBVH [13] + Binned SAH BVH Asynchronous BVH tree rotation [9] BVH refinement [16] construction [17] construction [1] 23 30 44 35 402 8 17 17 25 109 18 26 19 39 312 8 7 18 46 technical report [53]. The results on a 3.5GHz Intel Core i7 CPU and an NVIDIA GTX680 GPU are 18-35 FPS, and our architecture is at least 2.5× faster than this CPU-GPU hybrid ray tracer. Compared to full SAH BVH construction on the Intel MIC architecture [17], our approach takes advantage of asynchronous BVH construction and heterogeneous computing environments. According to [17], full BVH construction spent 41–65% of the total rendering time in the Cloth, Fairy, and Dragon scenes. In contrast, the GTU unit in our architecture occupies less than 3% of the total die area and an existing CPU performs tree reconstruction. 4.4 Simulation Results in Static Scenes Our ray-tracing system can also be used to accelerate the rendering of static scenes. To measure the performance of our architecture in static scenes, we set up the following experimental environment, similar to [23], [46]. We used the three scenes in Figure 10: Sibenik (80K triangles), Fairy Forest (174K triangles), and Conference (282K triangles). We obtained ray data from Aila’s CUDA ray tracer [46]. The resolution is 1024×768 and the ray types are the primary ray (very coherent), the ambient occlusion (AO) ray (incoherent), and the diffuse inter-reflection ray (very incoherent). The number of samples per pixel is 32. We used AO cut-off values of 5.0, 0.3, and 5.0 for the Sibenik, Fairy, and Conference scenes, respectively. We Fig. 10. Sample images from the three static test scenes: Sibenik rendered with ray casting, Fairy rendered with ambient occlusion, and Conference rendered with diffuse inter-reflection. used the same view points as [46] and the performance values are averages from five representative viewpoints per scene. The BVHs were built by the split BVH build algorithm [54]. Note that we assumed that eight memory channels are connected to T&I units in contrast to Section 4.3. The reason is that we investigate not dynamic scene performance but ray traversal performance, in this section. Table 6 summarizes the results in the static test scenes. According the results, our ray-tracing system achieves 351–969 Mrays/s. Compared to a kdtree-based ray-tracing hardware architecture for static scenes [23], our proposed architecture performs at an average 94.4% of [23] and these results are comparable to those of the kd-tree-based architecture. Additionally, our architecture performs better than GPU ray tracing on GTX680 [55], even though our architecture requires less computational resources than do modern IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS , VOL. X, NO. X, 2014 11 TABLE 6 Simulation results in static scenes. Ray type TRV/IST Cache hit rate (%) utilization (TRV L1/TRV L2/IST) (%) Sibenik (80K triangles) Primary 95 / 56 99 / 93 / 99 AO 95 / 53 97 / 99 / 99 Diffuse 65 / 50 87 / 92 / 89 Fairy (174K triangles) Primary 73 / 72 98 / 93 / 99 AO 64 / 80 96 / 97 / 99 Diffuse 59 / 81 89 / 91 / 94 Conference (282K triangles) Primary 82 / 73 99 / 93 / 99 AO 90 / 51 97 / 98 / 99 Diffuse 75 / 77 91 / 91 / 92 Average TRV/IST steps per ray Memory traffic (GB/s) Simulated Mrays/s Relative performance compared to [23] 65.4 / 2.9 43.7 / 1.8 76.9 / 4.3 1.8 0.4 27.6 588 928 351 128% 113% 80% 89.7 / 8.7 45.0 / 4.3 69.8 / 7.2 2.7 2.6 21.9 383 649 380 106% 80% 102% 58.9 / 3.7 38.8 / 1.6 61.3 / 4.6 1.9 1.2 24.1 602 969 506 76% 82% 84% TABLE 7 The effect of the cache-data reuse scheme for L1 traversal caches. Scene TRV/IST TRV L1/ L2 utilization cache hit (%) Without our cache scheme Sibenik 53 / 41 84 / 92 Fairy 55 / 75 87 / 91 Conference 68 / 70 89 / 92 With our cache scheme Sibenik 65 / 50 87 / 89 Fairy 59 / 81 89 / 91 Conference 75 / 77 91 / 91 Simulated Mrays/s 290 349 462 351 380 506 desktop GPUs, as described in Table 5. In particular, when tracing incoherent rays, our MIMD architecture results in less performance degradation than modern SIMT-based GPU architectures. We also investigated the efficiency of the cache scheme presented in Section 3.5. For this experiment, we traced diffuse inter-reflection rays, which are the most incoherent ray type in our benchmark. Under the cache-data reuse scheme, a ray that bypassed an L1 traversal cache was counted as a cache hit for the cache hit-rate calculation. According to the results shown in Table 7, the cache-data reuse scheme for L1 cache access improves the ray-tracing performance by 9–21% with increased cache hit rates. In terms of chip area, the cache-data reuse scheme requires additional register spaces to store a 64B cache block of data in each pipeline stage and buffer. However, additional registers for the T&I unit, which are 28KB, require only 0.13mm2 . 4.5 Discussion and Limitations More effective BVH update: Our system’s performance may drop off in complex dynamic scenes because of a long tree-build time and large memory footprints. In addition, our system would not be suit- able for very rapidly-changing scenes (e.g., racing) because asynchronous BVH construction exploits frameto-frame coherence. Additionally, object insertion, object deletion, or completely unstructured motion with topological changes can generate a frame drop in asynchronous BVH construction [1], since in these situations the entire BVH should be reconstructed in every frame. Finally, once triple buffering is used, the tree data stored in the first and second buffers are two and one frame old, respectively. The possibility of performance degradation caused by the outdated data in the first buffer was described above, but an early finish of BV refitting using the second buffer can result in delayed rendering. For example, if a significant amount of time is still required for ray tracing during the current frame after BV refitting has finished, the BVH in the second buffer will be outdated during the idle time of GTU units. The outdated BVH can make a perceptible delay if frame rates are very low. We think there are four possible future improvements: partial update, tree rotation, faster BVH construction, and continuous BV refitting operations. First, if we divide the static parts and dynamic parts of the tree using multi-level hierarchies like gkDtrees [8], we can more effectively render dynamic scenes, which mainly consist of static parts. Since we do not need to rebuild and refit static parts, and these static parts only need a single buffer, this method will alleviate the problem of the rapid BVH rebuild time outdating the rebuilt BVH. In case of object insertion, object deletion, or topological changes, the dynamic parts can be selectively restructured, in a similar manner to the method in [6]. Second, if the BV refit unit is extended to support the tree-rotation algorithm [9], the BVH update will be more robust for rapidly-changing scenes. Because the tree-rotation algorithm can be easily integrated to BVH refitting, we believe this addition can be available in our hardware architecture. Third, our approach can be combined with faster ways of constructing the BVH, such as the IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS , VOL. X, NO. X, 2014 approximate agglomerative clustering algorithm on multi-core CPUs [10] or a dedicated hardware unit for BVH construction [20]. In this case, since the tree need not be reconstructed during each frame, our small GTU unit will help to reduce the required hardware resources for tree construction. Finally, continuous BV refitting operations can alleviate the delayed rendering problem. After BV refitting computation is finished, we can update the BVH again at the current time if a long time remains to finish ray tracing at the current frame. This approach will generate a recently updated BVH. Shading cost: Shading may or may not be a major cost in ray tracing [46], but we assumed that shading would not be a bottleneck due to the shading cost of Whitted ray tracing and the sufficient capability of programmable shaders. In the Embree system, complex shading for off-line rendering typically consumes 30 to 50% of the total frame time [11]; it indicates that simpler shading for real-time ray tracing dynamic scenes would result in cost lower than 30% of the total frame time. In fact, shading costs in simpler scenes occupy about 25% of the total rendering time in Wald’s experiments (Table 7.6 in [44]). Additionally, a GPU ray tracer [55] on GTX680 shows very high ray generation and shading (RGS) performance (several hundred million rays per second). Even a cycleaccurate simulation result on a state-of-the-art mobile processor (4-core SRP) exhibited RGS performance by up to 198 M rays/s [32]. However, complex shading would bottleneck rendering. In this case, we think additional stream filter units [27] would be a suitable solution to maintain high SIMD utilization of the programmable shaders. Animation and primitive types: Because we focused on a very small fixed unit for dynamic scenes, our architecture currently supports triangular primitives and key-frame animations. To support other animation and primitive types, appropriate programmable shaders would be needed. In regards to animation types, the usage of GTU units will be different. If hierarchical transformation or skinned animation is performed on shaders, the interpolation unit for key-frame animation in a GTU unit will not be used. If an object is inserted or deleted in a scene, or the geometry/tessellation shader increases the number of triangles in an object, the tree data of the object in the GTU unit are invalid and so should be reconstructed on a CPU. In this case, selective restructuring [6] or multi-level tree decomposition [8], [11], [56] can be used to effectively handle the dynamic objects as described above. In regards to primitive types, more complex interfaces between TRV units and programmable shaders will be required to prevent performance degradation caused by frequent communication or unbalanced workloads between traversal and intersection operations. 5 C ONCLUSIONS 12 AND F UTURE W ORK In this paper, we have presented a hybrid raytracing architecture for dynamic scenes. Our approach achieves real-time frame rates using asynchronous BVH construction [1], [7] on a CPU, and dedicated ray-tracing hardware. We have also presented a novel traversal hardware architecture using ray-primAABB tests and an efficient cache scheme for the architecture. There are many avenues for future work. First, new acceleration data structures would help to increase the performance of our architecture. Our BVH-based hardware architecture is slower than a kd-tree-based hardware architecture [23] in some static scenes, as described in Section 4.4; we think that an extended architecture based on shared-plane BVHs (SPBVHs) [57] can compensate for the defect because SPBVHs have a lower traversal cost and memory footprints than BVHs. Second, we would like to prove the feasibility of our system on a register-transfer-level (RTL) implementation, after which we would like to run experiments on actual hardware. Integration with mobile GPU architectures [32] will be especially helpful for new killer mobile applications. Finally, we are interested in extending our architecture to accelerate ray-tracing-based sound rendering [58]. ACKNOWLEDGMENTS This work was supported by Samsung Electronics Co., Ltd. Jae-Ho was also supported by the National Research Foundation of Korea Grant funded by the Korean Government (Ministry of Education) [NRF-2012R1A6A3A03040332]. Dinesh Manocha was supported by ARO Contract W911NF-10-1-0506, and NSF awards 0917040 and 1320644. Models used are courtesy of the UNC Dynamic Scene Benchmarks (Cloth Simulation, Exploding Dragon, and Lion), the Utah 3D Animation Repository (Fairy Forest), Marko Dabrovic (Sibenik), and Anat Grynberg and Greg Ward (Conference). We would like to thank Tero Karras, Timo Aila, and Samuli Laine for releasing their GPU ray tracer, and to thank Tor Aamodt and his lab members for releasing GPGPU-Sim. R EFERENCES [1] [2] [3] [4] T. Ize, I. Wald, and S. G. Parker, “Asynchronous BVH construction for ray tracing dynamic scenes on parallel multi-core architectures,” in In Proceedings of the Eurographics Symposium on Parallel Graphics and Visualization, 2007, pp. 101–108. I. Wald, W. R. Mark, J. Gunther, S. Boulos, T. Ize, W. Hunt, S. G. Parker, and P. Shirley, “State of the art in ray tracing animated scenes,” Computer Graphics Forum, vol. 28, no. 6, pp. 1691–1722, 2009. C. Lauterbach, S.-E. Yoon, D. Tuft, and D. Manocha, “RTDEFORM: Interactive ray tracing of dynamic scenes using BVH,” in Proceedings of IEEE Symposium on Interactive Ray Tracing 2006, 2006, pp. 39–45. I. Wald, S. Boulos, and P. Shirley, “Ray tracing deformable scenes using dynamic bounding volume hierarchies,” ACM Transactions on Graphics, vol. 26, no. 1, pp. 6:1–6:18, 2007. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS , VOL. X, NO. X, 2014 [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] I. Wald, “On fast construction of SAH-based bounding volume hierarchies,” in Proceedings of IEEE Symposium on Interactive Ray Tracing 2007, 2007, pp. 33–40. S.-E. Yoon, S. Curtis, and D. Manocha, “Ray tracing dynamic scenes using selective restructuring,” in Proceedings of Eurographics symposium on rendering 2007, 2007, pp. 73–84. I. Wald, T. Ize, and S. G. Parker, “Fast, parallel, and asynchronous construction of BVHs for ray tracing animated scenes,” Computers & Graphics, vol. 32, no. 1, pp. 3–13, 2008. Y.-S. Kang, J.-H. Nah, W.-C. Park, and S.-B. Yang, “gkDtree: A group-based parallel update kd-tree for interactive ray tracing,” Journal of Systems Architecture, vol. 59, no. 3, pp. 166– 175, 2013. D. Kopta, T. Ize, J. Spjut, E. Brunvand, A. Davis, and A. Kensler, “Fast, effective BVH updates for animated scenes,” in Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games, 2012, pp. 197–204. Y. Gu, Y. He, K. Fatahalian, and G. Blelloch, “Efficient BVH construction via approximate agglomerative clustering,” in Proceedings of the 5th High-Performance Graphics Conference, 2013, pp. 81–88. I. Wald, S. Woop, C. Benthin, G. S. Johnson, and M. Ernst, “Embree - a kernel framework for efficient CPU ray tracing,” ACM Transactions on Graphics (SIGGRAPH 2014), vol. 33, no. 4, pp. 143:1–143:8, 2014. K. Zhou, Q. Hou, R. Wang, and B. Guo, “Real-time KDtree construction on graphics hardware,” ACM Transactions on Graphics (SIGGRAPH Asia 2008), vol. 27, no. 5, pp. 1–11, 2008. C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, and D. Manocha, “Fast BVH construction on GPUs.” Computer Graphics Forum (EUROGRAPHICS 2008), vol. 28, no. 2, pp. 375–384, 2009. K. Garanzha, J. Pantaleoni, and D. McAllister, “Simpler and faster HLBVH with work queues,” in Proceedings of the Conference on High Performance Graphics, 2011, pp. 59–64. T. Karras, “Maximizing parallelism in the construction of BVHs, octrees, and k-d trees,” in Proceedings of the 4th conference on High-Performance Graphics, 2012, pp. 33–37. T. Karras and T. Aila, “Fast parallel construction of highquality bounding volume hierarchies,” in Proceedings of the 5th High-Performance Graphics Conference, 2013, pp. 89–99. I. Wald, “Fast Construction of SAH BVHs on the Intel Many Integrated Core (MIC) Architecture,” IEEE Transactions on Visualization and Computer Graphics, vol. 18, no. 1, pp. 47–57, 2012. S. Woop, G. Marmitt, and P. Slusallek, “B-KD trees for hardware accelerated ray tracing of dynamic scenes,” in GH ’06: Proceedings of the 21st ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, 2006, pp. 67–77. S. Woop, E. Brunvand, and P. Slusallek, “Estimating performance of a ray-tracing ASIC design,” in Proceedings of the 2006 IEEE/EG Symposium on Interactive Ray Tracing, 2006, pp. 7–14. M. J. Doyle, C. Fowler, and M. Manzke, “A hardware unit for fast SAH-optimised BVH construction,” ACM Transactions on Graphics (SIGGRAPH 2013), vol. 32, no. 4, pp. 66:1–66:13, 2013. J. Bigler, A. Stephens, and S. G. Parker, “Design for parallel interactive ray tracing systems,” in Proceedings of IEEE Symposium on Interactive Ray Tracing 2006, 2006, pp. 187–196. S. G. Parker, J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Luebke, D. McAllister, M. McGuire, K. Morley, A. Robison, and M. Stich, “OptiX: a general purpose ray tracing engine,” ACM Transactions on Graphics (SIGGRAPH 2010), vol. 29, no. 4, pp. 66:1–66:13, 2010. J.-H. Nah, J.-S. Park, C. Park, J.-W. Kim, Y.-H. Jung, W.-C. Park, and T.-D. Han, “T&I engine: traversal and intersection engine for hardware accelerated ray tracing,” ACM Transactions on Graphics (SIGGRAPH Asia 2011), vol. 30, no. 6, pp. 160:1– 160:10, 2011. C. Lauterbach, Q. Mo, and D. Manocha, “gProximity: Hierarchical GPU-based operations for collision and distance queries,” Computer Graphics Forum (EUROGRAPHICS 2010), vol. 29, no. 2, pp. 419–428, 2010. J. Schmittler, S. Woop, D. Wagner, W. J. Paul, and P. Slusallek, “Realtime ray tracing of dynamic scenes on an FPGA chip,” in Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, 2004, pp. 95–106. 13 [26] S. Woop, J. Schmittler, and P. Slusallek, “RPU: a programmable ray processing unit for realtime ray tracing,” ACM Transactions on Graphics (SIGGRAPH 2005), vol. 24, no. 3, pp. 434–444, 2005. [27] K. Ramani, C. P. Gribble, and A. Davis, “StreamRay: a stream filtering architecture for coherent ray tracing,” in ASPLOS ’09: Proceeding of the Architectural support for programming languages and operating systems, 2009, pp. 325–336. [28] J. Spjut, A. Kensler, D. Kopta, and E. Brunvand, “TRaX: a multicore hardware architecture for real-time ray tracing,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 28, no. 12, pp. 1802–1815, 2009. [29] D. Kopta, J. Spjut, E. Brunvand, and A. Davis, “Efficient MIMD architectures for high-performance ray tracing,” in ICCD 2010: Proceedings of the 28th IEEE International Conference on Computer Design, 2010, pp. 9–16. [30] T. Aila and T. Karras, “Architecture considerations for tracing incoherent rays,” in HPG’ 10: Proceedings of the Conference on High Performance Graphics, 2010, pp. 113–122. [31] W.-J. Lee, S.-H. Lee, J.-H. Nah, J.-W. Kim, Y. Shin, J. Lee, and S.-Y. Jung, “SGRT: a scalable mobile GPU architecture based on ray tracing,” in ACM SIGGRAPH 2012 Talks, 2012. [32] W.-J. Lee, Y. Shin, J. Lee, J.-W. Kim, J.-H. Nah, S.-Y. Jung, S.H. Lee, H.-S. Park, and T.-D. Han, “SGRT: A mobile GPU architecture for real-time ray tracing,” in Proceedings of the 5th High-Performance Graphics Conference, 2013, pp. 109–119. [33] J.-H. Nah, H.-J. Kwon, D.-S. Kim, C.-H. Jeong, J. Park, T.-D. Han, D. Manocha, and W.-C. Park, “RayCore: A ray-tracing hardware architecture for mobile devices,” ACM Transactions on Graphics, vol. 33, no. 5, pp. 162:1–162:15, 2014. [34] S. Woop, “A programmable hardware architecture for realtime ray tracing of coherent dynamic scenes,” Ph.D. dissertation, Sarrland University, 2007. [35] B. C. Budge, J. C. Anderson, C. Garth, and K. I. Joy, “A hybrid CPU-GPU implementation for interactive ray-tracing of dynamic scenes,” University of California, Davis Computer Science, Tech. Rep. CSE-2008-9, 2008. [36] J.-H. Nah, Y.-S. Kang, K.-J. Lee, S.-J. Lee, T.-D. Han, and S.B. Yang, “MobiRT: an implementation of OpenGL ES-based CPU-GPU hybrid ray tracer for mobile devices,” in ACM SIGGRAPH ASIA 2010 Sketches, 2010, pp. 50:1–50:2. [37] J. Bikker and J. van Schijndel, “The brigade renderer: A path tracer for real-time games,” International Journal of Computer Games Technology, 2013. [38] B. Budge, T. Bernardin, J. A. Stuart, S. Sengupta, K. I. Joy, and J. D. Owens, “Out-of-core data management for path tracing on hybrid resources,” Computer Graphics Forum, vol. 28, no. 2, pp. 385–396, 2009. [39] A. Pajot, L. Barthe, M. Paulin, and P. Poulin, “Combinatorial bidirectional path-tracing for efficient hybrid CPU/GPU rendering,” Computer Graphics Forum, vol. 30, no. 2, pp. 315–324, 2011. [40] LuxRender, “Luxrays.” [Online]. Available: http://www. [41] A. Reshetov, “Faster ray packets - triangle intersection through vertex culling,” in Proceedings of IEEE Symposium on Interactive Ray Tracing 2007, 2007, pp. 105–112. [42] J. Snyder and A. Barr, “Ray tracing complex models containing surface tessellations,” in ACM SIGGRAPH Computer Graphics, vol. 21, 1987, pp. 119–128. [43] J.-H. Nah, W.-C. Park, Y.-S. Kang, and T.-D. Han, “Ray-box culling for tree structures,” Journal of Information Science and Engineering, vol. 29, no. 6, pp. 1211–1225, 2013. [44] I. Wald, “Realtime ray tracing and interactive global illumination,” Ph.D. dissertation, Sarrland University, 2004. [45] T. Möller and B. Trumbore, “Fast, minimum storage raytriangle intersection,” Journal of Graphics Tools, vol. 2, no. 1, pp. 21–28, 1997. [46] T. Aila and S. Laine, “Understanding the efficiency of ray traversal on GPUs,” in HPG ’09: Proceedings of the Conference on High Performance Graphics, 2009, pp. 145–149. [47] S. Laine, “Restart trail for stackless BVH traversal,” in HPG ’10: Proceedings of the Conference on High Performance Graphics, 2010, pp. 107–111. [48] J.-H. Nah and D. Manocha, “SATO: Surface-area traversal order for shadow ray tracing,” Computer Graphics Forum, vol. 33, no. 6, pp. 167–177, 2014. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS , VOL. X, NO. X, 2014 [49] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, “Analyzing CUDA workloads using a detailed GPU simulator,” in Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software 2009, 2009, pp. 163–174. [50] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0,” in MICRO 40: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007, pp. 3–14. [51] D. A. Patterson and J. L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 4th ed. Morgan Kaufmann Publishers Inc., 2008. [52] A. Mahesri, D. Johnson, N. Crago, and S. J. Patel, “Tradeoffs in designing accelerator architectures for visual computing,” in MICRO 41: Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture, 2008, pp. 164–175. [53] J.-W. Kim, J.-M. Kim, M. Lee, and T.-D. Han, “Asynchronous BVH reconstruction on CPU-GPU hybrid architecture,” in ACM SIGGRAPH 2014 Posters, 2014, pp. 91:1–91:1. [54] M. Stich, H. Friedrich, and A. Dietrich, “Spatial splits in bounding volume hierarchies,” in HPG’ 09: Proceedings of the Conference on High Performance Graphics, 2009, pp. 7–13. [55] T. Aila, S. Laine, and T. Karras, “Understanding the efficiency of ray traversal on GPUs – Kepler and Fermi addendum,” NVIDIA Corporation, NVIDIA Technical Report NVR-201202, 2012. [56] I. Wald, C. Benthin, and P. Slusallek, “Distributed interactive ray tracing of dynamic scenes,” in IEEE Symposium on Parallel and Large-Data Visualization and Graphics, 2003, pp. 77–86. [57] M. Ernst and S. Woop, “Ray tracing with shared-plane bounding volume hierarchies,” Journal of Graphics, GPU, and Game Tools, vol. 15, no. 3, pp. 141–151, 2011. [58] C. Schissler, R. Mehra, and D. Manocha, “High-order diffraction and diffuse reflections for interactive sound propagation in large environments,” ACM Transactions on Graphics (SIGGRAPH 2014), vol. 33, no. 4, pp. 39:1–39:12, 2014. Jae-Ho Nah received the B.S., M.S., and Ph.D. degrees from the Department of Computer Science, Yonsei University in 2005, 2007, and 2012, respectively. Currently, he is a senior research engineer at LG Electronics. His research interests include ray tracing, rendering algorithms, and graphics hardware. Jin-Woo Kim is a PhD candidate of the Media System Laboratory at Yonsei University, Seoul, Korea. He received the BS degree from Sangmyung University, Seoul, Korea, in 2006. His research interests include 2D/3D graphics hardware, real-time ray tracing, and GPGPU based parallel programming. Junho Park currently works at Humax, Korea. He received his BS and MS degrees from Soongsil University and Yonsei University, respectively. His main interests include ray tracing, GPGPU, and computer vision. 14 Won-Jong Lee is a senior researcher of the Processor Architecture Lab at Samsung Advanced Institute of Technology. He received his PhD and M.S. degree in computer science from Yonsei University, Seoul, Korea, in 2001 and his B.S. degree in computer engineering from Inha University, Incheon, Korea, in 1999. His research interests include mobile GPU, graphics hardware, ray tracing, parallel and distributed rendering. Currently he is leading a project on designing a mobile GPU architecture based on ray tracing. Jeong-Soo Park received the B.S., M.S., and Ph.D. degrees from the Department of Computer Science, Yonsei University in 2003, 2005, and 2014, respectively. Currently, he is a researcher of the Processor Architecture Lab at Samsung Advanced Institute of Technology. His research interests include 3D graphics hardware, programmable shader architecture, and mobile 3D graphics. Seok-Yoon Jung received the B.S. and M.S. degrees in control and instrumentation engineering and the Ph.D. degree in electrical engineering from Seoul National University, Seoul, Korea, in 1987, 1989, and 1998, respectively. Since February 1989, he has been a Member of the Research Staff of Samsung Advanced Institute of Technology, Kyungki, Korea. His current research interests include image processing, image and video data compression, and threedimensional graphics modeling and representation. Woo-Chan Park received M.S and Ph.D degree in Computer Science, Yonsei University in 1995 and 2000, respectively. Currently, he is a professor at the School of Computer Engineering, Sejong University, Seoul, Korea. His research interests include 3D rendering processor architecture, ray tracing accelerator, parallel rendering, high performance computer architecture, computer arithmetic, and ASIC design. Dinesh Manocha is currently the Phi Delta Theta/Mason Distinguished Professor of Computer Science at the University of North Carolina at Chapel Hill. He has co-authored more than 380 papers in the leading conferences and journals on computer graphics, robotics, and scientific computing. He has also served program chair for many conferences and editorial boards of many leading journals. Some of the software systems related to collision detection, GPU-based algorithms and geometric computing developed by his group have been downloaded by more than 150,000 users and are widely used in the industry. Manocha has received awards including Alfred P. Sloan Fellowship, NSF Career Award, Office of Naval Research Young Investigator Award, and 14 best paper awards at the leading conferences. He is a Fellow of ACM, AAAS, and IEEE, and received Distinguished Alumni Award from Indian Institute of Technology, Delhi. Tack-Don Han is a professor in the Department of Computer Science at Yonsei University, Korea. His research interests include high performance computer architecture, media system architecture, and wearable computing. He received Ph.D. in Computer Engineering from the University of Massachusetts.