Academia.eduAcademia.edu

Acceleration Data Structures for Ray Tracing on Mobile Devices

2019

Mobile devices are continuously becoming more efficient at performing computationally expensive tasks, such as ray tracing. A lot of research effort has been put into using acceleration data structures to minimize the computational cost of ray tracing and optimize the use of GPU resources. However, with the vast majority of research focusing on desktop GPUs, there is a lack of data regarding how such optimizations scale on mobile architectures where there are a different set of challenges and limitations. Our work bridges the gap by providing a performance analysis of not only ray tracing as a whole, but also of different data structures and techniques. We implemented and profiled the performance of multiple acceleration data structures across different instrumentation tools using a set of representative test scenes. Our investigation concludes that a hybrid rendering approach is more suitable for current mobile environments, with greater performance benefits observed when using data structures that focus on reducing memory bandwidth and ALU usage.

Acceleration Data Structures for Ray Tracing on Mobile Devices Nuno Sousa1 , David Sena2 , Nikolaos Papadopoulos2 and João Pereira1 1 Instituto Superior Técnico/Inesc-ID, Universidade de Lisboa, Lisboa, Portugal 2 Samsung R&D UK, Staines, U.K. Keywords: Ray Tracing, Acceleration Structures, Mobile Environment, Android, OpenGL ES. Abstract: Mobile devices are continuously becoming more efficient at performing computationally expensive tasks, such as ray tracing. A lot of research effort has been put into using acceleration data structures to minimize the computational cost of ray tracing and optimize the use of GPU resources. However, with the vast majority of research focusing on desktop GPUs, there is a lack of data regarding how such optimizations scale on mobile architectures where there are a different set of challenges and limitations. Our work bridges the gap by providing a performance analysis of not only ray tracing as a whole, but also of different data structures and techniques. We implemented and profiled the performance of multiple acceleration data structures across different instrumentation tools using a set of representative test scenes. Our investigation concludes that a hybrid rendering approach is more suitable for current mobile environments, with greater performance benefits observed when using data structures that focus on reducing memory bandwidth and ALU usage. 1 INTRODUCTION 2 PREVIOUS WORK The hardware of mobile devices has improved signif- The idea of using ray shooting for the generation of icantly over the past few years. There are, however, images was first introduced by (Appel, 1968). Sev- limitations, and developers are always searching for eral other techniques have since been developed that optimizations that allow them to make the best use of provide much higher visual fidelity by simulating vi- available hardware. Nevertheless, today a mobile de- sual effects like reflections (Whitted, 1979), soft shad- vice is capable of rendering graphically intensive ap- ows (Cook et al., 1984), depth-of-field (Cook et al., plications with reasonable quality and performance. 1984) and even global illumination (Kajiya, 1986). Ray tracing is a rendering technique capable of This, however, is outside the scope of our work. The producing highly realistic results at higher computa- focus of this research is the performance of accelera- tional cost than rasterization based approaches. With tion data structures and not the visual fidelity achieved the release of technologies like DirectX Raytracing with different ray tracing techniques. (DXR), native support for hardware accelerated ray tracing is starting to become more accessible to an 2.1 Acceleration Data Structures end user. The high computational cost of ray tracing can be Acceleration data structures can be used to reduce the reduced with the use of acceleration data structures, number of ray-primitive intersection tests. An accel- a topic that has been primarily researched for desktop eration data structure algorithm transforms scene data computers. Our main objective is to present a compar- to a format that minimizes the number of intersection ative study of the performance of these data structures tests at runtime and optimizes the use of hardware. on mobile platforms and document their characteris- For our research we focused on the traversal per- tics. formance of KD-Trees and Bounding Volume Hier- archy (BVH). Both structures make use of Surface Area Heuristic (SAH) (Goldsmith and Salmon, 1987) to determine the best splitting point for each node that is being subdivided. 332 Sousa, N., Sena, D., Papadopoulos, N. and Pereira, J. Acceleration Data Structures for Ray Tracing on Mobile Devices. DOI: 10.5220/0007575403320339 In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2019), pages 332-339 ISBN: 978-989-758-354-4 Copyright c 2019 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved Acceleration Data Structures for Ray Tracing on Mobile Devices 2.1.1 Bounding Volume Hierarchies 2.2 Mobile Environment Bounding Volume Hierarchies are based on bound- The system on chip architectures of mobile devices ing volumes (Kay and Kajiya, 1986). A BVH is a have restrictions on the amount of power they can tree in which the root consists of a bounding volume draw and, due to the small form factor, the amount that encloses the whole scene. Each internal node is of heat they are able to dissipate. As a result, com- a bounding volume of a subset of objects of its par- putational resources are more limited than on desktop ent node. The leaves contain actual geometry to test computers. against. The BVH can be subdivided, for example, by us- ing the median of the centroids of the enclosed objects 3 IMPLEMENTATION or by using SAH. In this work we chose to focus on the following This research focuses on mobile environments, run- traversal algorithms: ning Android, and was conducted in partnership with • Stack-less Parent-Link Traversal - a link-based al- Samsung UK. The following sections will describe gorithm that tries to provide the same traversal how the application was implemented, which ray trac- order of the stack based algorithm while being ing algorithms were used as well as which data struc- stack-less (Hapala et al., 2011). tures were implemented. We also describe the differ- • Restart Trail Traversal - This algorithm tries to ent rendering approaches used. adapt the KD-restart algorithm used to traverse KD-trees, to be used with Bounding Volume Hi- 3.1 Ray Tracing Implementation erarchies (BVHs) (Laine, 2010). We implemented Whitted ray tracing (Whitted, 1979) 2.1.2 KD-Trees with a ray spawned for each pixel of the framebuffer and a subsequent ray spawned for each light visibility First introduced as a method for searching of points query when an intersection is found. We implemented in a k-dimensional space (Bentley, 1975), KD-trees the ray-triangle intersection algorithm by (Möller and are a specific case of binary space partitioning Binary Trumbore, 2005) and for Axis Aligned Bounding Box Space Partitioning (BSP). (AABB) ray intersections we used the ray-box inter- Just like BVHs spatial subdivision can be based section algorithm by (Williams et al., 2005). on SAH. An optimised O(NlogN) construction algo- We implemented different data packing arrange- rithm introduced by (Wald and Havran, 2006) which ments of primitives using Shader Storage Buffer Ob- used an ordered event list with special list splitting ject (SSBO). Our initial approach was to store rules. each triangle vertex and each normal as a vec3 with Our work focused on Graphics Processing Unit padding. In our second approach we used the padding (GPU) based algorithms (Hapala and Havran, 2011), of the three vertices to store the first normal and min- more specifically: imize the size per primitive. Our last approach was • Kd-Push-Down Traversal - this algorithm expands based on the fact that, while doing intersection test- Kd-Restart (Horn et al., 2007), which works by ing, normals are not used. We split the vertices and moving a point along the ray and finding the leaf normals into separate SSBOs to reduce redundant where the point is located. By keeping the low- memory accesses. est depth-wise node that contains the interval of intersection in its entirety, this node can then be 3.2 Implementation of Acceleration used instead of the root node when restarting the Data Structures search. • Kd-Backtrack Traversal - this algorithm adds to For the GPU rendering methods we chose to imple- each node the corresponding bounding box and a ment only KD-Trees and BVHs instead of Regular pointer to the parent node (Foley and Sugerman, Grids because they are consistently outperformed by 2005) to avoid restarting the search from the root BVHs and KD-Trees apart from very specific situa- node. tions (Thrane and Simonsen, 2005). 333 GRAPP 2019 - 14th International Conference on Computer Graphics Theory and Applications 3.2.1 KD-Tree Implementation The construction of KD-Trees in our implementation is done using the SAH algorithm (Wald and Havran, 2006). Our implementation allows for the creation (a) BVH Trail traversal node layout of empty leaf nodes but does not perform triangle clipping. By experimenting with different values for Ctraversal and Cintersection , we concluded that mobile ar- chitectures tend to favour wider and shallower trees. (b) BVH Parent traversal node layout We found a value of 3.0 for Ctraversal and 1.5 for Figure 2: Layout of BVH nodes. vMin and vMax represent Cintersection to yield good results. the node bounding box. The memory layout for KD-Tree nodes varies ac- cording to which traversal algorithm is being used. sen, our implementation starts by constructing the se- lected acceleration structure along with the auxiliary structures for primitive storage. These structures are then copied to GPU memory as Shader Storage Buffer Objects (SSBOs). The application also creates and (a) KD-Pushdown node layout uploads a Vertex Array Object (VAO) containing a full-screen quad that is then used for every render- ing method. The drawing process, however, changes according to which rendering approach is selected: (b) KD-Backtrack node layout • Fragment Shaders - the application renders a full- Figure 1: Layouts of KD-Tree nodes. vMin and vMax rep- screen quad using a very simple vertex shader. resent the node bounding box. The fragment shader is then responsible for ray tracing the corresponding pixel. In this case, all Another difference between trees for the two the code for ray tracing and structure traversal is traversal methods is that while building the tree for the contained in the fragment shader. KD-Backtrack traversal method, we do not allow for perfectly flat nodes, i.e. nodes that have zero length • Compute Shaders - the application performs a two on one of the axis. This is done to avoid precision step process. In the first step, the application related issues while traversing the tree. We imple- dispatches the necessary compute workgroups so mented the KD-Pushdown and KD-Backtrack algo- that each thread processes a pixel of the final im- rithms using the node layouts shown in Figure 1. age. The result of this first step is stored in an Image Buffer which is then utilized in the second 3.2.2 BVH Implementation step as an input texture. The second pass simply draws a full-screen quad, using the texture gener- In our implementation, the construction of BVHs is ated in the first step. done using an altered version of the construction algo- • Hybrid Shading - the application, not only cre- rithm (Wald and Havran, 2006) that was also used in ates a VAO containing the full-screen quad, but the KD-Trees construction. Like with KD-Trees, af- also a second VAO containing the entire geom- ter experimenting with several values, we came to the etry for the scene being rendered. This second conclusion that, again, wider, shallower trees tend to VAO is used in the first phase of the rendering pro- perform best. As such, the values chosen for Ctraversal cess, where the application issues a drawcall that and Cintersection were, again, 3.0 and 1.5 respectively. rasterizes all primitives. This first phase stores The memory layout for BVH nodes also varies ac- the calculated normals into a color attachment. cording to which traversal method is being used. The From this first step a depth buffer is also gener- possible layouts are shown in Figure 2. ated. These two buffers are then used on the sec- For GPU traversal we implemented Trail traversal ond phase of the drawing process where the full- along with the Parent-Link traversal algorithm. screen quad is rendered. The values in the buffers are used to create and cast the shadow ray which 3.3 GPU Rendering Methods then triggers a structure traversal. For this ren- dering method, all the ray tracing logic is in the Our implementation used multiple rendering ap- fragment shader of the second pass. proaches. Regardless of the rendering method cho- 334 Acceleration Data Structures for Ray Tracing on Mobile Devices Figure 3: Graph comparing the performance of different rendering methods. 4 EVALUATION The application allows to visualize a heatmap rep- resenting the number of node traversals for each pixel. To evaluate the performance of the different algo- As an effort to keep results consistent, we consider rithms, we profiled our implementation by collect- that a node is traversed when it is fetched from the ing metrics in app and using external instrumentation acceleration structure SSBO. tools. The application was developed using OpenGL ES 4.2 External Tools and tested on a Samsung Galaxy S8 (Model SM- G950U) with a 64 bit Qualcomm Snapdragon 835 The Qualcomm Snapdragon Profiler allows develop- system-on-chip with a Qualcomm Adreno 540 GPU ers to profile devices with Snapdragon processors. provided by Samsung Research, UK. The application provides several metrics of interest to our work: 4.1 Application Metrics • SP Memory Read - Number of bytes read from memory by the Shader Processors per second. At the rendering stage, several measurements are col- lected in order to evaluate overall performance: • % Shader ALU Capacity Utilized - % of maxi- mum shader ALU capacity that is being utilized. • Framerate - How many times the image is updated per second. Frametime expressed in milliseconds, • % Time ALUs Working - % of time the ALUs are while more accurate, is harder to measure on mo- working while the shaders are busy. bile due to hardware and architectural restrictions. • ALU/Fragment - Average number of ALU in- • Rays per Second - expressed in millions of rays structions performed per fragment. per second. Rays-per-Second (RPS) more accu- rately describes the raw ray tracing power of the 4.3 Test Scenes and Test Methodology underlying hardware. • Structure Size - expressed in KBytes. Repre- For our tests we used a number of different scenes sents the overall size of the generated acceleration such as the Cornell Box, the Cornell Buddha, the structure. Fairy Forest and the Crytek Sponza. 335 GRAPP 2019 - 14th International Conference on Computer Graphics Theory and Applications Figure 4: Graph showing performance for different primitive layouts. The application always renders the resulting im- 6 PRIMITIVE LAYOUT age at 1024x1024 resolution. We also restart the ap- COMPARISON plication between every test. As shown in Figure 4, we ran a series of tests to anal- yse the performance of the different primitive layouts. 5 RENDERING TECHNIQUES Results show that a compact layout does not always equate to better performance as it needs a few extra COMPARISON instructions to retrieve the stored normal. The split layout on the other hand, consistently yields either We implemented three different rendering ap- similar or better results. proaches, and for each approach we profiled the per- To better understand the impact of these changes formance of different traversal methods across differ- when accessing memory, we performed a bandwidth ent scenes. While Fragment and Compute rendering analysis using the Cornell Box and Buddha scenes. had similar performance, as shown in Figure 3, the The results are shown in Figures 5 and 6. hybrid rendering approach distinguished itself by hav- The results show that there is a reduction in mem- ing better performance. This is due to the higher com- ory bandwidth from using the compact and split lay- putational cost of ray tracing when compared with outs, with split layout producing the best results. Hav- rasterization. ing vertices and normals separated means that no Overall, our results show that using hybrid ren- bandwidth is wasted on normals that are not being dering is the best approach when implementing ray used. This maximizes the number of vertices fetched tracing on mobile. However, different ray tracing al- each time which means the number of overall fetches gorithms may benefit from using Fragment and Com- is reduced. pute Shader based rendering. Scenes with higher complexity require more ac- cesses to the data structures containing the geometry and thus, optimization of primitive layout becomes more important because of the impact in memory bandwidth utilization. 336 Acceleration Data Structures for Ray Tracing on Mobile Devices Figure 5: Memory bandwidth usage when varying primitive Figure 7: Performance values for acceleration structures. layout for the Cornell Box Scene. (a) KD-Backtrack (b) KD-Pushdown Figure 6: Memory bandwidth usage when varying primitive layout for the Buddha Scene. 7 ACCELERATION DATA STRUCTURE COMPARISON (c) BVH-Parent (d) BVH-Trail Acceleration data structures are essential for the effi- Figure 8: Heatmap of the fairy scene for all structures. ciency of ray tracing algorithms and as such have an impact on performance. In the following sections we will present and discuss the results we obtained with regards to computation cost and memory utilization. 7.1 Performance To evaluate the performance of the different data structures, we conducted a series of tests for the traversal methods we chose across different scenes. For these particular tests we only used the Fragment Shader renderer along with a normal primitive layout. Figure 9: SP Memory Read values for each structure. Val- Figure 7 shows the results obtained from all ues for the Cornell Box were not visible at this scale. the test runs. Whilst the performance of the KD- Pushdown traversal excels in scenes with a lower each traversal method and each scene. Figure 8 shows number of primitives, it quickly deteriorates in scenes a subset of the generated heatmaps. with higher geometric complexity. In contrast, BVH We obtained the best results using BVH Trail. Trail traversal performs better in more complex KD-Backtrack and BVH-Parent-Link followed up, scenes. yielding similar results to each other. This verifies To understand the performance difference be- our previous observations and shows that the number tween traversal methods, we created heatmaps for of traversed nodes correlates to the performance of 337 GRAPP 2019 - 14th International Conference on Computer Graphics Theory and Applications Figure 10: Performance comparison for different SAH cost values. most ALU instructions per fragment is the KD- Pushdown due to the increased number of nodes and extra traversal steps it takes. This, combined with higher memory bandwidth utilization, results in the worst performance across all the traversal methods. Results show that traversal methods with fewer ALU instructions per fragment have better perfor- mance. There is however the exception of the KD- Backtrack traversal, that despite using less ALU in- structions has its traversal slowed down due to the high memory bandwidth requirements. According to these results the performance of the Figure 11: ALU instructions per fragment. traversal algorithms is limited by a combination of ALU, memory accesses and bandwidth. However, the strongest limitation appears to be the number of the algorithm. Due to the high cost of bandwidth, we traversed nodes, i.e. the number of accesses to the measure its impact. The results are shown in Figure 9. SSBOs that contain the acceleration structure. Algorithms based on KD-Trees consume higher memory bandwidth than those based on BVHs. One 7.2 SAH Costs Comparison explanation is that the increased number of nodes gen- erated by KD-Trees boosts the probability of execut- One of the ways to optimize KD-Trees and BVHs ing a memory fetch for each new traversed node. This built using the SAH is to tweak the values for traversal is because each local memory fetch request is less and intersection cost given to the construction func- probable to contain the next node that needs to be tra- tion. Usually, having the intersection cost higher than versed. In Figure 11 we also analyse the ALU in- the traversal yields better results. To test this we used structions per fragment. four different cost values and a combination of all the Results show that, the KD-Backtrack algorithm scenes and traversal methods. requires the least overall amount of ALU instructions. As shown in Figure 10, having an intersection cost This is due to the fact that it performs no near-far clas- lower than the traversal cost provides the best perfor- sification, and, consequently, uses fewer instructions mance. The results also show that it is best to keep the per node traversed than other traversal methods. traversal cost only slightly higher than the intersec- On the opposite end, the traversal method with tion cost. Reducing the cost of traversal equates to a 338 Acceleration Data Structures for Ray Tracing on Mobile Devices higher number of node traversals necessary per pixel. Kajiya, J. (1986). The rendering equation. In Proceedings The performance numbers shown in Figure 10 corrob- of the 13th Annual Conference on Computer Graphics orate the previous results showing that the number of and Interactive Techniques, pages 143–150. traversals correlate to the performance of the traversal Kay, T. L. and Kajiya, J. T. (1986). Ray tracing complex algorithm. scenes. In ACM SIGGRAPH computer graphics, vol- ume 20, pages 269–278. Laine, S. (2010). Restart Trail For Stackless BVH Traver- sal. In Proceedings of the Conference on High Perfor- 8 CONCLUSIONS mance Graphics, pages 107–111. Möller, T. and Trumbore, B. (2005). Fast, minimum storage Our work focused on providing a performance anal- ray/triangle intersection. In ACM SIGGRAPH 2005 ysis of different acceleration data structures for ray Courses. tracing on mobile devices. Our main goal was to es- Thrane, N. and Simonsen, L. O. (2005). A comparison of acceleration structures for gpu assisted ray tracing. tablish a basis for future research into the potential of Master’s thesis. mobile environments. As for future work, we want to Wald, I. and Havran, V. (2006). On building fast kd-Trees explore the performance of construction methods for for Ray Tracing, and on doing that in O(N log N). dynamics scenes to provide further insight on current In Proceedings of the IEEE Symposium on Interactive mobile hardware capabilities. Ray Tracing, pages 61–69. Whitted, T. (1979). An improved illumination model for shared display. In Proceedings of the 6th Annual Con- ference on Computer Graphics and Interactive Tech- ACKNOWLEDGEMENTS niques. Williams, A., Barrus, S., Morley, R. K., and Shirley, P. This work was supported by national funds through (2005). An efficient and robust ray-box intersection Fundação para a Ciência e Técnologia (FCT) with ref- algorithm. In ACM SIGGRAPH 2005 Courses. erence UID/CEC/50021/2019. REFERENCES Appel, A. (1968). Some techniques for shading machine renderings of solids. In Proceedings of the AFIPS Conference, pages 37–45. Bentley, J. (1975). Multidimensional binary search trees used for associative searching. Commun. ACM, 18(9):509–517. Cook, R., Porter, T., and Carpenter, L. (1984). Distributed ray tracing. In Proceedings of the 11th Annual Con- ference on Computer Graphics and Interactive Tech- niques, pages 137–145. Foley, T. and Sugerman, J. (2005). Kd-tree acceleration structures for a gpu raytracer. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, pages 15–22. Goldsmith, J. and Salmon, J. (1987). Automatic creation of object hierarchies for ray tracing. IEEE Comput. Graph. Appl., 7(5):14–20. Hapala, M., Davidovič, T., Wald, I., Havran, V., and Slusallek, P. (2011). Efficient Stack-less BVH Traver- sal for Ray Tracing. In Proceedings of the 27th Spring Conference on Computer Graphics, pages 7–12. Hapala, M. and Havran, V. (2011). Review: Kd-tree Traver- sal Algorithms for Ray Tracing. Computer Graphics Forum, 30(1):199–213. Horn, D. R., Sugerman, J., Houston, M., and Hanrahan, P. (2007). Interactive kd tree gpu raytracing. In Proceed- ings of the 2007 symposium on Interactive 3D graph- ics and games, pages 167–174. 339