Acceleration Data Structures for Ray Tracing on Mobile Devices
Nuno Sousa1 , David Sena2 , Nikolaos Papadopoulos2 and João Pereira1
1 Instituto Superior Técnico/Inesc-ID, Universidade de Lisboa, Lisboa, Portugal
2 Samsung R&D UK, Staines, U.K.
Keywords: Ray Tracing, Acceleration Structures, Mobile Environment, Android, OpenGL ES.
Abstract: Mobile devices are continuously becoming more efficient at performing computationally expensive tasks,
such as ray tracing. A lot of research effort has been put into using acceleration data structures to minimize
the computational cost of ray tracing and optimize the use of GPU resources. However, with the vast majority
of research focusing on desktop GPUs, there is a lack of data regarding how such optimizations scale on
mobile architectures where there are a different set of challenges and limitations. Our work bridges the gap
by providing a performance analysis of not only ray tracing as a whole, but also of different data structures
and techniques. We implemented and profiled the performance of multiple acceleration data structures across
different instrumentation tools using a set of representative test scenes. Our investigation concludes that a
hybrid rendering approach is more suitable for current mobile environments, with greater performance benefits
observed when using data structures that focus on reducing memory bandwidth and ALU usage.
1 INTRODUCTION 2 PREVIOUS WORK
The hardware of mobile devices has improved signif- The idea of using ray shooting for the generation of
icantly over the past few years. There are, however, images was first introduced by (Appel, 1968). Sev-
limitations, and developers are always searching for eral other techniques have since been developed that
optimizations that allow them to make the best use of provide much higher visual fidelity by simulating vi-
available hardware. Nevertheless, today a mobile de- sual effects like reflections (Whitted, 1979), soft shad-
vice is capable of rendering graphically intensive ap- ows (Cook et al., 1984), depth-of-field (Cook et al.,
plications with reasonable quality and performance. 1984) and even global illumination (Kajiya, 1986).
Ray tracing is a rendering technique capable of This, however, is outside the scope of our work. The
producing highly realistic results at higher computa- focus of this research is the performance of accelera-
tional cost than rasterization based approaches. With tion data structures and not the visual fidelity achieved
the release of technologies like DirectX Raytracing with different ray tracing techniques.
(DXR), native support for hardware accelerated ray
tracing is starting to become more accessible to an 2.1 Acceleration Data Structures
end user.
The high computational cost of ray tracing can be Acceleration data structures can be used to reduce the
reduced with the use of acceleration data structures, number of ray-primitive intersection tests. An accel-
a topic that has been primarily researched for desktop eration data structure algorithm transforms scene data
computers. Our main objective is to present a compar- to a format that minimizes the number of intersection
ative study of the performance of these data structures tests at runtime and optimizes the use of hardware.
on mobile platforms and document their characteris- For our research we focused on the traversal per-
tics. formance of KD-Trees and Bounding Volume Hier-
archy (BVH). Both structures make use of Surface
Area Heuristic (SAH) (Goldsmith and Salmon, 1987)
to determine the best splitting point for each node that
is being subdivided.
332
Sousa, N., Sena, D., Papadopoulos, N. and Pereira, J.
Acceleration Data Structures for Ray Tracing on Mobile Devices.
DOI: 10.5220/0007575403320339
In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2019), pages 332-339
ISBN: 978-989-758-354-4
Copyright c 2019 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
Acceleration Data Structures for Ray Tracing on Mobile Devices
2.1.1 Bounding Volume Hierarchies 2.2 Mobile Environment
Bounding Volume Hierarchies are based on bound- The system on chip architectures of mobile devices
ing volumes (Kay and Kajiya, 1986). A BVH is a have restrictions on the amount of power they can
tree in which the root consists of a bounding volume draw and, due to the small form factor, the amount
that encloses the whole scene. Each internal node is of heat they are able to dissipate. As a result, com-
a bounding volume of a subset of objects of its par- putational resources are more limited than on desktop
ent node. The leaves contain actual geometry to test computers.
against.
The BVH can be subdivided, for example, by us-
ing the median of the centroids of the enclosed objects 3 IMPLEMENTATION
or by using SAH.
In this work we chose to focus on the following
This research focuses on mobile environments, run-
traversal algorithms:
ning Android, and was conducted in partnership with
• Stack-less Parent-Link Traversal - a link-based al- Samsung UK. The following sections will describe
gorithm that tries to provide the same traversal how the application was implemented, which ray trac-
order of the stack based algorithm while being ing algorithms were used as well as which data struc-
stack-less (Hapala et al., 2011). tures were implemented. We also describe the differ-
• Restart Trail Traversal - This algorithm tries to ent rendering approaches used.
adapt the KD-restart algorithm used to traverse
KD-trees, to be used with Bounding Volume Hi- 3.1 Ray Tracing Implementation
erarchies (BVHs) (Laine, 2010).
We implemented Whitted ray tracing (Whitted, 1979)
2.1.2 KD-Trees with a ray spawned for each pixel of the framebuffer
and a subsequent ray spawned for each light visibility
First introduced as a method for searching of points query when an intersection is found. We implemented
in a k-dimensional space (Bentley, 1975), KD-trees the ray-triangle intersection algorithm by (Möller and
are a specific case of binary space partitioning Binary Trumbore, 2005) and for Axis Aligned Bounding Box
Space Partitioning (BSP). (AABB) ray intersections we used the ray-box inter-
Just like BVHs spatial subdivision can be based section algorithm by (Williams et al., 2005).
on SAH. An optimised O(NlogN) construction algo- We implemented different data packing arrange-
rithm introduced by (Wald and Havran, 2006) which ments of primitives using Shader Storage Buffer Ob-
used an ordered event list with special list splitting ject (SSBO). Our initial approach was to store
rules. each triangle vertex and each normal as a vec3 with
Our work focused on Graphics Processing Unit padding. In our second approach we used the padding
(GPU) based algorithms (Hapala and Havran, 2011), of the three vertices to store the first normal and min-
more specifically: imize the size per primitive. Our last approach was
• Kd-Push-Down Traversal - this algorithm expands based on the fact that, while doing intersection test-
Kd-Restart (Horn et al., 2007), which works by ing, normals are not used. We split the vertices and
moving a point along the ray and finding the leaf normals into separate SSBOs to reduce redundant
where the point is located. By keeping the low- memory accesses.
est depth-wise node that contains the interval of
intersection in its entirety, this node can then be 3.2 Implementation of Acceleration
used instead of the root node when restarting the Data Structures
search.
• Kd-Backtrack Traversal - this algorithm adds to For the GPU rendering methods we chose to imple-
each node the corresponding bounding box and a ment only KD-Trees and BVHs instead of Regular
pointer to the parent node (Foley and Sugerman, Grids because they are consistently outperformed by
2005) to avoid restarting the search from the root BVHs and KD-Trees apart from very specific situa-
node. tions (Thrane and Simonsen, 2005).
333
GRAPP 2019 - 14th International Conference on Computer Graphics Theory and Applications
3.2.1 KD-Tree Implementation
The construction of KD-Trees in our implementation
is done using the SAH algorithm (Wald and Havran,
2006). Our implementation allows for the creation (a) BVH Trail traversal node layout
of empty leaf nodes but does not perform triangle
clipping. By experimenting with different values for
Ctraversal and Cintersection , we concluded that mobile ar-
chitectures tend to favour wider and shallower trees. (b) BVH Parent traversal node layout
We found a value of 3.0 for Ctraversal and 1.5 for Figure 2: Layout of BVH nodes. vMin and vMax represent
Cintersection to yield good results. the node bounding box.
The memory layout for KD-Tree nodes varies ac-
cording to which traversal algorithm is being used. sen, our implementation starts by constructing the se-
lected acceleration structure along with the auxiliary
structures for primitive storage. These structures are
then copied to GPU memory as Shader Storage Buffer
Objects (SSBOs). The application also creates and
(a) KD-Pushdown node layout uploads a Vertex Array Object (VAO) containing a
full-screen quad that is then used for every render-
ing method. The drawing process, however, changes
according to which rendering approach is selected:
(b) KD-Backtrack node layout • Fragment Shaders - the application renders a full-
Figure 1: Layouts of KD-Tree nodes. vMin and vMax rep- screen quad using a very simple vertex shader.
resent the node bounding box. The fragment shader is then responsible for ray
tracing the corresponding pixel. In this case, all
Another difference between trees for the two the code for ray tracing and structure traversal is
traversal methods is that while building the tree for the contained in the fragment shader.
KD-Backtrack traversal method, we do not allow for
perfectly flat nodes, i.e. nodes that have zero length • Compute Shaders - the application performs a two
on one of the axis. This is done to avoid precision step process. In the first step, the application
related issues while traversing the tree. We imple- dispatches the necessary compute workgroups so
mented the KD-Pushdown and KD-Backtrack algo- that each thread processes a pixel of the final im-
rithms using the node layouts shown in Figure 1. age. The result of this first step is stored in an
Image Buffer which is then utilized in the second
3.2.2 BVH Implementation step as an input texture. The second pass simply
draws a full-screen quad, using the texture gener-
In our implementation, the construction of BVHs is ated in the first step.
done using an altered version of the construction algo- • Hybrid Shading - the application, not only cre-
rithm (Wald and Havran, 2006) that was also used in ates a VAO containing the full-screen quad, but
the KD-Trees construction. Like with KD-Trees, af- also a second VAO containing the entire geom-
ter experimenting with several values, we came to the etry for the scene being rendered. This second
conclusion that, again, wider, shallower trees tend to VAO is used in the first phase of the rendering pro-
perform best. As such, the values chosen for Ctraversal cess, where the application issues a drawcall that
and Cintersection were, again, 3.0 and 1.5 respectively. rasterizes all primitives. This first phase stores
The memory layout for BVH nodes also varies ac- the calculated normals into a color attachment.
cording to which traversal method is being used. The From this first step a depth buffer is also gener-
possible layouts are shown in Figure 2. ated. These two buffers are then used on the sec-
For GPU traversal we implemented Trail traversal ond phase of the drawing process where the full-
along with the Parent-Link traversal algorithm. screen quad is rendered. The values in the buffers
are used to create and cast the shadow ray which
3.3 GPU Rendering Methods then triggers a structure traversal. For this ren-
dering method, all the ray tracing logic is in the
Our implementation used multiple rendering ap- fragment shader of the second pass.
proaches. Regardless of the rendering method cho-
334
Acceleration Data Structures for Ray Tracing on Mobile Devices
Figure 3: Graph comparing the performance of different rendering methods.
4 EVALUATION The application allows to visualize a heatmap rep-
resenting the number of node traversals for each pixel.
To evaluate the performance of the different algo- As an effort to keep results consistent, we consider
rithms, we profiled our implementation by collect- that a node is traversed when it is fetched from the
ing metrics in app and using external instrumentation acceleration structure SSBO.
tools.
The application was developed using OpenGL ES 4.2 External Tools
and tested on a Samsung Galaxy S8 (Model SM-
G950U) with a 64 bit Qualcomm Snapdragon 835 The Qualcomm Snapdragon Profiler allows develop-
system-on-chip with a Qualcomm Adreno 540 GPU ers to profile devices with Snapdragon processors.
provided by Samsung Research, UK. The application provides several metrics of interest to
our work:
4.1 Application Metrics
• SP Memory Read - Number of bytes read from
memory by the Shader Processors per second.
At the rendering stage, several measurements are col-
lected in order to evaluate overall performance: • % Shader ALU Capacity Utilized - % of maxi-
mum shader ALU capacity that is being utilized.
• Framerate - How many times the image is updated
per second. Frametime expressed in milliseconds, • % Time ALUs Working - % of time the ALUs are
while more accurate, is harder to measure on mo- working while the shaders are busy.
bile due to hardware and architectural restrictions. • ALU/Fragment - Average number of ALU in-
• Rays per Second - expressed in millions of rays structions performed per fragment.
per second. Rays-per-Second (RPS) more accu-
rately describes the raw ray tracing power of the 4.3 Test Scenes and Test Methodology
underlying hardware.
• Structure Size - expressed in KBytes. Repre- For our tests we used a number of different scenes
sents the overall size of the generated acceleration such as the Cornell Box, the Cornell Buddha, the
structure. Fairy Forest and the Crytek Sponza.
335
GRAPP 2019 - 14th International Conference on Computer Graphics Theory and Applications
Figure 4: Graph showing performance for different primitive layouts.
The application always renders the resulting im- 6 PRIMITIVE LAYOUT
age at 1024x1024 resolution. We also restart the ap- COMPARISON
plication between every test.
As shown in Figure 4, we ran a series of tests to anal-
yse the performance of the different primitive layouts.
5 RENDERING TECHNIQUES Results show that a compact layout does not always
equate to better performance as it needs a few extra
COMPARISON instructions to retrieve the stored normal. The split
layout on the other hand, consistently yields either
We implemented three different rendering ap- similar or better results.
proaches, and for each approach we profiled the per- To better understand the impact of these changes
formance of different traversal methods across differ- when accessing memory, we performed a bandwidth
ent scenes. While Fragment and Compute rendering analysis using the Cornell Box and Buddha scenes.
had similar performance, as shown in Figure 3, the The results are shown in Figures 5 and 6.
hybrid rendering approach distinguished itself by hav- The results show that there is a reduction in mem-
ing better performance. This is due to the higher com- ory bandwidth from using the compact and split lay-
putational cost of ray tracing when compared with outs, with split layout producing the best results. Hav-
rasterization. ing vertices and normals separated means that no
Overall, our results show that using hybrid ren- bandwidth is wasted on normals that are not being
dering is the best approach when implementing ray used. This maximizes the number of vertices fetched
tracing on mobile. However, different ray tracing al- each time which means the number of overall fetches
gorithms may benefit from using Fragment and Com- is reduced.
pute Shader based rendering. Scenes with higher complexity require more ac-
cesses to the data structures containing the geometry
and thus, optimization of primitive layout becomes
more important because of the impact in memory
bandwidth utilization.
336
Acceleration Data Structures for Ray Tracing on Mobile Devices
Figure 5: Memory bandwidth usage when varying primitive Figure 7: Performance values for acceleration structures.
layout for the Cornell Box Scene.
(a) KD-Backtrack (b) KD-Pushdown
Figure 6: Memory bandwidth usage when varying primitive
layout for the Buddha Scene.
7 ACCELERATION DATA
STRUCTURE COMPARISON
(c) BVH-Parent (d) BVH-Trail
Acceleration data structures are essential for the effi- Figure 8: Heatmap of the fairy scene for all structures.
ciency of ray tracing algorithms and as such have an
impact on performance. In the following sections we
will present and discuss the results we obtained with
regards to computation cost and memory utilization.
7.1 Performance
To evaluate the performance of the different data
structures, we conducted a series of tests for the
traversal methods we chose across different scenes.
For these particular tests we only used the Fragment
Shader renderer along with a normal primitive layout. Figure 9: SP Memory Read values for each structure. Val-
Figure 7 shows the results obtained from all ues for the Cornell Box were not visible at this scale.
the test runs. Whilst the performance of the KD-
Pushdown traversal excels in scenes with a lower each traversal method and each scene. Figure 8 shows
number of primitives, it quickly deteriorates in scenes a subset of the generated heatmaps.
with higher geometric complexity. In contrast, BVH We obtained the best results using BVH Trail.
Trail traversal performs better in more complex KD-Backtrack and BVH-Parent-Link followed up,
scenes. yielding similar results to each other. This verifies
To understand the performance difference be- our previous observations and shows that the number
tween traversal methods, we created heatmaps for of traversed nodes correlates to the performance of
337
GRAPP 2019 - 14th International Conference on Computer Graphics Theory and Applications
Figure 10: Performance comparison for different SAH cost values.
most ALU instructions per fragment is the KD-
Pushdown due to the increased number of nodes and
extra traversal steps it takes. This, combined with
higher memory bandwidth utilization, results in the
worst performance across all the traversal methods.
Results show that traversal methods with fewer
ALU instructions per fragment have better perfor-
mance. There is however the exception of the KD-
Backtrack traversal, that despite using less ALU in-
structions has its traversal slowed down due to the
high memory bandwidth requirements.
According to these results the performance of the
Figure 11: ALU instructions per fragment.
traversal algorithms is limited by a combination of
ALU, memory accesses and bandwidth. However,
the strongest limitation appears to be the number of
the algorithm. Due to the high cost of bandwidth, we traversed nodes, i.e. the number of accesses to the
measure its impact. The results are shown in Figure 9. SSBOs that contain the acceleration structure.
Algorithms based on KD-Trees consume higher
memory bandwidth than those based on BVHs. One 7.2 SAH Costs Comparison
explanation is that the increased number of nodes gen-
erated by KD-Trees boosts the probability of execut- One of the ways to optimize KD-Trees and BVHs
ing a memory fetch for each new traversed node. This built using the SAH is to tweak the values for traversal
is because each local memory fetch request is less and intersection cost given to the construction func-
probable to contain the next node that needs to be tra- tion. Usually, having the intersection cost higher than
versed. In Figure 11 we also analyse the ALU in- the traversal yields better results. To test this we used
structions per fragment. four different cost values and a combination of all the
Results show that, the KD-Backtrack algorithm scenes and traversal methods.
requires the least overall amount of ALU instructions. As shown in Figure 10, having an intersection cost
This is due to the fact that it performs no near-far clas- lower than the traversal cost provides the best perfor-
sification, and, consequently, uses fewer instructions mance. The results also show that it is best to keep the
per node traversed than other traversal methods. traversal cost only slightly higher than the intersec-
On the opposite end, the traversal method with tion cost. Reducing the cost of traversal equates to a
338
Acceleration Data Structures for Ray Tracing on Mobile Devices
higher number of node traversals necessary per pixel. Kajiya, J. (1986). The rendering equation. In Proceedings
The performance numbers shown in Figure 10 corrob- of the 13th Annual Conference on Computer Graphics
orate the previous results showing that the number of and Interactive Techniques, pages 143–150.
traversals correlate to the performance of the traversal Kay, T. L. and Kajiya, J. T. (1986). Ray tracing complex
algorithm. scenes. In ACM SIGGRAPH computer graphics, vol-
ume 20, pages 269–278.
Laine, S. (2010). Restart Trail For Stackless BVH Traver-
sal. In Proceedings of the Conference on High Perfor-
8 CONCLUSIONS mance Graphics, pages 107–111.
Möller, T. and Trumbore, B. (2005). Fast, minimum storage
Our work focused on providing a performance anal- ray/triangle intersection. In ACM SIGGRAPH 2005
ysis of different acceleration data structures for ray Courses.
tracing on mobile devices. Our main goal was to es- Thrane, N. and Simonsen, L. O. (2005). A comparison
of acceleration structures for gpu assisted ray tracing.
tablish a basis for future research into the potential of
Master’s thesis.
mobile environments. As for future work, we want to
Wald, I. and Havran, V. (2006). On building fast kd-Trees
explore the performance of construction methods for for Ray Tracing, and on doing that in O(N log N).
dynamics scenes to provide further insight on current In Proceedings of the IEEE Symposium on Interactive
mobile hardware capabilities. Ray Tracing, pages 61–69.
Whitted, T. (1979). An improved illumination model for
shared display. In Proceedings of the 6th Annual Con-
ference on Computer Graphics and Interactive Tech-
ACKNOWLEDGEMENTS niques.
Williams, A., Barrus, S., Morley, R. K., and Shirley, P.
This work was supported by national funds through (2005). An efficient and robust ray-box intersection
Fundação para a Ciência e Técnologia (FCT) with ref- algorithm. In ACM SIGGRAPH 2005 Courses.
erence UID/CEC/50021/2019.
REFERENCES
Appel, A. (1968). Some techniques for shading machine
renderings of solids. In Proceedings of the AFIPS
Conference, pages 37–45.
Bentley, J. (1975). Multidimensional binary search trees
used for associative searching. Commun. ACM,
18(9):509–517.
Cook, R., Porter, T., and Carpenter, L. (1984). Distributed
ray tracing. In Proceedings of the 11th Annual Con-
ference on Computer Graphics and Interactive Tech-
niques, pages 137–145.
Foley, T. and Sugerman, J. (2005). Kd-tree acceleration
structures for a gpu raytracer. In Proceedings of
the ACM SIGGRAPH/EUROGRAPHICS conference
on Graphics hardware, pages 15–22.
Goldsmith, J. and Salmon, J. (1987). Automatic creation
of object hierarchies for ray tracing. IEEE Comput.
Graph. Appl., 7(5):14–20.
Hapala, M., Davidovič, T., Wald, I., Havran, V., and
Slusallek, P. (2011). Efficient Stack-less BVH Traver-
sal for Ray Tracing. In Proceedings of the 27th Spring
Conference on Computer Graphics, pages 7–12.
Hapala, M. and Havran, V. (2011). Review: Kd-tree Traver-
sal Algorithms for Ray Tracing. Computer Graphics
Forum, 30(1):199–213.
Horn, D. R., Sugerman, J., Houston, M., and Hanrahan, P.
(2007). Interactive kd tree gpu raytracing. In Proceed-
ings of the 2007 symposium on Interactive 3D graph-
ics and games, pages 167–174.
339