Applsci 12 09599 v2
Applsci 12 09599 v2
Applsci 12 09599 v2
sciences
Article
RT Engine: An Efficient Hardware Architecture for Ray Tracing
Run Yan 1 , Libo Huang 1, *, Hui Guo 1 , Yashuai Lü 2 , Ling Yang 1 , Nong Xiao 1 , Yongwen Wang 1 , Li Shen 1
and Mengqiao Lan 1
Abstract: The reality of the ray tracing technology that leads to its rendering effect is becoming
increasingly apparent in computer vision and industrial applications. However, designing efficient
ray tracing hardware is challenging due to memory access issues, divergent branches, and daunting
computation intensity. This article presents a novel architecture, a RT engine (Ray Tracing engine),
that accelerates ray tracing. First, we set up multiple stacks to store information for each ray so that
the RT engine can process many rays parallel in the system. The information in these stacks can
effectively improve the performance of the system. Second, we choose the three-phase break method
during the triangle intersection test, which can make the loop break earlier. Third, the reciprocal
unit adopts the approximation method, which combines Parabolic Synthesis and Second-Degree
interpolation. Combined with these strategies, we implement our system at RTL level with agile chip
development. Simulation and experimental results show that our architecture achieves a performance
per area which is 2.4 × greater than the best reported results for ray tracing on dedicated hardware.
Keywords: machine vision; computer graphics; hardware architecture; rendering; ray tracing;
graphics accelerators
performance of ray tracing [8]. This test occupies nearly 70–80% of total calculation time.
Therefore, many studies are concerned about the ray traversal and intersection tests that
achieve higher performance.
In the past few decades, many researchers worldwide have carried out significant
research into ray tracing software and hardware acceleration. Various platforms are used
at the hardware acceleration level, including central processing units (CPUs), GPUs, and
dedicated hardware. Due to the hardware features, GPUs and dedicated hardware show
significant performance advantages. There is a lot of research on optimizing GPUs’ accel-
eration. Ray tracing was challenging to implement on GPUs because early GPUs did not
support general-purpose computation. Nathan A. Carr et al. [9] made the first attempt
to implement ray tracing on a GPU. However, they only implemented the intersection
test. This unit reconfigured the geometry engine into a ray engine that efficiently inter-
sects caches of rays for many host-based rendering tasks. Timo Aila et al. [10] studied the
mapping of elementary acceleration structure traversal and primitive intersection onto
wide Single Instruction Multiple Data (SIMD)/Single Instruction Multiple Threads (SIMT)
machines. Yahshua Lü et al. [11] proposed the Dynamic Ray Shuffling (DRS) architecture
for GPUs to address ray tracing control flow divergences. The critical insight was that
the primary control flow divergences caused by inconsistent ray traversal states of a warp
could be eliminated by DRS. Experimental results show that, for an estimated 0.11% area
cost, DRS significantly improves the SIMD efficiency for the tested benchmarks from 41.06%
to 81.04% on average. Lufei Liu et al. [12] explored integrating the ray prediction strat-
egy into existing GPU pipelines and improving the predictor effectiveness by predicting
nodes higher in the tree and regrouping and scheduling traversal operations in a low-cost,
reasonable manner. They found that GPU platform optimization pays more attention to
the calculation part of the algorithm. Reducing the processing of branches and reducing
the redundant operation through the optimization strategy of architecture to maximize
the hardware characteristics has achieved good profits in academic research. In addition,
commercial GPUs have also introduced accelerated architectures for ray tracing. In 2018,
NVIDIA launched the first ray tracing GPU with the first-generation RT Core in Turing
architecture [13]. In 2021, the Ampere architecture with the second-generation RT Core
was established [14]. The RT Cores replace that software emulation, performing the tree
traversal and the ray/box and ray/triangle intersection tests. A ray query is sent from the
streaming multiprocessor (SM) to the RT Core. The RT Core uses dedicated evaluators to
test each ray against the box or, at the leaves of the tree, the triangles that make up the
scene. It does this repeatedly, optionally keeping track of the closest intersection found.
When the appropriate intersection point is determined, the result is returned to the SM for
further processing. In 2021, Imagination added the Ray Acceleration Cluster (RAC) of the
PowerVR Photon architecture to the C-series GPU to provide ray tracing IP technology
for the mobile phone market [15]. RAC uses a highly parallel Dual Triangle Tester Unit
to improve the efficiency of hardware computing. Its primary function is to perform the
intersecting test of the two triangles simultaneously and send the processing results to the
next stage. This series’ GPU is divided into single-core and multi-core models applied to
mobile and beyond mobile configurations.
In recent years, dedicated hardware has also been widely referenced in ray tracing
studies. SaarCOR [16] is a ray tracing pipeline consisting of a ray generation/shading unit,
a four-wide SIMD traversing unit, a list unit, a transformation, and an intersection test unit.
The T&I engine [17] is a hardware acceleration architecture of ray traversal and intersection
tests. This architecture adopts an order depth-first layout method to reduce memory band-
width. It proposes the three-phase ray-triangle intersection and a latency hiding architecture
defined as the ray accumulation unit. SGRT [18] is a real-time mobile ray tracing GPU.
It mainly includes two key features: (1) an area-efficient parallel pipelined traversal unit;
(2) flexible and high-performance kernels for shading and ray generation. RayCore [19]
mainly includes ray-tracing units (RTUs) based on a unified traversal and intersection
pipeline and a tree-building unit (TBU) for dynamic scenes. HART [20] utilizes hetero-
Appl. Sci. 2022, 12, 9599 3 of 14
geneous hardware resources: dedicated ray-tracing hardware for BVH update and ray
traversal and a CPU for BVH reconstruction. It also uses PrimAABB for traversal scheduling.
Lee et al. [21] optimized the SGRT [18] and proposed two-AABB traversal architecture
with two ray-AABB testing units. The experimental results showed that two-AABB was
up to 2.9 times faster than the single-pipeline architecture. Kopta et al. proposed STRaTA
(Streaming Treelet Ray Tracing Architecture) [22] to decrease energy consumption on mas-
sively parallel graphics processors. Viitanen et al. applied MBVH (Multi Bounding Volume
Hierarchy) [23] to a fixed-function ray tracing accelerator architecture. With primary rays,
energy efficiency improves by 15% and performance per area improves by 20%. Another
implementation approach is multiple streams, such as a different approach to hardware-
accelerated ray tracing, which begins by modifying the order of rendering operations,
proposed by Konstantin Shkurko et al. [24]. The dual steaming approach organizes the
memory access of ray tracing into two predictable data streams. E. Vasiou et al. [25] intro-
duced Mach-RT (Many Chip-Ray Tracing), a new hardware architecture used to accelerate
ray tracing. The primary approach combines a ray ordering scheme that minimizes access
to the scene data with a sizeable on-chip buffer acting as near-computer storage spread
over multiple chips.
From the perspective of existing academic research results and commercial GPUs, ray
tracing acceleration has gradually improved computing capabilities with the algorithm’s
progress and industrial technology’s development. Due to the different algorithms adopted
by different designs and the various purposes of the ray tracing hardware architecture,
many strategies have considerable differences in performance and hardware resource
overhead. However, they have some performance/area limitations, and we are focusing on
a more efficient hardware architecture for ray tracing.
Our main contributions to the literature are as follows:
(1) Three optimization methods for ray tracing memory access, branches, and functional
units, respectively;
(2) A new hardware architecture based on an area-efficient parallel pipelined ray traversal
and intersection unit;
(3) Implementation of the whole system at the RTL level and assess hardware resource
overhead.
The optimization methods are multiple stacks, three-phase break, and approximate
method of the reciprocal unit. Multiple stacks store information during ray traversal to
ensure system parallelism and reduce memory access. A three-phase break makes the
intersection of rays and primitives more efficient and allows an earlier exit from the loop.
The approximate method can significantly reduce the hardware overhead and can converge
faster. This paper is organized as follows: in Section 2, we describe the overall architecture
design and the data flow. Optimization and design methods for RT engine are discussed
in Section 3. In Section 4, we mainly evaluate and analyze the proposed architecture and
estimate the hardware consumption. The conclusion of the paper is presented in Section 5.
use of the memory hierarchy. Memory access has become much more costly than arithmetic
computations, as Horowitz [27] noted.
Acceleration structure: Many types of acceleration structure research focus on kd-tree
and BVH. In the last two decades, the bounding volume hierarchy (BVH) has become the
de facto standard acceleration structure for ray-tracing-based rendering algorithms [7]. The
BVH traversal algorithm usually occupies less memory bandwidth and has the characteris-
tics of a compact traversal state. NVIDIA [14] and Imagination [15] GPUs also choose BVH
as an acceleration structure in the industry. We choose binary-based BVH in our design. It
can effectively reduce hardware overhead and design costs compared to other kinds.
Per-ray traversal: There are two ways to determine ray-tracing traces. Packet traversal
means a group of rays following the same tree path. This is achieved by sharing the
traversal stack among the ray, which means that some rays will traverse those nodes
which will not intersect. The other is per-ray traversal, which allows each ray to traverse
independently, traversing the node’s children only when the node intersects with the ray.
Each ray requires a separate traversal stack to store the ray’s data. Our design chooses the
second way [7].
Primitive type: We use only triangles as the primitive type. This choice can improve
system performance and simplify designs because this method can eliminate branches
of different graphs. Therefore, other graphics before rendering should be converted into
triangles, just like rasterization-based GPU processing methods.
First-hit traversal: This approach is the most widely used and is indispensable for
computing the radiance at a shading point, which finds the nearest object in the direction of
a ray from its origin. In binary-based BVHs, this approach can efficiently push the farther
intersected node onto a stack and visit the closer one first [7].
The Stack Management unit is the key to ensuring multiple rays in the system parallel
processing. This part corresponds to the Stack.pop and Stack.push in Algorithm 1. Its primary
role is to operate the stack according to the results of the Ray Traversal unit and the Triangle
Intersection unit. To ensure the stacks’ accuracy and order, we added a LUT (Look Up
Table). This unit ensures that the data of rays and stacks can correspond.
The Triangle Intersection (IST) unit tests the intersection of the triangle and the ray
of the leaf node, which corresponds to the Intersection in Algorithm 1. For the design of
this part, we refer to the Woop algorithm [31]. This algorithm has three logical parts: the
ray-plane test, the barycentric test, and the final hit-point calculation. IST1 completes
the calculation of the ray-plane test. IST2 completes the barycentric test. IST3 completes
the final hit-point calculation. The design of this unit uses two optimization methods
to achieve a system performance improvement, which we will describe in detail in the
subsequent chapters.
Table 1 illustrates how rays move from unit to unit. The numbers in Table 1 correspond
to the numbers in Figure 1.
Figure 3. Triangle intersection test. (a) Ray-plane test, (b) barycentric test and (c) final hit point
calculation.
We set up IST1 to complete the ray-plane test. If it misses, it will break to LUT. This
unit calculated t value is first checked. The results mainly depend on whether the ray and
the triangle intersect at the t. The pop stack request is sent to the LUT if the test does not
pass. If the test passes, the t value will transmit to the IST2 unit to perform the following
process. The IST2 unit is for barycentric and obtains u. Then, we check u. A pop stack
request is issued to the LUT if the test is not passed. If the test passes, the obtained u value
will transmit to the IST3 for processing. The IST3 phase processes the v value, which can
determine if the ray intersects with a triangle and can return the value of t, u, and v. If we
can test it as early as possible, we can get stack operations to the LUT as soon as possible.
This design allows the triangle intersection test to get the output results earlier.
The simulation results show this method can effectively improve performance. For
the simulation results, see Table 2. IST1 break means it did not pass the ray-plane test, exit
Intersection, etc. For the test scene, see Figure 4.
Figure 4. Test scenes: Conference, Fairyforest, Sibenik and Sanmiguel with primary rays.
Appl. Sci. 2022, 12, 9599 9 of 14
The schematic diagram of the entire process is in Figure 5. The reciprocal unit has been
tested for every possible input. The max error is 1.18358967 × 10−7 (≈ 2−23.01 ) which is
smaller than the machine epsilon (e) (upper bound for the error) that is commonly defined
as 2−23 (by ISO C, Matlab, etc.) for the single-precision floating-point format.
Reduced precision has many studies in ray tracing, such as Vaidyanathan et al. [34],
aiming to reduce the accuracy of BVH to improve performance. This method requires only
a few hardware resources to reduce traversal calculation complexity while maintaining
robust image quality. Our approach is different from the BVH accuracy. It is mainly
intended to maintain the accuracy of BVHs and find a novel way to reduce accuracy and
hardware resources in reciprocal operation.
To compare the reciprocal computing unit used by RT engine, we compare the area
overhead with the same functional units in other academic research. For comparison data,
see Table 3. The table lists several academic research studies about ray tracing on the
reciprocal area. The process of SGRT and HART is 65 nm, and the clock rate is 500 MHz.
We perform an approximate transformation to NAND2 under the same clock rate for a
better comparison.
Appl. Sci. 2022, 12, 9599 10 of 14
Table 4. Complexity of our design in terms of the number of floating-point units(RCP: reciprocal unit,
ADD: adder, MUL: multiplier, MAC: multiply accumulate, CMP: comparator, DIV: divider, SQR:
square root).
Table 5 summarizes the area estimation of RT engine. Stacks, FIFOs, and functional
units require hardware resources. We estimated the area of this design to be 55.9% of the
total area for arithmetic units. Finally, we concluded that the RT engine occupies a 0.48 mm2
area with 28 nm process under an 850 MHz clock rate.
Appl. Sci. 2022, 12, 9599 11 of 14
Performance/
Acceleration Performance
Clock Rate Area (mm2 ) Process (nm) Area (MRPS
Structure (MRPS)
/mm2 )
T&I engine
SIGGRAPH’11 500 MHz Kd-tree 198 9.04 65 21.90
[17]
SGRT
SIGGRAPH’13 500 MHz BVH 184 7.2 65 25.56
[18]
RayCore
500 MHz Kd-tree 193 18 28 10.72
TOG’14 [19]
Two-AABB
SIGGRAPH’14 500 MHz BVH 297.6 6.82 28 43.63
[21]
HART
500 MHz BVH 602 7.68 65 78.39
TVCG’15 [20]
STRaTA
1 GHz BVH 365.6 57.1 65 6.40
CGF’15 [22]
MBVH
SIGGRAPH’16 500 MHz BVH 88 3.12 45 28.21
[23]
Dual Streaming
1 GHz BVH 345.6 57.1 65 6.05
HPG’17 [24]
Mach-RT
2 GHz BVH 284.25 52 65 5.47
TVCG’20 [25]
RT engine 850 MHz BVH 92.74 0.48 28 193.21
5. Conclusions
In this article, we present an efficient ray tracing hardware architecture RT engine by
analyzing relevant literature, algorithms, and RTL level implementations. We adopt three
optimization strategies to improve the system efficiency for memory access, branching,
and significant computation in ray tracing. Multiple stacks are used to store ray traversal
information. The three-phase break method is used to perform the loop break earlier
and approximate the method for the reciprocal to achieve hardware optimization and
performance improvement. Based on these three optimization strategies, we use the chip
agile development method to implement the RTL level, verify the accuracy of system
functions and evaluate performance through simulation. We use the synthesis tool to assess
the chip area. The experimental results show that the performance/area (MRPS/mm2 )
of our architecture is about 2.4× higher than the best reported results of other academic
research. These results indicate that our architecture can achieve efficient ray tracing.
Author Contributions: Writing—original draft, review and editing, R.Y.; Conceptualization, L.H.;
hardware, H.G.; software, Y.L.; investigation, L.Y. and N.X.; analysis, L.S. and Y.W.; validation, M.L.
All authors have read and agreed to the published version of the manuscript.
Appl. Sci. 2022, 12, 9599 13 of 14
Funding: This research is supported by the National Natural Science Foundation of China (No.
61872374/62102433).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data that support the findings of this study are available from the
corresponding author upon reasonable request.
Acknowledgments: The authors would like to thank everyone who contributed to the realization of
this research, either academically or through financial support.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Catmull, E. A Subdivision Algorithm for Computer Display of Curved Surfaces; The University of Utah: Salt Lake City, UT, USA, 1974.
2. Whitted, T. An improved illumination model for shaded display. ACM Siggraph Comput. Graph. 1979, 13, 14. [CrossRef]
3. Schmid, J.; Uludag, Y.; Deligiannis, J. It just works: Raytraced reflections in “Battlefield V”. In Proceedings of the GPU Technology
Conference, San Francisco, CA, USA, 19–22 March 2019.
4. Christensen, P.; Fong, J.; Shade, J.; Wooten, W.; Schubert, B.; Kensler, A.; Friedman, S.; Kilpatrick, C.; Ramshaw, C.; Bannister, M.
RenderMan: An Advanced Path-Tracing Architecture for Movie Rendering. ACM Trans. Graph. 2018, 37, 1–21. [CrossRef]
5. Velho, L.; da Silva, V.; Novello, T. Immersive visualization of the classical non-Euclidean spaces using real-time ray tracing in VR.
In Proceedings of the Graphics Interface Conference 2020, Toronto, ON, Canada, 28–29 May 2020.
6. Cao, Y.; Zhang, X.; Duan, B.; Zhao, W.; Wang, H. An improved method to build the KD tree based on presorted results. In
Proceedings of the 11th International Conference on Software Engineering and Service Science (ICSESS), Beijing, China, 16–18
October 2020; pp. 71–75.
7. Meister, D.; Ogaki, S.; Benthin, C.; Doyle, M.J.; Guthe, M.; Bittner, J. A Survey on Bounding Volume Hierarchies for Ray Tracing.
In Computer Graphics Forum; Wiley Online Library: Hoboken, NJ, USA, 2021; Volume 40, pp. 683–712.
8. Deng, Y.; Ni, Y.; Li, Z.; Mu, S.; Zhang, W. Toward Real-Time Ray Tracing: A Survey on Hardware Acceleration and Microarchitec-
ture Techniques. ACM Comput. Surv. 2017, 50, 58.1–58.41. [CrossRef]
9. Carr, N.A.; Hall, J.D.; Hart, J.C. The Ray Engine. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on
Graphics Hardware, Saarbrücken, Germany, 2–3 September 2002; pp. 37–46.
10. Aila, T.; Laine, S. Understanding the efficiency of ray traversal on GPUs. In Proceedings of the Conference on High Performance
Graphics, New Orleans, LA, USA, 1–3 August 2009; pp. 145–149.
11. Luü, Y.; Huang, L.; Shen, L.; Wang, Z. Unleashing the power of GPU for physically-based rendering via dynamic ray shuffling. In
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Boston, MA, USA, 14–17
October 2017; pp. 560–573.
12. Liu, L.; Chang, W.; Demoullin, F.; Chou, Y.H.; Saed, M.; Pankratz, D.; Nowicki, T.; Aamodt, T.M. Intersection Prediction
for Accelerated GPU Ray Tracing. In Proceedings of the MICRO-54: 54th Annual IEEE/ACM International Symposium on
Microarchitecture, Virtual, 18–22 October 2021; pp. 709–723.
13. Burgess, J. RTX ON—The NVIDIA TURING GPU. In Proceedings of the IEEE Hot Chips 31 Symposium (HCS), Cupertino, CA,
USA, 18–20 August 2019; pp. 1–27. [CrossRef]
14. Corporation, N. NVIDIA Ampere GA102 GPU Architecture. Available online: https://www.nvidia.com/content/PDF/nvidia-
ampere-ga-102-gpu-architecture-whitepaper-v2.1.pdf/ (accessed on 31 May 2022).
15. Beets, K. Introduction to the PowerVR Photon Architecture. Available online: https://www.imaginationtech.com/products/
gpu/graphics-architecture/powervr-photon/ (accessed on 31 May 2022).
16. Schmittler, J. SaarCOR: A Hardware Architecture for Real-Time Ray Tracing; The Eurographics Association: Geneve, Switzerland, 2007.
17. Nah, J.H.; Park, J.S.; Park, C.; Kim, J.W.; Jung, Y.H.; Park, W.C.; Han, T.D. T&I engine: Traversal and intersection engine for
hardware accelerated ray tracing. In Proceedings of the 2011 SIGGRAPH Asia Conference, Hong Kong China, 12–15 December
2011; pp. 1–10.
18. Lee, W.J.; Shin, Y.; Lee, J.; Lee, S.; Ryu, S.; Kim, J. Real-time ray tracing on future mobile computing platform. In Proceedings of
the SIGGRAPH Asia 2013 Symposium on Mobile Graphics and Interactive Applications, Hong Kong, China, 19–22 November
2013; pp. 1–5.
19. Nah, J.H.; Kwon, H.J.; Kim, D.S.; Jeong, C.H.; Park, J.; Han, T.D.; Manocha, D.; Park, W.C. RayCore: A ray-tracing hardware
architecture for mobile devices. ACM Trans. Graph. (TOG) 2014, 33, 1–15. [CrossRef]
20. Nah, J.H.; Kim, J.W.; Park, J.; Lee, W.J.; Park, J.S.; Jung, S.Y.; Park, W.C.; Manocha, D.; Han, T.D. HART: A hybrid architecture for
ray tracing animated scenes. IEEE Trans. Vis. Comput. Graph. 2014, 21, 389–401. [CrossRef]
21. Lee, J.; Lee, W.J.; Shin, Y.; Hwang, S.; Ryu, S.; Kim, J. Two-AABB traversal for mobile real-time ray tracing. In Proceedings of the
SIGGRAPH Asia 2014 Mobile Graphics and Interactive Applications, Shenzhen, China, 3–6 December 2014; pp. 1–5.
Appl. Sci. 2022, 12, 9599 14 of 14
22. Kopta, D.; Shkurko, K.; Spjut, J.; Brunvand, E.; Davis, A. Memory considerations for low energy ray tracing. In Computer Graphics
Forum; Wiley Online Library: Hoboken, NJ, USA, 2015; Volume 34, pp. 47–59.
23. Viitanen, T.; Koskela, M.; Jääskeläinen, P.; Takala, J. Multi bounding volume hierarchies for ray tracing pipelines. In Proceedings
of the SIGGRAPH ASIA 2016 Technical Briefs, Macao, China, 5–8 December 2016; pp. 1–4.
24. Shkurko, K.; Grant, T.; Kopta, D.; Mallett, I.; Yuksel, C.; Brunvand, E. Dual streaming for hardware-accelerated ray tracing. In
Proceedings of High Performance Graphics, Vancouver, BC, Canada, 28–30 July 2017; pp. 1–11.
25. Vasiou, E.; Shkurko, K.; Brunvand, E.; Yuksel, C. Mach-RT: A many chip architecture for HighPerformance ray tracing. IEEE
Trans. Vis. Comput. Graph. 2020, 28, 1585–1596. [CrossRef] [PubMed]
26. Hennessy, J.L.; Patterson, D.A. A new golden age for computer architecture. Commun. ACM 2019, 62, 48–60. [CrossRef]
27. Horowitz, M. 1.1 computing’s energy problem (and what we can do about it). In Proceedings of the International Solid-State
Circuits Conference Digest of Technical Papers (ISSCC), San Francisco, CA, USA, 9–13 February 2014; pp. 10–14.
28. Hapala, M.; Davidovič, T.; Wald, I.; Havran, V.; Slusallek, P. Efficient stack-less bvh traversal for ray tracing. In Proceedings of the
27th Spring Conference on Computer Graphics, Smolenice Castle, Slovakia, 27–29 April 2011; pp. 7–12.
29. Binder, N.; Keller, A. Efficient stackless hierarchy traversal on GPUs with backtracking in constant time. In Proceedings of the
High Performance Graphics, Dublin, Ireland, 20–22 June 2016; pp. 41–50.
30. Vaidyanathan, K.; Woop, S.; Benthin, C. Wide BVH traversal with a short stack. In Proceedings of the Conference on High-
Performance Graphics, Strasbourg , France, 8–10 July 2019; pp. 15–19.
31. Woop, S. A Ray Tracing Hardware Architecture for Dynamic Scenes; Fachrichtung 6.2-Informatik Computer Graphik, Saarland
University: Saarbriicken, Germany, 2004.
32. Hertz, E.; Svensson, B.; Nilsson, P. Combining the parabolic synthesis methodology with second-degree interpolation. Microprocess.
Microsystems 2016, 42, 142–155. [CrossRef]
33. Hertz, E. Methodologies for Approximation of Unary Functions and Their Implementation in Hardware. Ph.D. Thesis, Halmstad
University Press, Halmstad, Sweden, 2016.
34. Vaidyanathan, K.; Akenine-Möller, T.; Salvi, M. Watertight ray traversal with reduced precision. In Proceedings of the High
Performance Graphics, Dublin, Ireland, 20–22 June 2016; pp. 33–40.
35. Bachrach, J.; Vo, H.; Richards, B.; Lee, Y.; Waterman, A.; Avižienis, R.; Wawrzynek, J.; Asanović, K. Chisel: Constructing hardware
in a scala embedded language. In Proceedings of the DAC Design Automation Conference 2012, San Francisco, CA, USA, 3–7
June 2012; pp. 1212–1221.
36. Wald, I. Fast construction of SAH BVHs on the Intel many integrated core (MIC) architecture. IEEE Trans. Vis. Comput. Graph.
2010, 18, 47–57. [CrossRef] [PubMed]
37. Kopta, D.; Spjut, J.; Brunv, E.; Davis, A. Efficient MIMD architectures for high-performance ray tracing. In Proceedings of the
International Conference on Computer Design, Amsterdam, The Netherlands, 3–6 October 2010.