The Sphynx project was an exploratory study to discover what might be done to improve the heavy r... more The Sphynx project was an exploratory study to discover what might be done to improve the heavy replication of instructions in independent instruction caches for a massively parallel machine where a single program is executing across all of the cores. While a machine with only many cores (fewer than 50) might not have any issues replicating the instructions for each core, as we approach the era where thousands of cores can be placed on one chip, the overhead of instruction replication may become unacceptably large. We believe that a large amount of sharing should be possible when the machine is configured for all of the threads to issue from the same set of instructions. We propose a technique that allows sharing an instruction cache among a number of independent processor cores to allow for inter-thread sharing and reuse of instruction memory. While we do not have test cases to demonstrate the potential magnitude of performance gains that could be achieved, the potential for sharing reduces the die area required for instruction storage on chip.
TRaX (Threaded Ray eXecution) is a highly parallel multi-threaded, multicore processor architectu... more TRaX (Threaded Ray eXecution) is a highly parallel multi-threaded, multicore processor architecture designed for real-time ray tracing. The TRaX architecture consists of a set of thread processors that include commonly used functional units for each thread and that share larger functional units through a programmable interconnect. The memory system takes advantage of the application’s read-only access to the scene database and write-only access to the frame buffer output to provide efficient data delivery with a relatively simple memory system. One specific motivation behind TRaX is to accelerate single-ray performance instead of relying on ray-packets in SIMD mode to boost throughput, which can fail as packets become incoherent with respect to the objects in the scene database. In this paper we describe the TRaX architecture and our performance results compared to other architectures used for ray tracing. Simulated results indicate that a multicore version of the TRaX architecture ...
This dissertation presents computer architecture designs that are efficient for ray tracing based... more This dissertation presents computer architecture designs that are efficient for ray tracing based rendering algorithms. The primary observation is that ray tracing maps better to independent thread issue hardware designs than it does to dependent thread and data designs used in most commercial architectures. While the independent thread issue causes extra overhead in the fetch and issue parts of the pipeline, the number of computation resources required can be reduced through the sharing of less frequently used execution units. Furthermore, since all the threads run a single program on multiple data (SPMD), thread processors can share instruction and data caches. Ray tracing needs read-only access to the scene data during each frame, so caches can be optimized for reading, and traditional cache coherence protocols are unnecessary for maintaining coherent memory access. The resultant image exists as a write only frame buffer, allowing memory writes to avoid the cache entirely, preven...
Lower computer system input-to-output latency substantially re-duces many task completion times. ... more Lower computer system input-to-output latency substantially re-duces many task completion times. In fact, literature shows that reduction in targeting task completion time from decreased latency often exceeds the decrease in latency alone. However, for aiming in first person shooter (FPS) games, some prior work has demonstrated diminishing returns below 40 ms of local input-to-output computer system latency. In this paper, we review this prior art and provide an additional case study with data demonstrating the importance of local system latency improvement, even at latency values below 20 ms. Though other factors may determine victory in a particular esports challenge, ensuring balanced local computer latency among competitors is essential to fair competition.
Proceedings of the ACM on Computer Graphics and Interactive Techniques
End-to-end latency in remote-rendering systems can reduce user task performance. This notably inc... more End-to-end latency in remote-rendering systems can reduce user task performance. This notably includes aiming tasks on game streaming services, which are presently below the standards of competitive first-person desktop gaming. We evaluate the latency-induced penalty on task completion time in a controlled environment and show that it can be significantly mitigated by adopting and modifying image and simulation-warping techniques from virtual reality, eliminating up to 80% of the penalty from 80 ms of added latency. This has potential to enable remote rendering for esports and increase the effectiveness of remote-rendered content creation and robotic teleoperation. We provide full experimental methodology, analysis, implementation details, and source code.
Special Interest Group on Computer Graphics and Interactive Techniques Conference
Esports is a growing worldwide phenomenon now rivaling traditional sports, with a deep dependence... more Esports is a growing worldwide phenomenon now rivaling traditional sports, with a deep dependence on real-time graphics technology. Despite this, the SIGGRAPH research community has largely ignored it. This panel brings together esports experts in engineering, medicine as well as cognitive and data science to argue that this must change. Like film, esports is an important problem for computer graphics, and could give rise to technologies and techniques benefitting not only esports, but society more broadly. With a series of moderated and audience questions, this panel will sketch the research challenges and potential benefits of esports, while also considering its risks. CCS CONCEPTS • Applied computing → Computer games; • Human-centered computing → User studies; • Computing methodologies → Graphics systems and interfaces.
Special Interest Group on Computer Graphics and Interactive Techniques Conference Emerging Technologies
Figure 1: (left) In the cloud gaming paradigm, network latency is added to the game client, resul... more Figure 1: (left) In the cloud gaming paradigm, network latency is added to the game client, resulting in worse aiming performance from players. Late warp, a technique used to prevent simulator sickness in VR, can be applied to first person shooter (FPS) games to mitigate this latency penalty. Using a web-based FPS game (middle), players can test their skill against latency, as well as with late warp correction to see how much late warp helps, even when a naive implementation adds significant guard band artifacts (right). SIGGRAPH virtual conference attendees can run the web app for themselves at home.
Proceedings of the 5th High-Performance Graphics Conference on - HPG '13, 2013
ABSTRACT We propose two hardware mechanisms to decrease energy consumption on massively parallel ... more ABSTRACT We propose two hardware mechanisms to decrease energy consumption on massively parallel graphics processors for ray tracing while keeping performance high. First, we use a streaming data model and configure part of the L2 cache into a ray stream memory to enable efficient data processing through ray reordering. This increases the L1 hit rate and reduces off-chip memory accesses substantially. Second, we employ reconfigurable special-purpose pipelines than are constructed dynamically under program control. These pipelines use shared execution units (XUs) that can be configured to support the common compute kernels that are the foundation of the ray tracing algorithm, such as acceleration structure traversal and triangle intersection. This reduces the overhead incurred by memory and register accesses. These two synergistic features yield a ray tracing architecture that significantly reduces both power consumption and off-chip memory traffic when compared to a more traditional cache only approach.
Future large-scale multi-cores will likely be best suited for use within high-performance computi... more Future large-scale multi-cores will likely be best suited for use within high-performance computing (HPC) domains. A large fraction of HPC workloads employ the message- passing interface (MPI), yet multi-cores continue to be op- timized for shared-memory workloads. In this position pa- per, we put forth the design of a unique chip that is opti- mized for MPI workloads. It introduces
The Sphynx project was an exploratory study to discover what might be done to improve the heavy r... more The Sphynx project was an exploratory study to discover what might be done to improve the heavy replication of instructions in independent instruction caches for a massively parallel machine where a single program is executing across all of the cores. While a machine with only many cores (fewer than 50) might not have any issues replicating the instructions for each core, as we approach the era where thousands of cores can be placed on one chip, the overhead of instruction replication may become unacceptably large. We believe that a large amount of sharing should be possible when the machine is configured for all of the threads to issue from the same set of instructions. We propose a technique that allows sharing an instruction cache among a number of independent processor cores to allow for inter-thread sharing and reuse of instruction memory. While we do not have test cases to demonstrate the potential magnitude of performance gains that could be achieved, the potential for sharing reduces the die area required for instruction storage on chip.
TRaX (Threaded Ray eXecution) is a highly parallel multi-threaded, multicore processor architectu... more TRaX (Threaded Ray eXecution) is a highly parallel multi-threaded, multicore processor architecture designed for real-time ray tracing. The TRaX architecture consists of a set of thread processors that include commonly used functional units for each thread and that share larger functional units through a programmable interconnect. The memory system takes advantage of the application’s read-only access to the scene database and write-only access to the frame buffer output to provide efficient data delivery with a relatively simple memory system. One specific motivation behind TRaX is to accelerate single-ray performance instead of relying on ray-packets in SIMD mode to boost throughput, which can fail as packets become incoherent with respect to the objects in the scene database. In this paper we describe the TRaX architecture and our performance results compared to other architectures used for ray tracing. Simulated results indicate that a multicore version of the TRaX architecture ...
This dissertation presents computer architecture designs that are efficient for ray tracing based... more This dissertation presents computer architecture designs that are efficient for ray tracing based rendering algorithms. The primary observation is that ray tracing maps better to independent thread issue hardware designs than it does to dependent thread and data designs used in most commercial architectures. While the independent thread issue causes extra overhead in the fetch and issue parts of the pipeline, the number of computation resources required can be reduced through the sharing of less frequently used execution units. Furthermore, since all the threads run a single program on multiple data (SPMD), thread processors can share instruction and data caches. Ray tracing needs read-only access to the scene data during each frame, so caches can be optimized for reading, and traditional cache coherence protocols are unnecessary for maintaining coherent memory access. The resultant image exists as a write only frame buffer, allowing memory writes to avoid the cache entirely, preven...
Lower computer system input-to-output latency substantially re-duces many task completion times. ... more Lower computer system input-to-output latency substantially re-duces many task completion times. In fact, literature shows that reduction in targeting task completion time from decreased latency often exceeds the decrease in latency alone. However, for aiming in first person shooter (FPS) games, some prior work has demonstrated diminishing returns below 40 ms of local input-to-output computer system latency. In this paper, we review this prior art and provide an additional case study with data demonstrating the importance of local system latency improvement, even at latency values below 20 ms. Though other factors may determine victory in a particular esports challenge, ensuring balanced local computer latency among competitors is essential to fair competition.
Proceedings of the ACM on Computer Graphics and Interactive Techniques
End-to-end latency in remote-rendering systems can reduce user task performance. This notably inc... more End-to-end latency in remote-rendering systems can reduce user task performance. This notably includes aiming tasks on game streaming services, which are presently below the standards of competitive first-person desktop gaming. We evaluate the latency-induced penalty on task completion time in a controlled environment and show that it can be significantly mitigated by adopting and modifying image and simulation-warping techniques from virtual reality, eliminating up to 80% of the penalty from 80 ms of added latency. This has potential to enable remote rendering for esports and increase the effectiveness of remote-rendered content creation and robotic teleoperation. We provide full experimental methodology, analysis, implementation details, and source code.
Special Interest Group on Computer Graphics and Interactive Techniques Conference
Esports is a growing worldwide phenomenon now rivaling traditional sports, with a deep dependence... more Esports is a growing worldwide phenomenon now rivaling traditional sports, with a deep dependence on real-time graphics technology. Despite this, the SIGGRAPH research community has largely ignored it. This panel brings together esports experts in engineering, medicine as well as cognitive and data science to argue that this must change. Like film, esports is an important problem for computer graphics, and could give rise to technologies and techniques benefitting not only esports, but society more broadly. With a series of moderated and audience questions, this panel will sketch the research challenges and potential benefits of esports, while also considering its risks. CCS CONCEPTS • Applied computing → Computer games; • Human-centered computing → User studies; • Computing methodologies → Graphics systems and interfaces.
Special Interest Group on Computer Graphics and Interactive Techniques Conference Emerging Technologies
Figure 1: (left) In the cloud gaming paradigm, network latency is added to the game client, resul... more Figure 1: (left) In the cloud gaming paradigm, network latency is added to the game client, resulting in worse aiming performance from players. Late warp, a technique used to prevent simulator sickness in VR, can be applied to first person shooter (FPS) games to mitigate this latency penalty. Using a web-based FPS game (middle), players can test their skill against latency, as well as with late warp correction to see how much late warp helps, even when a naive implementation adds significant guard band artifacts (right). SIGGRAPH virtual conference attendees can run the web app for themselves at home.
Proceedings of the 5th High-Performance Graphics Conference on - HPG '13, 2013
ABSTRACT We propose two hardware mechanisms to decrease energy consumption on massively parallel ... more ABSTRACT We propose two hardware mechanisms to decrease energy consumption on massively parallel graphics processors for ray tracing while keeping performance high. First, we use a streaming data model and configure part of the L2 cache into a ray stream memory to enable efficient data processing through ray reordering. This increases the L1 hit rate and reduces off-chip memory accesses substantially. Second, we employ reconfigurable special-purpose pipelines than are constructed dynamically under program control. These pipelines use shared execution units (XUs) that can be configured to support the common compute kernels that are the foundation of the ray tracing algorithm, such as acceleration structure traversal and triangle intersection. This reduces the overhead incurred by memory and register accesses. These two synergistic features yield a ray tracing architecture that significantly reduces both power consumption and off-chip memory traffic when compared to a more traditional cache only approach.
Future large-scale multi-cores will likely be best suited for use within high-performance computi... more Future large-scale multi-cores will likely be best suited for use within high-performance computing (HPC) domains. A large fraction of HPC workloads employ the message- passing interface (MPI), yet multi-cores continue to be op- timized for shared-memory workloads. In this position pa- per, we put forth the design of a unique chip that is opti- mized for MPI workloads. It introduces
Uploads
Papers by Josef Spjut