Josef Spjut

NVIDIA, Research, Research Scientist

Harvey Mudd College, Engineering, Faculty Member

Followers

Following

Public Views

Address: United States

less

InterestsView All (11)

Uploads

Papers by Josef Spjut

Sphynx: A Shared Instruction Cache Exporatory Study

The Sphynx project was an exploratory study to discover what might be done to improve the heavy r... more The Sphynx project was an exploratory study to discover what might be done to improve the heavy replication of instructions in independent instruction caches for a massively parallel machine where a single program is executing across all of the cores. While a machine with only many cores (fewer than 50) might not have any issues replicating the instructions for each core, as we approach the era where thousands of cores can be placed on one chip, the overhead of instruction replication may become unacceptably large. We believe that a large amount of sharing should be possible when the machine is configured for all of the threads to issue from the same set of instructions. We propose a technique that allows sharing an instruction cache among a number of independent processor cores to allow for inter-thread sharing and reuse of instruction memory. While we do not have test cases to demonstrate the potential magnitude of performance gains that could be achieved, the potential for sharing reduces the die area required for instruction storage on chip.

Download

TRaX: A Multicore Hardware Architecture for Real-Time Ray

TRaX (Threaded Ray eXecution) is a highly parallel multi-threaded, multicore processor architectu... more TRaX (Threaded Ray eXecution) is a highly parallel multi-threaded, multicore processor architecture designed for real-time ray tracing. The TRaX architecture consists of a set of thread processors that include commonly used functional units for each thread and that share larger functional units through a programmable interconnect. The memory system takes advantage of the application’s read-only access to the scene database and write-only access to the frame buffer output to provide efficient data delivery with a relatively simple memory system. One specific motivation behind TRaX is to accelerate single-ray performance instead of relying on ray-packets in SIMD mode to boost throughput, which can fail as packets become incoherent with respect to the objects in the scene database. In this paper we describe the TRaX architecture and our performance results compared to other architectures used for ray tracing. Simulated results indicate that a multicore version of the TRaX architecture ...

Download

Efficient ray tracing architectures

This dissertation presents computer architecture designs that are efficient for ray tracing based... more This dissertation presents computer architecture designs that are efficient for ray tracing based rendering algorithms. The primary observation is that ray tracing maps better to independent thread issue hardware designs than it does to dependent thread and data designs used in most commercial architectures. While the independent thread issue causes extra overhead in the fetch and issue parts of the pipeline, the number of computation resources required can be reduced through the sharing of less frequently used execution units. Furthermore, since all the threads run a single program on multiple data (SPMD), thread processors can share instruction and data caches. Ray tracing needs read-only access to the scene data during each frame, so caches can be optimized for reading, and traditional cache coherence protocols are unnecessary for maintaining coherent memory access. The resultant image exists as a write only frame buffer, allowing memory writes to avoid the cache entirely, preven...

Download

A Case Study of First Person Aiming at Low Latency for Esports

Lower computer system input-to-output latency substantially re-duces many task completion times. ... more Lower computer system input-to-output latency substantially re-duces many task completion times. In fact, literature shows that reduction in targeting task completion time from decreased latency often exceeds the decrease in latency alone. However, for aiming in first person shooter (FPS) games, some prior work has demonstrated diminishing returns below 40 ms of local input-to-output computer system latency. In this paper, we review this prior art and provide an additional case study with data demonstrating the importance of local system latency improvement, even at latency values below 20 ms. Though other factors may determine victory in a particular esports challenge, ensuring balanced local computer latency among competitors is essential to fair competition.

Latency of 30 ms Benefits First Person Targeting Tasks More Than Refresh Rate Above 60 Hz

SIGGRAPH Asia 2019 Technical Briefs on - SA '19

Post-Render Warp with Late Input Sampling Improves Aiming Under High Latency Conditions

Proceedings of the ACM on Computer Graphics and Interactive Techniques

End-to-end latency in remote-rendering systems can reduce user task performance. This notably inc... more End-to-end latency in remote-rendering systems can reduce user task performance. This notably includes aiming tasks on game streaming services, which are presently below the standards of competitive first-person desktop gaming. We evaluate the latency-induced penalty on task completion time in a controlled environment and show that it can be significantly mitigated by adopting and modifying image and simulation-warping techniques from virtual reality, eliminating up to 80% of the penalty from 80 ms of added latency. This has potential to enable remote rendering for esports and increase the effectiveness of remote-rendered content creation and robotic teleoperation. We provide full experimental methodology, analysis, implementation details, and source code.

Steerable application-adaptive near eye displays

ACM SIGGRAPH 2018 Emerging Technologies

A variable shape and variable stiffness controller for haptic virtual interactions

2018 IEEE International Conference on Soft Robotics (RoboSoft)

Esports as a Driving Problem in Computer Graphics

Special Interest Group on Computer Graphics and Interactive Techniques Conference

Esports is a growing worldwide phenomenon now rivaling traditional sports, with a deep dependence... more Esports is a growing worldwide phenomenon now rivaling traditional sports, with a deep dependence on real-time graphics technology. Despite this, the SIGGRAPH research community has largely ignored it. This panel brings together esports experts in engineering, medicine as well as cognitive and data science to argue that this must change. Like film, esports is an important problem for computer graphics, and could give rise to technologies and techniques benefitting not only esports, but society more broadly. With a series of moderated and audience questions, this panel will sketch the research challenges and potential benefits of esports, while also considering its risks. CCS CONCEPTS • Applied computing → Computer games; • Human-centered computing → User studies; • Computing methodologies → Graphics systems and interfaces.

Download

Gaming at Warp Speed: Improving Aiming with Late Warp

Special Interest Group on Computer Graphics and Interactive Techniques Conference Emerging Technologies

Figure 1: (left) In the cloud gaming paradigm, network latency is added to the game client, resul... more Figure 1: (left) In the cloud gaming paradigm, network latency is added to the game client, resulting in worse aiming performance from players. Late warp, a technique used to prevent simulator sickness in VR, can be applied to first person shooter (FPS) games to mitigate this latency penalty. Using a web-based FPS game (middle), players can test their skill against latency, as well as with late warp correction to see how much late warp helps, even when a naive implementation adds significant guard band artifacts (right). SIGGRAPH virtual conference attendees can run the web app for themselves at home.

Download

Esports Arms Race: Latency and Refresh Rate for Competitive Gaming Tasks

Journal of Vision

A Deformable Interface for Human Touch Recognition Using Stretchable Carbon Nanotube Dielectric Elastomer Sensors and Deep Neural Networks

Soft Robotics

Foveated AR

ACM Transactions on Graphics

Fluidic Elastomer Actuators for Haptic Interactions in Virtual Reality

IEEE Robotics and Automation Letters

An Approach to Data Prefetching Using 2-Dimensional Selection Criteria

Download

8-3: Hybrid Modulation for Near Zero Display Latency

SID Symposium Digest of Technical Papers, 2016

Build your own game controller

SIGGRAPH 2015: Studio on - SIGGRAPH '15, 2015

Memory Considerations for Low Energy Ray Tracing

Computer Graphics Forum, 2014

An energy and bandwidth efficient ray tracing architecture

Proceedings of the 5th High-Performance Graphics Conference on - HPG '13, 2013

ABSTRACT We propose two hardware mechanisms to decrease energy consumption on massively parallel ... more ABSTRACT We propose two hardware mechanisms to decrease energy consumption on massively parallel graphics processors for ray tracing while keeping performance high. First, we use a streaming data model and configure part of the L2 cache into a ray stream memory to enable efficient data processing through ray reordering. This increases the L1 hit rate and reduces off-chip memory accesses substantially. Second, we employ reconfigurable special-purpose pipelines than are constructed dynamically under program control. These pipelines use shared execution units (XUs) that can be configured to support the common compute kernels that are the foundation of the ray tracing algorithm, such as acceleration structure traversal and triangle intersection. This reduces the overhead incurred by memory and register accesses. These two synergistic features yield a ray tracing architecture that significantly reduces both power consumption and off-chip memory traffic when compared to a more traditional cache only approach.

Optimizing a Multi-Core Processor for Message-Passing Workloads

Future large-scale multi-cores will likely be best suited for use within high-performance computi... more Future large-scale multi-cores will likely be best suited for use within high-performance computing (HPC) domains. A large fraction of HPC workloads employ the message- passing interface (MPI), yet multi-cores continue to be op- timized for shared-memory workloads. In this position pa- per, we put forth the design of a unique chip that is opti- mized for MPI workloads. It introduces

Download

Sphynx: A Shared Instruction Cache Exporatory Study

Download

TRaX: A Multicore Hardware Architecture for Real-Time Ray

Download

Efficient ray tracing architectures

Download

A Case Study of First Person Aiming at Low Latency for Esports