TTP: A Hardware-Efficient Design for Precise Prefetching in Ray Tracing
By Yavuz Selim Tozlu, Anshul Naithani, Huiyang Zhou
North Carolina State University, Raleigh, USA

Abstract
Ray tracing (RT) is a 3D graphics technique that offers highly realistic visuals. It is becoming prominent and accessible as GPU vendors have integrated dedicated ray tracing acceleration hardware. However, tracing millions of rays through 3D scenes consisting of high numbers of triangles in real time is challenging and requires expensive hardware. The main bottleneck in RT workloads is the expensive Bounding Volume Hierarchy (BVH) traversal task, which is a large tree structure that encodes the 3D scene. BVH traversal is a memory-bound problem, as the GPU threads spend most of their time reading tree node data from memory.
In this work, we attack the memory latency bottleneck of ray tracing through prefetching. We propose a novel hardware prefetcher, named Tree Traversal Prefetcher (TTP), for ray tracing. The main idea is to leverage the existing tree traversal stack in the RT units for highly accurate prefetching. In particular, TTPprefetches nodes using the addresses already available on the hardware traversal stacks of each thread. For DFS (Depth-first search) based traversal, prefetches are generated when nodes are being popped consecutively from the traversal stack, potentially corresponding to upward traversal through the tree.
We evaluate TTP on a cycle-level simulator, Vulkan-sim 2.0, and show that it achieves 1.48x speedup on average (up to 1.89x) compared to the baseline, with nearly negligible hardware overhead. TTP achieves 98.92% average L1 accuracy, which is the ratio of the prefetched blocks being actually referenced by demand loads. The coverage, computed as the ratio of L1 miss reduction over baseline L1 misses, is 31.54%, correlating well with the achieved speedup.
Index Terms — GPU, Ray Tracing, Prefetching.
To read the full article, click here
Related Semiconductor IP
Related Articles
- ChipBench: A Next-Step Benchmark for Evaluating LLM Performance in AI-Aided Chip Design
- Integrating VESA DSC and MIPI DSI in a System-on-Chip (SoC): Addressing Design Challenges and Leveraging Arasan IP Portfolio
- It's Just a Jump to the Left, Right? Shift Left in IC Design Enablement
- Design and implementation of a hardened cryptographic coprocessor for a RISC-V 128-bit core
Latest Articles
- TTP: A Hardware-Efficient Design for Precise Prefetching in Ray Tracing
- Heterogeneous SoC Integrating an Open-Source Recurrent SNN Accelerator for Neuromorphic Edge Computing on FPGA
- A Reconfigurable Multiplier Architecture for Error-Resilient Applications in RISC-V Core
- ObfAx: Obfuscation and IP Piracy Detection in Approximate Circuits
- LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges