A 16 nm 1.60TOPS/W High Utilization DNN Accelerator with 3D Spatial Data Reuse and Efficient Shared Memory Access
By Xiaoling Yi, Ryan Antonio, Yunhao Deng, Fanchen Kong, Joren Dumoulin, Jun Yin, Marian Verhelst
MICAS-ESAT, KU Leuven, Leuven, Belgium

Abstract
Achieving high compute utilization across a wide range of AI workloads is crucial for the efficiency of versatile DNN accelerators. This paper presents the Voltra chip and its utilization-optimised DNN accelerator architecture, which leverages 3-Dimensional (3D) spatial data reuse along with efficient and flexible shared memory access. The 3D spatial dataflow enables balanced spatial data reuse across three dimensions, improving spatial utilization by up to 2.0x compared to a conventional 2D design. Inside the shared memory access architecture, Voltra incorporates flexible data streamers that enable mixed-grained hardware data pre-fetching and dynamic memory allocation, further improving the temporal utilization by 2.12-2.94x and achieving 1.15-2.36x total latency speedup compared with the non-prefetching and separated memory architecture, respectively. Fabricated in 16nm technology, our chip achieves 1.60 TOPS/W peak system energy efficiency and 1.25 TOPS/mm2 system area efficiency, which is competitive with state-of-the-art solutions while achieving high utilization across diverse workloads.
Index Terms — DNN Accelerator, 3D Spatial Data Reuse, Flexible and Efficient Data Access, Shared Memory, High Utilization
To read the full article, click here
Related Semiconductor IP
- 5G-NTN Modem IP for Satellite User Terminals
- AXI-S Protocol Layer for UCIe
- HBM4E Controller IP
- 14-bit 12.5MSPS SAR ADC - Tower 65nm
- 5G-Advanced Modem IP for Edge and IoT Applications
Related Articles
- 3+ ways to design reconfigurable algorithm accelerator in IP block
- RAID6 accelerator in a PowerPC IOP SOC
- Emulator, accelerator, prototype - what’s the difference?
- NVMe host IP for computing accelerator
Latest Articles
- PDF: PUF-based DNN Fingerprinting for Knowledge Distillation Traceability
- TeraPool: A Physical Design Aware, 1024 RISC-V Cores Shared-L1-Memory Scaled-up Cluster Design with High Bandwidth Main Memory Link
- AutoGNN: End-to-End Hardware-Driven Graph Preprocessing for Enhanced GNN Performance
- LUTstructions: Self-loading FPGA-based Reconfigurable Instructions
- CQ-CiM: Hardware-Aware Embedding Shaping for Robust CiM-Based Retrieval