SkipOPU: An FPGA-based Overlay Processor for Large Language Models with Dynamically Allocated Computation
Zicheng He, University of California, Los Angeles, USA
Anhao Zhao, Institute of Digital Twin, Eastern Institute of Technology, China
Xiaoyu Shen, Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, Eastern Institute of Technology, China
Chen Wu, Chiplet CAD and Manufacturing Engineering Research Center of Zhejiang Province, Ningbo Institute of Digital Twin, Eastern Institute of Technology, China
He Lei, Eastern Institute of Technology, China
Abstract
Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their inference efficiency remains a critical bottleneck due to rapidly growing parameters. Recent advances in dynamic computation allocation address this challenge by exploiting the highly uneven contributions of different tokens and layers, enabling selective execution that significantly reduces redundant computation while preserving model accuracy. However, existing hardware platforms and accelerators are primarily optimized for uniform, static execution, limiting their ability to efficiently support such dynamic inference patterns. In this work, we propose SkipOPU, an FPGA-based overlay processor that dynamically allocates computation across tokens and layers with high flexibility through a lightweight routing mechanism. First, we decouple reduction operations from element-wise computation in nonlinear modules and perform reductions incrementally, which enables both stages to be fused with adjacent linear operations (router or matrix multiplication) for effective latency hiding. Second, motivated by asymmetric sensitivity to numerical precision between activation and weight, we design a PE array that efficiently supports float-fixed hybrid execution. A novel DSP overpacking technique is introduced to maximize hardware utilization while minimizing resource overhead. Finally, we develop a proactive on-chip KV history buffer that exploits cross-layer KV invariance of pruned tokens, eliminating irregular HBM accesses during decoding and supplementing off-chip bandwidth through high-locality on-chip reuse. Experimental results demonstrate that SkipOPU on an AMD U280 FPGA outperforms GPU and other FPGA-based accelerators by 1.23x-3.83x in bandwidth efficiency for LLMs inference with dynamic computation allocation and can reduce up to 25.4% KV storage overhead across varying sequence lengths.
Related Semiconductor IP
- ASIL B Compliant MIPI CSI-2 CSE2 Security Module
- SHA-256 Secure Hash Algorithm IP Core
- EdDSA Curve25519 signature generation engine
- DeWarp IP
- 6-bit, 12 GSPS Flash ADC - GlobalFoundries 22nm
Related Articles
- SV-LLM: An Agentic Approach for SoC Security Verification using Large Language Models
- RoMe: Row Granularity Access Memory System for Large Language Models
- An FPGA-Based SoC Architecture with a RISC-V Controller for Energy-Efficient Temporal-Coding Spiking Neural Networks
- An Example Verification Environment for Different Types of Processor Models
Latest Articles
- SkipOPU: An FPGA-based Overlay Processor for Large Language Models with Dynamically Allocated Computation
- TensorPool: A 3D-Stacked 8.4TFLOPS/4.3W Many-Core Domain-Specific Processor for AI-Native Radio Access Networks
- Assertain: Automated Security Assertion Generation Using Large Language Models
- VolTune: A Fine-Grained Runtime Voltage Control Architecture for FPGA Systems
- A Lightweight High-Throughput Collective-Capable NoC for Large-Scale ML Accelerators