VitaLLM: A Versatile and Tiny Accelerator for Mixed-Precision LLM Inference on Edge Devices
By Zi-Wei Lin, and Tian-Sheuan Chang
Institute of Electronics, National Yang Ming Chiao Tung University, Hsinchu, Taiwan

Abstract
We present VitaLLM, a mixed-precision accelerator that enables ternary-weight large language models to run efficiently on edge devices. The design combines two compute cores—a multiplier-free TINT core for ternary–INT projections and a BoothFlex core that reuses a radix-4 Booth datapath for both INT8×INT8 attention and ternary–INT—sustaining utilization without duplicating arrays. A predictive sparse attention mechanism employs a leading-one (LO) surrogate with a comparison-free top-K selector to prune key/value (KV) fetches by roughly 1−K/M for M cached tokens, confining exact attention to K candidates. System-level integration uses head-level pipelining and an absmax-based quantization barrier to standardize cross-core interfaces and overlap nonlinear reductions with linear tiles. A 16nm silicon prototype at 1GHz/0.8V achieves 72.46 tokens/s in decode and 0.88 s prefill (64 tokens) within 0.214 mm² and 120 KB on-chip memory, while reducing KV traffic and improving utilization in ablations. These results demonstrate practical BitNet b1.58 (3B) inference on edge-class platforms and provide a compact blueprint for future mixed-precision LLM accelerators.
To read the full article, click here
Related Semiconductor IP
- AES-GCM - Authenticated Encryption and Decryption
- AES-GCM Authenticated Encryption and Decryption
- AES-GCM - Authenticated Encryption and Decryption
- Verification IP for C-PHY
- Band-Gap Voltage Reference with dual 2µA Current Source - X-FAB XT018
Related Articles
- FPGA-Accelerated RISC-V ISA Extensions for Efficient Neural Network Inference on Edge Devices
- CD-PIM: A High-Bandwidth and Compute-Efficient LPDDR5-Based PIM for Low-Batch LLM Acceleration on Edge-Device
- LLM Inference with Codebook-based Q4X Quantization using the Llama.cpp Framework on RISC-V Vector CPUs
- Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference
Latest Articles
- VitaLLM: A Versatile and Tiny Accelerator for Mixed-Precision LLM Inference on Edge Devices
- SCENIC: Stream Computation-Enhanced SmartNIC
- Agentic AI-based Coverage Closure for Formal Verification
- Microarchitectural Co-Optimization for Sustained Throughput of RISC-V Multi-Lane Chaining Vector Processors
- RISC-V Functional Safety for Autonomous Automotive Systems: An Analytical Framework and Research Roadmap for ML-Assisted Certification