Pie: Pooling CPU Memory for LLM Inference
By Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, Ion Stoica (UC Berkeley)
The rapid growth of LLMs has revolutionized natural language processing and AI analysis, but their increasing size and memory demands present significant challenges. A common solution is to spill over to CPU memory; however, traditional GPU-CPU memory swapping often results in higher latency and lower throughput.
This paper introduces Pie, an LLM inference framework that addresses these challenges with performance-transparent swapping and adaptive expansion. By leveraging predictable memory access patterns and the high bandwidth of modern hardware like the NVIDIA GH200 Grace Hopper Superchip, Pie enables concurrent data swapping without affecting foreground computation, expanding effective memory without added latency. Adaptive expansion dynamically adjusts CPU memory allocation based on real-time information, optimizing memory usage and performance under varying conditions.
Pie maintains low computation latency, high throughput, and high elasticity. Our experimental evaluation demonstrates that Pie achieves optimal swapping policy during cache warmup and effectively balances increased memory capacity with negligible impact on computation. With its extended capacity, Pie outperforms vLLM by up to 1.9X in throughput and 2X in latency. Additionally, Pie can reduce GPU memory usage by up to 1.67X while maintaining the same performance. Compared to FlexGen, an offline profiling-based swapping solution, Pie achieves magnitudes lower latency and 9.4X higher throughput.
To read the full article, click here
Related Semiconductor IP
- Configurable CPU tailored precisely to your needs
- Multi-core capable 64-bit RISC-V CPU with vector extensions
- Multi-core capable 32-bit RISC-V CPU with vector extensions
- Ultra-low power consumption out-of-order commercial-grade 64-bit RISC-V CPU IP
- CPU IP Following the RVA23 Profile, supporting RVV1.0 and all extensions of Vector Crypto
Related Articles
- Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference
- SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference
- VitaLLM: A Versatile and Tiny Accelerator for Mixed-Precision LLM Inference on Edge Devices
- CXL-ClusterSim: Modeling CXL-based Disaggregated Memory Cluster for Pooling and Sharing using gem5 and SST
Latest Articles
- ZK-Flex: A Flexible and Scalable Framework for Accelerating Zero-Knowledge Proofs
- ITP-STDP: An Intrinsic-Timing Power-of-Two Learning Engine for On-Chip SNN Training
- OpenEye: A Scalable Open-Source Hardware Accelerator for DNNs
- CHIMERA: A Flexible and Scalable 3.1 TOPS/W AI-MCU with Transformer Accelerator and 563 Gb/s Shared-L2 Memory Subsystem with QoS Guarantees
- CXL-ClusterSim: Modeling CXL-based Disaggregated Memory Cluster for Pooling and Sharing using gem5 and SST