Shattering the neural network memory wall with Checkmate
A recent paper published on arXiv by a team of UC Berkeley researchers observes that neural networks are increasingly bottlenecked and constrained by the limited capacity of on-device GPU memory. Indeed, deep learning is constantly testing the limits of memory capacity on neural network accelerators as neural networks train with high-resolution images, 3D point-clouds and long Natural Language Processing (NLP) sequence data.
“In these applications, GPU memory usage is dominated by the intermediate activation tensors needed for backpropagation. The limited availability of high bandwidth on-device memory creates a memory wall that stifles exploration of novel architectures,” the researchers explain. “One of the main challenges when training large neural networks is the limited capacity of high-bandwidth memory on accelerators such as GPUs and TPUs. Critically, the bottleneck for state-of-the-art model development is now memory rather than data and compute availability and we expect this trend to worsen in the near future.”
As the researchers point out, some initiatives to address this bottleneck focus on dropping activations as a strategy to scale to larger neural networks under memory constraints. However, these heuristics assume uniform per-layer costs and are limited to simple architectures with linear graphs. As such, the UC Berkeley team uses off-the-shelf numerical solvers to formulate optimal rematerialization strategies for arbitrary deep neural networks in TensorFlow with non-uniform computation and memory costs. In addition, the UC Berkeley team demonstrates how optimal rematerialization enables larger batch sizes and substantially reduced memory usage – with minimal computational overhead across a range of image classification and semantic segmentation architectures.
To read the full article, click here
Related Semiconductor IP
- ISO/IEC 7816 Verification IP
- 50MHz to 800MHz Integer-N RC Phase-Locked Loop on SMIC 55nm LL
- Simulation VIP for AMBA CHI-C2C
- Process/Voltage/Temperature Sensor with Self-calibration (Supply voltage 1.2V) - TSMC 3nm N3P
- USB 20Gbps Device Controller
Related Blogs
- Pushing the Boundaries of Memory: What’s New with Weebit and AI
- Take your neural networks to the next level with Arm's Machine Learning Inference Advisor
- The Road to Innovation with Synopsys 224G PHY IP From Silicon to Scale: Synopsys 224G PHY Enables Next Gen Scaling Networks
- Will your multicore SoC hit the memory wall? Will the memory wall hit your SoC? Does it matter?
Latest Blogs
- A Comparison on Different AMBA 5 CHI Verification IPs
- Cadence Recognized as TSMC OIP Partner of the Year at 2025 OIP Ecosystem Forum
- Accelerating Development Cycles and Scalable, High-Performance On-Device AI with New Arm Lumex CSS Platform
- Desktop-Quality Ray-Traced Gaming and Intelligent AI Performance on Mobile with New Arm Mali G1-Ultra GPU
- Powering Scale Up and Scale Out with 224G SerDes for UALink and Ultra Ethernet