Large Language Model (LLM) architecture typically has a Prefill and Decode stage: prefill is compute bound, decode is memory bound and dominates Inference tokens per second performance for low batch size, “online” interactive applications. Conventional LLM lossy compression approaches (quantization, pruning, weight sharing, Low Rank factorization, knowledge distillation, sparse attention) utilized by Foundational Model providers resulting in loss, and reduced accuracy.
ZeroPoint’s AI-MX is an advanced memory compression IP designed to maximize the capacity and bandwidth of HBM and LPDDR in AI accelerators. By leveraging a high-performance compression algorithm, AI-MX enables near-instantaneous compression and decompression of AI workloads, significantly enhancing memory efficiency. AI-MX is capable of compressing models that have already been optimized.
For AI applications running large foundational models, AI-MX can provide up to 50% more effective memory and bandwidth capacity, translating into a higher throughput of tokens per second—critical for LLAMA3 BF16 and similar workloads.
Integration
AI-MX will be delivered in two generations.
- First generation : AI-MX G1 (Asymmetrical Architecture)
- Foundational models are compressed in software before deployment.
- In-line decompression in hardware occurs when loading data from HBM/
- LPDDR to the accelerator.
- Achieves up to 1.5X expansion and bandwidth acceleration, with efficiency dependent on model layer resolutions.
- Second generation : AI-MX G2 (Symmetrical Architecture)
- Full memory compression and decompression in hardware (in-line).
- Extends benefits beyond foundational models to include KV-cache, activations, and runtime data.
- Delivers up to 2X memory expansion and bandwidth acceleration, optimizing unstructured runtime data.
Performance / KPI (Samsung 4nm / TSMC N5 technology)
Feature | Performance |
Compression ratio: | 1.5x for LLM model data at BF16 1,35x - 2.0x for dynamic data i.e. KV cache and activiations |
Frequency: | Up to 1.75 GHz |
Throughtput: | Matches the throughput of the HBM and LPDDR memory channel |
IP area: | Decompressor area starting at 0.04mm2 (50GB/s scalable per engine throughput) |
Memory technologies supported: | HBM, GDDR, LPDDR, DDR, SRAM |