AI inference performance is increasingly constrained by memory bandwidth and capacity - not compute. Especially in large language models (LLMs), where the Prefill stage is compute-bound, but the Decode stage - critical for low-latency, real-time applications - is memory-bound. Traditional approaches like quantization, pruning and distillation reduce memory usage but compromise accuracy.
The AI-MX offers a complementary solution: lossless, nanosecond-latency memory compression, enabling up to 1.5x more effective HBM or LPDDR capacity and bandwidth. AI-MX enhances AI accelerators by transparently compressing memory traffic - without changes to DMA logic or SoC architecture.
Standards
- Compression: XZI (proprietary)
- Interface: AXI4, CHI
- Plug in compatible with industry standard memory controllers
Architecture
- Modular architecture, enables seamless scalability
- Architectural configuration parameters accessible to fine tune performance
Integration
AI-MX is available in two versions:
AI-MX G1 - Asymmetrical Architecture
- Models are compressed in software before deployment
- Hardware performs in-line decompression at runtime
- Suitable for static model data
- Up to 1.5x expansion
AI-MX G2 - Symmetrical Architecture
- Full in-line hardware compression + decompression
- Accelerates not just the model, but also KV-cache, activations, and runtime data
- Up to 2x expansion on unstructured data
- Ideal for dynamic, memory-bound workloads
Performance / KPI
Feature | Performance |
Compression ratio: | 1.5x for LLM model data at BF16 1,35x - 2.0x for dynamic data i.e. KV cache and activiations |
Throughput: | Matches the throughput of the HBM and LPDDR memory channel |
Frequency: | Up to 1.75 GHz (@5nm TSMC) |
IP area: | (Throughput dependant - contact for information) |
Memory technologies supported: | HBM, GDDR, LPDDR, DDR, SRAM |