High-Performance Memory Expansion IP for AI Accelerators

Overview

AI inference performance is increasingly constrained by memory bandwidth and capacity - not compute. Especially in large language models (LLMs), where the Prefill stage is compute-bound, but the Decode stage - critical for low-latency, real-time applications - is memory-bound. Traditional approaches like quantization, pruning and distillation reduce memory usage but compromise accuracy.

The AI-MX offers a complementary solution: lossless, nanosecond-latency memory compression, enabling up to 1.5x more effective HBM or LPDDR capacity and bandwidth. AI-MX enhances AI accelerators by transparently compressing memory traffic - without changes to DMA logic or SoC architecture.

Standards

  • Compression: XZI (proprietary)
  • Interface: AXI4, CHI
  • Plug in compatible with industry standard memory controllers

Architecture

  • Modular architecture, enables seamless scalability
  • Architectural configuration parameters accessible to fine tune performance

Integration

AI-MX is available in two versions:

AI-MX G1 - Asymmetrical Architecture

  • Models are compressed in software before deployment
  • Hardware performs in-line decompression at runtime
  • Suitable for static model data
  • Up to 1.5x expansion

AI-MX G2 - Symmetrical Architecture

  • Full in-line hardware compression + decompression
  • Accelerates not just the model, but also KV-cache, activations, and runtime data
  • Up to 2x expansion on unstructured data
  • Ideal for dynamic, memory-bound workloads

Performance / KPI

Feature Performance
Compression ratio: 1.5x for LLM model data at BF16
1,35x - 2.0x for dynamic data i.e. KV cache and activiations
Throughput: Matches the throughput of the HBM and LPDDR memory channel
Frequency: Up to 1.75 GHz (@5nm TSMC)
IP area: (Throughput dependant - contact for information)
Memory technologies supported: HBM, GDDR, LPDDR, DDR, SRAM

Key Features

  • SW model compression and in-line hardware accelerated decompression (Gen 1)
  • Hardware accelerated in-line compression and decompression (Gen 2) 
  • On-the-fly Multi-algorithm switching capability without recompression
  • Cache line granularity to enable high throughput and low latency

Benefits

  • +50% Effective HBM Capacity - Compress model and runtime data to fit 150 GB of workload into 100 GB of physical memory.
  • +50% Bandwidth Efficiency - Transfer more data per memory transaction, accelerating model throughput - tokens per second - without increasing power.
  • +50% More Throughput from the Same Silicon - An AI accelerator with 4 HBM stacks behaves like it has 6 - without changing the memory controller.
  • Transparent System Integration - Includes a patented, real-time Address Translation Unit for seamless compressed memory access.

Block Diagram

High-Performance Memory Expansion IP for AI Accelerators Block Diagram

Applications

  • AI Inference Accelerators (Datacenter, Edge)
  • Generative AI (LLMs like LLaMA3, GPT, Claude)
  • Smart devices with LPDDR bottlenecks
  • Automotive & Embedded AI Systems
  • Any SoC using HBM, LPDDR, GDDR or DDR

Deliverables

  • Performance evaluation license C++ compression model for integration in customer performance simulation model
  • HDL Source Licenses
    • Synthesizable System Verilog RTL (encrypted)
    • Implementation constraints
    • UVM testbench (self-checking)
    • Vectors for testbench and expected results
    • User Documentation

Technical Specifications

×
Semiconductor IP