High-Performance Memory Expansion IP for AI Accelerators

Overview

Large Language Model (LLM) architecture typically has a Prefill and Decode stage: prefill is compute bound, decode is memory bound and dominates Inference tokens per second performance for low batch size, “online” interactive applications. Conventional LLM lossy compression approaches (quantization, pruning, weight sharing, Low Rank factorization, knowledge distillation, sparse attention) utilized by Foundational Model providers resulting in loss, and reduced accuracy.

ZeroPoint’s AI-MX is an advanced memory compression IP designed to maximize the capacity and bandwidth of HBM and LPDDR in AI accelerators. By leveraging a high-performance compression algorithm, AI-MX enables near-instantaneous compression and decompression of AI workloads, significantly enhancing memory efficiency. AI-MX is capable of compressing models that have already been optimized.

For AI applications running large foundational models, AI-MX can provide up to 50% more effective memory and bandwidth capacity, translating into a higher throughput of tokens per second—critical for LLAMA3 BF16 and similar workloads.

Integration

AI-MX will be delivered in two generations.

  • First generation : AI-MX G1 (Asymmetrical Architecture)
    • Foundational models are compressed in software before deployment.
    • In-line decompression in hardware occurs when loading data from HBM/
    • LPDDR to the accelerator.
    • Achieves up to 1.5X expansion and bandwidth acceleration, with efficiency dependent on model layer resolutions.
  • Second generation : AI-MX G2 (Symmetrical Architecture)
    • Full memory compression and decompression in hardware (in-line).
    • Extends benefits beyond foundational models to include KV-cache, activations, and runtime data.
    • Delivers up to 2X memory expansion and bandwidth acceleration, optimizing unstructured runtime data.

Performance / KPI (Samsung 4nm / TSMC N5 technology)

Feature Performance
Compression ratio: 1.5x for LLM model data at BF16 1,35x - 2.0x for dynamic data i.e. KV cache and activiations
Frequency: Up to 1.75 GHz
Throughtput: Matches the throughput of the HBM and LPDDR memory channel
IP area: Decompressor area starting at 0.04mm2 (50GB/s scalable per engine throughput)
Memory technologies supported: HBM, GDDR, LPDDR, DDR, SRAM

Key Features

  • Standards: XZI (proprietary)
  • SW model compression and in-line hardware accelerated decompression (Gen 1)
  • Hardware accelerated in-line compression and decompression (Gen 2)
  • On-the-fly Multi-algorithm switching capability without recompression
  • Architecture
    •  Modular architecture, enables seamless scalability
    •  Architectural configuration parameters accessible to fine tune performance

Benefits

  • Expand Effective HBM Capacity by up to 50%: Compress AI model data in memory, allowing, as an example, 150GB of model data to fit into 100GB of physical HBM space. AI-MX’s modular architecture facilitates multiple instances of the IP to operate coherently to match HBM’s TB/s throughput, and boost LPDDR6 effective bandwidth closer to HBM.
  • Enhance AI Accelerator Throughput: An AI accelerator with 4 HBM stacks equipped with AI-MX effectively operates as if it has 6 HBM stacks
  • Boost Effective HBM Bandwidth: Improve bandwidth utilization by transferring more AI model data per transaction—up to 50% more efficient.
  • Integrated Address Translation and memory management: AI-MX is outfitted with patented hardware accelerated, real time address translation Unit for memory management from compressed to uncompressed space and vice versa. This makes the IP transparent to the SoC/ASIC and facilitates drop in ease of integration with no changes to the SoC fabric interface or DMA logic

Block Diagram

High-Performance Memory Expansion IP for AI Accelerators Block Diagram

Applications

AI-MX is tailored for high-performance AI acceleration across multiple domains:

  •  Data center & Edge AI accelerators
  •  Smart devices
  •  Autonomous and Intelligent Edge devices
  •  Embedded AI systems

As workloads increasingly shift towards large-scale foundational models, AI-MX delivers a scalable solution to overcome memory bottlenecks
while maintaining ultra-low latency.

Deliverables

  • Performance evaluation license
    • C++ compression model for integration in customer performance simulation model 
    • FPGA evaluation license
    •  Encrypted IP delivery (Xilinx) 
  • HDL Source Licenses
    •  Synthesizable System Verilog RTL (encrypted)
    •  Implementation constraints
    •  UVM testbench (self-checking)
    •  Vectors for testbench and expected results
    •  User Documentation

Technical Specifications

Availability
second half of 2025
×
Semiconductor IP