Making Strong Error-Correcting Codes Work Effectively for HBM in AI Inference

By Rui Xie ¹, Yunhua Fang ¹, Asad Ul Haq ¹, Linsen Ma ², Sanchari Sen ³, Swagath Venkataramani ³, Liu Liu ¹, Tong Zhang ¹
¹ Rensselaer Polytechnic Institute, Troy, New York, USA
² ScaleFlux, Milpitas, California, USA
³ IBM T.J. Watson Research Center, Yorktown Heights, New York, USA

Abstract

LLM inference is increasingly memory-bound, and HBM cost per GB now dominates system cost. Today’s HBM stacks include short on-die ECC, which tightens binning, raises price, and locks reliability policy inside the device. This paper asks a simple question: can we tolerate a much higher raw HBM bit error rate (BER) and still keep end-to-end correctness and throughput, without changing the HBM PHY or the fixed 32B transaction size? We propose REACH (Reliability Extension Architecture for Cost-effective HBM), a controller-managed ECC design that keeps the HBM link and 32B transfers unchanged. REACH uses a two-level Reed- Solomon (RS) scheme: each 32B chunk uses an inner RS code to check and correct most faults locally, while chunks that cannot be fixed are marked as erasures. An outer RS code spans kilobytes and runs in erasure-only mode, repairing only the flagged chunks and avoiding the expensive locator step. For small random writes, REACHupdates outer parity with differential parity so it does not recompute parity over the whole span, and an optional importance adaptive bit-plane policy can protect onlycritical fields (for example, BF16 exponents) to reduce ECC work and traffic. On three LLMs at 8K context, REACH keeps ∼79% of the on-die ECC throughput at BER=0 and stays qualified up to raw BER = 10⁻³, extending toler able device error rates by about three orders of magnitude while keeping tokens/s nearly flat. In ASAP7, a full REACH controller occupies 15.2mm² and consumes 17.5W at 3.56TB/s (∼4.9pJ/byte), and it reduces ECC area and power by 11.6× and ∼60% compared to a naive long-RS baseline. By moving strong ECC into the controller, REACHturns long-code reliability into a system choice, enabling vendors to trade higher device BER for lower HBM $/GB under the same standard interface.

Keywords: HBM, DRAM,Memory, Error-Correcting Code, Reliability

To read the full article, click here

HBM IP Selector

Making Strong Error-Correcting Codes Work Effectively for HBM in AI Inference

Abstract

Related Semiconductor IP

Related Articles

Latest Articles

Related Articles

Breaking the HBM Bit Cost Barrier: Domain-Specific ECC for AI Inference Infrastructure

RISC-V Based TinyML Accelerator for Depthwise Separable Convolutions in Edge AI

ChipBench: A Next-Step Benchmark for Evaluating LLM Performance in AI-Aided Chip Design

From The Outside In Making Third-Party IP Work in Semiconductor Design

SNAP-V: A RISC-V SoC with Configurable Neuromorphic Acceleration for Small-Scale Spiking Neural Networks

An FPGA Implementation of Displacement Vector Search for Intra Pattern Copy in JPEG XS

A Persistent-State Dataflow Accelerator for Memory-Bound Linear Attention Decode on FPGA

VMXDOTP: A RISC-V Vector ISA Extension for Efficient Microscaling (MX) Format Acceleration

PDF: PUF-based DNN Fingerprinting for Knowledge Distillation Traceability

Making Strong Error-Correcting Codes Work Effectively for HBM in AI Inference

Abstract

Subscribe to the Semi IP Hub Newsletter

Related Semiconductor IP

Related Articles

Latest Articles