The Growing Importance of AI Inference and the Implications for Memory Technology

By Tim Messegee, Rambus

The rapid evolution of artificial intelligence (AI) over the past decade has reshaped the way we interact with technology and transformed industries across the board. From early predictive models and voice assistants to today’s highly complex generative AI systems, we are witnessing the dawn of a new era: AI 2.0. This shift, marked by the rise of multimodal generative AI systems such as large language models (LLMs) and other creative AI engines, presents not only a revolution in what AI can achieve but also a series of profound challenges for computer architecture and in particular memory technology. The growing demands of AI inference, particularly at the edge, and the rise of multi-chip architectures have placed unprecedented pressure on memory systems.

From AI 1.0 to AI 2.0: The Rise of Multimodal Generative AI

The transition from AI 1.0, which predominantly focused on predictive tasks like recommendation engines and speech recognition, to AI 2.0 has brought about a new class of AI models. AI 2.0 is characterized by generative capabilities, where AI not only analyzes data but can also create new content. This is enabled by large language models (LLMs) such as GPT-4, Claude 3, and other systems that are now crossing the trillion-parameter threshold. These models have expanded their capacity to handle multimodal inputs—such as text, audio, video, and code—and generate sophisticated outputs across these media.

Multimodal AI models open up transformative applications, from assisting creative professionals to building more intuitive and personalized digital assistants. These systems’ ability to process complex inputs and outputs, providing more human-like, adaptable responses, is bringing us closer to the long-term goal of Artificial General Intelligence (AGI). However, the power of these models comes at a cost: they demand enormous compute resources and, critically, significant advancements in memory bandwidth, capacity, and efficiency to maintain performance as they scale.

The Critical Role of AI Inference

AI inference is the process by which a trained AI model makes predictions or decisions based on new data or input. While training AI models is computationally intensive, inference requires real-time, low-latency performance to be effective in real-world applications. For example, multimodal AI models used in voice assistants, autonomous vehicles, or recommendation systems must process input and deliver an output in fractions of a second.

Traditionally, inference workloads have been carried out in data centers, where high-performance GPUs and specialized AI accelerators can process vast amounts of data. However, as AI becomes increasingly integrated into edge devices like PCs, smartphones, IoT devices, and autonomous systems, there is a strong push to decentralize AI inference. Running AI inference on the edge offers several benefits, including reduced latency, enhanced privacy, and lower reliance on continuous cloud connectivity.

Memory Technology at the Heart of AI Inference

As AI inference becomes an increasingly integral part of edge devices, from personal computers (PCs) to IoT devices and automotive systems, the importance of memory technology grows exponentially. AI inference, particularly in edge devices, requires high performance while balancing power efficiency and compact form factors. Different devices, based on their specific use cases, demand various types of memory technologies to meet the needs of bandwidth, capacity, and power. Three critical memory types dominate the landscape for AI inference across PCs, IoT, and automotive applications: DDR5, LPDDR5, and GDDR.

DDR5: Scaling Performance for PCs and Edge Devices

The growing complexity of AI inference workloads, especially with the rise of multimodal AI applications, places heavy demands on system memory. DDR5 offers significant improvements over its predecessor, DDR4, with double the bandwidth and improved data rates, which are crucial for processing the large amounts of data required by modern AI inference tasks.

For AI inference in PCs, DDR5’s high bandwidth is particularly important when handling real-time data processing. As mentioned earlier, multimodal AI models are growing larger and more complex, requiring more memory bandwidth to ensure fast data flow and greater capacity to hold the increasingly large data models. AI PCs with DDR5 running at 6400 Megatransfers per second (MT/s) are soon coming to market. DDR5 ensures that inference tasks can be handled efficiently, providing faster response times in applications such as AI-driven content creation, gaming, and productivity tools. In addition, DDR5 provides cost-effective memory capacity for increasingly large inference models.

Beyond bandwidth and capacity, DDR5 also introduces improved power efficiency features compared to earlier generations, which is important for edge devices like desktops and laptops. DDR5’s power management IC (PMIC) is implemented directly on the DIMM, allowing for more granular power control, which reduces power consumption and improves thermal management. This is particularly important for AI inference tasks, which, although not as power-hungry as training, still require memory systems to operate efficiently within the thermal and power constraints of edge devices.

LPDDR: Power Efficiency, Bandwidth and Capacity

LPDDR evolved from DDR memory technology as a power-efficient alternative; LPDDR5, and the optional extension LPDDR5X, are the most recent updates to the standard. LPDDR5X is focused on improving performance, power, and flexibility; it offers data rates up to 8.533 Gigatransfers per second (GT/s), significantly boosting speed and performance. Compared to DDR5 memory, LPDDR5/5X limits the data bus width to 32 bits, while increasing the data rate. The switch to a quarter-speed clock, as compared to a half-speed clock in LPDDR4, along with a new feature – Dynamic Voltage Frequency Scaling – keeps the higher data rate LPDDR5 operation within the same thermal budget as LPDDR4-based devices. To support the space considerations of mobile, LPDDR5X can deliver capacities of up to 64GB by using multiple stacked DRAM dies in a multi-die package.  

As demand for memory performance grows, LPDDR5 has continued to evolve in the market with the major DRAM vendors announcing additional extensions to LPDDR5 known as LPDDR5T, with the “T” standing for “turbo.” LPDDR5T boosts data rates to 9.6 GT/s enabling an aggregate bandwidth of 38.4 GB/s in a x32 LPDDR5T package. 

With its low power consumption, high-bandwidth capabilities, and high capacity, LPDDR5 is a great choice of memory, not just for cutting-edge mobile devices, but also for AI inference on endpoints where power efficiency and compact form factor are crucial considerations. As such, we’re seeing LPDDR making inroads in AI PC laptops and automotive ADAS applications. The introduction of CAMM (compression attached memory modules) will only accelerate this trend. LPCAMM allows for harnessing the benefits of LPDDR5 memory in a flexible, scalable, upgradeable module form factor.  

GDDR: Turning Up Memory Bandwidth to Eleven

While LPDDR focuses on power efficiency, GDDR (Graphics DDR) is designed to maximize bandwidth, making it a critical memory technology for AI inference in high-performance applications like gaming PCs, workstations, and even automotive systems. GDDR6 and GDDR6X are the standard in graphics cards that are now increasingly leveraged for real-time AI inference workloads that demand extremely high data throughput.

In gaming, AI inference is used to enhance real-time experiences, such as AI-driven non-playable characters (NPCs), real-time ray tracing, and even procedural content generation. These inference tasks require extremely low latency, high-performance data processing, which is where GDDR memory shines. With data rates up to 24 GT/s, GDDR6 and GDDR6X offer bandwidths up to 96 GT/s per attached GDDR memory device to feed large, complex AI models used in real-time gaming environments. Similarly, GDDR-based GPUs are enabling multi-modal AI inference for content creation including image, music, speech and video generation.

But as the model sizes continue their meteoric growth, we need even more performance. Enter GDDR7, the latest iteration of the standard, which pushes data rates to 40GT/s and above. A GPU with ten GDDR7 devices can achieve an aggregate memory bandwidth of 1.6 Terabytes per second (TB/s). As pushing raw signaling rates becomes increasingly difficult due to signal integrity and power integrity limitations, GDDR7 moves to PAM3 (pulse amplitude modulation, three levels) in order to move more data per clock cycle vs. NRZ (aka PAM2) signaling used in GDDR6, LPDDR and DDR. As the highest data rate memory, GDDR made this transition to multi-PAM signaling first (GDDR6X uses PAM4), but expect this innovation to appear in other memory types as they continue to scale performance.

Memory Key to Enabling the Growth in AI Inference

The evolution of AI, particularly in the realm of multimodal generative AI like LLMs, is pushing the boundaries of memory technology for AI inference. AI inference, the process of using trained models for decision-making, has become critical across PCs, IoT devices, and automotive systems, demanding advancements in memory bandwidth, capacity, and power efficiency. DDR5 is designed to support AI inference in PCs and edge devices, delivering higher bandwidth and power efficiency. LPDDR5 is crucial for mobile and IoT devices, where low power consumption and compact form factors are essential. Meanwhile, GDDR6/6X and GDDR7 power high-performance applications like gaming PCs and automotive AI systems, offering extreme bandwidth and throughput for real-time data processing. The ongoing performance scaling of these memory technologies will be key to serving the growing demand for fast, efficient, and scalable AI inference. 

×
Semiconductor IP