Accelerating AI Workloads with NVMe® over Fabrics (NVMe-oF™) Technology

As Artificial Intelligence (AI) continues to transform industries, data centers must maximize GPU utilization, data access and sharing capabilities to meet the needs of AI workloads. Unlike traditional High-Performance Computing (HPC), AI requires access to massive datasets that can reach petabytes in size and need to be distributed across hundreds or thousands of GPUs. Streamlined IO through high-performance, scale-out storage is critical in feeding the AI “beast.”

Enter NVM Express (NVMe®) technology, the industry standard for solid state drives (SSDs) in all form factors. It is a high-performance storage protocol engineered to deliver the speed, scalability and efficiency required by today’s data-intensive applications.

NVMe over Fabrics (NVMe-oF™) technology extends the speed and efficiency of NVMe storage across network fabrics such as TCP, RDMA, and Fibre Channel to meet the scale-out storage needs of AI applications. And, looking ahead, NVM Express is exploring developing support for Ultra Ethernet, opening new possibilities for even greater performance and interoperability in future data center architectures. Read on to learn more about the ways NVMe-oF technology is addressing challenges in AI.

NVMe Technology: Built for AI Scale and Growing Speed

AI workloads are complex and dynamic, consisting of multiple phases (e.g., data ingest, data preprocessing, training, checkpointing, archiving) with differing storage demands in each phase.

NVMe technology can support workload phases by delivering high throughput and ultra-low latency access to storage. NVMe-oF technology enables responsiveness and dynamic scaling across nodes and clusters, helping AI and ML workloads operate at scale without being constrained by data bottlenecks.

Another challenge is that AI training datasets are often too large to fit into GPU memory. While many configurations will work with a pool of direct-connect NVMe SSDs, NVMe-oF technology provides flexibility to expand and optimize the storage configuration with external, large-scale storage. By enabling shared access to centralized pools of storage, NVMe-oF technology allows for more efficient data orchestration, better resource utilization and simplified infrastructure management.

Additionally, during training, AI models frequently save their state through checkpointing. If the storage system can’t keep up, training slows down and GPU utilization drops. NVMe-oF technology can reduce both the time required to gather and store checkpoint data from the GPUs, as well as the time for restoring checkpoint data. NVMe-oF technology can also enable flexibility if state must be loaded into different GPUs.

Ideal Use Cases for NVMe Technology in AI Workloads

NVMe-oF technology shines in environments that demand high performance and scalability. Some of these key use cases include:

  • Deep learning and foundation models when the data sets used by training are too large to fit in local node capabilities or sharing requirements don’t fit into local storage.
  • Enterprise AI and autonomous systems, which require access to large, low-latency, high-bandwidth pools of shared storage.
  • Cloud-native AI workloads, where dynamic scaling and flexibility are essential

Learn More & Access Resources

In conclusion, NVMe-oF technology enables a constant, high-speed data pipeline. It also supports parallel workloads, allowing multiple users or models to operate simultaneously without degrading performance. If you’re looking to modernize your AI infrastructure with NVMe-oF technology, the NVM Express website is a great place to start. Visit nvmexpress.org to explore the latest NVMe specifications and discover solutions tailored to your architecture in the Compliant Product List. Additionally, I invite you to watch my recent video, NVMe-oF Technology for AI Workloads.

×
Semiconductor IP