The Hitchhiker's Guide to Programming and Optimizing CXL-Based Heterogeneous Systems

By Zixuan Wang *, Suyash Mahar *, Luyi Li *,  Jangseon Park ❆, Jinpyo Kim❊, Theodore Michailidis *, Yue Pan *,  Tajana Rosing *, Dean Tullsen *, Steven Swanson *, Kyung Chang Ryoo❆, Sungjoo Park❆, and Jishen Zhao
* University of California San Diego 
❆ Samsung
 ❊ SK Hynix

Abstract

We present a thorough analysis of the use of CXL-based heterogeneous systems. We built a cluster of server systems that combines different vendor's CPUs and various types of CXL devices. We further developed a heterogeneous memory benchmark suite, Heimdall, to profile the performance of such heterogeneous systems. By leveraging Heimdall, we unveiled the detailed architecture design in these systems, drew observations on optimizing performance for workloads, and pointed out directions for future development of CXL-based heterogeneous systems.

1- Introduction

The ever-growing performance demands from modern applications drive the development of heterogeneous systems. However, heterogeneous systems’ communication bandwidth has become one of the key bottlenecks in system scalability, where the hardware bandwidth does not scale as fast as modern workloads’ bandwidth requirements. To improve the bandwidth, devices such as network cards and GPUs have developed dedicated communication links [1] to exchange data more efficiently. However, such new communication links often define their own communication protocols, requiring specialized operating system kernel drivers and system software libraries and imposing new programming models on workload developers. Adding such new devices to an existing heterogeneous system is thus non-trivial, which can hold back their adoption. This motivates the need for an industry-standard communication protocol to provide a consistent interface to workloads while enabling device manufacturers to integrate new devices into the ecosystem without changing the programming interface.

Cache coherent interconnect protocols are proposed to unify the communication interface between heterogeneous devices. Using cache-coherent interconnects, processors can access data through connected processors and cache the remote data locally; at the same time, the protocol transparently updates the locally cached data when modified by any connected processor. Such protocols allow processors to exchange data in a single standard scheme, simplifying data synchronization and reducing the use of dedicated communication drivers and libraries. Many existing cache coherent protocols are initially deployed for homogeneous systems, such as inter-CPU links (Intel UPI [2], and AMD Infinity Fabric [3]). With the rapid emergence of new processors and memory devices, industry and academia have been exploring generic cache coherent links [4–7] for heterogeneous systems to interconnect different types of processors and memory. Such generic protocols aim to unify communication schemes and optimize data exchange performance between devices.

Compute Express Link [4] (CXL) is a recent open standard of cache coherent interconnect protocol and has been commercially supported in its early stage. CXL defined an multi-device coherence protocol on-top-of PCIe physical layer, allowing processors to reuse the existing PCIe standard (links, form factors, and more) as much as possible instead of adopting a new physical layer design. When interconnected with CXL, accelerators and memory devices can sit on the PCIe bus and exchange data coherently with other devices, including the host CPU. As of today, CXL has three generations of standard specs, from 1.0 to 3.0. The CXL 1.0 spec lays the foundation of CXL standard, defining the basics of coherence and device data exchange protocols; the later revised 1.1 spec added the spec for the memory expander devices. The CXL 2.0 introduced switch-based topologies, enabling scalable multi-device configurations and improved memory pooling. The recent CXL 3.0 doubled the bandwidth with PCIe Gen 6 and incorporated advanced features such as atomic operations and enhanced security, supporting more demanding applications like AI and large-scale memory pools.

To study the CXL systems’ performance characteristics, we first built a cluster of CXL-based heterogeneous systems that combine various CPUs and CXL memory devices. We chose two types of CPUs, Intel Sapphire Rapids (SPR) and AMDGenoa (Zen4) for our machines, given their different capabilities: The Intel-SPR CPU implements CXL 1.0 standard, supports Intel FPGA-based CXL Type 1 and 2 devices through CPU firmware, and emulates the CXL.mem through CXL.cache interface. The AMD-Zen4 CPU implements CXL 1.1 standard with native supports of CXL.mem interface. We then incorporated two categories of CXL devices, FPGA-based and ASIC-based: We implemented CXL Type 1, Type 2, and Type 3 devices based on Intel Agilex 7 FPGAs and integrated them with Intel-SPR CPUs. Additionally, we incorporated ASIC-based CXL memory expanders to both Intel and AMD CPUs.

We then studied the performance characteristics in such heterogeneous systems, compared performance metrics side by-side across different systems, and drew observations. To this end, we developed a benchmark suite, Heimdall, and leveraged it to conduct a wide range of performance profiling. This benchmark suite consists of carefully crafted microbenchmarks that trigger specific system behaviors across system layers, from microarchitecture to operating systems levels. By analyzing the benchmark result on a single system, we observed characteristics such as CPU and CXL device microarchitecture designs that support the CXL protocol, together with OS and system software performance, while leveraging CXL devices. Then by comparing benchmark results across systems, we observed discrepancies between systems, including different CPU-side CXL designs between AMD and Intel and device-side architectural implications for performance.

In summary, we make the following contributions:

  •   We built a cluster of CXL-based systems and summarized our lessons learned throughout this process.
  •  We developed a benchmark suite–Heimdall–for heterogeneous memory systems.
  •  By leveraging this benchmark suite, we studied a wide range of CXL-based heterogeneous system configurations in our cluster and uncovered CXL-related architecture and system designs.
  •  We draw key observations from our extensive experiments and point out future directions in developing CXL based heterogeneous systems.
  •  We have a list of works undergoing and will update this paper in the future to include power analysis, CXL FPGA internals, the latest CXL prototypes’ performance, and more.

To read the full article, click here

×
Semiconductor IP